Why is pandas read_csv not reading the correct number of lines?

Question

Why is pandas read_csv not reading the correct number of lines?

I am trying to open part of a csv file using pandas read_csv. The section I open has a heading on line 746 and goes to line 1120.

 gr = read_csv(inputfile,header=746,nrows=374,index_col=False)

Then I get the error

CParserError: Error tokenizing data. C error: Expected 9 fields in line 1121, saw 17

The error makes sense because on line 1121 of the file, the data changes from 9 fields to 17. What makes no sense is that it tries to read line 1121, since threads and the header should only open lines up to 1120.

I can get it to work by decreasing the number of lines to 232. This works even if I increase the header number so that it starts further down the file (for example, increase it to 800).

The last line it will read is nothing special, and it will read the lines in the file if I increment the header number.

I am using Python 2.7 and pandas 0.14.

The file I'm trying to read looks like this:

"River Levels","GRETA_SOUTH      (C)","GLENROWAN        (C)","ROCKY_POINT      (C)","DOCKER_RD        (C)","BOBINAWARRAH     (C)","WOOLSHED         (C)","WANGARATTA       (C)","PEECHELBA_EAST   (C)"
 41812.00001,          0.70,          0.00,          0.00,          0.20,          0.00,          0.00,          7.30,        125.00
 41812.04168,          0.70,          0.00,          0.00,          0.20,          0.00,          0.00,          7.30,        125.00

Why is it trying to open line 1121 when the nrows + header is smaller than that, and why will it only read 232 lines before it does?

+3

python pandas csv

Chris leahy 23 Sep '14 at 2:11

source to share

2 answers

I believe this is one mistake / counting (user)! That is, it pd.read_csv(inputfile, header=746, nrows=374)

reads line 1021st 1-indexed , so you should read one smaller line. I could be wrong, but here's what I think ...

The python string index (as with most python indexing) starts at 0.

In [11]: s = 'a,b\nA,B\n1,2\n3,4\n1,2,3,4'

In [12]: for i, line in enumerate(s.splitlines()): print(i, line)
0 a,b
1 A,B
2 1,2
3 3,4
4 1,2,3,4

You usually think of line numbers from 1:

In [12]: for i, line in enumerate(s.splitlines(), start=1): print(i, line)
1 a,b
2 A,B
3 1,2
4 3,4
5 1,2,3,4

Next, we read the 3rd line (with python index) or 4th (with 1-index):

In [13]: pd.read_csv(StringIO(s), header=1, nrows=2)  # Note: header + nrows == 3
Out[13]:
   A  B
0  1  2
1  3  4

And if we include the following line, it will raise:

In [15]: pd.read_csv(StringIO(s), header=1, nrows=3)
CParserError: Error tokenizing data. C error: Expected 2 fields in line 5, saw 4

0

Andy Hayden 23 Sep 14 at 4:53

source to share

Andy Hayden · Accepted Answer · 2014-09-24T05:27:38+0000

If I'm not reading the docs it looks like a bug in read_csv

(I recommend filling out the github question!).

A workaround since your data is small (read in strings as string):

from StringIO import StringIO
with open(inputfile) as f:
    df = pd.read_csv(StringIO(''.join(f.readlines()[:1120])), header=746, nrows=374)

I tested this with the csv you provide and it works / doesn't go up!

Why is pandas read_csv not reading the correct number of lines?

More articles: