Python pandas DataFrame from first and last line of csv

All -

I am looking to create a pandas DataFrame from only the first and last lines of a very large csv. The goal of this exercise is to be able to easily grab some of the attributes from the first and last entries in these csv files. I have no problem grabbing the first line of the csv using:

pd.read_csv(filename, nrows=1)

      

I also have no problem grabbing the last line of a text file in various ways, for example:

with open(filename) as f:
    last_line = f.readlines()[-1]

      

However, after getting these two things into one DataFrame, I chose a loop. Any insight on how best to achieve this goal?

EDIT NOTE. I am trying to accomplish this task without first loading all the data into one DataFrame as I am dealing with quite large (> 15MM lines) csv files.

Thank!

+3


source to share


3 answers


Just use head

both tail

and concat

. You can even customize the number of lines.

import pandas as pd

df = pd.read_csv("flu.csv")
top = df.head(1)
bottom = df.tail(1)
concatenated = pd.concat([top,bottom])

print concatenated

      

Result:

           Date  Cases
0      9/1/2014     45
121  12/31/2014     97

      

Adjust head

and tail

to take 5 lines from the top and 10 from the bottom ...

           Date  Cases
0      9/1/2014     45
1      9/2/2014    104
2      9/3/2014     47
3      9/4/2014    108
4      9/5/2014     49
112  12/22/2014     30
113  12/23/2014     81
114  12/24/2014     99
115  12/25/2014     85
116  12/26/2014     55
117  12/27/2014     91
118  12/28/2014     68
119  12/29/2014    109
120  12/30/2014     55
121  12/31/2014     97

      



One possible approach that you can use if you don't want to download the entire CSV file as a dataframe is to treat it as CSV only. The following code is similar to your approach.

import pandas as pd
import csv

top = pd.read_csv("flu.csv", nrows=1)
headers = top.columns.values

with open("flu.csv", "r") as f, open("flu2.csv","w") as g:
    last_line = f.readlines()[-1].strip().split(",")
    c = csv.writer(g)
    c.writerow(headers)
    c.writerow(last_line)

bottom = pd.read_csv("flu2.csv")
concatenated = pd.concat([top, bottom])
concatenated.reset_index(inplace=True, drop=True)

print concatenated

      

The result is the same except for the index. Tested against a million lines and processed in about a second.

        Date  Cases
0   9/1/2014     45
1  7/25/4885     99
[Finished in 0.9s]

      

As it scales against 15 million lines, maybe it's your ball now. So I decided to test it for exactly 15,728,626 lines and the results seem good enough.

        Date  Cases
0   9/1/2014     45
1  7/25/4885     99
[Finished in 3.3s]

      

+5


source


So the way to do it without reading the whole file in Python first is to grab the first line and then loop over the file on the last line. Then use StringIO to suck them into Pandas. Maybe something like this:



import pandas as pd
import StringIO

with open('tst.csv') as f:
    first_line = f.readline()
    for line in f:
        pass #iterate to the end
    last_line = line

mydf = pd.DataFrame()
mydf = mydf.append(pd.read_csv(StringIO.StringIO(first_line), header=None))
mydf = mydf.append(pd.read_csv(StringIO.StringIO(last_line), header=None))

      

+3


source


You need this answer fooobar.com/questions/120872 / ... - not the accepted answer, but a better one because it searches backward for the first newline instead of guessing.

Then wrap the two lines in StringIO:

from cStringIO import StringIO
import pandas as pd

# grab the lines as per first-and-last-line question
truncated_input = StringIO(the_two_lines)
truncated_input.seek(0) # need to rewind
df = pd.read_csv(truncated_input)

      

+2


source







All Articles