Big problems reading files

I am trying to read a file 13GB

csv

using the following code:

chunks=pd.read_csv('filename.csv',chunksize=10000000)
df=pd.DataFrame()
%time df=pd.concat(chunks, ignore_index=True)

      

I've played around with parameter values chunksize

from 10 ** 3 to 10 ** 7, but every time I get MemoryError

. The file csv

has about 3.3 million rows and 1900 columns.

I can clearly see that I have 30+ GB of memory before I start reading the file, but I still get MemoryError

. How to fix it?

+3


source to share


1 answer


Chunking does nothing if you want to read everything in a file. The goal chunk

is to preprocess the block so that you then only work with the data that you are interested in (perhaps writing the processed chunk to disk). Also, it seems that your block size is larger than the number of lines in your data, which means that you are reading the entire file anyway.

As suggested by @MaxU, try sparse data frames and also use a smaller block size (e.g. 100k):



chunks = pd.read_csv('filename.csv', chunksize=100000)  # nrows=200000 to test given file size.
df = pd.concat([chunk.to_sparse(fill_value=0) for chunk in chunks])

      

You might also consider something like GraphLab Create , which uses SFrames (not limited to RAM).

0


source







All Articles