Pandas: handling DataFrame with many rows
I want to read and process a large CSV file ( data_file
) that has the following two column structure:
id params
1 '14':'blah blah','25':'more cool stuff'
2 '157':'yes, more stuff','15':'and even more'
3 '14':'blah blah','25':'more cool stuff'
4 '15':'different here'
5 '157':'yes, more stuff','15':'and even more'
6 '100':'exhausted'
This file contains 30,000,000 lines (5 GB of disk space). (The actual strings are encoded in UTF-8, for simplicity I've given them in ascii here). Note that some of the values ββin the second column are repeated.
I read this using pandas.read_csv()
:
df = pandas.read_csv(open(data_file, 'rb'), delimiter='\t',
usecols=['id', 'params'],dtype={'id':'u4', 'params':'str'})
After reading the file, the data frame df
uses 1.2 GB of RAM.
So far so good.
Now comes the processing part. I want to have a row column params
in this format:
blah blah||more cool stuff
yes, more stuff||and even more
blah blah||more cool stuff
different here
yes, more stuff||and even more
exhausted
I wrote:
def clean_keywords(x):
return "||".join(x.split("'")[1:][::2])
df['params'] = df['params'].map(clean_keywords)
This code works in the sense that it produces the correct result. But:
- The operation
map
uses more than 6.8 GB of RAM. - Once the computation is complete, 5.5 GB of RAM is used
df
(aftergc.collect()
), although the row computed in the columnparams
is shorter than the one read.
Can someone explain this and suggest an alternative way to accomplish the above operation using pandas (I'm using python 3.4, pandas 0.16.2, win64)?
source to share
Answering my own question.
It turns out to be pandas.read_csv()
smart. When the file is read, the lines become unique. But when that row is processed and stored in a column, they are no longer unique. Consequently, RAM usage is increasing. To avoid this, you need to preserve the uniqueness manually. I did it this way:
unique_strings = {}
def clean_keywords(x):
s = "||".join(x.split("'")[1:][::2])
return unique_strings.setdefault(s, s)
df['params'] = df['params'].map(clean_keywords)
With this solution, RAM max. usage was only 2.8 GB and dropped slightly with initial RAM after reading data (1.2 GB) as expected.
source to share