Pandas: handling DataFrame with many rows

Question

Pandas: handling DataFrame with many rows

I want to read and process a large CSV file ( data_file

) that has the following two column structure:

id params
1  '14':'blah blah','25':'more cool stuff'
2  '157':'yes, more stuff','15':'and even more'
3  '14':'blah blah','25':'more cool stuff'
4  '15':'different here'
5  '157':'yes, more stuff','15':'and even more'
6  '100':'exhausted'

This file contains 30,000,000 lines (5 GB of disk space). (The actual strings are encoded in UTF-8, for simplicity I've given them in ascii here). Note that some of the values in the second column are repeated.

I read this using pandas.read_csv()

:

df =  pandas.read_csv(open(data_file, 'rb'), delimiter='\t', 
         usecols=['id', 'params'],dtype={'id':'u4', 'params':'str'})

After reading the file, the data frame df

uses 1.2 GB of RAM.

So far so good.

Now comes the processing part. I want to have a row column params

in this format:

blah blah||more cool stuff
yes, more stuff||and even more
blah blah||more cool stuff
different here
yes, more stuff||and even more
exhausted

I wrote:

def clean_keywords(x): 
    return "||".join(x.split("'")[1:][::2])

df['params'] = df['params'].map(clean_keywords)

This code works in the sense that it produces the correct result. But:

The operation map

uses more than 6.8 GB of RAM.
Once the computation is complete, 5.5 GB of RAM is used df

(after gc.collect()

), although the row computed in the column params

is shorter than the one read.

Can someone explain this and suggest an alternative way to accomplish the above operation using pandas (I'm using python 3.4, pandas 0.16.2, win64)?

+3

python pandas

M. Page 10 jul. 15 at 15:20

source to share

1 answer

M. Page · Answer 1 · 2015-07-10T19:26:11+0000

Answering my own question.

It turns out to be pandas.read_csv()

smart. When the file is read, the lines become unique. But when that row is processed and stored in a column, they are no longer unique. Consequently, RAM usage is increasing. To avoid this, you need to preserve the uniqueness manually. I did it this way:

unique_strings = {}

def clean_keywords(x):
    s = "||".join(x.split("'")[1:][::2])
    return unique_strings.setdefault(s, s)

df['params'] = df['params'].map(clean_keywords)

With this solution, RAM max. usage was only 2.8 GB and dropped slightly with initial RAM after reading data (1.2 GB) as expected.

Pandas: handling DataFrame with many rows

More articles: