How to efficiently create a large but sparse DataFrame from a dict?

Question

How to efficiently create a large but sparse DataFrame from a dict?

I have a large but very sparse matrix (50,000 rows * 100,000 columns, only 10% of the values are known). Each known element of this matrix is a floating point number between 0.00 and 1.00, and these known values are stored in a python dict with a format such as:

{'c1': {'r1':0.27, 'r3':0.45}, 
 'c2': {'r2':0.65, 'r4':0.87} }

Now the problem is how to efficiently construct the pandas.DataFrame from this dict? Efficiency here includes both memory usage and time to create the dataframe.

For memory usage, I hope to store each np.uint8 item. Since the known value is between 0.00 and 1.00 and I only care about the first two digits, so I could pass it to an unsigned 8-bit integer by multiplying it by 100. This can save a lot of memory for this frame.

Is there a way to do this?

+3

python pandas

Jason yang Dec 19. '14 at 9:49

source to share

1 answer

S Anand · Answer 1 · 2015-04-09T06:12:52+0000

A dict

how:

{'c1': {'r1':0.27, 'r3':0.45}, 
 'c2': {'r2':0.65, 'r4':0.87} }

... is best converted to a normalized structure like this:

 level0    level1   value
 c1        r1        0.27
 c1        r3        0.45
 c2        r2        0.65
 c2        r4        0.87

... than a pivot table:

      r1    r2    r3    r4
c1  0.27   nan  0.45   nan
c2   nan  0.65   nan  0.87

... since the latter takes up a lot more memory.

A fairly memory-efficient way to build a normalized structure:

input = {'c1': {'r1':0.27, 'r3':0.45}, 
         'c2': {'r2':0.65, 'r4':0.87} }

result = []
for key, value in input.iteritems():
    row = pd.Series(value).reset_index()
    row.insert(0, 'key', key)
    result.append(row)

pd.concat(result, ignore_index=True)

This leads to:

  key index     0
0  c2    r2  0.65
1  c2    r4  0.87
2  c1    r1  0.27
3  c1    r3  0.45

How to efficiently create a large but sparse DataFrame from a dict?

More articles: