How to efficiently create a large but sparse DataFrame from a dict?
I have a large but very sparse matrix (50,000 rows * 100,000 columns, only 10% of the values ββare known). Each known element of this matrix is ββa floating point number between 0.00 and 1.00, and these known values ββare stored in a python dict with a format such as:
{'c1': {'r1':0.27, 'r3':0.45},
'c2': {'r2':0.65, 'r4':0.87} }
Now the problem is how to efficiently construct the pandas.DataFrame from this dict? Efficiency here includes both memory usage and time to create the dataframe.
For memory usage, I hope to store each np.uint8 item. Since the known value is between 0.00 and 1.00 and I only care about the first two digits, so I could pass it to an unsigned 8-bit integer by multiplying it by 100. This can save a lot of memory for this frame.
Is there a way to do this?
source to share
A dict
how:
{'c1': {'r1':0.27, 'r3':0.45},
'c2': {'r2':0.65, 'r4':0.87} }
... is best converted to a normalized structure like this:
level0 level1 value
c1 r1 0.27
c1 r3 0.45
c2 r2 0.65
c2 r4 0.87
... than a pivot table:
r1 r2 r3 r4
c1 0.27 nan 0.45 nan
c2 nan 0.65 nan 0.87
... since the latter takes up a lot more memory.
A fairly memory-efficient way to build a normalized structure:
input = {'c1': {'r1':0.27, 'r3':0.45},
'c2': {'r2':0.65, 'r4':0.87} }
result = []
for key, value in input.iteritems():
row = pd.Series(value).reset_index()
row.insert(0, 'key', key)
result.append(row)
pd.concat(result, ignore_index=True)
This leads to:
key index 0
0 c2 r2 0.65
1 c2 r4 0.87
2 c1 r1 0.27
3 c1 r3 0.45
source to share