How to efficiently create a large but sparse DataFrame from a dict?
I have a large but very sparse matrix (50,000 rows * 100,000 columns, only 10% of the values ββare known). Each known element of this matrix is ββa floating point number between 0.00 and 1.00, and these known values ββare stored in a python dict with a format such as:
{'c1': {'r1':0.27, 'r3':0.45}, 
 'c2': {'r2':0.65, 'r4':0.87} }
      
        
        
        
      
    
Now the problem is how to efficiently construct the pandas.DataFrame from this dict? Efficiency here includes both memory usage and time to create the dataframe.
For memory usage, I hope to store each np.uint8 item. Since the known value is between 0.00 and 1.00 and I only care about the first two digits, so I could pass it to an unsigned 8-bit integer by multiplying it by 100. This can save a lot of memory for this frame.
Is there a way to do this?
A dict
      
        
        
        
      
    how:
{'c1': {'r1':0.27, 'r3':0.45}, 
 'c2': {'r2':0.65, 'r4':0.87} }
      
        
        
        
      
    
... is best converted to a normalized structure like this:
 level0    level1   value
 c1        r1        0.27
 c1        r3        0.45
 c2        r2        0.65
 c2        r4        0.87
      
        
        
        
      
    
... than a pivot table:
      r1    r2    r3    r4
c1  0.27   nan  0.45   nan
c2   nan  0.65   nan  0.87
      
        
        
        
      
    
      ... since the latter takes up a lot more memory.
A fairly memory-efficient way to build a normalized structure:
input = {'c1': {'r1':0.27, 'r3':0.45}, 
         'c2': {'r2':0.65, 'r4':0.87} }
result = []
for key, value in input.iteritems():
    row = pd.Series(value).reset_index()
    row.insert(0, 'key', key)
    result.append(row)
pd.concat(result, ignore_index=True)
      
        
        
        
      
    
This leads to:
  key index     0
0  c2    r2  0.65
1  c2    r4  0.87
2  c1    r1  0.27
3  c1    r3  0.45