How do I fill in the missing value when building the DataFrame?
I am using pandas to store a large but very sparse matrix (50,000 rows * 100,000 columns), each element of this matrix is ββa floating point number between 0.00 and 1.00. The original values ββof the elements are stored in a python dict (only elements of known values ββare kept).
Now the problem is how to build the pandas.DataFrame from the dict correctly.
If I use float64, then a rough estimate of the physical size of this matrix would be: (50,000 * 100,000 * 8) = 37GB , which is significantly larger than the memory size of my machine.
However, I notice that since the range of each item is 0.00 to 1.00 and I only need the first 2 digits, I could convert each item to an unsigned 8-bit integer by multiplying it by 100, while then translate to np.uint8, which can reduce that data size to an acceptable size: (1/8 * 37GB) .
I tried this method but pandas.DataFrame doesn't work as I expect. When I specify the dtype in the pd.DataFrame () constructor, the end result is float64.
Here's some sample code:
In [87]: dc = {'A':{'a':np.uint8(1.2), 'c':np.uint8(3.2)}, 'B':{'a':np.uint8(1.2), \
'b':np.uint8(2.2)}, 'C':{'b':np.uint8(2.2), 'd':np.uint8(4.2)}}
In [88]: dc
Out[88]: {'A': {'a': 1, 'c': 3}, 'B': {'a': 1, 'b': 2}, 'C': {'b': 2, 'd': 4}}
In [89]: type(dc['A']['a'])
Out[89]: numpy.uint8
In [90]: df = pd.DataFrame(dc, index=['a', 'b', 'c','d'], dtype=np.uint8)
In [91]: df
Out[91]:
A B C
a 1 1 NaN
b NaN 2 2
c 3 NaN NaN
d NaN NaN 4
In [92]: df.dtypes
Out[92]:
A float64
B float64
C float64
dtype: object
@ zero323 mentions that this is a pandas design choice, so is there a way to efficiently build this datafile?
source to share
It won't help you, but this is expected behavior. Quoting Caveats and Gotchas
When injecting NA into an existing Series or DataFrame using reindex or some other means, the boolean and integer types will be promoted to a different dtype to hold the NA.
A comment from @EdChum provides optimal solutions, but if you really need to work with dicts, you can try something like this:
# Choose some default value
default = 0
# Prepare dict with defaults
defaults = {k: default for k in chain(*(x.keys() for x in dc.values()))}
# Fill gaps if needed and construct data frame
df = pd.DataFrame(
{k: dict(defaults.items() + v.items()) for k, v in dc.items()},
index=['a', 'b', 'c','d'], dtype=np.uint8)
source to share