Cheapest way to create pandas.DataFrame or pandas.SparseDataFrame
Suppose we have a huge and sparse matrix, what's the cheapest way to fill it in pandas.DataFrame
? More specifically, the huge matrix comes from a large dataset that includes a lot of dummy variable, and the dense version matrix takes up 150GB + memory, which doesn't seem to be durable.
I am trying to break into python.pandas memory management like the green hand of pandas. The current dilemma is described as follows:
- Using a dense source matrix and calling
pd.DataFrame
will not copy memory. The dense matrix will consume the most space. - If used
scipy.csr_matrix
,pd.DataFrame
does not accept it as a constructor argument. Taking a step back, if we resort topd.SparseDataFrame
, how can I avoid copying memory? - Here 's one great approach for converting
scipy.csr_matrix
topd.SparseDataFrame
. But for-loop is so inefficient and causes memory copying.
Also, I am trying to initialize sparseDataFrame
to a block of memory and assign a row-by-row value that ends with:
a = np.random.rand(4,5)
b = pd.DataFrame(a)
c = sparse.csr_matrix(a)
d = pd.SparseDataFrame(index=b.index, columns=b.columns)
elem = pd.SparseSeries(c[2].toarray().ravel())
d.loc[[2]] = [ elem ] # Got a NotImplementedError.
elem = pd.Series(c[2].toarray().ravel())
b.loc[[2]] = [ elem ] # Yes.
I think the scripting language is decent, indisputable. But I might need a pointer at this time, maybe.
Any help is appreciated in advance!
source to share
No one has answered this question yet
See similar questions:
or similar: