Cheapest way to create pandas.DataFrame or pandas.SparseDataFrame

Question

Cheapest way to create pandas.DataFrame or pandas.SparseDataFrame

Suppose we have a huge and sparse matrix, what's the cheapest way to fill it in pandas.DataFrame

? More specifically, the huge matrix comes from a large dataset that includes a lot of dummy variable, and the dense version matrix takes up 150GB + memory, which doesn't seem to be durable.

I am trying to break into python.pandas memory management like the green hand of pandas. The current dilemma is described as follows:

Using a dense source matrix and calling pd.DataFrame

will not copy memory. The dense matrix will consume the most space.
If used scipy.csr_matrix

, pd.DataFrame

does not accept it as a constructor argument. Taking a step back, if we resort to pd.SparseDataFrame

, how can I avoid copying memory?
Here 's one great approach for converting scipy.csr_matrix

to pd.SparseDataFrame

. But for-loop is so inefficient and causes memory copying.

Also, I am trying to initialize sparseDataFrame

to a block of memory and assign a row-by-row value that ends with:

a = np.random.rand(4,5)
b = pd.DataFrame(a)
c = sparse.csr_matrix(a)
d = pd.SparseDataFrame(index=b.index, columns=b.columns)
elem = pd.SparseSeries(c[2].toarray().ravel())
d.loc[[2]] = [ elem ]  # Got a NotImplementedError.
elem = pd.Series(c[2].toarray().ravel())
b.loc[[2]] = [ elem ]  # Yes.