Cheapest way to create pandas.DataFrame or pandas.SparseDataFrame

Suppose we have a huge and sparse matrix, what's the cheapest way to fill it in pandas.DataFrame

? More specifically, the huge matrix comes from a large dataset that includes a lot of dummy variable, and the dense version matrix takes up 150GB + memory, which doesn't seem to be durable.

I am trying to break into python.pandas memory management like the green hand of pandas. The current dilemma is described as follows:

  • Using a dense source matrix and calling pd.DataFrame

    will not copy memory. The dense matrix will consume the most space.
  • If used scipy.csr_matrix

    , pd.DataFrame

    does not accept it as a constructor argument. Taking a step back, if we resort to pd.SparseDataFrame

    , how can I avoid copying memory?
  • Here 's one great approach for converting scipy.csr_matrix

    to pd.SparseDataFrame

    . But for-loop is so inefficient and causes memory copying.

Also, I am trying to initialize sparseDataFrame

to a block of memory and assign a row-by-row value that ends with:

a = np.random.rand(4,5)
b = pd.DataFrame(a)
c = sparse.csr_matrix(a)
d = pd.SparseDataFrame(index=b.index, columns=b.columns)
elem = pd.SparseSeries(c[2].toarray().ravel())
d.loc[[2]] = [ elem ]  # Got a NotImplementedError.
elem = pd.Series(c[2].toarray().ravel())
b.loc[[2]] = [ elem ]  # Yes.

      

I think the scripting language is decent, indisputable. But I might need a pointer at this time, maybe.

Any help is appreciated in advance!

+3


source to share





All Articles