Read sparse matrix in python

I want to read a sparse matrix . When I am building ngrams using scikit learn . Its transform () gives the result in a sparse matrix. I want to read this matrix without doing todense () .

Code:

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
document = ['john guy','nice guy']
vectorizer = CountVectorizer(ngram_range=(1, 2))
X = vectorizer.fit_transform(document)
transformer = vectorizer.transform(document)
print transformer

      

Output:

  (0, 0)    1
  (0, 1)    1
  (0, 2)    1
  (1, 0)    1
  (1, 3)    1
  (1, 4)    1

      

How can I read this output to get its values . I need the value (0,0), (0,1) , etc. And save to the list .

+3


source to share


3 answers


The documentation for this method transform

says it returns a sparse matrix, but does not specify the type. Different views allow you to access the data in different ways, but easily convert to each other. Your printed display is typical of a str

sparse matrix.

An equivalent matrix can be generated with:

from scipy import sparse
i=[0,0,0,1,1,1]
j=[0,1,2,0,3,4]
A=sparse.csr_matrix((np.ones_like(j),(i,j)))
print(A)

      

production:

  (0, 0)        1
  (0, 1)        1
  (0, 2)        1
  (1, 0)        1
  (1, 3)        1
  (1, 4)        1

      

A type

A csr

can be indexed as a dense matrix:

In [32]: A[0,0]
Out[32]: 1    
In [33]: A[0,3]
Out[33]: 0

      

Internally the matrix csr

stores its data in the data

, indices

, indptr

, which is convenient for calculation, but a bit unclear. Convert it to format coo

to get data that looks like your input:



In [34]: A.tocoo().row
Out[34]: array([0, 0, 0, 1, 1, 1], dtype=int32)

In [35]: A.tocoo().col
Out[35]: array([0, 1, 2, 0, 3, 4], dtype=int32)

      

Or you can convert it to a type dok

and access data such as a dictionary:

A.todok().keys()
#  dict_keys([(0, 1), (0, 0), (1, 3), (1, 0), (0, 2), (1, 4)])
A.todok().items()

      

creates: (here Python3)

dict_items([((0, 1), 1), 
            ((0, 0), 1), 
            ((1, 3), 1), 
            ((1, 0), 1), 
            ((0, 2), 1), 
            ((1, 4), 1)])

      

The format lil

stores data as 2 lists of lists, one with data (all 1 in this example) and the other with string indices.

Or do you want to "read" the data in some other way?

+8


source


This is a SciPy CSR matrix . To convert this to (row, col, value) three times, the simplest option is to convert to COO format and then get triples from that:

>>> from scipy.sparse import rand
>>> X = rand(100, 100, format='csr')
>>> X
<100x100 sparse matrix of type '<type 'numpy.float64'>'
    with 100 stored elements in Compressed Sparse Row format>
>>> zip(X.row, X.col, X.data)[:10]
[(1, 78, 0.73843533223380842),
 (1, 91, 0.30943772717074158),
 (2, 35, 0.52635078317400608),
 (4, 75, 0.34667509458006551),
 (5, 30, 0.86482318943934389),
 (7, 74, 0.46260571098933323),
 (8, 75, 0.74193890941716234),
 (9, 72, 0.50095749482583696),
 (9, 80, 0.85906284644174613),
 (11, 66, 0.83072142899400137)]

      



(Note that the result is sorted.)

+2


source


You can use data

and indices

like:

>>> indices=transformer.toarray()
>>> indices
array([[1, 1, 1, 0, 0],
      [1, 0, 0, 1, 1]])
>>> values=transformer.data
>>> values
array([1, 1, 1, 1, 1, 1])

      

+1


source







All Articles