Read sparse matrix in python
I want to read a sparse matrix . When I am building ngrams using scikit learn . Its transform () gives the result in a sparse matrix. I want to read this matrix without doing todense () .
Code:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
document = ['john guy','nice guy']
vectorizer = CountVectorizer(ngram_range=(1, 2))
X = vectorizer.fit_transform(document)
transformer = vectorizer.transform(document)
print transformer
Output:
(0, 0) 1 (0, 1) 1 (0, 2) 1 (1, 0) 1 (1, 3) 1 (1, 4) 1
How can I read this output to get its values . I need the value (0,0), (0,1) , etc. And save to the list .
source to share
The documentation for this method transform
says it returns a sparse matrix, but does not specify the type. Different views allow you to access the data in different ways, but easily convert to each other. Your printed display is typical of a str
sparse matrix.
An equivalent matrix can be generated with:
from scipy import sparse i=[0,0,0,1,1,1] j=[0,1,2,0,3,4] A=sparse.csr_matrix((np.ones_like(j),(i,j))) print(A)
production:
A type(0, 0) 1 (0, 1) 1 (0, 2) 1 (1, 0) 1 (1, 3) 1 (1, 4) 1
A csr
can be indexed as a dense matrix:
In [32]: A[0,0]
Out[32]: 1
In [33]: A[0,3]
Out[33]: 0
Internally the matrix csr
stores its data in the data
, indices
, indptr
, which is convenient for calculation, but a bit unclear. Convert it to format coo
to get data that looks like your input:
In [34]: A.tocoo().row
Out[34]: array([0, 0, 0, 1, 1, 1], dtype=int32)
In [35]: A.tocoo().col
Out[35]: array([0, 1, 2, 0, 3, 4], dtype=int32)
Or you can convert it to a type dok
and access data such as a dictionary:
A.todok().keys()
# dict_keys([(0, 1), (0, 0), (1, 3), (1, 0), (0, 2), (1, 4)])
A.todok().items()
creates: (here Python3)
dict_items([((0, 1), 1), ((0, 0), 1), ((1, 3), 1), ((1, 0), 1), ((0, 2), 1), ((1, 4), 1)])
The format lil
stores data as 2 lists of lists, one with data (all 1 in this example) and the other with string indices.
Or do you want to "read" the data in some other way?
source to share
This is a SciPy CSR matrix . To convert this to (row, col, value) three times, the simplest option is to convert to COO format and then get triples from that:
>>> from scipy.sparse import rand
>>> X = rand(100, 100, format='csr')
>>> X
<100x100 sparse matrix of type '<type 'numpy.float64'>'
with 100 stored elements in Compressed Sparse Row format>
>>> zip(X.row, X.col, X.data)[:10]
[(1, 78, 0.73843533223380842),
(1, 91, 0.30943772717074158),
(2, 35, 0.52635078317400608),
(4, 75, 0.34667509458006551),
(5, 30, 0.86482318943934389),
(7, 74, 0.46260571098933323),
(8, 75, 0.74193890941716234),
(9, 72, 0.50095749482583696),
(9, 80, 0.85906284644174613),
(11, 66, 0.83072142899400137)]
(Note that the result is sorted.)
source to share