Equivalent to R removeSparseTerms in Python
We are working on a data mining project and used the removeSparseTerms function in the tm package in R to reduce the capabilities of our document matrix.
However, we are trying to port the code to python. Is there a function in sklearn, nltk or some other package that can give the same functionality?
Thank!
source to share
If your data is plain text, you can use the CountVectorizer to get the job done.
For example:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=2)
corpus = [
'This is the first document.',
'This is the second second document.',
'And the third one.',
'Is this the first document?',
]
vectorizer = vectorizer.fit(corpus)
print vectorizer.vocabulary_
#prints {u'this': 4, u'is': 2, u'the': 3, u'document': 0, u'first': 1}
X = vectorizer.transform(corpus)
Now X
- the matrix of documents. (If you want information, you will also want to consider Tf-idf term weighting .
It can help you easily get a matrix of multi-row documents.
As for sparsity, you can control these parameters:
- min_df is the minimum document frequency allowed for a term in the document matrix.
- max_features - maximum number of features allowed in the document-document matrix
Alternatively, if you already have a document matrix or a Tf-idf matrix and have a notion of what is sparse, define MIN_VAL_ALLOWED
and then do the following:
import numpy as np
from scipy.sparse import csr_matrix
MIN_VAL_ALLOWED = 2
X = csr_matrix([[7,8,0],
[2,1,1],
[5,5,0]])
z = np.squeeze(np.asarray(X.sum(axis=0) > MIN_VAL_ALLOWED)) #z is the non-sparse terms
print X[:,z].toarray()
#prints X without the third term (as it is sparse)
[[7 8]
[2 1]
[5 5]]
(use X = X[:,z]
so X
remains csr_matrix
.)
If this is the minimum document frequency , you want to set the threshold, binarize first and then use it Exactly the same:
import numpy as np
from scipy.sparse import csr_matrix
MIN_DF_ALLOWED = 2
X = csr_matrix([[7, 1.3, 0.9, 0],
[2, 1.2, 0.8 , 1],
[5, 1.5, 0 , 0]])
#Creating a copy of the data
B = csr_matrix(X, copy=True)
B[B>0] = 1
z = np.squeeze(np.asarray(X.sum(axis=0) > MIN_DF_ALLOWED))
print X[:,z].toarray()
#prints
[[ 7. 1.3]
[ 2. 1.2]
[ 5. 1.5]]
In this example, the third and fourth members (or columns) have disappeared because they only appear in two documents (rows). Use MIN_DF_ALLOWED
to set the threshold.
source to share