Equivalent to R removeSparseTerms in Python

Question

Equivalent to R removeSparseTerms in Python

We are working on a data mining project and used the removeSparseTerms function in the tm package in R to reduce the capabilities of our document matrix.

However, we are trying to port the code to python. Is there a function in sklearn, nltk or some other package that can give the same functionality?

Thank!

+3

python scikit-learn r machine-learning tm

AnirudhJ June 29. 15 at 6:53

source to share

1 answer

omerbp · Accepted Answer · 2015-06-29T07:34:48+0000

If your data is plain text, you can use the CountVectorizer to get the job done.

For example:

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=2)
corpus = [
    'This is the first document.',
    'This is the second second document.',
    'And the third one.',
    'Is this the first document?',
]
vectorizer = vectorizer.fit(corpus)
print vectorizer.vocabulary_ 
#prints {u'this': 4, u'is': 2, u'the': 3, u'document': 0, u'first': 1}
X = vectorizer.transform(corpus)

Now X

- the matrix of documents. (If you want information, you will also want to consider Tf-idf term weighting .

It can help you easily get a matrix of multi-row documents.

As for sparsity, you can control these parameters:

min_df is the minimum document frequency allowed for a term in the document matrix.
max_features - maximum number of features allowed in the document-document matrix

Alternatively, if you already have a document matrix or a Tf-idf matrix and have a notion of what is sparse, define MIN_VAL_ALLOWED

and then do the following:

import numpy as np
from scipy.sparse import csr_matrix
MIN_VAL_ALLOWED = 2

X = csr_matrix([[7,8,0],
                [2,1,1],
                [5,5,0]])

z = np.squeeze(np.asarray(X.sum(axis=0) > MIN_VAL_ALLOWED)) #z is the non-sparse terms 

print X[:,z].toarray()
#prints X without the third term (as it is sparse)
[[7 8]
[2 1]
[5 5]]

(use X = X[:,z]

so X

remains csr_matrix

.)

If this is the minimum document frequency , you want to set the threshold, binarize first and then use it Exactly the same:

import numpy as np
from scipy.sparse import csr_matrix

MIN_DF_ALLOWED = 2

X = csr_matrix([[7, 1.3, 0.9, 0],
                [2, 1.2, 0.8  , 1],
                [5, 1.5, 0  , 0]])

#Creating a copy of the data
B = csr_matrix(X, copy=True)
B[B>0] = 1
z = np.squeeze(np.asarray(X.sum(axis=0) > MIN_DF_ALLOWED))
print  X[:,z].toarray()
#prints
[[ 7.   1.3]
[ 2.   1.2]
[ 5.   1.5]]

In this example, the third and fourth members (or columns) have disappeared because they only appear in two documents (rows). Use MIN_DF_ALLOWED

to set the threshold.

Equivalent to R removeSparseTerms in Python

More articles: