Only ignore stop words for ngram_range = 1

Question

Only ignore stop words for ngram_range = 1

I am using the CountVectorizer from sklearn ... looking for a stop word list and applying a vector counter for ngram_range (1,3).

From what I can tell, if the word - say "me" - is in the list of stop words, then it is not visible for higher ngrams, that is, "tell me" will not be a feature. Is there a way that I can specify something like "only count stopwords when ngram is 1"?

+3

python scikit-learn nlp

Natalie Arellano May 09 '15 at 22:50

source to share

1 answer

Nikita Astrakhantsev · Accepted Answer · 2015-05-12T10:12:52+0000

You have at least 2 options:

combine 2 kinds of functions with FeatureUnion : one for ngram_range (1,1) with stop words and one for ngram_range (2,3) without stop words
(more efficient, but more difficult to implement and use) implement your own analyzer that will check for the presence of a list of words at the stop only for unigrams; see example code in this answer .

Only ignore stop words for ngram_range = 1

More articles: