Only ignore stop words for ngram_range = 1
I am using the CountVectorizer from sklearn ... looking for a stop word list and applying a vector counter for ngram_range (1,3).
From what I can tell, if the word - say "me" - is in the list of stop words, then it is not visible for higher ngrams, that is, "tell me" will not be a feature. Is there a way that I can specify something like "only count stopwords when ngram is 1"?
+3
source to share
1 answer
You have at least 2 options:
-
combine 2 kinds of functions with FeatureUnion : one for ngram_range (1,1) with stop words and one for ngram_range (2,3) without stop words
-
(more efficient, but more difficult to implement and use) implement your own analyzer that will check for the presence of a list of words at the stop only for unigrams; see example code in this answer .
+1
source to share