Only ignore stop words for ngram_range = 1

I am using the CountVectorizer from sklearn ... looking for a stop word list and applying a vector counter for ngram_range (1,3).

From what I can tell, if the word - say "me" - is in the list of stop words, then it is not visible for higher ngrams, that is, "tell me" will not be a feature. Is there a way that I can specify something like "only count stopwords when ngram is 1"?


source to share

1 answer

You have at least 2 options:

  • combine 2 kinds of functions with FeatureUnion : one for ngram_range (1,1) with stop words and one for ngram_range (2,3) without stop words

  • (more efficient, but more difficult to implement and use) implement your own analyzer that will check for the presence of a list of words at the stop only for unigrams; see example code in this answer .



All Articles