Only ignore stop words for ngram_range = 1

I am using the CountVectorizer from sklearn ... looking for a stop word list and applying a vector counter for ngram_range (1,3).

From what I can tell, if the word - say "me" - is in the list of stop words, then it is not visible for higher ngrams, that is, "tell me" will not be a feature. Is there a way that I can specify something like "only count stopwords when ngram is 1"?

+3


source to share


1 answer


You have at least 2 options:



  • combine 2 kinds of functions with FeatureUnion : one for ngram_range (1,1) with stop words and one for ngram_range (2,3) without stop words

  • (more efficient, but more difficult to implement and use) implement your own analyzer that will check for the presence of a list of words at the stop only for unigrams; see example code in this answer .

+1


source







All Articles