Unicode warning when using NLTK stopwords with scikit-learn's TfidfVectorizer

Question

Unicode warning when using NLTK stopwords with scikit-learn's TfidfVectorizer

I am trying to use the Tf-idf Vectorizer from scikit-learn using Spanish stopwords from NLTK:

from nltk.corpus import stopwords

vectorizer = TfidfVectorizer(stop_words=stopwords.words("spanish"))

The problem is that I am getting the following warning:

/home/---/.virtualenvs/thesis/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py:122: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
tokens = [w for w in tokens if w not in stop_words]

Is there an easy way to solve this problem?

+3

python python-2.7 scikit-learn unicode nltk

markusian 22 Aug '14 at 9:31

source to share

1 answer

markusian · Answer 1 · 2014-08-22T11:20:50+0000

In fact, the problem was easier to solve than I thought. The problem here is that NLTK does not return unicode object, while str does not return objects. So I needed to decrypt them from utf-8 before using them:

stopwords = [word.decode('utf-8') for word in stopwords.words('spanish')]

Unicode warning when using NLTK stopwords with scikit-learn's TfidfVectorizer

More articles: