How to disable default stopwords feature for sklearn TfidfVectorizer

Question

How to disable default stopwords feature for sklearn TfidfVectorizer

I am trying to get tf-idf values for Japanese words. The problem I am running into is that sklearn TfidfVectorizer is removing some Japanese characters I want to keep, how to terminate words.

Below is an example:

from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer(stop_words = None)

words_list = ["歯","が","痛い"]
tfidf_matrix =  tf.fit_transform(words_list)
feature_names = tf.get_feature_names() 
print (feature_names)

Output: ['痛い']

However, I want to keep all three characters in the list. I believe that TfidfVectorizer is removing characters of length 1 as stop words. How can I deactivate the default stopwords feature and keep all characters?

+3

python scikit-learn machine-learning nlp tf-idf

Splatnix 05 june 17 at 2:08

source to share

1 answer

Psidom · Accepted Answer · 2017-06-05T02:16:56+0000

You can change the token_pattern parameter from (?u)\\b\\w\\w+\\b

(default) to (?u)\\b\\w\\w*\\b

; The default match character matches two or more word characters (if you are not familiar with regex, +

means one or more, therefore \\w\\w+

matches a word with two or more word characters; *

on the other hand means zero or more, \\w\\w*

will thus match a word with one or several characters):

from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer(stop_words = None, token_pattern='(?u)\\b\\w\\w*\\b')

words_list = ["歯","が","痛い"]
tfidf_matrix =  tf.fit_transform(words_list)
feature_names = tf.get_feature_names() 
print(feature_names)
# ['が', '歯', '痛い']

How to disable default stopwords feature for sklearn TfidfVectorizer

More articles: