How to disable default stopwords feature for sklearn TfidfVectorizer

I am trying to get tf-idf values ​​for Japanese words. The problem I am running into is that sklearn TfidfVectorizer is removing some Japanese characters I want to keep, how to terminate words.

Below is an example:

from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer(stop_words = None)

words_list = ["歯","が","痛い"]
tfidf_matrix =  tf.fit_transform(words_list)
feature_names = tf.get_feature_names() 
print (feature_names)

      

Output: ['痛い']

However, I want to keep all three characters in the list. I believe that TfidfVectorizer is removing characters of length 1 as stop words. How can I deactivate the default stopwords feature and keep all characters?

+3


source to share


1 answer


You can change the token_pattern parameter from (?u)\\b\\w\\w+\\b

(default) to (?u)\\b\\w\\w*\\b

; The default match character matches two or more word characters (if you are not familiar with regex, +

means one or more, therefore \\w\\w+

matches a word with two or more word characters; *

on the other hand means zero or more, \\w\\w*

will thus match a word with one or several characters):



from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer(stop_words = None, token_pattern='(?u)\\b\\w\\w*\\b')
words_list = ["歯","が","痛い"]
tfidf_matrix =  tf.fit_transform(words_list)
feature_names = tf.get_feature_names() 
print(feature_names)
# ['が', '歯', '痛い']

      

+3


source







All Articles