How to disable default stopwords feature for sklearn TfidfVectorizer
I am trying to get tf-idf values for Japanese words. The problem I am running into is that sklearn TfidfVectorizer is removing some Japanese characters I want to keep, how to terminate words.
Below is an example:
from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer(stop_words = None)
words_list = ["歯","が","痛い"]
tfidf_matrix = tf.fit_transform(words_list)
feature_names = tf.get_feature_names()
print (feature_names)
Output: ['痛い']
However, I want to keep all three characters in the list. I believe that TfidfVectorizer is removing characters of length 1 as stop words. How can I deactivate the default stopwords feature and keep all characters?
source to share
You can change the token_pattern parameter from (?u)\\b\\w\\w+\\b
(default) to (?u)\\b\\w\\w*\\b
; The default match character matches two or more word characters (if you are not familiar with regex, +
means one or more, therefore \\w\\w+
matches a word with two or more word characters; *
on the other hand means zero or more, \\w\\w*
will thus match a word with one or several characters):
from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer(stop_words = None, token_pattern='(?u)\\b\\w\\w*\\b')
words_list = ["歯","が","痛い"]
tfidf_matrix = tf.fit_transform(words_list)
feature_names = tf.get_feature_names()
print(feature_names)
# ['が', '歯', '痛い']
source to share