How to increase weight of proper nouns in scikit TfidfVectorizer

Question

How to increase weight of proper nouns in scikit TfidfVectorizer

I use sci-kit

TdidfVectorizer

to extract keywords from a list of scientific articles. There is an argument for stop_words, but I was wondering if I could give more weight / score to relevant names like "Bor" or "Japan".

Should I implement my own custom one tfidf vectorizer

or can I use it in one?

tf = TfidfVectorizer(strip_accents='ascii', 
                     analyzer='word',
                     ngram_range=(1,1),
                     min_df = 0,
                     stop_words = stopwords,
                     lowercase = True)

+3

python scikit-learn machine-learning nlp nltk

Kevin Sun June 18 17 at 14:32

source to share

1 answer

vZ10 · Accepted Answer · 2017-06-18T14:48:26+0000

You can do your own postrpocessing for the TfIdf matrix.

First, you need to go through all the words of the indices to find the indices for all Self-Entities, then go through the matrix and increase the weight for those indices.

How to increase weight of proper nouns in scikit TfidfVectorizer

More articles: