TD-IDF Find cosine similarity between new document and dataset

Question

TD-IDF Find cosine similarity between new document and dataset

I have a TF-IDF dataset matrix:

tfidf = TfidfVectorizer().fit_transform(words)

where words are a list of descriptions. This gives a 69258x22024 matrix.

Now I want to find the cosine similarity between the new product and those in the matrix, as I need to find the 10 most similar products. I vecture it using the same method above.

However, I cannot multiply the matrices because their sizes are different (the new one would be like 6 words, so the matrix is 1x6), so I need to make a TFIDFVectorizer with the number of columns as the original.

How should I do it?

+3

python scikit-learn machine-learning tf-idf

Mohamed oun 01 jul. 17 at 15:42

source to share

2 answers

I think the variable is words

ambiguous. I advise you to rename words

to corpus

.

In fact, you are inserting all your documents into a variable first corpus

and after you compute your cosine semblance.

Here's an example:

tf_idf.py:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

corpus = [
     'This is the first document.',
     'This is the second second document.',
     'And the third one.',
     'Is this the first document?',
]

vectorizer = TfidfVectorizer()
tfidf = vectorizer.fit_transform(corpus)
words = vectorizer.get_feature_names()
similarity_matrix = cosine_similarity(tfidf)

Execute in the console ipython

:

In [1]: run tf_idf.py

In [2]: words
Out[2]: ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

In [3]: tfidf.toarray()
Out[3]: 
array([[ 0.        ,  0.43877674,  0.54197657,  0.43877674,  0.        ,
         0.        ,  0.35872874,  0.        ,  0.43877674],
       [ 0.        ,  0.27230147,  0.        ,  0.27230147,  0.        ,
         0.85322574,  0.22262429,  0.        ,  0.27230147],
       [ 0.55280532,  0.        ,  0.        ,  0.        ,  0.55280532,
         0.        ,  0.28847675,  0.55280532,  0.        ],
       [ 0.        ,  0.43877674,  0.54197657,  0.43877674,  0.        ,
         0.        ,  0.35872874,  0.        ,  0.43877674]])

In [4]: similarity_matrix
Out[4]: 
array([[ 1.        ,  0.43830038,  0.1034849 ,  1.        ],
       [ 0.43830038,  1.        ,  0.06422193,  0.43830038],
       [ 0.1034849 ,  0.06422193,  1.        ,  0.1034849 ],
       [ 1.        ,  0.43830038,  0.1034849 ,  1.        ]])

Note:

tfidf

- this scipy.sparse.csr.csr_matrix

, to_array

convert to numpy.ndarray

(but it's expensive, just here to easily see the content).
Similarity_matrix is a symmetric matrix.

You can do:

import numpy as np
print(np.triu(similarity_matrix, k=1))

Give:

array([[ 0.        ,  0.43830038,  0.1034849 ,  1.        ],
       [ 0.        ,  0.        ,  0.06422193,  0.43830038],
       [ 0.        ,  0.        ,  0.        ,  0.1034849 ],
       [ 0.        ,  0.        ,  0.        ,  0.        ]])

To see only interesting similarities.

See:

http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html
http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction

+2

glegoux 01 jul. 17 at 16:41

source to share

Mohamed oun · Accepted Answer · 2017-07-01T16:52:42+0000

I found a way to work. Instead of using fit_transform, you first need to fit the new document into the TFIDF matrix of the corpus like this:

queryTFIDF = TfidfVectorizer().fit(words)

We can now "convert" this vector to this matrix form using the transform function:

queryTFIDF = queryTFIDF.transform([query])

Where request is a query string.
We can then find the cosine similarity and find the 10 most similar / relevant documents:

cosine_similarities = cosine_similarity(queryTFIDF, datasetTFIDF).flatten()
related_product_indices = cosine_similarities.argsort()[:-11:-1]

TD-IDF Find cosine similarity between new document and dataset

More articles: