TD-IDF Find cosine similarity between new document and dataset
I have a TF-IDF dataset matrix:
tfidf = TfidfVectorizer().fit_transform(words)
where words are a list of descriptions. This gives a 69258x22024 matrix.
Now I want to find the cosine similarity between the new product and those in the matrix, as I need to find the 10 most similar products. I vecture it using the same method above.
However, I cannot multiply the matrices because their sizes are different (the new one would be like 6 words, so the matrix is 1x6), so I need to make a TFIDFVectorizer with the number of columns as the original.
How should I do it?
source to share
I found a way to work. Instead of using fit_transform, you first need to fit the new document into the TFIDF matrix of the corpus like this:
queryTFIDF = TfidfVectorizer().fit(words)
We can now "convert" this vector to this matrix form using the transform function:
queryTFIDF = queryTFIDF.transform([query])
Where request is a query string.
We can then find the cosine similarity and find the 10 most similar / relevant documents:
cosine_similarities = cosine_similarity(queryTFIDF, datasetTFIDF).flatten()
related_product_indices = cosine_similarities.argsort()[:-11:-1]
source to share
I think the variable is words
ambiguous. I advise you to rename words
to corpus
.
In fact, you are inserting all your documents into a variable first corpus
and after you compute your cosine semblance.
Here's an example:
tf_idf.py:
from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity corpus = [ 'This is the first document.', 'This is the second second document.', 'And the third one.', 'Is this the first document?', ] vectorizer = TfidfVectorizer() tfidf = vectorizer.fit_transform(corpus) words = vectorizer.get_feature_names() similarity_matrix = cosine_similarity(tfidf)
Execute in the console ipython
:
In [1]: run tf_idf.py
In [2]: words
Out[2]: ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
In [3]: tfidf.toarray()
Out[3]:
array([[ 0. , 0.43877674, 0.54197657, 0.43877674, 0. ,
0. , 0.35872874, 0. , 0.43877674],
[ 0. , 0.27230147, 0. , 0.27230147, 0. ,
0.85322574, 0.22262429, 0. , 0.27230147],
[ 0.55280532, 0. , 0. , 0. , 0.55280532,
0. , 0.28847675, 0.55280532, 0. ],
[ 0. , 0.43877674, 0.54197657, 0.43877674, 0. ,
0. , 0.35872874, 0. , 0.43877674]])
In [4]: similarity_matrix
Out[4]:
array([[ 1. , 0.43830038, 0.1034849 , 1. ],
[ 0.43830038, 1. , 0.06422193, 0.43830038],
[ 0.1034849 , 0.06422193, 1. , 0.1034849 ],
[ 1. , 0.43830038, 0.1034849 , 1. ]])
Note:
-
tfidf
- thisscipy.sparse.csr.csr_matrix
,to_array
convert tonumpy.ndarray
(but it's expensive, just here to easily see the content). - Similarity_matrix is a symmetric matrix.
You can do:
import numpy as np
print(np.triu(similarity_matrix, k=1))
Give:
array([[ 0. , 0.43830038, 0.1034849 , 1. ],
[ 0. , 0. , 0.06422193, 0.43830038],
[ 0. , 0. , 0. , 0.1034849 ],
[ 0. , 0. , 0. , 0. ]])
To see only interesting similarities.
See:
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html
http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction
source to share