Semantic Similarities Between Phrases Using GenSim

Background

I'm trying to judge if a phrase is semantically related to other words found in the corpus using Gensim. For example, here is a doc document pre-labeled:

 **Corpus**
 Car Insurance
 Car Insurance Coverage
 Auto Insurance
 Best Insurance
 How much is car insurance
 Best auto coverage
 Auto policy
 Car Policy Insurance

      

My code (based on this gensim tutorial ) judges the semantic relatendness of a phrase using cosine similarity with all strings in the corpus.

Problem

It looks like if the query contains ANY of the terms found in my vocabulary, that phrase is considered semantically similar to the corpus (e.g. ** Giraffe Poop Car Murderer bears similarity to cosine 1, but SHOULD be semantically unrelated), I'm not sure how solve this problem.

code

#Tokenize Corpus and filter out anything that is a stop word or has a frequency <1
texts = [[word for word in document if word not in stoplist]
        for document in documents]
from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1
texts = [[token for token in text if frequency[token] > 1]
        for text in texts]
dictionary = corpora.Dictionary(texts)

# doc2bow counts the number of occurences of each distinct word, converts the word
# to its integer word id and returns the result as a sparse vector

corpus = [dictionary.doc2bow(text) for text in texts]  
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2)
doc = "giraffe poop car murderer"
vec_bow = dictionary.doc2bow(doc.lower().split())

#convert the query to LSI space
vec_lsi = lsi[vec_bow]              
index = similarities.MatrixSimilarity(lsi[corpus])

# perform a similarity query against the corpus
sims = index[vec_lsi]
sims = sorted(enumerate(sims), key=lambda item: -item[1])

      

+3


source to share


1 answer


First of all, you are not directly comparing the convergence of the cosines of the vectors of bag words, but first downscaling your document vectors by applying latent semantic analysis ( https://en.wikipedia.org/wiki/Latent_semantic_analysis ). This is great, but I just wanted to highlight this. It is often assumed that the underlying semantic space of the corpus has a smaller dimension than the number of unique tokens. Therefore, LSA applies principal component analysis on your vector space and only stores the directions in your vector space that contain the most variance (i.e. those directions in space that change most rapidly, and therefore are assumed to contain more information) ... This is affected by the parametersnum_topics

that you pass to the constructor LsiModel

.

Second, I cleaned up your code a bit and nested the corpus:

# Tokenize Corpus and filter out anything that is a
# stop word or has a frequency <1

from gensim import corpora, models, similarities
from collections import defaultdict

documents = [
    'Car Insurance',  # doc_id 0
    'Car Insurance Coverage',  # doc_id 1
    'Auto Insurance',  # doc_id 2
    'Best Insurance',  # doc_id 3
    'How much is car insurance',  # doc_id 4
    'Best auto coverage',  # doc_id 5
    'Auto policy',  # doc_id 6
    'Car Policy Insurance',  # doc_id 7
]

stoplist = set(['is', 'how'])

texts = [[word.lower() for word in document.split()
          if word.lower() not in stoplist]
         for document in documents]

print texts
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1
texts = [[token for token in text if frequency[token] > 1]
         for text in texts]
dictionary = corpora.Dictionary(texts)

# doc2bow counts the number of occurences of each distinct word,
# converts the word to its integer word id and returns the result
# as a sparse vector

corpus = [dictionary.doc2bow(text) for text in texts]
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2)
doc = "giraffe poop car murderer"
vec_bow = dictionary.doc2bow(doc.lower().split())

# convert the query to LSI space
vec_lsi = lsi[vec_bow]
index = similarities.MatrixSimilarity(lsi[corpus])

# perform a similarity query against the corpus
sims = index[vec_lsi]
sims = sorted(enumerate(sims), key=lambda item: -item[1])

print sims

      

If I run the above, I get the following output:



[(0, 0.97798139), (4, 0.97798139), (7, 0.94720691), (1, 0.89220524), (3, 0.61052465), (2, 0.42138112), (6, -0.1468758), (5, -0.22077486)]

      

where each entry in this list corresponds (doc_id, cosine_similarity)

, sorted by cosine in descending order.

As in your request document, the only word that is actually part of your dictionary (built from your corpus) car

, all other tokens will be removed. Therefore, your request to your model consists of a singleton document car

. Hence, you can see that all documents containing car

are presumably very similar to your input request.

The reason why document # 3 ( Best Insurance

) is also rated very high is because the token is insurance

often found with car

(your request). This is just a reasoning about the semantics of distribution, i.e. "The word is characterized by the company it keeps" (Firth, JR 1957).

+6


source







All Articles