NLP - acceleration word similarity matchin
I am trying to find the maximum similarity between two words in a pandas dataframe. Here is my routine
import pandas as pd
from nltk.corpus import wordnet
import itertools
df = pd.DataFrame({'word_1':['desk', 'lamp', 'read'], 'word_2':['call','game','cook']})
def max_similarity(row):
word_1 = row['word_1']
word_2 = row['word_2']
ret_val = max([(wordnet.wup_similarity(syn_1, syn_2) or 0) for
syn_1, syn_2 in itertools.product(wordnet.synsets(word_1), wordnet.synsets(word_2))])
return ret_val
df['result'] = df.apply(lambda x: max_similarity(x), axis= 1)
It works great, but it is too slow. I am looking for a way to speed it up. wordnet
takes longer. All offers? Cython? I open up other packages for use such as spacy
.
source to share
Since you said you can use spacy as an NLP library, let's look at a simple test. We will use brown news corpus to create some arbitrary word pairs by halving them.
from nltk.corpus import brown
brown_corpus = list(brown.words(categories='news'))
brown_df = pd.DataFrame({
'word_1':brown_corpus[:len(brown_corpus)//2],
'word_2': brown_corpus[len(brown_corpus)//2:]
})
len(brown_df)
50277
The cosine similarity of two tokens / documents can be calculated using the method Doc.similarity
.
import spacy
nlp = spacy.load('en')
def spacy_max_similarity(row):
word_1 = nlp(row['word_1'])
word_2 = nlp(row['word_2'])
return word_1.similarity(word_2)
Finally, apply both methods to the dataframe:
nltk_similarity = %timeit -o brown_df.apply(nltk_max_similarity, axis=1)
1 loop, best of 3: 59 s per loop
spacy_similarity = %timeit -o brown_df.apply(spacy_max_similarity, axis=1)
1 loop, best of 3: 8.88 s per loop
Note that NLTK and spacy use different methods when it comes to measuring similarity. spacy uses word vectors that have been preprocessed with word2vec. From the docs :
Using vectors of words and semantic similarities
[...]
The English model by default sets vectors for a million vocabulary entries using 300-dimensional vectors learning from a general corpus traversal using the GloVe algorithm . GloVe's common traversal vectors have become the de facto standard for practical NLP.
source to share
One way to do it faster is to preserve the similarities between words. Then, in case of repetition, avoid running the search function in a loop.
import pandas as pd
from nltk.corpus import wordnet
import itertools
df = pd.DataFrame({'word_1':['desk', 'lamp', 'read'], 'word_2':['call','game','cook']})
word_similarities = dict()
def max_similarity(row):
word_1 = row['word_1']
word_2 = row['word_2']
key = tuple(sorted([word_1, word_2])) # symmetric measure :)
if key not in word_similarities:
word_similarities[key] = max([
(wordnet.wup_similarity(syn_1, syn_2) or 0)
for syn_1, syn_2 in itertools.product(wordnet.synsets(word_1), wordnet.synsets(word_2))
])
return word_similarities[key]
df['result'] = df.apply(lambda x: max_similarity(x), axis= 1)
source to share