NLP - acceleration word similarity matchin

I am trying to find the maximum similarity between two words in a pandas dataframe. Here is my routine

import pandas as pd
from nltk.corpus import wordnet
import itertools

df = pd.DataFrame({'word_1':['desk', 'lamp', 'read'], 'word_2':['call','game','cook']})

def max_similarity(row):
    word_1 = row['word_1']
    word_2 = row['word_2']

    ret_val = max([(wordnet.wup_similarity(syn_1, syn_2) or 0) for 
       syn_1, syn_2 in itertools.product(wordnet.synsets(word_1), wordnet.synsets(word_2))])

    return ret_val

df['result'] = df.apply(lambda x: max_similarity(x), axis= 1)

      

It works great, but it is too slow. I am looking for a way to speed it up. wordnet

takes longer. All offers? Cython? I open up other packages for use such as spacy

.

+3


source to share


2 answers


Since you said you can use spacy as an NLP library, let's look at a simple test. We will use brown news corpus to create some arbitrary word pairs by halving them.

from nltk.corpus import brown

brown_corpus = list(brown.words(categories='news'))
brown_df = pd.DataFrame({
    'word_1':brown_corpus[:len(brown_corpus)//2],
    'word_2': brown_corpus[len(brown_corpus)//2:]
})

len(brown_df)
50277

      

The cosine similarity of two tokens / documents can be calculated using the method Doc.similarity

.

import spacy
nlp = spacy.load('en')

def spacy_max_similarity(row):
    word_1 = nlp(row['word_1'])
    word_2 = nlp(row['word_2'])

    return word_1.similarity(word_2)

      

Finally, apply both methods to the dataframe:



nltk_similarity = %timeit -o brown_df.apply(nltk_max_similarity, axis=1)
1 loop, best of 3: 59 s per loop

spacy_similarity = %timeit -o brown_df.apply(spacy_max_similarity, axis=1)
1 loop, best of 3: 8.88 s per loop

      

Note that NLTK and spacy use different methods when it comes to measuring similarity. spacy uses word vectors that have been preprocessed with word2vec. From the docs :

Using vectors of words and semantic similarities

[...]

The English model by default sets vectors for a million vocabulary entries using 300-dimensional vectors learning from a general corpus traversal using the GloVe algorithm . GloVe's common traversal vectors have become the de facto standard for practical NLP.

nltk word Similarity vs.  spacy

+2


source


One way to do it faster is to preserve the similarities between words. Then, in case of repetition, avoid running the search function in a loop.



import pandas as pd
from nltk.corpus import wordnet
import itertools

df = pd.DataFrame({'word_1':['desk', 'lamp', 'read'], 'word_2':['call','game','cook']})

word_similarities = dict()
def max_similarity(row):
    word_1 = row['word_1']
    word_2 = row['word_2']

    key = tuple(sorted([word_1, word_2])) # symmetric measure :)

    if key not in word_similarities:
        word_similarities[key] = max([
            (wordnet.wup_similarity(syn_1, syn_2) or 0)
            for syn_1, syn_2 in itertools.product(wordnet.synsets(word_1), wordnet.synsets(word_2))
        ])

    return word_similarities[key]

df['result'] = df.apply(lambda x: max_similarity(x), axis= 1)

      

+1


source







All Articles