Python - comparing n-grams in multiple text files

First poster - I'm a new Python user with limited programming skills. Ultimately I am trying to identify and compare n-grams in numerous text documents found in the same directory. My analysis is somewhat similar to plagiarism detection - I want to calculate the percentage of text documents in which a specific n-gram can be found. At the moment I'm trying to simplify a version of a larger problem by trying to compare n-grams across two text documents. I have no problem identifying n-grams, but I am struggling to compare the two documents. Is there a way to store n-grams in a list to effectively compare which ones are present in two documents? Here is what I have done so far (goodbye naive coding). For reference, I provide the basic suggestions below, not text documents,which I actually read in my code.

import nltk
from nltk.util import ngrams

text1 = 'Hello my name is Jason'
text2 = 'My name is not Mike'

n = 3
trigrams1 = ngrams(text1.split(), n)
trigrams2 = ngrams(text2.split(), n)

print(trigrams1)
for grams in trigrams1:
    print(grams)

def compare(trigrams1, trigrams2):
    for grams1 in trigrams1:
        if each_gram in trigrams2:
            print (each_gram)
    return False 

      

Thanks everyone for your help!

+4


source to share


3 answers


Use a list common

in a function compare

. Attach each ngram to this list, which is common to both trigrams, and finally return the list as:



>>> trigrams1 = ngrams(text1.lower().split(), n)  # use text1.lower() to ignore sentence case.
>>> trigrams2 = ngrams(text2.lower().split(), n)  # use text2.lower() to ignore sentence case.
>>> trigrams1
[('hello', 'my', 'name'), ('my', 'name', 'is'), ('name', 'is', 'jason')]
>>> trigrams2
[('my', 'name', 'is'), ('name', 'is', 'not'), ('is', 'not', 'mike')]
>>> def compare(trigrams1, trigrams2):
...    common=[]
...    for grams1 in trigrams1:
...       if grams1 in trigrams2:
...         common.append(grams1)
...    return common
... 
>>> compare(trigrams1, trigrams2)
[('my', 'name', 'is')]

      

0


source


I think it is easier to concatenate the elements in ngrams and make a list of strings and then do the comparison.

Go to the process using the example you provided.

text1 = 'Hello my name is Jason'
text2 = 'My name is not Mike'

      

After applying the function ngrams

from nltk, you will get the following two lists, which I also call text1

and text2

, as before:

text1 = [('Hello', 'my', 'name'), ('my', 'name', 'is'), ('name', 'is', 'Jason')]
text2 = [('My', 'name', 'is'), ('name', 'is', 'not'), ('is', 'not', 'Mike')]

      

If you want to compare ngrams, you have to put all elements in lower case so that it doesn't count 'my'

and 'my'

as separate tokens, which we obviously don't want.

This function performs exactly this function.

def append_elements(n_gram):
    for element in range(len(n_gram)):
            phrase = ''
            for sub_element in n_gram[element]:
                    phrase += sub_element+' '
            n_gram[element] = phrase.strip().lower()
    return n_gram

      



So if we root it text1

, we get ['hello my name', 'my name is', 'name is jason']

one that is easier to handle.

Then we create a function compare

. You were correct in thinking that we can use a list to store generality. I named it common

here:

def compare(n_gram1, n_gram2):
    n_gram1 = append_elements(n_gram1)
    n_gram2 = append_elements(n_gram2)
    common = []
    for phrase in n_gram1:
        if phrase in n_gram2:
            common.append(phrase)
    if not common:
        return False
        # or you could print a message saying no commonality was found
    else:
        for i in common:
            print(i)

      

if not common

means the list is common

empty, in which case it prints a message or returnsFalse

Now in your example, when we run compare(text1, text2)

, the only thing in common is:

>>> 
my name is
>>>

      

which is the correct answer.

0


source


I was doing a task very similar to yours when I came across this old thread that seemed to work pretty well except for one bug. I'll add this answer here in case anyone else stumbles upon this. ngrams

from nltk.util

returns a generator object, not a list. It needs to be converted to a list in order to use the function you wrote compare

. Use lower()

for case insensitive match.

Complete example:

import nltk
from nltk.util import ngrams

text1 = 'Hello my name is Jason'
text2 = 'My name is not Mike'

n = 3
trigrams1 = ngrams(text1.lower().split(), n)
trigrams2 = ngrams(text2.lower().split(), n)

def compare_ngrams(trigrams1, trigrams2):
    trigrams1 = list(trigrams1)
    trigrams2 = list(trigrams2)
    common=[]
    for gram in trigrams1:
        if gram in trigrams2:
            common.append(gram)
    return common

common = compare_ngrams(trigrams1, trigrams2)
print(common)

      

Output:

[('my', 'name', 'is')]

      

0


source







All Articles