Python - comparing n-grams in multiple text files
First poster - I'm a new Python user with limited programming skills. Ultimately I am trying to identify and compare n-grams in numerous text documents found in the same directory. My analysis is somewhat similar to plagiarism detection - I want to calculate the percentage of text documents in which a specific n-gram can be found. At the moment I'm trying to simplify a version of a larger problem by trying to compare n-grams across two text documents. I have no problem identifying n-grams, but I am struggling to compare the two documents. Is there a way to store n-grams in a list to effectively compare which ones are present in two documents? Here is what I have done so far (goodbye naive coding). For reference, I provide the basic suggestions below, not text documents,which I actually read in my code.
import nltk
from nltk.util import ngrams
text1 = 'Hello my name is Jason'
text2 = 'My name is not Mike'
n = 3
trigrams1 = ngrams(text1.split(), n)
trigrams2 = ngrams(text2.split(), n)
print(trigrams1)
for grams in trigrams1:
print(grams)
def compare(trigrams1, trigrams2):
for grams1 in trigrams1:
if each_gram in trigrams2:
print (each_gram)
return False
Thanks everyone for your help!
source to share
Use a list common
in a function compare
. Attach each ngram to this list, which is common to both trigrams, and finally return the list as:
>>> trigrams1 = ngrams(text1.lower().split(), n) # use text1.lower() to ignore sentence case.
>>> trigrams2 = ngrams(text2.lower().split(), n) # use text2.lower() to ignore sentence case.
>>> trigrams1
[('hello', 'my', 'name'), ('my', 'name', 'is'), ('name', 'is', 'jason')]
>>> trigrams2
[('my', 'name', 'is'), ('name', 'is', 'not'), ('is', 'not', 'mike')]
>>> def compare(trigrams1, trigrams2):
... common=[]
... for grams1 in trigrams1:
... if grams1 in trigrams2:
... common.append(grams1)
... return common
...
>>> compare(trigrams1, trigrams2)
[('my', 'name', 'is')]
source to share
I think it is easier to concatenate the elements in ngrams and make a list of strings and then do the comparison.
Go to the process using the example you provided.
text1 = 'Hello my name is Jason'
text2 = 'My name is not Mike'
After applying the function ngrams
from nltk, you will get the following two lists, which I also call text1
and text2
, as before:
text1 = [('Hello', 'my', 'name'), ('my', 'name', 'is'), ('name', 'is', 'Jason')]
text2 = [('My', 'name', 'is'), ('name', 'is', 'not'), ('is', 'not', 'Mike')]
If you want to compare ngrams, you have to put all elements in lower case so that it doesn't count 'my'
and 'my'
as separate tokens, which we obviously don't want.
This function performs exactly this function.
def append_elements(n_gram):
for element in range(len(n_gram)):
phrase = ''
for sub_element in n_gram[element]:
phrase += sub_element+' '
n_gram[element] = phrase.strip().lower()
return n_gram
So if we root it text1
, we get ['hello my name', 'my name is', 'name is jason']
one that is easier to handle.
Then we create a function compare
. You were correct in thinking that we can use a list to store generality. I named it common
here:
def compare(n_gram1, n_gram2):
n_gram1 = append_elements(n_gram1)
n_gram2 = append_elements(n_gram2)
common = []
for phrase in n_gram1:
if phrase in n_gram2:
common.append(phrase)
if not common:
return False
# or you could print a message saying no commonality was found
else:
for i in common:
print(i)
if not common
means the list is common
empty, in which case it prints a message or returnsFalse
Now in your example, when we run compare(text1, text2)
, the only thing in common is:
>>>
my name is
>>>
which is the correct answer.
source to share
I was doing a task very similar to yours when I came across this old thread that seemed to work pretty well except for one bug. I'll add this answer here in case anyone else stumbles upon this. ngrams
from nltk.util
returns a generator object, not a list. It needs to be converted to a list in order to use the function you wrote compare
. Use lower()
for case insensitive match.
Complete example:
import nltk
from nltk.util import ngrams
text1 = 'Hello my name is Jason'
text2 = 'My name is not Mike'
n = 3
trigrams1 = ngrams(text1.lower().split(), n)
trigrams2 = ngrams(text2.lower().split(), n)
def compare_ngrams(trigrams1, trigrams2):
trigrams1 = list(trigrams1)
trigrams2 = list(trigrams2)
common=[]
for gram in trigrams1:
if gram in trigrams2:
common.append(gram)
return common
common = compare_ngrams(trigrams1, trigrams2)
print(common)
Output:
[('my', 'name', 'is')]
source to share