Most common 2 grams using python
Given the line:
this is a test this is
How can I find the most popular 2G files? The above line has all 2 grams:
{this is, is a, test this, this is}
As you can see, 2-gram this is
appears 2 times. Therefore, the result should be:
{this is: 2}
I know I can use a method Counter.most_common()
to find the most common items, but how can I create a list of 2 grams from a string to start with?
source to share
You can use the method presented in this blog to create n-grams conveniently in Python.
from collections import Counter bigrams = zip(words, words[1:]) counts = Counter(bigrams) print(counts.most_common())
This assumes the input is a list of words, of course. If your input is a string like the one you provided (which has no punctuation marks), you can only do words = text.split(' ')
to get a list of words. In general, however, you will need to accept punctuation marks, spaces, and other non-alphabetic characters. In this case, you can do something like
import re
words = re.findall(r'[A-Za-z]+', text)
or you can use an external library like nltk.tokenize .
Edit. If you want trigrams, or any other n-grams in general, you can use the function mentioned in the blog post I linked to:
def find_ngrams(input_list, n):
return zip(*(input_list[i:] for i in range(n)))
trigrams = find_ngrams(words, 3)
source to share