Most common 2 grams using python

Question

Most common 2 grams using python

Given the line:

this is a test this is

How can I find the most popular 2G files? The above line has all 2 grams:

{this is, is a, test this, this is}

As you can see, 2-gram this is

appears 2 times. Therefore, the result should be:

{this is: 2}

I know I can use a method Counter.most_common()

to find the most common items, but how can I create a list of 2 grams from a string to start with?

+3

python python-2.7 pyspark n-gram python-collections

stfd1123581321 Apr 18 17 at 13:33

source to share

2 answers

Well you can use

words = s.split() # s is the original string
pairs = [(words[i], words[i+1]) for i in range(len(words)-1)]

(words[i], words[i+1])

- a pair of words in the place i and i + 1, and we move all pairs from (0,1) to (n-2, n-1), where n is the length of string s.

+1

zmbq Apr 18 17 at 13:36

source to share

Martin Valgur · Accepted Answer · 2017-04-18T13:41:17+0000

You can use the method presented in this blog to create n-grams conveniently in Python.

from collections import Counter

bigrams = zip(words, words[1:])
counts = Counter(bigrams)
print(counts.most_common())

This assumes the input is a list of words, of course. If your input is a string like the one you provided (which has no punctuation marks), you can only do words = text.split(' ')

to get a list of words. In general, however, you will need to accept punctuation marks, spaces, and other non-alphabetic characters. In this case, you can do something like

import re

words = re.findall(r'[A-Za-z]+', text)

or you can use an external library like nltk.tokenize .

Edit. If you want trigrams, or any other n-grams in general, you can use the function mentioned in the blog post I linked to:

def find_ngrams(input_list, n):
  return zip(*(input_list[i:] for i in range(n)))

trigrams = find_ngrams(words, 3)

Most common 2 grams using python

More articles: