Pattern in a set of words and group them

I need to find how multiple words are related to each other in a set of 5000 patterns.

Example: -

  • mango, guava, lychee, apple
  • mango, guava, lychee, orange
  • mango, guava, pineapple, grapes
  • pen, pencil, book, copy, notepad
  • pen, pencil, book, copy, scale

We see that 1 and 2 are very close to each other. 3 is almost close to 1 and 2. we also have 4 and 5 very close to each other.

What approach and method can we use to test this correlation?

Thanks in advance!

Editors: Need help with grouping, for example group A consisting of lines 1, 2, 3 and group B containing 4 and 5.?

+3


source to share


2 answers


Here is one way to solve this problem. I am converting each list to a docs matrix using scikit-learn. Then calculate the cosine similarity matrix between each row using scipy.spacial.distance

.

from sklearn.feature_extraction.text import CountVectorizer
from scipy.spatial import distance

count_vect = CountVectorizer(tokenizer=lambda x: x.split(', '))

ls = ['mango, guava, litchi, apple', 
      'mango, guava, litchi, orange',
      'mango, guava, pineapple, grape',
      'pen, pencil, book, copy, notebook',
      'pen, pencil, book, copy, scale']

X = count_vect.fit_transform(ls).toarray()
D = distance.cdist(X, X, metric='cosine')

      

The output is a matrix of distances between each row. It looks like this:

[[ 0.  ,  0.25,  0.5 ,  1.  ,  1.  ],
 [ 0.25,  0.  ,  0.5 ,  1.  ,  1.  ],
 [ 0.5 ,  0.5 ,  0.  ,  1.  ,  1.  ],
 [ 1.  ,  1.  ,  1.  ,  0.  ,  0.2 ],
 [ 1.  ,  1.  ,  1.  ,  0.2 ,  0.  ]])

      

For example D[0, 1]

means that line 1 is close to line 2 because the distance between the two lines is small. Also, you can see that it is D[3, 4]

small, which means that line 4 is close to line 5.

note you can also use distance.pdist(X, metric='cosine')

which give the lower diagonal of the matrix just because the bottom and top diagonals are equal.



Grouping documents

To be more weird, you can group each row together with the computed distance matrix using hierarchical clustering.

from scipy.cluster import hierarchy

D = distance.pdist(X, metric='cosine')
Z = hierarchy.linkage(D, metric='euclidean')
partition = hcluster.fcluster(Z, t=0.8, criterion='distance') # [2, 2, 2, 1, 1] 

      

which means that document 1,2,3 are grouped together in group 2 and 4,5, grouped together in group 1. If you draw a dendrogram, you can see how each line is grouped together

from scipy.cluster.hierarchy import dendrogram
import matplotlib.pyplot as plt

hierarchy.dendrogram(Z, above_threshold_color='#bcbddc',
                     orientation='top')

      

+3


source


Another approach, or maybe another idea of ​​a new start to solve your question:

import re
from itertools import chain

a = ['mango, guava, litchi, apple', 
      'mango, guava, litchi, orange',
      'mango, guava, pineapple, grape',
      'pen, pencil, book, copy, notebook',
      'pen, pencil, book, copy, scale']

def get_words(lst):
    return [re.findall(r'[\w]+', k) for k in a]

def get_percent(lst):
    groupped_valid_dict = {}
    for k in range(len(lst)):
        sub = []
        for j in range(k+1, len(lst)):
            s = sum([1 if m == n else 0 for m, n in zip(lst[k], lst[j])])
            #percent = (1 - float(len(lst[k]) - s)/len(lst[k])) * 100
            #fmt = '%.2f%%' % percent
            #print 'Words of lines: %d and %d are %s close' %(k+1, j+1, fmt)
            if s > 0:
                sub.append("Line{}".format(j+1))
        if sub:
            groupped_valid_dict["Line{}".format(k+1)] = sub
    return groupped_valid_dict


lst = get_words(a)
lines  = get_percent(lst)
groups = [[k] + lines[k] for k in lines if k not in chain.from_iterable(lines.values())]
groups.sort(key=lambda x: x[0])

for k, v in enumerate(groups, 1):
    print "Group%d" %k, v

      



Output:

Group1 ['Line1', 'Line2', 'Line3']
Group2 ['Line4', 'Line5']

      

+1


source







All Articles