Pattern in a set of words and group them
I need to find how multiple words are related to each other in a set of 5000 patterns.
Example: -
- mango, guava, lychee, apple
- mango, guava, lychee, orange
- mango, guava, pineapple, grapes
- pen, pencil, book, copy, notepad
- pen, pencil, book, copy, scale
We see that 1 and 2 are very close to each other. 3 is almost close to 1 and 2. we also have 4 and 5 very close to each other.
What approach and method can we use to test this correlation?
Thanks in advance!
Editors: Need help with grouping, for example group A consisting of lines 1, 2, 3 and group B containing 4 and 5.?
source to share
Here is one way to solve this problem. I am converting each list to a docs matrix using scikit-learn. Then calculate the cosine similarity matrix between each row using scipy.spacial.distance
.
from sklearn.feature_extraction.text import CountVectorizer
from scipy.spatial import distance
count_vect = CountVectorizer(tokenizer=lambda x: x.split(', '))
ls = ['mango, guava, litchi, apple',
'mango, guava, litchi, orange',
'mango, guava, pineapple, grape',
'pen, pencil, book, copy, notebook',
'pen, pencil, book, copy, scale']
X = count_vect.fit_transform(ls).toarray()
D = distance.cdist(X, X, metric='cosine')
The output is a matrix of distances between each row. It looks like this:
[[ 0. , 0.25, 0.5 , 1. , 1. ],
[ 0.25, 0. , 0.5 , 1. , 1. ],
[ 0.5 , 0.5 , 0. , 1. , 1. ],
[ 1. , 1. , 1. , 0. , 0.2 ],
[ 1. , 1. , 1. , 0.2 , 0. ]])
For example D[0, 1]
means that line 1 is close to line 2 because the distance between the two lines is small. Also, you can see that it is D[3, 4]
small, which means that line 4 is close to line 5.
note you can also use distance.pdist(X, metric='cosine')
which give the lower diagonal of the matrix just because the bottom and top diagonals are equal.
Grouping documents
To be more weird, you can group each row together with the computed distance matrix using hierarchical clustering.
from scipy.cluster import hierarchy
D = distance.pdist(X, metric='cosine')
Z = hierarchy.linkage(D, metric='euclidean')
partition = hcluster.fcluster(Z, t=0.8, criterion='distance') # [2, 2, 2, 1, 1]
which means that document 1,2,3 are grouped together in group 2 and 4,5, grouped together in group 1. If you draw a dendrogram, you can see how each line is grouped together
from scipy.cluster.hierarchy import dendrogram
import matplotlib.pyplot as plt
hierarchy.dendrogram(Z, above_threshold_color='#bcbddc',
orientation='top')
source to share
Another approach, or maybe another idea of a new start to solve your question:
import re
from itertools import chain
a = ['mango, guava, litchi, apple',
'mango, guava, litchi, orange',
'mango, guava, pineapple, grape',
'pen, pencil, book, copy, notebook',
'pen, pencil, book, copy, scale']
def get_words(lst):
return [re.findall(r'[\w]+', k) for k in a]
def get_percent(lst):
groupped_valid_dict = {}
for k in range(len(lst)):
sub = []
for j in range(k+1, len(lst)):
s = sum([1 if m == n else 0 for m, n in zip(lst[k], lst[j])])
#percent = (1 - float(len(lst[k]) - s)/len(lst[k])) * 100
#fmt = '%.2f%%' % percent
#print 'Words of lines: %d and %d are %s close' %(k+1, j+1, fmt)
if s > 0:
sub.append("Line{}".format(j+1))
if sub:
groupped_valid_dict["Line{}".format(k+1)] = sub
return groupped_valid_dict
lst = get_words(a)
lines = get_percent(lst)
groups = [[k] + lines[k] for k in lines if k not in chain.from_iterable(lines.values())]
groups.sort(key=lambda x: x[0])
for k, v in enumerate(groups, 1):
print "Group%d" %k, v
Output:
Group1 ['Line1', 'Line2', 'Line3']
Group2 ['Line4', 'Line5']
source to share