How do I find a meaningful word to represent each k-means cluster derived from word2vec vectors?

I used the gensim package in Python to load a pre-built google word2vec dataset. Then I want to use k-means to find meaningful clusters on my word vectors and find a representative word for each cluster. I am thinking of using the word that matches the vector closest to the cluster centroid to represent that cluster, but I don’t know if that’s a good idea since my experiment didn’t give me good results.

My example code looks like this:

import gensim
import numpy as np
import pandas as pd
from sklearn.cluster import MiniBatchKMeans
from sklearn.metrics import pairwise_distances_argmin_min

model = gensim.models.KeyedVectors.load_word2vec_format('/home/Desktop/GoogleNews-vectors-negative300.bin', binary=True)  

K=3

words = ["ship", "car", "truck", "bus", "vehicle", "bike", "tractor", "boat",
       "apple", "banana", "fruit", "pear", "orange", "pineapple", "watermelon",
       "dog", "pig", "animal", "cat", "monkey", "snake", "tiger", "rat", "duck", "rabbit", "fox"]
NumOfWords = len(words)

# construct the n-dimentional array for input data, each row is a word vector
x = np.zeros((NumOfWords, model.vector_size))
for i in range(0, NumOfWords):
    x[i,]=model[words[i]] 

# train the k-means model
classifier = MiniBatchKMeans(n_clusters=K, random_state=1, max_iter=100)
classifier.fit(x)

# check whether the words are clustered correctly
print(classifier.predict(x))

# find the index and the distance of the closest points from x to each class centroid
close = pairwise_distances_argmin_min(classifier.cluster_centers_, x, metric='euclidean')
index_closest_points = close[0]
distance_closest_points = close[1]

for i in range(0, K):
    print("The closest word to the centroid of class {0} is {1}, the distance is {2}".format(i, words[index_closest_points[i]], distance_closest_points[i]))

      

The output looks like this:

[2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0]
The closest word to the centroid of class 0 is rabbit, the distance is 1.578625818679259
The closest word to the centroid of class 1 is fruit, the distance is 1.8351978219013796
The closest word to the centroid of class 2 is car, the distance is 1.6586030662247868

      

In the code, I have 3 categories of words: car, fruits and animals. It can be seen from the output that k-means correct clustering of words for all three categories, but the representative words obtained using the centroid method are not very good, since for class 0 I want to see "animal", but it gives "rabbit", and for class 2 i want to see "car" but it returns "car".

Any help or suggestion in finding a good representative word for each cluster would be much appreciated.

+3


source to share


1 answer


It sounds like you were hoping to find a generic term for words in a cluster - a kind of hypernym - with an automated process, and were hoping a centroid would be such a term.

Unfortunately, I have not seen any claims. word2vec completes the ordering of words this way. Words tend to be close to other words that can be filled in for them - but there really is no guarantee that all words of the general type are closer together than other types of words, or that hyponyms tend to be equidistant to their hyponyms, or etc. (It is certainly possible, given word2vec's success in the analogy-solution, that hypernyms tend to be biased away from their hyponyms in a vaguely similar direction between classes. That is, perhaps vaguely 'volkswagen' + ('animal' - 'dog') ~ 'car'

- although I haven't tested it.)

Interestingly, there are sometimes interesting vocabulary vectors that may be relevant: word vectors for words with more diffuse meaning - for example, multiple feelings - often have a lower magnitude in their original form than other vocabulary vectors for words with more special meanings. Common most similar calculations ignore magnitudes by simply comparing raw directions, but looking for more general terms may require lower vectors. But this is also just a guess that I have not tested.



You could find work on auto detecting hypernim / hyponym and possible word2vec vectors could be a factor in facilitating such detection processes - either trained in the usual way or with some new wrinkles to try to force the layout you want. (But such specializations are usually not supported out of the box by gensim.)

There are often documents out there that improve the word2vec learning process to make vectors better for certain purposes. One recent article from Facebook Research that seems relevant is Poincaré's nesting for teaching hierarchical representations , which reports on improvements in hierarchy modeling and, in particular, WordNet hypernimal graph noun tests.

+3


source







All Articles