How to plot the confusion / similarity matrix of the K-mean algorithm

Question

How to plot the confusion / similarity matrix of the K-mean algorithm

I am applying a K-mean algorithm to classify some text documents using scikit learn and display the clustering result. I would like to show the similarity of my cluster in a similarity matrix. I have not seen any tools in the scikit learning library that allows this.

# headlines type: <class 'numpy.ndarray'> tf-idf vectors
pca = PCA(n_components=2).fit(headlines)
data2D = pca.transform(to_headlines)
pl.scatter(data2D[:, 0], data2D[:, 1])
km = KMeans(n_clusters=4, init='k-means++', max_iter=300, n_init=3, random_state=0)
km.fit(headlines)

Is there a way / library that will allow me to easily draw this cosine similarity matrix?

+3

python python-3.x matplotlib scikit-learn

mel 15 jul. 17 at 5:14 am

source to share

1 answer

WhoIsJack · Accepted Answer · 2017-07-16T22:36:09+0000

If I get you right, you want to create a confusion matrix like the one shown here . However, this also requires truth

a prediction

, which can be compared with each other. Assuming you have a gold standard for classifying headings into groups k

( truth

), you can compare this to KMeans ( prediction

) clusters .

The only problem is that KMeans clustering is agnostic for yours truth

, meaning that the cluster labels it creates will not map to gold standard group labels. However, there is a workaround for this that must match kmeans labels

for truth labels

based on the best match.

Here's an example of how this might work.

First, let's generate some rough data - in this case, 100 samples with 50 functions each taken from 4 different (and slightly overlapping) normal distributions. The details are out of date; all it has to do is mimic the kind of dataset you might be working with. truth

in this case is the mean of the normal distribution from which the sample is generated.

# User input
n_samples  = 100
n_features =  50

# Prep
truth = np.empty(n_samples)
data  = np.empty((n_samples, n_features))
np.random.seed(42)

# Generate
for i,mu in enumerate(np.random.choice([0,1,2,3], n_samples, replace=True)):
    truth[i]  = mu
    data[i,:] = np.random.normal(loc=mu, scale=1.5, size=n_features)

# Show
plt.imshow(data, interpolation='none')
plt.show()

sample data

Next, we can apply PCA and KMeans .

Note that I'm not sure if this example is where the PCA point is, as you are not actually using a PC for your KMeans, and it is also unclear what the dataset to_headlines

you are transforming is.

Here I am transforming the input and then using a PC to cluster KMeans. I am also using the inference to illustrate the visualization that Saykat Kumar Day suggested in a comment to your question: a scatter plot with dots colored with a cluster label.

# PCA
pca = PCA(n_components=2).fit(data)
data2D = pca.transform(data)

# Kmeans
km = KMeans(n_clusters=4, init='k-means++', max_iter=300, n_init=3, random_state=0)
km.fit(data2D)

# Show
plt.scatter(data2D[:, 0], data2D[:, 1],
            c=km.labels_, edgecolor='')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.show()

pca kmeans

We then have to find the best match pairs between those truth labels

generated at the start (here mu

sample normal distributions) and those kmeans labels

generated by the clustering.

In this example, I simply match them in such a way that the number of true positive predictions is maximized. Please note that this is a simplistic, quick and dirty solution!

If your predictions are generally good, and if each group is represented by a similar number of samples in your dataset, it will probably work as intended - otherwise it could lead to incorrect matches / merges, and you might slightly overestimate the quality of your clustering as a result.

Suggestions for the best solutions are welcome.

# Prep
k_labels = km.labels_  # Get cluster labels
k_labels_matched = np.empty_like(k_labels)

# For each cluster label...
for k in np.unique(k_labels):

    # ...find and assign the best-matching truth label
    match_nums = [np.sum((k_labels==k)*(truth==t)) for t in np.unique(truth)]
    k_labels_matched[k_labels==k] = np.unique(truth)[np.argmax(match_nums)]

Now that we have mapped truths

and predictions

, we can finally compute and construct the confusion matrix .

# Compute confusion matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(truth, k_labels_matched)

# Plot confusion matrix
plt.imshow(cm,interpolation='none',cmap='Blues')
for (i, j), z in np.ndenumerate(cm):
    plt.text(j, i, z, ha='center', va='center')
plt.xlabel("kmeans label")
plt.ylabel("truth label")
plt.show()

confusion matrix

Hope this helps!

How to plot the confusion / similarity matrix of the K-mean algorithm

More articles: