How to plot the confusion / similarity matrix of the K-mean algorithm
I am applying a K-mean algorithm to classify some text documents using scikit learn and display the clustering result. I would like to show the similarity of my cluster in a similarity matrix. I have not seen any tools in the scikit learning library that allows this.
# headlines type: <class 'numpy.ndarray'> tf-idf vectors
pca = PCA(n_components=2).fit(headlines)
data2D = pca.transform(to_headlines)
pl.scatter(data2D[:, 0], data2D[:, 1])
km = KMeans(n_clusters=4, init='k-means++', max_iter=300, n_init=3, random_state=0)
km.fit(headlines)
Is there a way / library that will allow me to easily draw this cosine similarity matrix?
source to share
If I get you right, you want to create a confusion matrix like the one shown here . However, this also requires truth
a prediction
, which can be compared with each other. Assuming you have a gold standard for classifying headings into groups k
( truth
), you can compare this to KMeans ( prediction
) clusters .
The only problem is that KMeans clustering is agnostic for yours truth
, meaning that the cluster labels it creates will not map to gold standard group labels. However, there is a workaround for this that must match kmeans labels
for truth labels
based on the best match.
Here's an example of how this might work.
First, let's generate some rough data - in this case, 100 samples with 50 functions each taken from 4 different (and slightly overlapping) normal distributions. The details are out of date; all it has to do is mimic the kind of dataset you might be working with. truth
in this case is the mean of the normal distribution from which the sample is generated.
# User input
n_samples = 100
n_features = 50
# Prep
truth = np.empty(n_samples)
data = np.empty((n_samples, n_features))
np.random.seed(42)
# Generate
for i,mu in enumerate(np.random.choice([0,1,2,3], n_samples, replace=True)):
truth[i] = mu
data[i,:] = np.random.normal(loc=mu, scale=1.5, size=n_features)
# Show
plt.imshow(data, interpolation='none')
plt.show()
Next, we can apply PCA and KMeans .
Note that I'm not sure if this example is where the PCA point is, as you are not actually using a PC for your KMeans, and it is also unclear what the dataset to_headlines
you are transforming is.
Here I am transforming the input and then using a PC to cluster KMeans. I am also using the inference to illustrate the visualization that Saykat Kumar Day suggested in a comment to your question: a scatter plot with dots colored with a cluster label.
# PCA
pca = PCA(n_components=2).fit(data)
data2D = pca.transform(data)
# Kmeans
km = KMeans(n_clusters=4, init='k-means++', max_iter=300, n_init=3, random_state=0)
km.fit(data2D)
# Show
plt.scatter(data2D[:, 0], data2D[:, 1],
c=km.labels_, edgecolor='')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.show()
We then have to find the best match pairs between those truth labels
generated at the start (here mu
sample normal distributions) and those kmeans labels
generated by the clustering.
In this example, I simply match them in such a way that the number of true positive predictions is maximized. Please note that this is a simplistic, quick and dirty solution!
If your predictions are generally good, and if each group is represented by a similar number of samples in your dataset, it will probably work as intended - otherwise it could lead to incorrect matches / merges, and you might slightly overestimate the quality of your clustering as a result.
Suggestions for the best solutions are welcome.
# Prep
k_labels = km.labels_ # Get cluster labels
k_labels_matched = np.empty_like(k_labels)
# For each cluster label...
for k in np.unique(k_labels):
# ...find and assign the best-matching truth label
match_nums = [np.sum((k_labels==k)*(truth==t)) for t in np.unique(truth)]
k_labels_matched[k_labels==k] = np.unique(truth)[np.argmax(match_nums)]
Now that we have mapped truths
and predictions
, we can finally compute and construct the confusion matrix .
# Compute confusion matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(truth, k_labels_matched)
# Plot confusion matrix
plt.imshow(cm,interpolation='none',cmap='Blues')
for (i, j), z in np.ndenumerate(cm):
plt.text(j, i, z, ha='center', va='center')
plt.xlabel("kmeans label")
plt.ylabel("truth label")
plt.show()
Hope this helps!
source to share