Affinity Spread Preference Preferences

I need to perform clustering without knowing the number of clusters in advance. The number of clusters can be from 1 to 5, as I can find cases where all samples belong to one instance or a limited number of groups. I thought that affinity spread might be my choice as I can control the number of clusters by setting a preference parameter. However, if I have an artificial spawned cluster and I prefer the minimum euclidean spacing between nodes (to keep the number of clusters to a minimum), I get a horror of clustering.

"""
=================================================
Demo of affinity propagation clustering algorithm
=================================================

Reference:
Brendan J. Frey and Delbert Dueck, "Clustering by Passing Messages
Between Data Points", Science Feb. 2007

"""
print(__doc__)
import numpy as np
from sklearn.cluster import AffinityPropagation
from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs
from scipy.spatial.distance import pdist

##############################################################################
# Generate sample data
centers = [[0,0],[1,1]]
X, labels_true = make_blobs(n_samples=300, centers=centers, cluster_std=0.5,
                            random_state=0)
init = np.min(pdist(X))

##############################################################################
# Compute Affinity Propagation
af = AffinityPropagation(preference=init).fit(X)
cluster_centers_indices = af.cluster_centers_indices_
labels = af.labels_

n_clusters_ = len(cluster_centers_indices)

print('Estimated number of clusters: %d' % n_clusters_)
print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels_true, labels))
print("Completeness: %0.3f" % metrics.completeness_score(labels_true, labels))
print("V-measure: %0.3f" % metrics.v_measure_score(labels_true, labels))
print("Adjusted Rand Index: %0.3f"
      % metrics.adjusted_rand_score(labels_true, labels))
print("Adjusted Mutual Information: %0.3f"
      % metrics.adjusted_mutual_info_score(labels_true, labels))
print("Silhouette Coefficient: %0.3f"
      % metrics.silhouette_score(X, labels, metric='sqeuclidean'))

##############################################################################
# Plot result
import matplotlib.pyplot as plt
from itertools import cycle

plt.close('all')
plt.figure(1)
plt.clf()

colors = cycle('bgrcmykbgrcmykbgrcmykbgrcmyk')
for k, col in zip(range(n_clusters_), colors):
    class_members = labels == k
    cluster_center = X[cluster_centers_indices[k]]
    plt.plot(X[class_members, 0], X[class_members, 1], col + '.')
    plt.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col,
             markeredgecolor='k', markersize=14)
    for x in X[class_members]:
        plt.plot([cluster_center[0], x[0]], [cluster_center[1], x[1]], col)

plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()

      

enter image description here

Is there a downside to my approach to using Affinity distribution? On the contrary, affinity prevalence is inappropriate for this task, so should I use something else?

+2


source to share


2 answers


No, there is no lack. AP does not use distances, but requires similarity to be specified. I don't know scikit's implementation that well, but according to what I've read, it uses negative squared Euclidean distances by default to calculate the similarity matrix. If you set the input preference for the minimum Euclidean distance, you get a positive value and all similarities are negative. So this tends to result in as many clusters as you have samples (note: the higher the input preference, the more clusters). I would rather suggest setting the input preference to the minimum negative square distance, i.e. 1 times the square of the distance the largestin the dataset. This will give you much fewer clusters, but not necessarily a single cluster. I don't know if the preferenceRange () function exists in scikit's implementation. There is Matlab code on the AP homepage which is also implemented in the apkluster package that I maintain. This function allows you to define meaningful boundaries for the input preference parameter. Hope it helps.



+3


source


You can manage it by specifying the minimum settings, but you are not sure if you will find one cluster.



And also I would suggest that you do not do one cluster, because it would generate errors, since some of the data should not be the same or resemble examiners, but since you provide minimal preference, so the AP will capture the error.

0


source







All Articles