Affinity spread preference parameter

Question

Affinity spread preference parameter

I have been encouraging clustering results for a set of entity names using scikit-learn's affinity propagation implementation with modified Jaro-Winkler distance as a measure of similarity, but my clusters are still too numerous (i.e. too many false positives.)

In the scikit-learn documentation, I see that there is a "preference" parameter that affects the number of clusters, with the following description:

preference : array, shape (n_samples) or float, optional

Preferences for each point - Points with higher preference values are likely to be chosen as examples. By the number of copies, i.e. The clusters are affected by the value of the input preferences. If preferences are not passed as arguments, they will be set to the median of the input similarities. [0]

However, when I started tinkering with this value, I found that a very narrow range of values was giving me too many clusters ( preference=-11.13

) or too few clusters ( preference=-11.11

).

Is there a way to determine what should be a "reasonable" preference value? And why don't I get a non-critical number of clusters?

Similar questions:

Affinity Spread - Cluster Imbalance

Initializing Affinity Distribution Preferences

+3

python scikit-learn unsupervised-learning cluster-analysis

nitrl Apr 24 17 at 14:08

source to share

1 answer

Erotemic · Answer 1 · 2017-04-27T21:12:16+0000

You can try using sklearn.model_selection.GridSearchCV

or sklearn.model_selection.RandomizedSearchCV

.

You can define a configurable error measure that causes the hyperparameter lookup to generate smaller clusters. You can then search for multiple values to find the one that is good for your dataset based on the validation set.

More information: http://scikit-learn.org/stable/modules/grid_search.html

Affinity spread preference parameter

More articles: