Affinity spread preference parameter

I have been encouraging clustering results for a set of entity names using scikit-learn's affinity propagation implementation with modified Jaro-Winkler distance as a measure of similarity, but my clusters are still too numerous (i.e. too many false positives.)

In the scikit-learn documentation, I see that there is a "preference" parameter that affects the number of clusters, with the following description:

preference : array, shape (n_samples) or float, optional

Preferences for each point - Points with higher preference values ​​are likely to be chosen as examples. By the number of copies, i.e. The clusters are affected by the value of the input preferences. If preferences are not passed as arguments, they will be set to the median of the input similarities. [0]

However, when I started tinkering with this value, I found that a very narrow range of values ​​was giving me too many clusters ( preference=-11.13

) or too few clusters ( preference=-11.11

).

Is there a way to determine what should be a "reasonable" preference value? And why don't I get a non-critical number of clusters?

Similar questions:

Affinity Spread - Cluster Imbalance

Initializing Affinity Distribution Preferences

+3


source to share


1 answer


You can try using sklearn.model_selection.GridSearchCV

or sklearn.model_selection.RandomizedSearchCV

.

You can define a configurable error measure that causes the hyperparameter lookup to generate smaller clusters. You can then search for multiple values ​​to find the one that is good for your dataset based on the validation set.



More information: http://scikit-learn.org/stable/modules/grid_search.html

+1


source







All Articles