How can GridSearchCV be used for clustering (MeanShift or DBSCAN)?

I am trying to group some text documents using scikit-learn

. I am trying to use DBSCAN and MeanShift and want to figure out which hyperparameters (for example bandwidth

for MeanShift and eps

for DBSCAN) are best for the kind of data I am using (news articles).

I have some testing data that consists of pre-labeled clusters. I'm trying to use scikit-learn

GridSearchCV

but don't understand how (or if it can) be applied in this case, as it needs test data to be split, but I want to run an estimate across the entire dataset and compare the results against the pre-tagged data.

I am trying to point out a scoring function that compares the score marks to the true marks, but of course it doesn't work because only the sample data is clustered, not everything.

What's the appropriate approach here?

+3


source to share


1 answer


Have you considered implementing search ?

It's not particularly hard to implement a for loop. Even if you want to optimize two parameters, it's still pretty easy.

For both DBSCAN and MeanShift, I nonetheless advise you to understand your measure of similarity first. It makes sense to choose parameters based on an understanding of your measure instead of optimizing parameters to fit some shortcuts (which have a high risk of overfitting).



In other words, how far are the two articles that should be grouped?

If this distance varies too much from one data point to another, these algorithms will fail; and you may need to find the normalized distance function so that the actual similarity values ​​will make sense again. TF-IDF is standard in text, but mostly in search context. They can perform much worse in the context of clustering.

Also be careful that MeanShift (like k-means) needs to recalculate coordinates - to text data, this can lead to undesirable results; where the updated coordinates actually got worse, not better.

+1


source







All Articles