Clustering using k-means in python

I have a document d1 consisting of strings of the form user_id tag_id. There is another d2 document consisting of tag_id tag_name I need to create user clusters with similar behavior. I want to try this with the k-mean algorithm in python. I am completely new to this and cannot figure out how to start with this. Can anyone provide pointers?

Do I need to create different documents for each user first using d1 with its dictionary tag? And then apply the k-means algorithm to these documents? D1 has 1 million users. I'm not sure what I'm thinking in the right direction, creating 1 million files?


source to share

4 answers

Since your data is binary and sparse (in particular, not all users have checked all documents, right)? So I'm not at all sure if k-means is the correct way to do it.

Anyway, if you want to give k-means a try, look at options like k-medians (which won't allow you to "semi-tag") and convex / spherical k-means (which presumably works better with distance functions. such as distance from cosine, which seems to be much more appropriate here).



As @Jacob Eggers mentioned, you have to denormalize the data in order to form a matrix that is really sparse. Use SciPy package in python for k tools. Cm.

Scipy Kmeans

for examples and execution. Also check / ... for more information on python kmeans clustering.



First, you need to denormalize the data so that you have one file:

userid tag1 tag2 tag3 tag4 ....
0001   1    0    1    0    ....
0002   0    1    1    0    ....
0003   0    0    1    1    ....


Then you need to loop through a k-dimensional algorithm. Here is the matlab code from the ml class:

% Initialize centroids
centroids = kMeansInitCentroids(X, K);
for iter = 1:iterations
    % Cluster assignment step: Assign each data point to the
    % closest centroid. idx(i) corresponds to cˆ(i), the index 
    % of the centroid assigned to example i
    idx = findClosestCentroids(X, centroids);

    % Move centroid step: Compute means based on centroid
    % assignments
    centroids = computeMeans(X, idx, K);




For sparse k-means, see the examples provided in scikit-learn clustering .
How many identifiers are there, how many per user on average, how many clusters are you looking for? Even rough numbers, for example 100K IDs, av 10 per user, 100 clusters, can lead to someone who did clustering in this range (or otherwise, to make the "envelope" "impossible").

MinHash may be better suited to your problem than k-means; see Chapter 3 "Finding Similar Items", Ullman, Massive Massive Arrays ;
also SO questions / tagged / similarity + algorithm + python .



All Articles