Clustering using k-means in python

Question

Clustering using k-means in python

I have a document d1 consisting of strings of the form user_id tag_id. There is another d2 document consisting of tag_id tag_name I need to create user clusters with similar behavior. I want to try this with the k-mean algorithm in python. I am completely new to this and cannot figure out how to start with this. Can anyone provide pointers?

Do I need to create different documents for each user first using d1 with its dictionary tag? And then apply the k-means algorithm to these documents? D1 has 1 million users. I'm not sure what I'm thinking in the right direction, creating 1 million files?

+3

python tags cluster-analysis data-mining k-means

Maxwell 07 Mar 12 at 3:43

source to share

4 answers

Anony-mousse · Answer 1 · 2012-03-07T07:03:19+0000

Since your data is binary and sparse (in particular, not all users have checked all documents, right)? So I'm not at all sure if k-means is the correct way to do it.

Anyway, if you want to give k-means a try, look at options like k-medians (which won't allow you to "semi-tag") and convex / spherical k-means (which presumably works better with distance functions. such as distance from cosine, which seems to be much more appropriate here).

sravan_kumar · Answer 2 · 2012-03-07T04:44:38+0000

As @Jacob Eggers mentioned, you have to denormalize the data in order to form a matrix that is really sparse. Use SciPy package in python for k tools. Cm.

Scipy Kmeans

for examples and execution. Also check fooobar.com/questions/120585 / ... for more information on python kmeans clustering.

Jacob eggers · Answer 3 · 2012-03-07T04:05:37+0000

First, you need to denormalize the data so that you have one file:

userid tag1 tag2 tag3 tag4 ....
0001   1    0    1    0    ....
0002   0    1    1    0    ....
0003   0    0    1    1    ....

Then you need to loop through a k-dimensional algorithm. Here is the matlab code from the ml class:

% Initialize centroids
centroids = kMeansInitCentroids(X, K);
for iter = 1:iterations
    % Cluster assignment step: Assign each data point to the
    % closest centroid. idx(i) corresponds to cˆ(i), the index 
    % of the centroid assigned to example i
    idx = findClosestCentroids(X, centroids);

    % Move centroid step: Compute means based on centroid
    % assignments
    centroids = computeMeans(X, idx, K);
end

denis · Answer 4 · 2012-03-07T16:51:43+0000

For sparse k-means, see the examples provided in scikit-learn clustering .
How many identifiers are there, how many per user on average, how many clusters are you looking for? Even rough numbers, for example 100K IDs, av 10 per user, 100 clusters, can lead to someone who did clustering in this range (or otherwise, to make the "envelope" "impossible").

MinHash may be better suited to your problem than k-means; see Chapter 3 "Finding Similar Items", Ullman, Massive Massive Arrays ;
also SO questions / tagged / similarity + algorithm + python .

Clustering using k-means in python

More articles: