# Clustering using k-means in python

I have a document d1 consisting of strings of the form user_id tag_id. There is another d2 document consisting of tag_id tag_name I need to create user clusters with similar behavior. I want to try this with the k-mean algorithm in python. I am completely new to this and cannot figure out how to start with this. Can anyone provide pointers?

Do I need to create different documents for each user first using d1 with its dictionary tag? And then apply the k-means algorithm to these documents? D1 has 1 million users. I'm not sure what I'm thinking in the right direction, creating 1 million files?

source to share

Since your data is binary and sparse (in particular, not all users have checked all documents, right)? So I'm not at all sure if k-means is the correct way to do it.

Anyway, if you want to give k-means a try, look at options like **k-medians** (which won't allow you to "semi-tag") and convex / spherical k-means (which presumably works better with distance functions. such as distance from cosine, which seems to be much more appropriate here).

source to share

As @Jacob Eggers mentioned, you have to denormalize the data in order to form a matrix that is really sparse. Use SciPy package in python for k tools. Cm.

for examples and execution. Also check fooobar.com/questions/120585 / ... for more information on python kmeans clustering.

source to share

First, you need to denormalize the data so that you have one file:

```
userid tag1 tag2 tag3 tag4 ....
0001 1 0 1 0 ....
0002 0 1 1 0 ....
0003 0 0 1 1 ....
```

Then you need to loop through a k-dimensional algorithm. Here is the matlab code from the ml class:

```
% Initialize centroids
centroids = kMeansInitCentroids(X, K);
for iter = 1:iterations
% Cluster assignment step: Assign each data point to the
% closest centroid. idx(i) corresponds to cˆ(i), the index
% of the centroid assigned to example i
idx = findClosestCentroids(X, centroids);
% Move centroid step: Compute means based on centroid
% assignments
centroids = computeMeans(X, idx, K);
end
```

source to share

For sparse k-means, see the examples provided in
scikit-learn clustering .

How many identifiers are there, how many per user on average, how many clusters are you looking for? Even rough numbers, for example 100K IDs, av 10 per user, 100 clusters, can lead to someone who did clustering in this range (or otherwise, to make the "envelope" "impossible").

MinHash
may be better suited to your problem than k-means; see Chapter 3 "Finding Similar Items",
Ullman, Massive Massive Arrays ;

also SO questions / tagged / similarity + algorithm + python .

source to share