Matlab: clustering kmeans gives unexpected clusters

Example:

load kmeansdata %provides X variable
Y=bsxfun(@minus,X,mean(X,2))'/sqrt(size(X,2)-1); %normalized and means adjusted
[~,~,PC] = svd(Y); %
plot(PC(:,1),PC(:,2),'m.','markersize',15)

      

build the first two columns and you get what looks like 3 clusters. I want to identify these clusters using kmeans and draw clusters of different colors as prood. I tried:

[idx,cntrd] = kmeans(PC(:,1:2),3,'Distance','sqEuclidean');%,'Distance','correlation');

cluster=3;
Col = {'.b','.r','.g','.y','.m','.c','.k'}; % Cell array of colours.
figure;
hold on
for clus=1:cluster
  plot(PC(idx==clus,1),PC(idx==clus,2),Col{clus},'MarkerSize',12)  
end
plot(cntrd(:,1),cntrd(:,2),'kx','MarkerSize',15,'LineWidth',3) %plotting the centroids of the clusters

      

The cluster centroids are off and the colors are not what I expected. Can anyone please help?

EDIT: Multiple answers:

I copied this code from mathworks site and replaced my kmeans line:

opts = statset('Display','final');
[idx,C] = kmeans(PC(:,1:2),3,'Distance','cityblock',...
    'Replicates',5,'Options',opts);

      

it works, but I don't quite understand what opts does. Replicates, I suppose, just repeats kmeans 5 times and picks some average for the centroids. I also restarted matlab if any crash occurred

EDIT: ignore the above:

I thought the problem was solved, so I tried to find suitable k values. I entered k = 1, ran through everything, then k = 2, then k = 3, and I noticed that I got the same error again.

+3


source to share


1 answer


kmeans can be sensitive to initial centroid locations. The problem is that the algorithm is used to select the starting points. for example, you can get the expected response by doing this:

[idx,cntrd] = kmeans(PC(:,1:2),3, 'start', [-0.05 0; 0 0; 0.05  0]);

      



The looks can also be deceiving. In this case, the variance of the data is not equal in x and y dimensions. Thus, for some pairs of points, the Euclidean distance is not as far from visual clusters as in clusters.

For this data, you can use a mixture of Guassian distribution model.

0


source







All Articles