K stands for clustering for multidimensional data

Question

K stands for clustering for multidimensional data

if the dataset contains 440 features and 8 attributes (the dataset was pulled from the UCI machine learning repository). Whereas we calculate centroids for such datasets. (wholesale customer data) https://archive.ics.uci.edu/ml/datasets/Wholesale+customers

if i calculated the average of each row, would that be the centroid? and how do I display the resulting clusters in matlab.

+3

machine-learning cluster-analysis

Suvidha 03 Sep 14 at 17:24

source to share

1 answer

tttthomasssss · Accepted Answer · 2014-09-03T18:40:30+0000

OK, first of all in the dataset, 1 row corresponds to one example in the data, you have 440 rows, which means the dataset is 440 examples. Each column contains values for that particular function (or attribute, as you call it), for example. column 1 in your dataset contains values for a function Channel

, column 2 contains values for a function , Region

etc.

K-Means

Now for clustering K-Means you need to specify the number of clusters (K in K-Means). Let's say you want K = 3 clusters, then the easiest way to initialize K-Means is to randomly select 3 examples from your dataset (that's 3 rows randomly crossed out from the 440 rows you have) as your centroids. Now these 3 examples are your centroids.

You can think of your centroids as 3 bins, and you want to drive every example from the dataset to the closest (usually measured at Euclidean distance, check the function norm

in Matlab) bin,

After the first round of entering all the examples into the nearest box, you recalculate the centroids by calculating mean

all the examples in their respective cells. You repeat the process of putting all the examples in the closest bit until no example in your dataset has moved to another tray.

Some Matlab starting points

You load data with X = load('path/to/the/dataset', '-ascii');

In your case, it X

will be << 26>.

You can calculate the Euclidean distance from the example to the center of gravity distance = norm(example - centroid1);

, where both are example

and centroid1

are dimensioned 1x8

.

Recalculating centroids will work as follows, suppose you did 1 iteration of K-Means and put all examples in their nearest bit. Let's say it Bin1

now contains all the examples closest to centroid1

, and therefore Bin1

has a dimension 127x8

, which means that there are 127 examples out of 440 in this bin. To calculate the position of the center of gravity for the next iteration, you can do centroid1 = mean(Bin1);

. You would be doing similar things with your other boxes.

In terms of plotting, you should notice that your dataset contains 8 functions, which means 8 dimensions and is not rendered. I suggest you create or find a dataset (dummy) that only consists of two functions and therefore will be visualized with a Matlab function plot()

.

K stands for clustering for multidimensional data

More articles: