K stands for clustering for multidimensional data
if the dataset contains 440 features and 8 attributes (the dataset was pulled from the UCI machine learning repository). Whereas we calculate centroids for such datasets. (wholesale customer data) https://archive.ics.uci.edu/ml/datasets/Wholesale+customers
if i calculated the average of each row, would that be the centroid? and how do I display the resulting clusters in matlab.
source to share
OK, first of all in the dataset, 1 row corresponds to one example in the data, you have 440 rows, which means the dataset is 440 examples. Each column contains values ββfor that particular function (or attribute, as you call it), for example. column 1 in your dataset contains values ββfor a function Channel
, column 2 contains values ββfor a function , Region
etc.
Now for clustering K-Means you need to specify the number of clusters (K in K-Means). Let's say you want K = 3 clusters, then the easiest way to initialize K-Means is to randomly select 3 examples from your dataset (that's 3 rows randomly crossed out from the 440 rows you have) as your centroids. Now these 3 examples are your centroids.
You can think of your centroids as 3 bins, and you want to drive every example from the dataset to the closest (usually measured at Euclidean distance, check the function norm
in Matlab) bin,
After the first round of entering all the examples into the nearest box, you recalculate the centroids by calculating mean
all the examples in their respective cells. You repeat the process of putting all the examples in the closest bit until no example in your dataset has moved to another tray.
Some Matlab starting points
You load data with X = load('path/to/the/dataset', '-ascii');
In your case, it X
will be << 26>.
You can calculate the Euclidean distance from the example to the center of gravity
distance = norm(example - centroid1);
, where both are example
and centroid1
are dimensioned 1x8
.
Recalculating centroids will work as follows, suppose you did 1 iteration of K-Means and put all examples in their nearest bit. Let's say it Bin1
now contains all the examples closest to centroid1
, and therefore Bin1
has a dimension 127x8
, which means that there are 127 examples out of 440 in this bin. To calculate the position of the center of gravity for the next iteration, you can do centroid1 = mean(Bin1);
. You would be doing similar things with your other boxes.
In terms of plotting, you should notice that your dataset contains 8 functions, which means 8 dimensions and is not rendered. I suggest you create or find a dataset (dummy) that only consists of two functions and therefore will be visualized with a Matlab function plot()
.
source to share