Cylinder clustering in R - timestamp clustering with other data

I am learning R and I have to group numeric data with a timestamp field. Time is one of the parameters, and since the data is strictly dependent on daylight hours, I want to take into account the "spherical" nature of this data.

As I understood from the manual, libraries like skmeans cannot handle "cylindrical" data, only "spherical" data (ie all components are in polar coordinates).

My idea for a suitable solution comes down to this: I can decompose the HOUR (0-24) column into two different X, Y columns and express the time in polar coordinates such as x ^ 2 + y ^ 2 = 1, So the k-means with Euclidean distance should have no problem interpreting the data.

I'm right?

+3


source to share


2 answers


Here is the display h

to m

, where h

is the time in hours (and a fraction of an hour). Then we try kmeans

and, at least in this test, this works:



h <- c(22, 23, 0, 1, 2, 10, 11, 12)
ha <- 2*pi*h/24
m <- cbind(x = sin(ha), y = cos(ha))

kmeans(m, 2)$cluster # compute cluster assignments via kmeans
## [1] 2 2 2 2 2 1 1 1

      

+3


source


k- means using square Euclidean distance.

But really: projecting your data into meaningful Euclidean space is an easy way to avoid these kinds of problems.



However, remember that your rank will no longer rest on the top hat. In many cases, you can simply scale the average to the cylinder you want. But it can become 0, then no meaningful scaling is possible.

Another option is kernel k-mean . Since your desired distance is Euclidean after data transformation, you can also "transform" this transformation and use k-mean kernel. But it might actually be faster to convert your data to your specific case. Most likely, this will only happen when using much more complex transformations (say, to an infinite dimensional vector space).

+2


source







All Articles