R to identify two populations in a scatterplot

I compare two rasters with a simple cell plot scatter chart and find that I have two seemingly different groups:

true scatterplot

I am now trying to extract the locations of each of these populations (by isolating row IDs, for example) so that I can see where they end up in the rasters and perhaps understand why I am getting this behavior. Here's a reproducible example: Also, my original data contains about 1,000,000 rows, so the solution must also support a large data frame. Any ideas on how I can isolate each of these groups? Thanks to
X <- seq(1,1000,1)


Z <- runif(1000, 1, 2)


A = c(1.2 * X * Z + 100)


B = c(0.6 * X * Z )


df = data.frame(X = c(X,X), Y = c(A,B))


plot(df$X,df$Y)


sample scatter

+3


source to share


2 answers


Spectral clustering is useful for identifying clusters of points with a clear boundary. The big advantage is that it is not controlled, i.e. Does not rely on human judgment, although the method is slow and some hyperparameters (e.g. number of clusters) must be provided.

Below is the code for clustering. The code takes about a few minutes in your case.



library(kernlab)
specc_df <- specc(as.matrix(df),centers = 2)
plot(df, col = specc_df)

      

The result is an obvious graph of two clusters of points. obviously two groups of points

+5


source


The data has a linear dividing line. You can find it with:

plot(df$X,df$Y)
Pts = locator(2)

      

You need to click one point between the two groups down the origin and the other at the far right (between the groups). With your data I got



Pts
$x
[1]   0.8066296 994.9723687
$y
[1]   48.56932 1255.32870

## Slope
(Pts$y[2] - Pts$y[1]) / (Pts$x[2] - Pts$x[1])
[1] 1.213841

## Draw the line to confirm 
abline(48,1.2, col="red")

## use the line to distinguish the groups
Group = rep(1, nrow(df))
Group[df$X*1.2 + 48 < df$Y] = 2
plot(df, pch=20, col=Group)

      

Plot

+3


source







All Articles