Find the local minimum in the bimodal distribution with r

My data is preprocessed image data and I want to highlight two classes. In theory (and hopefully in practice), the best threshold is the local minimum between two peaks in bimodal distributed data.

My test data: http://www.file-upload.net/download-9365389/data.txt.html

I tried to follow this thread : I plotted a histogram and computed the kernel density function:

datafile <- read.table("....txt")
data <- data$V1
hist(data)

d <- density(data) # returns the density data with defaults
hist(data,prob=TRUE)
lines(d) # plots the results

      

But how to proceed?

I would compute the first and second derivatives of the density function to find local extrema, in particular the local minimum. However, I don't know how to do this in R and density(test)

does not seem to be a normal function. So please help me: how can I calculate the derivatives and find the local minimum of the well between the two peaks in a density function density(test)

?

+3


source to share


1 answer


There are several ways to do this.

First using d

for density as in your question d$x

and d$y

contain x and y values ​​for density plot. The minimum occurs when the derivative dy / dx = 0. Since the values ​​of x are uniformly distributed, we can estimate dy with diff(d$y)

and look for d$x

where is abs(diff(d$y))

minimized:

d$x[which.min(abs(diff(d$y)))]
# [1] 2.415785

      

The problem is that the density curve can also be maximized when dy / dx = 0. In this case, the minimum is shallow, but the highs are peaking, so it works, but you can't count on it.

Thus, the second method uses optimize(...)

, which looks for a local minimum in a given interval. optimize(...)

we need a function as an argument, so we use approxfun(d$x,d$y)

to create an interpolation function.

optimize(approxfun(d$x,d$y),interval=c(1,4))$minimum
# [1] 2.415791

      

Finally, we'll show that this is indeed the minimum:

hist(data,prob=TRUE)
lines(d, col="red", lty=2)
v <- optimize(approxfun(d$x,d$y),interval=c(1,4))$minimum
abline(v=v, col="blue")

      



Another approach that is actually preferred is using k-means clustering.

df <- read.csv(header=F,"data.txt")
colnames(df) = "X"

# bimodal
km <- kmeans(df,centers=2)
df$clust <- as.factor(km$cluster)
library(ggplot2)
ggplot(df, aes(x=X)) + 
  geom_histogram(aes(fill=clust,y=..count../sum(..count..)),
                     binwidth=0.5, color="grey50")+
  stat_density(geom="line", color="red")

      

The data actually looks more trimodal than bimodal.

# trimodal
km <- kmeans(df,centers=3)
df$clust <- as.factor(km$cluster)
library(ggplot2)
ggplot(df, aes(x=X)) + 
  geom_histogram(aes(fill=clust,y=..count../sum(..count..)),
                 binwidth=0.5, color="grey50")+
  stat_density(geom="line", color="red")

      

+5


source







All Articles