Removing multidimensional outliers with mvoutlier

Question

Removing multidimensional outliers with mvoutlier

Problem

I have a dataframe that consists of> 5 variables at any time and I am trying to make K-means out of it. Since K-Means are heavily influenced by outliers, I tried to find a few hours on how to calculate and remove multivariate outliers. Most examples are shown with two variables.

Possible solutions

mvoutlier - The kind user here pointed out that mvoutlier might be what I need.
Another external detection method - Poster here commented on combining R functions to create an ordered list of outliers.

Problems with this way Far

As for the mvoutlier , I was unable to generate the result because it noticed that my dataset contains negatives and because of this it cannot work. I am not sure how to change my data to positive only, as I need negatives in the set I am working with.

Regarding the Other outlier detection method . I was able to list outliers, but I don't know how to exclude them from the current dataset. Also, I know that these calculations are done after K-Means, and as such, I probably applied the math before doing K-Means.

Minimal testable example

Unfortunately, the dataset I am using is not available to everyone, so you need any random dataset with more than three variables. Below is the code converted from the post Another Leak Detection Method to work with my data. It should work dynamically if you have a random set of data. But it should have enough data where the number of cluster centers should be in order with 5.

clusterAmount <- 5
cluster <- kmeans(dataFrame, centers = clusterAmount, nstart = 20)
centers <- cluster$centers[cluster$cluster, ]
distances <- sqrt(rowSums(clusterDataFrame - centers)^2)
m <- tapply(distances, cluster$cluster, mean)
d <- distances/(m[cluster$cluster])

# 1% outliers
outliers <- d[order(d, decreasing = TRUE)][1:(nrow(clusterDataFrame) * .01)]

Result: a list of phenomena, ordered by their distance from the center in which they reside, I believe. The problem then is getting those results paired up to the corresponding rows in the dataframe and deleting them, so I can start the K-Means procedure. (Note that in the example I used K-Means before removing outliers, I will make sure to take the necessary steps and remove outliers before solving K-Means after solving).

Question

With an example Another leak detection method , how do I match the results with information in my current dataframe to exclude those lines before doing K-Means?

+3

r k-means outliers

Jon Jul 24 17 at 20:10

source to share

1 answer

Evan Friedland · Accepted Answer · 2017-07-25T00:12:04+0000

I don't know if this is actually useful, but if your data is multidimensional normal, you can try the Wilks (1963) based method. Wilks showed that the distances of the mahalanobis from the multivariate normal data follow the beta distribution. We can take advantage of this (Sepal diaphragm data used as an example):

test.dat <- iris[,-c(1,2))]

Wilks.function <- function(dat){
  n <- nrow(dat)
  p <- ncol(dat)
  # beta distribution
  u <- n * mahalanobis(dat, center = colMeans(dat), cov = cov(dat))/(n-1)^2
  w <- 1 - u
  F.stat <- ((n-p-1)/p) * (1/w-1) # computing F statistic
  p <- 1 - round( pf(F.stat, p, n-p-1), 3) # p value for each row
  cbind(w, F.stat, p)
}

plot(test.dat, 
     col = "blue", 
     pch = c(15,16,17)[as.numeric(iris$Species)])

dat.rows <- Wilks.function(test.dat); head(dat.rows)
#                 w    F.stat     p
#[1,] 0.9888813 0.8264127 0.440
#[2,] 0.9907488 0.6863139 0.505
#[3,] 0.9869330 0.9731436 0.380
#[4,] 0.9847254 1.1400985 0.323
#[5,] 0.9843166 1.1710961 0.313
#[6,] 0.9740961 1.9545687 0.145

Then we can simply find which rows of our multivariate data are significantly different from the beta distribution.

outliers <- which(dat.rows[,"p"] < 0.05)

points(test.dat[outliers,], 
       col = "red", 
       pch = c(15,16,17)[as.numeric(iris$Species[outliers])])

Removing multidimensional outliers with mvoutlier

More articles: