How exactly are the bumps removed in the R boxplot and how can the same outliers be removed for further calculation (eg average)?

In boxplot

I have set the option outline=FALSE

to remove deviations.
Now I would like to include points

that shows the average in the boxplot. Obviously, the means calculated using use mean

includes emissions.

How can you remove the same outliers from the framework so that the calculated average matches the data shown in the box?

I know how the outliers can be removed, but what settings are used by the outline

from boxplot

inside option ? Unfortunately, the manual does not provide any explanation.

+3


source to share


3 answers


To remove outliers, you must set the option outline

to FALSE

.

Let's assume your data is as follows:

data <- data.frame(a = c(seq(0,1,0.1),3))

      

Then you use the function boxplot

:



res <- boxplot(data, outline=FALSE)

      

In an object res

, you have some data about your data. Among them res$out

gives you all the emissions. There is only a value of 3 here.

So, to compute the average without deductions, you can simply do:

mean(data$a[!data$a %in% res$out])

      

+4


source


If you look at the Meaning section ?boxplot

, you will find:

"List with the following components:" [...]

out

the values ​​of any data points that lie outside the whisker extremes. "



This way you can evaluate the result of your call boxplot

to an object, extract outliers from it, and remove them from the original values:

x <- c(-10, 1:5, 50)
x
# [1] -10   1   2   3   4   5  50

bx <- boxplot(x)
str(bx)
# List of 6
# $ stats: num [1:5, 1] 1 1.5 3 4.5 5
# $ n    : num 7
# $ conf : num [1:2, 1] 1.21 4.79
# $ out  : num [1:2] -10 50
# $ group: num [1:2] 1 1
# $ names: chr "1"

x2 <- x[!(x %in% bx$out)]
x2
# [1] 1 2 3 4 5

      

+3


source


To answer the second part of your question, about how outliers are selected, it is good to recall how the box is built:

  • the boxplot "body" matches the second + third quartiles of the data (= interquartile range, IQR)
  • each whisker limit is usually calculated using 1.5 * IQR outside of that body.

If you accept the hypothesis that your data is normally distributed, there is this amount of data outside of each bottom:

1-pnorm(qnorm(0.75)+1.5*2*qnorm(0.75))

      

is 0.0035. Therefore, the normal variable has 0.7% of the "box ejection".

But this is not a very "reliable" way of detecting outliers, there are packages specially designed for this.

+2


source







All Articles