How exactly are the bumps removed in the R boxplot and how can the same outliers be removed for further calculation (eg average)?
In boxplot
I have set the option outline=FALSE
to remove deviations.
Now I would like to include points
that shows the average in the boxplot. Obviously, the means calculated using use mean
includes emissions.
How can you remove the same outliers from the framework so that the calculated average matches the data shown in the box?
I know how the outliers can be removed, but what settings are used by the outline
from boxplot
inside option ? Unfortunately, the manual does not provide any explanation.
source to share
To remove outliers, you must set the option outline
to FALSE
.
Let's assume your data is as follows:
data <- data.frame(a = c(seq(0,1,0.1),3))
Then you use the function boxplot
:
res <- boxplot(data, outline=FALSE)
In an object res
, you have some data about your data. Among them res$out
gives you all the emissions. There is only a value of 3 here.
So, to compute the average without deductions, you can simply do:
mean(data$a[!data$a %in% res$out])
source to share
If you look at the Meaning section ?boxplot
, you will find:
"List with the following components:" [...]
out
the values ββof any data points that lie outside the whisker extremes. "
This way you can evaluate the result of your call boxplot
to an object, extract outliers from it, and remove them from the original values:
x <- c(-10, 1:5, 50)
x
# [1] -10 1 2 3 4 5 50
bx <- boxplot(x)
str(bx)
# List of 6
# $ stats: num [1:5, 1] 1 1.5 3 4.5 5
# $ n : num 7
# $ conf : num [1:2, 1] 1.21 4.79
# $ out : num [1:2] -10 50
# $ group: num [1:2] 1 1
# $ names: chr "1"
x2 <- x[!(x %in% bx$out)]
x2
# [1] 1 2 3 4 5
source to share
To answer the second part of your question, about how outliers are selected, it is good to recall how the box is built:
- the boxplot "body" matches the second + third quartiles of the data (= interquartile range, IQR)
- each whisker limit is usually calculated using 1.5 * IQR outside of that body.
If you accept the hypothesis that your data is normally distributed, there is this amount of data outside of each bottom:
1-pnorm(qnorm(0.75)+1.5*2*qnorm(0.75))
is 0.0035. Therefore, the normal variable has 0.7% of the "box ejection".
But this is not a very "reliable" way of detecting outliers, there are packages specially designed for this.
source to share