Computing the categorical mean with filtered data
I have the following dataset:
A -1
A 10
B 8
D -1
A 0
A 4
B 2
C 6
And I want to add a column to it like this:
A -1 4,6
A 10 4,6
B 8 5,0
D -1 5,0
A 0 4,6
A 4 4,6
B 2 5,0
C 6 6,0
What happened here?
Well, I calculated the average of each categorical letter variable, but ignoring the negative numbers and considering that as the new column value.
If only negative values ββfor category i are specified for the mean value (ignoring negative values).
In SQL, this would be done with a filtered funcion group followed by a join. In excel, this would be a conditional vlookup. How do I do this in R?
Edit:
# Create dataset
category <- c("A","A","B","D","A","A","B","C")
value <- c(-1,10,8,-1,0,4,2,6)
dataset <- data.frame(category, value)
# Calculated means
fdata <- dataset[dataset[,'value']>-1,]
aggregate(fdata[,2], list(fdata$category), mean)
source to share
We can use ave
from the base R grouped by each category
, we check if there all
value
is less than 0 in a particular group, if it is then, we select mean
integers dataset
, and if it is not, we only take the mean
groups.
dataset$mean_column <- with(dataset, ave(value, category, FUN = function(x)
ifelse(all(x < 0), mean(value[value >= 0]), mean(x[x >= 0]))))
dataset
# category value mean_column
#1 A -1 4.666667
#2 A 10 4.666667
#3 B 8 5.000000
#4 D -1 5.000000
#5 A 0 4.666667
#6 A 4 4.666667
#7 B 2 5.000000
#8 C 6 6.000000
source to share
Using dplyr
dataset %>%
mutate(
x = ifelse(value < 0, NA_integer_, value),
meanAll = mean(x, na.rm = TRUE)) %>%
group_by(category) %>%
mutate(meanGroup = mean(x, na.rm = TRUE),
meanGroup = ifelse(is.nan(meanGroup), meanAll, meanGroup))
# Source: local data frame [8 x 5]
# Groups: category [4]
#
# # A tibble: 8 x 5
# category value x meanAll meanGroup
# <fctr> <dbl> <dbl> <dbl> <dbl>
# 1 A -1 NA 5 4.666667
# 2 A 10 10 5 4.666667
# 3 B 8 8 5 5.000000
# 4 D -1 NA 5 5.000000
# 5 A 0 0 5 4.666667
# 6 A 4 4 5 4.666667
# 7 B 2 2 5 5.000000
# 8 C 6 6 5 6.000000
source to share
The OP wrote in SQL this would be done with a filtered group funcion followed by a join. This approach can be implemented with data.table
:
library(data.table)
# filter data and compute group means
setDT(dataset)[value >= 0, .(grp.mean = mean(value)), category][
# now join with dataset
dataset, on = "category"][
# fill empty group means with overall mean of filtered values
is.na(grp.mean), grp.mean := dataset[value >= 0, mean(value)]][]
which returns
category grp.mean value
1: A 4.666667 -1
2: A 4.666667 10
3: B 5.000000 8
4: D 5.000000 -1
5: A 4.666667 0
6: A 4.666667 4
7: B 5.000000 2
8: C 6.000000 6
Here's a shorter option that uses assign by reference and avoids the join operation (I'm not sure which one is faster):
library(data.table)
# assign by reference of computed group means of filtered values
setDT(dataset)[, grp.mean := mean(value[value >=0]), category][
# fill empty group means with overall mean of filtered values
is.na(grp.mean), grp.mean := dataset[value >= 0, mean(value)]][]
source to share