Computing the categorical mean with filtered data

I have the following dataset:

A -1
A 10
B  8
D -1
A  0
A  4
B  2
C  6

      

And I want to add a column to it like this:

A -1 4,6
A 10 4,6
B  8 5,0
D -1 5,0
A  0 4,6
A  4 4,6
B  2 5,0
C  6 6,0

      

What happened here?

Well, I calculated the average of each categorical letter variable, but ignoring the negative numbers and considering that as the new column value.

If only negative values ​​for category i are specified for the mean value (ignoring negative values).

In SQL, this would be done with a filtered funcion group followed by a join. In excel, this would be a conditional vlookup. How do I do this in R?

Edit:

# Create dataset
category <- c("A","A","B","D","A","A","B","C")
value <- c(-1,10,8,-1,0,4,2,6)
dataset <- data.frame(category, value)

# Calculated means

fdata <- dataset[dataset[,'value']>-1,]
aggregate(fdata[,2], list(fdata$category), mean)

      

+3


source to share


3 answers


We can use ave

from the base R grouped by each category

, we check if there all

value

is less than 0 in a particular group, if it is then, we select mean

integers dataset

, and if it is not, we only take the mean

groups.



dataset$mean_column <- with(dataset, ave(value, category, FUN = function(x) 
         ifelse(all(x < 0), mean(value[value >= 0]), mean(x[x >= 0]))))

dataset

#  category value  mean_column
#1        A    -1    4.666667
#2        A    10    4.666667
#3        B     8    5.000000
#4        D    -1    5.000000
#5        A     0    4.666667
#6        A     4    4.666667
#7        B     2    5.000000
#8        C     6    6.000000

      

+4


source


Using dplyr



dataset %>% 
  mutate(
    x = ifelse(value < 0, NA_integer_, value),
    meanAll = mean(x, na.rm = TRUE)) %>% 
  group_by(category) %>% 
  mutate(meanGroup = mean(x, na.rm = TRUE),
         meanGroup = ifelse(is.nan(meanGroup), meanAll, meanGroup))

# Source: local data frame [8 x 5]
# Groups: category [4]
# 
# # A tibble: 8 x 5
#   category value     x meanAll meanGroup
#     <fctr> <dbl> <dbl>   <dbl>     <dbl>
# 1        A    -1    NA       5  4.666667
# 2        A    10    10       5  4.666667
# 3        B     8     8       5  5.000000
# 4        D    -1    NA       5  5.000000
# 5        A     0     0       5  4.666667
# 6        A     4     4       5  4.666667
# 7        B     2     2       5  5.000000
# 8        C     6     6       5  6.000000

      

+3


source


The OP wrote in SQL this would be done with a filtered group funcion followed by a join. This approach can be implemented with data.table

:

library(data.table)
# filter data and compute group means 
setDT(dataset)[value >= 0, .(grp.mean = mean(value)), category][
  # now join with dataset
  dataset, on = "category"][
    # fill empty group means with overall mean of filtered values
    is.na(grp.mean), grp.mean := dataset[value >= 0, mean(value)]][]

      

which returns

   category grp.mean value
1:        A 4.666667    -1
2:        A 4.666667    10
3:        B 5.000000     8
4:        D 5.000000    -1
5:        A 4.666667     0
6:        A 4.666667     4
7:        B 5.000000     2
8:        C 6.000000     6

      


Here's a shorter option that uses assign by reference and avoids the join operation (I'm not sure which one is faster):

library(data.table)
# assign by reference of computed group means of filtered values
setDT(dataset)[, grp.mean := mean(value[value >=0]), category][
    # fill empty group means with overall mean of filtered values
    is.na(grp.mean), grp.mean := dataset[value >= 0, mean(value)]][]

      

+2


source







All Articles