Apply the function to all lines except the current one (dplyr)
I am learning the dplyr
R package and still love it very much. One thing I need to do is create a new column whose value for each row will be the result of applying the function to all rows except the current one (possibly a subset by group), but I can't figure out how.
A contrived example would be
library(datasets)
library(dplyr)
data(mtcars)
x <- mtcars %>% mutate(name=rownames(mtcars)) %>% filter(cyl==4) %>% select(name,cyl,mpg)
# This is what I want to do more elegantly
x$othermpg <- NA
for (i in 1:nrow(x))
x$othermpg[i] <- mean(x$mpg[-i])
Here, the column othermpg
gets the average of mpg
all cars except the one in the current row.
Can anyone help please?
source to share
This can be done in data.table
library(data.table)
setDT(x)[, N:= 1:.N][, othermpg2:=mean(x[N!= .BY, mpg]), by=N][,N:=NULL]
x
# name cyl mpg othermpg othermpg2
#1: Datsun 710 4 22.8 27.05 27.05
#2: Merc 240D 4 24.4 26.89 26.89
#3: Merc 230 4 22.8 27.05 27.05
#4: Fiat 128 4 32.4 26.09 26.09
#5: Honda Civic 4 30.4 26.29 26.29
#6: Toyota Corolla 4 33.9 25.94 25.94
#7: Toyota Corona 4 21.5 27.18 27.18
#8: Fiat X1-9 4 27.3 26.60 26.60
#9: Porsche 914-2 4 26.0 26.73 26.73
#10: Lotus Europa 4 30.4 26.29 26.29
#11: Volvo 142E 4 21.4 27.19 27.19
- The idea was to create a row / column index first
N:=1:.N
- Group based on this index
by=N]
-
x[N!= .BY, mpg]
gives stringsmpg
that are not equal to the grouping variable - do mean
-
N:=NULL
Since N is not required, this column is discarded.
Or you could try (Inspired from @ thelatemail's answer)
setDT(x)[, N:=1:.N]
setkey(x, N)
x[,othermpg2 := mean(x[!.BY, mpg]), by=N][,N:=NULL]
Or without creating N
(from @Jon Clayden comments)
setDT(x)[, othermpg2:=mean(x[name!=.BY,mpg]), by=name]
Using dplyr
this works
x %>%
mutate(N=1:n()) %>%
do( data.frame(.,othermpg2=sapply(.$N, function(i) mean(.$mpg[!.$N %in% i]))))
# name cyl mpg othermpg N othermpg2
#1 Datsun 710 4 22.8 27.05 1 27.05
#2 Merc 240D 4 24.4 26.89 2 26.89
#3 Merc 230 4 22.8 27.05 3 27.05
#4 Fiat 128 4 32.4 26.09 4 26.09
#5 Honda Civic 4 30.4 26.29 5 26.29
#6 Toyota Corolla 4 33.9 25.94 6 25.94
#7 Toyota Corona 4 21.5 27.18 7 27.18
#8 Fiat X1-9 4 27.3 26.60 8 26.60
#9 Porsche 914-2 4 26.0 26.73 9 26.73
#10 Lotus Europa 4 30.4 26.29 10 26.29
#11 Volvo 142E 4 21.4 27.19 11 27.19
source to share
In the sample example, you can use the following code to avoid the for loop. Basically, instead of excluding the "current" line, you simply subtract it from the total:
library(dplyr)
x %>% mutate(othermpg2 = (sum(mpg)-mpg) / (length(mpg) -1 ))
# name cyl mpg othermpg othermpg2
#1 Datsun 710 4 22.8 27.05 27.05
#2 Merc 240D 4 24.4 26.89 26.89
#3 Merc 230 4 22.8 27.05 27.05
#4 Fiat 128 4 32.4 26.09 26.09
#5 Honda Civic 4 30.4 26.29 26.29
#6 Toyota Corolla 4 33.9 25.94 25.94
#7 Toyota Corona 4 21.5 27.18 27.18
#8 Fiat X1-9 4 27.3 26.60 26.60
#9 Porsche 914-2 4 26.0 26.73 26.73
#10 Lotus Europa 4 30.4 26.29 26.29
#11 Volvo 142E 4 21.4 27.19 27.19
If you feel this cannot be applied to your real data, edit your question with a more representative example.
source to share