Using functions in dplyr containing values ββin other lines
Sorry in advance for the clumsy code. I have a data frame similar to the following:
df <- data.frame(c(rep_len(1,5), 2, 2), c("A", "A", "B", "B", "C", "C", "C"))
names(df) <- c("id", "consequence")
id consequence
1 1 A
2 1 A
3 1 B
4 1 B
5 1 C
6 2 C
7 2 C
I would like to perform the following filtering action:
if the group by id contains consequences A or B, then keep those lines and remove the lines with the result C. If the group contains only C or one line, keep those / that lines / lines.
I tried to do this in dplyr using a custom function, but I have the problem that all rows are filtered, which removes all the consequences of C:
# filtering function:
consequence_select <- function(x) {
if(n_distinct(x$consequence) > 1) {
if(any(unique(x$consequence) %in% c("A", "B"))) {
x %>%
filter(consequence %in% c("A", "B"))} else {return(x)}
} else {return(x)}
}
df %>%
group_by(id) %>%
consequence_select
id consequence
1 1 A
2 1 A
3 1 B
4 1 B
I was able to do it correctly using plyr:
ddply(df, .(id), consequence_select)
id consequence
1 1 A
2 1 A
3 1 B
4 1 B
5 2 C
6 2 C
source to share
You can optimize your code by only applying it inside the argument filter
, not inside do
, since filter
this is a specialized dplyr function for this task. I created two functions and compared them to the existing answers. Which function you want to use depends on your requirements - for sampled data, they both give the same result. I also created a slightly larger sample data for the reference as shown below.
# sample data
df <- data.frame(id = sample(100, 1000, replace = T),
consequence = sample(LETTERS[1:3], 1000, replace = TRUE, prob = c(0.2, 0.2, 0.6)))
# the existing custom function
consequence_select <- function(x) {
if(n_distinct(x$consequence) > 1) {
if(any(unique(x$consequence) %in% c("A", "B"))) {
x %>%
filter(consequence %in% c("A", "B"))} else {return(x)}
} else {return(x)}
}
# eipi answer
f1 <- function() {
df %>%
group_by(id) %>%
do(consequence_select(.)) }
# jazzuro answer
f2 <- function() {
df %>%
group_by(id) %>%
do(if(all(.$consequence == "C")) {.} else{.[-which(.$consequence == "C"), ]}) }
# my answer 1
f3a <- function() {
df %>%
group_by(id) %>%
filter((consequence != "C" & n_distinct(consequence) > 1L) | all(consequence == "C") )
}
# my answer 2
f3b <- function() {
df %>%
group_by(id) %>%
filter((consequence %in% c("A", "B") & n_distinct(consequence) > 1L) | all(consequence == "C"))
}
library(microbenchmark)
microbenchmark(f1(), f2(), f3a(), f3b(), unit = "relative")
Unit: relative
expr min lq median uq max neval
f1() 11.243524 11.092915 10.956129 10.717519 8.859949 100
f2() 6.603549 6.663674 6.653424 6.566012 10.956784 100
f3a() 1.279952 1.294679 1.291719 1.294606 1.165322 100
f3b() 1.000000 1.000000 1.000000 1.000000 1.000000 100
all.equal(f1(), f3a())
#[1] TRUE
all.equal(f1(), f3b())
#[1] TRUE
As you can see, the slight increase in data size already shows a 10x speed difference between functions.
source to share