Using functions in dplyr containing values ​​in other lines

Sorry in advance for the clumsy code. I have a data frame similar to the following:

df <- data.frame(c(rep_len(1,5), 2, 2), c("A", "A", "B", "B", "C", "C", "C"))
names(df) <- c("id", "consequence")

  id consequence
1  1           A
2  1           A
3  1           B
4  1           B
5  1           C
6  2           C
7  2           C

      

I would like to perform the following filtering action:

if the group by id contains consequences A or B, then keep those lines and remove the lines with the result C. If the group contains only C or one line, keep those / that lines / lines.

I tried to do this in dplyr using a custom function, but I have the problem that all rows are filtered, which removes all the consequences of C:

# filtering function:
consequence_select <- function(x) {
  if(n_distinct(x$consequence) > 1) {
  if(any(unique(x$consequence) %in% c("A", "B"))) {
  x %>%
    filter(consequence %in% c("A", "B"))} else {return(x)}
     } else {return(x)}
}


df %>%
group_by(id) %>%
consequence_select

  id consequence
1  1           A
2  1           A
3  1           B
4  1           B

      

I was able to do it correctly using plyr:

ddply(df, .(id), consequence_select)

  id consequence
1  1           A
2  1           A
3  1           B
4  1           B
5  2           C
6  2           C

      

+3


source to share


3 answers


With, dplyr

you need to wrap the function in do

:

df %>%
  group_by(id) %>%
  do(consequence_select(.))

      



.

is a pronoun that refers to a data frame df

.

+4


source


You can optimize your code by only applying it inside the argument filter

, not inside do

, since filter

this is a specialized dplyr function for this task. I created two functions and compared them to the existing answers. Which function you want to use depends on your requirements - for sampled data, they both give the same result. I also created a slightly larger sample data for the reference as shown below.

# sample data
df <- data.frame(id = sample(100, 1000, replace = T), 
                 consequence = sample(LETTERS[1:3], 1000, replace = TRUE, prob = c(0.2, 0.2, 0.6)))

# the existing custom function
consequence_select <- function(x) {
  if(n_distinct(x$consequence) > 1) {
    if(any(unique(x$consequence) %in% c("A", "B"))) {
      x %>%
        filter(consequence %in% c("A", "B"))} else {return(x)}
  } else {return(x)}
}

# eipi answer
f1 <- function() {
  df %>%
  group_by(id) %>%
  do(consequence_select(.)) }

# jazzuro answer
f2 <- function() {
  df %>%
  group_by(id) %>%
  do(if(all(.$consequence == "C")) {.} else{.[-which(.$consequence == "C"), ]}) }

# my answer 1
f3a <- function() {
  df %>% 
    group_by(id) %>% 
    filter((consequence != "C" & n_distinct(consequence) > 1L) | all(consequence == "C") )
}

# my answer 2
f3b <- function() {
  df %>% 
    group_by(id) %>% 
    filter((consequence %in% c("A", "B") & n_distinct(consequence) > 1L) | all(consequence == "C"))
}

library(microbenchmark)

microbenchmark(f1(), f2(), f3a(), f3b(), unit = "relative")

Unit: relative
 expr       min        lq    median        uq       max neval
f1()  11.243524 11.092915 10.956129 10.717519  8.859949   100
f2()   6.603549  6.663674  6.653424  6.566012 10.956784   100
f3a()  1.279952  1.294679  1.291719  1.294606  1.165322   100
f3b()  1.000000  1.000000  1.000000  1.000000  1.000000   100

all.equal(f1(), f3a())
#[1] TRUE
all.equal(f1(), f3b())
#[1] TRUE

      



As you can see, the slight increase in data size already shows a 10x speed difference between functions.

+4


source


You can perform your function like this do

. foo

- your data.

foo %>%
    group_by(id) %>%
    do(if(all(.$consequence == "C")) {.} else{.[-which(.$consequence == "C"), ]})

#  id consequence
#1  1           A
#2  1           A
#3  1           B
#4  1           B
#5  2           C
#6  2           C

      

+3


source







All Articles