Anonymous function with dplyr :: do - using results from rle to filter data

I have time series data grouped by subject ('id') that stays on a specific "site" and has a specific "stage" at each stage of "time".

Sometimes items will switch from one site to another and possibly come back again. If people switch site back and forth (for example, from site 'a' to site 'b' and then back to site 'a') and if there is only one registration on the middle site (when going aba then site "b" would be considered a "middle site" here) and an individual is at a certain stage (stage = 2 here) on a middle site , then I want to remove the registration at this step.

My dummy data has four items. Three of them (topic 1-3) moved from site a to b, and then back to site b, and one moved from a to b.

The first two subjects have one registration on the middle site. Topic 1 is in stage 1 on a medium site and I want to keep this registration. Item 2, on the other hand, is in phase 2 on the middle site and this registration should be removed. Item 3 also moved between a and b. However, although he is in stage 2 on the middle site b, he has two registrations there and both registrations are saved. Object 4 moved from site a to site b, but did not return again. So while it is in stage 2 at site b, registration at site b is not a "middle site" and should be kept.

Data:

df <- structure(list(id = c(1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4),
                     time = c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 4L, 1L, 2L),
                     site = c("a", "b", "a", "a", "b", "a", "a", "b", "b", "a", "a", "b"),
                     stage = c(1, 1, 1, 1, 2, 1, 1, 2, 2, 1, 1, 2)),
                     .Names = c("id", "time", "site", "stage"),
                row.names = c(NA, -12L), class = "data.frame")

df

#    id time site stage
# 1   1    1    a     1
# 2   1    2    b     1 <~~ A single middle registration on site 2
# 3   1    3    a     1     However, the individual is in stage 1: -> keep 

# 4   2    1    a     1
# 5   2    2    b     2 <~~ A single middle registration on site 2 with stage 2: -> remove
# 6   2    3    a     1

# 7   3    1    a     1
# 8   3    2    b     2 <~~ Two middle registrations with stage 2: -> keep both rows 
# 9   3    3    b     2 <~~
# 10  3    4    a     1

# 11  4    1    a     1 
# 12  4    2    b     2 <~~ A single registration on site 2 with stage 2,
#                            but it is not in between two sites: -> keep

      

So in the test data, it is only the registration at time = 2 for id = 2, which should be removed.

Earlier I used plyr::ddply

and the result from rle

to solve the problem:

For each individual, calculate the site run length ( rle(x$site)

)
If :
 - back and forth between sites (for example, from a to b and back to a) ( length(r$values) > 2

) &
 - only one registration on the middle site ( r$lengths[2] == 1

) &
 - the stage on the middle site is 2 ( x$stage[x$site == r$values[2]][1] == 2

)
Then : remove the registration on the middle site x[!(x$site == r$values[2]), ]

)

library(plyr)

ddply(df, .(id), function(x){
  r <- rle(x$site)
  if(length(r$values) > 2 & r$lengths[2] == 1 & x$stage[x$site == r$values[2]][1] == 2){
    x[x$site != r$values[2], ]
  } else x
})

#    id time site stage
# 1   1    1    a     1
# 2   1    2    b     1
# 3   1    3    a     1

# 4   2    1    a     1 <~~ the single middle site with stage = 2 at time 2 is removed
# 5   2    3    a     1 <~~

# 6   3    1    a     1
# 7   3    2    b     2
# 8   3    3    b     2
# 9   3    4    a     1

# 10  4    1    a     1
# 11  4    2    b     2

detach("package:plyr")

      

Now I have some problems in order to get this right in dplyr

. I found some relevant posts on SO (like this and this ) and on github ( this and this ), but I am having a hard time adapting them to my needs. Here are some desperate attempts:

library(dplyr)

df %>%
  group_by(id) %>%
  do((function(x){
    r = rle(x$site)
    if(length(r$values) > 2 & r$lengths[2] == 1 & df$stage[df$site == r$values[2]][1] == 2){
    filter(x, x$site != r$values[2])
  } else x
})(.))
# desired row is not removed

df %>%
  group_by(id) %>%
  do(function(x){
    r = rle(x$site)
    if(length(r$values) > 2 & r$lengths[2] == 1 & df$stage[df$site == r$values[2]][1] == 2){
      x[!(x$site == r$values[2]), ]
    } else x
  })
# Error: Results are not data frames at positions: 1, 2, 3

      

This attempt works (gives the same result as ddply

above), but is far from elegant and I doubt this is the "right way":

df %>%
  group_by(id) %>%
  do(r = rle(.$site)) %>%  
  do(data.frame(id = .$id,
                len = length(.$r$values),
                site = .$r$values[2],
                len2 = .$r$lengths[2])) %>%
  filter(len == 3, len2 == 1) %>%
  select(-len) %>%
  left_join(df, ., by = c("id", "site")) %>%
  filter(!(len2 %in% 1 & stage == 2)) %>%
  select(-len2)

      

How do

is it correct? WWHWD?

+3


source to share


2 answers


I'm not sure if I fully understood the logic behind your code, but this might be another way to get the same result, perhaps with some modifications if necessary:

df %>% 
  group_by(id) %>%
  group_by(grp = cumsum(abs(c(1, diff(as.numeric(site))))), add = TRUE) %>%
  filter(!(grp == 2 & n() == 1 & stage == 2))

#Source: local data frame [9 x 5]
#Groups: id, grp
#
#  id time site stage grp
#1  1    1    a     1   1
#2  1    2    b     1   2
#3  1    3    a     1   3
#4  2    1    a     1   1     <~~ row in between 
#5  2    3    a     1   3     <~~ was removed
#6  3    1    a     1   1
#7  3    2    b     2   2
#8  3    3    b     2   2
#9  3    4    a     1   3

      

This approach assumes that the "middle group" is always the second "grp".


It might be even better to create the function I'm going to call intergroup()

, since it creates groups within the grouped data and uses this:



intergroup <- function(var, start = 1) {
  cumsum(abs(c(start, diff(as.numeric(as.factor(var))))))
}

df %>% 
  group_by(id) %>%
  group_by(grp = intergroup(site), add = TRUE) %>%
  filter(!(grp == 2 & n() == 1 & stage == 2))

      


Edit after updating the question.

Try the following corrected code for the corrected issue:

df %>% 
  group_by(id) %>%
  mutate(z = lag(site, 1) != lead(site, 1)) %>%   # check if site before and after are not the same
  group_by(grp = intergroup(site), add = TRUE) %>%
  filter(!(grp == 2 & n() == 1 & stage == 2 & !is.na(z))) %>%  # check for NA in z
  ungroup() %>% select(-c(z, grp))  

#Source: local data frame [11 x 4]
#
#   id time site stage
#1   1    1    a     1
#2   1    2    b     1
#3   1    3    a     1
#4   2    1    a     1
#5   2    3    a     1
#6   3    1    a     1
#7   3    2    b     2
#8   3    3    b     2
#9   3    4    a     1
#10  4    1    a     1
#11  4    2    b     2    <~~ row is kept

      

+3


source


Here's an alternative rle

that doesn't rely on do

. The code was inspired by this answer by @akrun (posted right after my question, thanks to @beginneR for heads).



df %>%
  group_by(id) %>%
  mutate(site_idx = with(rle(site),
                           rep(x = seq_along(lengths), times = lengths))) %>%
  filter(!(n_distinct(site_idx) > 2 & sum(site_idx == 2) == 1 &
           site_idx == 2 & stage == 2)) %>%
  select(-site_idx)

#    id time site stage
# 1   1    1    a     1
# 2   1    2    b     1
# 3   1    3    a     1
# 4   2    1    a     1 <~~ the single middle site with stage = 2 at time 2 has been removed
# 5   2    3    a     1 <~~
# 6   3    1    a     1
# 7   3    2    b     2
# 8   3    3    b     2
# 9   3    4    a     1
# 10  4    1    a     1
# 11  4    2    b     2

      

+1


source







All Articles