Create a counter variable in R, grouped by ID, that is conditionally reset

I am trying to count # consecutive days of inactive ( consecDaysInactive

), for each id.

I have already created an indicator variable inactive

, which is 1 in days when id is inactive and 0 is active. I also have an id variable and a date variable. My analysis dataset will contain hundreds of thousands of rows, so efficiency will be important.

The logic I'm trying to create looks like this:

  • for id, if the user is active, consecDaysInactive

    = 0
  • by id if the user is inactive and was active on the previous day, consecDaysInactive

    = 1
  • for id if the user was inactive on the previous day, consecDaysInactive

    = 1 + # previous consecutive inactive days
  • consecDaysInactive

    should reset to 0 for new id values.

I was able to create a cumulative sum but was unable to get it to reset to 0 after> = lines of inactive == 0.

I illustrated below the result I want ( consecDaysInactive

) as well as the result I was able to achieve programmatically ( bad_consecDaysInactive

).

library(dplyr)
d <- data.frame(id = c(1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2), date=as.Date(c('2017-01-01','2017-01-02','2017-01-03','2017-01-04','2017-01-05','2017-01-06','2017-01-07','2017-01-08','2017-01-01','2017-01-02','2017-01-03','2017-01-04','2017-01-05','2017-01-06','2017-01-07','2017-01-08')), inactive=c(0,0,0,1,1,1,0,1,0,1,1,1,1,0,0,1), consecDaysInactive=c(0,0,0,1,2,3,0,1,0,1,2,3,4,0,0,1))

d <- d %>% 
  group_by(id) %>% 
  arrange(id, date) %>% 
  do( data.frame(., bad_consecDaysInactive = cumsum(ifelse(.$inactive==1, 1,0))
  )
  )
d

      

where consecDaysInactive

iterations of +1 for each day in a row are inactive, but resets to 0, each date user is active, and resets to 0 for new id values. As shown below, I cannot get bad_consecDaysInactive

to reset to 0 - eg. line

          id       date inactive consecDaysInactive bad_consecDaysInactive
       <dbl>     <date>    <dbl>              <dbl>                  <dbl>
    1      1 2017-01-01        0                  0                      0
    2      1 2017-01-02        0                  0                      0
    3      1 2017-01-03        0                  0                      0
    4      1 2017-01-04        1                  1                      1
    5      1 2017-01-05        1                  2                      2
    6      1 2017-01-06        1                  3                      3
    7      1 2017-01-07        0                  0                      3
    8      1 2017-01-08        1                  1                      4
    9      2 2017-01-01        0                  0                      0
    10     2 2017-01-02        1                  1                      1
    11     2 2017-01-03        1                  2                      2
    12     2 2017-01-04        1                  3                      3
    13     2 2017-01-05        1                  4                      4
    14     2 2017-01-06        0                  0                      4
    15     2 2017-01-07        0                  0                      4
    16     2 2017-01-08        1                  1                      5

      

I've also considered (and tried) incrementing the variable in group_by()

and do()

, but since it's do()

not iterative, I can't get my counter to go through 2:

d2 <- d %>%
  group_by(id) %>% 
  do( data.frame(., bad_consecDaysInactive2 = ifelse(.$inactive == 0, 0, ifelse(.$inactive==1,.$inactive+lag(.$inactive), .$inactive)))) 
d2 

      

which gave as above:

      id       date inactive consecDaysInactive bad_consecDaysInactive bad_consecDaysInactive2
   <dbl>     <date>    <dbl>              <dbl>                  <dbl>                   <dbl>
1      1 2017-01-01        0                  0                      0                       0
2      1 2017-01-02        0                  0                      0                       0
3      1 2017-01-03        0                  0                      0                       0
4      1 2017-01-04        1                  1                      1                       1
5      1 2017-01-05        1                  2                      2                       2
6      1 2017-01-06        1                  3                      3                       2
7      1 2017-01-07        0                  0                      3                       0
8      1 2017-01-08        1                  1                      4                       1
9      2 2017-01-01        0                  0                      0                       0
10     2 2017-01-02        1                  1                      1                       1
11     2 2017-01-03        1                  2                      2                       2
12     2 2017-01-04        1                  3                      3                       2
13     2 2017-01-05        1                  4                      4                       2
14     2 2017-01-06        0                  0                      4                       0
15     2 2017-01-07        0                  0                      4                       0
16     2 2017-01-08        1                  1                      5                       1

      

As you can see, my iterator is bad_consecDaysInactive2

reset to 0, but not incremented by 2! If there is a data.table solution, I'd love to hear it.

+3


source to share


1 answer


Here's a very good way to do it with a for loop:

a <- c(1,1,1,1,0,0,1,0,1,1,1,0,0)
b <- rep(NA, length(a))
b[1] <- a[1]
for(i in 2:length(a)){
  b[i] <- a[i]*(a[i]+b[i-1])
}
a
b

      

It may not be the most efficient way to do it, but it will be pretty quick. 11.7 seconds for ten million lines on my computer.

a <- round(runif(10000000,0,1))
b <- rep(NA, length(a))
b[1] <- a[1]
t <- Sys.time()
for(i in 2:length(a)){
  b[i] <- a[i]*(a[i]+b[i-1])
}
b
Sys.time()-t

      

Time difference 11.73612 seconds

But that doesn't explain the need to do things on one identifier. This is easy to fix with minimal efficiency degradation. Your example framework is sorted by id. If the actual data is not sorted yet, do so. Then:



a <- round(runif(10000000,0,1))
id <- round(runif(10000000,1,1000))
id <- id[order(id)]
b <- rep(NA, length(a))
b[1] <- a[1]
t <- Sys.time()
for(i in 2:length(a)){
  b[i] <- a[i]*(a[i]+b[i-1])
  if(id[i] != id[i-1]){
    b[i] <- a[i]
  }
}
b
Sys.time()-t

      

Time difference 13.54373 sec.

If we include the time it took to sort id

, the time difference is closer to 19 seconds. Still not so bad!

How much efficiency savings can we get using Frank's answer in the OP's comments?

d <- data.frame(inactive=a, id=id)

t2 <- Sys.time()
b <- setDT(d)[, v := if (inactive[1]) seq.int(.N) else 0L, by=rleid(inactive)]
Sys.time()-t2

      

Time difference 2.233547 sec.

+2


source







All Articles