R data.table sets a new boolean column if the day of the week is between the date range

I have an object data.table

with two columns date

, from

and to

. I want to create a new column to determine if a particular business day is between the date range.

[Data]

library(data.table)
set.seed(1)
DT <- data.table(from=seq.Date(Sys.Date(), Sys.Date()+100, by="day"))[, to:=from+sample(10, 1), by=1:nrow(DT)][, from_wd:=wday(from)][, to_wd:=wday(to)]

> head(DT)
         from         to from_wd to_wd
1: 2015-08-06 2015-08-10       5     2
2: 2015-08-07 2015-08-10       6     2
3: 2015-08-08 2015-08-18       7     3
4: 2015-08-09 2015-08-16       1     1
5: 2015-08-10 2015-08-13       2     5
6: 2015-08-11 2015-08-13       3     5

      

[My approach]

In this case, I want to add a new column boolean

flag

that returns TRUE

if the environment is in a range [from, to]

.

This is my attempt:

DT[, flag:=0][DT[, .I[4 %in% unique(wday(seq.Date(from, to, by="day")))], by=1:nrow(DT)][[1]], flag:=1]

> table(DT$flag)

 0  1 
21 80 

      

[Question]

It took a while to execute the code, and as you can imagine, it will take more time if it nrow(DT)

gets more.

My question is: Is there a better way to do this? Better in terms of speed and code readability (I find my code is not intuitive at all).

+2


source to share


3 answers


Here's one approach:

next_wday <- function(d,wd=4L){
    wddiff = wd - wday(d)
    d + wddiff + (wddiff < 0L)*7L
} 


DT[, flag2 := +(next_wday(from) <= to)]

# test:
DT[,table(flag,flag2)]
#     flag2
# flag  0  1
#    0 44  0
#    1  0 57

      

The idea is that you are comparing to

to next Thursday **. The replacement line can be written in several different ways.

Benchmark

The OP mentioned that from

u to

can be up to 200 days apart, so ...

set.seed(1)
from <- seq(as.IDate("1950-01-01"), by = "day", length = 1e6)
to   <- from + pmin(200,rpois(length(from),1))
DT   <- data.table(from,to)

system.time(DT[, flag2 := +(next_wday(from) <= to)])
#    user  system elapsed 
#    2.11    0.03    2.14

# David Arenburg solution
system.time({
    DateDT <- DT[, {
                temp <- seq(min(from), max(to), by = "day")
                temp2 <- temp[wday(temp) == 4L]
                list(from = temp2, to = temp2)
               }
             ]
    indx <- foverlaps(DT, setkey(DateDT), nomatch = 0L, which = TRUE)$xid
    DT[, flag := 0L][indx, flag := 1L]
})  
#    user  system elapsed 
#    6.75    0.14    6.89

# check agreement
DT[,table(flag,flag2)]
#     flag2
# flag      0      1
#    0 714666      0
#    1      0 285334

      



I use IDate

because it is the date format that comes with the data.table package and with (?) Is faster. There are several ways to make your code even faster:

  • First, it might be faster to limit attention to lines where to-from

    less than 6 (since any break of 6 or more will have every workday), for example

    DT[,flag2:=0L][to-from < 6, flag2 := +(next_wday(from) <= to)]
    
          

  • Second, since the computation only depends on one line at a time, parallelization can lead to some improvement, as shown in @ grubjesic's answer.

  • Additional improvements can be found depending on the data from the real data alone.

The OP's code is not being tested here as it entails splitting data across lines and enumerating up to 200 dates per line, which will certainly be slow.


** or independently wday

equal to 4 means.

+3


source


You can also try the approach foverlaps

First, let's create a dataset of the entire environment, starting with min(from)

and ending withmax(to)

DateDT <- DT[, {
                temp <- seq(min(from), max(to), by = "day")
                temp2 <- temp[wday(temp) == 4L]
                .(from = temp2, to = temp2)
               }
             ]

      

Then run foverlaps

and extract the lines you want



indx <- foverlaps(DT, setkey(DateDT), nomatch = 0L, which = TRUE)$xid

      

Then a simple update from the link would be

DT[, flag := 0L][indx, flag := 1L]
DT[, table(flag)]
#  0  1 
# 44 57 

      

+2


source


Here's my example:

library(parallel)

process <- function(){


  from <- seq(as.Date("1950-01-01"), by = "day", length = 100000)
  to <- seq(as.Date("1950-01-04"), by = "day", length = 100000)

  DT <- data.frame(from,to)

  Ncores <- detectCores()

  flagList <- mclapply(1:nrow(DT),function(id){

    4 %in% strftime(seq(as.Date(DT[id,1]), as.Date(DT[id,2]), by="day"), format="%w")

  },mc.cores=Ncores)

  flag <- unlist(flagList)

  return(cbind(DT,flag))

}

      

It only takes 15 seconds for my i7 processor for 100k rows. Hope this helps.

0


source







All Articles