R data.table sets a new boolean column if the day of the week is between the date range

Question

R data.table sets a new boolean column if the day of the week is between the date range

I have an object data.table

with two columns date

, from

and to

. I want to create a new column to determine if a particular business day is between the date range.

[Data]

library(data.table)
set.seed(1)
DT <- data.table(from=seq.Date(Sys.Date(), Sys.Date()+100, by="day"))[, to:=from+sample(10, 1), by=1:nrow(DT)][, from_wd:=wday(from)][, to_wd:=wday(to)]

> head(DT)
         from         to from_wd to_wd
1: 2015-08-06 2015-08-10       5     2
2: 2015-08-07 2015-08-10       6     2
3: 2015-08-08 2015-08-18       7     3
4: 2015-08-09 2015-08-16       1     1
5: 2015-08-10 2015-08-13       2     5
6: 2015-08-11 2015-08-13       3     5

[My approach]

In this case, I want to add a new column boolean

flag

that returns TRUE

if the environment is in a range [from, to]

.

This is my attempt:

DT[, flag:=0][DT[, .I[4 %in% unique(wday(seq.Date(from, to, by="day")))], by=1:nrow(DT)][[1]], flag:=1]

> table(DT$flag)

 0  1 
21 80

[Question]

It took a while to execute the code, and as you can imagine, it will take more time if it nrow(DT)

gets more.

My question is: Is there a better way to do this? Better in terms of speed and code readability (I find my code is not intuitive at all).

+2

r data.table

Boxuan 05 Aug 15 at 20:37

source to share

3 answers

You can also try the approach foverlaps

First, let's create a dataset of the entire environment, starting with min(from)

and ending withmax(to)

DateDT <- DT[, {
                temp <- seq(min(from), max(to), by = "day")
                temp2 <- temp[wday(temp) == 4L]
                .(from = temp2, to = temp2)
               }
             ]

Then run foverlaps

and extract the lines you want

indx <- foverlaps(DT, setkey(DateDT), nomatch = 0L, which = TRUE)$xid

Then a simple update from the link would be

DT[, flag := 0L][indx, flag := 1L]
DT[, table(flag)]
#  0  1 
# 44 57

+2

David Arenburg 05 Aug 15 at 21:20

source to share

Here's my example:

library(parallel)

process <- function(){


  from <- seq(as.Date("1950-01-01"), by = "day", length = 100000)
  to <- seq(as.Date("1950-01-04"), by = "day", length = 100000)

  DT <- data.frame(from,to)

  Ncores <- detectCores()

  flagList <- mclapply(1:nrow(DT),function(id){

    4 %in% strftime(seq(as.Date(DT[id,1]), as.Date(DT[id,2]), by="day"), format="%w")

  },mc.cores=Ncores)

  flag <- unlist(flagList)

  return(cbind(DT,flag))

}

It only takes 15 seconds for my i7 processor for 100k rows. Hope this helps.

0

grubjesic 05 Aug 15 at 21:15

source to share

Frank · Accepted Answer · 2015-08-05T21:15:37+0000

Here's one approach:

next_wday <- function(d,wd=4L){
    wddiff = wd - wday(d)
    d + wddiff + (wddiff < 0L)*7L
} 


DT[, flag2 := +(next_wday(from) <= to)]

# test:
DT[,table(flag,flag2)]
#     flag2
# flag  0  1
#    0 44  0
#    1  0 57

The idea is that you are comparing to

to next Thursday **. The replacement line can be written in several different ways.

Benchmark

The OP mentioned that from

u to

can be up to 200 days apart, so ...

set.seed(1)
from <- seq(as.IDate("1950-01-01"), by = "day", length = 1e6)
to   <- from + pmin(200,rpois(length(from),1))
DT   <- data.table(from,to)

system.time(DT[, flag2 := +(next_wday(from) <= to)])
#    user  system elapsed 
#    2.11    0.03    2.14

# David Arenburg solution
system.time({
    DateDT <- DT[, {
                temp <- seq(min(from), max(to), by = "day")
                temp2 <- temp[wday(temp) == 4L]
                list(from = temp2, to = temp2)
               }
             ]
    indx <- foverlaps(DT, setkey(DateDT), nomatch = 0L, which = TRUE)$xid
    DT[, flag := 0L][indx, flag := 1L]
})  
#    user  system elapsed 
#    6.75    0.14    6.89

# check agreement
DT[,table(flag,flag2)]
#     flag2
# flag      0      1
#    0 714666      0
#    1      0 285334

I use IDate

because it is the date format that comes with the data.table package and with (?) Is faster. There are several ways to make your code even faster:

First, it might be faster to limit attention to lines where to-from

less than 6 (since any break of 6 or more will have every workday), for example

DT[,flag2:=0L][to-from < 6, flag2 := +(next_wday(from) <= to)]

Second, since the computation only depends on one line at a time, parallelization can lead to some improvement, as shown in @ grubjesic's answer.
Additional improvements can be found depending on the data from the real data alone.

The OP's code is not being tested here as it entails splitting data across lines and enumerating up to 200 dates per line, which will certainly be slow.

** or independently wday

equal to 4 means.

R data.table sets a new boolean column if the day of the week is between the date range

More articles: