R data.table sets a new boolean column if the day of the week is between the date range
I have an object data.table
with two columns date
, from
and to
. I want to create a new column to determine if a particular business day is between the date range.
[Data]
library(data.table)
set.seed(1)
DT <- data.table(from=seq.Date(Sys.Date(), Sys.Date()+100, by="day"))[, to:=from+sample(10, 1), by=1:nrow(DT)][, from_wd:=wday(from)][, to_wd:=wday(to)]
> head(DT)
from to from_wd to_wd
1: 2015-08-06 2015-08-10 5 2
2: 2015-08-07 2015-08-10 6 2
3: 2015-08-08 2015-08-18 7 3
4: 2015-08-09 2015-08-16 1 1
5: 2015-08-10 2015-08-13 2 5
6: 2015-08-11 2015-08-13 3 5
[My approach]
In this case, I want to add a new column boolean
flag
that returns TRUE
if the environment is in a range [from, to]
.
This is my attempt:
DT[, flag:=0][DT[, .I[4 %in% unique(wday(seq.Date(from, to, by="day")))], by=1:nrow(DT)][[1]], flag:=1]
> table(DT$flag)
0 1
21 80
[Question]
It took a while to execute the code, and as you can imagine, it will take more time if it nrow(DT)
gets more.
My question is: Is there a better way to do this? Better in terms of speed and code readability (I find my code is not intuitive at all).
source to share
Here's one approach:
next_wday <- function(d,wd=4L){
wddiff = wd - wday(d)
d + wddiff + (wddiff < 0L)*7L
}
DT[, flag2 := +(next_wday(from) <= to)]
# test:
DT[,table(flag,flag2)]
# flag2
# flag 0 1
# 0 44 0
# 1 0 57
The idea is that you are comparing to
to next Thursday **. The replacement line can be written in several different ways.
Benchmark
The OP mentioned that from
u to
can be up to 200 days apart, so ...
set.seed(1)
from <- seq(as.IDate("1950-01-01"), by = "day", length = 1e6)
to <- from + pmin(200,rpois(length(from),1))
DT <- data.table(from,to)
system.time(DT[, flag2 := +(next_wday(from) <= to)])
# user system elapsed
# 2.11 0.03 2.14
# David Arenburg solution
system.time({
DateDT <- DT[, {
temp <- seq(min(from), max(to), by = "day")
temp2 <- temp[wday(temp) == 4L]
list(from = temp2, to = temp2)
}
]
indx <- foverlaps(DT, setkey(DateDT), nomatch = 0L, which = TRUE)$xid
DT[, flag := 0L][indx, flag := 1L]
})
# user system elapsed
# 6.75 0.14 6.89
# check agreement
DT[,table(flag,flag2)]
# flag2
# flag 0 1
# 0 714666 0
# 1 0 285334
I use IDate
because it is the date format that comes with the data.table package and with (?) Is faster. There are several ways to make your code even faster:
-
First, it might be faster to limit attention to lines where
to-from
less than 6 (since any break of 6 or more will have every workday), for exampleDT[,flag2:=0L][to-from < 6, flag2 := +(next_wday(from) <= to)]
-
Second, since the computation only depends on one line at a time, parallelization can lead to some improvement, as shown in @ grubjesic's answer.
-
Additional improvements can be found depending on the data from the real data alone.
The OP's code is not being tested here as it entails splitting data across lines and enumerating up to 200 dates per line, which will certainly be slow.
** or independently wday
equal to 4 means.
source to share
You can also try the approach foverlaps
First, let's create a dataset of the entire environment, starting with min(from)
and ending withmax(to)
DateDT <- DT[, {
temp <- seq(min(from), max(to), by = "day")
temp2 <- temp[wday(temp) == 4L]
.(from = temp2, to = temp2)
}
]
Then run foverlaps
and extract the lines you want
indx <- foverlaps(DT, setkey(DateDT), nomatch = 0L, which = TRUE)$xid
Then a simple update from the link would be
DT[, flag := 0L][indx, flag := 1L]
DT[, table(flag)]
# 0 1
# 44 57
source to share
Here's my example:
library(parallel)
process <- function(){
from <- seq(as.Date("1950-01-01"), by = "day", length = 100000)
to <- seq(as.Date("1950-01-04"), by = "day", length = 100000)
DT <- data.frame(from,to)
Ncores <- detectCores()
flagList <- mclapply(1:nrow(DT),function(id){
4 %in% strftime(seq(as.Date(DT[id,1]), as.Date(DT[id,2]), by="day"), format="%w")
},mc.cores=Ncores)
flag <- unlist(flagList)
return(cbind(DT,flag))
}
It only takes 15 seconds for my i7 processor for 100k rows. Hope this helps.
source to share