How to lag values by the number of days in a data table
Let's say I have a data table:
id dt val1 val2
1 '2014-01-01' 100 200
2 '2014-01-01' 100 200
3 '2014-01-01' 100 200
4 '2014-01-01' 100 200
...
9 '2014-01-14' 1000 1100
10 '2014-01-14' 1000 1100
There are 100 rows - val1 and val2 for each of the ten ids for ten days.
I need to prepare a set of lagging values by id for a variable number of days. For example, if I wanted two days, I would see the following:
id dt val1 val2 val1_lag val2_lag
1 '2014-01-01' 100 200 NA NA
2 '2014-01-01' 100 200 NA NA
...
1 '2014-01-03' 300 400 100 200
2 '2014-01-03' 300 400 100 200
...
1 '2014-01-09' 500 600 300 400
...
9 '2014-01-14' 1000 1100 800 900
Note that January 9 is the previous value from January 4. There is no date exactly two days before, so it returns a value that is closest to January 7 (January 9 minus two days), no greater than January 7. It happens mechanically every day. For example, January 3 is used for values from January 1 because they are closest to January 3 minus two days.
There is one complication. There is no guarantee that the deferred date will be accurately priced on weekends and holidays. I want the latest value, which is at least x days earlier.
Is there a way to do this using the power of data tables? You can create a test table using:
library(data.table)
a <- rep(c(1:10), 10)
b <- rep(c(seq(as.Date("2014-01-01"), by = 1, len = 4), seq(as.Date("2014-01-09"), by = 1, len = 6)), 10)
c <- rep(seq(100, by = 100, len = 10), 10)
d <- rep(seq(200, by = 100, len = 10), 10)
e <- data.frame(a, b, c, d)
colnames(e) <- c("id", "dt", "val1", "val2")
setDT(e)
setkeyv(e, c("id", "dt"))
My instinct says to use something like this formula to generate lagging values, I'm just trying to do it automatically for multiple columns:
setDT(e)[, ??][by = id]
EDIT: I modified my code to reflect the potential gap in dates caused by weekends, holidays, etc.
source to share
Here's a solution using another minimal example that might be easier to understand (and covers more cases).
library(data.table)
lag <- 2;
dt <- data.table(id = c(1,1,1,1,2,2),
dt = as.Date(c("2014-1-1", "2014-1-4", "2014-1-6", "2014-1-7", "2014-1-1", "2014-1-10")),
val1 = sample(10, 6),
val2 = sample(10, 6))
dt
id dt val1 val2
1: 1 2014-01-01 8 7
2: 1 2014-01-04 10 5
3: 1 2014-01-06 3 3
4: 1 2014-01-07 7 2
5: 2 2014-01-01 2 1
6: 2 2014-01-10 5 6
dt1 <- copy(dt)
dt1[, laggedDate:=dt]
setkey(dt1, "id", "laggedDate")
dt2 <- copy(dt)
dt2[, laggedDate:=dt-2]
setkey(dt2, "id", "laggedDate")
setnames(dt1, c("dt","val1", "val2"), c("dt.lagged", "val1.lagged", "val2.lagged"))
dt1[dt2, roll=TRUE]
id laggedDate dt.lagged val1.lagged val2.lagged dt val1 val2
1: 1 2013-12-30 <NA> NA NA 2014-01-01 8 7
2: 1 2014-01-02 2014-01-01 8 7 2014-01-04 10 5
3: 1 2014-01-04 2014-01-04 10 5 2014-01-06 3 3
4: 1 2014-01-05 2014-01-04 10 5 2014-01-07 7 2
5: 2 2013-12-30 <NA> NA NA 2014-01-01 2 1
6: 2 2014-01-08 2014-01-01 2 1 2014-01-10 5 6
The key point here is the use of the circular data join function. If you're uncomfortable check out this blog post .
source to share