Creating lagging columns using data.table
Background:
Let's say I have this code
library(data.table)
#reproducibility
set.seed(45L)
#make table
dt <- data.table(V1=c(1L,2L),
V2=LETTERS[1:3],
V3=round(rnorm(4),4),
V4 = 1:12)
dt
for which i am getting
> dt
V1 V2 V3 V4
1: 1 A 0.3408 1
2: 2 B -0.7033 2
3: 1 C -0.3795 3
4: 2 A -0.7460 4
5: 1 B 0.3408 5
6: 2 C -0.7033 6
7: 1 A -0.3795 7
8: 2 B -0.7460 8
9: 1 C 0.3408 9
10: 2 A -0.7033 10
11: 1 B -0.3795 11
12: 2 C -0.7460 12
and I want to add 10 V3 delay columns.
Q:
Is there a way to do this within the data table paradigm.
More details:
If it was data.frame then I could make a loop.
dt <- as.data.frame(dt)
for(i in 1:10){
dt <- cbind(dt, shift(x = dt[, 3],
n = i,
fill = NA,
type = "lag"))
names(dt)[ncol(dt)] <- sprintf("lag_%06d",i)
}
dt
for which I get:
> dt
V1 V2 V3 V4 lag_000001 lag_000002 lag_000003 lag_000004 lag_000005 lag_000006 lag_000007 lag_000008 lag_000009 lag_000010
1 1 A 0.3408 1 NA NA NA NA NA NA NA NA NA NA
2 2 B -0.7033 2 0.3408 NA NA NA NA NA NA NA NA NA
3 1 C -0.3795 3 -0.7033 0.3408 NA NA NA NA NA NA NA NA
4 2 A -0.7460 4 -0.3795 -0.7033 0.3408 NA NA NA NA NA NA NA
5 1 B 0.3408 5 -0.7460 -0.3795 -0.7033 0.3408 NA NA NA NA NA NA
6 2 C -0.7033 6 0.3408 -0.7460 -0.3795 -0.7033 0.3408 NA NA NA NA NA
7 1 A -0.3795 7 -0.7033 0.3408 -0.7460 -0.3795 -0.7033 0.3408 NA NA NA NA
8 2 B -0.7460 8 -0.3795 -0.7033 0.3408 -0.7460 -0.3795 -0.7033 0.3408 NA NA NA
9 1 C 0.3408 9 -0.7460 -0.3795 -0.7033 0.3408 -0.7460 -0.3795 -0.7033 0.3408 NA NA
10 2 A -0.7033 10 0.3408 -0.7460 -0.3795 -0.7033 0.3408 -0.7460 -0.3795 -0.7033 0.3408 NA
11 1 B -0.3795 11 -0.7033 0.3408 -0.7460 -0.3795 -0.7033 0.3408 -0.7460 -0.3795 -0.7033 0.3408
12 2 C -0.7460 12 -0.3795 -0.7033 0.3408 -0.7460 -0.3795 -0.7033 0.3408 -0.7460 -0.3795 -0.7033
There should be a more elegant way. Something that can handle much larger data much faster and more efficiently.
Now if I want to do this first to do 10 lags, on column V3, then on V4, without guessing their names. I could just stick in the loops, but again I suspect data.table has something good (great?) To suggest there.
source to share
Yes, shift
accommodates this type of user demand. When the parameter n
for shift
is a vector, shift
returns a list for each shift in n
:
dt[, sprintf("V3_lag_%06d", 1:10) := shift(V3, 1:10, type = 'lag')]
dt[, sprintf("V4_lag_%06d", 1:10) := shift(V4, 1:10, type = 'lag')]
# V1 V2 V3 V4 V3_lag_000001 V3_lag_000002 V3_lag_000003 V3_lag_000004 V3_lag_000005
# 1: 1 A 1.2322 1 NA NA NA NA NA
# 2: 2 B 1.6094 2 1.2322 NA NA NA NA
# 3: 1 C 0.4016 3 1.6094 1.2322 NA NA NA
# 4: 2 A -0.2730 4 0.4016 1.6094 1.2322 NA NA
# 5: 1 B 1.2322 5 -0.2730 0.4016 1.6094 1.2322 NA
# 6: 2 C 1.6094 6 1.2322 -0.2730 0.4016 1.6094 1.2322
# 7: 1 A 0.4016 7 1.6094 1.2322 -0.2730 0.4016 1.6094
# 8: 2 B -0.2730 8 0.4016 1.6094 1.2322 -0.2730 0.4016
# 9: 1 C 1.2322 9 -0.2730 0.4016 1.6094 1.2322 -0.2730
# 10: 2 A 1.6094 10 1.2322 -0.2730 0.4016 1.6094 1.2322
# 11: 1 B 0.4016 11 1.6094 1.2322 -0.2730 0.4016 1.6094
# 12: 2 C -0.2730 12 0.4016 1.6094 1.2322 -0.2730 0.4016
# V3_lag_000006 V3_lag_000007 V3_lag_000008 V3_lag_000009 V3_lag_000010 V4_lag_000001
# 1: NA NA NA NA NA NA
# 2: NA NA NA NA NA 1
# 3: NA NA NA NA NA 2
# 4: NA NA NA NA NA 3
# 5: NA NA NA NA NA 4
# 6: NA NA NA NA NA 5
# 7: 1.2322 NA NA NA NA 6
# 8: 1.6094 1.2322 NA NA NA 7
# 9: 0.4016 1.6094 1.2322 NA NA 8
# 10: -0.2730 0.4016 1.6094 1.2322 NA 9
# 11: 1.2322 -0.2730 0.4016 1.6094 1.2322 10
# 12: 1.6094 1.2322 -0.2730 0.4016 1.6094 11
# V4_lag_000002 V4_lag_000003 V4_lag_000004 V4_lag_000005 V4_lag_000006 V4_lag_000007
# 1: NA NA NA NA NA NA
# 2: NA NA NA NA NA NA
# 3: 1 NA NA NA NA NA
# 4: 2 1 NA NA NA NA
# 5: 3 2 1 NA NA NA
# 6: 4 3 2 1 NA NA
# 7: 5 4 3 2 1 NA
# 8: 6 5 4 3 2 1
# 9: 7 6 5 4 3 2
# 10: 8 7 6 5 4 3
# 11: 9 8 7 6 5 4
# 12: 10 9 8 7 6 5
# V4_lag_000008 V4_lag_000009 V4_lag_000010
# 1: NA NA NA
# 2: NA NA NA
# 3: NA NA NA
# 4: NA NA NA
# 5: NA NA NA
# 6: NA NA NA
# 7: NA NA NA
# 8: NA NA NA
# 9: 1 NA NA
# 10: 2 1 NA
# 11: 3 2 1
# 12: 4 3 2
source to share
This can also be done in 1 data.table
call. This can be especially useful if you need a lot of columns:
cols <- c("V3","V4")
dt[, (paste0("lag_",rep(cols, each = 10), "_", rep(1:10, times = length(cols)))) :=
unlist(lapply(.SD, function(x) shift(x, 1:10, type = "lag")), recursive = F), .SDcols = cols]
The code paste0(...)
sets up the column names the way we want, and then the code unlist(lapply(...))
gets the lags for each column in the order we want . To see how each of them work, you can run them individually (at startup, unlist(lapply(...))
you need to substitute in dt[,c("V3","V4")]
for .SD
.
dt[,1:9]
# V1 V2 V3 V4 lag_V3_1 lag_V4_1 lag_V3_2 lag_V4_2 lag_V3_3
# 1: 1 A 0.3408 1 NA NA NA NA NA
# 2: 2 B -0.7033 2 0.3408 1 NA NA NA
# 3: 1 C -0.3795 3 -0.7033 2 0.3408 1 NA
# 4: 2 A -0.7460 4 -0.3795 3 -0.7033 2 0.3408
# 5: 1 B 0.3408 5 -0.7460 4 -0.3795 3 -0.7033
# 6: 2 C -0.7033 6 0.3408 5 -0.7460 4 -0.3795
# 7: 1 A -0.3795 7 -0.7033 6 0.3408 5 -0.7460
# 8: 2 B -0.7460 8 -0.3795 7 -0.7033 6 0.3408
# 9: 1 C 0.3408 9 -0.7460 8 -0.3795 7 -0.7033
#10: 2 A -0.7033 10 0.3408 9 -0.7460 8 -0.3795
#11: 1 B -0.3795 11 -0.7033 10 0.3408 9 -0.7460
#12: 2 C -0.7460 12 -0.3795 11 -0.7033 10 0.3408
source to share