Window (aka roll) connects to data.table
I am trying to find a way to create this query function: [# 2300] Add back and forth in roll = TRUE which was mentioned in this post .
Basically, I would like to perform the following "window connection" X
located inY
- left join in the first n columns (in the following example
{x,y}
) - And select the values of the last column (
t
in the following example) inY
that falls in the interval[t-w1,t+w2]
, where t is the last column inX
,t
will usually be the time column and{w1,w2}
some integers (probablyw1=w2=something
orw1=0
)
I built the following example (but feel free to provide another / better one)
library(data.table)
set.seed(123);
X <- data.table(x=c(1,1,1,2,2),y=c(T,T,F,F,F),t=as.POSIXct("08:00:00.000",format="%H:%M:%OS")+sample(0:999,5,TRUE)/1e3)
Y <- copy(X)
set.seed(123)
Y[,`:=`(IDX=.I,t=t+sample(c(-5:5)/1e3,5,T))]
Y <- rbindlist(list(Y, X[5,][,IDX:=6][,t:=t+0.001], X[5,][,IDX:=7][,t:=t+0.002]))
So with (w1,w2) = (.002,.002)
R) X R) Y
x y t x y t IDX
1: 1 TRUE 2013-01-25 08:00:00.286 1: 1 TRUE 2013-01-25 08:00:00.284 1
2: 1 TRUE 2013-01-25 08:00:00.788 2: 1 TRUE 2013-01-25 08:00:00.791 2
3: 1 FALSE 2013-01-25 08:00:00.407 3: 1 FALSE 2013-01-25 08:00:00.407 3
4: 2 FALSE 2013-01-25 08:00:00.882 4: 2 FALSE 2013-01-25 08:00:00.886 4
5: 2 FALSE 2013-01-25 08:00:00.940 5: 2 FALSE 2013-01-25 08:00:00.945 5
6: 2 FALSE 2013-01-25 08:00:00.941 6 #by hand
7: 2 FALSE 2013-01-25 08:00:00.942 7 #by hand
The result will be
R) ans
x y t IDX
1: 1 TRUE 2013-01-25 08:00:00.286 1
2: 1 TRUE 2013-01-25 08:00:00.788 NA
3: 1 FALSE 2013-01-25 08:00:00.407 3
4: 2 FALSE 2013-01-25 08:00:00.882 NA
5: 2 FALSE 2013-01-25 08:00:00.940 6,7
But: IDX
there may well be a list if several lines of Y
(which may contain more lines than X
), only one, or NA
if they do not match.
I would be happy with some answers other than data.table ...
source to share
Here's an attempt, not very elegant, without data.table
, but with plyr
. Don't know if this might be helpful for you.
Sample data:
X <- data.frame(x=c(1,1,1,2,2),y=c(T,T,F,F,F),t=rep(1,5)+sample(0:999,5,TRUE)/1e3)
Y <- data.frame(x=c(1,1,1,2,2),y=c(T,T,F,F,F),t=rep(1,5)+sample(0:999,5,TRUE)/1e3, IDX=1:5)
w1 <- 0.3
w2 <- 0.3
What gives:
R> X
x y t
1 1 TRUE 1.880
2 1 TRUE 1.364
3 1 FALSE 1.288
4 2 FALSE 1.170
5 2 FALSE 1.172
R> Y
x y t IDX
1 1 TRUE 1.482 1
2 1 TRUE 1.252 2
3 1 FALSE 1.216 3
4 2 FALSE 1.674 4
5 2 FALSE 1.047 5
Then you can use the following code:
m <- merge(X,Y, by=c("x","y"), all.x=TRUE, all.y=FALSE)
m <- m[m$t.x>m$t.y-w1 & m$t.x<m$t.y+w2,]
m <- ddply(m, c("x","y","t.x"), summarize, IDX=list(IDX))
names(m) <- c("x","y","t","IDX")
merge(X, m, by=c("x","y","t"), all.x=TRUE, all.y=FALSE)
Which gives the following output:
x y t IDX
1 1 FALSE 1.288 3
2 1 TRUE 1.364 1, 2
3 1 TRUE 1.880 NA
4 2 FALSE 1.170 5
5 2 FALSE 1.172 5
source to share
Here is the translation of yuba's code data.table
(v.1.8.7 r797)
setkey(X,x,y,t); setkey(Y,x,y,t)
m <- merge(X,Y, by=c("x","y"), all.x=TRUE, all.y=FALSE, allow.cartesian=TRUE)
m <- m[t.x>=(t.y-w1) & t.x<=(t.y+w2)]
m <- m[, list(IDX=list(IDX)), by=c("x","y","t.x")];
setnames(m,"t.x","t");
m <- m[X];
I get this (note NULL
, Matthew Doll can explain why we get this instead NA
)
R) m
x y t IDX
1: 1 FALSE 2013-01-25 08:00:00.407 3
2: 1 TRUE 2013-01-25 08:00:00.286 1
3: 1 TRUE 2013-01-25 08:00:00.788
4: 2 FALSE 2013-01-25 08:00:00.882
5: 2 FALSE 2013-01-25 08:00:00.940 6,7
I will wait for the expert to say that this is the solution data.table-optimal
to close the post.
source to share