Is it possible to accomplish the non-equilibrium SQL join task (example below) (faster and / or more tidy) with data.table?

I have a data table with three columns: pid (job), start time (time it starts) and fintime (time it ends) as follows:

require(data.table)

dt <- data.table(pid=sample(1:100,100), starttime = sample(1:100,100)/100)[,fintime:=starttime + round(runif(100)/4,2)] 

      

I need to identify all possible TWOs that can be executed sequentially, but confirm an acceptable gap between Jobs. I can do this using SQL for a span between 0.05 and 0.4 units (time) as shown below:

require(sqldf)
res <- sqldf("select a.pid as first, b.pid as second , a.starttime as startime,  b.fintime as fintime 
    from  dt a, dt b 
    where  a.fintime < b.starttime - 0.05
       and a.fintime > b.starttime - 0.4
  ")

      

How can I do this using data.table? (I am hoping for performance improvements over sqldf when the data is large and with a lot of constraints)

+3


source to share


1 answer


So, here's the data.table approach, which is about 20X faster , but there are some caveats (described at the end).

require(data.table)
set.seed(1)  # for reproducible example
n <- 100     # simple example
dt <- data.table(pid=sample(1:n,n), 
                 starttime = sample(1:n,n)/n,2)[,fintime:=starttime + round(runif(n)/4,2)]
# sqldf approach
require(sqldf)
f.sql <- function(dt) {
  sqldf("create index idx on dt(starttime,fintime)")
  res <- sqldf("select a.pid as first, b.pid as second , a.starttime as starttime,  b.fintime as fintime 
    from  dt a, dt b 
    where b.starttime >= a.fintime + 0.05
      and b.starttime <= a.fintime + 0.4
  ")  
}
res.sql <- f.sql(dt)

# data.table approach with foverlaps(...): need >= 1.9.4 for this!!
packageVersion("data.table")
# [1] β€˜1.9.4’
f.DT <- function(dt) {
  lookup <- dt[,list(second=pid, fintime, a=starttime,b=starttime)]
  setkey(lookup,a,b)
  DT      <- dt[,list(first=pid, starttime, a=fintime+0.05,b=fintime+0.4)]
  J.olaps <- foverlaps(DT,lookup,type="any",nomatch=0) 
  J.olaps[,list(first,second,starttime,fintime)]
}
res.DT <- f.DT(dt)

      

So this is using the feature foverlaps(...)

in the newest version data.table

(1.9.4). Suppose you have two data.tables, x

and y

. Each of them has a couple of columns that form a range. foverlaps(...)

finds all combinations of records in x

and y

where there is overlap between range at x

and range at y

. Here we set it up so that x

it has a range defined by fintime+0.04

and fintime+0.5

, but y

has a range defined starttime

at both ends. So now foverlaps(...)

looking for any combination of records where the start time is between 0.04 and 0.5 is greater than fintime.

Now for the caveats:

First, this only works (AFAIK) if you are willing to relax your constraints to a closed interval (e.g., b.starttime >= a.fintime + 0.05

as well as strictly >

).

Second, the data.table method finds all the records found in the sql approach plus some additional records. You can see this with the following code:



indx  <- data.table(first=res.sql$first,second=res.sql$second,key=c("first","second"))
setkey(res.DT,first,second)
extra <-  res.DT[!indx,]

      

The additional entries seem to be legitimate, so the question is, why weren't they found sqldf(...)

? I cannot answer this question.

Third, this works for your example, but it might not be easy to extend with "more constraints".

Finally, here's a "benchmark" with a dataset that looks more like your actual data:

set.seed(1)
n <- 1e4   # more realistic example
dt <- data.table(pid=sample(1:n,n), 
                 starttime = sample(1:n,n)/n)[,fintime:=starttime + round(runif(n)/4,2)]
system.time(res.sql <- f.sql(dt))
#    user  system elapsed 
#   45.25    0.53   45.80 
system.time(res.DT  <- f.DT(dt))
#    user  system elapsed 
#    2.09    0.86    2.94 

      

+3


source







All Articles