Subset of data. Table by the third column of a multi-column key

Let's say I have a data.table with a 3 column key. For example, suppose we have time invested in students, invested in schools.

dt <- data.table(expand.grid(schools = 200:210, students = 1:100, time = 1:5),
                 key = c("schools", "students", "time"))


And I'll say that I want to take a subset of my data that only includes time 5. I know I can use subset


time.5 <- subset(dt, wave == 5)


Or I could do a vector scan:

time.5 <- dt[wave == 5]


But this is not a "data.table path" - I want to use the speed of a binary search. Since I have 3 columns in my key, using the unique

following leads to incorrect results:

dt[.(unique(schools), unique(students), 5)]


Any ideas?


source to share

2 answers

You may try

 setkey(dt, time)

 all( dt[J(5)][,time]==5)
 #[1] TRUE



dt1 <- data.table(expand.grid(schools=200:450, students=1:600,time=1:50),
        key=c('schools', 'students', 'time'))
f1 <- function(){dt1[time==5]}

f2 <- function(){setkey(dt1, time)
               new.dt <- dt1[J(5)]
             setkeyv(new.dt, colnames(dt1)) 

 f3 <- function() {setkey(dt1, time)

microbenchmark(f1(), f2(), f3(), unit='relative', times=20L)
#Unit: relative
#expr      min       lq     mean   median       uq      max neval cld
#f1() 3.188559 3.240377 3.342936 3.218387 3.224352 5.319811    20   b
#f2() 1.050202 1.083136 1.081707 1.089292 1.087572 1.129741    20  a 
#f3() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000    20  a 




If query performance is a major factor, you can still speed up @akrun's solution.

# install_github("jangorecki/dwtools")
# or just source:
# instead of single key you can define multiple to be used automatically without the need to re-setkey
Idx = list(
  c('schools', 'students', 'time'),
IDX <- idxv(dt1, Idx)
f4 <- function(){
microbenchmark(f4(), f1(), f2(), f3(), unit='relative', times=1L)
#Unit: relative
#expr       min        lq      mean    median        uq       max neval
#f4()  1.000000  1.000000  1.000000  1.000000  1.000000  1.000000     1
#f1()  6.431114  6.431114  6.431114  6.431114  6.431114  6.431114     1
#f2()  2.320577  2.320577  2.320577  2.320577  2.320577  2.320577     1
#f3() 23.706655 23.706655 23.706655 23.706655 23.706655 23.706655     1


Correct me if I'm wrong, but the computation seems to f3()

echo its key on micro-detection times > 1L


Remember that multiple indexes ( Idx ) require a lot of memory.



All Articles