Is the order of the columns taken into account in indexing data.table (setkey)?

Question

Is the order of the columns taken into account in indexing data.table (setkey)?

I have a question about indexing objects data.table

.

setkey(data, A, B)
data[, C:=length(unique(B, na.rm=T)), by=A]

I was wondering if I should change the order of the index to

setkey(data, B, A)

increase a speed. Or are they the same? And how should I index for

data[c>=3, D:=sum(A), by=B]

+3

r data.table

JeanVuda 05 jul. 15 at 12:14

source to share

1 answer

MichaelChirico · Accepted Answer · 2015-07-06T00:59:37+0000

First, your bit length(unique(B, na.rm = T))

doesn't do what you think it is - na.rm = TRUE

not an argument for unique

, it's passed in ...

and ignored (thanks @akrun for pointing this out). Probably the best way to get what you want is to run it uniqueN(na.omit(B))

.

With this in mind, I spent 9 (= 3x3) tests comparing speed (slightly enhanced version) code that you suggested, changing the order of entry: (B,A)

, (A,B)

or nothing ( X

). For example, the function BAX

discussed below:

BAX <- function(){
  data <- data.table(A = sample(50, size = 1e6, T),
                     B = sample(c(1:150000, NA), size = 1e6, T))
  setkey(data, B, A)
  data[ , C := uniqueN(na.omit(B)), by = A]
  data[C >= 18500, D := sum(A), by = B]
}

Here is the result of 200 repetitions of each layout:

> microbenchmark(times = 200L,
                 XX(), XAB(), XBA(), ABX(), ABAB(),
                 ABBA(), BAX(), BAAB(), BABA())
Unit: milliseconds
   expr       min        lq     mean    median       uq      max neval    cld
   XX()  70.05867  73.66665 105.2628  96.55443 116.5883 213.2926   200 a     
  XAB() 112.52981 121.91760 161.2687 157.66455 172.6626 370.4791   200     ef
  XBA() 112.56648 122.65417 165.9513 158.96873 174.6038 406.3392   200      f
  ABX()  79.59582  82.33355 110.8462 101.04939 125.0158 198.1082   200 a     
 ABAB()  83.81686  90.40803 123.1391 126.94853 132.0878 182.0694   200  b    
 ABBA() 112.50687 117.68602 151.8467 155.72603 161.2123 228.5776   200    de 
  BAX()  85.82144  93.87965 134.5259 130.40824 146.1559 263.9083   200  bc   
 BAAB() 100.48214 105.35192 150.9692 146.76173 156.0230 392.4626   200    de 
 BABA()  93.29706 104.70251 142.8426 138.12848 149.1106 279.4645   200   cd

From this simple example, then your best options are not to loop the table (apparently minimal pre-sort profit) or key (A,B)

first and leave it.

Reverse key (B,A)

first and leave it - also works great. With that in mind, I'm really quite surprised at how poorly it performed XBA

.

If you are wondering why it is so fast without entering keys, this is basically what all manipulation is doing, for what you are trying to do, pre-sort the data; it only improves speed minimally in a given operation, but it costs the cost to re-enter intermediate operations. In manual language, this is the keyword versus ad hoc:

When it by

contains the first n columns of a key X

, we call it a key. The keyed groups are mapped contiguously in RAM, and the memory is copied internally for extra speed. Otherwise, we call it ad hoc . Ad hoc is still many times faster than pushing, for example, but not as fast as when using large datasets, especially when the size of each group is large. Not to be confused with keyby

= as defined below.

The real benefits of using subset and merge keys - for operations like yours, I found ad hoc bys quite satisfying.

Is the order of the columns taken into account in indexing data.table (setkey)?

More articles: