Is the order of the columns taken into account in indexing data.table (setkey)?
I have a question about indexing objects data.table
.
setkey(data, A, B)
data[, C:=length(unique(B, na.rm=T)), by=A]
I was wondering if I should change the order of the index to
setkey(data, B, A)
increase a speed. Or are they the same? And how should I index for
data[c>=3, D:=sum(A), by=B]
source to share
First, your bit length(unique(B, na.rm = T))
doesn't do what you think it is - na.rm = TRUE
not an argument for unique
, it's passed in ...
and ignored (thanks @akrun for pointing this out). Probably the best way to get what you want is to run it uniqueN(na.omit(B))
.
With this in mind, I spent 9 (= 3x3) tests comparing speed (slightly enhanced version) code that you suggested, changing the order of entry: (B,A)
, (A,B)
or nothing ( X
). For example, the function BAX
discussed below:
BAX <- function(){
data <- data.table(A = sample(50, size = 1e6, T),
B = sample(c(1:150000, NA), size = 1e6, T))
setkey(data, B, A)
data[ , C := uniqueN(na.omit(B)), by = A]
data[C >= 18500, D := sum(A), by = B]
}
Here is the result of 200 repetitions of each layout:
> microbenchmark(times = 200L,
XX(), XAB(), XBA(), ABX(), ABAB(),
ABBA(), BAX(), BAAB(), BABA())
Unit: milliseconds
expr min lq mean median uq max neval cld
XX() 70.05867 73.66665 105.2628 96.55443 116.5883 213.2926 200 a
XAB() 112.52981 121.91760 161.2687 157.66455 172.6626 370.4791 200 ef
XBA() 112.56648 122.65417 165.9513 158.96873 174.6038 406.3392 200 f
ABX() 79.59582 82.33355 110.8462 101.04939 125.0158 198.1082 200 a
ABAB() 83.81686 90.40803 123.1391 126.94853 132.0878 182.0694 200 b
ABBA() 112.50687 117.68602 151.8467 155.72603 161.2123 228.5776 200 de
BAX() 85.82144 93.87965 134.5259 130.40824 146.1559 263.9083 200 bc
BAAB() 100.48214 105.35192 150.9692 146.76173 156.0230 392.4626 200 de
BABA() 93.29706 104.70251 142.8426 138.12848 149.1106 279.4645 200 cd
From this simple example, then your best options are not to loop the table (apparently minimal pre-sort profit) or key (A,B)
first and leave it.
Reverse key (B,A)
first and leave it - also works great. With that in mind, I'm really quite surprised at how poorly it performed XBA
.
If you are wondering why it is so fast without entering keys, this is basically what all manipulation is doing, for what you are trying to do, pre-sort the data; it only improves speed minimally in a given operation, but it costs the cost to re-enter intermediate operations. In manual language, this is the keyword versus ad hoc:
When it
by
contains the first n columns of a keyX
, we call it a key. The keyed groups are mapped contiguously in RAM, and the memory is copied internally for extra speed. Otherwise, we call it ad hoc . Ad hoc is still many times faster than pushing, for example, but not as fast as when using large datasets, especially when the size of each group is large. Not to be confused withkeyby
= as defined below.
The real benefits of using subset and merge keys - for operations like yours, I found ad hoc bys quite satisfying.
source to share