When looking for a value in a large field, it makes a difference in performance if the field was sorted

I often have to search for a value in a field, the fields I am looking for are large and I am doing it repeatedly. But often the fields I am looking for are used in my data. Is it worth looking more for a value in an unsorted field compared to a sorted field? If so, do we need to tell R that the field is sorted before finding what we want, and if so, how?

set.seed(4)
v1<- runif(1000000)
x <- v1[384932]
v2<- sort(v1)

      

We are looking for x in both vectors

+3


source to share


3 answers


The data.table package optimizes this:

which(x == v1)
#[1] 384932

library(data.table)
DT <- data.table(v = v1)
DT[, ind := .I]
DT[v == x,]
#          v    ind
#1: 0.807405 384932

library(microbenchmark)
microbenchmark(which(x == v1),
               DT[v == x,])
#Unit: microseconds
#           expr      min        lq     mean   median       uq      max neval cld
# which(x == v1) 6507.923 6735.3560 6994.638 6930.145 7038.240 9693.825   100   b
#   DT[v == x, ]  813.030  849.9855 1082.465 1228.628 1258.347 1456.677   100  a 

      



Quote 1.9.4 NEWS:

" DT[column==value]

and are DT[column %in% values]

now optimized to use a key DT

when key(DT)[1]=="column"

, otherwise, an extra key (aka index) is added, so the next one DT[column==value]

is much faster. necessary, existing code should automatically benefit."

+3


source


I can't give you the technical details, but here's an example showing a minor performance difference (in an R base):



set.seed(4)
v1<- runif(5e7)
x <- v1[384932]
v2<- sort(v1)

library(microbenchmark)

microbenchmark(
  unsorted = {which(v1 == x)},
  sorted = {which(v2 == x)})

#Unit: milliseconds
#     expr      min       lq   median       uq      max neval
# unsorted 263.2967 269.6947 273.4812 278.6712 871.6447   100
#   sorted 263.7483 270.8948 274.0171 278.2555 831.2782   100

      

+1


source


The efficiency of "searching" an index by a numeric value will not depend on the ordering of the values ​​at all. The only time a fetch operation (one using "["

) will depend on some sort of ordering with a named vector or a named list. It's hard to imagine that this order matters when performing some sort of logic test. To get the most out of accessing named elements, a trusted guru suggests storing elements in an environment so that indexes can be hashed. http://markmail.org/message/pu3uib5muqjmz3l6?q=list:org%2Er-project%2Er-help+indices+hashing

Admittedly, the following should answer a different question. The help page ?sort

indicates that shell sorts are done by default and that in most cases they are faster than the alternative quicksort method. My testing shows that when sequences are known to be completely random that quicksort is better. (But since I don't think you should be taking things apart, this is completely beside the point.)

install.packages("rbenchmark")
library(rbenchmark)
rand <- sample(1:100000)
 benchmark( rs = { x <- sort(rand) }, rq={ x <- sort(rand, method="quick") },
            ss={ x <- sort(1:1000) }, sq={ x <- sort(1:10000, method="quick") })
 #----------
  test replications elapsed relative user.self sys.self user.child sys.child
2   rq          100   0.834  119.143     0.792    0.046          0         0
1   rs          100   1.583  226.143     1.539    0.047          0         0
4   sq          100   0.031    4.429     0.030    0.001          0         0
3   ss          100   0.007    1.000     0.008    0.000          0         0

      

It really makes sense once I get the right combinations). Shellsort is faster on sorted material, faster sorted on random material.

+1


source







All Articles