Speed up vector search for data.table

Question

Speed up vector search for data.table

In a row in the data table, I need to find the nearest smaller number from the vector. The following minimal working example gets the job done, but is too slow, especially for longer pre.numbers vectors (about 1 million elements in real data).

library(data.table)
set.seed(2)
pre.numbers <- sort(floor(runif(50000, 1, 1000000)))
the.table <- data.table(cbind(rowid=1:10000, current.number=floor(runif(1000, 1, 100000)), closest.lower.number=NA_integer_))
setkey(the.table, rowid)
the.table[, closest.lower.number:=max(pre.numbers[pre.numbers<current.number]), by=rowid]

There must be a smarter way to do this. There is no relationship between vector numbers and numbers in the data table.

+3

r data.table

Chris 12 jan. 15 at 18:47

source to share

2 answers

Here's a vector solution:

algo1 = function()
{
    vec     = the.table$current.number
    indices = findInterval(vec-0.1, pre.numbers)
    res     = ifelse(indices==0, 0, vec[indices])

    the.table$closest.lower.number = res
}

algo2 = function()
{
    setkey(the.table, rowid)
    the.table[, closest.lower.number:=max(pre.numbers[pre.numbers<current.number]), by=rowid]
}

On my machine:

t1 = system.time(algo1())
#> t1
#user  system elapsed 
#0.0      0.0     0.0 

t2 = system.time(algo2())
#> t2
#user  system elapsed 
#9.73    0.00    9.73

+4

Colonel Beauvel 12 jan. 15 at 18:58

source to share

Arun · Accepted Answer · 2015-01-12T19:34:34+0000

How about this? Using connections for framing data.

DT = data.table(pre = pre.numbers, 
       current.number = pre.numbers+0.5, key="current.number")
setkey(the.table, current.number)
ans = DT[the.table, roll=Inf, rollends=FALSE]

Since you are dealing with integers, I just added 0.5 (any number between 0 and 1 should be fine) to create DT

from pre.numbers

.

The last step is to perform a rolling LOCF join (the last observation is carried forward). For each value current.number

(key column), the corresponding rows appear in DT

current.number

(key column). If there is no match, the last observation will be folded forward. And if the match happens at the start / end, it results in NA

( rollends = FALSE

).

To better illustrate what's going on, consider this case:

# pre.numbers:
# c(10, 11)

# the.table:
# current.numbers
#               9
#              10
#              11

First we convert pre.number to DT

, which will result in columns

# DT:
# pre  current.numbers (key col)
#  10             10.5
#  11             11.5

For each value in the.table

:

#  9 -> falls before 10.5, LOCF and rollends = FALSE => result is NA
# 10 -> falls before 10.5 => same as above
# 11 -> falls between 10.5 and 11.5, LOCF matches with previous row = 10.5 
#       corresponding pre = 10.

NTN

Here's the code I used to generate the data:

require(data.table)
set.seed(1L)
pre.numbers = sort(floor(runif(50000, 1, 1000000)))
the.table = data.table(rowid=1:10000, current.number=floor(runif(1000, 1, 100000)))

Speed ​​up vector search for data.table

More articles:

Speed up vector search for data.table