Speed up vector search for data.table
In a row in the data table, I need to find the nearest smaller number from the vector. The following minimal working example gets the job done, but is too slow, especially for longer pre.numbers vectors (about 1 million elements in real data).
library(data.table)
set.seed(2)
pre.numbers <- sort(floor(runif(50000, 1, 1000000)))
the.table <- data.table(cbind(rowid=1:10000, current.number=floor(runif(1000, 1, 100000)), closest.lower.number=NA_integer_))
setkey(the.table, rowid)
the.table[, closest.lower.number:=max(pre.numbers[pre.numbers<current.number]), by=rowid]
There must be a smarter way to do this. There is no relationship between vector numbers and numbers in the data table.
source to share
How about this? Using connections for framing data.
DT = data.table(pre = pre.numbers,
current.number = pre.numbers+0.5, key="current.number")
setkey(the.table, current.number)
ans = DT[the.table, roll=Inf, rollends=FALSE]
Since you are dealing with integers, I just added 0.5 (any number between 0 and 1 should be fine) to create DT
from pre.numbers
.
The last step is to perform a rolling LOCF join (the last observation is carried forward). For each value current.number
(key column), the corresponding rows appear in DT
current.number
(key column). If there is no match, the last observation will be folded forward. And if the match happens at the start / end, it results in NA
( rollends = FALSE
).
To better illustrate what's going on, consider this case:
# pre.numbers:
# c(10, 11)
# the.table:
# current.numbers
# 9
# 10
# 11
First we convert pre.number to DT
, which will result in columns
# DT:
# pre current.numbers (key col)
# 10 10.5
# 11 11.5
For each value in the.table
:
# 9 -> falls before 10.5, LOCF and rollends = FALSE => result is NA
# 10 -> falls before 10.5 => same as above
# 11 -> falls between 10.5 and 11.5, LOCF matches with previous row = 10.5
# corresponding pre = 10.
NTN
Here's the code I used to generate the data:
require(data.table)
set.seed(1L)
pre.numbers = sort(floor(runif(50000, 1, 1000000)))
the.table = data.table(rowid=1:10000, current.number=floor(runif(1000, 1, 100000)))
source to share
Here's a vector solution:
algo1 = function()
{
vec = the.table$current.number
indices = findInterval(vec-0.1, pre.numbers)
res = ifelse(indices==0, 0, vec[indices])
the.table$closest.lower.number = res
}
algo2 = function()
{
setkey(the.table, rowid)
the.table[, closest.lower.number:=max(pre.numbers[pre.numbers<current.number]), by=rowid]
}
On my machine:
t1 = system.time(algo1())
#> t1
#user system elapsed
#0.0 0.0 0.0
t2 = system.time(algo2())
#> t2
#user system elapsed
#9.73 0.00 9.73
source to share