Set new column value based on TRUE / FALSE obtained from existing columns
I want to create a new column in an object data.table
and the values should be set based on the match of the number in the range. from
and to
ranges are two columns in an existing object data.table
.
[Data]
library(data.table)
set.seed(1)
DT <- data.table(from=sample(10000))[, to:=from+sample(10000)]
> head(DT)
from to
1: 2656 3304
2: 3721 10487
3: 5728 13081
4: 9080 10193
5: 2017 2484
6: 8980 10289
[My approach]
In this case, I want to add a new column boolean
flag
that returns TRUE
if 6500 is in the range [from, to]
.
This is my attempt:
DT[, flag:=0][DT[, .I[6500 %in% seq(from, to, by=1)], by=1:nrow(DT)][[1]], flag:=1]
> table(DT$flag)
0 1
5567 4433
[Question]
It took a while to execute the code, and as you can imagine, it will take longer if n
there are sample(n)
more.
My question is: Is there a better way to do this? Better in terms of speed and code readability (I find my code is not intuitive at all).
source to share
Why not just use:
DT[, flag := +(from < 6500 & to > 6500)]
The statement between ()
gives conditions, +
makes it a logical whole (either 0
, or 1
). This gives the same result as:
DT[, flag := ifelse(from < 6500 & to > 6500, 1, 0)]
Another possibility (suggested by @Frank):
DT[, flag := mapply(between,6500,from,to)]
The challenge, however, comes mapply
with quite an overhead. Just using:
DT[, flag := between(6500,from,to)]
much faster, as can be seen in the following speed comparison between different approaches:
library(microbenchmark)
microbenchmark(q = DT[, flag:=0][DT[, .I[6500 %in% seq(from, to, by=1)], by=1:nrow(DT)][[1]], flag:=1],
j1 = DT[, flag := +(from < 6500 & to > 6500)],
j2 = DT[, flag := ifelse(from < 6500 & to > 6500, 1, 0)],
j3 = DT[, flag := mapply(between,6500,from,to)],
j4 = DT[, flag := between(6500,from,to)],
nikos = DT[, flag:=0][from<=6500 & to>=6500, flag:=1],
jimbo = DT[, flag := 6500<=to & 6500>=from])
Unit: microseconds
expr min lq mean median uq max
q 2424842.405 2498646.2495 2638230.4775 2545083.2020 2684601.6290 4336768.458
j1 843.639 896.2505 1074.3921 955.3120 1047.0570 3689.399
j2 2063.674 2205.7850 2766.0470 2282.0050 3115.9475 7978.479
j3 16072.188 16406.6920 19550.4140 19056.4665 20915.1620 75465.362
j4 687.094 731.7385 877.6009 780.1280 858.3825 3073.322
nikos 1077.945 1186.2395 1424.4156 1290.7015 1389.8500 3699.621
jimbo 759.372 836.2075 1000.6322 884.4715 958.0035 3016.492
So the fastest approach (option between
c j4
) is about 3000 times faster than the original approach in the question.
source to share