Set new column value based on TRUE / FALSE obtained from existing columns

I want to create a new column in an object data.table

and the values ​​should be set based on the match of the number in the range. from

and to

ranges are two columns in an existing object data.table

.

[Data]

library(data.table)
set.seed(1)
DT <- data.table(from=sample(10000))[, to:=from+sample(10000)]

> head(DT)
from    to
1: 2656  3304
2: 3721 10487
3: 5728 13081
4: 9080 10193
5: 2017  2484
6: 8980 10289

      

[My approach]

In this case, I want to add a new column boolean

flag

that returns TRUE

if 6500 is in the range [from, to]

.

This is my attempt:

DT[, flag:=0][DT[, .I[6500 %in% seq(from, to, by=1)], by=1:nrow(DT)][[1]], flag:=1]

> table(DT$flag)

   0    1 
5567 4433 

      

[Question]

It took a while to execute the code, and as you can imagine, it will take longer if n

there are sample(n)

more.

My question is: Is there a better way to do this? Better in terms of speed and code readability (I find my code is not intuitive at all).

+3


source to share


3 answers


Why not just use:

DT[, flag := +(from < 6500 & to > 6500)]

      

The statement between ()

gives conditions, +

makes it a logical whole (either 0

, or 1

). This gives the same result as:

DT[, flag := ifelse(from < 6500 & to > 6500, 1, 0)]

      

Another possibility (suggested by @Frank):

DT[, flag := mapply(between,6500,from,to)]

      



The challenge, however, comes mapply

with quite an overhead. Just using:

DT[, flag := between(6500,from,to)]

      

much faster, as can be seen in the following speed comparison between different approaches:

library(microbenchmark)
microbenchmark(q = DT[, flag:=0][DT[, .I[6500 %in% seq(from, to, by=1)], by=1:nrow(DT)][[1]], flag:=1],
               j1 = DT[, flag := +(from < 6500 & to > 6500)],
               j2 = DT[, flag := ifelse(from < 6500 & to > 6500, 1, 0)],
               j3 = DT[, flag := mapply(between,6500,from,to)],
               j4 = DT[, flag := between(6500,from,to)],
               nikos = DT[, flag:=0][from<=6500 & to>=6500, flag:=1],
               jimbo = DT[, flag := 6500<=to & 6500>=from])

Unit: microseconds
  expr         min           lq         mean       median           uq         max
     q 2424842.405 2498646.2495 2638230.4775 2545083.2020 2684601.6290 4336768.458
    j1     843.639     896.2505    1074.3921     955.3120    1047.0570    3689.399
    j2    2063.674    2205.7850    2766.0470    2282.0050    3115.9475    7978.479
    j3   16072.188   16406.6920   19550.4140   19056.4665   20915.1620   75465.362
    j4     687.094     731.7385     877.6009     780.1280     858.3825    3073.322
 nikos    1077.945    1186.2395    1424.4156    1290.7015    1389.8500    3699.621
 jimbo     759.372     836.2075    1000.6322     884.4715     958.0035    3016.492

      

So the fastest approach (option between

c j4

) is about 3000 times faster than the original approach in the question.

+2


source


Sorry why don't you do something like this:

DT[,flag:=0]
DT[from<=6500 & to>=6500,flag:=1]

      



This should have the same result if I didn't understand your question.

+2


source


library(data.table)
set.seed(1)
DT <- data.table(from=sample(10000))[, to:=from+sample(10000)]
DT[, flag := 6500<=to & 6500>=from]

> table(DT$flag)

FALSE  TRUE 
 5567  4433 

      

0


source







All Articles