Alternative to slower ifelse in R data table

I am writing a function that uses multiple ifelse to run a data table. Although I use datasheets for speed, multiple ifelse make my code slow and this function is for large dataset. So I was wondering if there is an alternative to iflese. One example iflese from function (there are about 15 iflese), in this example the flag is set to 1 if x is empty else 0.

    dt<-dt[,flag:=ifelse(is.na(x)|!nzchar(x),1,0)]

      

My apologies if this is a duplicate question.

Thanks in advance.

+3


source to share


1 answer


The fastest approach will probably depend on what your data looks like. The ones in the comments are comparable for this example:

( twice

@DavidArenburg and onceadd

by @akrun pointed out. I'm not really sure how to compare them with replications

> 1, as the objects actually changed during the test.)

DT <- data.table(x=sample(c(NA,"",letters),1e8,replace=TRUE))

DT0 <- copy(DT)
DT1 <- copy(DT)
DT2 <- copy(DT)
DT3 <- copy(DT)
DT4 <- copy(DT)
DT5 <- copy(DT)
DT6 <- copy(DT)
DT7 <- copy(DT)

library(rbenchmark)
benchmark(
ifelse  = DT0[,flag:=ifelse(is.na(x)|!nzchar(x),1L,0L)],
keyit   = {
    setkey(DT1,x)   
    DT1[,flag:=0L]
    DT1[J(NA_character_,""),flag:=1L]
},
twiceby = DT2[, flag:= 0L][is.na(x)|!nzchar(x), flag:= 1L,by=x],
twice   = DT3[, flag:= 0L][is.na(x)|!nzchar(x), flag:= 1L],
onceby  = DT4[, flag:= +(is.na(x)|!nzchar(x)), by=x],
once    = DT5[, flag:= +(is.na(x)|!nzchar(x))],
onceadd = DT6[, flag:= (is.na(x)|!nzchar(x))+0L],
oncebyk = {setkey(DT7,x); DT7[, flag:= +(is.na(x)|!nzchar(x)), by=x]},
replications=1
)[1:5]
#      test replications elapsed relative user.self
# 1  ifelse            1   19.61   31.127     17.32
# 2   keyit            1    0.63    1.000      0.47
# 6    once            1    3.26    5.175      2.68
# 7 onceadd            1    3.24    5.143      2.88
# 5  onceby            1    1.81    2.873      1.75
# 8 oncebyk            1    0.91    1.444      0.82
# 4   twice            1    3.17    5.032      2.79
# 3 twiceby            1    3.45    5.476      3.16

      



Discussion. In this example, it keyit

is the fastest. However, it is also the most verbose and changes the collation of the table. Also, keyit

very specific to the OP's question (taking advantage of the fact that exactly two character values ​​match the condition is.na(x)|!nzchar(x)

) and hence it might not be that good for other applications where he would need to write something like

keyit   = {
    setkey(DT1,x)
    flagem = DT1[,some_other_condition(x),by=x][(V1)]$x
    DT1[,flag:=0L]
    DT1[J(flagem),flag:=1L]
}

      

+7


source







All Articles