Why is data.table so slow in this example in R

It has to do with R- viewing all column names with any NA

I was comparing data.frame and data.table versions and found data.table is 10 times slower. This conflicts with most of the code with data.table, which is actually much faster than the data.frame versions.

set.seed(49)
df1 <- as.data.frame(matrix(sample(c(NA,1:200), 1e4*5000, replace=TRUE), ncol=5000))

library(microbenchmark) 
f1 <- function() {names(df1)[sapply(df1, function(x) any(is.na(x)))]}
f2 <- function() { setDT(df1); names(df1)[df1[,sapply(.SD, function(x) any(is.na(x))),]]  } 
microbenchmark(f1(), f2(), unit="relative")
Unit: relative
 expr      min       lq   median       uq      max neval
 f1()  1.00000  1.00000 1.000000 1.000000 1.000000   100
 f2() 10.56342 10.20919 9.996129 9.967001 7.199539   100

      

setDT in advance:

set.seed(49)
df1 <- as.data.frame(matrix(sample(c(NA,1:200), 1e4*5000, replace=TRUE), ncol=5000))
setDT(df1)

library(microbenchmark) 
f1 <- function() {names(df1)[sapply(df1, function(x) any(is.na(x)))]}
f2 <- function() {names(df1)[df1[,sapply(.SD, function(x) any(is.na(x))),]]  } 
microbenchmark(f1(), f2(), unit="relative")
Unit: relative
 expr      min       lq   median       uq      max neval
 f1()  1.00000  1.00000  1.00000  1.00000 1.000000   100
 f2() 10.64642 10.77769 10.79191 10.77536 7.716308   100

      

What could be the reason?

+3


source to share


1 answer


data.table

in this case will not give magic speed up.

# Unit: relative
#  expr      min       lq   median       uq      max neval
#  f1() 1.000000 1.000000 1.000000 1.000000 1.000000    10
#  f2() 8.350364 8.146091 6.966839 5.766292 4.595742    10

      

For comparison, the timeouts are higher on my machine.

In the "data.frame" approach, you are actually leveraging the fact that it data.frame

is a list and iterate over the list.

In the approach, data.table

you do the same, however, by using .SD

, you are forcing the entire data table to be copied (to make the data available). This is due to the data.table

skill of only copying the data you need into an expression j

. Using .SD you copy everything.

A better performance approach would be to use anyNA

, which is a faster (primitive) approach to finding any NA values ​​(it will stop as soon as it finds the first one, instead of creating an integer vector is.na and then scanning for any TRUE values)



For a more specialized test, you may need to write (Rcpp style sugar style)

You will also find that it unlist(lapply(...))

will usually be faster sapply

.

f3 <- function() names(df1)[unlist(lapply(df1, anyNA))]
f4 <- function() names(df1)[sapply(df1, anyNA)]
microbenchmark(f1(), f2(),f3() ,f4(),unit="relative",times=10)

# Unit: relative
# expr       min        lq    median        uq        max neval
# f1() 10.988322 11.200684 11.048738 10.697663  13.110318    10
# f2() 92.915256 92.000781 91.000729 88.421331 103.627198    10
# f3()  1.000000  1.000000  1.000000  1.000000   1.000000    10
# f4()  1.591301  1.663222  1.650136  1.652701   2.133943    10

      

and with a suggestion from Martin Morgan

f3.1 <- function() names(df1)[unlist(lapply(df1, anyNA),use.names=FALSE)]

 microbenchmark(f1(), f2(),f3() ,f3.1(),f4(),unit="relative",times=10)
# Unit: relative
#    expr        min         lq    median         uq        max neval
#    f1()  18.125295  17.902925  18.17514  18.410682  9.2177043    10
#    f2() 147.914282 145.805223 145.05835 143.630573 81.9495460    10
#    f3()   1.608688   1.623366   1.66078   1.648530  0.8257108    10
#  f3.1()   1.000000   1.000000   1.00000   1.000000  1.0000000    10
#    f4()   2.555962   2.553768   2.60892   2.646575  1.3510561    10

      

+8


source







All Articles