Why is data.table so slow in this example in R
It has to do with R- viewing all column names with any NA
I was comparing data.frame and data.table versions and found data.table is 10 times slower. This conflicts with most of the code with data.table, which is actually much faster than the data.frame versions.
set.seed(49)
df1 <- as.data.frame(matrix(sample(c(NA,1:200), 1e4*5000, replace=TRUE), ncol=5000))
library(microbenchmark)
f1 <- function() {names(df1)[sapply(df1, function(x) any(is.na(x)))]}
f2 <- function() { setDT(df1); names(df1)[df1[,sapply(.SD, function(x) any(is.na(x))),]] }
microbenchmark(f1(), f2(), unit="relative")
Unit: relative
expr min lq median uq max neval
f1() 1.00000 1.00000 1.000000 1.000000 1.000000 100
f2() 10.56342 10.20919 9.996129 9.967001 7.199539 100
setDT in advance:
set.seed(49)
df1 <- as.data.frame(matrix(sample(c(NA,1:200), 1e4*5000, replace=TRUE), ncol=5000))
setDT(df1)
library(microbenchmark)
f1 <- function() {names(df1)[sapply(df1, function(x) any(is.na(x)))]}
f2 <- function() {names(df1)[df1[,sapply(.SD, function(x) any(is.na(x))),]] }
microbenchmark(f1(), f2(), unit="relative")
Unit: relative
expr min lq median uq max neval
f1() 1.00000 1.00000 1.00000 1.00000 1.000000 100
f2() 10.64642 10.77769 10.79191 10.77536 7.716308 100
What could be the reason?
source to share
data.table
in this case will not give magic speed up.
# Unit: relative
# expr min lq median uq max neval
# f1() 1.000000 1.000000 1.000000 1.000000 1.000000 10
# f2() 8.350364 8.146091 6.966839 5.766292 4.595742 10
For comparison, the timeouts are higher on my machine.
In the "data.frame" approach, you are actually leveraging the fact that it data.frame
is a list and iterate over the list.
In the approach, data.table
you do the same, however, by using .SD
, you are forcing the entire data table to be copied (to make the data available). This is due to the data.table
skill of only copying the data you need into an expression j
. Using .SD you copy everything.
A better performance approach would be to use anyNA
, which is a faster (primitive) approach to finding any NA values ββ(it will stop as soon as it finds the first one, instead of creating an integer vector is.na and then scanning for any TRUE values)
For a more specialized test, you may need to write (Rcpp style sugar style)
You will also find that it unlist(lapply(...))
will usually be faster sapply
.
f3 <- function() names(df1)[unlist(lapply(df1, anyNA))]
f4 <- function() names(df1)[sapply(df1, anyNA)]
microbenchmark(f1(), f2(),f3() ,f4(),unit="relative",times=10)
# Unit: relative
# expr min lq median uq max neval
# f1() 10.988322 11.200684 11.048738 10.697663 13.110318 10
# f2() 92.915256 92.000781 91.000729 88.421331 103.627198 10
# f3() 1.000000 1.000000 1.000000 1.000000 1.000000 10
# f4() 1.591301 1.663222 1.650136 1.652701 2.133943 10
and with a suggestion from Martin Morgan
f3.1 <- function() names(df1)[unlist(lapply(df1, anyNA),use.names=FALSE)]
microbenchmark(f1(), f2(),f3() ,f3.1(),f4(),unit="relative",times=10)
# Unit: relative
# expr min lq median uq max neval
# f1() 18.125295 17.902925 18.17514 18.410682 9.2177043 10
# f2() 147.914282 145.805223 145.05835 143.630573 81.9495460 10
# f3() 1.608688 1.623366 1.66078 1.648530 0.8257108 10
# f3.1() 1.000000 1.000000 1.00000 1.000000 1.0000000 10
# f4() 2.555962 2.553768 2.60892 2.646575 1.3510561 10
source to share