Data.table lapply.SD slows down dramatically with increasing number of columns

Question

Data.table lapply.SD slows down dramatically with increasing number of columns

I have a 20k x 60k table for aggregation and I am experimenting how to do it with high memory efficiency and speed. I noticed that the speed of data.table decreases dramatically as the number of columns increases. For example:

library(data.table)  
# a 200 x 1,000 table.
test_dt= data.table(sample= rep(1:100,2), value= matrix(sample(6e07, 2e05), nrow = 200 ))
system.time(test_dt[, lapply(.SD, mean), by= sample, .SDcols= colnames(test_dt)[-1]])
#   user  system elapsed 
#  0.470   0.009   0.117 

# a 200 x 10, 000 table
test_dt= data.table(sample= rep(1:100,2), value= matrix(sample(6e07, 2e06), nrow = 200 ))
system.time(test_dt[, lapply(.SD, mean), by= sample, .SDcols= colnames(test_dt)[-1]])
#   user  system elapsed 
# 15.055   0.603  15.334

Any explanation for this non-linear (100x deceleration over 10x column) increases over time? One way to solve this is to melt it into a long DT. However, it eats many folds of memory. Is there a way to achieve consistency between memory usage and speed? Thank.

+3

performance r data.table

Guangbo Chen June 10. 17 at 6:24 am

source to share

1 answer

Frank · Accepted Answer · 2017-06-10T12:51:03+0000

I see a similar result for the OP:

# a 200 x 10, 000 table
set.seed(1)
test_dt= data.table(sample= rep(1:100,2), value= matrix(sample(6e07, 2e06), nrow = 200 ))[, 
  (2:10001) := lapply(.SD, as.numeric), .SDcols=2:10001]
system.time(z <- test_dt[, lapply(.SD, mean), by= sample])
#    user  system elapsed 
#   12.27    0.00   12.26

(I'm converting to a numeric number, since it's pretty clear they are treated as floats, and append set.seed

, so it's easier to compare results when needed.)

Any explanation for this non-linear (100x deceleration over 10x column) increases over time?

Generally, data.tables and data.frames are optimized to support grouping rows / observations together, rather than iterating over a huge number of columns. I assume your approach works in your RAM limit and uses swap memory ... although I don't know very much about it.

I think if you want to take full advantage of the speed of the data.table package, you might need to conform to its natural storage formats. As you can see below, this is essential.

One way to solve this is to melt it into a long DT. However, it eats many folds of memory. Is there a way to achieve consistency between memory usage and speed?

I think the best approach is to get more RAM and keep the data in long form. I can see that the molten table is about twice as large, but the computation speed there is over 100 times faster.

test_mdt = melt(test_dt, id = "sample")[, variable := match(variable, unique(variable))]

system.time(mz <- test_mdt[, .(res = mean(value)), by=.(sample, variable)])
#    user  system elapsed 
#    0.11    0.00    0.11 

object.size(test_dt)  # 17.8 MB
object.size(test_mdt) # 32.0 MB

Alternatively, if each sample is the same size, use a matrix list, or possibly an array:

test_dt[, g := rowid(sample)]
test_mats = lapply( split(test_dt[, !"sample"], by="g", keep.by=FALSE), as.matrix )
system.time(matz <- Reduce(`+`, test_mats)/length(test_mats))
#    user  system elapsed 
#       0       0       0 

object.size(test_mats) # 17.3 MB

Data.table lapply.SD slows down dramatically with increasing number of columns

More articles: