Why is R (in my example) very slow for handling dates / times?

I have a list of 40 frames with about 250K lines and I want to add a new variable to each datafile. This new variable is period

calculated from another variable that contains Date objects, the conversion is very simple, if the year part of the date is less than the 2015 period, it is set to "new" yet "old".

I thought the computation would be very fast using vectorization, but it will take about 41 seconds! (Use a for loop or lapply, which give the same performance).

Reproducible example:

datas.d <- function(nDf, nRow) {
  lapply(seq_len(nDf), function(x) {
    data.frame(
      id1 = sample(7e8:9e8, nRow), 
      id2 = sample(1e9, nRow), 
      id3 = sample(1e9, nRow), 
      date = sample(seq(as.Date("2012-01-01"), Sys.Date(), by = 1), nRow, rep = TRUE), 
      code1 = sample(10, nRow, rep = TRUE), 
      code2 = sample(10, nRow, rep = TRUE), 
      code3 = sample(10, nRow, rep = TRUE)
    )
  })
}

datasDate <- datas.d(40, 25e4)

forLoopDate <- function(datas) {
  for (i in seq_along(datas)) {
    datas[[i]]$period <- rep("old", nrow(datas[[i]]))
    datas[[i]]$period[format(datas[[i]]$date, "%Y") == "2015"] <- "new"
  }
  return(datas)
}

> system.time(forLoopDate(datasDate))
utilisateur     système      écoulé 
      41.46        0.31       41.84

      

I already experienced slow action when I coerced rows to dates in an 800k row data frame, so I suspected date manipulation was to blame for poor performances. The R profiler confirmed this:

Rprof(tmp <- tempfile())
datas <- forLoopDate(datasDate)
Rprof(NULL)
summaryRprof(tmp)
$by.self
                  self.time self.pct total.time total.pct
"format.POSIXlt"      39.34    94.16      39.34     94.16
"as.POSIXlt.Date"      1.80     4.31       1.80      4.31
"=="                   0.36     0.86       0.36      0.86
"forLoopDate"          0.22     0.53      41.78    100.00
"format.Date"          0.06     0.14      41.20     98.61

      

So I tried the same conversion skipping date formatting i.e. directly using the string for the year. The performance gain is unambiguous:

I am also testing it with another formatting function, year

from the lubridate package. The formatting is very fast I guess because it works at the C level?

datas.s <- function(nDf, nRow) {
  lapply(seq_len(nDf), function(x) {
    data.frame(
      id1 = sample(7e8:9e8, nRow), 
      id2 = sample(1e9, nRow), 
      id3 = sample(1e9, nRow), 
      date = sample(2012:2015, nRow, rep = TRUE), 
      code1 = sample(10, nRow, rep = TRUE), 
      code2 = sample(10, nRow, rep = TRUE), 
      code3 = sample(10, nRow, rep = TRUE)
    )
  })
}

datasString <- datas.s(40, 25e4)

forLoopString <- function(datas) {
  for (i in seq_along(datas)) {
    datas[[i]]$period <- rep("old", nrow(datas[[i]]))
    datas[[i]]$period[datas[[i]]$date == "2015"] <- "new"
  }
  return(datas)
}

library(lubridate)
forLoopDate2 <- function(datas) {
  for (i in seq_along(datas)) {
    datas[[i]]$period <- rep("old", nrow(datas[[i]]))
    datas[[i]]$period[year(datas[[i]]$date) == 2015] <- "new"
  }
  return(datas)
}

library(microbenchmark)
mbm <- microbenchmark(
  date = datas <- forLoopDate(datasDate), 
  string = datas <- forLoopString(datasString),
  lubridate = datas <- forLoopDate2(datasDate),
  times = 10L)

> mbm
Unit: seconds
expr       min        lq      mean    median       uq       max neval
date 41.502728 41.561497 41.649533 41.652306 41.69218 41.875110    10
string  4.119266  4.131186  4.167809  4.166946  4.17993  4.239481    10
lubridate  2.088281  2.105413  2.133042  2.111710  2.15794  2.250739    10

      

And here many questions arise!

_Why formatting / converting Date is slow with R?

_Can Can I improve the performance of my code with Base R? What are the best practices in terms of performance when working with dates / dates?

Thank!

+3


source to share


1 answer


Function

A format

, which can return many different formats, you might expect it to be quite slow. If you are happy with the lubridate function year

, you can simply use its (very simple) code:

as.POSIXlt(x, tz = tz(x))$year + 1900

      



In general, you should avoid conversions between any types / classes and symbols when performance matters. This will often be slow. It is better to do numeric calculations (for example, you can use integers, which are the basis of Date variables, but this leads to problems with leap years, so it is better to use POSIXlt, which will take care of this for you).

+5


source







All Articles