Why is R (in my example) very slow for handling dates / times?
I have a list of 40 frames with about 250K lines and I want to add a new variable to each datafile. This new variable is period
calculated from another variable that contains Date objects, the conversion is very simple, if the year part of the date is less than the 2015 period, it is set to "new" yet "old".
I thought the computation would be very fast using vectorization, but it will take about 41 seconds! (Use a for loop or lapply, which give the same performance).
Reproducible example:
datas.d <- function(nDf, nRow) {
lapply(seq_len(nDf), function(x) {
data.frame(
id1 = sample(7e8:9e8, nRow),
id2 = sample(1e9, nRow),
id3 = sample(1e9, nRow),
date = sample(seq(as.Date("2012-01-01"), Sys.Date(), by = 1), nRow, rep = TRUE),
code1 = sample(10, nRow, rep = TRUE),
code2 = sample(10, nRow, rep = TRUE),
code3 = sample(10, nRow, rep = TRUE)
)
})
}
datasDate <- datas.d(40, 25e4)
forLoopDate <- function(datas) {
for (i in seq_along(datas)) {
datas[[i]]$period <- rep("old", nrow(datas[[i]]))
datas[[i]]$period[format(datas[[i]]$date, "%Y") == "2015"] <- "new"
}
return(datas)
}
> system.time(forLoopDate(datasDate))
utilisateur système écoulé
41.46 0.31 41.84
I already experienced slow action when I coerced rows to dates in an 800k row data frame, so I suspected date manipulation was to blame for poor performances. The R profiler confirmed this:
Rprof(tmp <- tempfile())
datas <- forLoopDate(datasDate)
Rprof(NULL)
summaryRprof(tmp)
$by.self
self.time self.pct total.time total.pct
"format.POSIXlt" 39.34 94.16 39.34 94.16
"as.POSIXlt.Date" 1.80 4.31 1.80 4.31
"==" 0.36 0.86 0.36 0.86
"forLoopDate" 0.22 0.53 41.78 100.00
"format.Date" 0.06 0.14 41.20 98.61
So I tried the same conversion skipping date formatting i.e. directly using the string for the year. The performance gain is unambiguous:
I am also testing it with another formatting function, year
from the lubridate package. The formatting is very fast I guess because it works at the C level?
datas.s <- function(nDf, nRow) {
lapply(seq_len(nDf), function(x) {
data.frame(
id1 = sample(7e8:9e8, nRow),
id2 = sample(1e9, nRow),
id3 = sample(1e9, nRow),
date = sample(2012:2015, nRow, rep = TRUE),
code1 = sample(10, nRow, rep = TRUE),
code2 = sample(10, nRow, rep = TRUE),
code3 = sample(10, nRow, rep = TRUE)
)
})
}
datasString <- datas.s(40, 25e4)
forLoopString <- function(datas) {
for (i in seq_along(datas)) {
datas[[i]]$period <- rep("old", nrow(datas[[i]]))
datas[[i]]$period[datas[[i]]$date == "2015"] <- "new"
}
return(datas)
}
library(lubridate)
forLoopDate2 <- function(datas) {
for (i in seq_along(datas)) {
datas[[i]]$period <- rep("old", nrow(datas[[i]]))
datas[[i]]$period[year(datas[[i]]$date) == 2015] <- "new"
}
return(datas)
}
library(microbenchmark)
mbm <- microbenchmark(
date = datas <- forLoopDate(datasDate),
string = datas <- forLoopString(datasString),
lubridate = datas <- forLoopDate2(datasDate),
times = 10L)
> mbm
Unit: seconds
expr min lq mean median uq max neval
date 41.502728 41.561497 41.649533 41.652306 41.69218 41.875110 10
string 4.119266 4.131186 4.167809 4.166946 4.17993 4.239481 10
lubridate 2.088281 2.105413 2.133042 2.111710 2.15794 2.250739 10
And here many questions arise!
_Why formatting / converting Date is slow with R?
_Can Can I improve the performance of my code with Base R? What are the best practices in terms of performance when working with dates / dates?
Thank!
source to share
A format
, which can return many different formats, you might expect it to be quite slow. If you are happy with the lubridate function year
, you can simply use its (very simple) code:
as.POSIXlt(x, tz = tz(x))$year + 1900
In general, you should avoid conversions between any types / classes and symbols when performance matters. This will often be slow. It is better to do numeric calculations (for example, you can use integers, which are the basis of Date variables, but this leads to problems with leap years, so it is better to use POSIXlt, which will take care of this for you).
source to share