Optimization For a loop with nested, if in R
I am trying to combine multiple csv files into one dataframe and trying to manipulate the resulting dataframe with a for loop. The resulting data frame can have between 1,500,000 and 2,000,000 rows.
I am using below code for this.
setwd("D:/Projects")
library(dplyr)
library(readr)
merge_data = function(path)
{
files = dir(path, pattern = '\\.csv', full.names = TRUE)
tables = lapply(files, read_csv)
do.call(rbind, tables)
}
Data = merge_data("D:/Projects")
Data1 = cbind(Data[,c(8,9,17)],Category = "",stringsAsFactors=FALSE)
head(Data1)
for (i in 1:nrow(Data1))
{
Data1$Category[i] = ""
Data1$Category[i] = ifelse(Data1$Days[i] <= 30, "<30",
ifelse(Data1$Days[i] <= 60, "31-60",
ifelse(Data1$Days[i] <= 90, "61-90",">90")))
}
However, the code takes a very long time. Is there a better and quicker way to accomplish the same operation?
source to share
We can make this more optimized by reading fread
from data.table
and then using cut/findInterval
. This will become more pronounced when it is running in multiple cores, nodes on the server where fread
all nodes use and run in parallel
library(data.table)
merge_data <- function(path) {
files = dir(path, pattern = '\\.csv', full.names = TRUE)
rbindlist(lapply(files, fread, select = c(8, 9, 17)))
}
Data <- merge_data("D:/Projects")
Data[, Category := cut(Data1, breaks = c(-Inf, 30, 60, 90, Inf),
labels = c("<=30", "31-60", "61-90", ">90"))]
source to share
Akrun is indeed correct that fread is significantly faster than read.csv.
However, apart from his post, I would add that your for loop is completely unnecessary. He replaced it with cut / findInterval, which I am not familiar with. In terms of simple R programming, although for loops it is necessary when some factor in your calculation changes line by line. However, in your code this is not the case and there is no need for a for loop.
Basically, you are performing the calculation up to 2 million times when you only need to perform the calculation on a column once.
You can replace the for loop like this:
Data1$category = ifelse(Data1$Days <= 30, "<=30",
ifelse(Data1$Days <= 60, "31-60",
ifelse(Data1$Days <= 90, "61-90",">90")))
and your code will run waaaaaay faster
source to share