Optimization For a loop with nested, if in R

I am trying to combine multiple csv files into one dataframe and trying to manipulate the resulting dataframe with a for loop. The resulting data frame can have between 1,500,000 and 2,000,000 rows.

I am using below code for this.

setwd("D:/Projects")
library(dplyr)
library(readr)
merge_data = function(path) 
{ 
  files = dir(path, pattern = '\\.csv', full.names = TRUE)
  tables = lapply(files, read_csv)
  do.call(rbind, tables)
}


Data = merge_data("D:/Projects")
Data1 = cbind(Data[,c(8,9,17)],Category = "",stringsAsFactors=FALSE)
head(Data1)

for (i in 1:nrow(Data1))
{ 
  Data1$Category[i] = ""
  Data1$Category[i] = ifelse(Data1$Days[i] <= 30, "<30",
                       ifelse(Data1$Days[i] <= 60, "31-60",
                       ifelse(Data1$Days[i] <= 90, "61-90",">90")))     

}

      

However, the code takes a very long time. Is there a better and quicker way to accomplish the same operation?

+3


source to share


3 answers


We can make this more optimized by reading fread

from data.table

and then using cut/findInterval

. This will become more pronounced when it is running in multiple cores, nodes on the server where fread

all nodes use and run in parallel



library(data.table)
merge_data <- function(path) { 
   files = dir(path, pattern = '\\.csv', full.names = TRUE)
  rbindlist(lapply(files, fread, select = c(8, 9, 17)))
 }

Data <- merge_data("D:/Projects")
Data[, Category := cut(Data1, breaks = c(-Inf, 30, 60, 90, Inf), 
      labels = c("<=30", "31-60", "61-90", ">90"))]

      

+2


source


You are already using dplyr

, so why not just:



Data = merge_data("D:/Projects") %>%
  select(8, 9, 17) %>%
  mutate(Category = cut(Days,
                        breaks = c(-Inf, 30, 60, 90, Inf), 
                        labels = c("<=30", "31-60", "61-90", ">90"))

      

+1


source


Akrun is indeed correct that fread is significantly faster than read.csv.

However, apart from his post, I would add that your for loop is completely unnecessary. He replaced it with cut / findInterval, which I am not familiar with. In terms of simple R programming, although for loops it is necessary when some factor in your calculation changes line by line. However, in your code this is not the case and there is no need for a for loop.

Basically, you are performing the calculation up to 2 million times when you only need to perform the calculation on a column once.

You can replace the for loop like this:

Data1$category = ifelse(Data1$Days <= 30, "<=30",
                 ifelse(Data1$Days <= 60, "31-60",
                 ifelse(Data1$Days <= 90, "61-90",">90")))

      

and your code will run waaaaaay faster

0


source







All Articles