Optimization For a loop with nested, if in R

Question

Optimization For a loop with nested, if in R

I am trying to combine multiple csv files into one dataframe and trying to manipulate the resulting dataframe with a for loop. The resulting data frame can have between 1,500,000 and 2,000,000 rows.

I am using below code for this.

setwd("D:/Projects")
library(dplyr)
library(readr)
merge_data = function(path) 
{ 
  files = dir(path, pattern = '\\.csv', full.names = TRUE)
  tables = lapply(files, read_csv)
  do.call(rbind, tables)
}


Data = merge_data("D:/Projects")
Data1 = cbind(Data[,c(8,9,17)],Category = "",stringsAsFactors=FALSE)
head(Data1)

for (i in 1:nrow(Data1))
{ 
  Data1$Category[i] = ""
  Data1$Category[i] = ifelse(Data1$Days[i] <= 30, "<30",
                       ifelse(Data1$Days[i] <= 60, "31-60",
                       ifelse(Data1$Days[i] <= 90, "61-90",">90")))     

}

However, the code takes a very long time. Is there a better and quicker way to accomplish the same operation?

+3

for-loop r nested-loops

Srivathsan V Apr 24 17 at 17:08

source to share

3 answers

You are already using dplyr

, so why not just:

Data = merge_data("D:/Projects") %>%
  select(8, 9, 17) %>%
  mutate(Category = cut(Days,
                        breaks = c(-Inf, 30, 60, 90, Inf), 
                        labels = c("<=30", "31-60", "61-90", ">90"))

+1

ulfelder Apr 24 17 at 17:46

source to share

Akrun is indeed correct that fread is significantly faster than read.csv.

However, apart from his post, I would add that your for loop is completely unnecessary. He replaced it with cut / findInterval, which I am not familiar with. In terms of simple R programming, although for loops it is necessary when some factor in your calculation changes line by line. However, in your code this is not the case and there is no need for a for loop.

Basically, you are performing the calculation up to 2 million times when you only need to perform the calculation on a column once.

You can replace the for loop like this:

Data1$category = ifelse(Data1$Days <= 30, "<=30",
                 ifelse(Data1$Days <= 60, "31-60",
                 ifelse(Data1$Days <= 90, "61-90",">90")))

and your code will run waaaaaay faster

0

Daniel Apr 24 17 at 17:50

source to share

akrun · Accepted Answer · 2017-04-24T17:15:18+0000

We can make this more optimized by reading fread

from data.table

and then using cut/findInterval

. This will become more pronounced when it is running in multiple cores, nodes on the server where fread

all nodes use and run in parallel

library(data.table)
merge_data <- function(path) { 
   files = dir(path, pattern = '\\.csv', full.names = TRUE)
  rbindlist(lapply(files, fread, select = c(8, 9, 17)))
 }

Data <- merge_data("D:/Projects")
Data[, Category := cut(Data1, breaks = c(-Inf, 30, 60, 90, Inf), 
      labels = c("<=30", "31-60", "61-90", ">90"))]

Optimization For a loop with nested, if in R

More articles: