How can I improve this R function

Question

How can I improve this R function

I am new to R. I created a function below to calculate the average of a dataset contained in 332 csv files. Ask for advice on how I can improve this code. It will take 38 seconds, which makes me think it is not very efficient.

pollutantmean <- function(directory, pollutant, id = 1:332) {
        files_list <- list.files(directory, full.names = TRUE) #creats list of files
        dat <- data.frame() #creates empty dataframe
                for(i in id){
                        dat<- rbind(dat,read.csv(files_list[i])) #combin all the monitor data together
}
        good <- complete.cases(dat) #remove all NA values from dataset
        mean(dat[good,pollutant]) #calculate mean
} #run time ~ 37sec - NEED TO OPTIMISE THE CODE

+3

r

RiskyB Apr 19 15 at 12:16

source to share

2 answers

Colonel Beauvel · Answer 1 · 2015-04-19T12:21:41+0000

Instead of creating void data.frame

and rbind

using each time, for loop

you can keep everything data.frames

in a list and combine them into one snapshot. You can also use the option na.rm

for the middle function to ignore the values NA

.

pollutantmean <- function(directory, pollutant, id = 1:332)
{
    files_list = list.files(directory, full.names = TRUE)[id] 
    df         = do.call(rbind, lapply(files_list, read.csv))

    mean(df[[pollutant]], na.rm=TRUE)
}

Optional - I would increase readability with magrittr

:

library(magrittr)

pollutantmean <- function(directory, pollutant, id = 1:332)
{
    list.files(directory, full.names = TRUE)[id] %>%
        lapply(read.csv) %>%
        do.call(rbind,.) %>%
        extract2(pollutant) %>%
        mean(na.rm=TRUE)
}

Rentrop · Answer 2 · 2015-04-19T13:49:07+0000

You can improve it using a function data.table

fread

(see Reading very large tables quickly as data in R ) Also, binding the result with using data.table::rbindlist

is faster.

require(data.table)    

pollutantmean <- function(directory, pollutant, id = 1:332) {
    files_list = list.files(directory, full.names = TRUE)[id]
    DT = rbindlist(lapply(files_list, fread))
    mean(DT[[pollutant]], na.rm=TRUE) 
}

How can I improve this R function

More articles: