How can I improve the performance of my data cleansing code currently using ddply using data.table?

Question

How can I improve the performance of my data cleansing code currently using ddply using data.table?

I am trying to clear data using ddply but it is very slow on 1.3M lines.

Sample code:

#Create Sample Data Frame
num_rows <- 10000
df <- data.frame(id=sample(1:20, num_rows, replace=T), 
                Consumption=sample(-20:20, num_rows, replace=T), 
                StartDate=as.Date(sample(15000:15020, num_rows, replace=T), origin = "1970-01-01"))
df$EndDate <- df$StartDate + 90
#df <- df[order(df$id, df$StartDate, df$Consumption),]
#Are values negative? 
# Needed for subsetting in ddply rows with same positive and negative values
df$Neg <- ifelse(df$Consumption < 0, -1, 1)
df$Consumption <- abs(df$Consumption)

I wrote a function to delete rows where there is a consumption value in one line that is identical but negative for a consumption value in another line (for the same id).

#Remove rows from a data frame where there is an equal but opposite consumption value
#Should ensure only one negative value is removed for each positive one. 
clean_negatives <- function(x3){
  copies <- abs(sum(x3$Neg))
  sgn <- ifelse(sum(x3$Neg) <0, -1, 1) 
  x3 <- x3[0:copies,]
  x3$Consumption <- sgn*x3$Consumption
  x3$Neg <- NULL
  x3}

Then I use ddply to apply this function to remove those error lines in the data

ptm <- proc.time()
df_cleaned <- ddply(df, .(id,StartDate, EndDate, Consumption),
                    function(x){clean_negatives(x)})
proc.time() - ptm

I was hoping I could use data.table to speed this up, but I couldn't figure out how to use data.table to help.

With 1.3M lines, so far, it takes my desktop all day to calculate and still hasn't finished.

+3

performance r data.table plyr

Oscar smith 22 jan. 13 at 6:47 am

source to share

1 answer

Arun · Accepted Answer · 2013-01-22T08:52:28+0000

Your question asks about implementation data.table

. So, I showed it here. Your function can also be greatly simplified. You can get first sign

by summing Neg

and then filter the table and then multiply Consumption

by sign

(as shown below).

require(data.table)
# get the data.table in dt
dt <- data.table(df, key = c("id", "StartDate", "EndDate", "Consumption"))
# first obtain the sign directly
dt <- dt[, sign := sign(sum(Neg)), by = c("id", "StartDate", "EndDate", "Consumption")]
# then filter by abs(sum(Neg))
dt.fil <- dt[, .SD[seq_len(abs(sum(Neg)))], by = c("id", "StartDate", "EndDate", "Consumption")]
# modifying for final output (line commented after Statquant comment
# dt.fil$Consumption <- dt.fil$Consumption * dt.fil$sign
dt.fil[, Consumption := (Consumption*sign)]
dt.fil <- subset(dt.fil, select=-c(Neg, sign))

Benchmarking

Data with a million rows:

#Create Sample Data Frame
num_rows <- 1e6
df <- data.frame(id=sample(1:20, num_rows, replace=T), 
                Consumption=sample(-20:20, num_rows, replace=T), 
                StartDate=as.Date(sample(15000:15020, num_rows, replace=T), origin = "1970-01-01"))
df$EndDate <- df$StartDate + 90
df$Neg <- ifelse(df$Consumption < 0, -1, 1)
df$Consumption <- abs(df$Consumption)

Function data.table

:

FUN.DT <- function() {
    require(data.table)
    dt <- data.table(df, key=c("id", "StartDate", "EndDate", "Consumption"))
    dt <- dt[, sign := sign(sum(Neg)), 
               by = c("id", "StartDate", "EndDate", "Consumption")]
    dt.fil <- dt[, .SD[seq_len(abs(sum(Neg)))], 
               by=c("id", "StartDate", "EndDate", "Consumption")]
    dt.fil[, Consumption := (Consumption*sign)]
    dt.fil <- subset(dt.fil, select=-c(Neg, sign))
}

Your function with ddply

FUN.PLYR <- function() {
    require(plyr)
    clean_negatives <- function(x3) {
        copies <- abs(sum(x3$Neg))
        sgn <- ifelse(sum(x3$Neg) <0, -1, 1) 
        x3 <- x3[0:copies,]
        x3$Consumption <- sgn*x3$Consumption
        x3$Neg <- NULL
        x3
    }
    df_cleaned <- ddply(df, .(id, StartDate, EndDate, Consumption), 
                           function(x) clean_negatives(x))
}

Benchmarking with rbenchmark

(only with 1 run)

require(rbenchmark)
benchmark(FUN.DT(), FUN.PLYR(), replications = 1, order = "elapsed")

        test replications elapsed relative user.self sys.self user.child sys.child
1   FUN.DT()            1   6.137    1.000     5.926    0.211          0         0
2 FUN.PLYR()            1 242.268   39.477   152.855    82.881         0         0

My data.table

implementation is about 39 times faster than your current implementation plyr

(I'm comparing mine to your implementation because the features are different).

Note:

I loaded the packages inside the function to get the full time to get the result. Also, for the same reason, I converted data.frame

to data.table

with keys inside the benchmarking function. This is therefore the minimum acceleration.

How can I improve the performance of my data cleansing code currently using ddply using data.table?

More articles: