How can I improve the performance of my data cleansing code currently using ddply using data.table?
I am trying to clear data using ddply but it is very slow on 1.3M lines.
Sample code:
#Create Sample Data Frame
num_rows <- 10000
df <- data.frame(id=sample(1:20, num_rows, replace=T),
Consumption=sample(-20:20, num_rows, replace=T),
StartDate=as.Date(sample(15000:15020, num_rows, replace=T), origin = "1970-01-01"))
df$EndDate <- df$StartDate + 90
#df <- df[order(df$id, df$StartDate, df$Consumption),]
#Are values negative?
# Needed for subsetting in ddply rows with same positive and negative values
df$Neg <- ifelse(df$Consumption < 0, -1, 1)
df$Consumption <- abs(df$Consumption)
I wrote a function to delete rows where there is a consumption value in one line that is identical but negative for a consumption value in another line (for the same id).
#Remove rows from a data frame where there is an equal but opposite consumption value
#Should ensure only one negative value is removed for each positive one.
clean_negatives <- function(x3){
copies <- abs(sum(x3$Neg))
sgn <- ifelse(sum(x3$Neg) <0, -1, 1)
x3 <- x3[0:copies,]
x3$Consumption <- sgn*x3$Consumption
x3$Neg <- NULL
x3}
Then I use ddply to apply this function to remove those error lines in the data
ptm <- proc.time()
df_cleaned <- ddply(df, .(id,StartDate, EndDate, Consumption),
function(x){clean_negatives(x)})
proc.time() - ptm
I was hoping I could use data.table to speed this up, but I couldn't figure out how to use data.table to help.
With 1.3M lines, so far, it takes my desktop all day to calculate and still hasn't finished.
source to share
Your question asks about implementation data.table
. So, I showed it here. Your function can also be greatly simplified. You can get first sign
by summing Neg
and then filter the table and then multiply Consumption
by sign
(as shown below).
require(data.table)
# get the data.table in dt
dt <- data.table(df, key = c("id", "StartDate", "EndDate", "Consumption"))
# first obtain the sign directly
dt <- dt[, sign := sign(sum(Neg)), by = c("id", "StartDate", "EndDate", "Consumption")]
# then filter by abs(sum(Neg))
dt.fil <- dt[, .SD[seq_len(abs(sum(Neg)))], by = c("id", "StartDate", "EndDate", "Consumption")]
# modifying for final output (line commented after Statquant comment
# dt.fil$Consumption <- dt.fil$Consumption * dt.fil$sign
dt.fil[, Consumption := (Consumption*sign)]
dt.fil <- subset(dt.fil, select=-c(Neg, sign))
Benchmarking
-
Data with a million rows:
#Create Sample Data Frame num_rows <- 1e6 df <- data.frame(id=sample(1:20, num_rows, replace=T), Consumption=sample(-20:20, num_rows, replace=T), StartDate=as.Date(sample(15000:15020, num_rows, replace=T), origin = "1970-01-01")) df$EndDate <- df$StartDate + 90 df$Neg <- ifelse(df$Consumption < 0, -1, 1) df$Consumption <- abs(df$Consumption)
-
Function
data.table
:FUN.DT <- function() { require(data.table) dt <- data.table(df, key=c("id", "StartDate", "EndDate", "Consumption")) dt <- dt[, sign := sign(sum(Neg)), by = c("id", "StartDate", "EndDate", "Consumption")] dt.fil <- dt[, .SD[seq_len(abs(sum(Neg)))], by=c("id", "StartDate", "EndDate", "Consumption")] dt.fil[, Consumption := (Consumption*sign)] dt.fil <- subset(dt.fil, select=-c(Neg, sign)) }
-
Your function with
ddply
FUN.PLYR <- function() { require(plyr) clean_negatives <- function(x3) { copies <- abs(sum(x3$Neg)) sgn <- ifelse(sum(x3$Neg) <0, -1, 1) x3 <- x3[0:copies,] x3$Consumption <- sgn*x3$Consumption x3$Neg <- NULL x3 } df_cleaned <- ddply(df, .(id, StartDate, EndDate, Consumption), function(x) clean_negatives(x)) }
-
Benchmarking with
rbenchmark
(only with 1 run)require(rbenchmark) benchmark(FUN.DT(), FUN.PLYR(), replications = 1, order = "elapsed") test replications elapsed relative user.self sys.self user.child sys.child 1 FUN.DT() 1 6.137 1.000 5.926 0.211 0 0 2 FUN.PLYR() 1 242.268 39.477 152.855 82.881 0 0
My data.table
implementation is about 39 times faster than your current implementation plyr
(I'm comparing mine to your implementation because the features are different).
Note:
I loaded the packages inside the function to get the full time to get the result. Also, for the same reason, I converted data.frame
to data.table
with keys inside the benchmarking function. This is therefore the minimum acceleration.
source to share