R: a faster alternative to period.apply
I have the following data:
Timestamp Weighted Value SumVal Group
1 1600 800 1
2 1000 1000 2
3 1000 1000 2
4 1000 1000 2
5 800 500 3
6 400 500 3
7 2000 800 4
8 1200 1000 4
I want to calculate for each group sum (Weighted_Value) / sum (SumVal), so for example group 3 the result will be 1.2.
I used period.apply for this:
period.apply(x4, intervalIndex, function(z) sum(z[,4])/sum(z[,2]))
But this is too slow for my application, so I wanted to ask if anyone knows a faster alternative for this? I've already tried ave, but it seems even slower.
My goal is by the way. to calculate a weighted average over time, to transfer an irregular time series to a time series at intervals with a time interval.
Thank!
source to share
Usage rowsum
seems to be faster (at least for this small sample dataset) than the approach data.table
:
sgibb <- function(datframe) {
data.frame(Group = unique(df$Group),
Avg = rowsum(df$Weighted_Value, df$Group)/rowsum(df$SumVal, df$Group))
}
Adding approach rowsum
to @platfort reference:
library(microbenchmark)
library(dplyr)
library(data.table)
microbenchmark(
Nader = df %>%
group_by(Group) %>%
summarise(res = sum(Weighted_Value) / sum(SumVal)),
Henk = setDT(df)[, sum(Weighted_Value) / sum(SumVal), by = Group],
plafort = weight.avg(df),
sgibb = sgibb(df)
)
# Unit: microseconds
# expr min lq mean median uq max neval
# Nader 2179.890 2280.462 2583.8798 2399.0885 2497.6000 6647.236 100
# Henk 648.191 693.519 788.1421 726.0940 751.0810 2386.260 100
# plafort 2638.967 2740.541 2935.4756 2785.7425 2909.4640 5000.652 100
# sgibb 347.125 384.830 442.6447 409.2815 441.8935 2039.563 100
source to share
Try to use it dplyr
, it should be faster than the base oneR
library(dplyr)
df <- read.table(text = "Timestamp Weighted_Value SumVal Group
1 1600 800 1
2 1000 1000 2
3 1000 1000 2
4 1000 1000 2
5 800 500 3
6 400 500 3
7 2000 800 4
8 1200 1000 4" , header = T)
df %>%
group_by(Group) %>%
summarise(res = sum(Weighted_Value) / sum(SumVal))
source to share
Here's a basic R solution. It's not the fastest for large (500k +) datasets, but you can see what might be going on under the hood in other functions.
weight.avg <- function(datframe) {
s <- split(datframe, datframe$Group)
avg <- sapply(s, function(x) sum(x[ ,2]) / sum(x[ ,3]))
data.frame(Group = names(avg), Avg = avg)
}
weight.avg(df)
Group Avg
1 1 2.000000
2 2 1.000000
3 3 1.200000
4 4 1.777778
The first line of the function splits the data frame into a group. The second applies the formula to each group. The latter creates a new data frame.
Data
df <- read.table(text = "Timestamp Weighted_Value SumVal Group
1 1600 800 1
2 1000 1000 2
3 1000 1000 2
4 1000 1000 2
5 800 500 3
6 400 500 3
7 2000 800 4
8 1200 1000 4" , header = T)
Fastest time
library(microbenchmark)
library(dplyr)
library(data.table)
microbenchmark(
Nader = df %>%
group_by(Group) %>%
summarise(res = sum(Weighted_Value) / sum(SumVal)),
Henk = setDT(df)[, sum(Weighted_Value) / sum(SumVal), by = Group],
plafort = weight.avg(df)
)
Unit: microseconds
expr min lq mean median uq max
Nader 2619.174 2827.0100 3094.5570 2949.976 3107.481 7980.684
Henk 783.186 833.7155 932.5883 888.783 944.640 3275.646
plafort 3550.787 3772.4395 4085.2323 3853.561 3995.869 7595.801
source to share