Replacement for-loops is applied to improve performance (with weighted.mean)

I'm an R newbie, so hopefully this is a solvable problem for some of you. I have a dataframe containing over a million data points. My goal is to compute a weighted average with a varying starting point.

To illustrate this frame (data.frame (matrix (c (1,2,3,2,2,1), 3,2)))

  X1 X2
1  1  2
2  2  2
3  3  1


where X1 is data and X2 is sample weight.

I want to calculate a weighted average for X1 from a starting point 1 to 3, 2: 3, and 3: 3.

With a loop, I just wrote:

B <- rep(NA,3) #empty result vector
for(i in 1:3){
  B[i] <- weighted.mean(x=A$X1[i:3],w=A$X2[i:3]) #shifting the starting point of the data and weights further to the end


With my real data this is impossible to compute because for each iteration the data.frame changes and the computation takes hours with no result.

Is there a way to implement a starting point for the variation using the apply command to improve performance?

Best regards, Ruben


source to share

2 answers

Building on @ joran's answer to get the correct result:

with(A, rev(cumsum(rev(X1*X2)) / cumsum(rev(X2))))
# [1] 1.800000 2.333333 3.000000


Also note that this is much faster than the sapply

/ approach lapply




You can use lapply

to create your subsets and sapply

to iterate over them, but I would bet in a faster way.

sapply(lapply(1:3,":",3),function(x) with(dat[x,],weighted.mean(X1,X2)))
[1] 1.800000 2.333333 3.000000




All Articles