Calculating the average for every second value in the data frame

I would like to aggregate each of the two cell values ​​by the mean and continue with the same process on the column of the dataframe. To be more precise, see the following piece of data:

    X         Y             Z
1   FRI 200101010000    -6.72
2   FRI 200101010030    -6.30
3   FRI 200101010100    -6.26
4   FRI 200101010130    -5.82
5   FRI 200101010200    -5.64
6   FRI 200101010230    -5.29
7   FRI 200101010300    -5.82
8   FRI 200101010330    -5.83
9   FRI 200101010400    -5.83
10  FRI 200101010430    -6.04
11  FRI 200101010500    -5.80
12  FRI 200101010530    -6.09

      

I would like to calculate the average of each Z on Y ending in 00 and 30, which means calculating the average of #row 1 + 2, #row 3 + 4, #row 5 + 6, etc ... which is what I expect here:

    X         Y             Z
1   FRI 200101010100    -6.51
2   FRI 200101010200    -6.04
3   FRI 200101010300    -5.47
...

      

Explanation: Y is the time: YYYYMMDDhhmm and I would like to average 30 minutes measurements per 1h measurements

+3


source to share


4 answers


Here's a possible solution data.table

library(data.table)
setDT(df)[, .(Y = Y[1L], Z = mean(Z)), by = .(X, indx = cumsum(substr(Y, 11, 12) == '00'))]
#      X indx            Y      Z
# 1: FRI    1 200101010000 -6.510
# 2: FRI    2 200101010100 -6.040
# 3: FRI    3 200101010200 -5.465
# 4: FRI    4 200101010300 -5.825
# 5: FRI    5 200101010400 -5.935
# 6: FRI    6 200101010500 -5.945

      



Or per @ akruns comment using aggregate

from base (although it will probably need some extra tweeking in the output)

aggregate(Z ~ X + indx, transform(df, indx = cumsum(substr(Y, 11, 12) == '00')), mean)

      

+4


source


a base-R solution where I first split the vector into pieces and calculated the average of each, this of course assumes the order you specified is always the case. Finally, I'll combine them to give your result:

Z <- unlist(lapply(split(df$Z, ceiling(seq_along(df$Z) / 2)), mean))
new_df <- cbind(df[seq(1,nrow(df), 2), c("X", "Y")], Z)

      



output:

     X            Y      Z
1  FRI 200101010000 -6.510
3  FRI 200101010100 -6.040
5  FRI 200101010200 -5.465
7  FRI 200101010300 -5.825
9  FRI 200101010400 -5.935
11 FRI 200101010500 -5.945

      

+2


source


dplyr version

library(dplyr)

df$Y <- as.character(df$Y)

means <- df %>%
        group_by(hour = substr(Y, start = 1, stop=10)) %>% summarise(Z = mean(Z))

> means
Source: local data frame [6 x 2]

        hour      Z
1 2001010100 -6.510
2 2001010101 -6.040
3 2001010102 -5.465
4 2001010103 -5.825
5 2001010104 -5.935
6 2001010105 -5.945

      

Groups data by variable Y without the last two digits.

+1


source


While this is not the case for the OP, in general for aggregating the second, if you have a column POSIXct

, you can:

library(lubridate)
library(tidyverse)
s <- seq(from=Sys.time(), length.out = 100, by=0.4)
df  = data.frame(time=s,v=rnorm(length(s)))
df %>% 
  group_by(time=floor_date(time, "1 second")) %>% 
  summarize(v=mean(v)) # you can put any other interval like 5 minute

      

0


source







All Articles