Calculating the average for every second value in the data frame
I would like to aggregate each of the two cell values by the mean and continue with the same process on the column of the dataframe. To be more precise, see the following piece of data:
X Y Z
1 FRI 200101010000 -6.72
2 FRI 200101010030 -6.30
3 FRI 200101010100 -6.26
4 FRI 200101010130 -5.82
5 FRI 200101010200 -5.64
6 FRI 200101010230 -5.29
7 FRI 200101010300 -5.82
8 FRI 200101010330 -5.83
9 FRI 200101010400 -5.83
10 FRI 200101010430 -6.04
11 FRI 200101010500 -5.80
12 FRI 200101010530 -6.09
I would like to calculate the average of each Z on Y ending in 00 and 30, which means calculating the average of #row 1 + 2, #row 3 + 4, #row 5 + 6, etc ... which is what I expect here:
X Y Z
1 FRI 200101010100 -6.51
2 FRI 200101010200 -6.04
3 FRI 200101010300 -5.47
...
Explanation: Y is the time: YYYYMMDDhhmm and I would like to average 30 minutes measurements per 1h measurements
source to share
Here's a possible solution data.table
library(data.table)
setDT(df)[, .(Y = Y[1L], Z = mean(Z)), by = .(X, indx = cumsum(substr(Y, 11, 12) == '00'))]
# X indx Y Z
# 1: FRI 1 200101010000 -6.510
# 2: FRI 2 200101010100 -6.040
# 3: FRI 3 200101010200 -5.465
# 4: FRI 4 200101010300 -5.825
# 5: FRI 5 200101010400 -5.935
# 6: FRI 6 200101010500 -5.945
Or per @ akruns comment using aggregate
from base (although it will probably need some extra tweeking in the output)
aggregate(Z ~ X + indx, transform(df, indx = cumsum(substr(Y, 11, 12) == '00')), mean)
source to share
a base-R solution where I first split the vector into pieces and calculated the average of each, this of course assumes the order you specified is always the case. Finally, I'll combine them to give your result:
Z <- unlist(lapply(split(df$Z, ceiling(seq_along(df$Z) / 2)), mean))
new_df <- cbind(df[seq(1,nrow(df), 2), c("X", "Y")], Z)
output:
X Y Z
1 FRI 200101010000 -6.510
3 FRI 200101010100 -6.040
5 FRI 200101010200 -5.465
7 FRI 200101010300 -5.825
9 FRI 200101010400 -5.935
11 FRI 200101010500 -5.945
source to share
dplyr version
library(dplyr)
df$Y <- as.character(df$Y)
means <- df %>%
group_by(hour = substr(Y, start = 1, stop=10)) %>% summarise(Z = mean(Z))
> means
Source: local data frame [6 x 2]
hour Z
1 2001010100 -6.510
2 2001010101 -6.040
3 2001010102 -5.465
4 2001010103 -5.825
5 2001010104 -5.935
6 2001010105 -5.945
Groups data by variable Y without the last two digits.
source to share
While this is not the case for the OP, in general for aggregating the second, if you have a column POSIXct
, you can:
library(lubridate)
library(tidyverse)
s <- seq(from=Sys.time(), length.out = 100, by=0.4)
df = data.frame(time=s,v=rnorm(length(s)))
df %>%
group_by(time=floor_date(time, "1 second")) %>%
summarize(v=mean(v)) # you can put any other interval like 5 minute
source to share