Averaging data based on the current line and previous lines
I have a simple dataset with the following form
df<- data.frame(c(10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20),
c(80, 80, 80, 80, 80, 80, 80, 80, 90, 90, 90, 90, 90, 90, 90, 90, 80, 80, 80, 80, 80, 80, 80, 80, 90, 90, 90, 90, 90, 90, 90, 90),
c(1, 1, 2, 2, 3, 3, 4, 4, 1, 1, 2, 2, 3, 3, 4, 4, 1, 1, 2, 2, 3, 3, 4, 4, 1, 1, 2, 2, 3, 3, 4, 4),
c(25, 75, 20, 40, 60, 50, 20, 10, 20, 30, 40, 60, 25, 75, 20, 40, 5, 5, 2, 4, 6, 5, 2, 1, 2, 3, 4, 6, 2, 7, 2, 4))
colnames(df)<-c("car_number", "year", "marker", "val")
What I am trying to do is quite simple actually: Per car_number
, I want to find the average of the values associated with the marker
-value and the preceding 3 values. So for the example data above, the output I want is
car=10, year=80 1: 50
car=10, year=80 2: 40
car=10, year=80 3: 45
car=10, year=80 4: 37.5
car=10, year=90 1: 31.25
car=10, year=90 2: 36.25
car=10, year=90 3: 35
car=10, year=90 4: 38.75
car=20, year=80 1: 5
car=20, year=80 2: 4
car=20, year=80 3: 4.5
car=20, year=80 4: 3.75
car=20, year=90 1: 3.125
car=20, year=90 2: 3.625
car=20, year=90 3: 3.375
car=20, year=90 4: 3.750
Note that for simplicity, the examples markers
above fall in pairs of two. This is not the case with real data, so I think the general solution will contain some group_by
(?)
Any effective solution is appreciated!
Here's a second example of dataset and output:
df<- data.frame(c(10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20),
c(80, 80, 80, 80, 80, 80, 80, 80, 90, 90, 90, 90, 90, 90, 90, 90, 80, 80, 80, 80, 80, 80, 80, 80, 90, 90, 90, 90, 90, 90, 90, 90),
c(1, 2, 2, 2, 3, 3, 4, 4, 1, 1, 2, 2, 3, 3, 3, 4, 1, 1, 1, 2, 3, 3, 4, 4, 4, 1, 2, 2, 3, 3, 3, 4),
c(25, 75, 20, 40, 60, 50, 20, 10, 20, 30, 40, 60, 25, 75, 20, 40, 5, 5, 2, 4, 6, 5, 2, 1, 2, 3, 4, 6, 2, 7, 2, 4))
colnames(df)<-c("car_number", "year", "marker", "val")
And the output (based on the above rules)
car=10, year=80 1: Mean{{25}] = 25
car=10, year=80 2: Mean[{40, 20, 75, 25}] = 40
car=10, year=80 3: Mean[{50, 60, 40, 20, 75, 25}] = 45
car=10, year=80 4: Mean[{10, 20, 50, 60, 40, 20, 75, 25}] = 37.5
car=10, year=90 1: Mean[{30, 20, 10, 20, 50, 60, 40, 20, 75}] = 36.11
car=10, year=90 2: Mean[{60, 40, 30, 20, 10, 20, 50, 60}] = 36.25
car=10, year=90 3: Mean[{20, 75, 25, 60, 40, 30, 20, 10, 20}] = 33.33
car=10, year=90 4: Mean[{40, 20, 75, 25, 60, 40, 30, 20}] = 38.75
car=20, year=80 1: Mean[{2, 5, 5}] = 4
car=20, year=80 2: Mean[{4, 2, 5, 5}] = 4
car=20, year=80 3: Mean[{5, 6, 4, 2, 5, 5}] = 4.5
car=20, year=80 4: Mean[{2, 1, 2, 5, 6, 4, 2, 5, 5}] = 3.55
car=20, year=90 1: Mean[{3, 2, 1, 2, 5, 6, 4}] = 3.29
car=20, year=90 2: Mean[{6, 4, 3, 2, 1, 2, 5, 6}] = 3.625
car=20, year=90 3: Mean[{2, 7, 2, 6, 4, 3, 2, 1, 2}] = 3.22
car=20, year=90 4: Mean[{4, 2, 7, 2, 6, 4, 3}] = 4
source to share
The first group_by
calculates the average value car_number
, year
, marker
and retains its weight (the number of rows).
The second group_by
one car_number
allows us to extract lag
ging and weights to calculate the required average:
library(purrr)
library(dplyr)
df %>%
arrange(car_number, year, marker) %>%
group_by(car_number, year, marker) %>%
summarise(mean_1 = mean(val, na.rm = TRUE), weight = n()) %>%
group_by(car_number) %>%
mutate(mean_2 = pmap_dbl(
list(mean_1, lag(mean_1), lag(mean_1, 2), lag(mean_1, 3),
weight, lag(weight), lag(weight, 2), lag(weight, 3)),
~ weighted.mean(c(..1, ..2, ..3, ..4),
c(..5, ..6, ..7, ..8),
na.rm = TRUE)
)) %>%
ungroup()
Result:
# # A tibble: 16 × 6
# car_number year marker mean_1 weight mean_2
# <dbl> <dbl> <dbl> <dbl> <int> <dbl>
# 1 10 80 1 50.0 2 50.000
# 2 10 80 2 30.0 2 40.000
# 3 10 80 3 55.0 2 45.000
# 4 10 80 4 15.0 2 37.500
# 5 10 90 1 25.0 2 31.250
# 6 10 90 2 50.0 2 36.250
# 7 10 90 3 50.0 2 35.000
# 8 10 90 4 30.0 2 38.750
# 9 20 80 1 5.0 2 5.000
# 10 20 80 2 3.0 2 4.000
# 11 20 80 3 5.5 2 4.500
# 12 20 80 4 1.5 2 3.750
# 13 20 90 1 2.5 2 3.125
# 14 20 90 2 5.0 2 3.625
# 15 20 90 3 4.5 2 3.375
# 16 20 90 4 3.0 2 3.750
Edit: Alternative syntax for versions purrr
before 0.2.2.9000
:
df %>%
arrange(car_number, year, marker) %>%
group_by(car_number, year, marker) %>%
summarise(mean_1 = mean(val, na.rm = TRUE), weight = n()) %>%
group_by(car_number) %>%
mutate(mean_2 = pmap_dbl(
list(mean_1, lag(mean_1), lag(mean_1, 2), lag(mean_1, 3),
weight, lag(weight), lag(weight, 2), lag(weight, 3)),
function(a, b, c, d, e, f, g, h)
weighted.mean(c(a, b, c, d),
c(e, f, g, h),
na.rm = TRUE)
)) %>%
ungroup()
source to share
Here is method c data.table
that modifies Frank's suggestion at David Arenburg by answering here .
# aggregate data by car_number, year, and marker
dfNew <- setDT(df)[, .(val=mean(val)), by=.(car_number, year, marker)]
# calculate average of current a previous three values
dfNew[, val := rowMeans(dfNew[,shift(val, 0:3), by=car_number][, -1], na.rm=TRUE)]
The first line is the standard aggregation call. The second line makes some changes to the method rowMeans
in the linked answer. rowMeans
a data table of shifted values is fed where the shift is done with car_number (thanks to sotos for catching it), which is tied to an instruction that omits the first column (using -1), which is the car_number column number returned in the first part of the chain.
this returns
car_number year marker val
1: 10 80 1 50.000
2: 10 80 2 40.000
3: 10 80 3 45.000
4: 10 80 4 37.500
5: 10 90 1 31.250
6: 10 90 2 36.250
7: 10 90 3 35.000
8: 10 90 4 38.750
9: 20 80 1 5.000
10: 20 80 2 4.000
11: 20 80 3 4.500
12: 20 80 4 3.750
13: 20 90 1 3.125
14: 20 90 2 3.625
15: 20 90 3 3.375
16: 20 90 4 3.750
source to share
By simply throwing the basic R solution into the mix. We can create a custom function using Reduce
c accumulate = TRUE
and tail(x, 4)
to include only the last 3 observations. All this after we average the dataset with car_type, year, marker
, i.e.
f1 <- function(x){
sapply(Reduce(c, x, accumulate = TRUE), function(i) mean(tail(i,4)))
}
dd <- aggregate(val ~ car_number+year+marker, df, mean)
dd <- dd[order(dd$car_number, dd$year, dd$marker),]
dd$new_avg <- with(dd, ave(val, car_number, FUN = f1))
dd
# car_number year marker val new_avg
#1 10 80 1 50.0 50.000
#5 10 80 2 30.0 40.000
#9 10 80 3 55.0 45.000
#13 10 80 4 15.0 37.500
#3 10 90 1 25.0 31.250
#7 10 90 2 50.0 36.250
#11 10 90 3 50.0 35.000
#15 10 90 4 30.0 38.750
#2 20 80 1 5.0 5.000
#6 20 80 2 3.0 4.000
#10 20 80 3 5.5 4.500
#14 20 80 4 1.5 3.750
#4 20 90 1 2.5 3.125
#8 20 90 2 5.0 3.625
#12 20 90 3 4.5 3.375
#16 20 90 4 3.0 3.750
source to share
given df
as your input, you can use dplyr
and zoo
and try:
grouping just over car_number
, you can try:
df %>%
group_by(car_number, year, marker) %>%
summarise(mm = mean(val)) %>%
group_by(car_number) %>%
mutate(rM=rollapply(mm, if_else(row_number() < 4, marker, 4), FUN=mean, align="right"))%>%
select(year, rM)
which gives:
Source: local data frame [16 x 3]
Groups: car_number [2]
car_number year rM
<dbl> <dbl> <dbl>
1 10 80 50.000
2 10 80 40.000
3 10 80 45.000
4 10 80 37.500
5 10 90 31.250
6 10 90 36.250
7 10 90 35.000
8 10 90 38.750
9 20 80 5.000
10 20 80 4.000
11 20 80 4.500
12 20 80 3.750
13 20 90 3.125
14 20 90 3.625
15 20 90 3.375
16 20 90 3.750
source to share
You can do it like this:
df %>%
group_by(car_number, year, marker) %>%
summarise(s = sum(val), w = n()) %>% # sum and number of values
group_by(car_number) %>%
mutate(S = cumsum(s) - cumsum(lag(s, 4, default=0))) %>% # sum of last four s
mutate(W = cumsum(w) - cumsum(lag(w, 4, default=0))) %>% # same for the weights
mutate(result = S/W)
Output of the second example:
# Source: local data frame [16 x 8]
# Groups: car_number [2]
#
# car_number year marker s w S W result
# <dbl> <dbl> <dbl> <dbl> <int> <dbl> <int> <dbl>
# 1 10 80 1 25 1 25 1 25.000000
# 2 10 80 2 135 3 160 4 40.000000
# 3 10 80 3 110 2 270 6 45.000000
# 4 10 80 4 30 2 300 8 37.500000
# 5 10 90 1 50 2 325 9 36.111111
# 6 10 90 2 100 2 290 8 36.250000
# 7 10 90 3 120 3 300 9 33.333333
# 8 10 90 4 40 1 310 8 38.750000
# 9 20 80 1 12 3 12 3 4.000000
# 10 20 80 2 4 1 16 4 4.000000
# 11 20 80 3 11 2 27 6 4.500000
# 12 20 80 4 5 3 32 9 3.555556
# 13 20 90 1 3 1 23 7 3.285714
# 14 20 90 2 10 2 29 8 3.625000
# 15 20 90 3 11 3 29 9 3.222222
# 16 20 90 4 4 1 28 7 4.000000
Edit : Probably more efficient with a package RcppRoll
, you can try this: S = roll_sum(c(0, 0, 0, s), 4)
(and the same for W
).
source to share