Averaging data based on the current line and previous lines

Question

Averaging data based on the current line and previous lines

I have a simple dataset with the following form

df<- data.frame(c(10, 10, 10,  10,  10,  10,  10,  10, 10, 10, 10, 10, 10, 10, 10, 10, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20),   
                c(80,  80,  80,  80,  80,  80,  80,  80, 90, 90, 90, 90, 90, 90, 90, 90, 80,  80,  80,  80,  80,  80,  80,  80, 90, 90, 90, 90, 90, 90, 90, 90),
                c(1,    1,   2,   2,   3,   3,   4,   4,     1,   1,    2, 2,   3,    3,   4,   4,  1,    1,   2,   2,   3,   3,   4,   4,     1,   1,    2, 2,   3,    3,   4,   4),
                c(25,   75,  20,  40,  60,  50,  20,  10,  20,  30,  40,  60, 25,   75,  20,  40, 5,   5,  2,  4,  6,  5,  2,  1,  2,  3,  4,  6, 2,   7,  2,  4))

colnames(df)<-c("car_number", "year", "marker", "val")

What I am trying to do is quite simple actually: Per car_number

, I want to find the average of the values associated with the marker

-value and the preceding 3 values. So for the example data above, the output I want is

car=10, year=80 1: 50
car=10, year=80 2: 40
car=10, year=80 3: 45
car=10, year=80 4: 37.5

car=10, year=90 1: 31.25
car=10, year=90 2: 36.25
car=10, year=90 3: 35
car=10, year=90 4: 38.75


car=20, year=80 1: 5
car=20, year=80 2: 4
car=20, year=80 3: 4.5
car=20, year=80 4: 3.75

car=20, year=90 1: 3.125
car=20, year=90 2: 3.625
car=20, year=90 3: 3.375
car=20, year=90 4: 3.750

Note that for simplicity, the examples markers

above fall in pairs of two. This is not the case with real data, so I think the general solution will contain some group_by

(?)

Any effective solution is appreciated!

Here's a second example of dataset and output:

df<- data.frame(c(10, 10, 10,  10,  10,  10,  10,  10, 10, 10, 10, 10, 10, 10, 10, 10, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20),   
                c(80,  80,  80,  80,  80,  80,  80,  80, 90, 90, 90, 90, 90, 90, 90, 90, 80,  80,  80,  80,  80,  80,  80,  80, 90, 90, 90, 90, 90, 90, 90, 90),
                c(1,    2,   2,   2,   3,   3,   4,   4,     1,   1,    2, 2,   3,    3,   3,   4,  1,    1,   1,   2,   3,   3,   4,   4,     4,   1,    2, 2,   3,    3,   3,   4),
                c(25,   75,  20,  40,  60,  50,  20,  10,  20,  30,  40,  60, 25,   75,  20,  40, 5,   5,  2,  4,  6,  5,  2,  1,  2,  3,  4,  6, 2,   7,  2,  4))

colnames(df)<-c("car_number", "year", "marker", "val")

And the output (based on the above rules)

car=10, year=80 1: Mean{{25}]                                  = 25
car=10, year=80 2: Mean[{40, 20, 75, 25}]                      = 40
car=10, year=80 3: Mean[{50, 60, 40, 20, 75, 25}]              = 45
car=10, year=80 4: Mean[{10, 20, 50, 60, 40, 20, 75, 25}]      = 37.5

car=10, year=90 1: Mean[{30, 20, 10, 20, 50, 60, 40, 20, 75}] = 36.11
car=10, year=90 2: Mean[{60, 40, 30, 20, 10, 20, 50, 60}]     = 36.25
car=10, year=90 3: Mean[{20, 75, 25, 60, 40, 30, 20, 10, 20}] = 33.33
car=10, year=90 4: Mean[{40, 20, 75, 25, 60, 40, 30, 20}]     = 38.75

car=20, year=80 1: Mean[{2, 5, 5}]                   = 4
car=20, year=80 2: Mean[{4, 2, 5, 5}]                = 4
car=20, year=80 3: Mean[{5, 6, 4, 2, 5, 5}]          = 4.5
car=20, year=80 4: Mean[{2, 1, 2, 5, 6, 4, 2, 5, 5}] = 3.55

car=20, year=90 1: Mean[{3, 2, 1, 2, 5, 6, 4}]       = 3.29
car=20, year=90 2: Mean[{6, 4, 3, 2, 1, 2, 5, 6}]    = 3.625
car=20, year=90 3: Mean[{2, 7, 2, 6, 4, 3, 2, 1, 2}] = 3.22
car=20, year=90 4: Mean[{4, 2, 7, 2, 6, 4, 3}]       = 4

+3

r dplyr

BillyJean 06 Apr 17 at 9:28 am

source to share

5 answers

Aurèle · Answer 1 · 2017-04-06T10:12:10+0000

The first group_by

calculates the average value car_number

, year

, marker

and retains its weight (the number of rows).
The second group_by

one car_number

allows us to extract lag

ging and weights to calculate the required average:

library(purrr)
library(dplyr)
df %>% 
  arrange(car_number, year, marker) %>% 
  group_by(car_number, year, marker) %>% 
  summarise(mean_1 = mean(val, na.rm = TRUE), weight = n()) %>% 
  group_by(car_number) %>% 
  mutate(mean_2 = pmap_dbl(
    list(mean_1, lag(mean_1), lag(mean_1, 2), lag(mean_1, 3),
         weight, lag(weight), lag(weight, 2), lag(weight, 3)),
    ~ weighted.mean(c(..1, ..2, ..3, ..4),
                    c(..5, ..6, ..7, ..8),
                    na.rm = TRUE)
  )) %>%
  ungroup()

Result:

# # A tibble: 16 × 6
#    car_number  year marker mean_1 weight mean_2
#         <dbl> <dbl>  <dbl>  <dbl>  <int>  <dbl>
# 1          10    80      1   50.0      2 50.000
# 2          10    80      2   30.0      2 40.000
# 3          10    80      3   55.0      2 45.000
# 4          10    80      4   15.0      2 37.500
# 5          10    90      1   25.0      2 31.250
# 6          10    90      2   50.0      2 36.250
# 7          10    90      3   50.0      2 35.000
# 8          10    90      4   30.0      2 38.750
# 9          20    80      1    5.0      2  5.000
# 10         20    80      2    3.0      2  4.000
# 11         20    80      3    5.5      2  4.500
# 12         20    80      4    1.5      2  3.750
# 13         20    90      1    2.5      2  3.125
# 14         20    90      2    5.0      2  3.625
# 15         20    90      3    4.5      2  3.375
# 16         20    90      4    3.0      2  3.750

Edit: Alternative syntax for versions purrr

before 0.2.2.9000

:

df %>% 
  arrange(car_number, year, marker) %>% 
  group_by(car_number, year, marker) %>% 
  summarise(mean_1 = mean(val, na.rm = TRUE), weight = n()) %>% 
  group_by(car_number) %>% 
  mutate(mean_2 = pmap_dbl(
    list(mean_1, lag(mean_1), lag(mean_1, 2), lag(mean_1, 3),
         weight, lag(weight), lag(weight, 2), lag(weight, 3)),
    function(a, b, c, d, e, f, g, h)
      weighted.mean(c(a, b, c, d),
                    c(e, f, g, h),
                    na.rm = TRUE)
  )) %>%
  ungroup()

lmo · Answer 2 · 2017-04-06T12:38:09+0000

Here is method c data.table

that modifies Frank's suggestion at David Arenburg by answering here .

# aggregate data by car_number, year, and marker
dfNew <- setDT(df)[, .(val=mean(val)), by=.(car_number, year, marker)]
# calculate average of current a previous three values
dfNew[, val := rowMeans(dfNew[,shift(val, 0:3), by=car_number][, -1], na.rm=TRUE)]

The first line is the standard aggregation call. The second line makes some changes to the method rowMeans

in the linked answer. rowMeans

a data table of shifted values is fed where the shift is done with car_number (thanks to sotos for catching it), which is tied to an instruction that omits the first column (using -1), which is the car_number column number returned in the first part of the chain.

this returns

   car_number year marker    val
 1:         10   80      1 50.000
 2:         10   80      2 40.000
 3:         10   80      3 45.000
 4:         10   80      4 37.500
 5:         10   90      1 31.250
 6:         10   90      2 36.250
 7:         10   90      3 35.000
 8:         10   90      4 38.750
 9:         20   80      1  5.000
10:         20   80      2  4.000
11:         20   80      3  4.500
12:         20   80      4  3.750
13:         20   90      1  3.125
14:         20   90      2  3.625
15:         20   90      3  3.375
16:         20   90      4  3.750

Sotos · Answer 3 · 2017-04-06T13:10:02+0000

By simply throwing the basic R solution into the mix. We can create a custom function using Reduce

c accumulate = TRUE

and tail(x, 4)

to include only the last 3 observations. All this after we average the dataset with car_type, year, marker

, i.e.

f1 <- function(x){
    sapply(Reduce(c, x, accumulate = TRUE), function(i) mean(tail(i,4)))
  }

dd <- aggregate(val ~ car_number+year+marker, df, mean)
dd <- dd[order(dd$car_number, dd$year, dd$marker),]
dd$new_avg <- with(dd, ave(val, car_number, FUN = f1))

dd
#   car_number year marker  val new_avg
#1          10   80      1 50.0  50.000
#5          10   80      2 30.0  40.000
#9          10   80      3 55.0  45.000
#13         10   80      4 15.0  37.500
#3          10   90      1 25.0  31.250
#7          10   90      2 50.0  36.250
#11         10   90      3 50.0  35.000
#15         10   90      4 30.0  38.750
#2          20   80      1  5.0   5.000
#6          20   80      2  3.0   4.000
#10         20   80      3  5.5   4.500
#14         20   80      4  1.5   3.750
#4          20   90      1  2.5   3.125
#8          20   90      2  5.0   3.625
#12         20   90      3  4.5   3.375
#16         20   90      4  3.0   3.750

Aramis7d · Answer 4 · 2017-04-06T12:31:23+0000

given df

as your input, you can use dplyr

and zoo

and try:

grouping just over car_number

, you can try:

df %>%
  group_by(car_number, year, marker) %>%
  summarise(mm = mean(val)) %>%
  group_by(car_number) %>%
  mutate(rM=rollapply(mm, if_else(row_number() < 4, marker, 4), FUN=mean, align="right"))%>%
  select(year, rM)

which gives:

Source: local data frame [16 x 3]
Groups: car_number [2]

   car_number  year     rM
        <dbl> <dbl>  <dbl>
1          10    80 50.000
2          10    80 40.000
3          10    80 45.000
4          10    80 37.500
5          10    90 31.250
6          10    90 36.250
7          10    90 35.000
8          10    90 38.750
9          20    80  5.000
10         20    80  4.000
11         20    80  4.500
12         20    80  3.750
13         20    90  3.125
14         20    90  3.625
15         20    90  3.375
16         20    90  3.750

Scarabee · Answer 5 · 2017-04-06T21:21:27+0000

You can do it like this:

df %>%
  group_by(car_number, year, marker) %>%
  summarise(s = sum(val), w = n()) %>% # sum and number of values
  group_by(car_number) %>%
  mutate(S = cumsum(s) - cumsum(lag(s, 4, default=0))) %>% # sum of last four s
  mutate(W = cumsum(w) - cumsum(lag(w, 4, default=0))) %>% # same for the weights
  mutate(result = S/W)

Output of the second example:

# Source: local data frame [16 x 8]
# Groups: car_number [2]
# 
#    car_number  year marker     s     w     S     W    result
#         <dbl> <dbl>  <dbl> <dbl> <int> <dbl> <int>     <dbl>
# 1          10    80      1    25     1    25     1 25.000000
# 2          10    80      2   135     3   160     4 40.000000
# 3          10    80      3   110     2   270     6 45.000000
# 4          10    80      4    30     2   300     8 37.500000
# 5          10    90      1    50     2   325     9 36.111111
# 6          10    90      2   100     2   290     8 36.250000
# 7          10    90      3   120     3   300     9 33.333333
# 8          10    90      4    40     1   310     8 38.750000
# 9          20    80      1    12     3    12     3  4.000000
# 10         20    80      2     4     1    16     4  4.000000
# 11         20    80      3    11     2    27     6  4.500000
# 12         20    80      4     5     3    32     9  3.555556
# 13         20    90      1     3     1    23     7  3.285714
# 14         20    90      2    10     2    29     8  3.625000
# 15         20    90      3    11     3    29     9  3.222222
# 16         20    90      4     4     1    28     7  4.000000

Edit : Probably more efficient with a package RcppRoll

, you can try this: S = roll_sum(c(0, 0, 0, s), 4)

(and the same for W

).

Averaging data based on the current line and previous lines

More articles: