R - rolling 12 months in a group with an inconsistent time interval
I have data (values) collected over several years by numerous teams. I would like to calculate a 12 month moving median (using the previous 12 months) for each group. I've looked at the roll functions (and others) and they appear to be all fixed interval (e.g. 12 months = 12 rows), but the date intervals in my data don't match every group. Data values ββwere collected every month or every month, but there are additional gaps. I think I need a sliding median function that collects values ββfrom the previous 364 days for each group. I would appreciate some help!
Below is an example of my data:
Date Group Value
8/17/2013 A 5
10/2/2013 A 13
1/15/2014 A 11
3/15/2014 A 2
5/22/2014 A 7
7/15/2014 A 1
9/3/2014 A 1
11/15/2014 A 7
7/22/2013 B 13
8/5/2013 B 13
9/7/2013 B 12
10/16/2013 B 6
11/17/2013 B 5
12/9/2013 B 15
1/30/2014 B 1
2/23/2014 B 10
3/24/2014 B 15
4/5/2014 B 3
5/26/2014 B 3
6/16/2014 B 4
8/5/2014 B 6
9/26/2014 B 8
10/16/2014 B 15
11/29/2014 B 12
12/13/2016 B 1
I want to add a column to this "Rolling Median" table that contains a rolling 12 month (or 365 day) median for each group.
source to share
This package may be useful to you:
https://github.com/mgahan/boRingTrees
It is called boRingTrees
and it handles this type of problem. If you don't want to download the package, you can also use the following code:
##Utilize the data.table package
library(data.table)
setDT(data)
data[, Date2 := as.Date(Date,format="%m/%d/%Y")] #Format date field
#Apply rollingByCalcs function (full function code is below)
data[, Roll_Median := rollingByCalcs(data,bylist=c("Group"),dates="Date2",target="Value",
lower=0,upper=365,incbounds=T,stat=median,na.rm=T,cores=1)]
rollingByCalcs <- function(data,bylist=NULL,dates,target=NULL,
lower,upper,incbounds=T,stat=length,na.rm=T,cores=1){
tic <- Sys.time()
require("data.table")
require("parallel")
data <- data.table(data)
if (is.null(bylist)){
data[, id.filler := 1]
bylist <- "id.filler"
}
if (is.null(target)){
data[,target:=1]
target <- "target"
}
##Create group by variable
data[,Grp.Var:=.GRP,by=bylist]
##Assign variable names
data[,target:=data[,eval(parse(text=target))]]
data[,dates:=data[,eval(parse(text=dates))]]
##Create "list" of comparison dates
Ref <- data[,list(Compare_Value=list(I(target)),Compare_Date=list(I(dates))), by=c("Grp.Var")]
##Compare two lists and see of the compare date is within N days
data$Roll.Val <- mcmapply(FUN = function(RD, NUM) {
d <- as.numeric(RD-Ref$Compare_Date[[NUM]])
true.vals <- between(x=d,lower=lower,upper=upper,incbounds=incbounds)
out <- stat(Ref$Compare_Value[[NUM]][true.vals])
return(out)
}, RD = data$dates,NUM=data$Grp.Var,mc.cores=cores)
print(Sys.time()-tic)
return(data$Roll.Val)
}
source to share
You can write a helper function. The dplyr package can be used here :
library(dplyr)
rollingMedian <- function(targetDate, targetGroup) {
dat %>%
mutate(thisDiff = difftime(as.Date(Date), targetDate, unit = "days")) %>%
filter(thisDiff < 0, thisDiff > -366, Group == targetGroup) %>%
summarise(medValue = median(Value))
}
dat$rollingMed <- mapply(rollingMedian, dat$Date, dat$Group)
Result:
dat
Date Group Value rollingMed
1 2013-08-17 A 5 NA
2 2013-10-02 A 13 5
3 2014-01-15 A 11 9
4 2014-03-15 A 2 11
5 2014-05-22 A 7 8
6 2014-07-15 A 1 7
7 2014-09-03 A 1 7
8 2014-11-15 A 7 2
9 2013-07-22 B 13 NA
10 2013-08-05 B 13 13
...
Data used:
dat <- structure(list(Date = structure(c(1376697600, 1380672000, 1389744000,
1394841600, 1400716800, 1405382400, 1409702400, 1416009600, 1374451200,
1375660800, 1378512000, 1381881600, 1384646400, 1386547200, 1391040000,
1393113600, 1395619200, 1396656000, 1401062400, 1402876800, 1407196800,
1411689600, 1413417600, 1417219200, 1481587200), tzone = "UTC", class = c("POSIXct",
"POSIXt")), Group = c("A", "A", "A", "A", "A", "A", "A", "A",
"B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B",
"B", "B", "B", "B"), Value = c(5L, 13L, 11L, 2L, 7L, 1L, 1L,
7L, 13L, 13L, 12L, 6L, 5L, 15L, 1L, 10L, 15L, 3L, 3L, 4L, 6L,
8L, 15L, 12L, 1L)), .Names = c("Date", "Group", "Value"), row.names = c(NA,
-25L), class = "data.frame")
source to share