R - rolling 12 months in a group with an inconsistent time interval

I have data (values) collected over several years by numerous teams. I would like to calculate a 12 month moving median (using the previous 12 months) for each group. I've looked at the roll functions (and others) and they appear to be all fixed interval (e.g. 12 months = 12 rows), but the date intervals in my data don't match every group. Data values ​​were collected every month or every month, but there are additional gaps. I think I need a sliding median function that collects values ​​from the previous 364 days for each group. I would appreciate some help!

Below is an example of my data:

Date    Group   Value
8/17/2013   A   5
10/2/2013   A   13
1/15/2014   A   11
3/15/2014   A   2
5/22/2014   A   7
7/15/2014   A   1
9/3/2014    A   1
11/15/2014  A   7
7/22/2013   B   13
8/5/2013    B   13
9/7/2013    B   12
10/16/2013  B   6
11/17/2013  B   5
12/9/2013   B   15
1/30/2014   B   1
2/23/2014   B   10
3/24/2014   B   15
4/5/2014    B   3
5/26/2014   B   3
6/16/2014   B   4
8/5/2014    B   6
9/26/2014   B   8
10/16/2014  B   15
11/29/2014  B   12
12/13/2016  B   1

      

I want to add a column to this "Rolling Median" table that contains a rolling 12 month (or 365 day) median for each group.

+3


source to share


2 answers


This package may be useful to you:

https://github.com/mgahan/boRingTrees



It is called boRingTrees

and it handles this type of problem. If you don't want to download the package, you can also use the following code:

##Utilize the data.table package
library(data.table)
setDT(data)
data[, Date2 := as.Date(Date,format="%m/%d/%Y")] #Format date field

#Apply rollingByCalcs function (full function code is below)
data[, Roll_Median := rollingByCalcs(data,bylist=c("Group"),dates="Date2",target="Value",
                      lower=0,upper=365,incbounds=T,stat=median,na.rm=T,cores=1)]


rollingByCalcs <- function(data,bylist=NULL,dates,target=NULL,
                           lower,upper,incbounds=T,stat=length,na.rm=T,cores=1){
  tic <- Sys.time()

  require("data.table")
  require("parallel")
  data <- data.table(data)


  if (is.null(bylist)){
    data[, id.filler := 1]
    bylist <- "id.filler"
  }

  if (is.null(target)){
    data[,target:=1]
    target <- "target"
  }

  ##Create group by variable
  data[,Grp.Var:=.GRP,by=bylist]

  ##Assign variable names
  data[,target:=data[,eval(parse(text=target))]]
  data[,dates:=data[,eval(parse(text=dates))]]

  ##Create "list" of comparison dates
  Ref <- data[,list(Compare_Value=list(I(target)),Compare_Date=list(I(dates))), by=c("Grp.Var")]

  ##Compare two lists and see of the compare date is within N days
  data$Roll.Val <- mcmapply(FUN = function(RD, NUM) {
    d <- as.numeric(RD-Ref$Compare_Date[[NUM]])
    true.vals <- between(x=d,lower=lower,upper=upper,incbounds=incbounds)  
    out <- stat(Ref$Compare_Value[[NUM]][true.vals])
    return(out)
  }, RD = data$dates,NUM=data$Grp.Var,mc.cores=cores)

  print(Sys.time()-tic)
  return(data$Roll.Val)
}

      

+2


source


You can write a helper function. The dplyr package can be used here :

library(dplyr)
rollingMedian <- function(targetDate, targetGroup) {
  dat %>%
    mutate(thisDiff = difftime(as.Date(Date), targetDate, unit = "days")) %>%
    filter(thisDiff < 0, thisDiff > -366, Group == targetGroup) %>%
    summarise(medValue = median(Value))
}

dat$rollingMed <- mapply(rollingMedian, dat$Date, dat$Group)

      

Result:



dat
         Date Group Value rollingMed
1  2013-08-17     A     5         NA
2  2013-10-02     A    13          5
3  2014-01-15     A    11          9
4  2014-03-15     A     2         11
5  2014-05-22     A     7          8
6  2014-07-15     A     1          7
7  2014-09-03     A     1          7
8  2014-11-15     A     7          2
9  2013-07-22     B    13         NA
10 2013-08-05     B    13         13
...

      

Data used:

dat <- structure(list(Date = structure(c(1376697600, 1380672000, 1389744000, 
1394841600, 1400716800, 1405382400, 1409702400, 1416009600, 1374451200, 
1375660800, 1378512000, 1381881600, 1384646400, 1386547200, 1391040000, 
1393113600, 1395619200, 1396656000, 1401062400, 1402876800, 1407196800, 
1411689600, 1413417600, 1417219200, 1481587200), tzone = "UTC", class = c("POSIXct", 
"POSIXt")), Group = c("A", "A", "A", "A", "A", "A", "A", "A", 
"B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", 
"B", "B", "B", "B"), Value = c(5L, 13L, 11L, 2L, 7L, 1L, 1L, 
7L, 13L, 13L, 12L, 6L, 5L, 15L, 1L, 10L, 15L, 3L, 3L, 4L, 6L, 
8L, 15L, 12L, 1L)), .Names = c("Date", "Group", "Value"), row.names = c(NA, 
-25L), class = "data.frame")

      

0


source







All Articles