Define Overlapping Ranges - R
I have two data frames. One with purchases during the month, one with advertisements (announcements) that were widely presented this month. To understand if a purchase can be reliably linked to an ad, I want to understand how many of the purchase dates occur within 4 days of an ad. To do this, I created some (cumbersome) code based on expanding each line of the ad base to cover the corresponding 4 day period and then using a merge construct to see where the (lack of) overlap is. This seems like a very cumbersome way to do things. Ideally - I would like to do this in dplyr elegantly. let me know if anyone has any suggestions
library(dplyr)
library(lubridate)
require(data.table)
# set start and end dates to sample between
day.start <- "2007/01/01"
day.end <- "2007/01/30"
set.seed(1) # define a random date/time selection function
rand.day.time <- function(day.start,day.end,size) {
dayseq <- seq.Date(as.Date(day.start),as.Date(day.end),by="day")
dayselect <- sample(dayseq,size,replace=TRUE)
as.POSIXlt(paste(dayselect) )
}
dateval=rand.day.time(day.start,day.end,size=20)
###create initial dataframes
action=rep(c("ad","purchase"),10)
id=rep(c(1,1,2,2),5)
df=data.frame(customer=id,date=dateval,action=action)
df_pur=filter(df,action=="purchase");(df_pur=df_pur[order(df_pur$date),])
df_ad=filter(df,action=="ad");(df_ad=df_ad[order(df_ad$date),])
#expand data-frame to include all the ranges for which the ad might trigger purchases
df_ad_exp = df_ad %>%
group_by(customer,date) %>%
summarize(start=min(date),end=min(date+days(4)))
df_ad_exp=as.data.frame(df_ad_exp)
df_ad_exp2=setDT(df_ad_exp)[, list(customer=customer, range=seq(start,end,by="day")), by=1:nrow(df_ad_exp)]
###merge the dataframe, use NA values to identify those dates in which purchase was made but no ad was "active"
df_ad_exp2=as.data.frame(df_ad_exp2)
(df_ad_exp2=df_ad_exp2[,c("customer","range")])
df_ad_exp2$helpercol=0
(df_pur_m=merge(df_pur,df_ad_exp2,by.x=c("date","customer"),by.y=c("range","customer"),all.x=TRUE))
df_pur_m$ad_in_range=df_pur_m$helpercol;df_pur_m$helpercol=NULL
df_pur_m$ad_in_range[!is.na(df_pur_m$ad_in_range)]=1;df_pur_m$ad_in_range[is.na(df_pur_m$ad_in_range)]=0
#outcomes
df_pur
df_ad
df_pur_m
> df_ad
customer date action
3 1 2007-01-07 ad
6 2 2007-01-07 ad
1 1 2007-01-08 ad
10 2 2007-01-12 ad
2 2 2007-01-18 ad
5 1 2007-01-19 ad
7 1 2007-01-21 ad
9 1 2007-01-22 ad
8 2 2007-01-24 ad
4 2 2007-01-29 ad
> df_pur_m
date customer action ad_in_range
1 2007-01-02 1 purchase 0
2 2007-01-06 2 purchase 0
3 2007-01-12 1 purchase 1
4 2007-01-12 1 purchase 1
5 2007-01-15 2 purchase 1
6 2007-01-20 2 purchase 1
7 2007-01-24 2 purchase 1
8 2007-01-27 1 purchase 0
9 2007-01-28 2 purchase 1
10 2007-01-30 1 purchase 0
source to share
Try it foverlaps
in data.table
, it's designed to do this (I can't think of an elegant dplyr
way, sorry). Both tables must have a Start / End Date column, so the start and end date of the ad is the start date up to 4 days; purchase start / end date is the same.
# df_ad must be keyed
setDT(df_ad)[, ad_date_end:=date + days(4)]
setnames(df_ad, 'date', 'ad_date') # just for readability later
setkey(df_ad, customer, ad_date, ad_date_end)
setDT(df_pur)[, purch_end:=date]
setnames(df_pur, 'date', 'purch_date') # for readability
# type='within': the x interval (purchase) is within the y interval (ad)
# we merge on customer ID, start & end date
ovl <- foverlaps(df_pur, df_ad,
by.x=c('customer', 'purch_date', 'purch_end'), type='within')
# customer ad_date action ad_date_end purch_date i.action purch_end
# 1: 1 <NA> NA <NA> 2007-01-02 purchase 2007-01-02
# 2: 2 <NA> NA <NA> 2007-01-06 purchase 2007-01-06
# 3: 1 2007-01-08 ad 2007-01-12 2007-01-12 purchase 2007-01-12
# 4: 1 2007-01-08 ad 2007-01-12 2007-01-12 purchase 2007-01-12
# 5: 2 2007-01-12 ad 2007-01-16 2007-01-15 purchase 2007-01-15
# 6: 2 2007-01-18 ad 2007-01-22 2007-01-20 purchase 2007-01-20
# 7: 2 2007-01-24 ad 2007-01-28 2007-01-24 purchase 2007-01-24
# 8: 1 <NA> NA <NA> 2007-01-27 purchase 2007-01-27
# 9: 2 2007-01-24 ad 2007-01-28 2007-01-28 purchase 2007-01-28
# 10: 1 <NA> NA <NA> 2007-01-30 purchase 2007-01-30
# tidyup
ovl[, action:=i.action][, c('ad_date_end', 'purch_end', 'i.action'):=NULL]
customer ad_date action purch_date
# 1: 1 <NA> purchase 2007-01-02
# 2: 2 <NA> purchase 2007-01-06
# 3: 1 2007-01-08 purchase 2007-01-12
# 4: 1 2007-01-08 purchase 2007-01-12
# 5: 2 2007-01-12 purchase 2007-01-15
# 6: 2 2007-01-18 purchase 2007-01-20
# 7: 2 2007-01-24 purchase 2007-01-24
# 8: 1 <NA> purchase 2007-01-27
# 9: 2 2007-01-24 purchase 2007-01-28
# 10: 1 <NA> purchase 2007-01-30
The lines with NA
ad_date
were purchases not related to the ad.
source to share