Efficient cross-connect with aggregation and filter
As per the title, I'm looking to do a cross join with a table that does an aggregation function and filters on multiple variables in the table.
I have similar data:
library(dplyr)
library(data.table)
library(sqldf)
sales <- data.frame(salesx = c(3000, 2250,850,1800,1700,560,58,200,965,1525)
,week = seq(from = 1, to = 10, by = 1)
,uplift = c(0.04)
,slope = c(100)
,carryover = c(.35))
spend <- data.frame(spend = seq(from = 1, to = 50000, by = 1))
tempdata <- merge(spend,sales,all=TRUE)
tempdata$singledata <- as.numeric(1)
And here is an example of what I am trying to accomplish through my sql based solution:
newdata <- sqldf("select a.spend, a.week,
sum(case when b.week > a.week
then b.salesx*(b.uplift*(1-exp(-(power(b.singledata,b.week-a.week)/b.slope))))/b.spend
else 0.0 end) as calc3
from tempdata a, tempdata b
where a.spend = b.spend
group by a.spend,a.week")
This gives the results I want, but it is a little slow, especially with my real dataset of about 1 million records. It would be great to have some advice on: a) how to speed up sqldf; or b) using the more efficient data.table / dplyr approach (I can't figure out as far as the trifecta cross-join / aggregation / filtering issue is concerned).
The clarity when connecting without even connecting is below:
I had a few questions about the non equi join solution - the output was fine and very fast. To understand how the code works, I have broken it down like this:
breakdown <- setDT(tempdata)[tempdata, .(spend, uplift, slope,carryover,salesx, singledata, week, i.week,x.week, i.salesx,x.salesx, x.spend, i.spend), on=.(spend, week > week)]
According to the breakdown, in order to match the original calculation, it should be:
x.salesx*(uplift*(1.0-exp(-(`^`(singledata,x.week-week)/slope))))/i.spend
The reason this is not obvious is because in the example I used, βthe power side of the equation didnβt really do anything (always 1). The actual calculation used (adding the transferable variable to the data):
SQL
b.salesx*(b.uplift*(1-exp(-(power((b.singledata*b.carryover),b.week-a.week)/b.slope))))/b.spend (sql)
My data.table solution
sum(salesx.y*(uplift.y*(1-exp(-((singledata.y*adstock.y)^(week.y-week.x)/slope.y))))/spend), by=list(spend, week.x)
However, I cannot get this to work with the non equi join solution when adding a carry variable i.e.
x.salesx*(uplift*(1.0-exp(-(`^`((singledata*carryover),x.week-week)/slope))))/i.spend
source to share
With version 1.9.8 (on CRAN November 25, 2016), data.table
non-equi joins have been introduced to help avoid multiplayer cross joins:
library(data.table)
newdata4 <-
# coerce to data.table
setDT(tempdata)[
# non-equi self-join
tempdata, on = .(spend, week > week),
# compute result
.(calc3 = sum(salesx*(uplift*(1.0-exp(-(`^`(singledata,week-i.week)/slope))))/i.spend)),
# grouped by join parameters
by = .EACHI][
# replace NA
is.na(calc3), calc3 := 0.0][]
# check that results are equal
all.equal(newdata, as.data.frame(newdata4[order(spend, week)]))
[1] TRUE
Benchmark
The OP provided three different solutions, two options sqldf
and one data.table
cross join approach . They are compared to non-equi union.
Code below
dt_tempdata <- data.table(tempdata)
microbenchmark::microbenchmark(
sqldf = {
newdata <- sqldf("select a.spend, a.week,
sum(case when b.week > a.week
then b.salesx*(b.uplift*(1-exp(-(power(b.singledata,b.week-a.week)/b.slope))))/b.spend
else 0.0 end) as calc3
from tempdata a, tempdata b
where a.spend = b.spend
group by a.spend,a.week")
},
sqldf_idx = {
newdata2 <- sqldf(c('create index newindex on tempdata(spend)',
'select a.spend, a.week,
sum(case when b.week > a.week
then b.salesx*(b.uplift*(1-exp(-(power(b.singledata,b.week-a.week)/b.slope))))/b.spend
else 0.0 end) as calc3
from main.tempdata a left join main.tempdata b
on a.spend = b.spend
group by a.spend,a.week'), dbname = tempfile())
},
dt_merge = {
newdata3 <- merge(dt_tempdata, dt_tempdata, by="spend", all=TRUE, allow.cartesian=TRUE)[
week.y > week.x,
.(calc3 = sum(salesx.y*(uplift.y*(1-exp(-(singledata.y^(week.y-week.x)/slope.y)))))),
by=.(spend, week.x)]
},
dt_nonequi = {
newdata4 <- dt_tempdata[
dt_tempdata, on = .(spend, week > week),
.(calc3 = sum(salesx*(uplift*(1.0-exp(-(`^`(singledata,week-i.week)/slope))))/i.spend)),
by = .EACHI][is.na(calc3), calc3 := 0.0]
},
times = 3L
)
returns these timings:
Unit: seconds
expr min lq mean median uq max neval cld
sqldf 9.456110 10.081704 10.647193 10.707299 11.242735 11.778171 3 b
sqldf_idx 10.980590 11.477774 11.734239 11.974958 12.111064 12.247170 3 b
dt_merge 3.037857 3.147274 3.192227 3.256692 3.269412 3.282131 3 a
dt_nonequi 1.768764 1.776581 1.792359 1.784397 1.804156 1.823916 3 a
For a given problem size, the no-equity join is the fastest, almost twice as fast as the merge / crossover method data.table
and 6 times as fast as codes sqldf
. Interestingly, index creation and / or temporary file usage on my system is quite expensive.
Please note that I have optimized the OP's solution data.table
.
Finally, all versions except merge / cross (I refrained from fixing this version) return the same result.
all.equal(newdata, newdata2) # TRUE
all.equal(newdata, as.data.frame(newdata3[order(spend, week.x)])) # FALSE (last week missing)
all.equal(newdata, as.data.frame(newdata4[order(spend, week)])) # TRUE
Bigger problem size
The OP reported that the merge / cross join solution is data.table
running out of memory for his 1 M row production dataset. To test that the non-equi join method consumes less memory, I tested it with a problem size of 5 M lines ( nrow(tempdata)
), which is ten times the size of the previous tests. On my PC with 8GB of memory, startup completed without issue after about 18 seconds.
Unit: seconds
expr min lq mean median uq max neval
dt_nonequi 18.12387 18.12657 18.23454 18.12927 18.28987 18.45047 3
source to share
Finally, I managed to study it again:
My original solution:
system.time(newdata <- sqldf("select a.spend, a.week,
sum(case when b.week > a.week
then b.salesx*(b.uplift*(1-exp(-(power(b.singledata,b.week-a.week)/b.slope))))/b.spend
else 0.0 end) as calc3
from tempdata a, tempdata b
where a.spend = b.spend
group by a.spend,a.week"))
user system elapsed
11.99 3.77 16.11
With an index (although something tells me it doesn't work as expected):
system.time(newdata2 <- sqldf(c('create index newindex on tempdata(spend)',
'select a.spend, a.week,
sum(case when b.week > a.week
then b.salesx*(b.uplift*(1-exp(-(power(b.singledata,b.week-a.week)/b.slope))))/b.spend
else 0.0 end) as calc3
from main.tempdata a left join main.tempdata b
on a.spend = b.spend
group by a.spend,a.week'), dbname = tempfile()))
user system elapsed
12.73 2.93 15.76
Data.table solution (does not return 0 from ifelse statement in sql):
datatablefunc <- function(g){
tempdata2 <- as.data.table(g)
setkey(tempdata2, spend)
tempdata3 <- merge(tempdata2, tempdata2, by="spend", all=TRUE, allow.cartesian=TRUE)
tempdata4 <- tempdata3[week.y > week.x, sum(salesx.y*(uplift.y*(1-exp(-(singledata.y^(week.y-week.x)/slope.y))))/spend), by=list(spend, week.x)]
return(tempdata4)
}
system.time(newdata3 <- datatablefunc(tempdata))
user system elapsed
2.36 0.25 2.62
The beauty of the sql based solution is that since the temp output is stored in the sql server and not in memory, then I don't run into the fact that I can't highlight the vector issues, what's going on with the data.table / dplyr solutions ( when I add more data) ... the downside is that it takes a little longer to run.
source to share