Faster aggregation of a subset in data.table
I want to add a new column to my data table. This column must contain at least two columns of all rows that satisfy a certain condition. An example with a data.table looks like this:
library(data.table)
DT <- data.table(pattern=c("A", "A & B", "A & B & C", "A & C & D"),
value1=c(1, 2, 3, 4),
value2=c(5, 6, 7, 8)
)
pattern value1 value2
1: A 1 5
2: A & B 2 6
3: A & B & C 3 7
4: A & C & D 4 8
For every line x and every line i where pattern [x] is a subcategory of pattern [i], I want to perform the computation as:
min((value1[i]-value1[x])/(value1[i]/value2[i]-value1[x]/value2[x]))
Since the patterns are in a similar order, I can get the subtasks using a regex and replace "&" with the wildcard ". *" And check if it is the pattern itself. Hence, I can use a for-loop for each line:
setkey(DT,pattern)
for(i in 1:nrow(DT)) {
DT[i, foo:=DT[grepl(gsub("&",".*",DT[i]$pattern,fixed=TRUE),pattern) & DT[i]$pattern!=pattern,
ifelse(.N==0,
NA,
min((DT[i]$value1-value1)/(DT[i]$value1/DT[i]$value2-value1/value2)))]]
}
Unfortunately, the dataset is quite large and the for-loop on this dataset is terribly slow. I hope someone can help me with some data.table magic I don't know to solve this problem. Basically, my question is more similar to this one , but string templates are given, so I cannot use range joins.
Background: Patterns are derived from the association building rule , for example {onion and potatoes => burger}. The example shows thousands of different items (for example, A, B, C, and D). I'm trying to add a statistical measure to find out how the rule applies to its sub-rules.
source to share
I am not doing what calculations you want to do (I tried to run your code, got Inf on two lines), but as a general idea, you can do it as an intermediate step:
DT[, hasA := grepl("A", pattern)]
DT[, hasB := grepl("B", pattern)]
DT[, hasC := grepl("C", pattern)]
DT[, hasD := grepl("D", pattern)]
DT[, foo_0 := value1*value2]
and go from there.
source to share