Faster aggregation of a subset in data.table

Question

Faster aggregation of a subset in data.table

I want to add a new column to my data table. This column must contain at least two columns of all rows that satisfy a certain condition. An example with a data.table looks like this:

library(data.table)
DT <- data.table(pattern=c("A", "A & B", "A & B & C", "A & C & D"),
                 value1=c(1, 2, 3, 4),
                 value2=c(5, 6, 7, 8)
)  

     pattern value1 value2
1:         A      1      5
2:     A & B      2      6
3: A & B & C      3      7
4: A & C & D      4      8

For every line x and every line i where pattern [x] is a subcategory of pattern [i], I want to perform the computation as:

min((value1[i]-value1[x])/(value1[i]/value2[i]-value1[x]/value2[x]))

Since the patterns are in a similar order, I can get the subtasks using a regex and replace "&" with the wildcard ". *" And check if it is the pattern itself. Hence, I can use a for-loop for each line:

setkey(DT,pattern)
for(i in 1:nrow(DT)) {
  DT[i, foo:=DT[grepl(gsub("&",".*",DT[i]$pattern,fixed=TRUE),pattern) & DT[i]$pattern!=pattern, 
                ifelse(.N==0,
                       NA,
                       min((DT[i]$value1-value1)/(DT[i]$value1/DT[i]$value2-value1/value2)))]]
}

Unfortunately, the dataset is quite large and the for-loop on this dataset is terribly slow. I hope someone can help me with some data.table magic I don't know to solve this problem. Basically, my question is more similar to this one , but string templates are given, so I cannot use range joins.

Background: Patterns are derived from the association building rule , for example {onion and potatoes => burger}. The example shows thousands of different items (for example, A, B, C, and D). I'm trying to add a statistical measure to find out how the rule applies to its sub-rules.

+3

r self-join data.table data-mining

Roy van der Valk 09 dec. 14 at 22:10

source to share

1 answer

Serban tanasa · Answer 1 · 2014-12-09T23:35:07+0000

I am not doing what calculations you want to do (I tried to run your code, got Inf on two lines), but as a general idea, you can do it as an intermediate step:

DT[, hasA := grepl("A", pattern)]
DT[, hasB := grepl("B", pattern)]
DT[, hasC := grepl("C", pattern)]
DT[, hasD := grepl("D", pattern)]
DT[, foo_0 := value1*value2]

and go from there.

Faster aggregation of a subset in data.table

More articles: