Computed .BY operations in data.table

As an extension of this question , I would like to run calculations that involve a variable .BY

, which itself is the product of the calculation. The questions I've looked at are using a key that just accesses, but doesn't transform or aggregate the existing value.

In this example I am trying to create an ROC for a binary classifier with a function that uses data.table

(since ROC calculations in existing packages are quite slow). In this case, the variable .BY

is the breakpoint, and the calculations are true positive and false positive rates to estimate the probability at that point.

I can do this with an intermediate data.table

, but I am looking for a more efficient solution. It works:

# dummy example
library(data.table)
dt <- setDT(get(data(GermanCredit, package='caret'))
            )[, `:=`(y = as.integer(Class=='Bad'),
                     Class = NULL)]
model <- glm(y ~ ., family='binomial', data=dt)
dt[,y_est := predict(model, type='response')]

#--- Generate ROC with specified # of cutpoints  ---
# level of resolution of ROC curve -- up to uniqueN(y_est)
res <- 5 

# vector of cutpoints (thresholds for y_est)
cuts <- dt[,.( thresh=quantile(y_est, probs=0:res/res) )]

# at y_est >= each threshold, how many true positive and false positives?
roc <-  cuts[, .( tpr = dt[y_est>=.BY[[1]],sum(y==1)]/dt[,sum(y==1)],
                  fpr = dt[y_est>=.BY[[1]],sum(y==0)]/dt[,sum(y==0)]
                 ), by=thresh]

plot(tpr~fpr,data=roc,type='s') # looks right

      

enter image description here

But this doesn't work:

# this doesn't work, and doesn't have access to the total positives & negatives
dt[, .(tp=sum( (y_est>=.BY[[1]]) & (y==1)  ),
       fp=sum( (y_est>=.BY[[1]]) & (y==0)  ) ),
   keyby=.(thresh= quantile(y_est, probs=0:res/res) )]
# Error in `[.data.table`(dt, , .(tp = sum((y_est >= .BY[[1]]) & (y == 1)),  : 
#   The items in the 'by' or 'keyby' list are length (6).
#   Each must be same length as rows in x or number of rows returned by i (1000).

      

Is there an idiomatically data.table (or at least a more efficient) way to do this?

+3


source to share


1 answer


You can use non-equi joins:



dt[.(thresh = quantile(y_est, probs=0:res/res)), on = .(y_est >= thresh),
   .(fp = sum(y == 0), tp = sum(y == 1)), by = .EACHI][,
   lapply(.SD, function(x) x/x[1]), .SDcols = -"y_est"]
#           fp          tp
#1: 1.00000000 1.000000000
#2: 0.72714286 0.970000000
#3: 0.46857143 0.906666667
#4: 0.24142857 0.770000000
#5: 0.08142857 0.476666667
#6: 0.00000000 0.003333333

      

+2


source







All Articles