R: how to calculate the sensitivity and specificity of the rpart tree
library(rpart)
train <- data.frame(ClaimID = c(1,2,3,4,5,6,7,8,9,10),
RearEnd = c(TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE),
Whiplash = c(TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE),
Activity = factor(c("active", "very active", "very active", "inactive", "very inactive", "inactive", "very inactive", "active", "active", "very active"),
levels=c("very inactive", "inactive", "active", "very active"),
ordered=TRUE),
Fraud = c(FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE))
mytree <- rpart(Fraud ~ RearEnd + Whiplash + Activity, data = train, method = "class", minsplit = 2, minbucket = 1, cp=-1)
prp(mytree, type = 4, extra = 101, leaf.round = 0, fallen.leaves = TRUE,
varlen = 0, tweak = 1.2)
Then using printcp
I can see the results of the cross validation
> printcp(mytree)
Classification tree:
rpart(formula = Fraud ~ RearEnd + Whiplash + Activity, data = train,
method = "class", minsplit = 2, minbucket = 1, cp = -1)
Variables actually used in tree construction:
[1] Activity RearEnd Whiplash
Root node error: 5/10 = 0.5
n= 10
CP nsplit rel error xerror xstd
1 0.6 0 1.0 2.0 0.0
2 0.2 1 0.4 0.4 0.3
3 -1.0 3 0.0 0.4 0.3
So the root node error is 0.5 and I understand it is a misclassification error. But I'm having trouble calculating sensitivity (proportion of true positives) and specifics (proportion of true negatives). How can I calculate them based on the output rpart
?
(the above example is from http://gormanalysis.com/decision-trees-in-r-using-rpart/ )
You can use a package caret
for this:
Data:
library(rpart)
train <- data.frame(ClaimID = c(1,2,3,4,5,6,7,8,9,10),
RearEnd = c(TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE),
Whiplash = c(TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE),
Activity = factor(c("active", "very active", "very active", "inactive", "very inactive", "inactive", "very inactive", "active", "active", "very active"),
levels=c("very inactive", "inactive", "active", "very active"),
ordered=TRUE),
Fraud = c(FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE))
mytree <- rpart(Fraud ~ RearEnd + Whiplash + Activity, data = train, method = "class", minsplit = 2, minbucket = 1, cp=-1)
Decision
library(caret)
#calculate predictions
preds <- predict(mytree, train)
#calculate sensitivity
> sensitivity(factor(preds[,2]), factor(as.numeric(train$Fraud)))
[1] 1
#calculate specificity
> specificity(factor(preds[,2]), factor(as.numeric(train$Fraud)))
[1] 1
Both sensitivity
both specificity
take predictions as the first argument and the observable values ββ(the response variable, i.e. train$Fraud
) as the second argument .
According to the documentation, both predictions and observed values ββmust be passed to functions as factors that have the same levels.
Both the specificity and the sensitivity are 1 in this case because the predictions are 100% accurate.
The root node error is a classification error at the root of the tree. Therefore, it is an error to skip the classification before adding any nodes. Not an error of missing the classification of the last tree.