Evaluating the Performance of the Zero Overestimated Negative Binomial Model
I am simulating film distribution over a catenary (based on telephone data) using a zero inflated negative binomial model (package: pscl)
m1 <- zeroinfl(LENGTH_OF_DIFF ~ ., data = trainData, type = "negbin")
(variables described below). The next step is to evaluate the performance of the model.
My attempt was to make some out-of-sample predictions and calculate the MSE.
Using
predict(m1, newdata = testData)
I got a prediction for the average diffusion chain length for each datapoint and using
predict(m1, newdata = testData, type = "prob")
I got a matrix containing the probability that each datapoint will be of a certain length.
Estimation problem . Since I have a dataset pumped with 0 (and 1), the model would be correct most of the time if it predicted 0 for all values. The predictions I get are good for nets with zero length (according to MSE), but the deviation between predicted and true values for nets of 1 or more length is significant.
My question is:
- How can we estimate how well our model predicts chains of non-zero length?
- Is this approach the right way to make predictions from a zero overstated negative binomial model?
- If so: how can I interpret these results?
- If not: what alternative can I use?
My variables:
- Dependent variable:
- diffusion chain length (count [0.36])
- Independent variables:
- film characteristics (both mannequins and continuous variables).
- film characteristics (both mannequins and continuous variables).
Thank!
source to share
Directly estimate the RMSPE (Root Mean Square Predictive Error), but it is probably best to transfer your account ahead of time to make sure that really large accounts don't dominate that amount.
Here you can find false negative and false positive error rates (FNR and FPR). FNR is the probability that an actual non-zero length string will have zero length (i.e. None, also known as negative). FPR is the probability that an actual zero length string is falsely predicted to have a nonzero (i.e. positive) length. I suggest doing Google on these terms to find paper in your favorite quantitative journals or in a book chapter that simply explains it. For environmentalists, I tend to go back to Fielding and Bell (1997, Environmental Conservation). First, let's define a repeatable example that anyone can use (not sure where your trainData comes from). This is from the help of the zeroinfl function in the pscl library:
# an example from help on zeroinfl function in pscl library
library(pscl)
fm_zinb2 <- zeroinfl(art ~ . | ., data = bioChemists, dist = "negbin")
There are several packages in R that compute them. But here's a manual approach. First, calculate the observed and predicted values.
# store observed values, and determine how many are nonzero
obs <- bioChemists$art
obs.nonzero <- obs > 0
table(obs)
table(obs.nonzero)
# calculate predicted counts, and check their distribution
preds.count <- predict(fm_zinb2, type="response")
plot(density(preds.count))
# also the predicted probability that each item is nonzero
preds <- 1-predict(fm_zinb2, type = "prob")[,1]
preds.nonzero <- preds > 0.5
plot(density(preds))
table(preds.nonzero)
Then we get the confusion matrix (basis FNR, FPR)
# the confusion matrix is obtained by tabulating the dichotomized observations and predictions
confusion.matrix <- table(preds.nonzero, obs.nonzero)
FNR <- confusion.matrix[2,1] / sum(confusion.matrix[,1])
FNR
In terms of calibration, we can do this visually or through calibration
# let look at how well the counts are being predicted
library(ggplot2)
output <- as.data.frame(list(preds.count=preds.count, obs=obs))
ggplot(aes(x=obs, y=preds.count), data=output) + geom_point(alpha=0.3) + geom_smooth(col="aqua")
Converting counters to "see" what is happening:
output$log.obs <- log(output$obs)
output$log.preds.count <- log(output$preds.count)
ggplot(aes(x=log.obs, y=log.preds.count), data=output[!is.na(output$log.obs) & !is.na(output$log.preds.count),]) + geom_jitter(alpha=0.3, width=.15, size=2) + geom_smooth(col="blue") + labs(x="Observed count (non-zero, natural logarithm)", y="Predicted count (non-zero, natural logarithm)")
In your case, you can also evaluate correlations between predicted counts and actual values, including including or excluding zeros.
So, you can pick up the regression as a kind of calibration to evaluate this! However, since predictions are not necessarily taken into account, we cannot use Poisson regression, so we can instead use a lognormal value, regressing the prediction log against the observed log, assuming a normal response.
calibrate <- lm(log(preds.count) ~ log(obs), data=output[output$obs!=0 & output$preds.count!=0,])
summary(calibrate)
sigma <- summary(calibrate)$sigma
sigma
There are more bizarre ways to estimate sizing, I suppose, as with any modeling exercise ... but that's a start.
For a more advanced evaluation of zero-blown models, check out the ways in which the log likelihood can be exploited in the links provided for the zeroinfl function. It takes a little finesse.
source to share