How to check the quality of the probability estimate?

I created a heuristic (ANN, but it doesn't matter) to estimate the probabilities of an event (sports results, but that doesn't matter either). Given some inputs, this heuristic tells you what the probabilities of an event are. Something like: Given the thesis, Team B has a 65% chance of winning.

I have a large set of inputs for which I have now gotten the result (previously played games). What formula / metric I could use to determine the accuracy of my estimate.

The problem I see is that if the score says the event has a 20% chance, and the event actually happens. I cannot tell if I am right in wrong. Perhaps this is wrong, and this event was more likely. Perhaps this is correct, an event like a 20% chance of occurrence and occurred. This may not be right, the event has a very low chance of existence, say 1 in 1000, but it happened this time.

Fortunately, I have a lot of test data, so there is probably an opportunity to use it to determine my heuristic.

Anyone got it?

+2


source to share


3 answers


There are several dimensions that can be used to quantify the performance of a binary classifier.

Are you wondering if your evaluator (ANN, for example) outputs the calibrated probability or not?

If not, then all that matters is ordering the ranks, maximizing the area under the ROC (AUROC) curve - a pretty good summary of the metric's performance. Others are "KS" statistics, elevator. There are many used and highlight different aspects of performance.



If you care about calibrated probabilities, the most common metrics are "cross entropy" (also known as Bernoulli probability / maximum likelihood, a typical measure used in logistic regression) or "Brier score". The Brier score is nothing more than a mean square error comparing continuous predicted probabilities with binary actual results.

Which is the correct thing to use depends on the end use of the classifier. For example, your classifier might estimate the blowout probability very well, but be substandard in terms of close results.

Usually, the true metric you are trying to optimize is "dollars". This is often difficult to imagine mathematically, but starting with that, this is your best chance to find a suitable and calculated metric.

+3


source


It depends to some extent on the decision function used.

In the case of a binary classification problem (predicting whether an event happened or not [ex: win]), a simple implementation is to predict 1

if the probability is greater than 50%, 0

otherwise.

If you have a multi-class problem (predicting which of the K events occurred [ex: win / draw / lose]), you can predict the class with the highest probability.



And the way to evaluate your heuristic is to compute the prediction error by comparing the actual class of each input to the prediction of your heuristic for that instance.

Note that you used to split your data into training / test chunks to unbiased performance estimates.

There are other assessment tools, such as ROC curves , which is a way of displaying performance against true / false positives.

+1


source


As you stated, if you predict that an event has 20% of the time - and 80% will not happen - observing a single isolated event will not tell you how good or bad your evaluator is. However, if you had a large selection of events for which you predicted a 20% success rate, but notice that this pattern succeeded at 30%, you may start to suspect that your evaluator is disabled.
One approach would be to group your events by predicted likelihood of occurrence and observe the actual frequency by group and measure the difference. For example, depending on how much data you have, group all the events where you predict occurrence between 20% and 25%, and calculate the actual frequency of occurrence by group - and measure the difference for each group. This should give you an idea of ​​whether your evaluator is biased, and possibly for whom it varies.

+1


source







All Articles