How to calculate the classification error rate

Good. Now this question is rather difficult. I'll give you an example.

Now the left numbers are my classification of algorithms and the right numbers are the numbers of the original classes

177 86
177 86
177 86
177 86
177 86
177 86
177 86
177 86
177 86
177 89
177 89
177 89
177 89
177 89
177 89
177 89

      

So, here is my algorithm combined 2 different classes into 1. As you can see, it combined classes 86 and 89 into one class. So what would be the error in the above example?

Or here's another example

203 7
203 7
203 7
203 7
16 7
203 7
17 7
16 7
203 7

      

In the example above, the remaining numbers are my classification of algorithms and the correct numbers are the original class IDs. As seen above, Miss classifies 3 products (I classify the same commercial products). So in this example, what would be the error rate? How would you calculate.

This question is quite complex and complex. We have finished the classification, but we cannot find the correct algorithm to calculate the success rate: D

+3


source to share


4 answers


Here's a long example, a real confuson matrix with 10 input classes "0" - "9", (handwritten digits), and 10 output clusters labeled A - J.

Confusion matrix for 5620 optdigits:

True 0 - 9 down, clusters A - J across
-----------------------------------------------------
      A    B    C    D    E    F    G    H    I    J
-----------------------------------------------------
0:    2         4         1       546    1
1:   71  249        11    1    6            228    5
2:   13    5        64    1   13    1       460
3:   29    2       507        20         5    9
4:        33  483         4   38         5    3    2
5:    1    1    2   58    3            480   13
6:    2    1    2       294         1         1  257
7:    1    5    1            546         6    7
8:  415   15    2    5    3   12        13   87    2
9:   46   72    2  357        35    1   47    2
----------------------------------------------------
    580  383  496 1002  307  670  549  557  810  266  estimates in each cluster

y class sizes: [554 571 557 572 568 558 558 566 554 562]
kmeans cluster sizes: [ 580  383  496 1002  307  670  549  557  810  266]

      

For example, cluster A has 580 data points, of which 415 are "8"; Cluster B has 383 data points, 249 of which are "1"; etc.

The problem is that the output classes are scrambled, rearranged; they match in this order, with counts:

      A    B    C    D    E    F    G    H    I    J
      8    1    4    3    6    7    0    5    2    6
    415  249  483  507  294  546  546  480  460  257

      



We can say that the "success rate" is 75% = (415 + 249 + 483 + 507 + 294 + 546 + 546 + 480 + 460 + 257) / 5620
but this discards useful information; here that E and J say "6" and no cluster says "9".

So, add the largest numbers in each column of the confusion matrix and divide by the total.
But how to count overlapping / missing clusters as 2 "6", no "9" here?
I am not aware of general agreement (I doubt the Hungarian algorithm is used in practice).

Bottom line: don't discard information; look at the whole confusion matrix.

NB such a "success rate" would be optimistic for the new data!
It is common practice to divide the data into 2/3 "training set" and 1/3 "test set", train, for example. k- means only 2/3,
then measure confusion / success rate on the test set - generally worse than on the training set.
Much more can be said; see for example Cross-validation .

+4


source


You have to define error criteria if you want to evaluate the performance of an algorithm, so I'm not sure exactly what you are asking. In some clustering and machine learning algorithms, you determine the error rate and minimize it.



Take a look at this https://en.wikipedia.org/wiki/Confusion_matrix to get some ideas

0


source


You must define an error rate in order to measure yourself. In your case, a simple method should be to find your product properties mapping as

p = properties(id)

      

where id

is the product identifier, and p

is likely to be a vector with each entry of different properties. Then you can define the error function e

(or distance) between the two products as

e = d(p1, p2)

      

Of course, every property must be evaluated up to a number in this function. Then this error function can be used in the classification and learning algorithm.

In your second example it seems that you are considering the pair (203 7) as a successful classification, so I think you already have the metric. You can be more specific to get a better answer.

0


source


Classification Error Rate (CER) is 1 - Purity ( http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html )

ClusterPurity <- function(clusters, classes) {
    sum(apply(table(classes, clusters), 2, max)) / length(clusters)
}

      

@ John-colby code Or

CER <- function(clusters, classes) {
    1- sum(apply(table(classes, clusters), 2, max)) / length(clusters)
}

      

-1


source







All Articles