How to calculate the classification error rate
Good. Now this question is rather difficult. I'll give you an example.
Now the left numbers are my classification of algorithms and the right numbers are the numbers of the original classes
177 86
177 86
177 86
177 86
177 86
177 86
177 86
177 86
177 86
177 89
177 89
177 89
177 89
177 89
177 89
177 89
So, here is my algorithm combined 2 different classes into 1. As you can see, it combined classes 86 and 89 into one class. So what would be the error in the above example?
Or here's another example
203 7
203 7
203 7
203 7
16 7
203 7
17 7
16 7
203 7
In the example above, the remaining numbers are my classification of algorithms and the correct numbers are the original class IDs. As seen above, Miss classifies 3 products (I classify the same commercial products). So in this example, what would be the error rate? How would you calculate.
This question is quite complex and complex. We have finished the classification, but we cannot find the correct algorithm to calculate the success rate: D
source to share
Here's a long example, a real confuson matrix with 10 input classes "0" - "9", (handwritten digits), and 10 output clusters labeled A - J.
Confusion matrix for 5620 optdigits:
True 0 - 9 down, clusters A - J across
-----------------------------------------------------
A B C D E F G H I J
-----------------------------------------------------
0: 2 4 1 546 1
1: 71 249 11 1 6 228 5
2: 13 5 64 1 13 1 460
3: 29 2 507 20 5 9
4: 33 483 4 38 5 3 2
5: 1 1 2 58 3 480 13
6: 2 1 2 294 1 1 257
7: 1 5 1 546 6 7
8: 415 15 2 5 3 12 13 87 2
9: 46 72 2 357 35 1 47 2
----------------------------------------------------
580 383 496 1002 307 670 549 557 810 266 estimates in each cluster
y class sizes: [554 571 557 572 568 558 558 566 554 562]
kmeans cluster sizes: [ 580 383 496 1002 307 670 549 557 810 266]
For example, cluster A has 580 data points, of which 415 are "8"; Cluster B has 383 data points, 249 of which are "1"; etc.
The problem is that the output classes are scrambled, rearranged; they match in this order, with counts:
A B C D E F G H I J
8 1 4 3 6 7 0 5 2 6
415 249 483 507 294 546 546 480 460 257
We can say that the "success rate" is 75% = (415 + 249 + 483 + 507 + 294 + 546 + 546 + 480 + 460 + 257) / 5620
but this discards useful information; here that E and J say "6" and no cluster says "9".
So, add the largest numbers in each column of the confusion matrix and divide by the total.
But how to count overlapping / missing clusters as 2 "6", no "9" here?
I am not aware of general agreement (I doubt the Hungarian algorithm is
used in practice).
Bottom line: don't discard information; look at the whole confusion matrix.
NB such a "success rate" would be optimistic for the new data!
It is common practice to divide the data into 2/3 "training set" and 1/3 "test set", train, for example. k- means only 2/3,
then measure confusion / success rate on the test set - generally worse than on the training set.
Much more can be said; see for example
Cross-validation .
source to share
You have to define error criteria if you want to evaluate the performance of an algorithm, so I'm not sure exactly what you are asking. In some clustering and machine learning algorithms, you determine the error rate and minimize it.
Take a look at this https://en.wikipedia.org/wiki/Confusion_matrix to get some ideas
source to share
You must define an error rate in order to measure yourself. In your case, a simple method should be to find your product properties mapping as
p = properties(id)
where id
is the product identifier, and p
is likely to be a vector with each entry of different properties. Then you can define the error function e
(or distance) between the two products as
e = d(p1, p2)
Of course, every property must be evaluated up to a number in this function. Then this error function can be used in the classification and learning algorithm.
In your second example it seems that you are considering the pair (203 7) as a successful classification, so I think you already have the metric. You can be more specific to get a better answer.
source to share
Classification Error Rate (CER) is 1 - Purity ( http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html )
ClusterPurity <- function(clusters, classes) {
sum(apply(table(classes, clusters), 2, max)) / length(clusters)
}
@ John-colby code Or
CER <- function(clusters, classes) {
1- sum(apply(table(classes, clusters), 2, max)) / length(clusters)
}
source to share