Machine Learning: Moving Treshhold

I am trying to solve a binary classification problem where 80% of the data belongs to class x and 20% of the data belongs to class y. All my models (AdaBoost, Neural Networks, and SVC) simply predict that all data will be part of class x, as this is the maximum precision they can achieve.

My goal is to achieve higher precision for all records of class x and I don't care how many records are falsely classified as part of class y.

My idea would be to just insert records into class x when the model is sure of them and put them into class y otherwise.

How can I achieve this? Is there a way to move the trade so that only very obvious entries are classified as class x?

I am using python and sklearn

Sample code:

adaboost = AdaBoostClassifier(random_state=1)
adaboost.fit(X_train, y_train)
adaboost_prediction = adaboost.predict(X_test)

confusion_matrix(adaboost_prediction,y_test) outputs:

array([[  0,   0],
       [10845, 51591]])

      

+3


source to share


2 answers


Using AdaBoostClassifier

, you can infer the probabilities of the classes and then spawn them using predict_proba

instead predict

:

adaboost = AdaBoostClassifier(random_state=1)
adaboost.fit(X_train, y_train)
adaboost_probs = adaboost.predict_proba(X_test)

threshold = 0.8 # for example    
thresholded_adaboost_prediction = adaboost_probs > threshold

      

Using this approach, you can also check (just debug printing, or maybe sort and plot on a graph) how the confidence levels in your final model change in the test data to help decide whether to continue.



There are several ways to approach your problem. For example, see Miriam Farber's answer which discusses re-weighting the classifier to correct for an 80/20 class imbalance during training. You may find that you have other problems, including perhaps the classifiers you are using cannot really separate the x and y classes given your current data. Considering all possible data problems can lead to several different approaches.

If you have more questions about issues with your data versus code problem, there are Stack Exchange sites that can help you, as well as Stack Overflow (read the site guide before submitting): Data Science and Cross Validation .

+4


source


In SVM, one way to move the threshold is to choose class_weight

so that you put much more weight on the data points from the class y

. Consider the example below taken from SVM: Hyperplane splitting for unbalanced classes :

enter image description here

The straight line is the decision boundary that you get when used SVC

with the default class weights (same weight for each class). The dotted line is the decision boundary that you get when you use it class_weight={1: 10}

(that is, you put a lot more weight on class 1, relative to class 0).



The class weight picks up the penalty parameter in SVM :

class_weight: {dict, 'balanced}, optional

Set class C parameter to weight_class [i] * C for SVC. If not considering that all classes must have one weight. The "balanced" mode uses y values ​​to automatically adjust the weights inversely proportional to the class frequencies in the input as n_samples / (n_classes * np.bincount (y))

+2


source







All Articles