Scikit-learn classifier suits target function, accuracy and recall

The performance of a machine learning classifier can be measured using various metrics such as accuracy, recall, and classification accuracy, among other metrics.

This code looks like this:

clf = svm.SVC(kernel='rbf'), y_train)


  • What metric is the fitting function trying to optimize?

  • How can you tune the model to improve accuracy when accuracy is more important than recall?


source to share

3 answers

  • As far as I know, SVMs minimize hinge loss.

  • I don't know of any one-size-fits-all way to make the vector resource classifier priority over recall. As always, you can cross-validate and then play with hyperparameters to see if something helps. Alternatively, you can train the regression by outputting the value at [0,1] instead of the classifier. Then, by choosing the right threshold so that all examples score above that threshold in the "1" category, you get a classifier with a configurable threshold that you can set arbitrarily high to maximize accuracy over recall.



You can tweak the parameters of your SVM using Grid Search Cross Validation to maximize your accuracy. To do this, set the "scoring" parameter as

sklearn.grid_search.GridSearchCV(clf, param_grid, scoring="precision")


Here clf

is your SVC classifier and of course you also need to set the parameter grid param_grid

. Examples here



I see two paths: optimization with a grid lookup of parameters, as @laneok suggests, or optimization by adjusting the threshold, as @cfh suggests.

Optimally, you should do both.

I wouldn't optimize for accuracy, as you usually get 100% accuracy by setting a very high threshold and getting very low feedback. So if possible, you can define a trade-off between precision and recall that you like, and grid-search over that.

You can probably get better results if you actually choose a separate threshold. You can use SVC.decision_function to get continuous output and then choose the optimal threshold for the tradeoff you want to achieve. However, in order to select a threshold, you need a validation suite that makes it harder inside grid lookups (possible).

What I usually find is a good trade-off between optimizing what you want and the complexity of the pipeline is optimizing in the grid search for something that will take precision into account, say "roc_auc" and after grid-search choose a threshold for a trade-off based validation suite you like.

roc_auc basically optimizes all possible thresholds at the same time, so the parameters won't be as specific for the threshold you want as they could.



All Articles