Scikit-learn classifier suits target function, accuracy and recall

The performance of a machine learning classifier can be measured using various metrics such as accuracy, recall, and classification accuracy, among other metrics.

This code looks like this:

clf = svm.SVC(kernel='rbf')
clf.fit(X_train, y_train)

      

  • What metric is the fitting function trying to optimize?

  • How can you tune the model to improve accuracy when accuracy is more important than recall?

+3


source to share


3 answers


  • As far as I know, SVMs minimize hinge loss.

  • I don't know of any one-size-fits-all way to make the vector resource classifier priority over recall. As always, you can cross-validate and then play with hyperparameters to see if something helps. Alternatively, you can train the regression by outputting the value at [0,1] instead of the classifier. Then, by choosing the right threshold so that all examples score above that threshold in the "1" category, you get a classifier with a configurable threshold that you can set arbitrarily high to maximize accuracy over recall.



+2


source


You can tweak the parameters of your SVM using Grid Search Cross Validation to maximize your accuracy. To do this, set the "scoring" parameter as

sklearn.grid_search.GridSearchCV(clf, param_grid, scoring="precision")

      



Here clf

is your SVC classifier and of course you also need to set the parameter grid param_grid

. Examples here

+2


source


I see two paths: optimization with a grid lookup of parameters, as @laneok suggests, or optimization by adjusting the threshold, as @cfh suggests.

Optimally, you should do both.

I wouldn't optimize for accuracy, as you usually get 100% accuracy by setting a very high threshold and getting very low feedback. So if possible, you can define a trade-off between precision and recall that you like, and grid-search over that.

You can probably get better results if you actually choose a separate threshold. You can use SVC.decision_function to get continuous output and then choose the optimal threshold for the tradeoff you want to achieve. However, in order to select a threshold, you need a validation suite that makes it harder inside grid lookups (possible).

What I usually find is a good trade-off between optimizing what you want and the complexity of the pipeline is optimizing in the grid search for something that will take precision into account, say "roc_auc" and after grid-search choose a threshold for a trade-off based validation suite you like.

roc_auc basically optimizes all possible thresholds at the same time, so the parameters won't be as specific for the threshold you want as they could.

+2


source







All Articles