Can I explicitly set the list of possible classes for sklearn SVMs?
I have a program that uses the SVC class from sklearn. Indeed, I am using the OneVsRestClassifier which uses the SVC class. My problem is that the pred_proba () method sometimes returns too short a vector. This is because the classes_ attribute is missing from the class, which happens when the label is missing during training.
Consider the following example (code shown below). Let's assume all possible classes are 1, 2, 3, and 4. Now let's say the training data just doesn't contain any data labeled with class 3. That's fine, unless I call pred_proba () I want a vector of length 4. Instead , I get a vector of length 3. That is, pred_proba () returns [p (1) p (2) p (4)], but I want [p (1) p (2) p (3) p (4)] , where p (3) = 0.
My guess is that clf.classes_ is implicitly defined by the labels visible during training, which is incomplete in this case. Is there a way to explicitly set possible class labels? I know the simple job is to just dump output_proba () and manually create the array I want. However, this is inconvenient and may slow down my program a little.
# Python 2.7.6
from sklearn.svm import SVC
from sklearn.multiclass import OneVsRestClassifier
import numpy as np
X_train = [[1], [2], [4]] * 10
y = [1, 2, 4] * 10
X_test = [[1]]
clf = OneVsRestClassifier(SVC(probability=True, kernel="linear"))
clf.fit(X_train, y)
# calling predict_proba() gives: [p(1) p(2) p(4)]
# I want: [p(1) p(2) p(3) p(4)], where p(3) = 0
print clf.predict_proba(X_test)
The workaround I had in mind is creating a new list of probabilities and building it one element at a time with multiple append () calls (see code below). It seems like it would be slow compared to the fact that I have a preview_proba () function that automatically returns what I want. I don't know yet if this will slow down my program significantly because I haven't tried it yet. Regardless, I wanted to know if there is a better way.
def workAround(probs, classes_, all_classes):
"""
probs: list of probabilities, output of predict_proba (but 1D)
classes_: clf.classes_
all_classes: all possible classes; superset of classes_
"""
all_probs = []
i = 0 # index into probs and classes_
for cls in all_classes:
if cls == classes_[i]:
all_probs.append(probs[i])
i += 1
else:
all_probs.append(0.0)
return np.asarray(all_probs)
source to share
As stated in the comments, scikit-learn does not provide a way to explicitly set possible class labels.
I NumPyfied your workaround:
import sklearn
import sklearn.svm
import numpy as np
np.random.seed(3) # for reproducibility
def predict_proba_ordered(probs, classes_, all_classes):
"""
probs: list of probabilities, output of predict_proba
classes_: clf.classes_
all_classes: all possible classes (superset of classes_)
"""
proba_ordered = np.zeros((probs.shape[0], all_classes.size), dtype=np.float)
sorter = np.argsort(all_classes) # http://stackoverflow.com/a/32191125/395857
idx = sorter[np.searchsorted(all_classes, classes_, sorter=sorter)]
proba_ordered[:, idx] = probs
return proba_ordered
# Prepare the data set
all_classes = np.array([1,2,3,4]) # explicitly set the possible class labels.
X_train = [[1], [2], [4]] * 3
print('X_train: {0}'.format(X_train))
y = [1, 2, 4] * 3 # Label 3 is missing.
print('y: {0}'.format(y))
X_test = [[1], [2], [3]]
print('X_test: {0}'.format(X_test))
# Train
clf = sklearn.svm.SVC(probability=True, kernel="linear")
clf.fit(X_train, y)
print('clf.classes_: {0}'.format(clf.classes_))
# Predict
probs = clf.predict_proba(X_test) #As label 3 isn't in train set, the probs' size is 3, not 4
proba_ordered = predict_proba_ordered(probs, clf.classes_, all_classes)
print('proba_ordered: {0}'.format(proba_ordered))
Output:
X_train: [[1], [2], [4], [1], [2], [4], [1], [2], [4]]
y: [1, 2, 4, 1, 2, 4, 1, 2, 4]
X_test: [[1], [2], [3]]
clf.classes_: [1 2 4]
proba_ordered: [[ 0.81499201 0.08640176 0. 0.09860622]
[ 0.21105955 0.63893181 0. 0.15000863]
[ 0.08965731 0.49640147 0. 0.41394122]]
Note that you can explicitly set possible class labels to sklearn.metrics
(for example, sklearn.metrics.f1_score
using parameters labels
:
labels : array
Integer array of labels.
Example:
# Score
y_pred = clf.predict(X_test)
y_true = np.array([1,2,3])
precision = sklearn.metrics.precision_score(y_true, y_pred, labels=all_classes, average=None)
print('precision: {0}'.format(precision))
recall = sklearn.metrics.recall_score(y_true, y_pred, labels=all_classes, average=None)
print('recall: {0}'.format(recall))
f1_score = sklearn.metrics.f1_score(y_true, y_pred, labels=all_classes, average=None)
print('f1_score: {0}'.format(f1_score))
Note that at the moment you are running the problem, try using sklearn.metrics.roc_auc_score()
when the positive example is not true for the given label .
source to share