GridSearch for multi-label classification in Scikit-learn

I'm trying to do a GridSearch for the best hyperparameters in each 10x cross-validation, it worked fine with my previous multi-class classification work, but not this time with multi-tasking work.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
clf = OneVsRestClassifier(LinearSVC())

C_range = 10.0 ** np.arange(-2, 9)
param_grid = dict(estimator__clf__C = C_range)

clf = GridSearchCV(clf, param_grid), y_train)


I am getting the error:

ValueError                                Traceback (most recent call last)
<ipython-input-65-dcf9c1d2e19d> in <module>()
      7 clf = GridSearchCV(clf, param_grid)
----> 8, y_train)

/usr/local/lib/python2.7/site-packages/sklearn/grid_search.pyc in fit(self, X, y)
    596         """
--> 597         return self._fit(X, y, ParameterGrid(self.param_grid))

/usr/local/lib/python2.7/site-packages/sklearn/grid_search.pyc in _fit(self, X, y,   
    357                                  % (len(y), n_samples))
    358             y = np.asarray(y)
--> 359         cv = check_cv(cv, X, y, classifier=is_classifier(estimator))
    361         if self.verbose > 0:

/usr/local/lib/python2.7/site-packages/sklearn/cross_validation.pyc in _check_cv(cv, X,  
y, classifier, warn_mask)
   1365             needs_indices = None
   1366         if classifier:
-> 1367             cv = StratifiedKFold(y, cv, indices=needs_indices)
   1368         else:
   1369             if not is_sparse:

/usr/local/lib/python2.7/site-packages/sklearn/cross_validation.pyc in __init__(self, 
y, n_folds, indices, shuffle, random_state)
    427         for test_fold_idx, per_label_splits in enumerate(zip(*per_label_cvs)):
    428             for label, (_, test_split) in zip(unique_labels, per_label_splits):
--> 429                 label_test_folds = test_folds[y == label]
    430                 # the test split can be too big because we used
    431                 # KFold(max(c, self.n_folds), self.n_folds) instead of

ValueError: boolean index array should have 1 dimension


Which may relate to the size or format of the mark indicator.

print X_train.shape, y_train.shape


we get:

(147, 1024) (147, 6)


Seems to GridSearch

implement StratifiedKFold

at its core. The problem arises in the stratified K-fold strategy with the ambiguity problem.

StratifiedKFold(y_train, 10)



ValueError                                Traceback (most recent call last)
<ipython-input-87-884ffeeef781> in <module>()
----> 1 StratifiedKFold(y_train, 10)

/usr/local/lib/python2.7/site-packages/sklearn/cross_validation.pyc in __init__(self,   
y, n_folds, indices, shuffle, random_state)
    427         for test_fold_idx, per_label_splits in enumerate(zip(*per_label_cvs)):
    428             for label, (_, test_split) in zip(unique_labels, per_label_splits):
--> 429                 label_test_folds = test_folds[y == label]
    430                 # the test split can be too big because we used
    431                 # KFold(max(c, self.n_folds), self.n_folds) instead of

ValueError: boolean index array should have 1 dimension


Current use of the regular K-fold strategy works great. Is there any method for implementing K-fold stratified classification for multi-label classification?


Mesh search performs stratified cross-validation for classification tasks, but it is not implemented for multitasking tasks; in fact, multicast stratification is an unsolved problem in machine learning. I recently ran into the same problem and all the literature I could find was the suggested method in this article (the authors of which stated that they might not find any other attempt to solve this).



As pointed out by Fred Foo , stratified cross-validation is not implemented for multitasking. One alternative is to use scikit-learn's scatitKFold class in the transformed label space as suggested here .

Below is some example python code.

from sklearn.model_selection import StratifiedKFold
kf = StratifiedKFold(n_splits=n_splits, random_state=None, shuffle=shuffle)

for train_index, test_index in kf.split(X, lp.transform(y)):
    X_train = X[train_index,:]
    y_train = y[train_index,:]

    X_test = X[test_index,:]
    y_test = y[test_index,:]

    # learn the classifier, y_train)

    # predict labels for test data
    predictions = classifier.predict(X_test)




