GridSearch for multi-label classification in Scikit-learn
I'm trying to do a GridSearch for the best hyperparameters in each 10x cross-validation, it worked fine with my previous multi-class classification work, but not this time with multi-tasking work.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33) clf = OneVsRestClassifier(LinearSVC()) C_range = 10.0 ** np.arange(-2, 9) param_grid = dict(estimator__clf__C = C_range) clf = GridSearchCV(clf, param_grid) clf.fit(X_train, y_train)
I am getting the error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-65-dcf9c1d2e19d> in <module>()
6
7 clf = GridSearchCV(clf, param_grid)
----> 8 clf.fit(X_train, y_train)
/usr/local/lib/python2.7/site-packages/sklearn/grid_search.pyc in fit(self, X, y)
595
596 """
--> 597 return self._fit(X, y, ParameterGrid(self.param_grid))
598
599
/usr/local/lib/python2.7/site-packages/sklearn/grid_search.pyc in _fit(self, X, y,
parameter_iterable)
357 % (len(y), n_samples))
358 y = np.asarray(y)
--> 359 cv = check_cv(cv, X, y, classifier=is_classifier(estimator))
360
361 if self.verbose > 0:
/usr/local/lib/python2.7/site-packages/sklearn/cross_validation.pyc in _check_cv(cv, X,
y, classifier, warn_mask)
1365 needs_indices = None
1366 if classifier:
-> 1367 cv = StratifiedKFold(y, cv, indices=needs_indices)
1368 else:
1369 if not is_sparse:
/usr/local/lib/python2.7/site-packages/sklearn/cross_validation.pyc in __init__(self,
y, n_folds, indices, shuffle, random_state)
427 for test_fold_idx, per_label_splits in enumerate(zip(*per_label_cvs)):
428 for label, (_, test_split) in zip(unique_labels, per_label_splits):
--> 429 label_test_folds = test_folds[y == label]
430 # the test split can be too big because we used
431 # KFold(max(c, self.n_folds), self.n_folds) instead of
ValueError: boolean index array should have 1 dimension
Which may relate to the size or format of the mark indicator.
print X_train.shape, y_train.shape
we get:
(147, 1024) (147, 6)
Seems to GridSearch
implement StratifiedKFold
at its core. The problem arises in the stratified K-fold strategy with the ambiguity problem.
StratifiedKFold(y_train, 10)
gives
ValueError Traceback (most recent call last)
<ipython-input-87-884ffeeef781> in <module>()
----> 1 StratifiedKFold(y_train, 10)
/usr/local/lib/python2.7/site-packages/sklearn/cross_validation.pyc in __init__(self,
y, n_folds, indices, shuffle, random_state)
427 for test_fold_idx, per_label_splits in enumerate(zip(*per_label_cvs)):
428 for label, (_, test_split) in zip(unique_labels, per_label_splits):
--> 429 label_test_folds = test_folds[y == label]
430 # the test split can be too big because we used
431 # KFold(max(c, self.n_folds), self.n_folds) instead of
ValueError: boolean index array should have 1 dimension
Current use of the regular K-fold strategy works great. Is there any method for implementing K-fold stratified classification for multi-label classification?
source to share
Mesh search performs stratified cross-validation for classification tasks, but it is not implemented for multitasking tasks; in fact, multicast stratification is an unsolved problem in machine learning. I recently ran into the same problem and all the literature I could find was the suggested method in this article (the authors of which stated that they might not find any other attempt to solve this).
source to share
As pointed out by Fred Foo , stratified cross-validation is not implemented for multitasking. One alternative is to use scikit-learn's scatitKFold class in the transformed label space as suggested here .
Below is some example python code.
from sklearn.model_selection import StratifiedKFold
kf = StratifiedKFold(n_splits=n_splits, random_state=None, shuffle=shuffle)
for train_index, test_index in kf.split(X, lp.transform(y)):
X_train = X[train_index,:]
y_train = y[train_index,:]
X_test = X[test_index,:]
y_test = y[test_index,:]
# learn the classifier
classifier.fit(X_train, y_train)
# predict labels for test data
predictions = classifier.predict(X_test)
source to share