Put custom functions in the Sklearn pipeline

Question

Put custom functions in the Sklearn pipeline

There are several steps in my classification scheme, including:

SMOTE (Synthetic Minority Oversampling Technique)
Fisher criterion for function selection
Standardization (Z-score normalization)
SVC (Vector Support Classifier)

The main parameters that need to be tuned in the above diagram are percentiles (2.) and hyperparameters for SVC (4.) and I want to go through a grid search to tune.

The current solution builds a "partial" pipeline, including steps 3 and 4 in the circuit, clf = Pipeline([('normal',preprocessing.StandardScaler()),('svc',svm.SVC(class_weight='auto'))])

and splits the circuit into two parts:

1) Adjust the percentile of functions to keep the first grid search

skf = StratifiedKFold(y)
for train_ind, test_ind in skf:
    X_train, X_test, y_train, y_test = X[train_ind], X[test_ind], y[train_ind], y[test_ind]
    # SMOTE synthesizes the training data (we want to keep test data intact)
    X_train, y_train = SMOTE(X_train, y_train)
    for percentile in percentiles:
        # Fisher returns the indices of the selected features specified by the parameter 'percentile'
        selected_ind = Fisher(X_train, y_train, percentile) 
        X_train_selected, X_test_selected = X_train[selected_ind,:], X_test[selected_ind, :]
        model = clf.fit(X_train_selected, y_train)
        y_predict = model.predict(X_test_selected)
        f1 = f1_score(y_predict, y_test)

The f1 scores will be stored and then averaged across all folded sections for all percentiles, and the percentile with the best CV is returned. The goal of setting the "percentile for loop" as the inner loop is to ensure fair competition since we have the same training data (including synthesized data) in all folded sections for all percentiles.

2) After defining the percentile, tweak the hyperparameters with a second grid lookup

skf = StratifiedKFold(y)
for train_ind, test_ind in skf:
    X_train, X_test, y_train, y_test = X[train_ind], X[test_ind], y[train_ind], y[test_ind]
    # SMOTE synthesizes the training data (we want to keep test data intact)
    X_train, y_train = SMOTE(X_train, y_train)
    for parameters in parameter_comb:
        # Select the features based on the tuned percentile
        selected_ind = Fisher(X_train, y_train, best_percentile) 
        X_train_selected, X_test_selected = X_train[selected_ind,:], X_test[selected_ind, :]
        clf.set_params(svc__C=parameters['C'], svc__gamma=parameters['gamma'])
        model = clf.fit(X_train_selected, y_train)
        y_predict = model.predict(X_test_selected)
        f1 = f1_score(y_predict, y_test)

This is done in a similar way, except that we are setting the hyperparameter for SVC, not the percentile of the functions to select.

My questions:

I) In the current solution, I only use 3. and 4. in clf

and do 1. and 2. kinda "manually" in two nested loops as described above. Is there a way to include all four steps in the pipeline and complete the entire process at once?

II) If it is ok to support the first nested loop, is it possible (and how) to simplify the next nested loop using one pipeline

clf_all = Pipeline([('smote', SMOTE()),
                    ('fisher', Fisher(percentile=best_percentile))
                    ('normal',preprocessing.StandardScaler()),
                    ('svc',svm.SVC(class_weight='auto'))])

and just use GridSearchCV(clf_all, parameter_comb)

for customization?

Please note that both SMOTE

and Fisher

(ranking criteria) should only be met for training data in each warehouse section.

Any comment would be so grateful.

EDIT SMOTE

and Fisher

shown below:

def Fscore(X, y, percentile=None):
    X_pos, X_neg = X[y==1], X[y==0]
    X_mean = X.mean(axis=0)
    X_pos_mean, X_neg_mean = X_pos.mean(axis=0), X_neg.mean(axis=0)
    deno = (1.0/(shape(X_pos)[0]-1))*X_pos.var(axis=0) +(1.0/(shape(X_neg[0]-1))*X_neg.var(axis=0)
    num = (X_pos_mean - X_mean)**2 + (X_neg_mean - X_mean)**2
    F = num/deno
    sort_F = argsort(F)[::-1]
    n_feature = (float(percentile)/100)*shape(X)[1]
    ind_feature = sort_F[:ceil(n_feature)]
    return(ind_feature)

SMOTE

from https://github.com/blacklab/nyan/blob/master/shared_modules/smote.py it returns synthesized data. I modified it to bring back the original raw data collected along with the synthesized data, along with its labels and synthesized.

def smote(X, y):
n_pos = sum(y==1), sum(y==0)
n_syn = (n_neg-n_pos)/float(n_pos) 
X_pos = X[y==1]
X_syn = SMOTE(X_pos, int(round(n_syn))*100, 5)
y_syn = np.ones(shape(X_syn)[0])
X, y = np.vstack([X, X_syn]), np.concatenate([y, y_syn])
return(X, y)

+3

scikit-learn machine-learning pipeline feature-selection cross-validation

Francis 07 jul. At 4:44 am

source to share

1 answer

David · Accepted Answer · 2015-07-08T17:36:59+0000

I don't know where your functions come from SMOTE()

and Fisher()

, but the answer is yes, you can do it. To do this, you will need to write a wrapper class around these functions. The easiest way to do this is to inherit from sklearn classes BaseEstimator

and TransformerMixin

see this for an example: http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html

If that doesn't make sense to you, please provide details of at least one of your functions (from the library it comes from, or your code if you wrote it yourself) and we can move on from there.

EDIT:

Sorry I didn't look at your functions closely enough to realize that they will transform your target in addition to your training data (i.e. both X and y). The pipeline does not support conversions to your target, so you will be doing them sooner than before. For your reference, here's what it would look like to write your own class for your Fisher process, which will work if the function itself is not supposed to affect your target variable.

>>> from sklearn.base import BaseEstimator, TransformerMixin
>>> from sklearn.preprocessing import StandardScaler
>>> from sklearn.svm import SVC
>>> from sklearn.pipeline import Pipeline
>>> from sklearn.grid_search import GridSearchCV
>>> from sklearn.datasets import load_iris
>>> 
>>> class Fisher(BaseEstimator, TransformerMixin):
...     def __init__(self,percentile=0.95):
...             self.percentile = percentile
...     def fit(self, X, y):
...             from numpy import shape, argsort, ceil
...             X_pos, X_neg = X[y==1], X[y==0]
...             X_mean = X.mean(axis=0)
...             X_pos_mean, X_neg_mean = X_pos.mean(axis=0), X_neg.mean(axis=0)
...             deno = (1.0/(shape(X_pos)[0]-1))*X_pos.var(axis=0) + (1.0/(shape(X_neg)[0]-1))*X_neg.var(axis=0)
...             num = (X_pos_mean - X_mean)**2 + (X_neg_mean - X_mean)**2
...             F = num/deno
...             sort_F = argsort(F)[::-1]
...             n_feature = (float(self.percentile)/100)*shape(X)[1]
...             self.ind_feature = sort_F[:ceil(n_feature)]
...             return self
...     def transform(self, x):
...             return x[self.ind_feature,:]
... 
>>> 
>>> data = load_iris()
>>> 
>>> pipeline = Pipeline([
...     ('fisher', Fisher()),
...     ('normal',StandardScaler()),
...     ('svm',SVC(class_weight='auto'))
... ])
>>> 
>>> grid = {
...     'fisher__percentile':[0.75,0.50],
...     'svm__C':[1,2]
... }
>>> 
>>> model = GridSearchCV(estimator = pipeline, param_grid=grid, cv=2)
>>> model.fit(data.data,data.target)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/grid_search.py", line 596, in fit
    return self._fit(X, y, ParameterGrid(self.param_grid))
  File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/grid_search.py", line 378, in _fit
    for parameters in parameter_iterable
  File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 653, in __call__
    self.dispatch(function, args, kwargs)
  File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 400, in dispatch
    job = ImmediateApply(func, args, kwargs)
  File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 138, in __init__
    self.results = func(*args, **kwargs)
  File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/cross_validation.py", line 1239, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/pipeline.py", line 130, in fit
    self.steps[-1][-1].fit(Xt, y, **fit_params)
  File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/svm/base.py", line 149, in fit
    (X.shape[0], y.shape[0]))
ValueError: X and y have incompatible shapes.
X has 1 samples, but y has 75.

Put custom functions in the Sklearn pipeline

More articles: