Sklearn TimeSeriesSplit cross_val_predict only works for sections

Question

Sklearn TimeSeriesSplit cross_val_predict only works for sections

I am trying to use TimeSeriesSplit cross validation strategy in 0.18.1 for sklearn with LogisticRegression score. I am getting the error:

cross_val_predict only works for sections

The following code snippet shows how to reproduce:

from sklearn import linear_model, neighbors
from sklearn.model_selection import train_test_split, cross_val_predict, TimeSeriesSplit, KFold, cross_val_score
import pandas as pd
import numpy as np
from datetime import date, datetime

df = pd.DataFrame(data=np.random.randint(0,10,(100,5)), index=pd.date_range(start=date.today(), periods=100), columns='x1 x2 x3 x4 y'.split())


X, y = df['x1 x2 x3 x4'.split()], df['y']
score = cross_val_score(linear_model.LogisticRegression(fit_intercept=True), X, y, cv=TimeSeriesSplit(n_splits=2))
y_hat = cross_val_predict(linear_model.LogisticRegression(fit_intercept=True), X, y, cv=TimeSeriesSplit(n_splits=2), method='predict_proba')

What am I doing wrong?

+1

python scikit-learn logistic-regression cross-validation

nickos556 19 jan. 17 at 23:50

source to share

1 answer

glao · Accepted Answer · 2017-03-01T23:24:09+0000

There are several ways to pass an argument cv

to cross_val_score

. This is where you need to pass the generator to split. for example

y = range(14)
cv = TimeSeriesSplit(n_splits=2).split(y)

gives a generator. With this, you can generate CV and test arrays. The first looks like this:

print cv.next()
    (array([0, 1, 2, 3, 4, 5, 6, 7]), array([ 8,  9, 10, 11, 12, 13]))

You can also take dataframe as input for split

.

df = pd.DataFrame(data=np.random.randint(0,10,(100,5)), 
                  index=pd.date_range(start=date.today(), 
                  periods=100), columns='x1 x2 x3 x4 y'.split())

cv = TimeSeriesSplit(n_splits=2).split(df)
print cv.next()
    (array([ 0,  1,  2, ..., 31, 32, 33]), array([34, 35, 36, ..., 64, 65, 66]))

In your case, this should work:

score = cross_val_score(linear_model.LogisticRegression(fit_intercept=True), 
                         X, y, cv=TimeSeriesSplit(n_splits=2).split(df))

Have a look at cross_val_score and TimeSeriesSplit for details.

Sklearn TimeSeriesSplit cross_val_predict only works for sections

More articles: