Difference between cross_val_score and cross_val_predict

Question

Difference between cross_val_score and cross_val_predict

I want to evaluate the construction of a regression model using scikitlearn using cross-checking and confusion, which of the two functions cross_val_score

, and cross_val_predict

I have to use. One of the options:

cvs = DecisionTreeRegressor(max_depth = depth)
scores = cross_val_score(cvs, predictors, target, cv=cvfolds, scoring='r2')
print("R2-Score: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Another to use cv predictions with the standard one r2_score

:

cvp = DecisionTreeRegressor(max_depth = depth)
predictions = cross_val_predict(cvp, predictors, target, cv=cvfolds)
print ("CV R^2-Score: {}".format(r2_score(df[target], predictions_cv)))

I would assume both methods are valid and give similar results. But this takes place only for small k-edges. While r ^ 2 is about the same for 10x-cv, it gets lower for higher k-values in the case of the first version using "cross_vall_score". The second version is largely unaffected by changes in the number of folds.

Is this behavior expected and I am missing CV understanding in SKLearn?

+13

python scikit-learn machine-learning regression cross-validation

Bobipuegi Apr 25. 17 at 14:25

source to share

3 answers

I think the difference can be clarified by checking their results. Consider this snippet:

# Last column is the label
print(X.shape)  # (7040, 133)

clf = MLPClassifier()

scores = cross_val_score(clf, X[:,:-1], X[:,-1], cv=5)
print(scores.shape)  # (5,)

y_pred = cross_val_predict(clf, X[:,:-1], X[:,-1], cv=5)
print(y_pred.shape)  # (7040,)

Pay attention to the forms: why is this so? scores.shape

has a length of 5 because it is a cross-validated score of 5 (see argument cv=5

). Therefore, one valid value is calculated for each fold. This value is the rating of the classifier:

given the true marks and the predicted marks, how many answers was the predictor correct in a particular fold?

In this case, the y labels specified in the input data are used twice: to study the data and evaluate the efficiency of the classifier.

On the other hand, it y_pred.shape

has a length of 7040, which is the form of a dataset. This is the length of the input dataset. This means that each value is not a score calculated for multiple values, but a single value: the prediction of the classifier:

based on the input data and their labels, what is the prediction of the classifier for a specific example that was in the test set of a particular fold?

Note that you do not know which fold was used: each output was calculated on the test data of a specific fold, but you cannot determine which (at least from this output).

In this case, labels are used only once: to train the classifier. Your job is to compare these results with the true results in order to compute a score. If you just average them, as you did, then the result is not an estimate, but simply an average forecast.

+2

The Data Scientician 24 oct. '18 at 8:19

source to share

So this question bothered me as well, and while others have made good comments, they haven't answered all aspects of the OP's question.

Correct answer: the discrepancy in estimates for increasing k is related to the selected metric R2 (coefficient of determination). For example MSE, MSLE or MAE will make no difference in using cross_val_score

or cross_val_predict

.

See definition of R2 :

R ^ 2 = 1 - (MSE (main truth, forecast) / MSE (main truth, average (main truth) ))

The bold part explains why the score starts to differ as k increases: the more splits we have, the fewer samples in the test fold and the higher the variance of the mean of the test fold. Conversely, for small k, the mean of the test fold will not differ much from the true mean, since the sample size is still large enough to have little variance.

Evidence:

import numpy as np
from sklearn.metrics import mean_absolute_error as mae
from sklearn.metrics import mean_squared_log_error as msle, r2_score

predictions = np.random.rand(1000)*100
groundtruth = np.random.rand(1000)*20

def scores_for_increasing_k(score_func):
    skewed_score = score_func(groundtruth, predictions)
    print(f'skewed score (from cross_val_predict): {skewed_score}')
    for k in (2,4,5,10,20,50,100,200,250):
        fold_preds = np.split(predictions, k)
        fold_gtruth = np.split(groundtruth, k)
        correct_score = np.mean([score_func(g, p) for g,p in zip(fold_gtruth, fold_preds)])

        print(f'correct CV for k={k}: {correct_score}')

for name, score in [('MAE', mae), ('MSLE', msle), ('R2', r2_score)]:
    print(name)
    scores_for_increasing_k(score)
    print()

The output would be:

MAE
skewed score (from cross_val_predict): 42.25333901481263
correct CV for k=2: 42.25333901481264
correct CV for k=4: 42.25333901481264
correct CV for k=5: 42.25333901481264
correct CV for k=10: 42.25333901481264
correct CV for k=20: 42.25333901481264
correct CV for k=50: 42.25333901481264
correct CV for k=100: 42.25333901481264
correct CV for k=200: 42.25333901481264
correct CV for k=250: 42.25333901481264

MSLE
skewed score (from cross_val_predict): 3.5252449697327175
correct CV for k=2: 3.525244969732718
correct CV for k=4: 3.525244969732718
correct CV for k=5: 3.525244969732718
correct CV for k=10: 3.525244969732718
correct CV for k=20: 3.525244969732718
correct CV for k=50: 3.5252449697327175
correct CV for k=100: 3.5252449697327175
correct CV for k=200: 3.5252449697327175
correct CV for k=250: 3.5252449697327175

R2
skewed score (from cross_val_predict): -74.5910282783694
correct CV for k=2: -74.63582817089443
correct CV for k=4: -74.73848598638291
correct CV for k=5: -75.06145142821893
correct CV for k=10: -75.38967601572112
correct CV for k=20: -77.20560102267272
correct CV for k=50: -81.28604960074824
correct CV for k=100: -95.1061197684949
correct CV for k=200: -144.90258384605787
correct CV for k=250: -210.13375041871123

Of course, there is another effect here that others have mentioned. As k increases, more models appear, trained on more samples and tested on fewer samples, which will affect the final estimates, but this is not due to the choice between cross_val_score

and cross_val_predict

.

0

Kirgsn Jul 26. '19 at 11:41

source to share

Vivek kumar · Accepted Answer · 2017-04-25T14:45:24+0000

cross_val_score

returns the test fold score, where it cross_val_predict

returns the predicted y values for the test fold.

For cross_val_score()

you are using the average of the output, which will be influenced by the number of folds, because then it might have some folds that might have a high error (do not match).

Whereas it cross_val_predict()

returns, for each element in the input, the prediction that was obtained for this element when it was in the test set. [Note that only cross-validation strategies can be used that assign all items to a test case exactly once]. Thus, increasing the number of folds only increases the training data for the test item, and therefore, its result may not be much affected.

Hope this helps. Feel free to ask any doubts.

Edit: answer to question in comment

Please see the following answer for how it works cross_val_predict

:

fooobar.com/questions/1262268 / ...

I think there cross_val_predict

will be a refit because as the fold increases, there will be more data for trains and less data for a test. Thus, the final label is more dependent on the training data. Also, as mentioned above, the prediction for one sample is performed only once, so it may be more susceptible to data splitting. This is why most sites or tutorials recommend using cross_val_score

for analysis.

Difference between cross_val_score and cross_val_predict

More articles: