Why is Weka RandomForest giving me a better result than Scikit RandomForestClassifier?

I am getting peculiar differences in results between WEKA and scikit using the same RandomForest method and the same dataset. With scikit, I get an AUC of around 0.62 (all along since I've done extensive testing). However, with WEKA, they get results close to 0.79. This is a huge difference!

The dataset I tested the algorithms on is KC1.arff, from which I put a copy into my public Dropbox folder https://dl.dropbox.com/u/30688032/KC1.arff . For WEKA, I just downloaded the .jar file from http://www.cs.waikato.ac.nz/ml/weka/downloading.html . In WEKA, I set the cross validation parameter to 10x, dataset as KC1.arff, algorithm "RandomForest -l 19 -K 0 -S 1". Then I ran the code! After you create the results in WEKA, it should be saved as a file, CSV or .arff. Read this file and check the "Area_under_ROC" column, it should be somewhat close to 0.79.

Below is the code for scikit RandomForest

import numpy as np
from pandas import *
from sklearn.ensemble import RandomForestClassifier

def read_arff(f):
    from scipy.io import arff
    data, meta = arff.loadarff(f) 
    return DataFrame(data)

def kfold(clr,X,y,folds=10):
    from sklearn.cross_validation import StratifiedKFold
    from sklearn import metrics
    auc_sum=0
    kf = StratifiedKFold(y, folds)
    for train_index, test_index in kf:
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]
        clr.fit(X_train, y_train)
        pred_test = clr.predict(X_test)
        print metrics.auc_score(y_test,pred_test)
        auc_sum+=metrics.auc_score(y_test,pred_test)

    print 'AUC: ',  auc_sum/folds
    print  "----------------------------" 



#read the dataset
X=read_arff('KC1.arff')
y=X['Defective']

#changes N, and Y to 0, and 1 respectively
s = np.unique(y)
mapping = Series([x[0] for x in enumerate(s)], index = s)  
y=y.map(mapping) 
del X['Defective']

#initialize random forests (by defualt it is set to 10 trees)
rf=RandomForestClassifier()

#run algorithm
kfold(rf,np.array(X),y)

#You will get an average AUC around 0.62 as opposed to 0.79 in WEKA

      

Please keep in mind that the real auc value as shown in the experimental results of the relevant articles is around 0.79, so the problem is with my implementation which uses scikit random forests.

Your kind help would be much appreciated!

Many thanks!

+3


source to share


1 answer


After posting a question in the scikit-learn issue trackers, I got some feedback that the problem was the "predicted" function I was using. It should have been "pred_test = clr.predict_proba (X_test) [:, 1]" instead of "pred_test = clr.predict (X_test)" since the classification problem is binary: either 0 or 1.



After implementing the change, the results were the same for WEKA and scikit random forest :)

+3


source







All Articles