Why does decision tree code written in python predict differently than code written in R?
I'm working with the load_iris dataset from sklearn in python and R (it's just called iris in R).
I have built the model in both languages ββusing the "gini" index and in both languages ββI will be able to validate the model correctly when the test data is taken directly from the diaphragm set.
However, if I give a new dataset as test input, for the same python and R puts it in different categories.
I'm not sure if I'm missing here or something amiss, so any guidance would be much appreciated.
Code given below: Python 2.7:
from sklearn.datasets import load_iris
from sklearn import tree
iris = load_iris()
model = tree.DecisionTreeClassifier(criterion='gini')
model.fit(iris.data, iris.target)
model.score(iris.data, iris.target)
print iris.data[49],model.predict([iris.data[49]])
print iris.data[99],model.predict([iris.data[99]])
print iris.data[100],model.predict([iris.data[100]])
print iris.data[149],model.predict([iris.data[149]])
print [6.3,2.8,6,1.3],model.predict([[6.3,2.8,6,1.3]])
R-Rstudio runs 3.3.2 32 bit:
library(rpart)
iris<- iris
x_train = iris[c('Sepal.Length','Sepal.Width','Petal.Length','Petal.Width')]
y_train = as.matrix(cbind(iris['Species']))
x <- cbind(x_train,y_train)
fit <- rpart(y_train ~ ., data = x_train,method="class",parms = list(split = "gini"))
summary(fit)
x_test = x[149,]
x_test[,1]=6.3
x_test[,2]=2.8
x_test[,3]=6
x_test[,4]=1.3
predicted1= predict(fit,x[49,]) # same as python result
predicted2= predict(fit,x[100,]) # same as python result
predicted3= predict(fit,x[101,]) # same as python result
predicted4= predict(fit,x[149,]) # same as python result
predicted5= predict(fit,x_test) ## this value does not match with pythons result
My python output:
[ 5. 3.3 1.4 0.2] [0]
[ 5.7 2.8 4.1 1.3] [1]
[ 6.3 3.3 6. 2.5] [2]
[ 5.9 3. 5.1 1.8] [2]
[6.3, 2.8, 6, 1.3] [2] -----> this means it putting the test data into virginica bucket
and the R output:
> predicted1
setosa versicolor virginica
49 1 0 0
> predicted2
setosa versicolor virginica
100 0 0.9074074 0.09259259
> predicted3
setosa versicolor virginica
101 0 0.02173913 0.9782609
> predicted4
setosa versicolor virginica
149 0 0.02173913 0.9782609
> predicted5
setosa versicolor virginica
149 0 0.9074074 0.09259259 --> this means it putting the test data into versicolor bucket
Please, help. Thank.
source to share
Decision trees include quite a few parameters (minimum leave size, tree depth, split time, etc.), and different packages may have different default settings. If you want to get the same results, you need to make sure that the implicit defaults are similar. For example, try running the following:
fit <- rpart(y_train ~ ., data = x_train,method="class",
parms = list(split = "gini"),
control = rpart.control(minsplit = 2, minbucket = 1, xval=0, maxdepth = 30))
(predicted5= predict(fit,x_test))
setosa versicolor virginica
149 0 0.3333333 0.6666667
Here options minsplit = 2, minbucket = 1, xval=0
and maxdepth = 30
are chosen to be identical to sklearn
-options, see here . maxdepth = 30
- this is the greatest value rpart
that will allow you; sklearn
has no boundaries here). If you want probabilities etc. Were the same, you probably want to play around with the parameter cp
.
Similarly, for
model = tree.DecisionTreeClassifier(criterion='gini',
min_samples_split=20,
min_samples_leaf=round(20.0/3.0), max_depth=30)
model.fit(iris.data, iris.target)
I get
print model.predict([iris.data[49]])
print model.predict([iris.data[99]])
print model.predict([iris.data[100]])
print model.predict([iris.data[149]])
print model.predict([[6.3,2.8,6,1.3]])
[0]
[1]
[2]
[2]
[1]
which looks pretty similar to your original output R
.
Needless to say, be careful when your predictions (on a training set) seem "unreasonably good", as you are more likely to recycle the data. For example, see model.predict_proba(...)
which gives probabilities in sklearn
(instead of predicted classes). You should see that with your current Python code / settings, you almost certainly overworked.
source to share
In addition to @ coffeeinjunky's answer, you need to pay attention to the parameter random_state
(this is a Python parameter, not sure if this is called in R). Generating the psuedo-random tree itself, so you need to specify that both models have the same seed. Otherwise, you will fit / predict with the same model and get different results on each run, because the tree used is different in each.
Check out the section on Decision Trees in MΓΌller and Guido - "Python for Machine Learning". It does a great job of explaining the various options visually, and pdfs float around the web if you just try a google search. With decision trees and ensemble training methods, the parameters you specify will have a significant impact on predictions.
source to share