Why h2o.randomForest in R makes much better predictions than randomForest packages

setwd("D:/Santander")

## import train dataset
train<-read.csv("train.csv",header=T)


dim(train)

summary(train)

str(train)

prop.table(table(train2$TARGET))

stats<-function(x){
  length<-length(x)
  nmiss<-sum(is.na(x))
  y<-x[!is.na(x)]
  freq<-as.data.frame(table(y))
  max_freq<-max(freq[,2])/length
  min<-min(y)
  median<-median(y)
  max<-max(y)
  mean<-mean(y)
  freq<-length(unique(y))
  return(c(nmiss=nmiss,min=min,median=median,mean=mean,max=max,freq=freq,max_freq=max_freq))
}


var_stats<-sapply(train,stats)

var_stats_1<-t(var_stats)

###将最大频数类别比例超过0.9999,其它类别小于1/10000的变量全删除

exclude_var<-rownames(var_stats_1)[var_stats_1[,7]>0.9999]

train2<-train[,! colnames(train) %in% c(exclude_var,"ID")]




rm(list=setdiff(ls(),"train2"))

train2<-train2[1:10000,]

write.csv(train2,"example data.csv",row.names = F)

##随机将数据分为训练集与测试集
set.seed(1)
ind<-sample(c(1,2),size=nrow(train2),replace=T,prob=c(0.8,0.2))

train2$TARGET<-factor(train2$TARGET)
train_set<-train2[ind==1,]
test_set<-train2[ind==2,]

rm(train2)
##1\用R randomForest构建预测模型 100棵树
library(randomForest)

memory.limit(4000)

random<-randomForest(TARGET~.,data=train_set,ntree=50)

print(random)

random.importance<-importance(random)

p_train<-predict(random,train_set,type="prob")

pred.auc<-prediction(p_train[,2],train_set$TARGET)

performance(pred.auc,"auc")

##train_set auc=0.8177


## predict test_set
p_test<-predict(random,newdata = test_set,type="prob")

pred.auc<-prediction(p_test[,2],test_set$TARGET)
performance(pred.auc,"auc")

##test_set auc=0.60


#________________________________________________#

##_________h2o.randomForest_______________

library(h2o)
h2o.init()

train.h2o<-as.h2o(train_set)
test.h2o<-as.h2o(test_set)

random.h2o<-h2o.randomForest(,"TARGET",training_frame = train.h2o,ntrees=50)


importance.h2o<-h2o.varimp(random.h2o)

p_train.h2o<-as.data.frame(h2o.predict(random.h2o,train.h2o))

pred.auc<-prediction(p_train.h2o$p1,train_set$TARGET)

performance(pred.auc,"auc")

##auc=0.9388, bigger than previous one

###test_set prediction

p_test.h2o<-as.data.frame(h2o.predict(random.h2o,test.h2o))

pred.auc<-prediction(p_test.h2o$p1,test_set$TARGET)

performance(pred.auc,"auc")

###auc=0.775

      

I tried to make predictions with Kaggle competition: Santander customer satisfaction: https://www.kaggle.com/c/santander-customer-satisfaction When I use the randomForest package in R, I got the end result in test data AUC = 0.57. but when i use h2o.randomForest i got the final result in test data AUC = 0.81. the parameters in both functions are the same, I only used the default parameters with ntree = 100. So why does h2o.randomForest make much better predictions than the randomForest package itself?

+3


source to share


1 answer


First, as noted by user1808924, there are differences in algorithms and default hyperparameters. For example, R randomForest is split based on the Gini criterion, and H2O trees are split based on Squared error reduction (even for classification). H2O also uses histograms for splitting and can handle splitting on categorical variables without dummy (or one-shot) coding (although I don't think that matters here since the Santander dataset is completely numeric). Other information on the splitting of H2O can be found here (this is in the GBM section, but the split into both algos is the same).



If you look at the predictions from your R randomForest model, you can see that they are all in 0.02 increments. R randomForest builds really deep trees, which results in clean leaf nodes. This means that the predicted result or observation will either be 0 or 1 in each tree, and since you set ntrees=50

, the predictions will all be in 0.02 increments. The reason you get poor AUC estimates is because with AUC this is the order of the predictions, and since all of your predictions are [0.00, 0.02, 0.04, ...] there are many connections. Trees in an H2O random forest are not as deep and therefore not as clean, which allows us to predict that they have even more granularity, and which can be better sorted for a better AUC.

+5


source







All Articles