H2o predictions sometimes fail when the response variable is not present in the test case

When predicting on a test set where the response variable is missing, h2o will fail in various ways if one hot encoding was used for the factor variable in training, either when implicitly specified in GLM training, or when specified explicitly in other methods.

This bug is present in R 3.4.0 and h2o 3.12.0.1. We also tested h2o 3.10.3.3

 library(h2o)
localH2O = h2o.init()

prostatePath = system.file("extdata", "prostate.csv", package = "h2o")
prostate.hex = read.csv(prostatePath)
prostate.hex$our_factor<-as.factor(paste0("Q",c(rep(c(1:380),1))))

prostate.hex<-as.h2o(prostate.hex)
prostate.hex$weight<-1

prostate_train<-prostate.hex[1:300,]
prostate_test<-prostate.hex[301:380,]
prostate_test<-prostate_test[,-3] #delete response variable from test data

model<-h2o.glm(y = "AGE", x = c("our_factor"), 
               training_frame = prostate_train,offset_column="weight")
predict(model,newdata=prostate_train)
predict(model,newdata=prostate_test)

model<-h2o.glm(y = "AGE", x = c("our_factor"), 
               training_frame = prostate_train)
predict(model,newdata=prostate_train)
predict(model,newdata=prostate_test)

model<-h2o.gbm(y = "AGE", x = c("our_factor"), 
               training_frame = prostate_train,categorical_encoding = "OneHotExplicit")
predict(model,newdata=prostate_train)
predict(model,newdata=prostate_test)

      

The first GLM example that has been trained on a bias column gives all NaNs when predicting on test data. The second GLM example creates this error:

DistributedException from localhost/127.0.0.1:54321: '0', caused by java.lang.ArrayIndexOutOfBoundsException: 0

DistributedException from localhost/127.0.0.1:54321: '0', caused by java.lang.ArrayIndexOutOfBoundsException: 0
    at water.MRTask.getResult(MRTask.java:478)
    at water.MRTask.getResult(MRTask.java:486)
    at water.MRTask.doAll(MRTask.java:390)
    at water.MRTask.doAll(MRTask.java:396)
    at hex.glm.GLMModel.predictScoreImpl(GLMModel.java:1215)
    at hex.Model.score(Model.java:1077)
    at water.api.ModelMetricsHandler$1.compute2(ModelMetricsHandler.java:351)
    at water.H2O$H2OCountedCompleter.compute(H2O.java:1349)
    at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
    at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
    at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
    at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
    at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
    at hex.DataInfo.extractDenseRow(DataInfo.java:1025)
    at hex.glm.GLMScore.map(GLMScore.java:148)
    at water.MRTask.compute2(MRTask.java:657)
    at water.H2O$H2OCountedCompleter.compute1(H2O.java:1352)
    at hex.glm.GLMScore$Icer.compute1(GLMScore$Icer.java)
    at water.H2O$H2OCountedCompleter.compute(H2O.java:1348)
    ... 5 more

Error: DistributedException from localhost/127.0.0.1:54321: '0', caused by java.lang.ArrayIndexOutOfBoundsException: 0

      

The GBM example generates this error (although the only column missing from the test data is the response variable):

java.lang.IllegalArgumentException: Test/Validation dataset has no columns in common with the training set
java.lang.IllegalArgumentException: Test/Validation dataset has no columns in common with the training set
    at hex.Model.adaptTestForTrain(Model.java:1028)
    at hex.Model.adaptTestForTrain(Model.java:854)
    at hex.Model.score(Model.java:1072)
    at water.api.ModelMetricsHandler$1.compute2(ModelMetricsHandler.java:351)
    at water.H2O$H2OCountedCompleter.compute(H2O.java:1349)
    at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
    at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
    at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
    at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
    at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)

Error: java.lang.IllegalArgumentException: Test/Validation dataset has no columns in common with the training set

      

The error appears to be factor-variable specific and uses one hot encoding explicitly. It can be worked around by adding a "fake" response column to the test dataset (we tested this and the value of this column doesn't make any difference to the predictions, as we would expect), but this is obviously not ideal.

Errors remain even if all factor levels are present in both the train set and the test set if there are 5 or more factor levels:

prostate.hex$our_factor<-as.factor(paste0("Q",c(rep(c(1:5),76))))

      

If there are 4 or less, there is no problem with GLM, but the error message from GBM remains

+1


source to share





All Articles