Errors when running Caret package in R

I am trying to create a model to predict whether a product will sell on an ecommerce website with a 1 or 0 being the result.

My data is several categorical variables: one with more levels, binary and one continuous (price) with an output variable of 1 or 0, regardless of whether the product list is sold.

This is my code:

inTrainingset<-createDataPartition(C$Sale, p=.75, list=FALSE)
CTrain<-C[inTrainingset,]
CTest<-C[-inTrainingset,]


gbmfit<-gbm(Sale~., data=C,distribution="bernoulli",n.trees=5,interaction.depth=7,shrinkage=      .01,)
plot(gbmfit)


gbmTune<-train(Sale~.,data=CTrain, method="gbm")


ctrl<-trainControl(method="repeatedcv",repeats=5)
gbmTune<-train(Sale~.,data=CTrain, 
           method="gbm", 
           verbose=FALSE, 
           trControl=ctrl)


ctrl<-trainControl(method="repeatedcv", repeats=5, classProbs=TRUE, summaryFunction =    twoClassSummary)
gbmTune<-trainControl(Sale~., data=CTrain, 
                  method="gbm", 
                  metric="ROC", 
                  verbose=FALSE , 
                  trControl=ctrl)



  grid<-expand.grid(.interaction.depth=seq(1,7, by=2), .n.trees=seq(100,300, by=50),  .shrinkage=c(.01,.1))

  gbmTune<-train(Sale~., data=CTrain, 
           method="gbm", 
           metric="ROC", 
           tunegrid= grid, 
           verebose=FALSE,
           trControl=ctrl)



  set.seed(1)
  gbmTune <- train(Sale~., data = CTrain,
               method = "gbm",
               metric = "ROC",
               tuneGrid = grid,
               verbose = FALSE,
               trControl = ctrl)

      

I am facing two problems. First, when I try to add summaryFunction = twoClasssummary and then tune in, I get this:

Error in trainControl (Sale ~., Data = CTrain, method = "gbm", metric = "ROC",: unused arguments (data = CTrain, metric = "ROC", trControl = ctrl)

The second problem, if I decide to bypass the summaryFunction, is when I try to run the model, I get this error:

Error in evalSummaryFunction (y, wts = weights, ctrl = trControl, lev = classLevels ,: train () using ROC codes requires class probabilities. See classProbs parameter for trainControl () Also: Warning message: In train.default (x, y, weight = w, ...): cannnot calculates the class probabilities for regression

I tried to change the output variable from a numeric value 1 or 0 to just a text value in excel, but that didn't change the situation.

Any help would be greatly appreciated for how to fix the fact of interpreting this model as a regression or the first error message I encounter.

Best,

Will will@nubimetrics.com

+3


source to share


2 answers


Your result:

Sale = c(1L, 0L, 1L, 1L, 0L))

      

While gbm

expecting it, this is a rather unnatural way to encode data. Almost every other function uses factors.



So if you give train

numeric data 0/1 it thinks you want to do a regression. If you convert this to a factor and use "0" and "1" as levels (and if you want class probabilities), you should see a warning that says, "At least one of the class levels is not a valid variable name R; This can cause errors if class probabilities are generated because variable names will be converted to ... ". This is not a simple warning.

Use factor levels that are valid R variable names and you should be fine.

Max

+4


source


I was able to reproduce your error using a dataset data(GermanCredit)

.

Your error comes from using trainControl

it as if it were gbm

, train

or something like that.

If you look at the documentation related to vignette with help ?trainControl

, you can see that it looks for input that is very different from what you give it.



It works:

require(caret)
require(gbm)
data(GermanCredit)

# Your dependent variable was Sale and it was binary
#   in place of Sale I will use the binary variable Telephone 

C      <- GermanCredit
C$Sale <- GermanCredit$Telephone

inTrainingset<-createDataPartition(C$Sale, p=.75, list=FALSE)
CTrain<-C[inTrainingset,]
CTest<-C[-inTrainingset,]
set.seed(123)
seeds <- vector(mode = "list", length = 51)
for(i in 1:50) seeds[[i]] <- sample.int(1000, 22)

gbmfit<-gbm(Sale~Age+ResidenceDuration, data=C,
            distribution="bernoulli",n.trees=5,interaction.depth=7,shrinkage=      .01,)
plot(gbmfit)


gbmTune<-train(Sale~Age+ResidenceDuration,data=CTrain, method="gbm")


ctrl<-trainControl(method="repeatedcv",repeats=5)
gbmTune<-train(Sale~Age+ResidenceDuration,data=CTrain, 
               method="gbm", 
               verbose=FALSE, 
               trControl=ctrl)


ctrl<-trainControl(method="repeatedcv", repeats=5, classProbs=TRUE, summaryFunction =    twoClassSummary)

# gbmTune<-trainControl(Sale~Age+ResidenceDuration, data=CTrain, 
#                       method="gbm", 
#                       metric="ROC", 
#                       verbose=FALSE , 
#                       trControl=ctrl)

gbmTune <- trainControl(method = "adaptive_cv", 
                      repeats = 5,
                      verboseIter = TRUE,
                      seeds = seeds)

grid<-expand.grid(.interaction.depth=seq(1,7, by=2), .n.trees=seq(100,300, by=50),  .shrinkage=c(.01,.1))

gbmTune<-train(Sale~Age+ResidenceDuration, data=CTrain, 
               method="gbm", 
               metric="ROC", 
               tunegrid= grid, 
               verebose=FALSE,
               trControl=ctrl)



set.seed(1)
gbmTune <- train(Sale~Age+ResidenceDuration, data = CTrain,
                 method = "gbm",
                 metric = "ROC",
                 tuneGrid = grid,
                 verbose = FALSE,
                 trControl = ctrl)

      

Depending on what you are trying to accomplish, you can re-specify it a little differently, but it all comes down to what you used trainControl

as if it were train

.

+1


source







All Articles