Errors when running Caret package in R
I am trying to create a model to predict whether a product will sell on an ecommerce website with a 1 or 0 being the result.
My data is several categorical variables: one with more levels, binary and one continuous (price) with an output variable of 1 or 0, regardless of whether the product list is sold.
This is my code:
inTrainingset<-createDataPartition(C$Sale, p=.75, list=FALSE)
CTrain<-C[inTrainingset,]
CTest<-C[-inTrainingset,]
gbmfit<-gbm(Sale~., data=C,distribution="bernoulli",n.trees=5,interaction.depth=7,shrinkage= .01,)
plot(gbmfit)
gbmTune<-train(Sale~.,data=CTrain, method="gbm")
ctrl<-trainControl(method="repeatedcv",repeats=5)
gbmTune<-train(Sale~.,data=CTrain,
method="gbm",
verbose=FALSE,
trControl=ctrl)
ctrl<-trainControl(method="repeatedcv", repeats=5, classProbs=TRUE, summaryFunction = twoClassSummary)
gbmTune<-trainControl(Sale~., data=CTrain,
method="gbm",
metric="ROC",
verbose=FALSE ,
trControl=ctrl)
grid<-expand.grid(.interaction.depth=seq(1,7, by=2), .n.trees=seq(100,300, by=50), .shrinkage=c(.01,.1))
gbmTune<-train(Sale~., data=CTrain,
method="gbm",
metric="ROC",
tunegrid= grid,
verebose=FALSE,
trControl=ctrl)
set.seed(1)
gbmTune <- train(Sale~., data = CTrain,
method = "gbm",
metric = "ROC",
tuneGrid = grid,
verbose = FALSE,
trControl = ctrl)
I am facing two problems. First, when I try to add summaryFunction = twoClasssummary and then tune in, I get this:
Error in trainControl (Sale ~., Data = CTrain, method = "gbm", metric = "ROC",: unused arguments (data = CTrain, metric = "ROC", trControl = ctrl)
The second problem, if I decide to bypass the summaryFunction, is when I try to run the model, I get this error:
Error in evalSummaryFunction (y, wts = weights, ctrl = trControl, lev = classLevels ,: train () using ROC codes requires class probabilities. See classProbs parameter for trainControl () Also: Warning message: In train.default (x, y, weight = w, ...): cannnot calculates the class probabilities for regression
I tried to change the output variable from a numeric value 1 or 0 to just a text value in excel, but that didn't change the situation.
Any help would be greatly appreciated for how to fix the fact of interpreting this model as a regression or the first error message I encounter.
Best,
Will will@nubimetrics.com
source to share
Your result:
Sale = c(1L, 0L, 1L, 1L, 0L))
While gbm
expecting it, this is a rather unnatural way to encode data. Almost every other function uses factors.
So if you give train
numeric data 0/1 it thinks you want to do a regression. If you convert this to a factor and use "0" and "1" as levels (and if you want class probabilities), you should see a warning that says, "At least one of the class levels is not a valid variable name R; This can cause errors if class probabilities are generated because variable names will be converted to ... ". This is not a simple warning.
Use factor levels that are valid R variable names and you should be fine.
Max
source to share
I was able to reproduce your error using a dataset data(GermanCredit)
.
Your error comes from using trainControl
it as if it were gbm
, train
or something like that.
If you look at the documentation related to vignette with help ?trainControl
, you can see that it looks for input that is very different from what you give it.
It works:
require(caret)
require(gbm)
data(GermanCredit)
# Your dependent variable was Sale and it was binary
# in place of Sale I will use the binary variable Telephone
C <- GermanCredit
C$Sale <- GermanCredit$Telephone
inTrainingset<-createDataPartition(C$Sale, p=.75, list=FALSE)
CTrain<-C[inTrainingset,]
CTest<-C[-inTrainingset,]
set.seed(123)
seeds <- vector(mode = "list", length = 51)
for(i in 1:50) seeds[[i]] <- sample.int(1000, 22)
gbmfit<-gbm(Sale~Age+ResidenceDuration, data=C,
distribution="bernoulli",n.trees=5,interaction.depth=7,shrinkage= .01,)
plot(gbmfit)
gbmTune<-train(Sale~Age+ResidenceDuration,data=CTrain, method="gbm")
ctrl<-trainControl(method="repeatedcv",repeats=5)
gbmTune<-train(Sale~Age+ResidenceDuration,data=CTrain,
method="gbm",
verbose=FALSE,
trControl=ctrl)
ctrl<-trainControl(method="repeatedcv", repeats=5, classProbs=TRUE, summaryFunction = twoClassSummary)
# gbmTune<-trainControl(Sale~Age+ResidenceDuration, data=CTrain,
# method="gbm",
# metric="ROC",
# verbose=FALSE ,
# trControl=ctrl)
gbmTune <- trainControl(method = "adaptive_cv",
repeats = 5,
verboseIter = TRUE,
seeds = seeds)
grid<-expand.grid(.interaction.depth=seq(1,7, by=2), .n.trees=seq(100,300, by=50), .shrinkage=c(.01,.1))
gbmTune<-train(Sale~Age+ResidenceDuration, data=CTrain,
method="gbm",
metric="ROC",
tunegrid= grid,
verebose=FALSE,
trControl=ctrl)
set.seed(1)
gbmTune <- train(Sale~Age+ResidenceDuration, data = CTrain,
method = "gbm",
metric = "ROC",
tuneGrid = grid,
verbose = FALSE,
trControl = ctrl)
Depending on what you are trying to accomplish, you can re-specify it a little differently, but it all comes down to what you used trainControl
as if it were train
.
source to share