Why do glmnet coefficient estimates vary greatly between models with the same input parameters?

Question

Why do glmnet coefficient estimates vary greatly between models with the same input parameters?

I am trying to fit a lasso model using cv.glmnet

. I tried to implement four different models (3 using cv.glmnet

and 1 using caret::train

) based on standardization. All four models give very different estimates of the coefficients, which I cannot understand why.

Here is the fully reproducible code:

library("glmnet")
data(iris)
iris <- iris
dat <- iris[iris$Species %in% c("setosa","versicolor"),]
X <- as.matrix(dat[,1:4])
Y <- as.factor(as.character(dat$Species))

set.seed(123)
model1 <- cv.glmnet(x = X,
                    y = Y,
                    family = "binomial",
                    standardize = FALSE,
                    alpha = 1,
                    lambda = rev(seq(0,1,length=100)),
                    nfolds = 3)

set.seed(123)
model2 <- cv.glmnet(x = scale(X, center = T, scale = T),
                    y = Y,
                    family = "binomial",
                    standardize = FALSE,
                    alpha = 1,
                    lambda = rev(seq(0,1,length=100)),
                    nfolds = 3)
set.seed(123)
model3 <- cv.glmnet(x = X,
                    y = Y,
                    family = "binomial",
                    standardize = TRUE,
                    alpha = 1,
                    lambda = rev(seq(0,1,length=100)),
                    nfolds = 3)

##Using caret
library("caret")

lambda.grid <- rev(seq(0,1,length=100)) #set of lambda values for cross-validation
alpha.grid <- 1 #alpha
trainControl <- trainControl(method ="cv",
                             number=3) #3-fold cross-validation
tuneGrid <- expand.grid(.alpha=alpha.grid, .lambda=lambda.grid) #these are tuning parameters to be passed into the train function below

set.seed(123)
model4 <- train(x = X,
                y = Y,
                method="glmnet",
                family="binomial",
                standardize = FALSE,
                trControl = trainControl,                          
                tuneGrid = tuneGrid)

c1 <- coef(model1, s=model1$lambda.min)
c2 <- coef(model2, s=model2$lambda.min)
c3 <- coef(model3, s=model3$lambda.min)
c4 <- coef(model4$finalModel, s=model4$finalModel$lambdaOpt)
c1 <- as.matrix(c1)
c2 <- as.matrix(c2)
c3 <- as.matrix(c3)
c4 <- as.matrix(c4)

model2

pre-allocates independent variables (vector X

), and model3

does so by setting standardize = TRUE

. Therefore, at least these two models should return the same results - but they are not.

lambda.min derived from four models:

model1 = 0

model2 = 0

model3 = 0

model4 = 0.6565657

The estimates of the coefficients between the models are also very different. Why is this happening?

+3

r r-caret glmnet

technOslerphile 10 jul. 17 at 9:17 am

source to share

1 answer

ming gao · Answer 1 · 2018-01-16T11:54:48+0000

Actually slightly different from scale(x) & standardize = FALSE

and x & standardize = TRUE

. We need some (N-1) / N.

See here .

If we use a Gaussian family,

library(glmnet)
X <- matrix(runif(100, 0, 1), ncol=2)
y <- 1 -2*X[,1] + X[,2]

enet <- glmnet(X, y, lambda=0.1,standardize = T,family="gaussian")
coefficients(enet)
coef <- coefficients(enet)
coef[2]*sd(X[,1])/sd(y) #standardized coef
#[1] -0.6895065

enet1 <- glmnet(scale(X)/99*100, y/(99/100*sd(y)),lambda=0.1/(99/100*sd(y)),standardize = F,family="gaussian")
coefficients(enet1)[2]
#[1] -0.6894995

If we use a binomial family,

data(iris)
iris <- iris
dat <- iris[iris$Species %in% c("setosa","versicolor"),]
X <- as.matrix(dat[,1:4])
Y <- as.factor(as.character(dat$Species))

set.seed(123)
model1 <- cv.glmnet(x = X,
                y = Y,
                family = "binomial",
                standardize = T,
                alpha = 1,
                lambda = rev(seq(0,1,length=100)),
                nfolds = 3)
coefficients(model1,s=0.03)[3]*sd(X[,2])
#[1] -0.3374946

set.seed(123)
model3 <- cv.glmnet(x = scale(X)/99*100,
                y = Y,
                family = "binomial",
                standardize = F,
                alpha = 1,
                lambda = rev(seq(0,1,length=100)),
                nfolds = 3)
coefficients(model3,s=0.03)[3]
#[1] -0.3355027

These results are almost the same. Hope it's not too late for this answer.

Why do glmnet coefficient estimates vary greatly between models with the same input parameters?

More articles: