Coef is undefined due to similarity and pvalues

I have a CRM dataset used for an experiment where a dummy W corresponds to a treatment / control group (see code below). When I checked the independence of W from other functions, I realized two things:

  • When using model.matrix, some coefficients (1 in this dummy dataset) were not determined due to similarities. This did not happen when feeding DT directly to lm ()
  • The model obtained in both cases gives different results, i.e. The p-values ​​of the individual characteristics change.

I (think I) understand the concept of multicolonity, but in this particular case I don't quite understand a) why it occurs b) why it has different effects on model.matrix and lm

What am I missing?

Thank you so much!

set.seed(1)

n = 302

DT = data.table(
    zipcode = factor(sample(seq(1,52), n, replace=TRUE)),
    gender = factor(sample(c("M","F"), n, replace=TRUE)),
    age = sample(seq(1,95), n, replace=TRUE),
    days_since_last_purchase = sample(seq(1,259), n, replace=TRUE),
    W = sample(c(0,1), n, replace=TRUE)
)

summary(DT)

m = model.matrix(W ~ . +0, DT)
f1 = lm(DT$W ~ m) 
f2= lm(W~ ., DT)

p_value_ratio <- function(lm)
{
    summary_randomization = summary(lm)
    p_values_randomization = summary_randomization$coefficients[, 4] 
    L = length(p_values_randomization)
    return(sum(p_values_randomization <= 0.05)/(L-1))
}

all.equal(p_value_ratio(f1), p_value_ratio(f2))

alias(f1)
alias(f2)

      

+3


source to share


1 answer


Your problem is + 0

in model.matrix

. The second approach involves interception in the model matrix. If you exclude it, there are fewer factors to exclude (which are usually represented by interception):

colnames(model.matrix(W ~ ., DT))
#excludes zipcode1 and genderf since these define the intercept

colnames(model.matrix(W ~ . + 0, DT))
#excludes only genderf

      

Note what f1

includes the intercept being added lm

(I assume an internal call model.matrix

, but not marked):

m = model.matrix(W ~ . + 0, DT);
f1 = lm(DT$W ~ m ); 
model.matrix(f1)

      

You may need the following:



m = model.matrix(W ~ ., DT);
f1 = lm(DT$W ~ m[,-1]); 

      

(Usually you only build the model matrix by hand if you want to use it directly lm.fit

.)

f2= lm(W~ ., DT);
all.equal(unname(coef(f1)), unname(coef(f2)))
#[1] TRUE

      

Ultimately it comes down to your understanding of treatment contrasts. Generally, you should not exclude interception from the model matrix.

+1


source







All Articles