Coef is undefined due to similarity and pvalues
I have a CRM dataset used for an experiment where a dummy W corresponds to a treatment / control group (see code below). When I checked the independence of W from other functions, I realized two things:
- When using model.matrix, some coefficients (1 in this dummy dataset) were not determined due to similarities. This did not happen when feeding DT directly to lm ()
- The model obtained in both cases gives different results, i.e. The p-values of the individual characteristics change.
I (think I) understand the concept of multicolonity, but in this particular case I don't quite understand a) why it occurs b) why it has different effects on model.matrix and lm
What am I missing?
Thank you so much!
set.seed(1)
n = 302
DT = data.table(
zipcode = factor(sample(seq(1,52), n, replace=TRUE)),
gender = factor(sample(c("M","F"), n, replace=TRUE)),
age = sample(seq(1,95), n, replace=TRUE),
days_since_last_purchase = sample(seq(1,259), n, replace=TRUE),
W = sample(c(0,1), n, replace=TRUE)
)
summary(DT)
m = model.matrix(W ~ . +0, DT)
f1 = lm(DT$W ~ m)
f2= lm(W~ ., DT)
p_value_ratio <- function(lm)
{
summary_randomization = summary(lm)
p_values_randomization = summary_randomization$coefficients[, 4]
L = length(p_values_randomization)
return(sum(p_values_randomization <= 0.05)/(L-1))
}
all.equal(p_value_ratio(f1), p_value_ratio(f2))
alias(f1)
alias(f2)
source to share
Your problem is + 0
in model.matrix
. The second approach involves interception in the model matrix. If you exclude it, there are fewer factors to exclude (which are usually represented by interception):
colnames(model.matrix(W ~ ., DT))
#excludes zipcode1 and genderf since these define the intercept
colnames(model.matrix(W ~ . + 0, DT))
#excludes only genderf
Note what f1
includes the intercept being added lm
(I assume an internal call model.matrix
, but not marked):
m = model.matrix(W ~ . + 0, DT); f1 = lm(DT$W ~ m ); model.matrix(f1)
You may need the following:
m = model.matrix(W ~ ., DT);
f1 = lm(DT$W ~ m[,-1]);
(Usually you only build the model matrix by hand if you want to use it directly lm.fit
.)
f2= lm(W~ ., DT);
all.equal(unname(coef(f1)), unname(coef(f2)))
#[1] TRUE
Ultimately it comes down to your understanding of treatment contrasts. Generally, you should not exclude interception from the model matrix.
source to share