Coef is undefined due to similarity and pvalues

Question

Coef is undefined due to similarity and pvalues

I have a CRM dataset used for an experiment where a dummy W corresponds to a treatment / control group (see code below). When I checked the independence of W from other functions, I realized two things:

When using model.matrix, some coefficients (1 in this dummy dataset) were not determined due to similarities. This did not happen when feeding DT directly to lm ()
The model obtained in both cases gives different results, i.e. The p-values of the individual characteristics change.

I (think I) understand the concept of multicolonity, but in this particular case I don't quite understand a) why it occurs b) why it has different effects on model.matrix and lm

What am I missing?

Thank you so much!

set.seed(1)

n = 302

DT = data.table(
    zipcode = factor(sample(seq(1,52), n, replace=TRUE)),
    gender = factor(sample(c("M","F"), n, replace=TRUE)),
    age = sample(seq(1,95), n, replace=TRUE),
    days_since_last_purchase = sample(seq(1,259), n, replace=TRUE),
    W = sample(c(0,1), n, replace=TRUE)
)

summary(DT)

m = model.matrix(W ~ . +0, DT)
f1 = lm(DT$W ~ m) 
f2= lm(W~ ., DT)

p_value_ratio <- function(lm)
{
    summary_randomization = summary(lm)
    p_values_randomization = summary_randomization$coefficients[, 4] 
    L = length(p_values_randomization)
    return(sum(p_values_randomization <= 0.05)/(L-1))
}

all.equal(p_value_ratio(f1), p_value_ratio(f2))

alias(f1)
alias(f2)

+3

r lm

Stefano 09 Aug 17 at 8:09

source to share

1 answer

Roland · Accepted Answer · 2017-08-09T08:52:03+0000

Your problem is + 0

in model.matrix

. The second approach involves interception in the model matrix. If you exclude it, there are fewer factors to exclude (which are usually represented by interception):

colnames(model.matrix(W ~ ., DT))
#excludes zipcode1 and genderf since these define the intercept

colnames(model.matrix(W ~ . + 0, DT))
#excludes only genderf

Note what f1

includes the intercept being added lm

(I assume an internal call model.matrix

, but not marked):

m = model.matrix(W ~ . + 0, DT);
f1 = lm(DT$W ~ m ); 
model.matrix(f1)

You may need the following:

m = model.matrix(W ~ ., DT);
f1 = lm(DT$W ~ m[,-1]);

(Usually you only build the model matrix by hand if you want to use it directly lm.fit

.)

f2= lm(W~ ., DT);
all.equal(unname(coef(f1)), unname(coef(f2)))
#[1] TRUE

Ultimately it comes down to your understanding of treatment contrasts. Generally, you should not exclude interception from the model matrix.

Coef is undefined due to similarity and pvalues

More articles: