H2o GLM only interact with certain predictors

I'm interested in creating interaction conditions in h2o.glm (). But I don't want to generate all paired interactions. For example in the mtcars dataset ... I want to interact "mpg" with all other factors like "cyl", "hp" and "disp", but I don't want the other factors to interact with each other (so I don't want disp_hp or disp_cyl).

What is the best approach to this problem using the (interactions = interactions_list) parameter in h2o.glm ()?

thank

+3


source to share


1 answer


According to the ?h2o.glm

parameter it interactions=

takes:

List of indexes of predictor columns to interact with. All paired combinations will be calculated for the list.

You don't need all paired combinations, only specific ones.

Unfortunately the R H2O API does not provide a formula interface. If this were the case, then it would be possible to program an arbitrary set of interactions, as in vanilla R glm. 1

Option 1: use beta_constraints

One solution is to include all pairwise combinations in the model, and then suppress those you don't want by setting the rates to 0.

According to the glm docs beta_constraints=

serves for:

Specify a dataset to use beta limits. The selected frame is used to constrain the coefficient vector to provide upper and lower bounds. The dataset must contain a column of names with a valid coefficient of names.

According to H2O Glossary format beta_constraints

:

A data.frame or H2OParsedData object with columns ["names", "lower_bounds", "upper_bounds", "beta_given"], where each row corresponds to a predictor in GLM. "Names" contain predictor names, "lower_bounds" and "upper_bounds" are the lower and upper bounds for beta, and "beta_given" are some given starting values ​​for beta testing.

We now know how to populate our beta_constraints

data frame , except for how to format the names of the interaction terms. interaction doc on doesn't tell us what to expect. So let's just run an example of an interaction with H2O and see how the interactions are named.

library('h2o')
remoteH2O <- h2o.init(ip='xxx.xx.xx.xxx', startH2O=FALSE)

data(mtcars)

df1 <- as.h2o(mtcars, destination_frame = 'demo_mtcars')

target <- 'wt'
predictors <- c('mpg','cyl','hp','disp')

glm1 <- h2o.glm(x = predictors,
                y = target,
                training_frame = 'demo_mtcars',
                model_id = 'demo_glm',
                lambda = 0, # disable regularization, but your use case may vary
                standardize = FALSE, # we want to see the raw parameters, but your use case may vary
                interactions = predictors # create all interactions
                )
print(glm1) # output includes:
# Coefficients: glm coefficients
#        names coefficients
# 1  Intercept     4.336269
# 2    mpg_cyl     0.019558
# 3     mpg_hp     0.000156
# ..

      

So, it looks like the interaction terms get like v1_v2

.



So let's name all the interaction conditions that we want to suppress using setdiff()

in relation to the conditions we want to keep.

library(tidyr)
intx_terms_keep <- # see footnote 1 for explanation
  expand.grid(c('mpg'),c('cyl','hp','disp')) %>%
    unite(intx, Var1, Var2, sep='_') %>% unlist()

intx_terms_suppress <- setdiff( # suppress all interactions minus those we wish to keep
                             combn(predictors,2,FUN=paste,collapse='_'), 
                             intx_terms_keep
                            )
constraints <- data.frame(names=intx_terms_suppress, 
                          lower_bounds=0, 
                          upper_bounds=0, 
                          beta_given=0)

glm2 <- h2o.glm(x = predictors,
                y = target,
                training_frame = 'demo_mtcars',
                model_id = 'demo_glm',
                lambda = 0,
                standardize = FALSE, 
                interactions = predictors, # create all interactions
                beta_constraints = constraints
)
print(glm2) # output includes:
# Coefficients: glm coefficients
#        names coefficients
# 1  Intercept     3.405154
# 2    mpg_cyl    -0.012740
# 3     mpg_hp    -0.000250
# 4   mpg_disp     0.000066
# 5     cyl_hp     0.000000
# 6   cyl_disp     0.000000
# 7    hp_disp     0.000000
# 8        mpg    -0.018981
# 9        cyl     0.168820
# 10      disp     0.004070
# 11        hp     0.000501

      

As you can see, only the required interaction terms have non-zero coefficients. The rest are effectively ignored. However , since they are still members of the model, they can count on degrees of freedom and may affect some of the metrics (i.e., adjusted R-squares).

Option 2. Pre-create conditions for interaction

As Edward Cook pointed out, another solution would be to preset interactions as variables in the training set.

This approach will ensure that unwanted interactions do not account for the degree of freedom and affect your corrected R-squared.

1 Alternative, non-H2O solution for vanilla formula equationglm

In vanilla R glm()

, which allows a formula interface, I would use expand.grid

to create a chain of interaction terms and include it in the formula.

Pass expand.grid

two vectors - you want to interact with all members in v1 with all members in v2.

To use your example, you want to interact mpg

with cyl

, hp

and disp

:

library(tidyr)
intx_term_string <- 
  expand.grid(c('mpg'),c('cyl','hp','disp')) %>%
    unite(intx, Var1, Var2, sep=':') %>% apply(2, paste, collapse='+')

      

This gives you a sequence of interaction terms such as "mpg:cyl+mpg:hp+mpg:disp"

that you can insert into the string of other predictors (perhaps with paste) and transform with as.formula()

.

+4


source







All Articles