How do I get the columns that the statsmodels / patsy formula depends on?

Suppose I have a pandas

dataframe:

df = pd.DataFrame({'x1': [0, 1, 2, 3, 4], 
                   'x2': [10, 9, 8, 7, 6], 
                   'x3': [.1, .1, .2, 4, 8], 
                   'y': [17, 18, 19, 20, 21]})

      

Now I am fitting the model statsmodels

using the formula (which uses patsy

under the hood):

import statsmodels.formula.api as smf
fit = smf.ols(formula='y ~ x1:x2', data=df).fit()

      

I need a list of columns df

that it depends on fit

, so I can use fit.predict()

on another dataset. If I try list(fit.params.index)

, for example, I get:

['Intercept', 'x1:x2']

      

I tried to recreate the templated design matrix and use design_info

, but I am still just getting it x1:x2

. I want to:

['x1', 'x2']

      

Or even:

['Intercept', 'x1', 'x2']

      

How can I only get this from the object fit

?

+3


source to share


2 answers


Just check if the column names appear in the string representation of the formula:

ols = smf.ols(formula='y ~ x1:x2', data=df)
fit = ols.fit()

print([c for c in df.columns if c in ols.formula])
['x1', 'x2', 'y']

      



There is another approach by restoring the patsy model (more verbose but more robust) and it is independent of the original dataframe:

md = patsy.ModelDesc.from_formula(ols.formula)
termlist = md.rhs_termlist + md.lhs_termlist

factors = []
for term in termlist:
    for factor in term.factors:
        factors.append(factor.name())

print(factors)
['x1', 'x2', 'y']

      

+3


source


predict

takes the same data frame or dictionary structure, and a call to patsy converts it in a compatible way. To reproduce this, you can also check the code in statsmodels.base.model.Results.predict

, whose core is

exog = dmatrix(self.model.data.design_info.builder,
                           exog, return_type="dataframe")

      



The information about the formula itself is stored in the description terms

in design_info

. The variable names themselves are used in summary()

and as an index in the returned pandas series, for example in results.params

.

0


source







All Articles