How do I get the columns that the statsmodels / patsy formula depends on?

Question

How do I get the columns that the statsmodels / patsy formula depends on?

Suppose I have a pandas

dataframe:

df = pd.DataFrame({'x1': [0, 1, 2, 3, 4], 
                   'x2': [10, 9, 8, 7, 6], 
                   'x3': [.1, .1, .2, 4, 8], 
                   'y': [17, 18, 19, 20, 21]})

Now I am fitting the model statsmodels

using the formula (which uses patsy

under the hood):

import statsmodels.formula.api as smf
fit = smf.ols(formula='y ~ x1:x2', data=df).fit()

I need a list of columns df

that it depends on fit

, so I can use fit.predict()

on another dataset. If I try list(fit.params.index)

, for example, I get:

['Intercept', 'x1:x2']

I tried to recreate the templated design matrix and use design_info

, but I am still just getting it x1:x2

. I want to:

['x1', 'x2']

Or even:

['Intercept', 'x1', 'x2']

How can I only get this from the object fit

?

+3

python pandas statsmodels patsy

bwk Apr 12 17 at 19:26

source to share

2 answers

predict

takes the same data frame or dictionary structure, and a call to patsy converts it in a compatible way. To reproduce this, you can also check the code in statsmodels.base.model.Results.predict

, whose core is

exog = dmatrix(self.model.data.design_info.builder,
                           exog, return_type="dataframe")

The information about the formula itself is stored in the description terms

in design_info

. The variable names themselves are used in summary()

and as an index in the returned pandas series, for example in results.params

.

0

Josef Apr 12 17 at 19:58

source to share

Jan Trienes · Accepted Answer · 2017-04-12T20:10:09+0000

Just check if the column names appear in the string representation of the formula:

ols = smf.ols(formula='y ~ x1:x2', data=df)
fit = ols.fit()

print([c for c in df.columns if c in ols.formula])
['x1', 'x2', 'y']

There is another approach by restoring the patsy model (more verbose but more robust) and it is independent of the original dataframe:

md = patsy.ModelDesc.from_formula(ols.formula)
termlist = md.rhs_termlist + md.lhs_termlist

factors = []
for term in termlist:
    for factor in term.factors:
        factors.append(factor.name())

print(factors)
['x1', 'x2', 'y']

How do I get the columns that the statsmodels / patsy formula depends on?

More articles: