Statsmodels - different forms of streaming?

I am trying to perform boolean regression on a dataset that contains a target variable that is boolean ("default") and two functions ("fico_interp", "home_ownership_int") using the logit module in statsmodels. All three values ​​are taken from one data frame: traidf:

from sklearn import datasets
import statsmodels.formula.api as smf

lmf = smf.logit('default ~ fico_interp + home_ownership_int',traindf).fit()

      

Which generates the error message:

ValueError: operands cannot be passed along with shapes (40406,2) (40406,)

How can this happen?

+3


source to share


1 answer


The problem is that it traindf['default']

contains values ​​that are not numeric.

The following code reproduces the error:

import pandas as pd, numpy as np, statsmodels.formula.api as smf
df = pd.DataFrame(np.random.randn(1000,2), columns=list('AB'))
df['C'] = ((df['B'] > 0)*1).apply(str)
lmf = smf.logit('C ~ A', df).fit()

      



And the following code is a possible way to fix this instance:

df.replace(to_replace={'C' : {'1': 1, '0': 0}}, inplace = True)
lmf = smf.logit('C ~ A', df).fit()

      

This post reports a similar issue.

+2


source







All Articles