Pandas / Statsmodel OLS Predicts Future Values

I am trying to get a prediction for future values ​​in a model that I have created. I've tried both OLS in pandas and statsmodels. This is what I have in statsmodels:

import statsmodels.api as sm
endog = pd.DataFrame(dframe['monthly_data_smoothed8'])
smresults = sm.OLS(dframe['monthly_data_smoothed8'], dframe['date_delta']).fit()
sm_pred = smresults.predict(endog)
sm_pred

      

The length of the returned array is equal to the number of entries in my original dataframe, but the values ​​are not the same. When I do the following using pandas, I am not getting any return values.

from pandas.stats.api import ols
res1 = ols(y=dframe['monthly_data_smoothed8'], x=dframe['date_delta'])
res1.predict

      

(Note that there is no .fit function for OLS in Pandas). Can anyone shed some light on how I can get future predictions from my OLS model in pandas or statsmodel? I understand that I shouldn't be using. predict correctly, and I've read several other problems people have had, but they don't seem to be relevant to my case.

edit I believe that "endog" as defined is not correct, I must pass the values ​​I want to predict; so I created a date range of 12 periods for the last recorded value. But still I am missing something as I get the error:

matrices are not aligned

      

edit is the chunk of data, the last column (in red) of numbers is the delta of the date, the difference in months from the first date:

month   monthly_data    monthly_data_smoothed5  monthly_data_smoothed8  monthly_data_smoothed12 monthly_data_smoothed3  date_delta
0   2011-01-31  3.711838e+11    3.711838e+11    3.711838e+11    3.711838e+11    3.711838e+11    0.000000
1   2011-02-28  3.776706e+11    3.750759e+11    3.748327e+11    3.746975e+11    3.755084e+11    0.919937
2   2011-03-31  4.547079e+11    4.127964e+11    4.083554e+11    4.059256e+11    4.207653e+11    1.938438
3   2011-04-30  4.688370e+11    4.360748e+11    4.295531e+11    4.257843e+11    4.464035e+11    2.924085

      

+3


source to share


1 answer


I think your problem here is that statsmodels don't add hooks by default, so your model hasn't reached most of the match. To solve it, your code would be something like this:

dframe = pd.read_clipboard() # your sample data
dframe['intercept'] = 1
X = dframe[['intercept', 'date_delta']]
y = dframe['monthly_data_smoothed8']

smresults = sm.OLS(y, X).fit()

dframe['pred'] = smresults.predict()

      

Also, for what it's worth, I think the statsmodel api formula works much better when working with DataFrames and adds a default hook (add - 1

to remove). See below, he should give the same answer.

import statsmodels.formula.api as smf

smresults = smf.ols('monthly_data_smoothed8 ~ date_delta', dframe).fit()

dframe['pred'] = smresults.predict()

      

Edit:



To predict future values, simply pass the new data into .predict()

For example, using the first model:

In [165]: smresults.predict(pd.DataFrame({'intercept': 1, 
                                          'date_delta': [0.5, 0.75, 1.0]}))
Out[165]: array([  2.03927604e+11,   2.95182280e+11,   3.86436955e+11])

      

On interception - there is nothing encoded in the number 1

, it is simply based on OLS math (interception is completely analogous to the regressor, which is always 1), so you can deduce the value immediately from the summary. Looking at the statsmodels docs , an alternative way to add an intercept is:

X = sm.add_constant(X)

      

+4


source







All Articles