How to correctly set start / end parameters of statsmodels.tsa.ar_model.AR.predict function
I have a project cost data block from an irregularly spaced time series that I would like to try to apply the statsmodel
AR model against .
This is the sample data in it:
cost
date
2015-07-16 35.98
2015-08-11 25.00
2015-08-11 43.94
2015-08-13 26.25
2015-08-18 15.38
2015-08-24 77.72
2015-09-09 40.00
2015-09-09 20.00
2015-09-09 65.00
2015-09-23 70.50
2015-09-29 59.00
2015-11-03 19.25
2015-11-04 19.97
2015-11-10 26.25
2015-11-12 19.97
2015-11-12 23.97
2015-11-12 21.88
2015-11-23 23.50
2015-11-23 33.75
2015-11-23 22.70
2015-11-23 33.75
2015-11-24 27.95
2015-11-24 27.95
2015-11-24 27.95
...
2017-03-31 21.93
2017-04-06 22.45
2017-04-06 26.85
2017-04-12 60.40
2017-04-12 37.00
2017-04-12 20.00
2017-04-12 66.00
2017-04-12 60.00
2017-04-13 41.95
2017-04-13 25.97
2017-04-13 29.48
2017-04-19 41.00
2017-04-19 58.00
2017-04-19 78.00
2017-04-19 12.00
2017-04-24 51.05
2017-04-26 21.88
2017-04-26 50.05
2017-04-28 21.00
2017-04-28 30.00
I'm having a hard time figuring out how to use start
and end
in a function predict
.
According to the docs :
start: int, str or datetime Zero index number of the observation from which prediction starts, i.e. First> forecast starts. It can also be a date string to parse or a date type.
end: int, str or datetime The number with the zero index at which to stop forecasting, i.e. the first forecast begins. It can also be a parse date or datetime type.
I create a dataframe that has an empty daily time series, adds my irregularly spaced time series to it, and then tries to apply the model.
data = pd.read_csv('data.csv', index_col=1, parse_dates=True)
df = pd.DataFrame(index=pd.date_range(start=datetime(2015, 1, 1), end=datetime(2017, 12, 31), freq='d'))
df = df.join(data)
df.cost.interpolate(inplace=True)
ar_model = sm.tsa.AR(df, missing='drop', freq='D')
ar_res = ar_model.fit(maxlag=9, method='mle', disp=-1)
pred = ar_res.predict(start='2016', end='2016')
The function predict
results in an errorpandas.tslib.OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 605-12-31 00:00:00
If I try to use a more specific date, I get the same type of error:
pred = ar_res.predict(start='2016-01-01', end='2016-06-01')
If I try to use integers, I get another error:
pred = ar_res.predict(start=0, end=len(data))
Wrong number of items passed 202, placement implies 197
If I actually use datetime
, I get an error no rule for interpreting end
.
I'm hitting the wall so hard, I think I am missing something.
Ultimately, I would like to use the model to make out-of-sample predictions (for example, forecasting for the next quarter).
source to share
So, I was creating a daily index to account for time interval requirements at the same interval, but it still remained unique (comment by @ user333700).
I added a function groupby
to sum recurring dates together and then could run the function predict
using objects datetime
(answer by @ andy-hayden).
df = df.groupby(pd.TimeGrouper(freq='D')).sum()
...
ar_res.predict(start=min(df.index), end=datetime(2018,12,31))
With the function predict
providing the result, I can now analyze the results and tweak the parameters to get something useful.
source to share
This works if you pass datetime
(and not date
):
from datetime import datetime
...
pred = ar_res.predict(start=datetime(2015, 1, 1), end=datetime(2017,12,31))
In [21]: pred.head(2) # my dummy numbers from data
Out[21]:
2015-01-01 35
2015-01-02 23
Freq: D, dtype: float64
In [22]: pred.tail(2)
Out[22]:
2017-12-30 44
2017-12-31 44
Freq: D, dtype: float64
source to share