Creating missing time slots and filling 0 values ββin pandas series
I have a pandas series as such (which I did by doing a group on two columns
divvy_df.groupby(['from_station_name', 'starttime']).size()
):
from_station_name starttime
900 W Harrison 2014-07-01 08:00:00 1
2014-07-01 10:00:00 1
2014-07-01 11:00:00 1
2014-07-01 12:00:00 1
2014-07-01 13:00:00 1
2014-07-01 16:00:00 1
2014-07-01 17:00:00 3
2014-07-01 22:00:00 1
2014-07-02 01:00:00 1
2014-07-02 08:00:00 1
2014-07-02 12:00:00 2
...
As you can see, the resulting series shows the number of times the start time is recorded for this station name. I would like to do this somehow so that the missing timeslots are created and filled with a value of 0. Is this possible?
So, I would like something like this:
from_station_name starttime
900 W Harrison 2014-07-01 00:00:00 0
2014-07-01 01:00:00 0
2014-07-01 02:00:00 0
2014-07-01 03:00:00 0
...
2014-07-01 08:00:00 1
2014-07-01 09:00:00 0
2014-07-01 10:00:00 1
2014-07-01 11:00:00 1
2014-07-01 12:00:00 1
2014-07-01 13:00:00 0
...
2014-07-01 22:00:00 1
2014-07-01 23:00:00 0
2014-07-02 00:00:00 0
What's the best way to get something like this?
source to share
Usually, when you have a series, ts
with DatetimeIndex, you can expand it to include new dates using ts.reindex
, or ts.asfreq
or ts.resample
. Since it ts.reindex
has a parameter fill_value
to fill in missing values, I think which is the easiest to use in this case ts.reindex
.
However, this issue ts
has MultiIndex with from_station_name
and
starttime
.
So the first step is to move the level from_station_name
out of the index. There are two options. You can use reset_index
to move it to a column or use
unstack
to move it to a new index level of the column. I think unstacking
with
ts = ts.unstack(level=0)
is the best choice here because with the other option (using reset_index
) you run into problems with duplicate dates in the index and there is no problem of padding and padding with padding and forward-filling from_station_name
.
The series now has a sibling DatetimeIndex, so you can reindex
like this:
index = pd.date_range('2014-07-01', '2014-07-03', freq='H')
ts = ts.reindex(index, fill_value=0)
Now, undo the operation unstack
, so set the column index level back to the row index:
ts = ts.stack()
ts = ts.swaplevel(0,1)
import pandas as pd
ts = pd.Series({('900 W Harrison', '2014-07-01 08:00:00'): 1,
('900 W Harrison', '2014-07-01 10:00:00'): 1,
('900 W Harrison', '2014-07-01 11:00:00'): 1,
('900 W Harrison', '2014-07-01 12:00:00'): 1,
('900 W Harrison', '2014-07-01 13:00:00'): 1,
('900 W Harrison', '2014-07-01 16:00:00'): 1,
('900 W Harrison', '2014-07-01 17:00:00'): 3,
('900 W Harrison', '2014-07-01 22:00:00'): 1,
('900 W Harrison', '2014-07-02 01:00:00'): 1,
('900 W Harrison', '2014-07-02 08:00:00'): 1,
('900 W Harrison', '2014-07-02 12:00:00'): 2})
ts = ts.unstack(level=0)
# ensure ts.index is a DatetimeIndex
ts.index = pd.to_datetime(ts.index)
index = pd.date_range('2014-07-01', '2014-07-03', freq='H')
ts = ts.reindex(index, fill_value=0)
ts = ts.stack()
ts = ts.swaplevel(0,1)
ts = ts.sortlevel()
print(ts)
gives
900 W Harrison 2014-07-01 00:00:00 0
2014-07-01 01:00:00 0
2014-07-01 02:00:00 0
2014-07-01 03:00:00 0
2014-07-01 04:00:00 0
2014-07-01 05:00:00 0
2014-07-01 06:00:00 0
2014-07-01 07:00:00 0
2014-07-01 08:00:00 1
2014-07-01 09:00:00 0
2014-07-01 10:00:00 1
...
2014-07-02 21:00:00 0
2014-07-02 22:00:00 0
2014-07-02 23:00:00 0
2014-07-03 00:00:00 0
dtype: int64
source to share