Pandas -Add missing years in time series data with repeating years
I have a dataset like this that is missing data for several years.
County Year Pop
12 1999 1.1
12 2001 1.2
13 1999 1.0
13 2000 1.1
I need something like
County Year Pop
12 1999 1.1
12 2000 NaN
12 2001 1.2
13 1999 1.0
13 2000 1.1
13 2001 nan
I tried to set the index to year and then reindex with another dataframe of the years years method ( Pandas mentioned here : Add data for missing months ), but that gives me a cant reindex error with duplicate values. I've also tried df.loc but it has the same problem. I even tried a full outer join with an empty df for only years, but that also didn't work.
How can I solve this?
source to share
Make MultiIndex so you don't have duplicates:
df.set_index(['County', 'Year'], inplace=True)
Then create a complete MultiIndex with all combinations:
index = pd.MultiIndex.from_product(df.index.levels)
Then reindex:
df.reindex(index)
The MultiIndex construct is untested and may need a little tweaking (for example if the year is completely missing in all counties), but I think you get the idea.
source to share
You can use pivot_table
:
In [11]: df.pivot_table(values="Pop", index="County", columns="Year")
Out[11]:
Year 1999 2000 2001
County
12 1.1 NaN 1.2
13 1.0 1.1 NaN
and the stack
result (series required):
In [12]: df.pivot_table(values="Pop", index="County", columns="Year").stack(dropna=False)
Out[12]:
County Year
12 1999 1.1
2000 NaN
2001 1.2
13 1999 1.0
2000 1.1
2001 NaN
dtype: float64
source to share
I am working on the assumption that you can add all the years between the minimum and maximum years. Perhaps you were missing 2000 for both counties 12
and 13
.
I will plot pd.MultiIndex
from_product
using values unique
from the column 'County'
and all integer years between and including the minimum and maximum years in the column 'Year'
.
Note. this decision fills in all missing years, even if they are not currently present.
mux = pd.MultiIndex.from_product([
df.County.unique(),
range(df.Year.min(), df.Year.max() + 1)
], names=['County', 'Year'])
df.set_index(['County', 'Year']).reindex(mux).reset_index()
County Year Pop
0 12 1999 1.1
1 12 2000 NaN
2 12 2001 1.2
3 13 1999 1.0
4 13 2000 1.1
5 13 2001 NaN
source to share
You mentioned that you tried to join with an empty df and this approach might actually work.
Setting:
df = pd.DataFrame({'County': {0: 12, 1: 12, 2: 13, 3: 13},
'Pop': {0: 1.1, 1: 1.2, 2: 1.0, 3: 1.1},
'Year': {0: 1999, 1: 2001, 2: 1999, 3: 2000}})
Decision
#create a new blank df with all the required Years for each County
df_2 = pd.DataFrame(np.r_[pd.tools.util.cartesian_product([df.County.unique(),np.arange(1999,2002)])].T, columns=['County','Year'])
#Left join the new dataframe to the existing dataframe to populate the Pop values.
pd.merge(df_2,df,on=['Year','County'],how='left')
Out[73]:
County Year Pop
0 12 1999 1.1
1 12 2000 NaN
2 12 2001 1.2
3 13 1999 1.0
4 13 2000 1.1
5 13 2001 NaN
source to share