Pandas -Add missing years in time series data with repeating years

I have a dataset like this that is missing data for several years.

County Year Pop
12     1999 1.1
12     2001 1.2
13     1999 1.0
13     2000 1.1


I need something like

County Year Pop
12     1999 1.1
12     2000 NaN
12     2001 1.2
13     1999 1.0
13     2000 1.1
13     2001 nan


I tried to set the index to year and then reindex with another dataframe of the years years method ( Pandas mentioned here : Add data for missing months ), but that gives me a cant reindex error with duplicate values. I've also tried df.loc but it has the same problem. I even tried a full outer join with an empty df for only years, but that also didn't work.

How can I solve this?


source to share

5 answers

Make MultiIndex so you don't have duplicates:

df.set_index(['County', 'Year'], inplace=True)


Then create a complete MultiIndex with all combinations:

index = pd.MultiIndex.from_product(df.index.levels)


Then reindex:



The MultiIndex construct is untested and may need a little tweaking (for example if the year is completely missing in all counties), but I think you get the idea.



You can use pivot_table


In [11]: df.pivot_table(values="Pop", index="County", columns="Year")
Year    1999  2000  2001
12       1.1   NaN   1.2
13       1.0   1.1   NaN


and the stack

result (series required):

In [12]: df.pivot_table(values="Pop", index="County", columns="Year").stack(dropna=False)
County  Year
12      1999    1.1
        2000    NaN
        2001    1.2
13      1999    1.0
        2000    1.1
        2001    NaN
dtype: float64




I am working on the assumption that you can add all the years between the minimum and maximum years. Perhaps you were missing 2000 for both counties 12

and 13


I will plot pd.MultiIndex


using values unique

from the column 'County'

and all integer years between and including the minimum and maximum years in the column 'Year'


Note. this decision fills in all missing years, even if they are not currently present.

mux = pd.MultiIndex.from_product([
        range(df.Year.min(), df.Year.max() + 1)
    ], names=['County', 'Year'])

df.set_index(['County', 'Year']).reindex(mux).reset_index()

   County  Year  Pop
0      12  1999  1.1
1      12  2000  NaN
2      12  2001  1.2
3      13  1999  1.0
4      13  2000  1.1
5      13  2001  NaN




Or you can try black magic: P

min_year, max_year = df.Year.min(), df.Year.max()

df.groupby('County').apply(lambda g: g.set_index("Year").reindex(range(min_year, max_year+1))).drop("County", axis=1).reset_index()




You mentioned that you tried to join with an empty df and this approach might actually work.


df = pd.DataFrame({'County': {0: 12, 1: 12, 2: 13, 3: 13},
 'Pop': {0: 1.1, 1: 1.2, 2: 1.0, 3: 1.1},
 'Year': {0: 1999, 1: 2001, 2: 1999, 3: 2000}})



#create a new blank df with all the required Years for each County
df_2 = pd.DataFrame(np.r_[[df.County.unique(),np.arange(1999,2002)])].T, columns=['County','Year'])

#Left join the new dataframe to the existing dataframe to populate the Pop values.
   County  Year  Pop
0      12  1999  1.1
1      12  2000  NaN
2      12  2001  1.2
3      13  1999  1.0
4      13  2000  1.1
5      13  2001  NaN




All Articles