Pandas -Add missing years in time series data with repeating years

I have a dataset like this that is missing data for several years.

County Year Pop
12     1999 1.1
12     2001 1.2
13     1999 1.0
13     2000 1.1

      

I need something like

County Year Pop
12     1999 1.1
12     2000 NaN
12     2001 1.2
13     1999 1.0
13     2000 1.1
13     2001 nan

      

I tried to set the index to year and then reindex with another dataframe of the years years method ( Pandas mentioned here : Add data for missing months ), but that gives me a cant reindex error with duplicate values. I've also tried df.loc but it has the same problem. I even tried a full outer join with an empty df for only years, but that also didn't work.

How can I solve this?

+3


source to share


5 answers


Make MultiIndex so you don't have duplicates:

df.set_index(['County', 'Year'], inplace=True)

      

Then create a complete MultiIndex with all combinations:

index = pd.MultiIndex.from_product(df.index.levels)

      



Then reindex:

df.reindex(index)

      

The MultiIndex construct is untested and may need a little tweaking (for example if the year is completely missing in all counties), but I think you get the idea.

+5


source


You can use pivot_table

:

In [11]: df.pivot_table(values="Pop", index="County", columns="Year")
Out[11]:
Year    1999  2000  2001
County
12       1.1   NaN   1.2
13       1.0   1.1   NaN

      



and the stack

result (series required):

In [12]: df.pivot_table(values="Pop", index="County", columns="Year").stack(dropna=False)
Out[12]:
County  Year
12      1999    1.1
        2000    NaN
        2001    1.2
13      1999    1.0
        2000    1.1
        2001    NaN
dtype: float64

      

+2


source


I am working on the assumption that you can add all the years between the minimum and maximum years. Perhaps you were missing 2000 for both counties 12

and 13

.

I will plot pd.MultiIndex

from_product

using values unique

from the column 'County'

and all integer years between and including the minimum and maximum years in the column 'Year'

.

Note. this decision fills in all missing years, even if they are not currently present.

mux = pd.MultiIndex.from_product([
        df.County.unique(),
        range(df.Year.min(), df.Year.max() + 1)
    ], names=['County', 'Year'])

df.set_index(['County', 'Year']).reindex(mux).reset_index()

   County  Year  Pop
0      12  1999  1.1
1      12  2000  NaN
2      12  2001  1.2
3      13  1999  1.0
4      13  2000  1.1
5      13  2001  NaN

      

+2


source


Or you can try black magic: P

min_year, max_year = df.Year.min(), df.Year.max()

df.groupby('County').apply(lambda g: g.set_index("Year").reindex(range(min_year, max_year+1))).drop("County", axis=1).reset_index()

      

+1


source


You mentioned that you tried to join with an empty df and this approach might actually work.

Setting:

df = pd.DataFrame({'County': {0: 12, 1: 12, 2: 13, 3: 13},
 'Pop': {0: 1.1, 1: 1.2, 2: 1.0, 3: 1.1},
 'Year': {0: 1999, 1: 2001, 2: 1999, 3: 2000}})

      

Decision

#create a new blank df with all the required Years for each County
df_2 = pd.DataFrame(np.r_[pd.tools.util.cartesian_product([df.County.unique(),np.arange(1999,2002)])].T, columns=['County','Year'])

#Left join the new dataframe to the existing dataframe to populate the Pop values.
pd.merge(df_2,df,on=['Year','County'],how='left')
Out[73]: 
   County  Year  Pop
0      12  1999  1.1
1      12  2000  NaN
2      12  2001  1.2
3      13  1999  1.0
4      13  2000  1.1
5      13  2001  NaN

      

+1


source







All Articles