Limit panda missing data padding to one index in multi-indexed DataFrame

As an example, let's say I have a df with columns for 'year', 'quarter' (sequential after one year), variable ('var') and dimension ('value'):

year   quarter   var  value
2015         1     A    0.1
2015         2     A    0.5
2015         3     A    0.6
2015         4     A    1.0
2015         1     B    0.1
2015         4     B    0.5
2015         2     C    0.0
2015         3     C    0.7
2015         4     C    1.2

      

but sometimes data are missing (example: see [2015.2, "B"]). it doesn't stretch too much to insert NaN into the data using reindexing so I get this:

year   quarter   var  value
2015         1     A    0.1
2015         2     A    0.5
2015         3     A    0.6
2015         4     A    1.0
2015         1     B    0.1
2015         2     B    NaN
2015         3     B    NaN
2015         4     B    0.5
2015         1     C    NaN
2015         2     C    0.0
2015         3     C    0.7
2015         4     C    1.2

      

but what I would like to do is fill in the "missing" data using forward fill to propagate the values ​​- i.e. df.ffill () - and then fill the remaining values ​​with zero - i.e. df.fillna (0) so you get something like this:

year   quarter   var  value
2015         1     A    0.1
2015         2     A    0.5
2015         3     A    0.6
2015         4     A    1.0
2015         1     B    0.1
2015         2     B    0.1
2015         3     B    0.1
2015         4     B    0.5
2015         1     C    0.0
2015         2     C    0.0
2015         3     C    0.7
2015         4     C    1.2

      

however, when I use df.ffill (), I haven't found a way to restrict / section to 'var' or 'year'.

My first idea was to convert the data to a pivot table:

pd.pivot_table(data,values='value',index=['year','quarter'],columns='var',aggfunc=np.sum)

      

and then forward fill, but I can't figure out how to limit it to a year (or how to unzip the pivot table back to its original form).

any help is appreciated!

+3


source to share


1 answer


You basically want your data in a table over time on row indices and everything else in columns. You can use pivot table or stack / stack:

df2 = df.set_index(['year', 'quarter', 'var']).unstack('var')
>>> df2
             value          
var              A    B    C
year quarter                
2015 1         0.1  0.1  NaN
     2         0.5  NaN  0.0
     3         0.6  NaN  0.7
     4         1.0  0.5  1.2

      

After the data is in this form, then fill in the fill and back.



df2 = df2.ffill().bfill(0)

      

Finally, add and sort your data, then reset your index if you like:

   >>> df2.stack('var').sortlevel(2).reset_index()
        year  quarter var  value
    0   2015        1   A    0.1
    1   2015        2   A    0.5
    2   2015        3   A    0.6
    3   2015        4   A    1.0
    4   2015        1   B    0.1
    5   2015        2   B    0.1
    6   2015        3   B    0.1
    7   2015        4   B    0.5
    8   2015        1   C    0.0
    9   2015        2   C    0.0
    10  2015        3   C    0.7
    11  2015        4   C    1.2

      

+4


source







All Articles