Pandas Diff () for first records in timers, no data returns NaN

Question

Pandas Diff () for first records in timers, no data returns NaN

In Pandas 0.14.1, diff () does not generate values at the start of time series.

Using diff () seems to handle missing data differently than cumsum (), which assumes NaN == 0. I am wondering if there is a way to make diff () accept 0 for previous missing data (missing as it is from before time series).

For example:

    >print df

    2014-05-01  A     Apple        1
                B     Banana       2
    2014-06-01  A     Apple        3
                B     Banana       4

leads to:

    >print df.groupby(level=[1,2]).diff()

    2014-05-01  A     Apple        NaN
                B     Banana       NaN
    2014-06-01  A     Apple        2
                B     Banana       2

When the desired result is:

    2014-05-01  A     Apple        1
                B     Banana       2
    2014-06-01  A     Apple        2
                B     Banana       2

+3

python numpy pandas dataframe

LPG 13 Aug 14 at 14:40

source to share

1 answer

chrisb · Answer 1 · 2014-08-13T16:20:39+0000

As far as I know, groupby(...).diff()

just a call np.diff

that always returns an array 1 (or n) shorter than the one passed to it.

But filling in the missing data should be pretty simple. Something like that?

In [175]: df
Out[175]: 
                     d
a          b c        
2014-05-01 A Apple   1
           B Banana  2
2014-06-01 A Apple   3
           B Banana  4

In [176]: df['diff'] = df.groupby(level=[1,2])['d'].diff()

In [177]: df['diff'] = df['diff'].fillna(df['d'])

In [178]: df
Out[178]: 
                     d  diff
a          b c              
2014-05-01 A Apple   1     1
           B Banana  2     2
2014-06-01 A Apple   3     2
           B Banana  4     2

Pandas Diff () for first records in timers, no data returns NaN

More articles: