Pandas Diff () for first records in timers, no data returns NaN
In Pandas 0.14.1, diff () does not generate values ββat the start of time series.
Using diff () seems to handle missing data differently than cumsum (), which assumes NaN == 0. I am wondering if there is a way to make diff () accept 0 for previous missing data (missing as it is from before time series).
For example:
>print df
2014-05-01 A Apple 1
B Banana 2
2014-06-01 A Apple 3
B Banana 4
leads to:
>print df.groupby(level=[1,2]).diff()
2014-05-01 A Apple NaN
B Banana NaN
2014-06-01 A Apple 2
B Banana 2
When the desired result is:
2014-05-01 A Apple 1
B Banana 2
2014-06-01 A Apple 2
B Banana 2
+3
source to share
1 answer
As far as I know, groupby(...).diff()
just a call np.diff
that always returns an array 1 (or n) shorter than the one passed to it.
But filling in the missing data should be pretty simple. Something like that?
In [175]: df
Out[175]:
d
a b c
2014-05-01 A Apple 1
B Banana 2
2014-06-01 A Apple 3
B Banana 4
In [176]: df['diff'] = df.groupby(level=[1,2])['d'].diff()
In [177]: df['diff'] = df['diff'].fillna(df['d'])
In [178]: df
Out[178]:
d diff
a b c
2014-05-01 A Apple 1 1
B Banana 2 2
2014-06-01 A Apple 3 2
B Banana 4 2
+5
source to share