Group then apply function then flatten back to dataframe in Pandas Python
I have some data where some of the column values ββare year match (year to date amounts). I want to change these columns to reflect differences, not cumulative amounts. The data looks like this:
ReportNumber NIY OANCFY FQTR FYEARQ
Reference Published
2007-12-31 2008-02-21 00:00:00 1 3131 3073 4 2007
2008-03-31 2008-05-08 00:00:00 1 1189 1482 1 2008
2009-05-07 16:00:00 2 1149 NaN 1 2008
2008-06-30 2008-08-07 00:00:00 1 2296 2493 2 2008
2009-08-18 00:00:00 3 2214 NaN 2 2008
2008-09-30 2008-11-06 00:00:00 1 3402 3763 3 2008
2009-11-07 00:00:00 3 3277 NaN 3 2008
2008-12-31 2009-02-17 16:00:00 1 NaN 4959 4 2008
2009-02-18 00:00:00 3 4202 NaN 4 2008
2010-03-21 00:00:00 5 4031 NaN 4 2008
2009-03-31 2009-05-07 16:00:00 1 942 1441 1 2009
2009-06-30 2009-08-06 00:00:00 1 1748 3017 2 2009
2009-09-30 2009-11-07 00:00:00 1 2458 4423 3 2009
2009-12-31 2010-02-24 16:00:00 1 3181 5598 4 2009
2010-03-31 2010-05-07 16:00:00 1 677 1172 1 2010
2010-06-30 2010-08-06 08:00:00 1 1392 2441 2 2010
2010-09-30 2010-11-08 16:00:00 1 1760 3150 3 2010
2010-12-31 2011-02-24 08:00:00 1 961 3946 4 2010
2011-03-31 2011-05-06 16:00:00 1 310 390 1 2011
2012-08-16 16:00:00 2 319 NaN 1 2011
2011-06-30 2011-08-09 08:00:00 1 465 730 2 2011
2012-08-16 16:00:00 2 443 NaN 2 2011
2011-09-30 2011-11-09 00:00:00 1 394 1222 3 2011
2012-11-06 16:00:00 2 411 NaN 3 2011
2011-12-31 2012-03-06 00:00:00 1 -5725 1785 4 2011
2013-03-05 00:00:00 2 -5754 NaN 4 2011
2012-03-31 2012-05-05 16:00:00 1 42 540 1 2012
2012-08-16 16:00:00 2 10 NaN 1 2012
2012-06-30 2012-08-02 16:00:00 1 -294 999 2 2012
2012-09-30 2012-11-06 16:00:00 1 -675 1785 3 2012
2012-12-31 2013-03-05 00:00:00 1 -219 2708 4 2012
So I need to account for the differences between FQTR within a given FYEARQ according to the "FQTR" and then the "published" index and have it as one frame. In my attempt that works until multiple values ββare given for the item ('NIY', 'OANCFY') for the reference data.
cfgtmp = cftmp.groupby('FYEARQ')
ft = dict()
for group_name, subdf in cftmp.dropna().drop_duplicates().groupby('FYEARQ'):
tmp = pd.concat([subdf.head(1), subdf.diff()]).dropna()
tmp['FQTR'] = subdf['FQTR']
tmp['FYEARQ'] = subdf['FYEARQ']
tmp['ReportNumber'] = subdf['ReportNumber']
ft.update({group_name : tmp})
print group_name
print 'differences'
print tmp
print ' '
pd.concat tries to handle differences between quarters ('FQTR'). It returns:
2007
differences
ReportNumber NIY OANCFY FQTR FYEARQ
Reference Published
2007-12-31 2008-02-21 1 3131 3073 4 2007
2008
differences
ReportNumber NIY OANCFY FQTR FYEARQ
Reference Published
2008-03-31 2008-05-08 1 1189 1482 1 2008
2008-06-30 2008-08-07 1 1107 1011 2 2008
2008-09-30 2008-11-06 1 1106 1270 3 2008
2009
differences
ReportNumber NIY OANCFY FQTR FYEARQ
Reference Published
2009-03-31 2009-05-07 16:00:00 1 942 1441 1 2009
2009-06-30 2009-08-06 00:00:00 1 806 1576 2 2009
2009-09-30 2009-11-07 00:00:00 1 710 1406 3 2009
2009-12-31 2010-02-24 16:00:00 1 723 1175 4 2009
2010
differences
ReportNumber NIY OANCFY FQTR FYEARQ
Reference Published
2010-03-31 2010-05-07 16:00:00 1 677 1172 1 2010
2010-06-30 2010-08-06 08:00:00 1 715 1269 2 2010
2010-09-30 2010-11-08 16:00:00 1 368 709 3 2010
2010-12-31 2011-02-24 08:00:00 1 -799 796 4 2010
2011
differences
ReportNumber NIY OANCFY FQTR FYEARQ
Reference Published
2011-03-31 2011-05-06 16:00:00 1 310 390 1 2011
2011-06-30 2011-08-09 08:00:00 1 155 340 2 2011
2011-09-30 2011-11-09 00:00:00 1 -71 492 3 2011
2011-12-31 2012-03-06 00:00:00 1 -6119 563 4 2011
2012
differences
ReportNumber NIY OANCFY FQTR FYEARQ
Reference Published
2012-03-31 2012-05-05 16:00:00 1 42 540 1 2012
2012-06-30 2012-08-02 16:00:00 1 -336 459 2 2012
2012-09-30 2012-11-06 16:00:00 1 -381 786 3 2012
2012-12-31 2013-03-05 00:00:00 1 456 923 4 2012
The problem with this solution is that it is only effective for "ReportNumber" == 1
Then I use pd.concat to flatten it back into one frame:
pd.concat([ft[f] for f in ft])
Any suggestions?
source to share
Your parts have cftmp.dropna().
dropped data associated with a report number other than 1. In your example, the nan
dataframe has this data for OANCFY.
However, to avoid using a loop, you can do something like this: get the first observation with head
and delta with diff()
, and then concat
both together.
In [71]:
newdf = pd.concat((df.groupby(['FYEARQ',
'ReportNumber']).head(1),
df.groupby(['FYEARQ',
'ReportNumber']).diff().dropna())).reset_index()\
.sort('Reference')\
.dropna(subset=['OANCFY'])\
.reset_index(drop=True)\
.fillna(method='pad')
newdf['FQTR'] = newdf.FQTR.groupby(newdf.FYEARQ).cumsum()
print newdf
Reference Published FQTR FYEARQ NIY OANCFY ReportNumber
0 2007-12-31 2008-02-21 00:00:00 4 2007 3131 3073 1
1 2008-03-31 2008-05-08 00:00:00 1 2008 1189 1482 1
2 2008-06-30 2008-08-07 00:00:00 2 2008 1107 1011 1
3 2008-09-30 2008-11-06 00:00:00 3 2008 1106 1270 1
4 2009-03-31 2009-05-07 16:00:00 1 2009 942 1441 1
5 2009-06-30 2009-08-06 00:00:00 2 2009 806 1576 1
6 2009-09-30 2009-11-07 00:00:00 3 2009 710 1406 1
7 2009-12-31 2010-02-24 16:00:00 4 2009 723 1175 1
8 2010-03-31 2010-05-07 16:00:00 1 2010 677 1172 1
9 2010-06-30 2010-08-06 08:00:00 2 2010 715 1269 1
10 2010-09-30 2010-11-08 16:00:00 3 2010 368 709 1
11 2010-12-31 2011-02-24 08:00:00 4 2010 -799 796 1
12 2011-03-31 2011-05-06 16:00:00 1 2011 310 390 1
13 2011-06-30 2011-08-09 08:00:00 2 2011 155 340 1
14 2011-09-30 2011-11-09 00:00:00 3 2011 -71 492 1
15 2011-12-31 2012-03-06 00:00:00 4 2011 -6119 563 1
16 2012-03-31 2012-05-05 16:00:00 1 2012 42 540 1
17 2012-06-30 2012-08-02 16:00:00 2 2012 -336 459 1
18 2012-09-30 2012-11-06 16:00:00 3 2012 -381 786 1
19 2012-12-31 2013-03-05 00:00:00 4 2012 456 923 1
source to share
I'm not entirely sure about describing your question as to what you want, but I'm going to assume the following:
- Difference quarter by quarter, only for the year, for the number of the report.
This certainly looks like what you are looking for.
Here's how to get started:
- Sort
DataFrame
byFQTR
. This ensures that transactionsdiff
are done quarterly - group by years and report number.
- select the columns you want to delimit and apply
pd.DataFrame.diff
Here's an example. First of all, some good copy / passive data:
In [156]: df = {'FQTR': {0: 4, 1: 1, 2: 1, 3: 2, 4: 2, 5: 3, 6: 3, 7: 4, 8: 4, 9: 4, 10: 1, 11: 2, 12: 3, 13: 4, 14: 1, 15: 2, 16: 3, 17: 4, 18: 1, 19: 1, 20: 2, 21: 2, 22: 3, 23: 3, 24: 4, 25: 4, 26: 1, 27: 1, 28: 2, 29: 3, 30: 4}, 'FYEARQ': {0: 2007, 1: 2008, 2: 2008, 3: 2008, 4: 2008, 5: 2008, 6: 2008, 7: 2008, 8: 2008, 9: 2008, 10: 2009, 11: 2009, 12: 2009, 13: 2009, 14: 2010, 15: 2010, 16: 2010, 17: 2010, 18: 2011, 19: 2011, 20: 2011, 21: 2011, 22: 2011, 23: 2011, 24: 2011, 25: 2011, 26: 2012, 27: 2012, 28: 2012, 29: 2012, 30: 2012}, 'NIY': {0: 3131.0, 1: 1189.0, 2: 1149.0, 3: 2296.0, 4: 2214.0, 5: 3402.0, 6: 3277.0, 7: nan, 8: 4202.0, 9: 4031.0, 10: 942.0, 11: 1748.0, 12: 2458.0, 13: 3181.0, 14: 677.0, 15: 1392.0, 16: 1760.0, 17: 961.0, 18: 310.0, 19: 319.0, 20: 465.0, 21: 443.0, 22: 394.0, 23: 411.0, 24: -5725.0, 25: -5754.0, 26: 42.0, 27: 10.0, 28: -294.0, 29: -675.0, 30: -219.0}, 'OANCFY': {0: 3073.0, 1: 1482.0, 2: nan, 3: 2493.0, 4: nan, 5: 3763.0, 6: nan, 7: 4959.0, 8: nan, 9: nan, 10: 1441.0, 11: 3017.0, 12: 4423.0, 13: 5598.0, 14: 1172.0, 15: 2441.0, 16: 3150.0, 17: 3946.0, 18: 390.0, 19: nan, 20: 730.0, 21: nan, 22: 1222.0, 23: nan, 24: 1785.0, 25: nan, 26: 540.0, 27: nan, 28: 999.0, 29: 1785.0, 30: 2708.0}, 'Published': {0: '2008-02-21 00:00:00', 1: '2008-05-08 00:00:00', 2: '2009-05-07 16:00:00', 3: '2008-08-07 00:00:00', 4: '2009-08-18 00:00:00', 5: '2008-11-06 00:00:00', 6: '2009-11-07 00:00:00', 7: '2009-02-17 16:00:00', 8: '2009-02-18 00:00:00', 9: '2010-03-21 00:00:00', 10: '2009-05-07 16:00:00', 11: '2009-08-06 00:00:00', 12: '2009-11-07 00:00:00', 13: '2010-02-24 16:00:00', 14: '2010-05-07 16:00:00', 15: '2010-08-06 08:00:00', 16: '2010-11-08 16:00:00', 17: '2011-02-24 08:00:00', 18: '2011-05-06 16:00:00', 19: '2012-08-16 16:00:00', 20: '2011-08-09 08:00:00', 21: '2012-08-16 16:00:00', 22: '2011-11-09 00:00:00', 23: '2012-11-06 16:00:00', 24: '2012-03-06 00:00:00', 25: '2013-03-05 00:00:00', 26: '2012-05-05 16:00:00', 27: '2012-08-16 16:00:00', 28: '2012-08-02 16:00:00', 29: '2012-11-06 16:00:00', 30: '2013-03-05 00:00:00'}, 'Reference': {0: '2007-12-31', 1: '2008-03-31', 2: '2008-03-31', 3: '2008-06-30', 4: '2008-06-30', 5: '2008-09-30', 6: '2008-09-30', 7: '2008-12-31', 8: '2008-12-31', 9: '2008-12-31', 10: '2009-03-31', 11: '2009-06-30', 12: '2009-09-30', 13: '2009-12-31', 14: '2010-03-31', 15: '2010-06-30', 16: '2010-09-30', 17: '2010-12-31', 18: '2011-03-31', 19: '2011-03-31', 20: '2011-06-30', 21: '2011-06-30', 22: '2011-09-30', 23: '2011-09-30', 24: '2011-12-31', 25: '2011-12-31', 26: '2012-03-31', 27: '2012-03-31', 28: '2012-06-30', 29: '2012-09-30', 30: '2012-12-31'}, 'ReportNumber': {0: 1, 1: 1, 2: 2, 3: 1, 4: 3, 5: 1, 6: 3, 7: 1, 8: 3, 9: 5, 10: 1, 11: 1, 12: 1, 13: 1, 14: 1, 15: 1, 16: 1, 17: 1, 18: 1, 19: 2, 20: 1, 21: 2, 22: 1, 23: 2, 24: 1, 25: 2, 26: 1, 27: 2, 28: 1, 29: 1, 30: 1}}
In [165]: df.head()
Out[165]:
Reference Published ReportNumber NIY OANCFY FQTR FYEARQ
0 2007-12-31 2008-02-21 00:00:00 1 3131 3073 4 2007
1 2008-03-31 2008-05-08 00:00:00 1 1189 1482 1 2008
2 2008-03-31 2009-05-07 16:00:00 2 1149 NaN 1 2008
3 2008-06-30 2008-08-07 00:00:00 1 2296 2493 2 2008
4 2008-06-30 2009-08-18 00:00:00 3 2214 NaN 2 2008
Now apply the above logic:
In [166]: diffs = df.sort('FQTR').groupby(['FYEARQ',
'ReportNumber'])[['NIY', 'OANCFY']].apply(pd.DataFrame.diff)
Rename the resulting columns:
In [177]: diffs.columns = ['%s_diff' % col for col in diffs.columns]
and, if you like, join columns with different columns to the original one df
:
In [180]: with_diffs = pd.concat([df, diffs], axis=1)
Leaving you with something close to what you want:
In [182]: with_diffs[['Published', 'ReportNumber', 'NIY', 'FQTR', 'FYEARQ', 'NIY_diff']]
Out[182]:
Published ReportNumber NIY FQTR FYEARQ NIY_diff
0 2008-02-21 00:00:00 1 3131 4 2007 NaN
1 2008-05-08 00:00:00 1 1189 1 2008 NaN
2 2009-05-07 16:00:00 2 1149 1 2008 NaN
3 2008-08-07 00:00:00 1 2296 2 2008 1107
4 2009-08-18 00:00:00 3 2214 2 2008 NaN
5 2008-11-06 00:00:00 1 3402 3 2008 1106
6 2009-11-07 00:00:00 3 3277 3 2008 1063
7 2009-02-17 16:00:00 1 NaN 4 2008 NaN
8 2009-02-18 00:00:00 3 4202 4 2008 925
9 2010-03-21 00:00:00 5 4031 4 2008 NaN
10 2009-05-07 16:00:00 1 942 1 2009 NaN
11 2009-08-06 00:00:00 1 1748 2 2009 806
12 2009-11-07 00:00:00 1 2458 3 2009 710
13 2010-02-24 16:00:00 1 3181 4 2009 723
14 2010-05-07 16:00:00 1 677 1 2010 NaN
15 2010-08-06 08:00:00 1 1392 2 2010 715
16 2010-11-08 16:00:00 1 1760 3 2010 368
17 2011-02-24 08:00:00 1 961 4 2010 -799
18 2011-05-06 16:00:00 1 310 1 2011 NaN
19 2012-08-16 16:00:00 2 319 1 2011 NaN
20 2011-08-09 08:00:00 1 465 2 2011 155
21 2012-08-16 16:00:00 2 443 2 2011 124
22 2011-11-09 00:00:00 1 394 3 2011 -71
23 2012-11-06 16:00:00 2 411 3 2011 -32
24 2012-03-06 00:00:00 1 -5725 4 2011 -6119
25 2013-03-05 00:00:00 2 -5754 4 2011 -6165
26 2012-05-05 16:00:00 1 42 1 2012 NaN
27 2012-08-16 16:00:00 2 10 1 2012 NaN
28 2012-08-02 16:00:00 1 -294 2 2012 -336
29 2012-11-06 16:00:00 1 -675 3 2012 -381
30 2013-03-05 00:00:00 1 -219 4 2012 456
There's a bit more cleaning work to be done (i.e. resetting the values NaN
if you want), but that's left as an exercise!
source to share