Group then apply function then flatten back to dataframe in Pandas Python

Question

Group then apply function then flatten back to dataframe in Pandas Python

I have some data where some of the column values are year match (year to date amounts). I want to change these columns to reflect differences, not cumulative amounts. The data looks like this:

                                ReportNumber   NIY  OANCFY  FQTR  FYEARQ
Reference  Published                                                    
2007-12-31 2008-02-21 00:00:00             1  3131    3073     4    2007
2008-03-31 2008-05-08 00:00:00             1  1189    1482     1    2008
           2009-05-07 16:00:00             2  1149     NaN     1    2008
2008-06-30 2008-08-07 00:00:00             1  2296    2493     2    2008
           2009-08-18 00:00:00             3  2214     NaN     2    2008
2008-09-30 2008-11-06 00:00:00             1  3402    3763     3    2008
           2009-11-07 00:00:00             3  3277     NaN     3    2008
2008-12-31 2009-02-17 16:00:00             1   NaN    4959     4    2008
           2009-02-18 00:00:00             3  4202     NaN     4    2008
           2010-03-21 00:00:00             5  4031     NaN     4    2008
2009-03-31 2009-05-07 16:00:00             1   942    1441     1    2009
2009-06-30 2009-08-06 00:00:00             1  1748    3017     2    2009
2009-09-30 2009-11-07 00:00:00             1  2458    4423     3    2009
2009-12-31 2010-02-24 16:00:00             1  3181    5598     4    2009
2010-03-31 2010-05-07 16:00:00             1   677    1172     1    2010
2010-06-30 2010-08-06 08:00:00             1  1392    2441     2    2010
2010-09-30 2010-11-08 16:00:00             1  1760    3150     3    2010
2010-12-31 2011-02-24 08:00:00             1   961    3946     4    2010
2011-03-31 2011-05-06 16:00:00             1   310     390     1    2011
           2012-08-16 16:00:00             2   319     NaN     1    2011
2011-06-30 2011-08-09 08:00:00             1   465     730     2    2011
           2012-08-16 16:00:00             2   443     NaN     2    2011
2011-09-30 2011-11-09 00:00:00             1   394    1222     3    2011
           2012-11-06 16:00:00             2   411     NaN     3    2011
2011-12-31 2012-03-06 00:00:00             1 -5725    1785     4    2011
           2013-03-05 00:00:00             2 -5754     NaN     4    2011
2012-03-31 2012-05-05 16:00:00             1    42     540     1    2012
           2012-08-16 16:00:00             2    10     NaN     1    2012
2012-06-30 2012-08-02 16:00:00             1  -294     999     2    2012
2012-09-30 2012-11-06 16:00:00             1  -675    1785     3    2012
2012-12-31 2013-03-05 00:00:00             1  -219    2708     4    2012

So I need to account for the differences between FQTR within a given FYEARQ according to the "FQTR" and then the "published" index and have it as one frame. In my attempt that works until multiple values are given for the item ('NIY', 'OANCFY') for the reference data.

cfgtmp = cftmp.groupby('FYEARQ')
ft = dict()
for group_name, subdf in cftmp.dropna().drop_duplicates().groupby('FYEARQ'):
    tmp = pd.concat([subdf.head(1), subdf.diff()]).dropna()
    tmp['FQTR'] = subdf['FQTR']
    tmp['FYEARQ'] = subdf['FYEARQ']
    tmp['ReportNumber'] = subdf['ReportNumber']
    ft.update({group_name : tmp})
    print group_name
    print 'differences'
    print tmp
    print ' '

pd.concat tries to handle differences between quarters ('FQTR'). It returns:

2007
differences
                       ReportNumber   NIY  OANCFY  FQTR  FYEARQ
Reference  Published                                           
2007-12-31 2008-02-21             1  3131    3073     4    2007

2008
differences
                       ReportNumber   NIY  OANCFY  FQTR  FYEARQ
Reference  Published                                           
2008-03-31 2008-05-08             1  1189    1482     1    2008
2008-06-30 2008-08-07             1  1107    1011     2    2008
2008-09-30 2008-11-06             1  1106    1270     3    2008

2009
differences
                                ReportNumber  NIY  OANCFY  FQTR  FYEARQ
Reference  Published                                                   
2009-03-31 2009-05-07 16:00:00             1  942    1441     1    2009
2009-06-30 2009-08-06 00:00:00             1  806    1576     2    2009
2009-09-30 2009-11-07 00:00:00             1  710    1406     3    2009
2009-12-31 2010-02-24 16:00:00             1  723    1175     4    2009

2010
differences
                                ReportNumber  NIY  OANCFY  FQTR  FYEARQ
Reference  Published                                                   
2010-03-31 2010-05-07 16:00:00             1  677    1172     1    2010
2010-06-30 2010-08-06 08:00:00             1  715    1269     2    2010
2010-09-30 2010-11-08 16:00:00             1  368     709     3    2010
2010-12-31 2011-02-24 08:00:00             1 -799     796     4    2010

2011
differences
                                ReportNumber   NIY  OANCFY  FQTR  FYEARQ
Reference  Published                                                    
2011-03-31 2011-05-06 16:00:00             1   310     390     1    2011
2011-06-30 2011-08-09 08:00:00             1   155     340     2    2011
2011-09-30 2011-11-09 00:00:00             1   -71     492     3    2011
2011-12-31 2012-03-06 00:00:00             1 -6119     563     4    2011

2012
differences
                                ReportNumber  NIY  OANCFY  FQTR  FYEARQ
Reference  Published                                                   
2012-03-31 2012-05-05 16:00:00             1   42     540     1    2012
2012-06-30 2012-08-02 16:00:00             1 -336     459     2    2012
2012-09-30 2012-11-06 16:00:00             1 -381     786     3    2012
2012-12-31 2013-03-05 00:00:00             1  456     923     4    2012

The problem with this solution is that it is only effective for "ReportNumber" == 1

Then I use pd.concat to flatten it back into one frame:

pd.concat([ft[f] for f in ft])

Any suggestions?

+3

python pandas time-series dataframe

user1129988 10 Aug '15 at 9:39

source to share

2 answers

CT Zhu · Answer 1 · 2015-08-10T17:44:54+0000

Your parts have cftmp.dropna().

dropped data associated with a report number other than 1. In your example, the nan

dataframe has this data for OANCFY.

However, to avoid using a loop, you can do something like this: get the first observation with head

and delta with diff()

, and then concat

both together.

In [71]:

newdf = pd.concat((df.groupby(['FYEARQ', 
                               'ReportNumber']).head(1),
                   df.groupby(['FYEARQ', 
                               'ReportNumber']).diff().dropna())).reset_index()\
                                                                 .sort('Reference')\
                                                                 .dropna(subset=['OANCFY'])\
                                                                 .reset_index(drop=True)\
                                                                 .fillna(method='pad')

newdf['FQTR'] = newdf.FQTR.groupby(newdf.FYEARQ).cumsum()

print newdf

     Reference           Published  FQTR  FYEARQ   NIY  OANCFY  ReportNumber
0   2007-12-31 2008-02-21 00:00:00     4    2007  3131    3073             1
1   2008-03-31 2008-05-08 00:00:00     1    2008  1189    1482             1
2   2008-06-30 2008-08-07 00:00:00     2    2008  1107    1011             1
3   2008-09-30 2008-11-06 00:00:00     3    2008  1106    1270             1
4   2009-03-31 2009-05-07 16:00:00     1    2009   942    1441             1
5   2009-06-30 2009-08-06 00:00:00     2    2009   806    1576             1
6   2009-09-30 2009-11-07 00:00:00     3    2009   710    1406             1
7   2009-12-31 2010-02-24 16:00:00     4    2009   723    1175             1
8   2010-03-31 2010-05-07 16:00:00     1    2010   677    1172             1
9   2010-06-30 2010-08-06 08:00:00     2    2010   715    1269             1
10  2010-09-30 2010-11-08 16:00:00     3    2010   368     709             1
11  2010-12-31 2011-02-24 08:00:00     4    2010  -799     796             1
12  2011-03-31 2011-05-06 16:00:00     1    2011   310     390             1
13  2011-06-30 2011-08-09 08:00:00     2    2011   155     340             1
14  2011-09-30 2011-11-09 00:00:00     3    2011   -71     492             1
15  2011-12-31 2012-03-06 00:00:00     4    2011 -6119     563             1
16  2012-03-31 2012-05-05 16:00:00     1    2012    42     540             1
17  2012-06-30 2012-08-02 16:00:00     2    2012  -336     459             1
18  2012-09-30 2012-11-06 16:00:00     3    2012  -381     786             1
19  2012-12-31 2013-03-05 00:00:00     4    2012   456     923             1

LondonRob · Answer 2 · 2015-08-11T01:07:17+0000

I'm not entirely sure about describing your question as to what you want, but I'm going to assume the following:

Difference quarter by quarter, only for the year, for the number of the report.

This certainly looks like what you are looking for.

Here's how to get started:

Sort DataFrame

by FQTR

. This ensures that transactions diff

are done quarterly
group by years and report number.
select the columns you want to delimit and apply pd.DataFrame.diff

Here's an example. First of all, some good copy / passive data:

In [156]: df = {'FQTR': {0: 4, 1: 1, 2: 1, 3: 2, 4: 2, 5: 3, 6: 3, 7: 4, 8: 4, 9: 4, 10: 1, 11: 2, 12: 3, 13: 4, 14: 1, 15: 2, 16: 3, 17: 4, 18: 1, 19: 1, 20: 2, 21: 2, 22: 3, 23: 3, 24: 4, 25: 4, 26: 1, 27: 1, 28: 2, 29: 3, 30: 4}, 'FYEARQ': {0: 2007, 1: 2008, 2: 2008, 3: 2008, 4: 2008, 5: 2008, 6: 2008, 7: 2008, 8: 2008, 9: 2008, 10: 2009, 11: 2009, 12: 2009, 13: 2009, 14: 2010, 15: 2010, 16: 2010, 17: 2010, 18: 2011, 19: 2011, 20: 2011, 21: 2011, 22: 2011, 23: 2011, 24: 2011, 25: 2011, 26: 2012, 27: 2012, 28: 2012, 29: 2012, 30: 2012}, 'NIY': {0: 3131.0, 1: 1189.0, 2: 1149.0, 3: 2296.0, 4: 2214.0, 5: 3402.0, 6: 3277.0, 7: nan, 8: 4202.0, 9: 4031.0, 10: 942.0, 11: 1748.0, 12: 2458.0, 13: 3181.0, 14: 677.0, 15: 1392.0, 16: 1760.0, 17: 961.0, 18: 310.0, 19: 319.0, 20: 465.0, 21: 443.0, 22: 394.0, 23: 411.0, 24: -5725.0, 25: -5754.0, 26: 42.0, 27: 10.0, 28: -294.0, 29: -675.0, 30: -219.0}, 'OANCFY': {0: 3073.0, 1: 1482.0, 2: nan, 3: 2493.0, 4: nan, 5: 3763.0, 6: nan, 7: 4959.0, 8: nan, 9: nan, 10: 1441.0, 11: 3017.0, 12: 4423.0, 13: 5598.0, 14: 1172.0, 15: 2441.0, 16: 3150.0, 17: 3946.0, 18: 390.0, 19: nan, 20: 730.0, 21: nan, 22: 1222.0, 23: nan, 24: 1785.0, 25: nan, 26: 540.0, 27: nan, 28: 999.0, 29: 1785.0, 30: 2708.0}, 'Published': {0: '2008-02-21 00:00:00', 1: '2008-05-08 00:00:00', 2: '2009-05-07 16:00:00', 3: '2008-08-07 00:00:00', 4: '2009-08-18 00:00:00', 5: '2008-11-06 00:00:00', 6: '2009-11-07 00:00:00', 7: '2009-02-17 16:00:00', 8: '2009-02-18 00:00:00', 9: '2010-03-21 00:00:00', 10: '2009-05-07 16:00:00', 11: '2009-08-06 00:00:00', 12: '2009-11-07 00:00:00', 13: '2010-02-24 16:00:00', 14: '2010-05-07 16:00:00', 15: '2010-08-06 08:00:00', 16: '2010-11-08 16:00:00', 17: '2011-02-24 08:00:00', 18: '2011-05-06 16:00:00', 19: '2012-08-16 16:00:00', 20: '2011-08-09 08:00:00', 21: '2012-08-16 16:00:00', 22: '2011-11-09 00:00:00', 23: '2012-11-06 16:00:00', 24: '2012-03-06 00:00:00', 25: '2013-03-05 00:00:00', 26: '2012-05-05 16:00:00', 27: '2012-08-16 16:00:00', 28: '2012-08-02 16:00:00', 29: '2012-11-06 16:00:00', 30: '2013-03-05 00:00:00'}, 'Reference': {0: '2007-12-31', 1: '2008-03-31', 2: '2008-03-31', 3: '2008-06-30', 4: '2008-06-30', 5: '2008-09-30', 6: '2008-09-30', 7: '2008-12-31', 8: '2008-12-31', 9: '2008-12-31', 10: '2009-03-31', 11: '2009-06-30', 12: '2009-09-30', 13: '2009-12-31', 14: '2010-03-31', 15: '2010-06-30', 16: '2010-09-30', 17: '2010-12-31', 18: '2011-03-31', 19: '2011-03-31', 20: '2011-06-30', 21: '2011-06-30', 22: '2011-09-30', 23: '2011-09-30', 24: '2011-12-31', 25: '2011-12-31', 26: '2012-03-31', 27: '2012-03-31', 28: '2012-06-30', 29: '2012-09-30', 30: '2012-12-31'}, 'ReportNumber': {0: 1, 1: 1, 2: 2, 3: 1, 4: 3, 5: 1, 6: 3, 7: 1, 8: 3, 9: 5, 10: 1, 11: 1, 12: 1, 13: 1, 14: 1, 15: 1, 16: 1, 17: 1, 18: 1, 19: 2, 20: 1, 21: 2, 22: 1, 23: 2, 24: 1, 25: 2, 26: 1, 27: 2, 28: 1, 29: 1, 30: 1}}
In [165]: df.head()
Out[165]: 
    Reference             Published  ReportNumber   NIY  OANCFY  FQTR  FYEARQ
0  2007-12-31   2008-02-21 00:00:00             1  3131    3073     4    2007
1  2008-03-31   2008-05-08 00:00:00             1  1189    1482     1    2008
2  2008-03-31   2009-05-07 16:00:00             2  1149     NaN     1    2008
3  2008-06-30   2008-08-07 00:00:00             1  2296    2493     2    2008
4  2008-06-30   2009-08-18 00:00:00             3  2214     NaN     2    2008

Now apply the above logic:

In [166]: diffs = df.sort('FQTR').groupby(['FYEARQ',
    'ReportNumber'])[['NIY', 'OANCFY']].apply(pd.DataFrame.diff)

Rename the resulting columns:

In [177]: diffs.columns = ['%s_diff' % col for col in diffs.columns]

and, if you like, join columns with different columns to the original one df

:

In [180]: with_diffs = pd.concat([df, diffs], axis=1)

Leaving you with something close to what you want:

In [182]: with_diffs[['Published', 'ReportNumber', 'NIY', 'FQTR', 'FYEARQ', 'NIY_diff']]
Out[182]: 
               Published  ReportNumber   NIY  FQTR  FYEARQ  NIY_diff
0    2008-02-21 00:00:00             1  3131     4    2007       NaN
1    2008-05-08 00:00:00             1  1189     1    2008       NaN
2    2009-05-07 16:00:00             2  1149     1    2008       NaN
3    2008-08-07 00:00:00             1  2296     2    2008      1107
4    2009-08-18 00:00:00             3  2214     2    2008       NaN
5    2008-11-06 00:00:00             1  3402     3    2008      1106
6    2009-11-07 00:00:00             3  3277     3    2008      1063
7    2009-02-17 16:00:00             1   NaN     4    2008       NaN
8    2009-02-18 00:00:00             3  4202     4    2008       925
9    2010-03-21 00:00:00             5  4031     4    2008       NaN
10   2009-05-07 16:00:00             1   942     1    2009       NaN
11   2009-08-06 00:00:00             1  1748     2    2009       806
12   2009-11-07 00:00:00             1  2458     3    2009       710
13   2010-02-24 16:00:00             1  3181     4    2009       723
14   2010-05-07 16:00:00             1   677     1    2010       NaN
15   2010-08-06 08:00:00             1  1392     2    2010       715
16   2010-11-08 16:00:00             1  1760     3    2010       368
17   2011-02-24 08:00:00             1   961     4    2010      -799
18   2011-05-06 16:00:00             1   310     1    2011       NaN
19   2012-08-16 16:00:00             2   319     1    2011       NaN
20   2011-08-09 08:00:00             1   465     2    2011       155
21   2012-08-16 16:00:00             2   443     2    2011       124
22   2011-11-09 00:00:00             1   394     3    2011       -71
23   2012-11-06 16:00:00             2   411     3    2011       -32
24   2012-03-06 00:00:00             1 -5725     4    2011     -6119
25   2013-03-05 00:00:00             2 -5754     4    2011     -6165
26   2012-05-05 16:00:00             1    42     1    2012       NaN
27   2012-08-16 16:00:00             2    10     1    2012       NaN
28   2012-08-02 16:00:00             1  -294     2    2012      -336
29   2012-11-06 16:00:00             1  -675     3    2012      -381
30   2013-03-05 00:00:00             1  -219     4    2012       456

There's a bit more cleaning work to be done (i.e. resetting the values NaN

if you want), but that's left as an exercise!

Group then apply function then flatten back to dataframe in Pandas Python

More articles: