Group then apply function then flatten back to dataframe in Pandas Python

I have some data where some of the column values ​​are year match (year to date amounts). I want to change these columns to reflect differences, not cumulative amounts. The data looks like this:

                                ReportNumber   NIY  OANCFY  FQTR  FYEARQ
Reference  Published                                                    
2007-12-31 2008-02-21 00:00:00             1  3131    3073     4    2007
2008-03-31 2008-05-08 00:00:00             1  1189    1482     1    2008
           2009-05-07 16:00:00             2  1149     NaN     1    2008
2008-06-30 2008-08-07 00:00:00             1  2296    2493     2    2008
           2009-08-18 00:00:00             3  2214     NaN     2    2008
2008-09-30 2008-11-06 00:00:00             1  3402    3763     3    2008
           2009-11-07 00:00:00             3  3277     NaN     3    2008
2008-12-31 2009-02-17 16:00:00             1   NaN    4959     4    2008
           2009-02-18 00:00:00             3  4202     NaN     4    2008
           2010-03-21 00:00:00             5  4031     NaN     4    2008
2009-03-31 2009-05-07 16:00:00             1   942    1441     1    2009
2009-06-30 2009-08-06 00:00:00             1  1748    3017     2    2009
2009-09-30 2009-11-07 00:00:00             1  2458    4423     3    2009
2009-12-31 2010-02-24 16:00:00             1  3181    5598     4    2009
2010-03-31 2010-05-07 16:00:00             1   677    1172     1    2010
2010-06-30 2010-08-06 08:00:00             1  1392    2441     2    2010
2010-09-30 2010-11-08 16:00:00             1  1760    3150     3    2010
2010-12-31 2011-02-24 08:00:00             1   961    3946     4    2010
2011-03-31 2011-05-06 16:00:00             1   310     390     1    2011
           2012-08-16 16:00:00             2   319     NaN     1    2011
2011-06-30 2011-08-09 08:00:00             1   465     730     2    2011
           2012-08-16 16:00:00             2   443     NaN     2    2011
2011-09-30 2011-11-09 00:00:00             1   394    1222     3    2011
           2012-11-06 16:00:00             2   411     NaN     3    2011
2011-12-31 2012-03-06 00:00:00             1 -5725    1785     4    2011
           2013-03-05 00:00:00             2 -5754     NaN     4    2011
2012-03-31 2012-05-05 16:00:00             1    42     540     1    2012
           2012-08-16 16:00:00             2    10     NaN     1    2012
2012-06-30 2012-08-02 16:00:00             1  -294     999     2    2012
2012-09-30 2012-11-06 16:00:00             1  -675    1785     3    2012
2012-12-31 2013-03-05 00:00:00             1  -219    2708     4    2012

      

So I need to account for the differences between FQTR within a given FYEARQ according to the "FQTR" and then the "published" index and have it as one frame. In my attempt that works until multiple values ​​are given for the item ('NIY', 'OANCFY') for the reference data.

cfgtmp = cftmp.groupby('FYEARQ')
ft = dict()
for group_name, subdf in cftmp.dropna().drop_duplicates().groupby('FYEARQ'):
    tmp = pd.concat([subdf.head(1), subdf.diff()]).dropna()
    tmp['FQTR'] = subdf['FQTR']
    tmp['FYEARQ'] = subdf['FYEARQ']
    tmp['ReportNumber'] = subdf['ReportNumber']
    ft.update({group_name : tmp})
    print group_name
    print 'differences'
    print tmp
    print ' '

      

pd.concat tries to handle differences between quarters ('FQTR'). It returns:

2007
differences
                       ReportNumber   NIY  OANCFY  FQTR  FYEARQ
Reference  Published                                           
2007-12-31 2008-02-21             1  3131    3073     4    2007

2008
differences
                       ReportNumber   NIY  OANCFY  FQTR  FYEARQ
Reference  Published                                           
2008-03-31 2008-05-08             1  1189    1482     1    2008
2008-06-30 2008-08-07             1  1107    1011     2    2008
2008-09-30 2008-11-06             1  1106    1270     3    2008

2009
differences
                                ReportNumber  NIY  OANCFY  FQTR  FYEARQ
Reference  Published                                                   
2009-03-31 2009-05-07 16:00:00             1  942    1441     1    2009
2009-06-30 2009-08-06 00:00:00             1  806    1576     2    2009
2009-09-30 2009-11-07 00:00:00             1  710    1406     3    2009
2009-12-31 2010-02-24 16:00:00             1  723    1175     4    2009

2010
differences
                                ReportNumber  NIY  OANCFY  FQTR  FYEARQ
Reference  Published                                                   
2010-03-31 2010-05-07 16:00:00             1  677    1172     1    2010
2010-06-30 2010-08-06 08:00:00             1  715    1269     2    2010
2010-09-30 2010-11-08 16:00:00             1  368     709     3    2010
2010-12-31 2011-02-24 08:00:00             1 -799     796     4    2010

2011
differences
                                ReportNumber   NIY  OANCFY  FQTR  FYEARQ
Reference  Published                                                    
2011-03-31 2011-05-06 16:00:00             1   310     390     1    2011
2011-06-30 2011-08-09 08:00:00             1   155     340     2    2011
2011-09-30 2011-11-09 00:00:00             1   -71     492     3    2011
2011-12-31 2012-03-06 00:00:00             1 -6119     563     4    2011

2012
differences
                                ReportNumber  NIY  OANCFY  FQTR  FYEARQ
Reference  Published                                                   
2012-03-31 2012-05-05 16:00:00             1   42     540     1    2012
2012-06-30 2012-08-02 16:00:00             1 -336     459     2    2012
2012-09-30 2012-11-06 16:00:00             1 -381     786     3    2012
2012-12-31 2013-03-05 00:00:00             1  456     923     4    2012

      

The problem with this solution is that it is only effective for "ReportNumber" == 1

Then I use pd.concat to flatten it back into one frame:

pd.concat([ft[f] for f in ft])

      

Any suggestions?

+3


source to share


2 answers


Your parts have cftmp.dropna().

dropped data associated with a report number other than 1. In your example, the nan

dataframe has this data for OANCFY.

However, to avoid using a loop, you can do something like this: get the first observation with head

and delta with diff()

, and then concat

both together.



In [71]:

newdf = pd.concat((df.groupby(['FYEARQ', 
                               'ReportNumber']).head(1),
                   df.groupby(['FYEARQ', 
                               'ReportNumber']).diff().dropna())).reset_index()\
                                                                 .sort('Reference')\
                                                                 .dropna(subset=['OANCFY'])\
                                                                 .reset_index(drop=True)\
                                                                 .fillna(method='pad')

newdf['FQTR'] = newdf.FQTR.groupby(newdf.FYEARQ).cumsum()

print newdf

     Reference           Published  FQTR  FYEARQ   NIY  OANCFY  ReportNumber
0   2007-12-31 2008-02-21 00:00:00     4    2007  3131    3073             1
1   2008-03-31 2008-05-08 00:00:00     1    2008  1189    1482             1
2   2008-06-30 2008-08-07 00:00:00     2    2008  1107    1011             1
3   2008-09-30 2008-11-06 00:00:00     3    2008  1106    1270             1
4   2009-03-31 2009-05-07 16:00:00     1    2009   942    1441             1
5   2009-06-30 2009-08-06 00:00:00     2    2009   806    1576             1
6   2009-09-30 2009-11-07 00:00:00     3    2009   710    1406             1
7   2009-12-31 2010-02-24 16:00:00     4    2009   723    1175             1
8   2010-03-31 2010-05-07 16:00:00     1    2010   677    1172             1
9   2010-06-30 2010-08-06 08:00:00     2    2010   715    1269             1
10  2010-09-30 2010-11-08 16:00:00     3    2010   368     709             1
11  2010-12-31 2011-02-24 08:00:00     4    2010  -799     796             1
12  2011-03-31 2011-05-06 16:00:00     1    2011   310     390             1
13  2011-06-30 2011-08-09 08:00:00     2    2011   155     340             1
14  2011-09-30 2011-11-09 00:00:00     3    2011   -71     492             1
15  2011-12-31 2012-03-06 00:00:00     4    2011 -6119     563             1
16  2012-03-31 2012-05-05 16:00:00     1    2012    42     540             1
17  2012-06-30 2012-08-02 16:00:00     2    2012  -336     459             1
18  2012-09-30 2012-11-06 16:00:00     3    2012  -381     786             1
19  2012-12-31 2013-03-05 00:00:00     4    2012   456     923             1

      

+1


source


I'm not entirely sure about describing your question as to what you want, but I'm going to assume the following:

  • Difference quarter by quarter, only for the year, for the number of the report.

This certainly looks like what you are looking for.

Here's how to get started:

  • Sort DataFrame

    by FQTR

    . This ensures that transactions diff

    are done quarterly
  • group by years and report number.
  • select the columns you want to delimit and apply pd.DataFrame.diff

Here's an example. First of all, some good copy / passive data:

In [156]: df = {'FQTR': {0: 4, 1: 1, 2: 1, 3: 2, 4: 2, 5: 3, 6: 3, 7: 4, 8: 4, 9: 4, 10: 1, 11: 2, 12: 3, 13: 4, 14: 1, 15: 2, 16: 3, 17: 4, 18: 1, 19: 1, 20: 2, 21: 2, 22: 3, 23: 3, 24: 4, 25: 4, 26: 1, 27: 1, 28: 2, 29: 3, 30: 4}, 'FYEARQ': {0: 2007, 1: 2008, 2: 2008, 3: 2008, 4: 2008, 5: 2008, 6: 2008, 7: 2008, 8: 2008, 9: 2008, 10: 2009, 11: 2009, 12: 2009, 13: 2009, 14: 2010, 15: 2010, 16: 2010, 17: 2010, 18: 2011, 19: 2011, 20: 2011, 21: 2011, 22: 2011, 23: 2011, 24: 2011, 25: 2011, 26: 2012, 27: 2012, 28: 2012, 29: 2012, 30: 2012}, 'NIY': {0: 3131.0, 1: 1189.0, 2: 1149.0, 3: 2296.0, 4: 2214.0, 5: 3402.0, 6: 3277.0, 7: nan, 8: 4202.0, 9: 4031.0, 10: 942.0, 11: 1748.0, 12: 2458.0, 13: 3181.0, 14: 677.0, 15: 1392.0, 16: 1760.0, 17: 961.0, 18: 310.0, 19: 319.0, 20: 465.0, 21: 443.0, 22: 394.0, 23: 411.0, 24: -5725.0, 25: -5754.0, 26: 42.0, 27: 10.0, 28: -294.0, 29: -675.0, 30: -219.0}, 'OANCFY': {0: 3073.0, 1: 1482.0, 2: nan, 3: 2493.0, 4: nan, 5: 3763.0, 6: nan, 7: 4959.0, 8: nan, 9: nan, 10: 1441.0, 11: 3017.0, 12: 4423.0, 13: 5598.0, 14: 1172.0, 15: 2441.0, 16: 3150.0, 17: 3946.0, 18: 390.0, 19: nan, 20: 730.0, 21: nan, 22: 1222.0, 23: nan, 24: 1785.0, 25: nan, 26: 540.0, 27: nan, 28: 999.0, 29: 1785.0, 30: 2708.0}, 'Published': {0: '2008-02-21 00:00:00', 1: '2008-05-08 00:00:00', 2: '2009-05-07 16:00:00', 3: '2008-08-07 00:00:00', 4: '2009-08-18 00:00:00', 5: '2008-11-06 00:00:00', 6: '2009-11-07 00:00:00', 7: '2009-02-17 16:00:00', 8: '2009-02-18 00:00:00', 9: '2010-03-21 00:00:00', 10: '2009-05-07 16:00:00', 11: '2009-08-06 00:00:00', 12: '2009-11-07 00:00:00', 13: '2010-02-24 16:00:00', 14: '2010-05-07 16:00:00', 15: '2010-08-06 08:00:00', 16: '2010-11-08 16:00:00', 17: '2011-02-24 08:00:00', 18: '2011-05-06 16:00:00', 19: '2012-08-16 16:00:00', 20: '2011-08-09 08:00:00', 21: '2012-08-16 16:00:00', 22: '2011-11-09 00:00:00', 23: '2012-11-06 16:00:00', 24: '2012-03-06 00:00:00', 25: '2013-03-05 00:00:00', 26: '2012-05-05 16:00:00', 27: '2012-08-16 16:00:00', 28: '2012-08-02 16:00:00', 29: '2012-11-06 16:00:00', 30: '2013-03-05 00:00:00'}, 'Reference': {0: '2007-12-31', 1: '2008-03-31', 2: '2008-03-31', 3: '2008-06-30', 4: '2008-06-30', 5: '2008-09-30', 6: '2008-09-30', 7: '2008-12-31', 8: '2008-12-31', 9: '2008-12-31', 10: '2009-03-31', 11: '2009-06-30', 12: '2009-09-30', 13: '2009-12-31', 14: '2010-03-31', 15: '2010-06-30', 16: '2010-09-30', 17: '2010-12-31', 18: '2011-03-31', 19: '2011-03-31', 20: '2011-06-30', 21: '2011-06-30', 22: '2011-09-30', 23: '2011-09-30', 24: '2011-12-31', 25: '2011-12-31', 26: '2012-03-31', 27: '2012-03-31', 28: '2012-06-30', 29: '2012-09-30', 30: '2012-12-31'}, 'ReportNumber': {0: 1, 1: 1, 2: 2, 3: 1, 4: 3, 5: 1, 6: 3, 7: 1, 8: 3, 9: 5, 10: 1, 11: 1, 12: 1, 13: 1, 14: 1, 15: 1, 16: 1, 17: 1, 18: 1, 19: 2, 20: 1, 21: 2, 22: 1, 23: 2, 24: 1, 25: 2, 26: 1, 27: 2, 28: 1, 29: 1, 30: 1}}
In [165]: df.head()
Out[165]: 
    Reference             Published  ReportNumber   NIY  OANCFY  FQTR  FYEARQ
0  2007-12-31   2008-02-21 00:00:00             1  3131    3073     4    2007
1  2008-03-31   2008-05-08 00:00:00             1  1189    1482     1    2008
2  2008-03-31   2009-05-07 16:00:00             2  1149     NaN     1    2008
3  2008-06-30   2008-08-07 00:00:00             1  2296    2493     2    2008
4  2008-06-30   2009-08-18 00:00:00             3  2214     NaN     2    2008

      

Now apply the above logic:



In [166]: diffs = df.sort('FQTR').groupby(['FYEARQ',
    'ReportNumber'])[['NIY', 'OANCFY']].apply(pd.DataFrame.diff)

      

Rename the resulting columns:

In [177]: diffs.columns = ['%s_diff' % col for col in diffs.columns]

      

and, if you like, join columns with different columns to the original one df

:

In [180]: with_diffs = pd.concat([df, diffs], axis=1)

      

Leaving you with something close to what you want:

In [182]: with_diffs[['Published', 'ReportNumber', 'NIY', 'FQTR', 'FYEARQ', 'NIY_diff']]
Out[182]: 
               Published  ReportNumber   NIY  FQTR  FYEARQ  NIY_diff
0    2008-02-21 00:00:00             1  3131     4    2007       NaN
1    2008-05-08 00:00:00             1  1189     1    2008       NaN
2    2009-05-07 16:00:00             2  1149     1    2008       NaN
3    2008-08-07 00:00:00             1  2296     2    2008      1107
4    2009-08-18 00:00:00             3  2214     2    2008       NaN
5    2008-11-06 00:00:00             1  3402     3    2008      1106
6    2009-11-07 00:00:00             3  3277     3    2008      1063
7    2009-02-17 16:00:00             1   NaN     4    2008       NaN
8    2009-02-18 00:00:00             3  4202     4    2008       925
9    2010-03-21 00:00:00             5  4031     4    2008       NaN
10   2009-05-07 16:00:00             1   942     1    2009       NaN
11   2009-08-06 00:00:00             1  1748     2    2009       806
12   2009-11-07 00:00:00             1  2458     3    2009       710
13   2010-02-24 16:00:00             1  3181     4    2009       723
14   2010-05-07 16:00:00             1   677     1    2010       NaN
15   2010-08-06 08:00:00             1  1392     2    2010       715
16   2010-11-08 16:00:00             1  1760     3    2010       368
17   2011-02-24 08:00:00             1   961     4    2010      -799
18   2011-05-06 16:00:00             1   310     1    2011       NaN
19   2012-08-16 16:00:00             2   319     1    2011       NaN
20   2011-08-09 08:00:00             1   465     2    2011       155
21   2012-08-16 16:00:00             2   443     2    2011       124
22   2011-11-09 00:00:00             1   394     3    2011       -71
23   2012-11-06 16:00:00             2   411     3    2011       -32
24   2012-03-06 00:00:00             1 -5725     4    2011     -6119
25   2013-03-05 00:00:00             2 -5754     4    2011     -6165
26   2012-05-05 16:00:00             1    42     1    2012       NaN
27   2012-08-16 16:00:00             2    10     1    2012       NaN
28   2012-08-02 16:00:00             1  -294     2    2012      -336
29   2012-11-06 16:00:00             1  -675     3    2012      -381
30   2013-03-05 00:00:00             1  -219     4    2012       456

      

There's a bit more cleaning work to be done (i.e. resetting the values NaN

if you want), but that's left as an exercise!

+1


source







All Articles