Using a function on a group by object to add a row to each group

I have a fairly large dataset, but for reproducibility, let's say I have the following multi-index dataframe:

arrays = [['bar', 'bar','bar', 'baz', 'baz', 'foo', 'foo', 'foo', 'qux', 'qux'],
             ['one', 'one','two', 'one', 'two', 'one', 'two', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
a = pd.DataFrame(np.random.random((10,)), index = index)
a[1] = pd.date_range('2017-07-02', periods=10, freq='5min')

a
Out[68]: 
                     0                   1
first second                              
bar   one     0.705488 2017-07-02 00:00:00
      one     0.715645 2017-07-02 00:05:00
      two     0.194648 2017-07-02 00:10:00
baz   one     0.129729 2017-07-02 00:15:00
      two     0.449889 2017-07-02 00:20:00
foo   one     0.031531 2017-07-02 00:25:00
      two     0.320757 2017-07-02 00:30:00
      two     0.876243 2017-07-02 00:35:00
qux   one     0.443682 2017-07-02 00:40:00
      two     0.802774 2017-07-02 00:45:00

      

I want to add the current timestamp as a new line for each group identified by combinations of the first and second indices. (e.g. bar-one

, bar-two

etc.)

What I've done:

The function of adding a time stamp to each group:

def myfunction(g, now):
    g.loc[g.shape[0], 1] = now # current timestamp
    return g

      

By applying a function to a groupby object,

# current timestamp
now = pd.datetime.now()

a = a.reset_index().groupby(['first', 'second']).apply(lambda x: myfunction(x, now))

      

This returns:

               first second         0                       1
first second                                                 
bar   one    0   bar    one  0.705488 2017-07-02 00:00:00.000
             1   bar    one  0.715645 2017-07-02 00:05:00.000
             2   NaN    NaN       NaN 2017-07-02 02:05:06.442
      two    2   bar    two  0.194648 2017-07-02 00:10:00.000
             1   NaN    NaN       NaN 2017-07-02 02:05:06.442
baz   one    3   baz    one  0.129729 2017-07-02 00:15:00.000
             1   NaN    NaN       NaN 2017-07-02 02:05:06.442
      two    4   baz    two  0.449889 2017-07-02 00:20:00.000
             1   NaN    NaN       NaN 2017-07-02 02:05:06.442
foo   one    5   foo    one  0.031531 2017-07-02 00:25:00.000
             1   NaN    NaN       NaN 2017-07-02 02:05:06.442
      two    6   foo    two  0.320757 2017-07-02 00:30:00.000
             7   foo    two  0.876243 2017-07-02 00:35:00.000
             2   NaN    NaN       NaN 2017-07-02 02:05:06.442
qux   one    8   qux    one  0.443682 2017-07-02 00:40:00.000
             1   NaN    NaN       NaN 2017-07-02 02:05:06.442
      two    9   qux    two  0.802774 2017-07-02 00:45:00.000
             1   NaN    NaN       NaN 2017-07-02 02:05:06.442

      

I can't figure out why the new index level was introduced, however I can get rid of it and get what I want ultimately:

a = a.reset_index(level = 2).drop(('level_2', 'first', 'second')).loc[:,(0,1)]

                     0                       1
first second                                  
bar   one     0.705488 2017-07-02 00:00:00.000
      one     0.715645 2017-07-02 00:05:00.000
      one          NaN 2017-07-02 02:05:06.442
      two     0.194648 2017-07-02 00:10:00.000
      two          NaN 2017-07-02 02:05:06.442
baz   one     0.129729 2017-07-02 00:15:00.000
      one          NaN 2017-07-02 02:05:06.442
      two     0.449889 2017-07-02 00:20:00.000
      two          NaN 2017-07-02 02:05:06.442
foo   one     0.031531 2017-07-02 00:25:00.000
      one          NaN 2017-07-02 02:05:06.442
      two     0.320757 2017-07-02 00:30:00.000
      two     0.876243 2017-07-02 00:35:00.000
      two          NaN 2017-07-02 02:05:06.442
qux   one     0.443682 2017-07-02 00:40:00.000
      one          NaN 2017-07-02 02:05:06.442
      two     0.802774 2017-07-02 00:45:00.000
      two          NaN 2017-07-02 02:05:06.442

      

Question:

I am wondering if there is an elegant, more pandonic way to do this (adding a newline to each group and - although not mentioned here) conditionally padding the rest of the fields of that newline other than the timestamp field. )

+1


source to share


2 answers


You can group the index first by creating an extra row that you need for each group, then concatenate it and sort df.



(
    pd.concat([a, 
               a.groupby(level=[0,1]).first().apply(lambda x: [np.nan,dt.datetime.now()]
               ,axis=1)])
    .sort_index()
)

Out[538]: 
                     0                          1
first second                                     
bar   one     0.587648 2017-07-02 00:00:00.000000
      one     0.974524 2017-07-02 00:05:00.000000
      one          NaN 2017-07-02 15:18:57.503371
      two     0.555171 2017-07-02 00:10:00.000000
      two          NaN 2017-07-02 15:18:57.503371
baz   one     0.832874 2017-07-02 00:15:00.000000
      one          NaN 2017-07-02 15:18:57.503371
      two     0.956891 2017-07-02 00:20:00.000000
      two          NaN 2017-07-02 15:18:57.503371
foo   one     0.872959 2017-07-02 00:25:00.000000
      one          NaN 2017-07-02 15:18:57.503371
      two     0.056546 2017-07-02 00:30:00.000000
      two     0.359184 2017-07-02 00:35:00.000000
      two          NaN 2017-07-02 15:18:57.503371
qux   one     0.301327 2017-07-02 00:40:00.000000
      one          NaN 2017-07-02 15:18:57.503371
      two     0.891815 2017-07-02 00:45:00.000000
      two          NaN 2017-07-02 15:18:57.503371

      

+1


source


Just:

b= a.groupby(level=[0,1]).max()  # the new lines
b[:]= np.NaN, pd.datetime.now()  # updated
a = a.append(b).sort_index()     # appended and sorted

      



Grouping by level retains its structure, so it is easier to manage.

+1


source







All Articles