Using a function on a group by object to add a row to each group
I have a fairly large dataset, but for reproducibility, let's say I have the following multi-index dataframe:
arrays = [['bar', 'bar','bar', 'baz', 'baz', 'foo', 'foo', 'foo', 'qux', 'qux'],
['one', 'one','two', 'one', 'two', 'one', 'two', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
a = pd.DataFrame(np.random.random((10,)), index = index)
a[1] = pd.date_range('2017-07-02', periods=10, freq='5min')
a
Out[68]:
0 1
first second
bar one 0.705488 2017-07-02 00:00:00
one 0.715645 2017-07-02 00:05:00
two 0.194648 2017-07-02 00:10:00
baz one 0.129729 2017-07-02 00:15:00
two 0.449889 2017-07-02 00:20:00
foo one 0.031531 2017-07-02 00:25:00
two 0.320757 2017-07-02 00:30:00
two 0.876243 2017-07-02 00:35:00
qux one 0.443682 2017-07-02 00:40:00
two 0.802774 2017-07-02 00:45:00
I want to add the current timestamp as a new line for each group identified by combinations of the first and second indices. (e.g. bar-one
, bar-two
etc.)
What I've done:
The function of adding a time stamp to each group:
def myfunction(g, now):
g.loc[g.shape[0], 1] = now # current timestamp
return g
By applying a function to a groupby object,
# current timestamp
now = pd.datetime.now()
a = a.reset_index().groupby(['first', 'second']).apply(lambda x: myfunction(x, now))
This returns:
first second 0 1
first second
bar one 0 bar one 0.705488 2017-07-02 00:00:00.000
1 bar one 0.715645 2017-07-02 00:05:00.000
2 NaN NaN NaN 2017-07-02 02:05:06.442
two 2 bar two 0.194648 2017-07-02 00:10:00.000
1 NaN NaN NaN 2017-07-02 02:05:06.442
baz one 3 baz one 0.129729 2017-07-02 00:15:00.000
1 NaN NaN NaN 2017-07-02 02:05:06.442
two 4 baz two 0.449889 2017-07-02 00:20:00.000
1 NaN NaN NaN 2017-07-02 02:05:06.442
foo one 5 foo one 0.031531 2017-07-02 00:25:00.000
1 NaN NaN NaN 2017-07-02 02:05:06.442
two 6 foo two 0.320757 2017-07-02 00:30:00.000
7 foo two 0.876243 2017-07-02 00:35:00.000
2 NaN NaN NaN 2017-07-02 02:05:06.442
qux one 8 qux one 0.443682 2017-07-02 00:40:00.000
1 NaN NaN NaN 2017-07-02 02:05:06.442
two 9 qux two 0.802774 2017-07-02 00:45:00.000
1 NaN NaN NaN 2017-07-02 02:05:06.442
I can't figure out why the new index level was introduced, however I can get rid of it and get what I want ultimately:
a = a.reset_index(level = 2).drop(('level_2', 'first', 'second')).loc[:,(0,1)]
0 1
first second
bar one 0.705488 2017-07-02 00:00:00.000
one 0.715645 2017-07-02 00:05:00.000
one NaN 2017-07-02 02:05:06.442
two 0.194648 2017-07-02 00:10:00.000
two NaN 2017-07-02 02:05:06.442
baz one 0.129729 2017-07-02 00:15:00.000
one NaN 2017-07-02 02:05:06.442
two 0.449889 2017-07-02 00:20:00.000
two NaN 2017-07-02 02:05:06.442
foo one 0.031531 2017-07-02 00:25:00.000
one NaN 2017-07-02 02:05:06.442
two 0.320757 2017-07-02 00:30:00.000
two 0.876243 2017-07-02 00:35:00.000
two NaN 2017-07-02 02:05:06.442
qux one 0.443682 2017-07-02 00:40:00.000
one NaN 2017-07-02 02:05:06.442
two 0.802774 2017-07-02 00:45:00.000
two NaN 2017-07-02 02:05:06.442
Question:
I am wondering if there is an elegant, more pandonic way to do this (adding a newline to each group and - although not mentioned here) conditionally padding the rest of the fields of that newline other than the timestamp field. )
source to share
You can group the index first by creating an extra row that you need for each group, then concatenate it and sort df.
(
pd.concat([a,
a.groupby(level=[0,1]).first().apply(lambda x: [np.nan,dt.datetime.now()]
,axis=1)])
.sort_index()
)
Out[538]:
0 1
first second
bar one 0.587648 2017-07-02 00:00:00.000000
one 0.974524 2017-07-02 00:05:00.000000
one NaN 2017-07-02 15:18:57.503371
two 0.555171 2017-07-02 00:10:00.000000
two NaN 2017-07-02 15:18:57.503371
baz one 0.832874 2017-07-02 00:15:00.000000
one NaN 2017-07-02 15:18:57.503371
two 0.956891 2017-07-02 00:20:00.000000
two NaN 2017-07-02 15:18:57.503371
foo one 0.872959 2017-07-02 00:25:00.000000
one NaN 2017-07-02 15:18:57.503371
two 0.056546 2017-07-02 00:30:00.000000
two 0.359184 2017-07-02 00:35:00.000000
two NaN 2017-07-02 15:18:57.503371
qux one 0.301327 2017-07-02 00:40:00.000000
one NaN 2017-07-02 15:18:57.503371
two 0.891815 2017-07-02 00:45:00.000000
two NaN 2017-07-02 15:18:57.503371
source to share