Strange behavior when trying to add a row to each group in a group by object
This question is about a function that behaves unpredictably when applied on two different data frames - more specifically, groupby objects. Either I am missing something that is clearly wrong, or there is a bug in pandas.
I wrote the following function to add a line to each group in a groupby object. This question is another question related to this feature.
def myfunction(g, now):
'''This function appends a row to each group and populates the DTM column value of that row with the current timestamp. Other columns of the new row will have NaN s.
g: a groupby object
now: current timestamp
returns a dataframe that has the current timestamp appended in the DTM column for each group
'''
g.loc[g.shape[0], 'DTM'] = now # Appending the current timestamp to a DTM column in each group
return g
We'll run two tests to test the function.
TEST 1
It works as expected in the dataframe a
in the linked question (demonstrated in the above question). Here's a slightly increased rerun for clarity (mostly copied from the linked question).
arrays = [['bar', 'bar','bar', 'baz', 'baz', 'foo', 'foo', 'foo', 'qux', 'qux'],
['one', 'one','two', 'one', 'two', 'one', 'two', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
a = pd.DataFrame(np.random.random((10,)), index = index)
a
Out[50]:
0
first second
bar one 0.134379
one 0.967928
two 0.067502
baz one 0.182887
two 0.926932
foo one 0.806225
two 0.718322
two 0.932114
qux one 0.772494
two 0.141510
By applying this function,
a = a.reset_index().groupby(['first', 'second']).apply(lambda x: myfunction(x, now))
It adds a new line to each group. Added a new column DTM
as it was not in the original a
. The group is a pair first
- second
.
a
Out[52]:
first second 0 DTM
first second
bar one 0 bar one 0.134379 NaT
1 bar one 0.967928 NaT
2 NaN NaN NaN 2017-07-03 18:56:33.183
two 2 bar two 0.067502 NaT
1 NaN NaN NaN 2017-07-03 18:56:33.183
baz one 3 baz one 0.182887 NaT
1 NaN NaN NaN 2017-07-03 18:56:33.183
two 4 baz two 0.926932 NaT
1 NaN NaN NaN 2017-07-03 18:56:33.183
foo one 5 foo one 0.806225 NaT
1 NaN NaN NaN 2017-07-03 18:56:33.183
two 6 foo two 0.718322 NaT
7 foo two 0.932114 NaT
2 NaN NaN NaN 2017-07-03 18:56:33.183
qux one 8 qux one 0.772494 NaT
1 NaN NaN NaN 2017-07-03 18:56:33.183
two 9 qux two 0.141510 NaT
1 NaN NaN NaN 2017-07-03 18:56:33.183
Some clarifications,
a = a.reset_index(level = 2).drop(('level_2', 'first', 'second')).loc[:,(0,'DTM')]
this gives the final a
as,
a
Out[62]:
0 DTM
first second
bar one 0.371683 NaT
one 0.327870 NaT
one NaN 2017-07-03 18:56:33.183
two 0.048794 NaT
two NaN 2017-07-03 18:56:33.183
baz one 0.462747 NaT
one NaN 2017-07-03 18:56:33.183
two 0.758674 NaT
two NaN 2017-07-03 18:56:33.183
foo one 0.238607 NaT
one NaN 2017-07-03 18:56:33.183
two 0.156104 NaT
two 0.594270 NaT
two NaN 2017-07-03 18:56:33.183
qux one 0.091088 NaT
one NaN 2017-07-03 18:56:33.183
two 0.795864 NaT
two NaN 2017-07-03 18:56:33.183
So far so good. This is expected behavior. A new row was added to each pair of first
- second
, and the column of DTM
that row was filled with the current timestamp.
TEST 2
Surprisingly, I was unable to reproduce this behavior in the dataframe below df
. A group is a combination ID
-SEQ
This one df
can be reproduced using:
1. Copy below
C1 572 5/9/2017 10:13 PE
C1 572 5/9/2017 12:24 OK
C1 579 5/9/2017 10:19 PE
C1 579 5/9/2017 13:25 OK
C1 587 5/9/2017 10:20 PE
C1 587 5/9/2017 12:25 OK
C1 590 5/9/2017 10:21 PE
C1 590 5/9/2017 13:09 OK
C1 604 5/9/2017 10:38 PE
C1 604 5/9/2017 12:32 OK
C1 609 5/9/2017 10:39 PE
C1 609 5/9/2017 13:29 OK
C1 613 5/9/2017 10:39 PE
C1 613 5/9/2017 13:08 OK
C1 618 5/9/2017 10:40 PE
C1 618 5/9/2017 13:33 OK
C1 636 5/9/2017 10:54 PE
C1 636 5/9/2017 13:36 OK
C1 642 5/9/2017 10:55 PE
C1 642 5/9/2017 13:35 OK
C1 643 5/9/2017 10:56 PE
C1 643 5/9/2017 13:34 OK
C1 656 5/9/2017 10:55 PE
C1 656 5/9/2017 13:36 OK
C2 86 9/5/2016 19:45 PE
C2 86 9/6/2016 11:55 OK
C3 10 4/17/2017 12:23 PE
C3 10 4/17/2017 14:51 OK
C4 38 3/25/2017 10:35 PE
C4 38 3/25/2017 10:51 OK
2. And then after completing them,
df = pd.read_clipboard(sep = '[ ]{2,}')
df.columns = ['ID', 'SEQ', 'DTM', 'STATUS']
Setting up a multi-index
d = df.set_index(['ID', 'SEQ', 'DTM']) # I have three index levels this time in the original dataframe
What looks d
,
d
Out[40]:
STATUS
ID SEQ DTM
C1 572 5/9/2017 10:13 PE
5/9/2017 12:24 OK
579 5/9/2017 10:19 PE
5/9/2017 13:25 OK
587 5/9/2017 10:20 PE
5/9/2017 12:25 OK
590 5/9/2017 10:21 PE
5/9/2017 13:09 OK
604 5/9/2017 10:38 PE
5/9/2017 12:32 OK
609 5/9/2017 10:39 PE
5/9/2017 13:29 OK
613 5/9/2017 10:39 PE
5/9/2017 13:08 OK
618 5/9/2017 10:40 PE
5/9/2017 13:33 OK
636 5/9/2017 10:54 PE
5/9/2017 13:36 OK
642 5/9/2017 10:55 PE
5/9/2017 13:35 OK
643 5/9/2017 10:56 PE
5/9/2017 13:34 OK
656 5/9/2017 10:55 PE
5/9/2017 13:36 OK
C2 86 9/5/2016 19:45 PE
9/6/2016 11:55 OK
C3 10 4/17/2017 12:23 PE
4/17/2017 14:51 OK
C4 38 3/25/2017 10:35 PE
3/25/2017 10:51 OK
By applying this function,
dd = d.reset_index().groupby(['ID', 'SEQ']).apply(lambda x: myfunction(x, now)) # a group is a unique combination of ID-SEQ pairs
This returns, (note the fourth line)
dd
Out[37]:
ID SEQ DTM STATUS
ID SEQ
C1 572 0 C1 572.0 5/9/2017 10:13 PE
1 C1 572.0 5/9/2017 12:24 OK
2 NaN NaN 2017-07-03 18:56:33.183000 NaN
579 2 C1 579.0 2017-07-03 18:56:33.183000 PE
3 C1 579.0 5/9/2017 13:25 OK
587 4 C1 587.0 5/9/2017 10:20 PE
5 C1 587.0 5/9/2017 12:25 OK
2 NaN NaN 2017-07-03 18:56:33.183000 NaN
590 6 C1 590.0 5/9/2017 10:21 PE
7 C1 590.0 5/9/2017 13:09 OK
2 NaN NaN 2017-07-03 18:56:33.183000 NaN
604 8 C1 604.0 5/9/2017 10:38 PE
9 C1 604.0 5/9/2017 12:32 OK
2 NaN NaN 2017-07-03 18:56:33.183000 NaN
609 10 C1 609.0 5/9/2017 10:39 PE
11 C1 609.0 5/9/2017 13:29 OK
2 NaN NaN 2017-07-03 18:56:33.183000 NaN
613 12 C1 613.0 5/9/2017 10:39 PE
13 C1 613.0 5/9/2017 13:08 OK
2 NaN NaN 2017-07-03 18:56:33.183000 NaN
618 14 C1 618.0 5/9/2017 10:40 PE
15 C1 618.0 5/9/2017 13:33 OK
2 NaN NaN 2017-07-03 18:56:33.183000 NaN
636 16 C1 636.0 5/9/2017 10:54 PE
17 C1 636.0 5/9/2017 13:36 OK
2 NaN NaN 2017-07-03 18:56:33.183000 NaN
642 18 C1 642.0 5/9/2017 10:55 PE
19 C1 642.0 5/9/2017 13:35 OK
2 NaN NaN 2017-07-03 18:56:33.183000 NaN
643 20 C1 643.0 5/9/2017 10:56 PE
21 C1 643.0 5/9/2017 13:34 OK
2 NaN NaN 2017-07-03 18:56:33.183000 NaN
656 22 C1 656.0 5/9/2017 10:55 PE
23 C1 656.0 5/9/2017 13:36 OK
2 NaN NaN 2017-07-03 18:56:33.183000 NaN
C2 86 24 C2 86.0 9/5/2016 19:45 PE
25 C2 86.0 9/6/2016 11:55 OK
2 NaN NaN 2017-07-03 18:56:33.183000 NaN
C3 10 26 C3 10.0 4/17/2017 12:23 PE
27 C3 10.0 4/17/2017 14:51 OK
2 NaN NaN 2017-07-03 18:56:33.183000 NaN
C4 38 28 C4 38.0 3/25/2017 10:35 PE
29 C4 38.0 3/25/2017 10:51 OK
2 NaN NaN 2017-07-03 18:56:33.183000 NaN
some clarification,
ddd = dd.reset_index(level = 2).drop(('level_2', 'ID', 'SEQ')).loc[:,('DTM','STATUS')]
ddd
Out[39]:
DTM STATUS
ID SEQ
C1 572 5/9/2017 10:13 PE
572 5/9/2017 12:24 OK
572 2017-07-03 18:56:33.183000 NaN
579 2017-07-03 18:56:33.183000 PE
579 5/9/2017 13:25 OK
587 5/9/2017 10:20 PE
587 5/9/2017 12:25 OK
587 2017-07-03 18:56:33.183000 NaN
590 5/9/2017 10:21 PE
590 5/9/2017 13:09 OK
590 2017-07-03 18:56:33.183000 NaN
604 5/9/2017 10:38 PE
604 5/9/2017 12:32 OK
604 2017-07-03 18:56:33.183000 NaN
609 5/9/2017 10:39 PE
609 5/9/2017 13:29 OK
609 2017-07-03 18:56:33.183000 NaN
613 5/9/2017 10:39 PE
613 5/9/2017 13:08 OK
613 2017-07-03 18:56:33.183000 NaN
618 5/9/2017 10:40 PE
618 5/9/2017 13:33 OK
618 2017-07-03 18:56:33.183000 NaN
636 5/9/2017 10:54 PE
636 5/9/2017 13:36 OK
636 2017-07-03 18:56:33.183000 NaN
642 5/9/2017 10:55 PE
642 5/9/2017 13:35 OK
642 2017-07-03 18:56:33.183000 NaN
643 5/9/2017 10:56 PE
643 5/9/2017 13:34 OK
643 2017-07-03 18:56:33.183000 NaN
656 5/9/2017 10:55 PE
656 5/9/2017 13:36 OK
656 2017-07-03 18:56:33.183000 NaN
C2 86 9/5/2016 19:45 PE
86 9/6/2016 11:55 OK
86 2017-07-03 18:56:33.183000 NaN
C3 10 4/17/2017 12:23 PE
10 4/17/2017 14:51 OK
10 2017-07-03 18:56:33.183000 NaN
C4 38 3/25/2017 10:35 PE
38 3/25/2017 10:51 OK
38 2017-07-03 18:56:33.183000 NaN
Problem
In each group ID
- SEQ
adds a new string that contains the current timestamp, except for the group C1
- 579
! (4th line in dd
and ddd
)
Questions
- What is the cause of this problem?
- What is the additional index level introduced in
dd
?
source to share
After a big debug issue was found.
There is a problem with the same number at the level 3
- in the last example it is a group form 2
, but this value existed before, so no new line was added, the onlu line was overwritten.
ID SEQ DTM STATUS
ID SEQ
C1 572 0 C1 572.0 2017-05-09 10:13:00.000000 PE
1 C1 572.0 2017-05-09 12:24:00.000000 OK
2 NaN NaN 2017-07-06 08:46:02.341472 NaN
579 2 C1 579.0 2017-07-06 08:46:02.341472 PE <- ovetwritten values in row
3 C1 579.0 2017-05-09 13:25:00.000000 OK
587 4 C1 587.0 2017-05-09 10:20:00.000000 PE
5 C1 587.0 2017-05-09 12:25:00.000000 OK
2 NaN NaN 2017-07-06 08:46:02.341472 NaN
The first sample was nice because the second group only has one line.
But if it has 2 lines:
arrays = [['bar', 'bar','bar', 'baz', 'baz', 'foo', 'foo', 'foo', 'qux', 'qux'],
['one', 'two','two', 'one', 'two', 'one', 'two', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
a = pd.DataFrame(np.random.random((10,)), index = index)
print (a)
0
first second
bar one 0.366258
two 0.583205
two 0.159388
baz one 0.598198
two 0.274027
foo one 0.086461
two 0.353577
two 0.823377
qux one 0.098737
two 0.128470
same problem.
print (a)
first second 0 DTM
first second
bar one 0 bar one 0.366258 NaT
1 NaN NaN NaN 2017-07-06 08:47:55.610671
two 1 bar two 0.583205 NaT
2 bar two 0.159388 2017-07-06 08:47:55.610671 <- ovetwritten
baz one 3 baz one 0.598198 NaT
1 NaN NaN NaN 2017-07-06 08:47:55.610671
two 4 baz two 0.274027 NaT
So, if the function is slightly modified, everything works fine:
now = pd.datetime.now()
def myfunction(g, now):
g.loc[str(g.shape[0]) + 'a', 'DTM'] = now
return g
arrays = [['bar', 'bar','bar', 'baz', 'baz', 'foo', 'foo', 'foo', 'qux', 'qux'],
['one', 'two','two', 'one', 'two', 'one', 'two', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
a = pd.DataFrame(np.random.random((10,)), index = index)
print (a)
a = a.reset_index().groupby(['first', 'second']).apply(lambda x: myfunction(x, now))
print (a)
first second 0 DTM
first second
bar one 0 bar one 0.677641 NaT
1a NaN NaN NaN 2017-07-06 08:54:47.481671
two 1 bar two 0.274588 NaT
2 bar two 0.524903 NaT
2a NaN NaN NaN 2017-07-06 08:54:47.481671
baz one 3 baz one 0.198272 NaT
1a NaN NaN NaN 2017-07-06 08:54:47.481671
two 4 baz two 0.787949 NaT
1a NaN NaN NaN 2017-07-06 08:54:47.481671
foo one 5 foo one 0.484197 NaT
1a NaN NaN NaN 2017-07-06 08:54:47.481671
source to share