Efficient grouping in pandas based on another series
I need to perform a grouped operation based on another boolean column in mine DataFrame
. This is most easily seen in an example: I have the following DataFrame
:
b id
0 False 0
1 True 0
2 False 0
3 False 1
4 True 1
5 True 2
6 True 2
7 False 3
8 True 4
9 True 4
10 False 4
and would like to get a column whose elements are True if the column b
is True and the last time it was True for a given one id
:
b id lastMention
0 False 0 False
1 True 0 True
2 False 0 False
3 False 1 False
4 True 1 False
5 True 2 True
6 True 3 True
7 False 3 False
8 True 4 False
9 True 4 True
10 False 4 False
I have some code that achieves this, albeit inefficiently:
def lastMentionFun(df):
b = df['b']
a = b.sum()
if a > 0:
maxInd = b[b].index.max()
df.loc[maxInd, 'lastMention'] = True
return df
df['lastMention'] = False
df = df.groupby('id').apply(lastMentionFun)
Can anyone suggest what's the correct pythonic approach to do this nicely and quickly?
source to share
First you can filter the True values ββin the column b
and then get max
the index value using groupby
and aggregate max
:
print (df[df.b].reset_index().groupby('id')['index'].max())
id
0 1
1 4
2 6
4 9
Name: index, dtype: int64
Then replace the values False
with the index values ββwith loc
:
df['lastMention'] = False
df.loc[df[df.b].reset_index().groupby('id')['index'].max(), 'lastMention'] = True
print (df)
b id lastMention
0 False 0 False
1 True 0 True
2 False 0 False
3 False 1 False
4 True 1 True
5 True 2 False
6 True 2 True
7 False 3 False
8 True 4 False
9 True 4 True
10 False 4 False
Another solution is to get the values max
c groupby
and apply
then check the membership of the values ββin the index c isin
- output boolean Series
:
print (df[df.b].groupby('id').apply(lambda x: x.index.max()))
id
0 1
1 4
2 6
4 9
dtype: int64
df['lastMention'] = df.index.isin(df[df.b].groupby('id').apply(lambda x: x.index.max()))
print (df)
b id lastMention
0 False 0 False
1 True 0 True
2 False 0 False
3 False 1 False
4 True 1 True
5 True 2 False
6 True 2 True
7 False 3 False
8 True 4 False
9 True 4 True
10 False 4 False
source to share
Not sure if this is the most efficient method, but it only uses built-in functions (the main one being "cumsum" and then max to check that it is equal to the last one. Pd.merge is only used for max back in the table, maybe is there a better way to do this?).
df['cum_b']=df.groupby('id', as_index=False).cumsum()
df = pd.merge(df, df[['id','cum_b']].groupby('id', as_index=False).max(), how='left', on='id', suffixes=('','_max'))
df['lastMention'] = np.logical_and(df.b, df.cum_b == df.cum_b_max)
PS The core you provided in the example changed slightly from the first to the second snippet, I hope I interpreted your request correctly!
source to share