How to compare strings in a dataframe in pandas
I would like to be able to compare 2 strings where the ID numbers are the same (e.g. rows 0 and 1) and then delete the row with less absolute income. Is there a way I can do this using only pandas functions and not looping through strings using .itertuples (). I've been thinking about using .shift and .apply, but I'm not sure how to accomplish.
Index ID Income
0 2011000070 55019
1 2011000070 0
2 2011000074 23879
3 2011000074 0
4 2011000078 0
5 2011000078 0
6 2011000118 -32500
7 2011000118 0
I want to:
Index ID Income
0 2011000070 55019
2 2011000074 23879
4 2011000078 0
6 2011000118 -32500
source to share
You need DataFrameGroupBy.idxmax
c Series.abs
for the indices of the maximum absolute values ββand then select the rows loc
:
print (df.groupby('ID')['Income'].apply(lambda x: x.abs().idxmax()))
ID
2011000070 0
2011000074 2
2011000078 4
2011000118 6
Name: Income, dtype: int64
df = df.loc[df.groupby('ID')['Income'].apply(lambda x: x.abs().idxmax())]
print (df)
Index ID Income
0 0 2011000070 55019
2 2 2011000074 23879
4 4 2011000078 0
6 6 2011000118 -32500
Alternative solution:
df = df.loc[df['Income'].abs().groupby(df['ID']).idxmax()]
print (df)
Index ID Income
0 0 2011000070 55019
2 2 2011000074 23879
4 4 2011000078 0
6 6 2011000118 -32500
source to share
Using pandas.DataFrame.drop_duplicates
plus sorting by ID
, and absolute value Income
should solve your problem. Its default parameter keep
is "first"
, which is what you need.
df['Income_abs'] = df['Income'].apply(abs)
df.sort_values(['ID', 'Income_abs'], ascending=[True,False]).drop_duplicates(['ID']).drop('Income_abs',axis=1)
Out[26]:
Index ID Income
0 0 2011000070 55019
2 2 2011000074 23879
4 4 2011000078 0
6 6 2011000118 -32500
source to share