How to compare strings in a dataframe in pandas

I would like to be able to compare 2 strings where the ID numbers are the same (e.g. rows 0 and 1) and then delete the row with less absolute income. Is there a way I can do this using only pandas functions and not looping through strings using .itertuples (). I've been thinking about using .shift and .apply, but I'm not sure how to accomplish.

 Index   ID             Income  
 0       2011000070      55019   
 1       2011000070          0   
 2       2011000074      23879   
 3       2011000074          0   
 4       2011000078          0   
 5       2011000078          0   
 6       2011000118     -32500   
 7       2011000118          0 

      

I want to:

 Index   ID             Income  
 0       2011000070      55019     
 2       2011000074      23879     
 4       2011000078          0     
 6       2011000118     -32500   

      

+3


source to share


3 answers


You need DataFrameGroupBy.idxmax

c Series.abs

for the indices of the maximum absolute values ​​and then select the rows loc

:

print (df.groupby('ID')['Income'].apply(lambda x: x.abs().idxmax()))
ID
2011000070    0
2011000074    2
2011000078    4
2011000118    6
Name: Income, dtype: int64

df = df.loc[df.groupby('ID')['Income'].apply(lambda x: x.abs().idxmax())]
print (df)
   Index          ID  Income
0      0  2011000070   55019
2      2  2011000074   23879
4      4  2011000078       0
6      6  2011000118  -32500

      



Alternative solution:

df = df.loc[df['Income'].abs().groupby(df['ID']).idxmax()]
print (df)
   Index          ID  Income
0      0  2011000070   55019
2      2  2011000074   23879
4      4  2011000078       0
6      6  2011000118  -32500

      

+3


source


Using pandas.DataFrame.drop_duplicates

plus sorting by ID

, and absolute value Income

should solve your problem. Its default parameter keep

is "first"

, which is what you need.



df['Income_abs'] = df['Income'].apply(abs)

df.sort_values(['ID', 'Income_abs'], ascending=[True,False]).drop_duplicates(['ID']).drop('Income_abs',axis=1)
Out[26]: 
   Index          ID  Income
0      0  2011000070   55019
2      2  2011000074   23879
4      4  2011000078       0
6      6  2011000118  -32500

      

+1


source


It might work.

In [458]: df.groupby('ID', as_index=False).apply(lambda x: x.ix[x.Income.abs().idxmax()])
Out[458]:
   Index          ID  Income
0      0  2011000070   55019
1      2  2011000074   23879
2      4  2011000078       0
3      6  2011000118  -32500

      

+1


source







All Articles