Remove duplicate rows in pandas dataframe based on condition
is_avail valu data_source
2015-08-07 False 0.282 source_a
2015-08-07 False 0.582 source_b
2015-08-23 False 0.296 source_a
2015-09-08 False 0.433 source_a
2015-10-01 True 0.169 source_b
In the above data window, I want to remove duplicate rows (for example, the row where the index is repeated) while keeping the row with the higher value in the column valu
.
I can delete rows with duplicate indices like:
df = df[~df.index.duplicated()]
... But how to delete based on the above condition?
+3
source to share
2 answers
You can use groupby on index after sorting df by value.
df.sort_values(by='valu', ascending=False).groupby(level=0).first()
Out[1277]:
is_avail valu data_source
2015-08-07 False 0.582 source_b
2015-08-23 False 0.296 source_a
2015-09-08 False 0.433 source_a
2015-10-01 True 0.169 source_b
+3
source to share
Using drop_duplicates
withkeep='last'
df.rename_axis('date').reset_index() \
.sort_values(['date', 'valu']) \
.drop_duplicates('date', keep='last') \
.set_index('date').rename_axis(df.index.name)
is_avail valu data_source
2015-08-07 False 0.582 source_b
2015-08-23 False 0.296 source_a
2015-09-08 False 0.433 source_a
2015-10-01 True 0.169 source_b
+1
source to share