Remove duplicate rows in pandas dataframe based on condition

Question

Remove duplicate rows in pandas dataframe based on condition

            is_avail   valu data_source
2015-08-07     False  0.282    source_a
2015-08-07     False  0.582    source_b
2015-08-23     False  0.296    source_a
2015-09-08     False  0.433    source_a
2015-10-01      True  0.169    source_b

In the above data window, I want to remove duplicate rows (for example, the row where the index is repeated) while keeping the row with the higher value in the column valu

.

I can delete rows with duplicate indices like:

df = df[~df.index.duplicated()]

... But how to delete based on the above condition?

+3

python pandas

user308827 05 May '17 at 22:15

source to share

2 answers

Using drop_duplicates

withkeep='last'

df.rename_axis('date').reset_index() \
    .sort_values(['date', 'valu']) \
    .drop_duplicates('date', keep='last') \
    .set_index('date').rename_axis(df.index.name)

           is_avail   valu data_source
2015-08-07    False  0.582    source_b
2015-08-23    False  0.296    source_a
2015-09-08    False  0.433    source_a
2015-10-01     True  0.169    source_b

+1

piRSquared 05 May '17 at 22:21

source to share

Allen · Accepted Answer · 2017-05-05T22:21:09+0000

You can use groupby on index after sorting df by value.

df.sort_values(by='valu', ascending=False).groupby(level=0).first()
Out[1277]: 
           is_avail   valu data_source
2015-08-07    False  0.582    source_b
2015-08-23    False  0.296    source_a
2015-09-08    False  0.433    source_a
2015-10-01     True  0.169    source_b

Remove duplicate rows in pandas dataframe based on condition

More articles: