Drop Duplicates and Add Values ​​Pandas

I have an info frame below. I would like to remove duplicates, but add the duplicate value from the column E

to the non-duplicate record

import pandas as pd
import numpy as np
dfp = pd.DataFrame({'A' : [np.NaN,np.NaN,3,4,5,5,3,1,6,7], 
                    'B' : [1,1,3,5,0,0,np.NaN,9,0,0], 
                    'C' : ['AA1233445','AA1233445', 'rmacy','Idaho Rx','Ab123455','TV192837','RX','Ohio Drugs','RX12345','USA Pharma'], 
                    'D' : [123456,123456,1234567,12345678,12345,12345,12345678,123456789,1234567,np.NaN],
                    'E' : ['Assign','Allign','Hello','Ugly','Appreciate','Undo','Testing','Unicycle','Pharma','Unicorn',]})
print(dfp)

      

I grab all the duplicates:

df2 = dfp.loc[(dfp['A'].duplicated(keep=False))].copy()

     A    B          C           D           E
0  NaN  1.0  AA1233445    123456.0      Assign
1  NaN  1.0  AA1233445    123456.0      Allign
2  3.0  3.0      rmacy   1234567.0       Hello
4  5.0  0.0   Ab123455     12345.0  Appreciate
5  5.0  0.0   TV192837     12345.0        Undo
6  3.0  NaN         RX  12345678.0     Testing

      

and would like my result to be as follows:

     A    B          C           D           E
0  NaN  1.0  AA1233445    123456.0      Assign Allign
2  3.0  3.0      rmacy   1234567.0      Hello Testing
4  5.0  0.0   Ab123455     12345.0      Appreciate Undo

      

I know what I need to use dfp.loc[(dfp['A'].duplicated(keep='last'))].copy()

to capture the first occurrence, but I cannot set the column value E

to include other duplicate values.

I think I need to try something like:

df3 = dfp.loc[(dfp['A'].duplicated(keep='last'))].copy()
df3['E'] = df3['E'] + dfp.loc[(dfp['A'].duplicated(keep=False).copy()),'E']

      

but my conclusion is:

     A    B          C          D                     E
0  NaN  1.0  AA1233445   123456.0          AssignAssign
2  3.0  3.0      rmacy  1234567.0            HelloHello
4  5.0  0.0   Ab123455    12345.0  AppreciateAppreciate

      

I'm stumped. Am I complicating something? How can I get the output I'm looking for so that I can remove all duplicates except the first one afterwards, but "keep" the values ​​of the discarded values ​​in the column E

?

+3


source to share


2 answers


Define functions to use in agg

and use in groupby

. To make groupby work with NaNs, I converted to strings and then back to float.

f = {c: ' '.join if c == 'E' else 'first' for c in ['B', 'C', 'D', 'E']}

dfp.groupby(
    dfp.A.astype(str), sort=False
).agg(f).reset_index().eval(
    'A = @pd.to_numeric(A, "coerce").values',
    inplace=False
)

     A    B           C            D                E
0  NaN  1.0   AA1233445     123456.0    Assign Allign
1  3.0  3.0       rmacy    1234567.0    Hello Testing
2  4.0  5.0    Idaho Rx   12345678.0             Ugly
3  5.0  0.0    Ab123455      12345.0  Appreciate Undo
4  1.0  9.0  Ohio Drugs  123456789.0         Unicycle
5  6.0  0.0     RX12345    1234567.0           Pharma
6  7.0  0.0  USA Pharma          NaN          Unicorn

      


Limiting it to duplicate lines only:



f = {c: ' '.join if c == 'E' else 'first' for c in ['B', 'C', 'D', 'E']}
d1 = dfp[dfp.duplicated('A', keep=False)]
d2 = d1.groupby(d1.A.astype(str), sort=False).agg(f).reset_index()
d2.A = d2.A.astype(float)

      

d2

     A    B          C          D                E
0  NaN  1.0  AA1233445   123456.0    Assign Allign
1  3.0  3.0      rmacy  1234567.0    Hello Testing
2  5.0  0.0   Ab123455    12345.0  Appreciate Undo

      

+3


source


Here's my ugly solution:



In [263]: (dfp.reset_index()
     ...:     .assign(A=dfp.A.fillna(-1))
     ...:     .groupby('A')
     ...:     .filter(lambda x: len(x) > 1)
     ...:     .groupby('A', as_index=False)
     ...:     .apply(lambda x: x.head(1).assign(E=x.E.str.cat(sep=' ')))
     ...:     .replace({'A':{-1:np.nan}})
     ...:     .set_index('index'))
     ...:
Out[263]:
         A    B          C          D                E
index
0      NaN  1.0  AA1233445   123456.0    Assign Allign
2      3.0  3.0      rmacy  1234567.0    Hello Testing
4      5.0  0.0   Ab123455    12345.0  Appreciate Undo

      

+3


source







All Articles