How to delete data in a data frame based on another data frame

I have a dataframe A like this

    user_id sku_id    time
0   56804   75906   2016-02-01 00:10:48
1   56804   75906   2016-02-01 08:36:59
2   56805   75906   2016-02-01 08:36:59
3   56806   81256   2016-02-01 00:08:15
……

      

and then I have another data block B like this:

    user_id sku_id        
0   56804   75906
1   56806   81256   
……

      

I want to select samples in dataframe A with (user_id, sku_id) not in dataframe B. How to do this efficiently? because my data is relatively huge and I am doing this on my PC with limited memory.

+3


source to share


2 answers


Use merge

with the parameter indicator

, query

to filter, and then remove the auxiliary column drop

:

df = pd.merge(df1, df2, how='outer', indicator=True)
       .query('_merge == "left_only"')
       .drop('_merge', 1)
print (df)
   user_id  sku_id                 time
2    56805   75906  2016-02-01 08:36:59

      



Another solution:

x = pd.MultiIndex.from_arrays([df1['user_id'], df1['sku_id']])
y = pd.MultiIndex.from_arrays([df2['user_id'], df2['sku_id']])
inter = x.difference(y)
df1 = df1.set_index(['user_id', 'sku_id']).loc[inter].reset_index()
print (df1)
   user_id  sku_id                 time
0    56805   75906  2016-02-01 08:36:59

      

+2


source


There are two ways to do this: 1) Using isin, you can remove the columns you want. But to perform this operation, you need to concatenate two columns:

A["id"] = str(A["user_id"])+"_"+str(A["sku_id"])
B["id"] = str(B["user_id"])+"_"+str(B["sku_id"])
l = list(B["id"])
A2 = A[~A["id"].isin(l)]

      

2) Create another field in the B-data frame that is 1. You can concatenate two data frames using all conditions and discard the fields that are 1 as



B["unique"] = 1
A2 = A.merge(B,on=["user_id","sku_id"],how="outer")
A2 = A2[A2["unique"]!=1]

      

Let me know if this helps

0


source







All Articles