How to delete data in a data frame based on another data frame

Question

How to delete data in a data frame based on another data frame

I have a dataframe A like this

    user_id sku_id    time
0   56804   75906   2016-02-01 00:10:48
1   56804   75906   2016-02-01 08:36:59
2   56805   75906   2016-02-01 08:36:59
3   56806   81256   2016-02-01 00:08:15
……

and then I have another data block B like this:

    user_id sku_id        
0   56804   75906
1   56806   81256   
……

I want to select samples in dataframe A with (user_id, sku_id) not in dataframe B. How to do this efficiently? because my data is relatively huge and I am doing this on my PC with limited memory.

+3

python pandas

Husy May 12 '17 at 5:14

source to share

2 answers

jezrael · Answer 1 · 2017-05-12T05:19:30+0000

Use merge

with the parameter indicator

, query

to filter, and then remove the auxiliary column drop

:

df = pd.merge(df1, df2, how='outer', indicator=True)
       .query('_merge == "left_only"')
       .drop('_merge', 1)
print (df)
   user_id  sku_id                 time
2    56805   75906  2016-02-01 08:36:59

Another solution:

x = pd.MultiIndex.from_arrays([df1['user_id'], df1['sku_id']])
y = pd.MultiIndex.from_arrays([df2['user_id'], df2['sku_id']])
inter = x.difference(y)
df1 = df1.set_index(['user_id', 'sku_id']).loc[inter].reset_index()
print (df1)
   user_id  sku_id                 time
0    56805   75906  2016-02-01 08:36:59

Aman · Answer 2 · 2017-05-12T05:47:51+0000

There are two ways to do this: 1) Using isin, you can remove the columns you want. But to perform this operation, you need to concatenate two columns:

A["id"] = str(A["user_id"])+"_"+str(A["sku_id"])
B["id"] = str(B["user_id"])+"_"+str(B["sku_id"])
l = list(B["id"])
A2 = A[~A["id"].isin(l)]

2) Create another field in the B-data frame that is 1. You can concatenate two data frames using all conditions and discard the fields that are 1 as

B["unique"] = 1
A2 = A.merge(B,on=["user_id","sku_id"],how="outer")
A2 = A2[A2["unique"]!=1]

Let me know if this helps

How to delete data in a data frame based on another data frame

More articles: