How to delete data in a data frame based on another data frame
I have a dataframe A like this
user_id sku_id time
0 56804 75906 2016-02-01 00:10:48
1 56804 75906 2016-02-01 08:36:59
2 56805 75906 2016-02-01 08:36:59
3 56806 81256 2016-02-01 00:08:15
……
and then I have another data block B like this:
user_id sku_id
0 56804 75906
1 56806 81256
……
I want to select samples in dataframe A with (user_id, sku_id) not in dataframe B. How to do this efficiently? because my data is relatively huge and I am doing this on my PC with limited memory.
source to share
Use merge
with the parameter indicator
, query
to filter, and then remove the auxiliary column drop
:
df = pd.merge(df1, df2, how='outer', indicator=True)
.query('_merge == "left_only"')
.drop('_merge', 1)
print (df)
user_id sku_id time
2 56805 75906 2016-02-01 08:36:59
Another solution:
x = pd.MultiIndex.from_arrays([df1['user_id'], df1['sku_id']]) y = pd.MultiIndex.from_arrays([df2['user_id'], df2['sku_id']]) inter = x.difference(y) df1 = df1.set_index(['user_id', 'sku_id']).loc[inter].reset_index() print (df1) user_id sku_id time 0 56805 75906 2016-02-01 08:36:59
source to share
There are two ways to do this: 1) Using isin, you can remove the columns you want. But to perform this operation, you need to concatenate two columns:
A["id"] = str(A["user_id"])+"_"+str(A["sku_id"])
B["id"] = str(B["user_id"])+"_"+str(B["sku_id"])
l = list(B["id"])
A2 = A[~A["id"].isin(l)]
2) Create another field in the B-data frame that is 1. You can concatenate two data frames using all conditions and discard the fields that are 1 as
B["unique"] = 1
A2 = A.merge(B,on=["user_id","sku_id"],how="outer")
A2 = A2[A2["unique"]!=1]
Let me know if this helps
source to share