Identify NOT records in another data frame
I have one clock of data:
data1 = pd.DataFrame([['a','z',0],['a','y',20],['b','z',1]],columns=['id1','id2','number'])
data2 = pd.DataFrame([['a','y',1],['a','y',1],['b','z',0]],columns=['id1','id2','number'])
I want to return records that are in data1, not data2 (how id1 and id2 are combined).
In this case, I just want it to return one record ['a', 'z', 0], since both ['a', 'y'] and ['b', 'z'] exist in data2.
source to share
I think there is an alternative way. If we set both columns as index, we can use a method .isin
to filter out what is needed:
data1.set_index(['id1', 'id2'], inplace=True)
data2.set_index(['id1', 'id2'], inplace=True)
data1[~data1.index.isin(data2.index)].reset_index()
Productivity:
id1 id2 number
0 a z 0
No matter what you have in number
.
source to share
This is a bit tricky, usually when we want to filter rows using multiple conditions, we'll do something like:
In [39]:
data1[(data1.id1 != data2.id1) & (data1.id2 != data2.id2)]
Out[39]:
Empty DataFrame
Columns: [id1, id2, number]
Index: []
but this gives no rows because the condition is not met because at least one of the id values ββmatched.
So what we really want to do is use both columns as the id column and then filter out rows that are only in data1.
To do this, we can first do a left merge:
In [33]:
merged = data1.merge(data2, on=['id1', 'id2'], how='left')
merged
Out[33]:
id1 id2 number_x number_y
0 a z 0 NaN
1 a y 20 1
2 a y 20 1
3 b z 1 0
Now we only need rows where the right side is zero, as this means that the composite index value does not exist:
In [36]:
merged_null = merged[merged.number_y.isnull()]
merged_null
Out[36]:
id1 id2 number_x number_y
0 a z 0 NaN
We can now use this to select our rows from the original frame using isin
to select those id values ββthat are in both id1 ad id2:
In [38]:
data1[(data1.id1.isin(merged_null['id1']) ) & (data1.id2.isin(merged_null['id2']))]
Out[38]:
id1 id2 number
0 a z 0
source to share