Identify NOT records in another data frame

Question

Identify NOT records in another data frame

I have one clock of data:

data1 = pd.DataFrame([['a','z',0],['a','y',20],['b','z',1]],columns=['id1','id2','number'])
data2 = pd.DataFrame([['a','y',1],['a','y',1],['b','z',0]],columns=['id1','id2','number'])

I want to return records that are in data1, not data2 (how id1 and id2 are combined).

In this case, I just want it to return one record ['a', 'z', 0], since both ['a', 'y'] and ['b', 'z'] exist in data2.

+2

merge pandas

Chris 24 nov. '14 at 3:36

source to share

2 answers

This is a bit tricky, usually when we want to filter rows using multiple conditions, we'll do something like:

In [39]:
data1[(data1.id1 != data2.id1) & (data1.id2 != data2.id2)]
Out[39]:
Empty DataFrame
Columns: [id1, id2, number]
Index: []

but this gives no rows because the condition is not met because at least one of the id values matched.

So what we really want to do is use both columns as the id column and then filter out rows that are only in data1.

To do this, we can first do a left merge:

In [33]:
merged = data1.merge(data2, on=['id1', 'id2'], how='left')
merged
Out[33]:
  id1 id2  number_x  number_y
0   a   z         0       NaN
1   a   y        20         1
2   a   y        20         1
3   b   z         1         0

Now we only need rows where the right side is zero, as this means that the composite index value does not exist:

In [36]:

merged_null = merged[merged.number_y.isnull()]
merged_null

Out[36]:
  id1 id2  number_x  number_y
0   a   z         0       NaN

We can now use this to select our rows from the original frame using isin

to select those id values that are in both id1 ad id2:

In [38]:

data1[(data1.id1.isin(merged_null['id1']) ) & (data1.id2.isin(merged_null['id2']))]
Out[38]:
  id1 id2  number
0   a   z       0

+1

EdChum 24 nov. '14 at 5:00

source to share

Primer · Accepted Answer · 2014-11-25T09:49:31+0000

I think there is an alternative way. If we set both columns as index, we can use a method .isin

to filter out what is needed:

data1.set_index(['id1', 'id2'], inplace=True)
data2.set_index(['id1', 'id2'], inplace=True)
data1[~data1.index.isin(data2.index)].reset_index()

Productivity:

  id1 id2  number
0   a   z       0

No matter what you have in number

.

Identify NOT records in another data frame

More articles: