Pandas dataframe group and filter

Question

Pandas dataframe group and filter

OID,TYPE,ResponseType
100,mod,ok
100,mod,ok
101,mod,ok
101,mod,ok
101,mod,ok
101,mod,ok
101,mod,no
102,mod,ok
102,mod,ok2
103,mod,ok
103,mod,no2

I want to remove all OIDs that never had or didn't have 2 as an answer.

I tried:

dfnew = df.groupby('OID').filter(lambda x: ((x['ResponseType']=='no') | x['ResponseType']=='no2')).any() )

But I am getting SyntaxError: invalid syntax

Another example would be to create set

all the OIDs to be filtered and then use them to filter the df. The df has 5,000,000 rows!

Expected OP

OID,TYPE,ResponseType
100,mod,ok
100,mod,ok

102,mod,ok
102,mod,ok2

+3

python filter pandas dataframe

pythonRcpp June 27. 17 at 11:42

source to share

2 answers

You can do it like this:

df[~df['OID'].isin(df[df['ResponseType'].isin(['no', 'no2'])]['OID'])]

0

zipa June 27. 17 at 11:49

source to share

jezrael · Accepted Answer · 2017-06-27T11:44:47+0000

You need to add one (

and ~

to invert the booelan mask - but this is very slow:

dfnew = df.groupby('OID').filter(lambda x: ~((x['ResponseType']=='no') | 
                                             (x['ResponseType']=='no2')).any() )
                                          #here

print (dfnew)
   OID TYPE ResponseType
0  100  mod           ok
1  100  mod           ok
7  102  mod           ok
8  102  mod          ok2

Another solution, faster with boolean indexing

and double isin

:

oids = df.loc[df['ResponseType'].isin(['no','no2']), 'OID']
print (oids)
6     101
10    103
Name: OID, dtype: int64

dfnew = df[~df['OID'].isin(oids)]
print (dfnew)
   OID TYPE ResponseType
0  100  mod           ok
1  100  mod           ok
7  102  mod           ok
8  102  mod          ok2

Slower solution with unique

:

oids = df.loc[df['ResponseType'].isin(['no','no2']), 'OID'].unique()
print (oids)
[101 103]

Delay

np.random.seed(123)
N = 1000000
df = pd.DataFrame({'ResponseType': np.random.choice(['ok','ok2','no2', 'no'], N),
                   'TYPE':['mod'] * N,
                   'OID':np.random.randint(100000, size=N)})
print (df)

In [285]: %timeit (df[~df['OID'].isin(df.loc[df['ResponseType'].isin(['no','no2']), 'OID'])])
10 loops, best of 3: 67.2 ms per loop

In [286]: %timeit (df[~df['OID'].isin(df.loc[df['ResponseType'].isin(['no','no2']), 'OID'].unique())])
10 loops, best of 3: 69.5 ms per loop

#zipa solution
In [287]: %timeit (df[~df['OID'].isin(df[df['ResponseType'].isin(['no', 'no2'])]['OID'])])
10 loops, best of 3: 91.5 ms per loop

#groupby solution :(
In [288]: %timeit (df.groupby('OID').filter(lambda x: ~((x['ResponseType']=='no') |  (x['ResponseType']=='no2')).any() ))
1 loop, best of 3: 1min 54s per loop

Pandas dataframe group and filter

More articles: