Pandas dataframe group and filter

OID,TYPE,ResponseType
100,mod,ok
100,mod,ok
101,mod,ok
101,mod,ok
101,mod,ok
101,mod,ok
101,mod,no
102,mod,ok
102,mod,ok2
103,mod,ok
103,mod,no2

      

I want to remove all OIDs that never had or didn't have 2 as an answer.

I tried:

dfnew = df.groupby('OID').filter(lambda x: ((x['ResponseType']=='no') | x['ResponseType']=='no2')).any() )

      

But I am getting SyntaxError: invalid syntax

Another example would be to create set

all the OIDs to be filtered and then use them to filter the df. The df has 5,000,000 rows!

Expected OP

OID,TYPE,ResponseType
100,mod,ok
100,mod,ok

102,mod,ok
102,mod,ok2

      

+3


source to share


2 answers


You need to add one (

and ~

to invert the booelan mask - but this is very slow:

dfnew = df.groupby('OID').filter(lambda x: ~((x['ResponseType']=='no') | 
                                             (x['ResponseType']=='no2')).any() )
                                          #here

print (dfnew)
   OID TYPE ResponseType
0  100  mod           ok
1  100  mod           ok
7  102  mod           ok
8  102  mod          ok2

      

Another solution, faster with boolean indexing

and double isin

:

oids = df.loc[df['ResponseType'].isin(['no','no2']), 'OID']
print (oids)
6     101
10    103
Name: OID, dtype: int64

dfnew = df[~df['OID'].isin(oids)]
print (dfnew)
   OID TYPE ResponseType
0  100  mod           ok
1  100  mod           ok
7  102  mod           ok
8  102  mod          ok2

      



Slower solution with unique

:

oids = df.loc[df['ResponseType'].isin(['no','no2']), 'OID'].unique()
print (oids)
[101 103]

      

Delay

np.random.seed(123)
N = 1000000
df = pd.DataFrame({'ResponseType': np.random.choice(['ok','ok2','no2', 'no'], N),
                   'TYPE':['mod'] * N,
                   'OID':np.random.randint(100000, size=N)})
print (df)

In [285]: %timeit (df[~df['OID'].isin(df.loc[df['ResponseType'].isin(['no','no2']), 'OID'])])
10 loops, best of 3: 67.2 ms per loop

In [286]: %timeit (df[~df['OID'].isin(df.loc[df['ResponseType'].isin(['no','no2']), 'OID'].unique())])
10 loops, best of 3: 69.5 ms per loop

#zipa solution
In [287]: %timeit (df[~df['OID'].isin(df[df['ResponseType'].isin(['no', 'no2'])]['OID'])])
10 loops, best of 3: 91.5 ms per loop

#groupby solution :(
In [288]: %timeit (df.groupby('OID').filter(lambda x: ~((x['ResponseType']=='no') |  (x['ResponseType']=='no2')).any() ))
1 loop, best of 3: 1min 54s per loop

      

+2


source


You can do it like this:



df[~df['OID'].isin(df[df['ResponseType'].isin(['no', 'no2'])]['OID'])]

      

0


source







All Articles