Pandas dataframe group and filter
OID,TYPE,ResponseType
100,mod,ok
100,mod,ok
101,mod,ok
101,mod,ok
101,mod,ok
101,mod,ok
101,mod,no
102,mod,ok
102,mod,ok2
103,mod,ok
103,mod,no2
I want to remove all OIDs that never had or didn't have 2 as an answer.
I tried:
dfnew = df.groupby('OID').filter(lambda x: ((x['ResponseType']=='no') | x['ResponseType']=='no2')).any() )
But I am getting SyntaxError: invalid syntax
Another example would be to create set
all the OIDs to be filtered and then use them to filter the df. The df has 5,000,000 rows!
Expected OP
OID,TYPE,ResponseType 100,mod,ok 100,mod,ok 102,mod,ok 102,mod,ok2
+3
source to share
2 answers
You need to add one (
and ~
to invert the booelan mask - but this is very slow:
dfnew = df.groupby('OID').filter(lambda x: ~((x['ResponseType']=='no') |
(x['ResponseType']=='no2')).any() )
#here
print (dfnew)
OID TYPE ResponseType
0 100 mod ok
1 100 mod ok
7 102 mod ok
8 102 mod ok2
Another solution, faster with boolean indexing
and double isin
:
oids = df.loc[df['ResponseType'].isin(['no','no2']), 'OID']
print (oids)
6 101
10 103
Name: OID, dtype: int64
dfnew = df[~df['OID'].isin(oids)]
print (dfnew)
OID TYPE ResponseType
0 100 mod ok
1 100 mod ok
7 102 mod ok
8 102 mod ok2
Slower solution with unique
:
oids = df.loc[df['ResponseType'].isin(['no','no2']), 'OID'].unique()
print (oids)
[101 103]
Delay
np.random.seed(123)
N = 1000000
df = pd.DataFrame({'ResponseType': np.random.choice(['ok','ok2','no2', 'no'], N),
'TYPE':['mod'] * N,
'OID':np.random.randint(100000, size=N)})
print (df)
In [285]: %timeit (df[~df['OID'].isin(df.loc[df['ResponseType'].isin(['no','no2']), 'OID'])])
10 loops, best of 3: 67.2 ms per loop
In [286]: %timeit (df[~df['OID'].isin(df.loc[df['ResponseType'].isin(['no','no2']), 'OID'].unique())])
10 loops, best of 3: 69.5 ms per loop
#zipa solution
In [287]: %timeit (df[~df['OID'].isin(df[df['ResponseType'].isin(['no', 'no2'])]['OID'])])
10 loops, best of 3: 91.5 ms per loop
#groupby solution :(
In [288]: %timeit (df.groupby('OID').filter(lambda x: ~((x['ResponseType']=='no') | (x['ResponseType']=='no2')).any() ))
1 loop, best of 3: 1min 54s per loop
+2
source to share