Find strings with k-consecutive NaNs in Pandas
In the following example:
df =
0 NaN 5.0 NaN 6.0 NaN
1 5.0 6.0 6.0 NaN NaN
2 6.0 6.0 NaN NaN NaN
3 6.0 NaN NaN NaN 6.0
4 NaN NaN NaN 6.0 NaN
5 6.0 6.0 6.0 8.0 7.0
6 6.0 6.0 8.0 7.0 8.0
7 6.0 8.0 7.0 8.0 8.0
8 8.0 7.0 8.0 8.0 NaN
9 7.0 8.0 8.0 NaN 9.0
how to find lines with sequential k-NaN? For example, for the k=3
required lines: [2,3,4]
+3
source to share
3 answers
In [164]: df[df.astype(str).sum(1).str.contains(''.join(['nan']*3))]
Out[164]:
0 1 2 3 4 5
2 2 6.0 6.0 NaN NaN NaN
3 3 6.0 NaN NaN NaN 6.0
4 4 NaN NaN NaN 6.0 NaN
Explanation:
In [166]: df.astype(str).sum(1)
Out[166]:
0 0nan5.06.06.0nan
1 15.06.06.0nannan
2 26.06.0nannannan
3 36.0nannannan6.0
4 4nannannan6.0nan
5 56.06.06.08.07.0
6 66.06.08.07.08.0
7 76.08.07.08.08.0
8 88.07.08.08.0nan
9 97.08.08.0nan9.0
dtype: object
In [167]: ''.join(['nan']*3)
Out[167]: 'nannannan'
+3
source to share
You can use a scrolling window with nan
s count :
>>> import numpy as np
>>> np.flatnonzero(np.any(np.isnan(df).rolling(window=3, axis=1).sum() >= 3, axis=1))
array([2, 3, 4], dtype=int64)
To get the matching lines, just use iloc
:
>>> df.iloc[rows_with_k_consecutive_nans(df, )]
0 1 2 3 4 5
2 2 6.0 6.0 NaN NaN NaN
3 3 6.0 NaN NaN NaN 6.0
4 4 NaN NaN NaN 6.0 NaN
This can also be wrapped in a function:
def rows_with_k_consecutive_nans(df, k):
"""This is exactly like the above but using pandas functions instead of
numpys. (see also Scott Boston answer). The approach is completly identical!
"""
return df.isnull().rolling(window=k, axis=1).sum().ge(k).any(axis=1)
>>> df[rows_with_k_consecutive_nans(df, 3)] # no iloc here!
0 1 2 3 4 5
2 2 6.0 6.0 NaN NaN NaN
3 3 6.0 NaN NaN NaN 6.0
4 4 NaN NaN NaN 6.0 NaN
>>> df[rows_with_k_consecutive_nans(df, 2)] # with 2 consecutives
0 1 2 3 4 5
1 1 5.0 6.0 6.0 NaN NaN
2 2 6.0 6.0 NaN NaN NaN
3 3 6.0 NaN NaN NaN 6.0
4 4 NaN NaN NaN 6.0 NaN
Step by step:
I'll only explain the numpy method, pandas functions are almost identical to these.
np.isnan
to find nan
s
>>> np.isnan(df)
0 1 2 3 4 5
0 False True False True False True
1 False False False False True True
2 False False False True True True
3 False False True True True False
4 False True True True False True
5 False False False False False False
6 False False False False False False
7 False False False False False False
8 False False False False False True
9 False False False False True False
pd.DataFrame.rolling
to get consecutive NaNs
>>> np.isnan(df).rolling(window=3, axis=1).sum()
0 1 2 3 4 5
0 NaN NaN 1.0 2.0 1.0 2.0
1 NaN NaN 0.0 0.0 1.0 2.0
2 NaN NaN 0.0 1.0 2.0 3.0
3 NaN NaN 1.0 2.0 3.0 2.0
4 NaN NaN 2.0 3.0 2.0 2.0
5 NaN NaN 0.0 0.0 0.0 0.0
6 NaN NaN 0.0 0.0 0.0 0.0
7 NaN NaN 0.0 0.0 0.0 0.0
8 NaN NaN 0.0 0.0 0.0 1.0
9 NaN NaN 0.0 0.0 1.0 1.0
Check availability for 3 consective NaNs
>>> np.isnan(df).rolling(window=3, axis=1).sum() >= 3
0 1 2 3 4 5
0 False False False False False False
1 False False False False False False
2 False False False False False True
3 False False False False True False
4 False False False True False False
5 False False False False False False
6 False False False False False False
7 False False False False False False
8 False False False False False False
9 False False False False False False
>>> np.any(np.isnan(df).rolling(window=3, axis=1).sum() >= 3, axis=1) # rows with at least 1 True
array([False, False, True, True, True, False, False, False, False, False], dtype=bool)
np.flatnonzero
gives you the indices True
s.
>>> np.flatnonzero(np.any(np.isnan(df).rolling(window=3, axis=1).sum() >= 3, axis=1))
array([2, 3, 4], dtype=int64)
+2
source to share