Find strings with k-consecutive NaNs in Pandas

In the following example:

 df = 
  0   NaN   5.0   NaN   6.0   NaN      
  1   5.0   6.0   6.0   NaN   NaN      
  2   6.0   6.0   NaN   NaN   NaN      
  3   6.0   NaN   NaN   NaN   6.0      
  4   NaN   NaN   NaN   6.0   NaN      
  5   6.0   6.0   6.0   8.0   7.0    
  6   6.0   6.0   8.0   7.0   8.0    
  7   6.0   8.0   7.0   8.0   8.0     
  8   8.0   7.0   8.0   8.0   NaN     
  9   7.0   8.0   8.0   NaN   9.0     

      

how to find lines with sequential k-NaN? For example, for the k=3

required lines: [2,3,4]

+3


source to share


3 answers


In [164]: df[df.astype(str).sum(1).str.contains(''.join(['nan']*3))]
Out[164]:
   0    1    2   3    4    5
2  2  6.0  6.0 NaN  NaN  NaN
3  3  6.0  NaN NaN  NaN  6.0
4  4  NaN  NaN NaN  6.0  NaN

      

Explanation:



In [166]: df.astype(str).sum(1)
Out[166]:
0    0nan5.06.06.0nan
1    15.06.06.0nannan
2    26.06.0nannannan
3    36.0nannannan6.0
4    4nannannan6.0nan
5    56.06.06.08.07.0
6    66.06.08.07.08.0
7    76.08.07.08.08.0
8    88.07.08.08.0nan
9    97.08.08.0nan9.0
dtype: object

In [167]: ''.join(['nan']*3)
Out[167]: 'nannannan'

      

+3


source


You can use a scrolling window with nan

s count :

>>> import numpy as np
>>> np.flatnonzero(np.any(np.isnan(df).rolling(window=3, axis=1).sum() >= 3, axis=1))
array([2, 3, 4], dtype=int64)

      

To get the matching lines, just use iloc

:

>>> df.iloc[rows_with_k_consecutive_nans(df, )]
   0    1    2   3    4    5
2  2  6.0  6.0 NaN  NaN  NaN
3  3  6.0  NaN NaN  NaN  6.0
4  4  NaN  NaN NaN  6.0  NaN

      

This can also be wrapped in a function:

def rows_with_k_consecutive_nans(df, k):
    """This is exactly like the above but using pandas functions instead of
    numpys. (see also Scott Boston answer). The approach is completly identical!
    """
    return df.isnull().rolling(window=k, axis=1).sum().ge(k).any(axis=1)

>>> df[rows_with_k_consecutive_nans(df, 3)]  # no iloc here!
   0    1    2   3    4    5
2  2  6.0  6.0 NaN  NaN  NaN
3  3  6.0  NaN NaN  NaN  6.0
4  4  NaN  NaN NaN  6.0  NaN

>>> df[rows_with_k_consecutive_nans(df, 2)]  # with 2 consecutives
   0    1    2    3    4    5
1  1  5.0  6.0  6.0  NaN  NaN
2  2  6.0  6.0  NaN  NaN  NaN
3  3  6.0  NaN  NaN  NaN  6.0
4  4  NaN  NaN  NaN  6.0  NaN

      


Step by step:

I'll only explain the numpy method, pandas functions are almost identical to these.



np.isnan

to find nan

s

>>> np.isnan(df)
       0      1      2      3      4      5
0  False   True  False   True  False   True
1  False  False  False  False   True   True
2  False  False  False   True   True   True
3  False  False   True   True   True  False
4  False   True   True   True  False   True
5  False  False  False  False  False  False
6  False  False  False  False  False  False
7  False  False  False  False  False  False
8  False  False  False  False  False   True
9  False  False  False  False   True  False

      

pd.DataFrame.rolling

to get consecutive NaNs

>>> np.isnan(df).rolling(window=3, axis=1).sum()
    0   1    2    3    4    5
0 NaN NaN  1.0  2.0  1.0  2.0
1 NaN NaN  0.0  0.0  1.0  2.0
2 NaN NaN  0.0  1.0  2.0  3.0
3 NaN NaN  1.0  2.0  3.0  2.0
4 NaN NaN  2.0  3.0  2.0  2.0
5 NaN NaN  0.0  0.0  0.0  0.0
6 NaN NaN  0.0  0.0  0.0  0.0
7 NaN NaN  0.0  0.0  0.0  0.0
8 NaN NaN  0.0  0.0  0.0  1.0
9 NaN NaN  0.0  0.0  1.0  1.0

      

Check availability for 3 consective NaNs

>>> np.isnan(df).rolling(window=3, axis=1).sum() >= 3
       0      1      2      3      4      5
0  False  False  False  False  False  False
1  False  False  False  False  False  False
2  False  False  False  False  False   True
3  False  False  False  False   True  False
4  False  False  False   True  False  False
5  False  False  False  False  False  False
6  False  False  False  False  False  False
7  False  False  False  False  False  False
8  False  False  False  False  False  False
9  False  False  False  False  False  False

>>> np.any(np.isnan(df).rolling(window=3, axis=1).sum() >= 3, axis=1)  # rows with at least 1 True
array([False, False,  True,  True,  True, False, False, False, False, False], dtype=bool)

      

np.flatnonzero

gives you the indices True

s.

>>> np.flatnonzero(np.any(np.isnan(df).rolling(window=3, axis=1).sum() >= 3, axis=1))
array([2, 3, 4], dtype=int64)

      

+2


source


MSeifert rolling solution with pandas:

 df[df.isnull().rolling(window=3,axis=1).sum().ge(3).any(axis=1)]

      

Output:

   0    1    2   3    4    5
2  2  6.0  6.0 NaN  NaN  NaN
3  3  6.0  NaN NaN  NaN  6.0
4  4  NaN  NaN NaN  6.0  NaN

      

+1


source







All Articles