Find strings with k-consecutive NaNs in Pandas

Question

Find strings with k-consecutive NaNs in Pandas

In the following example:

 df = 
  0   NaN   5.0   NaN   6.0   NaN      
  1   5.0   6.0   6.0   NaN   NaN      
  2   6.0   6.0   NaN   NaN   NaN      
  3   6.0   NaN   NaN   NaN   6.0      
  4   NaN   NaN   NaN   6.0   NaN      
  5   6.0   6.0   6.0   8.0   7.0    
  6   6.0   6.0   8.0   7.0   8.0    
  7   6.0   8.0   7.0   8.0   8.0     
  8   8.0   7.0   8.0   8.0   NaN     
  9   7.0   8.0   8.0   NaN   9.0

how to find lines with sequential k-NaN? For example, for the k=3

required lines: [2,3,4]

+3

python pandas dataframe

Arnold klein May 11 '17 at 12:45

source to share

3 answers

You can use a scrolling window with nan

s count :

>>> import numpy as np
>>> np.flatnonzero(np.any(np.isnan(df).rolling(window=3, axis=1).sum() >= 3, axis=1))
array([2, 3, 4], dtype=int64)

To get the matching lines, just use iloc

:

>>> df.iloc[rows_with_k_consecutive_nans(df, )]
   0    1    2   3    4    5
2  2  6.0  6.0 NaN  NaN  NaN
3  3  6.0  NaN NaN  NaN  6.0
4  4  NaN  NaN NaN  6.0  NaN

This can also be wrapped in a function:

def rows_with_k_consecutive_nans(df, k):
    """This is exactly like the above but using pandas functions instead of
    numpys. (see also Scott Boston answer). The approach is completly identical!
    """
    return df.isnull().rolling(window=k, axis=1).sum().ge(k).any(axis=1)

>>> df[rows_with_k_consecutive_nans(df, 3)]  # no iloc here!
   0    1    2   3    4    5
2  2  6.0  6.0 NaN  NaN  NaN
3  3  6.0  NaN NaN  NaN  6.0
4  4  NaN  NaN NaN  6.0  NaN

>>> df[rows_with_k_consecutive_nans(df, 2)]  # with 2 consecutives
   0    1    2    3    4    5
1  1  5.0  6.0  6.0  NaN  NaN
2  2  6.0  6.0  NaN  NaN  NaN
3  3  6.0  NaN  NaN  NaN  6.0
4  4  NaN  NaN  NaN  6.0  NaN

Step by step:

I'll only explain the numpy method, pandas functions are almost identical to these.

`np.isnan`

to find `nan`

s

>>> np.isnan(df)
       0      1      2      3      4      5
0  False   True  False   True  False   True
1  False  False  False  False   True   True
2  False  False  False   True   True   True
3  False  False   True   True   True  False
4  False   True   True   True  False   True
5  False  False  False  False  False  False
6  False  False  False  False  False  False
7  False  False  False  False  False  False
8  False  False  False  False  False   True
9  False  False  False  False   True  False

`pd.DataFrame.rolling`

to get consecutive NaNs

>>> np.isnan(df).rolling(window=3, axis=1).sum()
    0   1    2    3    4    5
0 NaN NaN  1.0  2.0  1.0  2.0
1 NaN NaN  0.0  0.0  1.0  2.0
2 NaN NaN  0.0  1.0  2.0  3.0
3 NaN NaN  1.0  2.0  3.0  2.0
4 NaN NaN  2.0  3.0  2.0  2.0
5 NaN NaN  0.0  0.0  0.0  0.0
6 NaN NaN  0.0  0.0  0.0  0.0
7 NaN NaN  0.0  0.0  0.0  0.0
8 NaN NaN  0.0  0.0  0.0  1.0
9 NaN NaN  0.0  0.0  1.0  1.0

Check availability for 3 consective NaNs

>>> np.isnan(df).rolling(window=3, axis=1).sum() >= 3
       0      1      2      3      4      5
0  False  False  False  False  False  False
1  False  False  False  False  False  False
2  False  False  False  False  False   True
3  False  False  False  False   True  False
4  False  False  False   True  False  False
5  False  False  False  False  False  False
6  False  False  False  False  False  False
7  False  False  False  False  False  False
8  False  False  False  False  False  False
9  False  False  False  False  False  False

>>> np.any(np.isnan(df).rolling(window=3, axis=1).sum() >= 3, axis=1)  # rows with at least 1 True
array([False, False,  True,  True,  True, False, False, False, False, False], dtype=bool)

`np.flatnonzero`

gives you the indices `True`

s.

>>> np.flatnonzero(np.any(np.isnan(df).rolling(window=3, axis=1).sum() >= 3, axis=1))
array([2, 3, 4], dtype=int64)

+2

MSeifert May 11 '17 at 13:07

source to share

MSeifert rolling solution with pandas:

 df[df.isnull().rolling(window=3,axis=1).sum().ge(3).any(axis=1)]

Output:

   0    1    2   3    4    5
2  2  6.0  6.0 NaN  NaN  NaN
3  3  6.0  NaN NaN  NaN  6.0
4  4  NaN  NaN NaN  6.0  NaN

+1

Scott boston May 11 '17 at 12:50

source to share

MaxU · Accepted Answer · 2017-05-11T12:52:24+0000

In [164]: df[df.astype(str).sum(1).str.contains(''.join(['nan']*3))]
Out[164]:
   0    1    2   3    4    5
2  2  6.0  6.0 NaN  NaN  NaN
3  3  6.0  NaN NaN  NaN  6.0
4  4  NaN  NaN NaN  6.0  NaN

Explanation:

In [166]: df.astype(str).sum(1)
Out[166]:
0    0nan5.06.06.0nan
1    15.06.06.0nannan
2    26.06.0nannannan
3    36.0nannannan6.0
4    4nannannan6.0nan
5    56.06.06.08.07.0
6    66.06.08.07.08.0
7    76.08.07.08.08.0
8    88.07.08.08.0nan
9    97.08.08.0nan9.0
dtype: object

In [167]: ''.join(['nan']*3)
Out[167]: 'nannannan'

Find strings with k-consecutive NaNs in Pandas

Step by step:

np.isnan (adsbygoogle = window.adsbygoogle || []).push({}); to find nan (adsbygoogle = window.adsbygoogle || []).push({}); s

pd.DataFrame.rolling (adsbygoogle = window.adsbygoogle || []).push({}); to get consecutive NaNs

Check availability for 3 consective NaNs

np.flatnonzero (adsbygoogle = window.adsbygoogle || []).push({}); gives you the indices True (adsbygoogle = window.adsbygoogle || []).push({}); s.

More articles:

`np.isnan`

to find `nan`

s

`pd.DataFrame.rolling`

to get consecutive NaNs

`np.flatnonzero`

gives you the indices `True`

s.