Pandas DataFrame: find index values ​​for sequences of a specific length where two columns are equal / identical

I have pandas DataFrame

which is defined as:

# -*- coding: utf-8 -*-
import datetime as dt
import pandas as pd


data = [[1, 1], [1, 1], [2, 2], [2, 2], [2, 2], [3, 3], [4, 4], [4, 4],
        [4, 4], [5, 5], [5, 5]]
df = pd.DataFrame(data, columns=['A', 'B'])
df.index = pd.date_range(dt.datetime(2012, 1, 1), periods=len(df), freq='H')

print(df)

      

and gives:

                 A  B
2012-01-01 00:00:00  1  1
2012-01-01 01:00:00  1  1
2012-01-01 02:00:00  2  2
2012-01-01 03:00:00  2  2
2012-01-01 04:00:00  2  2
2012-01-01 05:00:00  3  3
2012-01-01 06:00:00  4  4
2012-01-01 07:00:00  4  4
2012-01-01 08:00:00  4  4
2012-01-01 09:00:00  5  5
2012-01-01 10:00:00  5  5

      

Now I am trying to get the index of the rows where columns A and B are equal and at least (or will exactly also be sufficient) n consecutive rows (hours here) are equal in A

and B

i.e. I want to extract the index values ​​that must be consecutive (slices of length> = n) where A

and B

are equal.

So in this case for n = 2 it should be the index for "twos" and "fours":

2012-01-01 02:00:00
2012-01-01 03:00:00
2012-01-01 04:00:00
2012-01-01 06:00:00
2012-01-01 07:00:00
2012-01-01 08:00:00

      

Getting only the index for strings where A

and B

are equal is simple.

But how can I only get n consecutive equal elements?

I guess there must be some fancy group approach that I am not seeing at the moment.

+3


source to share


1 answer


In your description, I don't understand why 1 and 5 would be excluded from your results, since each contains 2 or more consecutive lines with corresponding values ​​for A and B.

The solution below should help, however, and I'm sure you can modify it to suit your needs. It first filters the dataframe to match the values ​​in the columns A

and B

( df_matching

). It then uses the shift-cumsum pattern to group by sequential match values, and then filters by n

.

n = 2
df_matching = df[df.A == df.B]
gb = df_matching.groupby((df_matching.A != df_matching.A.shift()).cumsum())
df_target = gb.filter(lambda x: len(x) >= n)

>>> df_target
                     A  B
2012-01-01 00:00:00  1  1
2012-01-01 01:00:00  1  1
2012-01-01 02:00:00  2  2
2012-01-01 03:00:00  2  2
2012-01-01 04:00:00  2  2
2012-01-01 06:00:00  4  4
2012-01-01 07:00:00  4  4
2012-01-01 08:00:00  4  4
2012-01-01 09:00:00  5  5
2012-01-01 10:00:00  5  5

      



The information box above should ensure that it meets your expectations. Then just extract the index:

>>> df_target.index
DatetimeIndex(['2012-01-01 00:00:00', '2012-01-01 01:00:00',
               '2012-01-01 02:00:00', '2012-01-01 03:00:00',
               '2012-01-01 04:00:00', '2012-01-01 06:00:00',
               '2012-01-01 07:00:00', '2012-01-01 08:00:00',
               '2012-01-01 09:00:00', '2012-01-01 10:00:00'],
              dtype='datetime64[ns]', freq=None)

      

Note that you get the expected result if n=3

.

+3


source







All Articles