Pandas DataFrame: find index values for sequences of a specific length where two columns are equal / identical

Question

Pandas DataFrame: find index values for sequences of a specific length where two columns are equal / identical

I have pandas DataFrame

which is defined as:

# -*- coding: utf-8 -*-
import datetime as dt
import pandas as pd


data = [[1, 1], [1, 1], [2, 2], [2, 2], [2, 2], [3, 3], [4, 4], [4, 4],
        [4, 4], [5, 5], [5, 5]]
df = pd.DataFrame(data, columns=['A', 'B'])
df.index = pd.date_range(dt.datetime(2012, 1, 1), periods=len(df), freq='H')

print(df)

and gives:

                 A  B
2012-01-01 00:00:00  1  1
2012-01-01 01:00:00  1  1
2012-01-01 02:00:00  2  2
2012-01-01 03:00:00  2  2
2012-01-01 04:00:00  2  2
2012-01-01 05:00:00  3  3
2012-01-01 06:00:00  4  4
2012-01-01 07:00:00  4  4
2012-01-01 08:00:00  4  4
2012-01-01 09:00:00  5  5
2012-01-01 10:00:00  5  5

Now I am trying to get the index of the rows where columns A and B are equal and at least (or will exactly also be sufficient) n consecutive rows (hours here) are equal in A

and B

i.e. I want to extract the index values that must be consecutive (slices of length> = n) where A

and B

are equal.

So in this case for n = 2 it should be the index for "twos" and "fours":

2012-01-01 02:00:00
2012-01-01 03:00:00
2012-01-01 04:00:00
2012-01-01 06:00:00
2012-01-01 07:00:00
2012-01-01 08:00:00

Getting only the index for strings where A

and B

are equal is simple.

But how can I only get n consecutive equal elements?

I guess there must be some fancy group approach that I am not seeing at the moment.

+3

python pandas

Cord Kaldemeyer Jun 28. 17 at 15:58

source to share

1 answer

Alexander · Accepted Answer · 2017-06-28T16:44:09+0000

In your description, I don't understand why 1 and 5 would be excluded from your results, since each contains 2 or more consecutive lines with corresponding values for A and B.

The solution below should help, however, and I'm sure you can modify it to suit your needs. It first filters the dataframe to match the values in the columns A

and B

( df_matching

). It then uses the shift-cumsum pattern to group by sequential match values, and then filters by n

.

n = 2
df_matching = df[df.A == df.B]
gb = df_matching.groupby((df_matching.A != df_matching.A.shift()).cumsum())
df_target = gb.filter(lambda x: len(x) >= n)

>>> df_target
                     A  B
2012-01-01 00:00:00  1  1
2012-01-01 01:00:00  1  1
2012-01-01 02:00:00  2  2
2012-01-01 03:00:00  2  2
2012-01-01 04:00:00  2  2
2012-01-01 06:00:00  4  4
2012-01-01 07:00:00  4  4
2012-01-01 08:00:00  4  4
2012-01-01 09:00:00  5  5
2012-01-01 10:00:00  5  5

The information box above should ensure that it meets your expectations. Then just extract the index:

>>> df_target.index
DatetimeIndex(['2012-01-01 00:00:00', '2012-01-01 01:00:00',
               '2012-01-01 02:00:00', '2012-01-01 03:00:00',
               '2012-01-01 04:00:00', '2012-01-01 06:00:00',
               '2012-01-01 07:00:00', '2012-01-01 08:00:00',
               '2012-01-01 09:00:00', '2012-01-01 10:00:00'],
              dtype='datetime64[ns]', freq=None)

Note that you get the expected result if n=3

.

Pandas DataFrame: find index values ​​for sequences of a specific length where two columns are equal / identical

More articles:

Pandas DataFrame: find index values for sequences of a specific length where two columns are equal / identical