Pandas DataFrame: find index values ββfor sequences of a specific length where two columns are equal / identical
I have pandas DataFrame
which is defined as:
# -*- coding: utf-8 -*-
import datetime as dt
import pandas as pd
data = [[1, 1], [1, 1], [2, 2], [2, 2], [2, 2], [3, 3], [4, 4], [4, 4],
[4, 4], [5, 5], [5, 5]]
df = pd.DataFrame(data, columns=['A', 'B'])
df.index = pd.date_range(dt.datetime(2012, 1, 1), periods=len(df), freq='H')
print(df)
and gives:
A B
2012-01-01 00:00:00 1 1
2012-01-01 01:00:00 1 1
2012-01-01 02:00:00 2 2
2012-01-01 03:00:00 2 2
2012-01-01 04:00:00 2 2
2012-01-01 05:00:00 3 3
2012-01-01 06:00:00 4 4
2012-01-01 07:00:00 4 4
2012-01-01 08:00:00 4 4
2012-01-01 09:00:00 5 5
2012-01-01 10:00:00 5 5
Now I am trying to get the index of the rows where columns A and B are equal and at least (or will exactly also be sufficient) n consecutive rows (hours here) are equal in A
and B
i.e. I want to extract the index values ββthat must be consecutive (slices of length> = n) where A
and B
are equal.
So in this case for n = 2 it should be the index for "twos" and "fours":
2012-01-01 02:00:00
2012-01-01 03:00:00
2012-01-01 04:00:00
2012-01-01 06:00:00
2012-01-01 07:00:00
2012-01-01 08:00:00
Getting only the index for strings where A
and B
are equal is simple.
But how can I only get n consecutive equal elements?
I guess there must be some fancy group approach that I am not seeing at the moment.
source to share
In your description, I don't understand why 1 and 5 would be excluded from your results, since each contains 2 or more consecutive lines with corresponding values ββfor A and B.
The solution below should help, however, and I'm sure you can modify it to suit your needs. It first filters the dataframe to match the values ββin the columns A
and B
( df_matching
). It then uses the shift-cumsum pattern to group by sequential match values, and then filters by n
.
n = 2 df_matching = df[df.A == df.B] gb = df_matching.groupby((df_matching.A != df_matching.A.shift()).cumsum()) df_target = gb.filter(lambda x: len(x) >= n) >>> df_target A B 2012-01-01 00:00:00 1 1 2012-01-01 01:00:00 1 1 2012-01-01 02:00:00 2 2 2012-01-01 03:00:00 2 2 2012-01-01 04:00:00 2 2 2012-01-01 06:00:00 4 4 2012-01-01 07:00:00 4 4 2012-01-01 08:00:00 4 4 2012-01-01 09:00:00 5 5 2012-01-01 10:00:00 5 5
The information box above should ensure that it meets your expectations. Then just extract the index:
>>> df_target.index
DatetimeIndex(['2012-01-01 00:00:00', '2012-01-01 01:00:00',
'2012-01-01 02:00:00', '2012-01-01 03:00:00',
'2012-01-01 04:00:00', '2012-01-01 06:00:00',
'2012-01-01 07:00:00', '2012-01-01 08:00:00',
'2012-01-01 09:00:00', '2012-01-01 10:00:00'],
dtype='datetime64[ns]', freq=None)
Note that you get the expected result if n=3
.
source to share