Pandas measure elapsed time when condition is true
I have the following framework:
dt binary
2016-01-01 00:00:00 False
2016-01-01 00:00:01 False
2016-01-01 00:00:02 False
2016-01-01 00:00:03 False
2016-01-01 00:00:04 True
2016-01-01 00:00:05 True
2016-01-01 00:00:06 True
2016-01-01 00:00:07 False
2016-01-01 00:00:08 False
2016-01-01 00:00:09 True
2016-01-01 00:00:10 True
I would like to summarize the past tense when binary
equal True
. I am sharing my solution that implements it, but something tells me that there should be an easier way as it is a fairly simple time series data function. Note that the data is most likely equidistant, but I cannot rely on this.
df['binary_grp'] = (df.binary.diff(1) != False).astype(int).cumsum()
# Throw away False values
df = df[df.binary]
groupby = df.groupby('binary_grp')
df = pd.DataFrame({'timespan': groupby.dt.last() - groupby.dt.first()})
return df.timespan.sum().seconds / 60.0
The hardest part is probably the first line. What it does is it basically assigns an incremental number to each sequential block. This is what the data looks like after that:
dt binary binary_grp
2016-01-01 00:00:00 False 1
2016-01-01 00:00:01 False 1
2016-01-01 00:00:02 False 1
2016-01-01 00:00:03 False 1
2016-01-01 00:00:04 True 2
2016-01-01 00:00:05 True 2
2016-01-01 00:00:06 True 2
2016-01-01 00:00:07 False 3
2016-01-01 00:00:08 False 3
2016-01-01 00:00:09 True 4
2016-01-01 00:00:10 True 4
Is there a better way to do this? I am assuming this code is executed, my concern is readability.
source to share
I think your decision is nice.
Another solution:
Compare shift
ed values ββwith ne
, get groups cumsum
.
After filtering can be used apply
with a difference by choosing iloc
:
df['binary_grp'] = (df.binary.ne(df.binary.shift())).cumsum()
df = df[df.binary]
s = df.groupby('binary_grp')['dt'].apply(lambda x: x.iloc[-1] - x.iloc[0])
print (s)
binary_grp
2 00:00:02
4 00:00:01
Name: dt, dtype: timedelta64[ns]
all_time = s.sum().seconds / 60.0
print (all_time)
0.05
Your solution DataFrame
doesn't need a new one if you only need all_time
:
groupby = df.groupby('binary_grp') s = groupby.dt.last() - groupby.dt.first() all_time = s.sum().seconds / 60.0 print (all_time) 0.05
But if necessary, you can create it from Series
s
using to_frame
:
df1 = s.to_frame('timestamp')
print (df1)
timestamp
binary_grp
2 00:00:02
4 00:00:01
source to share
IIUC:
You want to find the sum of the time covered by the entire series, where binary
- True
.
However, we have to make some options or assumptions
dt binary
0 2016-01-01 00:00:00 False
1 2016-01-01 00:00:01 False
2 2016-01-01 00:00:02 False
3 2016-01-01 00:00:03 False
4 2016-01-01 00:00:04 True # <- This where time starts
5 2016-01-01 00:00:05 True
6 2016-01-01 00:00:06 True
7 2016-01-01 00:00:07 False # <- And ends here. So this would
8 2016-01-01 00:00:08 False # be 00:00:07 - 00:00:04 or 3 seconds
9 2016-01-01 00:00:09 True # <- Starts again
10 2016-01-01 00:00:10 True # <- But ends here because
# I don't have another Timestamp
With these assumptions, we can use diff
, multiply andsum
df.dt.diff().shift(-1).mul(df.binary).sum()
Timedelta('0 days 00:00:04')
We can use this concept together with groupby
# Use xor and cumsum to identify change in True to False and False to True
grps = (df.binary ^ df.binary.shift()).cumsum()
mask = df.binary.groupby(grps).first()
df.dt.diff().shift(-1).groupby(grps).sum()[mask]
binary
1 00:00:03
3 00:00:01
Name: dt, dtype: timedelta64[ns]
Or without a mask
pd.concat([df.dt.diff().shift(-1).groupby(grps).sum(), mask], axis=1)
dt binary
binary
0 00:00:04 False
1 00:00:03 True
2 00:00:02 False
3 00:00:01 True
source to share