Combining data across interval boundaries

Question

Combining data across interval boundaries

Let's say I have this data:

      start       end duration
1   2.67026  2.903822 0.233562
2   4.40529  5.606470 1.201180
3   9.24340 10.010818 0.767418
4  11.87930 13.414140 1.534840
5  14.78210 15.182492 0.400392
6  16.51720 16.817494 0.300294
7  22.08930 25.125610 3.036310
8  32.13240 33.667240 1.534840
9  45.47880 45.912558 0.433758
10 52.85270 54.454270 1.601570
11 55.62210 56.389518 0.767418

They represent 11 events that happened within a minute. Each of them has a start and end time (in seconds) and the duration of that event (in seconds).

What I want to calculate is how many seconds was spent executing these events in each 10 second box / epoch.

The standard way to double data in data.table

would be to do something like:

as.data.table(df)[, .(total = sum(duration)), by = .(INTERVAL = cut(end, seq(0,60,10)))]

   INTERVAL    total
1:   (0,10] 1.434742
2:  (10,20] 3.002944
3:  (20,30] 3.036310
4:  (30,40] 1.534840
5:  (40,50] 0.433758
6:  (50,60] 2.368988

Note, however, that event 3 starts at 9.24340 seconds and ends at 10.010818 seconds. This method only summed the durations of the first two events in the interval (0.10). I want the first interval to include 10-9.24340 = 0.7566 seconds, i.e. should be 2.19132 seconds. This number should be subtracted from the second interval, it should be 2.246344 seconds.

In this example, 0-10 / 10-20 seconds are the only ones where the event spans the cut point, however, obviously I need to find a solution that generalizes to any number of possible cut points.

I think the solution might be to convert the time to datetime format (including milliseconds?) And use that for the cut

data, however I was unable to get it working.

EDIT after @Arun's answer:

@Argun works well for the specified problem. But what if we want to include all intervals - even those where the summed duration = 0.

Example:

set.seed(1)
df<-
  data.frame(
    start=c(2.3, 3.5,6.7,9.4,10.4,13.5,16.3,18.1),
    duration=runif(8,0,1)
  )
df$end<-df$start+df$duration
dt<-data.table(df)
dt


   start  duration       end
1:   2.3 0.2655087  2.565509
2:   3.5 0.3721239  3.872124
3:   6.7 0.5728534  7.272853
4:   9.4 0.9082078 10.308208
5:  10.4 0.2016819 10.601682
6:  13.5 0.8983897 14.398390
7:  16.3 0.9446753 17.244675
8:  18.1 0.6607978 18.760798

Arun's next solution:

lookup = data.table(start = seq(0, 18, by = 2), end = seq(2, 20, by = 2))
ans = foverlaps(dt, setkey(lookup, start, end))
ans[, sum(pmin(i.end, end) - pmax(i.start, start)), by=.(start,end)]

Result:

1:     2   4 0.6376326
2:     6   8 0.5728534
3:     8  10 0.6000000
4:    10  12 0.5098897
5:    12  14 0.5000000
6:    14  16 0.3983897
7:    16  18 0.9446753
8:    18  20 0.6607978

Note that intervals 0-2 and 4-6 are not included in the result. Obviously we could link them back together, but I'm wondering if this can be done simply by changing the code data.table

?

+3

r data.table

jalapic June 20. 15 at 14:36

source to share

1 answer

Arun · Accepted Answer · 2015-06-20T15:18:12+0000

This is how I might think with foverlaps()

.

require(data.table) # v1.9.5+ (due to bug fixes in foverlaps for double)
lookup = data.table(start = seq(0, 50, by = 10), end = seq(10, 60, by = 10))
#    start end
# 1:     0  10
# 2:    10  20
# 3:    20  30
# 4:    30  40
# 5:    40  50
# 6:    50  60

ans = foverlaps(dt, setkey(lookup, start, end))
ans[, sum(pmin(i.end, end) - pmax(i.start, start)), by=.(start,end)]
#    start end       V1
# 1:     0  10 2.191342
# 2:    10  20 2.246344
# 3:    20  30 3.036310
# 4:    30  40 1.534840
# 5:    40  50 0.433758
# 6:    50  60 2.368988

I feel like there might be better options out there though ..

Combining data across interval boundaries

More articles: