Merging Observations with Overlapping Dates

Each observation in my dataframe contains a different "before" and "after". The problem is the overlapping dates for each id. For example, in the following table, IDs 1 and 4 contain overlapping date values.

ID  before date after date
1   10/1/1996   12/1/1996
1   1/1/1998    9/30/2003
1   1/1/2000    12/31/2004
2   1/1/2001    3/31/2006
3   1/1/2001    9/30/2006
4   1/1/2001    9/30/2005
4   10/1/2004   12/30/2004
4   10/3/2004   11/28/2004

      

I am trying to get something like this:

ID  before date after date
1   10/1/1996   12/1/1996
1   1/1/1998    12/31/2004
2   1/1/2001    3/31/2006
3   1/1/2001    9/30/2006
4   1/1/2001    9/30/2005

      

Basically, I would like to replace any matching date values ​​with a date range of overlapping values, leave the values ​​unoverlapped, and remove unnecessary rows. Not sure how to do this

+3


source to share


1 answer


First, you must convert your strings to Date

-classed values , which makes comparison possible. Here's how I defined and enforced your data:

df <- data.frame(ID=c(1,1,1,2,3,4,4,4), before.date=c('10/1/1996','1/1/1998','1/1/2000','1/1/2001','1/1/2001','1/1/2001','10/1/2004','10/3/2004'), after.date=c('12/1/1996','9/30/2003','12/31/2004','3/31/2006','9/30/2006','9/30/2005','12/30/2004','11/28/2004') );
dcis <- grep('date$',names(df));
df[dcis] <- lapply(df[dcis],as.Date,'%m/%d/%Y');
df;
##   ID before.date after.date
## 1  1  1996-10-01 1996-12-01
## 2  1  1998-01-01 2003-09-30
## 3  1  2000-01-01 2004-12-31
## 4  2  2001-01-01 2006-03-31
## 5  3  2001-01-01 2006-09-30
## 6  4  2001-01-01 2005-09-30
## 7  4  2004-10-01 2004-12-30
## 8  4  2004-10-03 2004-11-28

      

Now my solution involves calculating the "overlapping grouping" vector, which I named og

. It makes the assumption that the input is df

ordered ID

and then before.date

that it is contained in your example data. If not, it can be achieved with df[order(df$ID,df$before.date),]

. This is how I calculate og

:

cummax.Date <- function(x) as.Date(cummax(as.integer(x)),'1970-01-01');
og <- with(df,c(0,cumsum(!(ID[-length(ID)]==ID[-1] & ave(after.date,ID,FUN=cummax)[-length(after.date)]>before.date[-1]))));
og;
## [1] 0 1 1 2 3 4 4 4

      

Unfortunately, the basic R function cummax()

doesn't work with Date

-classified objects, so I had to write a plugin cummax.Date()

. I will explain the need for ave()

and cummax()

business at the end of the post.

As you can see, the above computation lags behind the RHS of each of the two vectorized comparisons, excluding the first element through [-1]

. This allows you to compare a record ID

for equality with the next record ID

, and also compare if it's after.date

after the before.date

next record. The resulting logical vectors are ANDed ( &

) together. The negation of this boolean vector then represents contiguous pairs of records that do not overlap, and hence we can cumsum()

result (and add zero, since the first record must start at zero) to get our grouping vector.

Finally, for the final part of the solution, I used by()

to work with each overlapping group independently:

do.call(rbind,by(df,og,function(g) transform(g[1,],after.date=max(g$after.date))));
##   ID before.date after.date
## 0  1  1996-10-01 1996-12-01
## 1  1  1998-01-01 2004-12-31
## 2  2  2001-01-01 2006-03-31
## 3  3  2001-01-01 2006-09-30
## 4  4  2001-01-01 2005-09-30

      

Since all the records in a group must have the same ID

, and we have made the assumption that the records ordered by before.date

(after they are ordered according to ID

what is no longer relevant), we can get the correct values ID

and before.date

from the first record in the group. That's why I started with g[1,]

. Then we just need to get the largest after.date

of the group through max(g$after.date)

and transfer the first entry after.date

to what I did with transform()

.

A word on performance: an assumption about how to streamline the performance of ancillaries, because it allows us to simply compare each record to the immediately next record using lagged vectorized comparisons, rather than comparing each record in the group to every other record.

Now for business ave()

and cummax()

. I realized after writing the initial version of my answer that there was a flaw in my solution that happens to not be exposed to your example data. Let's say there are three entries in a group. If the first record has a range that overlaps with two of the next two records, and then the middle record does not intersect with the third record, then my (original) code will be unable to determine that the third record is part of the same overlapping group of the previous two records.



The solution is to not just use the after.date

current record when comparing with the next record, but instead use the cumulative maximum after.date

within the group. If any early entry stretches completely beyond its next entry, it clearly overlaps with that entry, and hers after.date

is what matters when considering overlapping groups for subsequent entries.

Here's the input that requires this fix, using df

as a base:

df2 <- df;
df2[7,'after.date'] <- '2004-10-02';
df2;
##   ID before.date after.date
## 1  1  1996-10-01 1996-12-01
## 2  1  1998-01-01 2003-09-30
## 3  1  2000-01-01 2004-12-31
## 4  2  2001-01-01 2006-03-31
## 5  3  2001-01-01 2006-09-30
## 6  4  2001-01-01 2005-09-30
## 7  4  2004-10-01 2004-10-02
## 8  4  2004-10-03 2004-11-28

      

Now record 6 matches from both records 7 and 8, but record 7 does not overlap with record 8. The solution still works:

cummax.Date <- function(x) as.Date(cummax(as.integer(x)),'1970-01-01');
og <- with(df2,c(0,cumsum(!(ID[-length(ID)]==ID[-1] & ave(after.date,ID,FUN=cummax)[-length(after.date)]>before.date[-1]))));
og;
## [1] 0 1 1 2 3 4 4 4
do.call(rbind,by(df2,og,function(g) transform(g[1,],after.date=max(g$after.date))));
##   ID before.date after.date
## 0  1  1996-10-01 1996-12-01
## 1  1  1998-01-01 2004-12-31
## 2  2  2001-01-01 2006-03-31
## 3  3  2001-01-01 2006-09-30
## 4  4  2001-01-01 2005-09-30

      

Here's proof that the calculation og

would be wrong without fixing ave()

/ cummax()

:

og <- with(df2,c(0,cumsum(!(ID[-length(ID)]==ID[-1] & after.date[-length(after.date)]>before.date[-1]))));
og;
## [1] 0 1 1 2 3 4 4 5

      


Minor tweak for a solution to overwrite after.date

before computation og

and avoid the call max()

(makes sense if you plan on overwriting the original df

with a new aggregation):

cummax.Date <- function(x) as.Date(cummax(as.integer(x)),'1970-01-01');
df$after.date <- ave(df$after.date,df$ID,FUN=cummax);
df;
##   ID before.date after.date
## 1  1  1996-10-01 1996-12-01
## 2  1  1998-01-01 2003-09-30
## 3  1  2000-01-01 2004-12-31
## 4  2  2001-01-01 2006-03-31
## 5  3  2001-01-01 2006-09-30
## 6  4  2001-01-01 2005-09-30
## 7  4  2004-10-01 2005-09-30
## 8  4  2004-10-03 2005-09-30
og <- with(df,c(0,cumsum(!(ID[-length(ID)]==ID[-1] & after.date[-length(after.date)]>before.date[-1]))));
og;
## [1] 0 1 1 2 3 4 4 4
df <- do.call(rbind,by(df,og,function(g) transform(g[1,],after.date=g$after.date[nrow(g)])));
df;
##   ID before.date after.date
## 0  1  1996-10-01 1996-12-01
## 1  1  1998-01-01 2004-12-31
## 2  2  2001-01-01 2006-03-31
## 3  3  2001-01-01 2006-09-30
## 4  4  2001-01-01 2005-09-30

      

+2


source







All Articles