Merging Observations with Overlapping Dates
Each observation in my dataframe contains a different "before" and "after". The problem is the overlapping dates for each id. For example, in the following table, IDs 1 and 4 contain overlapping date values.
ID before date after date
1 10/1/1996 12/1/1996
1 1/1/1998 9/30/2003
1 1/1/2000 12/31/2004
2 1/1/2001 3/31/2006
3 1/1/2001 9/30/2006
4 1/1/2001 9/30/2005
4 10/1/2004 12/30/2004
4 10/3/2004 11/28/2004
I am trying to get something like this:
ID before date after date
1 10/1/1996 12/1/1996
1 1/1/1998 12/31/2004
2 1/1/2001 3/31/2006
3 1/1/2001 9/30/2006
4 1/1/2001 9/30/2005
Basically, I would like to replace any matching date values ​​with a date range of overlapping values, leave the values ​​unoverlapped, and remove unnecessary rows. Not sure how to do this
source to share
First, you must convert your strings to Date
-classed values , which makes comparison possible. Here's how I defined and enforced your data:
df <- data.frame(ID=c(1,1,1,2,3,4,4,4), before.date=c('10/1/1996','1/1/1998','1/1/2000','1/1/2001','1/1/2001','1/1/2001','10/1/2004','10/3/2004'), after.date=c('12/1/1996','9/30/2003','12/31/2004','3/31/2006','9/30/2006','9/30/2005','12/30/2004','11/28/2004') );
dcis <- grep('date$',names(df));
df[dcis] <- lapply(df[dcis],as.Date,'%m/%d/%Y');
df;
## ID before.date after.date
## 1 1 1996-10-01 1996-12-01
## 2 1 1998-01-01 2003-09-30
## 3 1 2000-01-01 2004-12-31
## 4 2 2001-01-01 2006-03-31
## 5 3 2001-01-01 2006-09-30
## 6 4 2001-01-01 2005-09-30
## 7 4 2004-10-01 2004-12-30
## 8 4 2004-10-03 2004-11-28
Now my solution involves calculating the "overlapping grouping" vector, which I named og
. It makes the assumption that the input is df
ordered ID
and then before.date
that it is contained in your example data. If not, it can be achieved with df[order(df$ID,df$before.date),]
. This is how I calculate og
:
cummax.Date <- function(x) as.Date(cummax(as.integer(x)),'1970-01-01');
og <- with(df,c(0,cumsum(!(ID[-length(ID)]==ID[-1] & ave(after.date,ID,FUN=cummax)[-length(after.date)]>before.date[-1]))));
og;
## [1] 0 1 1 2 3 4 4 4
Unfortunately, the basic R function cummax()
doesn't work with Date
-classified objects, so I had to write a plugin cummax.Date()
. I will explain the need for ave()
and cummax()
business at the end of the post.
As you can see, the above computation lags behind the RHS of each of the two vectorized comparisons, excluding the first element through [-1]
. This allows you to compare a record ID
for equality with the next record ID
, and also compare if it's after.date
after the before.date
next record. The resulting logical vectors are ANDed ( &
) together. The negation of this boolean vector then represents contiguous pairs of records that do not overlap, and hence we can cumsum()
result (and add zero, since the first record must start at zero) to get our grouping vector.
Finally, for the final part of the solution, I used by()
to work with each overlapping group independently:
do.call(rbind,by(df,og,function(g) transform(g[1,],after.date=max(g$after.date))));
## ID before.date after.date
## 0 1 1996-10-01 1996-12-01
## 1 1 1998-01-01 2004-12-31
## 2 2 2001-01-01 2006-03-31
## 3 3 2001-01-01 2006-09-30
## 4 4 2001-01-01 2005-09-30
Since all the records in a group must have the same ID
, and we have made the assumption that the records ordered by before.date
(after they are ordered according to ID
what is no longer relevant), we can get the correct values ID
and before.date
from the first record in the group. That's why I started with g[1,]
. Then we just need to get the largest after.date
of the group through max(g$after.date)
and transfer the first entry after.date
to what I did with transform()
.
A word on performance: an assumption about how to streamline the performance of ancillaries, because it allows us to simply compare each record to the immediately next record using lagged vectorized comparisons, rather than comparing each record in the group to every other record.
Now for business ave()
and cummax()
. I realized after writing the initial version of my answer that there was a flaw in my solution that happens to not be exposed to your example data. Let's say there are three entries in a group. If the first record has a range that overlaps with two of the next two records, and then the middle record does not intersect with the third record, then my (original) code will be unable to determine that the third record is part of the same overlapping group of the previous two records.
The solution is to not just use the after.date
current record when comparing with the next record, but instead use the cumulative maximum after.date
within the group. If any early entry stretches completely beyond its next entry, it clearly overlaps with that entry, and hers after.date
is what matters when considering overlapping groups for subsequent entries.
Here's the input that requires this fix, using df
as a base:
df2 <- df;
df2[7,'after.date'] <- '2004-10-02';
df2;
## ID before.date after.date
## 1 1 1996-10-01 1996-12-01
## 2 1 1998-01-01 2003-09-30
## 3 1 2000-01-01 2004-12-31
## 4 2 2001-01-01 2006-03-31
## 5 3 2001-01-01 2006-09-30
## 6 4 2001-01-01 2005-09-30
## 7 4 2004-10-01 2004-10-02
## 8 4 2004-10-03 2004-11-28
Now record 6 matches from both records 7 and 8, but record 7 does not overlap with record 8. The solution still works:
cummax.Date <- function(x) as.Date(cummax(as.integer(x)),'1970-01-01');
og <- with(df2,c(0,cumsum(!(ID[-length(ID)]==ID[-1] & ave(after.date,ID,FUN=cummax)[-length(after.date)]>before.date[-1]))));
og;
## [1] 0 1 1 2 3 4 4 4
do.call(rbind,by(df2,og,function(g) transform(g[1,],after.date=max(g$after.date))));
## ID before.date after.date
## 0 1 1996-10-01 1996-12-01
## 1 1 1998-01-01 2004-12-31
## 2 2 2001-01-01 2006-03-31
## 3 3 2001-01-01 2006-09-30
## 4 4 2001-01-01 2005-09-30
Here's proof that the calculation og
would be wrong without fixing ave()
/ cummax()
:
og <- with(df2,c(0,cumsum(!(ID[-length(ID)]==ID[-1] & after.date[-length(after.date)]>before.date[-1]))));
og;
## [1] 0 1 1 2 3 4 4 5
Minor tweak for a solution to overwrite after.date
before computation og
and avoid the call max()
(makes sense if you plan on overwriting the original df
with a new aggregation):
cummax.Date <- function(x) as.Date(cummax(as.integer(x)),'1970-01-01');
df$after.date <- ave(df$after.date,df$ID,FUN=cummax);
df;
## ID before.date after.date
## 1 1 1996-10-01 1996-12-01
## 2 1 1998-01-01 2003-09-30
## 3 1 2000-01-01 2004-12-31
## 4 2 2001-01-01 2006-03-31
## 5 3 2001-01-01 2006-09-30
## 6 4 2001-01-01 2005-09-30
## 7 4 2004-10-01 2005-09-30
## 8 4 2004-10-03 2005-09-30
og <- with(df,c(0,cumsum(!(ID[-length(ID)]==ID[-1] & after.date[-length(after.date)]>before.date[-1]))));
og;
## [1] 0 1 1 2 3 4 4 4
df <- do.call(rbind,by(df,og,function(g) transform(g[1,],after.date=g$after.date[nrow(g)])));
df;
## ID before.date after.date
## 0 1 1996-10-01 1996-12-01
## 1 1 1998-01-01 2004-12-31
## 2 2 2001-01-01 2006-03-31
## 3 3 2001-01-01 2006-09-30
## 4 4 2001-01-01 2005-09-30
source to share