Identifying duplicate values including the first value in R
My data looks like this:
ROW ID DATE
1 1 09/20/2014
2 1 09/21/2014
3 1 09/22/2014
4 1 09/22/2014
5 2 09/19/2014
6 2 09/20/2014
7 2 09/21/2014
8 2 09/21/2014
9 2 09/21/2014
10 3 09/18/2014
11 3 09/19/2014
12 3 09/20/2014
13 3 09/20/2014
I want to create a FL column to identify duplicate dates by ID, I know duplicated () will identify subsequent duplicate values, but I also want to define the first value before it starts repeating
My data should look like this:
ROW ID DATE FL
1 1 09/20/2014 0
2 1 09/21/2014 0
3 1 09/22/2014 1
4 1 09/22/2014 1
5 2 09/19/2014 0
6 2 09/20/2014 0
7 2 09/21/2014 1
8 2 09/21/2014 1
9 2 09/21/2014 1
10 3 09/18/2014 0
11 3 09/19/2014 0
12 3 09/20/2014 1
13 3 09/20/2014 1
So, for each id, non-repeating dates get a FL value of 0 and repeating dates get a FL value of 1. It would be great if you could help me with the R-code for this. Thank.
change
here's the amount of data:
structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L),
DATE = structure(c(3L, 4L, 5L, 5L, 2L, 3L, 4L, 4L, 4L, 1L, 2L, 3L, 3L),
.Label = c("9/18/2014", "9/19/2014", "9/20/2014", "9/21/2014", "9/22/2014"),
class = "factor")),
.Names = c("ID", "DATE"), class = "data.frame", row.names = c(NA, -13L))
source to share
Use something like this with a package data.table
:
library(data.table)
setDT(dat)[,FL := (duplicated(DATE) | duplicated(DATE, fromLast = TRUE))*1,ID]
ID DATE FL
1: 1 9/20/2014 0
2: 1 9/21/2014 0
3: 1 9/22/2014 1
4: 1 9/22/2014 1
5: 2 9/19/2014 0
6: 2 9/20/2014 0
7: 2 9/21/2014 1
8: 2 9/21/2014 1
9: 2 9/21/2014 1
10: 3 9/18/2014 0
11: 3 9/19/2014 0
12: 3 9/20/2014 1
13: 3 9/20/2014 1
Or in R base (using @akrun):
transform(dat, ave(as.numeric(factor(DATE)), ID,
FUN=function(x) duplicated(x)|duplicated(x,fromLast=TRUE)))
source to share
Fwiw, here's one rough way to solve this problem.
# your original data frame
dat <- structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L),
DATE = structure(c(3L, 4L, 5L, 5L, 2L, 3L, 4L, 4L, 4L, 1L, 2L, 3L, 3L),
.Label = c("9/18/2014", "9/19/2014", "9/20/2014", "9/21/2014", "9/22/2014"),
class = "factor")),
.Names = c("ID", "DATE"), class = "data.frame", row.names = c(NA, -13L))
# glue the columns to avoid need of grouping by ID first
dat2 <- paste(dat$ID, dat$DATE, sep='/')
# alternatively, you can use following for string comparison, if needed.
# dat2<-paste(as.character(dat$ID),as.character(dat$DATE),sep='/')
# create a lookup table for counts of each ID+DATE combo
lookup<-table(dat2)
# add a column based on counts. If count is 1 then ID+DATE is not duplicated.
dat$FL <- sapply(dat2,FUN = function(x) { if (lookup[x] == 1) 0 else 1})
# output
print(dat)
This should give you what you are looking for.
ID DATE FL
1 1 9/20/2014 0
2 1 9/21/2014 0
3 1 9/22/2014 1
4 1 9/22/2014 1
5 2 9/19/2014 0
6 2 9/20/2014 0
7 2 9/21/2014 1
8 2 9/21/2014 1
9 2 9/21/2014 1
10 3 9/18/2014 0
11 3 9/19/2014 0
12 3 9/20/2014 1
13 3 9/20/2014 1
There are more sophisticated ways to do this, and it table()
has its limitations, but for the most part it's simple, easy to read and should do the job for you.
source to share