Identifying duplicate values including the first value in R

Question

Identifying duplicate values including the first value in R

My data looks like this:

ROW  ID   DATE  
1    1    09/20/2014
2    1    09/21/2014
3    1    09/22/2014
4    1    09/22/2014
5    2    09/19/2014
6    2    09/20/2014
7    2    09/21/2014
8    2    09/21/2014
9    2    09/21/2014
10   3    09/18/2014
11   3    09/19/2014
12   3    09/20/2014
13   3    09/20/2014

I want to create a FL column to identify duplicate dates by ID, I know duplicated () will identify subsequent duplicate values, but I also want to define the first value before it starts repeating

My data should look like this:

ROW  ID   DATE         FL
1    1    09/20/2014    0
2    1    09/21/2014    0
3    1    09/22/2014    1
4    1    09/22/2014    1
5    2    09/19/2014    0
6    2    09/20/2014    0
7    2    09/21/2014    1
8    2    09/21/2014    1
9    2    09/21/2014    1
10   3    09/18/2014    0 
11   3    09/19/2014    0
12   3    09/20/2014    1
13   3    09/20/2014    1

So, for each id, non-repeating dates get a FL value of 0 and repeating dates get a FL value of 1. It would be great if you could help me with the R-code for this. Thank.

change

here's the amount of data:

structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L), 
              DATE = structure(c(3L, 4L, 5L, 5L, 2L, 3L, 4L, 4L, 4L, 1L, 2L, 3L, 3L), 
              .Label = c("9/18/2014", "9/19/2014", "9/20/2014", "9/21/2014", "9/22/2014"), 
                        class = "factor")), 
              .Names = c("ID", "DATE"), class = "data.frame", row.names = c(NA, -13L))

+3

r

Vineet Goti 24 Sep 14 at 18:11

source to share

3 answers

agstudy · Answer 1 · 2014-09-24T18:19:18+0000

Use something like this with a package data.table

:

 library(data.table)
 setDT(dat)[,FL := (duplicated(DATE) | duplicated(DATE, fromLast = TRUE))*1,ID]
    ID      DATE FL
 1:  1 9/20/2014  0
 2:  1 9/21/2014  0
 3:  1 9/22/2014  1
 4:  1 9/22/2014  1
 5:  2 9/19/2014  0
 6:  2 9/20/2014  0
 7:  2 9/21/2014  1
 8:  2 9/21/2014  1
 9:  2 9/21/2014  1
10:  3 9/18/2014  0
11:  3 9/19/2014  0
12:  3 9/20/2014  1
13:  3 9/20/2014  1

Or in R base (using @akrun):

transform(dat, ave(as.numeric(factor(DATE)), ID, 
           FUN=function(x) duplicated(x)|duplicated(x,fromLast=TRUE)))

DivM · Answer 2 · 2014-09-24T21:58:55+0000

Fwiw, here's one rough way to solve this problem.

# your original data frame
dat <- structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L), 
         DATE = structure(c(3L, 4L, 5L, 5L, 2L, 3L, 4L, 4L, 4L, 1L, 2L, 3L, 3L), 
           .Label = c("9/18/2014", "9/19/2014", "9/20/2014", "9/21/2014", "9/22/2014"), 
           class = "factor")), 
         .Names = c("ID", "DATE"), class = "data.frame", row.names = c(NA, -13L))

# glue the columns to avoid need of grouping by ID first
dat2 <- paste(dat$ID, dat$DATE, sep='/')
# alternatively, you can use following for string comparison, if needed.
# dat2<-paste(as.character(dat$ID),as.character(dat$DATE),sep='/')

# create a lookup table for counts of each ID+DATE combo
lookup<-table(dat2)

# add a column based on counts. If count is 1 then ID+DATE is not duplicated.
dat$FL <- sapply(dat2,FUN = function(x) { if (lookup[x] == 1) 0 else 1})

# output
print(dat)

This should give you what you are looking for.

   ID      DATE FL
1   1 9/20/2014  0
2   1 9/21/2014  0
3   1 9/22/2014  1
4   1 9/22/2014  1
5   2 9/19/2014  0
6   2 9/20/2014  0
7   2 9/21/2014  1
8   2 9/21/2014  1
9   2 9/21/2014  1
10  3 9/18/2014  0
11  3 9/19/2014  0
12  3 9/20/2014  1
13  3 9/20/2014  1

There are more sophisticated ways to do this, and it table()

has its limitations, but for the most part it's simple, easy to read and should do the job for you.

Athos · Answer 3 · 2014-09-26T05:57:44+0000

dplyr

+ magrittr

alternative:

dat %>% 
    group_by(ID, DATE) %>% 
    mutate(FL = ifelse(n() > 1, 1, 0))

Small downside: this will result in 1 for every duplicate date, not just the last one. See if this is inconvenient.

Identifying duplicate values ​​including the first value in R

change

More articles:

Identifying duplicate values including the first value in R