Detection and correction of grouping variables with more than one repetition of a time sequence

I have a database with a wrong personal id id

. I would like to find an automatic detection and correction method, but I cannot figure it out.

I am only learning a manual way to do this (quite cumbersome).

The data looks like this:

   id time
1   1    1
2   1    2
3   1    3
4   2    1
5   2    2
6   2    3
7   2    1
8   2    2
9   2    3
10  3    1
11  3    2
12  3    3
13  3    1
14  3    2
15  3    3

      

id

2 and id

3 are incorrect because of the information the variable has time

. Every time it starts time

, it id

must change.

I am creating a variable count count and a variable flag

(fixed) id

.

dta$row = 1:nrow(dta)
dta$id_f = dta$id

      

Then I fix the cases manually

dta[4:6, 'id_f'] <- paste( dta[4:6, 'id_f'], 'a', sep = '')
dta[7:9, 'id_f'] <- paste( dta[7:9, 'id_f'], 'b', sep = '')

dta[10:12, 'id_f'] <- paste( dta[10:12, 'id_f'], 'a', sep = '')
dta[13:15, 'id_f'] <- paste( dta[13:15, 'id_f'], 'b', sep = '')

      

Do you have any hint how I could avoid this manually?

The output I want is the following with the corrected id

   id time row id_f
1   1    1   1    1
2   1    2   2    1
3   1    3   3    1
4   2    1   4   2a
5   2    2   5   2a
6   2    3   6   2a
7   2    1   7   2b
8   2    2   8   2b
9   2    3   9   2b
10  3    1  10   3a
11  3    2  11   3a
12  3    3  12   3a
13  3    1  13   3b
14  3    2  14   3b
15  3    3  15   3b

      

Data

dta = structure(list(id = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 
3L, 3L, 3L, 3L, 3L), time = c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 
3L, 1L, 2L, 3L, 1L, 2L, 3L)), .Names = c("id", "time"), class = "data.frame", row.names = c(NA, 
-15L))

      

+3


source to share


3 answers


There is a possibility here:



do.call(rbind, 
        by(dta, dta$id, function(x){

          # identify cases with more than one "Each time that the time begin"
          # I used "more than one 1" as criteria
          if(sum(x$time == 1) > 1){

            # diff: to detect non-consecutive times, i.e. differences not equal to one.
            # cumsum: to create an index variable, used to index letters
            x$id2 <- paste0(x$id, letters[cumsum(c(FALSE, diff(x$time) != 1)) + 1])

          # for id with a correct sequence of "time", just use the original id
          } else {
            x$id2 <- x$id
          }
          x
        }))

#      id time id2
# 1.1   1    1   1
# 1.2   1    2   1
# 1.3   1    3   1
# 2.4   2    1  2a
# 2.5   2    2  2a
# 2.6   2    3  2a
# 2.7   2    1  2b
# 2.8   2    2  2b
# 2.9   2    3  2b
# 3.10  3    1  3a
# 3.11  3    2  3a
# 3.12  3    3  3a
# 3.13  3    1  3b
# 3.14  3    2  3b
# 3.15  3    3  3b

      

+3


source


Not what you asked for, but if you can tolerate availability 1a

without it 1b

, it will work. But this requires that your data is properly sorted.



library(dplyr)
dta %>%
  mutate(time_diff = c(-1, diff(time)),
         new_time = (time_diff < 0),
         time_id = cumsum(new_time),
         row = 1:n()) %>%
  group_by(id) %>%
  mutate(time_id = time_id - (min(time_id) - 1),
         time_id = letters[time_id],
         id_f = paste0(id, time_id)) %>%
  ungroup() %>%
  select(id, time, row, id_f) 

      

+1


source


I named the data frame z.

z$timediff <- c(0,diff(z$time)) < 0
z$iddiff <- c(0,diff(z$id))
z$timediffminusiddiff <- z$timediff - z$iddiff
z$cumsumtimediff <- cumsum(z$timediff)

z$haserr <- ave(z$timediffminusiddiff,z$id,FUN = max)
z$newnum <- letters[z$cumsumtimediff - ave(z$cumsumtimediff,z$id,FUN = min) + 1]
z[z$haserr == 1,'id'] <- paste0(z$id,z$newnum)[z$haserr == 1]
z[ ,c('id','time')]

      

You could squeeze this together into fewer lines, but then it's harder to read.

+1


source







All Articles