Detection and correction of grouping variables with more than one repetition of a time sequence
I have a database with a wrong personal id id
. I would like to find an automatic detection and correction method, but I cannot figure it out.
I am only learning a manual way to do this (quite cumbersome).
The data looks like this:
id time
1 1 1
2 1 2
3 1 3
4 2 1
5 2 2
6 2 3
7 2 1
8 2 2
9 2 3
10 3 1
11 3 2
12 3 3
13 3 1
14 3 2
15 3 3
id
2 and id
3 are incorrect because of the information the variable has time
. Every time it starts time
, it id
must change.
I am creating a variable count count and a variable flag
(fixed) id
.
dta$row = 1:nrow(dta)
dta$id_f = dta$id
Then I fix the cases manually
dta[4:6, 'id_f'] <- paste( dta[4:6, 'id_f'], 'a', sep = '')
dta[7:9, 'id_f'] <- paste( dta[7:9, 'id_f'], 'b', sep = '')
dta[10:12, 'id_f'] <- paste( dta[10:12, 'id_f'], 'a', sep = '')
dta[13:15, 'id_f'] <- paste( dta[13:15, 'id_f'], 'b', sep = '')
Do you have any hint how I could avoid this manually?
The output I want is the following with the corrected id
id time row id_f
1 1 1 1 1
2 1 2 2 1
3 1 3 3 1
4 2 1 4 2a
5 2 2 5 2a
6 2 3 6 2a
7 2 1 7 2b
8 2 2 8 2b
9 2 3 9 2b
10 3 1 10 3a
11 3 2 11 3a
12 3 3 12 3a
13 3 1 13 3b
14 3 2 14 3b
15 3 3 15 3b
Data
dta = structure(list(id = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L,
3L, 3L, 3L, 3L, 3L), time = c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L,
3L, 1L, 2L, 3L, 1L, 2L, 3L)), .Names = c("id", "time"), class = "data.frame", row.names = c(NA,
-15L))
source to share
There is a possibility here:
do.call(rbind,
by(dta, dta$id, function(x){
# identify cases with more than one "Each time that the time begin"
# I used "more than one 1" as criteria
if(sum(x$time == 1) > 1){
# diff: to detect non-consecutive times, i.e. differences not equal to one.
# cumsum: to create an index variable, used to index letters
x$id2 <- paste0(x$id, letters[cumsum(c(FALSE, diff(x$time) != 1)) + 1])
# for id with a correct sequence of "time", just use the original id
} else {
x$id2 <- x$id
}
x
}))
# id time id2
# 1.1 1 1 1
# 1.2 1 2 1
# 1.3 1 3 1
# 2.4 2 1 2a
# 2.5 2 2 2a
# 2.6 2 3 2a
# 2.7 2 1 2b
# 2.8 2 2 2b
# 2.9 2 3 2b
# 3.10 3 1 3a
# 3.11 3 2 3a
# 3.12 3 3 3a
# 3.13 3 1 3b
# 3.14 3 2 3b
# 3.15 3 3 3b
source to share
Not what you asked for, but if you can tolerate availability 1a
without it 1b
, it will work. But this requires that your data is properly sorted.
library(dplyr)
dta %>%
mutate(time_diff = c(-1, diff(time)),
new_time = (time_diff < 0),
time_id = cumsum(new_time),
row = 1:n()) %>%
group_by(id) %>%
mutate(time_id = time_id - (min(time_id) - 1),
time_id = letters[time_id],
id_f = paste0(id, time_id)) %>%
ungroup() %>%
select(id, time, row, id_f)
source to share
I named the data frame z.
z$timediff <- c(0,diff(z$time)) < 0
z$iddiff <- c(0,diff(z$id))
z$timediffminusiddiff <- z$timediff - z$iddiff
z$cumsumtimediff <- cumsum(z$timediff)
z$haserr <- ave(z$timediffminusiddiff,z$id,FUN = max)
z$newnum <- letters[z$cumsumtimediff - ave(z$cumsumtimediff,z$id,FUN = min) + 1]
z[z$haserr == 1,'id'] <- paste0(z$id,z$newnum)[z$haserr == 1]
z[ ,c('id','time')]
You could squeeze this together into fewer lines, but then it's harder to read.
source to share