The returned column index of the first set of consecutive values ​​in a row of a data frame in R

I have a large dataframe (200k rows) made up of monthly sample data. Each variable records the test result for that month; positive (1) or negative (0). The file also contains unique identifiers and a number of factorial variables for use in analysis. Here's a simplified example to illustrate:

w <- c(101, 0, 0, 0, 1, 1, 1, 5)
x <- c(102, 0, 0, 0, 0, 0, 0, 3)
y <- c(103, 1, 0, 0, 0, 0, 0, 2)
z <- c(104, 1, 1, 1, 0, 0, 0, 2)
dfrm <- data.frame(rbind(w,x,y,z), row.names = NULL)
names(dfrm) <- c("id","jan","feb","mar","apr","may","jun","start")

      

Test participants joined at different times; the last column is an index indicating the column in which this first exploration result is written. Results months before a member joins are recorded as zeros (as in the first line of the example).

I want to identify the first sequence of three consecutive zeros for each participant and then return the position of the beginning of that 3-zero sequence; but limiting my search to columns only since they started trial (those from index column onwards).

My approach - and I'm sure there are many of them - was to split it into two tasks: noting the NA for those test results that occurred before the participant joined using a for loop:

for (i in 1:nrow(dfrm)){
if(dfrm$start[i] > 2) 
dfrm[i,2:(dfrm$start[i]-1)] <- NA
}

      

before using a match loop across the entire data range now that rogue early zeros are set to NA:

for (i in 1:nrow(dfrm)){
f <- match(c(0,0,0), dfrm[i,2:7])
dfrm$outputmth[i] <- f[1]
}

dfrm$outputmth <- dfrm$outputmth - (dfrm$start - 2)

      

Succeeded (I think) in generating my desired result: first occurrence of three consecutive zeros for each participant in an active state and an NA where no event was found.

This comes with some awkward workarounds; in particular the second loop, returning a list of 3 values ​​in f, from which I only have to select the first element to fill dfrm$outputmth.

. But more importantly, it took about 30 minutes to run this code on the full dataset. So, feeling a little confused, I hope there is at least one efficient way to write and run this?

Thanks a lot for any help.

0


source to share


1 answer


I don't think what you've already written should give the correct result ... Because it match(c(0, 0, 0), ...)

won't match the first three consecutive zeros, but rather will give the first zero match repeated three times. In general, you should avoid loops that repeat across the lines of the data frame because they tend to be slow (for example, if you change the contents of the data frame in the body of the loop, it will create copies). The workaround uses apply

to traverse the lines of the data frame and use a function rle

to check for three consecutive zeros



dfrm$outputmth <- apply(dfrm[-1], 1, function(x) {
    y <- rle(x[x[7]:6])
    z <- y$values == 0 & y$lengths >= 3
    i <- which(z)[1]
    if (is.na(i)) return(NA)
    if (i == 1) return(x[7])
    return(sum(y$lengths[1:(i-1)]) + x[7])
})

dfrm
#  id jan feb mar apr may jun start outputmth
# 101   0   0   0   1   1   1     5        NA
# 102   0   0   0   0   0   0     3         3
# 103   1   0   0   0   0   0     2         2
# 104   1   1   1   0   0   0     2         4

      

+1


source







All Articles