Idiomatic way of copying cell values ​​"down" into vector R

Possible duplicate:
Padding NA in vector using values ​​other than NA?

Is there an idiomatic way to copy the down cell values ​​into the R vector? By "copy-down" I mean replacing NAs with the closest previous value other than NA.

While I can do it very simply with a for loop, it is very slow. Any advice on how to do this should be appreciated.

# Test code
# Set up test data
len <- 1000000
data <- rep(c(1, rep(NA, 9)), len %/% 10) * rep(1:(len %/% 10), each=10)
head(data, n=25)
tail(data, n=25)

# Time naive method
system.time({
  data.clean <- data;
  for (i in 2:length(data.clean)){
    if(is.na(data.clean[i])) data.clean[i] <- data.clean[i-1]
  }
})

# Print results
head(data.clean, n=25)
tail(data.clean, n=25)

      

Test run result:

> # Set up test data
> len <- 1000000
> data <- rep(c(1, rep(NA, 9)), len %/% 10) * rep(1:(len %/% 10), each=10)
> head(data, n=25)
 [1]  1 NA NA NA NA NA NA NA NA NA  2 NA NA NA NA NA NA NA NA NA  3 NA NA NA NA
> tail(data, n=25)
 [1]     NA     NA     NA     NA     NA  99999     NA     NA     NA     NA
[11]     NA     NA     NA     NA     NA 100000     NA     NA     NA     NA
[21]     NA     NA     NA     NA     NA
> 
> # Time naive method
> system.time({
+   data.clean <- data;
+   for (i in 2:length(data.clean)){
+     if(is.na(data.clean[i])) data.clean[i] <- data.clean[i-1]
+   }
+ })
   user  system elapsed 
   3.09    0.00    3.09 
> 
> # Print results
> head(data.clean, n=25)
 [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3
> tail(data.clean, n=25)
 [1]  99998  99998  99998  99998  99998  99999  99999  99999  99999  99999
[11]  99999  99999  99999  99999  99999 100000 100000 100000 100000 100000
[21] 100000 100000 100000 100000 100000
> 

      

+1


source to share


2 answers


Use zoo::na.locf

Wrapping your code in a function f

(including returning data.clean

at the end):



library(rbenchmark)
library(zoo)

identical(f(data), na.locf(data))
## [1] TRUE

benchmark(f(data), na.locf(data), replications=10, columns=c("test", "elapsed", "relative"))
##            test elapsed relative
## 1       f(data)  21.460   14.471
## 2 na.locf(data)   1.483    1.000

      

+2


source


I don't know about idioms, but here we are defining non-NA ( idx

) values ​​and the index of the last non-NA ( cumsum(idx)

) value

f1 <- function(x) {
    idx <- !is.na(x)
    x[idx][cumsum(idx)]
}

      

which appears to be about 6x faster than the na.locf

example data. By default it crashes like na.locf

, so



f2 <- function(x, na.rm=TRUE) {
    idx <- !is.na(x)
    cidx <- cumsum(idx)
    if (!na.rm)
        cidx[cidx==0] <- NA_integer_
    x[idx][cidx]
}

      

which seems to add about 30% of the time when na.rm=FALSE

. Presumably na.locf

has other merits, capturing more corner cases and allowing padding instead of down (which is interesting in the world cumsum

). It is also clear that we are doing at least five distributions, perhaps big data - idx

(in fact, we compute is.na()

and complement) cumsum(idx)

, x[idx]

and x[idx][cumsum(idx)]

- so there is room for further improvement, for example in C

+5


source







All Articles