Idiomatic way of copying cell values "down" into vector R
Possible duplicate:
Padding NA in vector using values other than NA?
Is there an idiomatic way to copy the down cell values into the R vector? By "copy-down" I mean replacing NAs with the closest previous value other than NA.
While I can do it very simply with a for loop, it is very slow. Any advice on how to do this should be appreciated.
# Test code
# Set up test data
len <- 1000000
data <- rep(c(1, rep(NA, 9)), len %/% 10) * rep(1:(len %/% 10), each=10)
head(data, n=25)
tail(data, n=25)
# Time naive method
system.time({
data.clean <- data;
for (i in 2:length(data.clean)){
if(is.na(data.clean[i])) data.clean[i] <- data.clean[i-1]
}
})
# Print results
head(data.clean, n=25)
tail(data.clean, n=25)
Test run result:
> # Set up test data
> len <- 1000000
> data <- rep(c(1, rep(NA, 9)), len %/% 10) * rep(1:(len %/% 10), each=10)
> head(data, n=25)
[1] 1 NA NA NA NA NA NA NA NA NA 2 NA NA NA NA NA NA NA NA NA 3 NA NA NA NA
> tail(data, n=25)
[1] NA NA NA NA NA 99999 NA NA NA NA
[11] NA NA NA NA NA 100000 NA NA NA NA
[21] NA NA NA NA NA
>
> # Time naive method
> system.time({
+ data.clean <- data;
+ for (i in 2:length(data.clean)){
+ if(is.na(data.clean[i])) data.clean[i] <- data.clean[i-1]
+ }
+ })
user system elapsed
3.09 0.00 3.09
>
> # Print results
> head(data.clean, n=25)
[1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3
> tail(data.clean, n=25)
[1] 99998 99998 99998 99998 99998 99999 99999 99999 99999 99999
[11] 99999 99999 99999 99999 99999 100000 100000 100000 100000 100000
[21] 100000 100000 100000 100000 100000
>
source to share
Use zoo::na.locf
Wrapping your code in a function f
(including returning data.clean
at the end):
library(rbenchmark)
library(zoo)
identical(f(data), na.locf(data))
## [1] TRUE
benchmark(f(data), na.locf(data), replications=10, columns=c("test", "elapsed", "relative"))
## test elapsed relative
## 1 f(data) 21.460 14.471
## 2 na.locf(data) 1.483 1.000
source to share
I don't know about idioms, but here we are defining non-NA ( idx
) values and the index of the last non-NA ( cumsum(idx)
) value
f1 <- function(x) {
idx <- !is.na(x)
x[idx][cumsum(idx)]
}
which appears to be about 6x faster than the na.locf
example data. By default it crashes like na.locf
, so
f2 <- function(x, na.rm=TRUE) {
idx <- !is.na(x)
cidx <- cumsum(idx)
if (!na.rm)
cidx[cidx==0] <- NA_integer_
x[idx][cidx]
}
which seems to add about 30% of the time when na.rm=FALSE
. Presumably na.locf
has other merits, capturing more corner cases and allowing padding instead of down (which is interesting in the world cumsum
). It is also clear that we are doing at least five distributions, perhaps big data - idx
(in fact, we compute is.na()
and complement) cumsum(idx)
, x[idx]
and x[idx][cumsum(idx)]
- so there is room for further improvement, for example in C
source to share