Replace NA with the average of the column groups

I want to find means for all values ​​in column groups. This column group can contain missing cases. I want to replace the missing cases in a column group with the mean for that column group. In my case, the number of columns in a group is constant years


Below is the code that does this. However, I hope someone can provide code that is much more efficient. lapply

finds the average for a given group of columns. However, I have not yet come up with a similar approach to replace missing observations. Thanks for any advice.

Here's an example dataset:

my.first.year <- 1980
my.last.year  <- 1982
years <- (my.last.year - my.first.year) + 1

x = read.table(text = "
 city county   state      a80    a81    a82    b80     b81   b82
  1      B       AA        2      20    200     4       8     12
  2      B       AA        4      NA    400     5       9     NA
  1      C       AA        6      60     NA    NA      10     14
  2      C       AA       NA      80    800     7      11     15    
", sep = "", header = TRUE, stringsAsFactors = FALSE)

(2 + 4 + 6 + 20 + 60 + 80 + 200 + 400 + 800) / 9
(4 + 5 + 7 + 8 + 9 + 10 + 11 + 12 + 14 + 15) / 10

my.means <- lapply( seq(4, ncol(x), years) , function(i) { mean(unlist(x[,i : (i+years-1) ]) , na.rm=TRUE) } )

x2 <- x

x2[,(3+years*0+1):(3+years*1)][[,(3+years*0+1):(3+years*1)])] = my.means[[1]]
x2[,(3+years*1+1):(3+years*2)][[,(3+years*1+1):(3+years*2)])] = my.means[[2]]


Here's the result:

#   city county state      a80      a81      a82 b80 b81  b82
# 1    1      B    AA   2.0000  20.0000 200.0000 4.0   8 12.0
# 2    2      B    AA   4.0000 174.6667 400.0000 5.0   9  9.5
# 3    1      C    AA   6.0000  60.0000 174.6667 9.5  10 14.0
# 4    2      C    AA 174.6667  80.0000 800.0000 7.0  11 15.0



Here's another solution using reshape

from the R base, an often-forgotten feature with amazing power.

x2 = reshape(x, direction = 'long', varying = 4:9, sep = "")
x2[,c('a', 'b')] = apply(x2[,c('a', 'b')], 2, function(y){
  y[] = mean(y, na.rm = T)
x3 = reshape(x2, direction = 'wide', idvar = names(x2)[1:3], timevar = 'time', 
 sep = "")


This is how it works. First, we reformat the data into a long format, where a

they b

become columns and years become rows. Second, we replace the NAs in the columns a

and b

with our respective means. Finally, we convert the data to wide format again. reshape

is a confusing feature, but working with the examples on the help page will allow you to speed things up.


To change the order of the columns, you can do



To replace the outlet names, you can do

rownames(x3) = 1:NROW(x3)




One answer, but perhaps not the easiest one, that uses packages plyr

and reshape2




First convert your dataframe from "wide" to "long" format (one observation per row) and create a column groups


mx <- melt(x, id.vars=c("city","country","state"))
mx$groups[mx$variable %in% c("a80","a81","a82")] <- 1
mx$groups[mx$variable %in% c("b80","b81","b82")] <- 2


The first lines of your data should now look like this:

  city county state variable value groups
1    1      B    AA      a80     2      1
2    2      B    AA      a80     4      1
3    1      C    AA      a80     6      1
4    2      C    AA      a80    NA      1
5    1      B    AA      a81    20      1
6    2      B    AA      a81    NA      1


Then you can use ddply

to replace missing values ​​with the following means:

mx <- ddply(mx, .(groups), function(df) {df$value[$value)] <- mean(df$value, na.rm=TRUE); return(df)})


Finally, use dcast

to return the data in "long" format:

x <- dcast(mx, city + county + state ~ variable)


What gives:

  city county state      a80      a81      a82 b80 b81  b82
1    1      B    AA   2.0000  20.0000 200.0000 4.0   8 12.0
2    1      C    AA   6.0000  60.0000 174.6667 9.5  10 14.0
3    2      B    AA   4.0000 174.6667 400.0000 5.0   9  9.5
4    2      C    AA 174.6667  80.0000 800.0000 7.0  11 15.0




I use your code and add one line na.fill

(even I don't like your grouping by 3 columns).



- zoo package. It was so handy that I thought it was in the base package. Next time I will restart the session before posting here.

ll <- lapply( seq(4, ncol(x), years) , 
        function(i) { 
          m <- mean(unlist(x[,i : (i+years-1) ]) , na.rm=TRUE) 
          na.fill(x[,i : (i+years-1) ],m)      ## here the line I add 

    a80      a81      a82 b80 b81  b82
[1,]   2.0000  20.0000 200.0000 4.0   8 12.0
[2,]   4.0000 174.6667 400.0000 5.0   9  9.5
[3,]   6.0000  60.0000 174.6667 9.5  10 14.0
[4,] 174.6667  80.0000 800.0000 7.0  11 15.0


I would use something like this to select columns:

lapply(c('a','b'),function(i){ <- regmatches(colnames(x),
       m <- mean(unlist(x[,]) , na.rm=TRUE) 
       na.fill(x[, ],m) 

  city county state      a80      a81      a82 b80 b81  b82
1    1      B    AA   2.0000  20.0000 200.0000 4.0   8 12.0
2    2      B    AA   4.0000 174.6667 400.0000 5.0   9  9.5
3    1      C    AA   6.0000  60.0000 174.6667 9.5  10 14.0
4    2      C    AA 174.6667  80.0000 800.0000 7.0  11 15.0




You make it difficult for yourself to store data in a wide format versus a long format. My Take on this will convert to long format using reshape2melt()

package . Use of your data

my.first.year <- 1980
my.last.year  <- 1982

x <- read.table(text = "
 city county   state      a80    a81    a82    b80     b81   b82
  1      B       AA        2      20    200     4       8     12
  2      B       AA        4      NA    400     5       9     NA
  1      C       AA        6      60     NA    NA      10     14
  2      C       AA       NA      80    800     7      11     15    
", sep = "", header = TRUE, stringsAsFactors = FALSE)


First we melt()


will do some manipulations variable

to get the group and the year


xx <- melt(x, id.vars = c("city","county","state"))
## Add year and group variables by process the `variable` column
xx <- transform(xx, year = as.numeric(sub("^[a-zA-Z]", "", variable)),
                group = regmatches(variable, regexpr("^[a-zA-Z]", variable)), 
                stringsAsFactors = FALSE)
## format start and end years as per way stored in column names
start <- as.numeric(substring(my.first.year, first = 3))
end <- as.numeric(substring(my.last.year, first = 3))



and end

are formatted versions of your beginning and ending years without part of the century. At the moment xx

looks like

> head(xx)
  city county state variable value year group
1    1      B    AA      a80     2   80     a
2    2      B    AA      a80     4   80     a
3    1      C    AA      a80     6   80     a
4    2      C    AA      a80    NA   80     a
5    1      B    AA      a81    20   81     a
6    2      B    AA      a81    NA   81     a


Next, I use one of the basic named composite samples, and split()



xxs <- split(xx, f = xx$group)


Then lapply()

can apply the function to a subset year

of which are in or between the years indicated start

: end

. I am calculating the average of a variable value

for the values ​​of a subset by removing NA

s. We are returning the average.

foo <- function(x, start, end) {
  take <- with(x, year >= start & year <= end)
  xbar <- mean(x[take, "value"], na.rm = TRUE)

lapply(xxs, foo, start = start, end = end)


This gives:

> lapply(xxs, foo, start = start, end = end)
[1] 174.6667

[1] 9.5


As for the function to replace NA

s, a minor modification foo()

achieves this:

foor <- function(x, start, end) {
  take <- with(x, year >= start & year <= end)
  xbar <- mean(x[take, "value"], na.rm = TRUE)
  nas <-[take, "value"]) ## which are NA?
  x[take, "value"][nas] <- xbar  ## replace NA with xbar
  x                              ## return


To get a dataframe back, I am wrapping this in

, which orders the call rbind()

to output from lapply()


xx2 <-, lapply(xxs, foor, start = start, end = end))


which gives:

> head(xx2)
    city county state variable    value year group
a.1    1      B    AA      a80   2.0000   80     a
a.2    2      B    AA      a80   4.0000   80     a
a.3    1      C    AA      a80   6.0000   80     a
a.4    2      C    AA      a80 174.6667   80     a
a.5    1      B    AA      a81  20.0000   81     a
a.6    2      B    AA      a81 174.6667   81     a


If you need to revert to the original data format, then dcast()

(also from reshape2

) is your friend:

x2 <- dcast(xx2[, 1:5], city + county + state ~ variable)

> head(x)
  city county state a80 a81 a82 b80 b81 b82
1    1      B    AA   2  20 200   4   8  12
2    2      B    AA   4  NA 400   5   9  NA
3    1      C    AA   6  60  NA  NA  10  14
4    2      C    AA  NA  80 800   7  11  15
> head(x2)
  city county state      a80      a81      a82 b80 b81  b82
1    1      B    AA   2.0000  20.0000 200.0000 4.0   8 12.0
2    1      C    AA   6.0000  60.0000 174.6667 9.5  10 14.0
3    2      B    AA   4.0000 174.6667 400.0000 5.0   9  9.5
4    2      C    AA 174.6667  80.0000 800.0000 7.0  11 15.0




I could have ticked any of the answers, but I prefer Ramnat's answer because it's completely in the R base and seems very straight forward. However, when I tried to use his answer, I realized that I needed separate funds for each of the many states. So, I changed his answer as follows:

my.first.year <- 1980
my.last.year  <- 1982
years <- (my.last.year - my.first.year) + 1

x = read.table(text = "
 city county   state      a80    a81    a82    b80     b81   b82
  1      B       AA        2      20    200     4       8     12
  2      B       AA        4      NA    400     5       9     NA
  1      C       AA        6      60     NA    NA      10     14
  2      C       AA       NA      80    800     7      11     15

  1      A       BB        1       2      1     2       2      2
  2      A       BB        2      NA      1     2       2     NA
  1      B       BB        1       1     NA    NA       2      2
  2      B       BB       NA       2      1     2       2     10
", sep = "", header = TRUE, stringsAsFactors = FALSE)

x2 = reshape(x, direction = 'long', varying = 4:9, sep = "")

x2 <- x2[order(x2$state, x2$time),]

x2[,c('a', 'b')] = apply(x2[,c('a', 'b')], 2, function(z) {
      sapply(split(z, x2$state), 
      function(y) {  y[] = mean(y, na.rm = T)  
      return(y)   }) 

x3 <- reshape(x2, direction = 'wide', idvar = names(x2)[1:3], timevar = 'time', 
 sep = "")



This code works. Although for some reason I needed to order x2

for state

. I don't quite understand the instructions return

. If I find that the code does not work with future datasets, I will edit this post to resolve the issue.



