Replace NA with the average of the column groups
I want to find means for all values in column groups. This column group can contain missing cases. I want to replace the missing cases in a column group with the mean for that column group. In my case, the number of columns in a group is constant years
.
Below is the code that does this. However, I hope someone can provide code that is much more efficient. lapply
finds the average for a given group of columns. However, I have not yet come up with a similar approach to replace missing observations. Thanks for any advice.
Here's an example dataset:
my.first.year <- 1980
my.last.year <- 1982
years <- (my.last.year - my.first.year) + 1
x = read.table(text = "
city county state a80 a81 a82 b80 b81 b82
1 B AA 2 20 200 4 8 12
2 B AA 4 NA 400 5 9 NA
1 C AA 6 60 NA NA 10 14
2 C AA NA 80 800 7 11 15
", sep = "", header = TRUE, stringsAsFactors = FALSE)
(2 + 4 + 6 + 20 + 60 + 80 + 200 + 400 + 800) / 9
(4 + 5 + 7 + 8 + 9 + 10 + 11 + 12 + 14 + 15) / 10
my.means <- lapply( seq(4, ncol(x), years) , function(i) { mean(unlist(x[,i : (i+years-1) ]) , na.rm=TRUE) } )
my.means
x2 <- x
x2[,(3+years*0+1):(3+years*1)][is.na(x2[,(3+years*0+1):(3+years*1)])] = my.means[[1]]
x2[,(3+years*1+1):(3+years*2)][is.na(x2[,(3+years*1+1):(3+years*2)])] = my.means[[2]]
Here's the result:
# city county state a80 a81 a82 b80 b81 b82
# 1 1 B AA 2.0000 20.0000 200.0000 4.0 8 12.0
# 2 2 B AA 4.0000 174.6667 400.0000 5.0 9 9.5
# 3 1 C AA 6.0000 60.0000 174.6667 9.5 10 14.0
# 4 2 C AA 174.6667 80.0000 800.0000 7.0 11 15.0
source to share
Here's another solution using reshape
from the R base, an often-forgotten feature with amazing power.
x2 = reshape(x, direction = 'long', varying = 4:9, sep = "")
x2[,c('a', 'b')] = apply(x2[,c('a', 'b')], 2, function(y){
y[is.na(y)] = mean(y, na.rm = T)
return(y)
})
x3 = reshape(x2, direction = 'wide', idvar = names(x2)[1:3], timevar = 'time',
sep = "")
This is how it works. First, we reformat the data into a long format, where a
they b
become columns and years become rows. Second, we replace the NAs in the columns a
and b
with our respective means. Finally, we convert the data to wide format again. reshape
is a confusing feature, but working with the examples on the help page will allow you to speed things up.
EDIT
To change the order of the columns, you can do
x3[,names(x)]
To replace the outlet names, you can do
rownames(x3) = 1:NROW(x3)
source to share
One answer, but perhaps not the easiest one, that uses packages plyr
and reshape2
:
library(reshape2) library(plyr)
First convert your dataframe from "wide" to "long" format (one observation per row) and create a column groups
:
mx <- melt(x, id.vars=c("city","country","state"))
mx$groups[mx$variable %in% c("a80","a81","a82")] <- 1
mx$groups[mx$variable %in% c("b80","b81","b82")] <- 2
head(mx)
The first lines of your data should now look like this:
city county state variable value groups
1 1 B AA a80 2 1
2 2 B AA a80 4 1
3 1 C AA a80 6 1
4 2 C AA a80 NA 1
5 1 B AA a81 20 1
6 2 B AA a81 NA 1
Then you can use ddply
to replace missing values with the following means:
mx <- ddply(mx, .(groups), function(df) {df$value[is.na(df$value)] <- mean(df$value, na.rm=TRUE); return(df)})
Finally, use dcast
to return the data in "long" format:
x <- dcast(mx, city + county + state ~ variable)
x
What gives:
city county state a80 a81 a82 b80 b81 b82
1 1 B AA 2.0000 20.0000 200.0000 4.0 8 12.0
2 1 C AA 6.0000 60.0000 174.6667 9.5 10 14.0
3 2 B AA 4.0000 174.6667 400.0000 5.0 9 9.5
4 2 C AA 174.6667 80.0000 800.0000 7.0 11 15.0
source to share
I use your code and add one line na.fill
(even I don't like your grouping by 3 columns).
EDIT
na.fill
- zoo package. It was so handy that I thought it was in the base package. Next time I will restart the session before posting here.
ll <- lapply( seq(4, ncol(x), years) ,
function(i) {
m <- mean(unlist(x[,i : (i+years-1) ]) , na.rm=TRUE)
na.fill(x[,i : (i+years-1) ],m) ## here the line I add
}
)
do.call(cbind,ll)
a80 a81 a82 b80 b81 b82
[1,] 2.0000 20.0000 200.0000 4.0 8 12.0
[2,] 4.0000 174.6667 400.0000 5.0 9 9.5
[3,] 6.0000 60.0000 174.6667 9.5 10 14.0
[4,] 174.6667 80.0000 800.0000 7.0 11 15.0
I would use something like this to select columns:
lapply(c('a','b'),function(i){
cols.group <- regmatches(colnames(x),
regexpr(paste(i,"[0-9]+",sep=''),colnames(x)))
m <- mean(unlist(x[,cols.group]) , na.rm=TRUE)
na.fill(x[,cols.group ],m)
})
do.call(cbind,ll)
cbind(x[,!grepl("(a|b)[0-9]+",colnames(x))],do.call(cbind,ll))
city county state a80 a81 a82 b80 b81 b82
1 1 B AA 2.0000 20.0000 200.0000 4.0 8 12.0
2 2 B AA 4.0000 174.6667 400.0000 5.0 9 9.5
3 1 C AA 6.0000 60.0000 174.6667 9.5 10 14.0
4 2 C AA 174.6667 80.0000 800.0000 7.0 11 15.0
source to share
You make it difficult for yourself to store data in a wide format versus a long format. My Take on this will convert to long format using reshape2melt()
package . Use of your data
my.first.year <- 1980
my.last.year <- 1982
x <- read.table(text = "
city county state a80 a81 a82 b80 b81 b82
1 B AA 2 20 200 4 8 12
2 B AA 4 NA 400 5 9 NA
1 C AA 6 60 NA NA 10 14
2 C AA NA 80 800 7 11 15
", sep = "", header = TRUE, stringsAsFactors = FALSE)
First we melt()
x
will do some manipulations variable
to get the group and the year
require(reshape2)
xx <- melt(x, id.vars = c("city","county","state"))
## Add year and group variables by process the `variable` column
xx <- transform(xx, year = as.numeric(sub("^[a-zA-Z]", "", variable)),
group = regmatches(variable, regexpr("^[a-zA-Z]", variable)),
stringsAsFactors = FALSE)
## format start and end years as per way stored in column names
start <- as.numeric(substring(my.first.year, first = 3))
end <- as.numeric(substring(my.last.year, first = 3))
start
and end
are formatted versions of your beginning and ending years without part of the century. At the moment xx
looks like
> head(xx)
city county state variable value year group
1 1 B AA a80 2 80 a
2 2 B AA a80 4 80 a
3 1 C AA a80 6 80 a
4 2 C AA a80 NA 80 a
5 1 B AA a81 20 81 a
6 2 B AA a81 NA 81 a
Next, I use one of the basic named composite samples, and split()
xx
ongroup
xxs <- split(xx, f = xx$group)
Then lapply()
can apply the function to a subset year
of which are in or between the years indicated start
: end
. I am calculating the average of a variable value
for the values of a subset by removing NA
s. We are returning the average.
foo <- function(x, start, end) {
take <- with(x, year >= start & year <= end)
xbar <- mean(x[take, "value"], na.rm = TRUE)
xbar
}
lapply(xxs, foo, start = start, end = end)
This gives:
> lapply(xxs, foo, start = start, end = end)
$a
[1] 174.6667
$b
[1] 9.5
As for the function to replace NA
s, a minor modification foo()
achieves this:
foor <- function(x, start, end) {
take <- with(x, year >= start & year <= end)
xbar <- mean(x[take, "value"], na.rm = TRUE)
nas <- is.na(x[take, "value"]) ## which are NA?
x[take, "value"][nas] <- xbar ## replace NA with xbar
x ## return
}
To get a dataframe back, I am wrapping this in do.call()
, which orders the call rbind()
to output from lapply()
:
xx2 <- do.call(rbind, lapply(xxs, foor, start = start, end = end))
which gives:
> head(xx2)
city county state variable value year group
a.1 1 B AA a80 2.0000 80 a
a.2 2 B AA a80 4.0000 80 a
a.3 1 C AA a80 6.0000 80 a
a.4 2 C AA a80 174.6667 80 a
a.5 1 B AA a81 20.0000 81 a
a.6 2 B AA a81 174.6667 81 a
If you need to revert to the original data format, then dcast()
(also from reshape2
) is your friend:
x2 <- dcast(xx2[, 1:5], city + county + state ~ variable)
> head(x)
city county state a80 a81 a82 b80 b81 b82
1 1 B AA 2 20 200 4 8 12
2 2 B AA 4 NA 400 5 9 NA
3 1 C AA 6 60 NA NA 10 14
4 2 C AA NA 80 800 7 11 15
> head(x2)
city county state a80 a81 a82 b80 b81 b82
1 1 B AA 2.0000 20.0000 200.0000 4.0 8 12.0
2 1 C AA 6.0000 60.0000 174.6667 9.5 10 14.0
3 2 B AA 4.0000 174.6667 400.0000 5.0 9 9.5
4 2 C AA 174.6667 80.0000 800.0000 7.0 11 15.0
source to share
I could have ticked any of the answers, but I prefer Ramnat's answer because it's completely in the R base and seems very straight forward. However, when I tried to use his answer, I realized that I needed separate funds for each of the many states. So, I changed his answer as follows:
my.first.year <- 1980
my.last.year <- 1982
years <- (my.last.year - my.first.year) + 1
x = read.table(text = "
city county state a80 a81 a82 b80 b81 b82
1 B AA 2 20 200 4 8 12
2 B AA 4 NA 400 5 9 NA
1 C AA 6 60 NA NA 10 14
2 C AA NA 80 800 7 11 15
1 A BB 1 2 1 2 2 2
2 A BB 2 NA 1 2 2 NA
1 B BB 1 1 NA NA 2 2
2 B BB NA 2 1 2 2 10
", sep = "", header = TRUE, stringsAsFactors = FALSE)
x
x2 = reshape(x, direction = 'long', varying = 4:9, sep = "")
x2 <- x2[order(x2$state, x2$time),]
x2[,c('a', 'b')] = apply(x2[,c('a', 'b')], 2, function(z) {
sapply(split(z, x2$state),
function(y) { y[is.na(y)] = mean(y, na.rm = T)
return(y) })
})
x2
x3 <- reshape(x2, direction = 'wide', idvar = names(x2)[1:3], timevar = 'time',
sep = "")
x3[,names(x)]
This code works. Although for some reason I needed to order x2
for state
. I don't quite understand the instructions return
. If I find that the code does not work with future datasets, I will edit this post to resolve the issue.
source to share