Replacing missing value in R with mean
I have a dataframe with data columns with a missing value and I would like to replace the missing value by taking the average using the values โโof the cells above and below.
df1<-c(2,2,NA,10, 20, NA,3)
if(df1[i]== NA){
df1[i]= mean(df1[i+1],df1[i-1])
}
However I am getting this error
Error in if (df1[i] == NA) { : missing value where TRUE/FALSE needed
In addition: Warning message:
In if (df1[i] == NA) { :
the condition has length > 1 and only the first element will be used
Any advice would be appreciated to resolve this issue.
If you are sure that you have no consecutive NA values, and the first and last elements are never NA, then you can do
df1<-c(2,2,NA,10, 20, NA,3)
idx<-which(is.na(df1))
df1[idx] <- (df1[idx-1] + df1[idx+1])/2
df1
# [1] 2.0 2.0 6.0 10.0 20.0 11.5 3.0
It should be more efficient than a loop.
Using lag and lead from dplyr
:
library(dplyr)
df1[is.na(df1)] <- (df1[is.na(lag(df1, default=""))] +
df1[is.na(lead(df1, default=""))]) / 2
This will be much faster than the for loop version
You can use na.approx()
from package zoo
to replace NA
with interpolated values:
library(zoo)
> na.approx(df1)
# [1] 2.0 2.0 6.0 10.0 20.0 11.5 3.0
As @ G. Grothendieck mentioned, this will fill NA
in if there are multiple in the line NA
. Also, if there may be at the ends NA
, then adding an argument will na.rm = FALSE
keep them, or adding rule = 2
will replace them with the first or last not NA
.
to check what NA is used is.na()
, create a loop and give a mean()
vector as an argument, otherwise it will only see the first value. This should work if you don't have consecutive NA's and the first and last entries are not NA:
df1<-c(2,2,NA,10, 20, NA,3)
for(i in 2:(length(df1)-1)){
if(is.na(df1[i])){
df1[i]= mean(c(df1[i+1],df1[i-1]))
}
}