Replacing missing value in R with mean

Question

Replacing missing value in R with mean

I have a dataframe with data columns with a missing value and I would like to replace the missing value by taking the average using the values of the cells above and below.

 df1<-c(2,2,NA,10, 20, NA,3)
 if(df1[i]== NA){
 df1[i]= mean(df1[i+1],df1[i-1])
}

However I am getting this error

  Error in if (df1[i] == NA) { : missing value where TRUE/FALSE needed
  In addition: Warning message:
  In if (df1[i] == NA) { :
  the condition has length > 1 and only the first element will be used

Any advice would be appreciated to resolve this issue.

+3

r if-statement average missing-data

NickWilson June 26. 15 at 18:48

source to share

4 answers

Using lag and lead from dplyr

:

library(dplyr)

df1[is.na(df1)] <- (df1[is.na(lag(df1, default=""))] +          
                    df1[is.na(lead(df1, default=""))]) / 2

This will be much faster than the for loop version

+2

jeremycg June 26. 15 at 19:14

source to share

You can use na.approx()

from package zoo

to replace NA

with interpolated values:

library(zoo)
> na.approx(df1)
# [1]  2.0  2.0  6.0 10.0 20.0 11.5  3.0

As @ G. Grothendieck mentioned, this will fill NA

in if there are multiple in the line NA

. Also, if there may be at the ends NA

, then adding an argument will na.rm = FALSE

keep them, or adding rule = 2

will replace them with the first or last not NA

.

+2

Steven beaupré June 26. 15 at 22:16

source to share

to check what NA is used is.na()

, create a loop and give a mean()

vector as an argument, otherwise it will only see the first value. This should work if you don't have consecutive NA's and the first and last entries are not NA:

df1<-c(2,2,NA,10, 20, NA,3)
for(i in 2:(length(df1)-1)){
  if(is.na(df1[i])){
     df1[i]= mean(c(df1[i+1],df1[i-1]))
  }
}

+1

mts June 26. 15 at 18:52

source to share

MrFlick · Accepted Answer · 2015-06-26T19:10:28+0000

If you are sure that you have no consecutive NA values, and the first and last elements are never NA, then you can do

df1<-c(2,2,NA,10, 20, NA,3)
idx<-which(is.na(df1))
df1[idx] <- (df1[idx-1] + df1[idx+1])/2
df1
# [1]  2.0  2.0  6.0 10.0 20.0 11.5  3.0

It should be more efficient than a loop.

Replacing missing value in R with mean

More articles: