Replacing Multiple Column Exceptions in a Dataframe Containing an NN using R

I am trying to replace outliers from a large dataset (over 3000 columns and 250,000 rows) with NA. I want to replace observations that are greater than or less than three standard deviations from the mean NA. I got it by doing column by column:

height = ifelse(abs(height-mean(height,na.rm=TRUE)) < 3*sd(height,na.rm=TRUE),height,NA)

      

However, I would like to create a function to do this on a subset of the columns. To do this, I created a list with the column names that I want to replace with outliers. But it doesn't work. Can anyone help me please?

An example of my dataset would be:

name = factor(c("A","B","C","D","E","F","G","H","H"))
height = c(120,NA,150,170,NA,146,132,210,NA)
age = c(10,20,0,30,40,50,60,NA,130)
mark = c(100,0.5,100,50,90,100,NA,50,210)
data = data.frame(name=name,mark=mark,age=age,height=height)
data

      

This was my last try:

d1=names(data)
list = c("age","height","mark")
ntraits=length(list)
nrows=dim(data)[1]
for(i in 1:ntraits){
a=list[i]
b=which(d1==a)
d2=data[,b]
for (j in 1:nrows){
                  d2[j] = ifelse(abs(d2[j]-mean(d2,na.rm=TRUE)) < 3*sd(d2,na.rm=TRUE),d2[j],NA)
                  }
}

      

Sorry, I'm still learning to program in R. Thanks a lot. Greetings.

+3


source to share


1 answer


I would look at using application and scaling, the scale would miss NA. The following code should work:



 # get sd for a subset of the columns
 data.scale <-  scale(data[ ,c("age","height","mark") ])

 # set outliers to NA
 data.scale[ abs(data.scale) > 3 ] <- NA

 # write back to the data set
 data[ ,c("age","height","mark") ] <- data.scale

      

+1


source







All Articles