Using lapply to list percentage of null variables in each column in R

I was provided with a large CSV which contains 115 columns and 1000 rows. Columns have a lot of data, some are character based, some are integers, etc. However, the data has many null variables of different types (NA, -999, NULL, etc.).

What I want to do is write a script that will generate a list of columns where over 30% of the data in the column is NULL of some type.

To do this, I wrote a script to give me zero percent (as decimal) for one column. This script works great for me.

length(which(indata$ObservationYear == "" | is.na(indata$ObservationYear) |
indata$ObservationYear == "NA" | indata$ObservationYear == "-999" |
indata$ObservationYear == "0"))/nrow(indata)

      

I want to write a script to do this for all columns. I believe I need to use the lapply function.

I tried to do it here, however, I cannot get this script to work at all:

Null_Counter <- lapply(indata, 2, length(x),
                   length(which(indata == "" | is.na(indata) | indata == "NA" | indata == "-999" | indata == "0")))
                   names(indata(which(0.3>=Null_Counter / nrow(indata))))

      

I am getting the following errors:

Error in match.fun(FUN) : '2' is not a function, character or symbol

      

and

Error: could not find function "indata"

      

Ideally, I want it to give me a vectorized list of all column names where the percentage of all null variables (NA, -999, 0, NULL) is greater than 30%.

Can anyone please help?

+3


source to share


2 answers


I believe you want to use apply and not lapply, which apply a function to a list. Try the following:



Null_Counter <- apply(indata, 2, function(x) length(which(x == "" | is.na(x) | x == "NA" | x == "-999" | x == "0"))/length(x))
Null_Name <- colnames(indata)[Null_Counter >= 0.3]

      

+1


source


Here's another way to do it in data.table

:

#first, make a reproducible example:
library(data.table)
#make it so that all columns have ~30% "NA" as you define it
dt<-as.data.table(replicate(
  115,sample(c(1:100,"",NA,"NA",-999,0),size=1000,replace=T,
             prob=c(rep(.007,100),rep(.06,5)))))

      

Now figure out what the trouble is:

x<-as.matrix(dt[,lapply(.SD,function(x){
  mean(is.na(x) | x %in% c("","NA","-999","0"))})])
colnames(x)[x>.3]

      



Probably a more concise way of doing this, but that eludes me.

If you are trying to remove these columns, this can be customized:

dt[,!colnames(x)[x>.3],with=F]

      

0


source







All Articles