Using lapply to list percentage of null variables in each column in R
I was provided with a large CSV which contains 115 columns and 1000 rows. Columns have a lot of data, some are character based, some are integers, etc. However, the data has many null variables of different types (NA, -999, NULL, etc.).
What I want to do is write a script that will generate a list of columns where over 30% of the data in the column is NULL of some type.
To do this, I wrote a script to give me zero percent (as decimal) for one column. This script works great for me.
length(which(indata$ObservationYear == "" | is.na(indata$ObservationYear) |
indata$ObservationYear == "NA" | indata$ObservationYear == "-999" |
indata$ObservationYear == "0"))/nrow(indata)
I want to write a script to do this for all columns. I believe I need to use the lapply function.
I tried to do it here, however, I cannot get this script to work at all:
Null_Counter <- lapply(indata, 2, length(x),
length(which(indata == "" | is.na(indata) | indata == "NA" | indata == "-999" | indata == "0")))
names(indata(which(0.3>=Null_Counter / nrow(indata))))
I am getting the following errors:
Error in match.fun(FUN) : '2' is not a function, character or symbol
and
Error: could not find function "indata"
Ideally, I want it to give me a vectorized list of all column names where the percentage of all null variables (NA, -999, 0, NULL) is greater than 30%.
Can anyone please help?
source to share
Here's another way to do it in data.table
:
#first, make a reproducible example:
library(data.table)
#make it so that all columns have ~30% "NA" as you define it
dt<-as.data.table(replicate(
115,sample(c(1:100,"",NA,"NA",-999,0),size=1000,replace=T,
prob=c(rep(.007,100),rep(.06,5)))))
Now figure out what the trouble is:
x<-as.matrix(dt[,lapply(.SD,function(x){
mean(is.na(x) | x %in% c("","NA","-999","0"))})])
colnames(x)[x>.3]
Probably a more concise way of doing this, but that eludes me.
If you are trying to remove these columns, this can be customized:
dt[,!colnames(x)[x>.3],with=F]
source to share