How to find and remove columns containing more than k consecutive zeros in an R data.frame?
I have a huge one data.frame
with about 200 variables, each represented by a column. Unfortunately, the data comes from a poorly formatted data dump (and therefore cannot be changed) that represents both missing values ββand zeros like 0
. The data was observed every 5 minutes for a month, and the day period 0
can only reasonably be considered a day when the meter was not functioning, leading to the conclusion that those are 0
in fact NA
with.
I want to find (and delete) columns that are at least 288 consecutive 0
at any point. Or more generally, how to remove columns from data.frame
containing> = k consecutive 0
s?
I am relatively new to R and any help would be greatly appreciated. Thank!
EDIT: Here's an example of reproducibility. Given k = 4, I would like to delete columns A and B (but not C as they are 0
not sequential).
df<-data.frame(A=c(4,5,8,2,0,0,0,0,6,3), B=c(3,0,0,0,0,6,8,2,1,0), C=c(4,5,6,0,3,0,2,1,0,0), D=c(1:10))
df
A B C D
1 4 3 4 1
2 5 0 5 2
3 8 0 6 3
4 2 0 0 4
5 0 0 3 5
6 0 6 0 6
7 0 8 2 7
8 0 2 1 8
9 6 1 0 9
10 3 0 0 10
You can use this function for your data:
cons.Zeros <- function (x, n)
{
x <- x[!is.na(x)] == 0
r <- rle(x)
any(r$lengths[r$values] >= n)
}
This function returns TRUE
for the columns to be dropped. n
is the number of consecutive zeros that require the column to be dropped.
For your sample dataset, use n = 3
;
df.dropped <- df[, !sapply(df, cons.Zeros, n=3)]
#output:
# > df.dropped
# C D
# 1 4 1
# 2 5 2
# 3 6 3
# 4 0 4
# 5 3 5
# 6 0 6
# 7 2 7
# 8 1 8
# 9 0 9
# 10 0 10