R: effectively grep characters on strings of large data.frame
I have a data frame of character strings of length> 1M lines:
>head(df)
A B C D
1 S1 S2 U1 U2
2 S1 S2 S2 S1
3 S2 S1 S1 S2
4 S1 M2 U1 S2
5 S1 S1 M2 M1
6 M2 M2 M1 M2
I would like to identify all lines where a specific character is present (eg "U"). The solutions I have found so far work, but they are very slow, for example:
matches <- apply(as.matrix(df), 1, function(x){ sum(grepl("U", x, perl=T)) > 0 })
Any idea how to improve this query? Thank!
source to share
EDIT: Updates for comments at:
Also very fast (0.31 seconds, even faster than before):
rows <- which(
rowSums(
`dim<-`(grepl("U", as.matrix(df), fixed=TRUE), dim(df))
) > 0
)
And gives the same result as the previous answers. Usage fixed=FALSE
roughly doubles the time, but your example doesn't require that.
What we're doing here is cheating by applying grepl
to a matrix, although we really care about turning df
into a vector (which is a matrix), which as.matrix
is one of the quickest ways to do it. Then we can only run one command grepl
. Finally, we use dim<-
to cast the vector result grepl
back into the matrix and use it rowSums
to check if the rows match.
This is why this is much faster than your version:
- We call
grepl
once, not a million times as you do withapply
, since the functionapply
is applied is called once for each line;grepl
is vectorized, which means you want to minimize how many times you call it and use vectorization - We do the number of line matches with
rowSums
instead ofapply
;rowSums
- much faster versionapply(x, 1, sum)
(see docs for?rowSums
).
PREVIOUS ANSWER:
Here's a relatively simple solution that works on my system in 0.35 seconds for a 1MB row by 4 frames of column data:
rows <- which(rowSums(as.matrix(df) == "U") > 0)
To confirm
df[head(rows), ]
produces (each line has a U):
a b c d
5 F B D U
8 R S U F
15 U L R P
20 U E E O
21 Y U D I
32 P F U H
And the data:
set.seed(1)
df <- as.data.frame(
`names<-`(
replicate(4, sample(LETTERS, 1e6, rep=T), simplify=F),
letters[1:4]
)
)
source to share
[This answered the original question, which was an exact match of characters in a matrix, not a regex]. Instruct the matrix (which is the correct representation anyway), compare each element with "U" (use %in%
if there is more than one possible value of interest) to create a boolean matrix, and calculate the sum of the row; use this for a subset of the original
which(rowSums(as.matrix(df) == "U") > 0)
no need to explicitly outline (via apply or vapply); these are "vectorized" calculations and are fast (although the above implies creating 2 new matrices, and therefore could be improved).
source to share