R: effectively grep characters on strings of large data.frame

Question

R: effectively grep characters on strings of large data.frame

I have a data frame of character strings of length> 1M lines:

>head(df)
     A    B     C     D
1   S1   S2    U1    U2
2   S1   S2    S2    S1
3   S2   S1    S1    S2
4   S1   M2    U1    S2
5   S1   S1    M2    M1
6   M2   M2    M1    M2

I would like to identify all lines where a specific character is present (eg "U"). The solutions I have found so far work, but they are very slow, for example:

matches <- apply(as.matrix(df), 1, function(x){ sum(grepl("U", x, perl=T)) > 0 })

Any idea how to improve this query? Thank!

+3

grep r dataframe

jul635 04 Sep 14 at 14:21

source to share

4 answers

library(data.table)

df = fread("~/Rscripts/SO.csv")  # fast read
x = df[, lapply(.SD, function(x) x %like% "U")] # fast grep
y = x[, rowSums(x) > 0]
z = df[y,]

+2

Henk 04 Sep 14 at 16:26

source to share

If you're just looking for the string index for characters, maybe try this. This should be slightly faster than a loop.

unique(row(df)[grep("U", unlist(df))])
# [1] 1 4

+2

Rich scriven 04 Sep 14 at 17:18

source to share

[This answered the original question, which was an exact match of characters in a matrix, not a regex]. Instruct the matrix (which is the correct representation anyway), compare each element with "U" (use %in%

if there is more than one possible value of interest) to create a boolean matrix, and calculate the sum of the row; use this for a subset of the original

which(rowSums(as.matrix(df) == "U") > 0)

no need to explicitly outline (via apply or vapply); these are "vectorized" calculations and are fast (although the above implies creating 2 new matrices, and therefore could be improved).

+1

Martin morgan 04 Sep 14 at 14:38

source to share

BrodieG · Accepted Answer · 2014-09-04T14:35:28+0000

EDIT: Updates for comments at:

Also very fast (0.31 seconds, even faster than before):

rows <- which(
  rowSums(
    `dim<-`(grepl("U", as.matrix(df), fixed=TRUE), dim(df))
  ) > 0
)

And gives the same result as the previous answers. Usage fixed=FALSE

roughly doubles the time, but your example doesn't require that.

What we're doing here is cheating by applying grepl

to a matrix, although we really care about turning df

into a vector (which is a matrix), which as.matrix

is one of the quickest ways to do it. Then we can only run one command grepl

. Finally, we use dim<-

to cast the vector result grepl

back into the matrix and use it rowSums

to check if the rows match.

This is why this is much faster than your version:

We call grepl

once, not a million times as you do with apply

, since the function apply

is applied is called once for each line; grepl

is vectorized, which means you want to minimize how many times you call it and use vectorization
We do the number of line matches with rowSums

instead of apply

; rowSums

- much faster version apply(x, 1, sum)

(see docs for ?rowSums

).

PREVIOUS ANSWER:

Here's a relatively simple solution that works on my system in 0.35 seconds for a 1MB row by 4 frames of column data:

rows <- which(rowSums(as.matrix(df) == "U") > 0)

To confirm

df[head(rows), ]

produces (each line has a U):

   a b c d
5  F B D U
8  R S U F
15 U L R P
20 U E E O
21 Y U D I
32 P F U H

And the data:

set.seed(1)
df <- as.data.frame(
  `names<-`(
    replicate(4, sample(LETTERS, 1e6, rep=T), simplify=F),
    letters[1:4]
  )
)

R: effectively grep characters on strings of large data.frame

More articles: