R: effectively grep characters on strings of large data.frame

I have a data frame of character strings of length> 1M lines:

>head(df)
     A    B     C     D
1   S1   S2    U1    U2
2   S1   S2    S2    S1
3   S2   S1    S1    S2
4   S1   M2    U1    S2
5   S1   S1    M2    M1
6   M2   M2    M1    M2

      

I would like to identify all lines where a specific character is present (eg "U"). The solutions I have found so far work, but they are very slow, for example:

matches <- apply(as.matrix(df), 1, function(x){ sum(grepl("U", x, perl=T)) > 0 })

      

Any idea how to improve this query? Thank!

+3


source to share


4 answers


EDIT: Updates for comments at:

Also very fast (0.31 seconds, even faster than before):

rows <- which(
  rowSums(
    `dim<-`(grepl("U", as.matrix(df), fixed=TRUE), dim(df))
  ) > 0
)

      

And gives the same result as the previous answers. Usage fixed=FALSE

roughly doubles the time, but your example doesn't require that.

What we're doing here is cheating by applying grepl

to a matrix, although we really care about turning df

into a vector (which is a matrix), which as.matrix

is one of the quickest ways to do it. Then we can only run one command grepl

. Finally, we use dim<-

to cast the vector result grepl

back into the matrix and use it rowSums

to check if the rows match.

This is why this is much faster than your version:

  • We call grepl

    once, not a million times as you do with apply

    , since the function apply

    is applied is called once for each line; grepl

    is vectorized, which means you want to minimize how many times you call it and use vectorization
  • We do the number of line matches with rowSums

    instead of apply

    ; rowSums

    - much faster version apply(x, 1, sum)

    (see docs for ?rowSums

    ).

PREVIOUS ANSWER:



Here's a relatively simple solution that works on my system in 0.35 seconds for a 1MB row by 4 frames of column data:

rows <- which(rowSums(as.matrix(df) == "U") > 0)

      

To confirm

df[head(rows), ]

      

produces (each line has a U):

   a b c d
5  F B D U
8  R S U F
15 U L R P
20 U E E O
21 Y U D I
32 P F U H

      

And the data:

set.seed(1)
df <- as.data.frame(
  `names<-`(
    replicate(4, sample(LETTERS, 1e6, rep=T), simplify=F),
    letters[1:4]
  )
)

      

+4


source


library(data.table)

df = fread("~/Rscripts/SO.csv")  # fast read
x = df[, lapply(.SD, function(x) x %like% "U")] # fast grep
y = x[, rowSums(x) > 0]
z = df[y,]

      



+2


source


If you're just looking for the string index for characters, maybe try this. This should be slightly faster than a loop.

unique(row(df)[grep("U", unlist(df))])
# [1] 1 4

      

+2


source


[This answered the original question, which was an exact match of characters in a matrix, not a regex]. Instruct the matrix (which is the correct representation anyway), compare each element with "U" (use %in%

if there is more than one possible value of interest) to create a boolean matrix, and calculate the sum of the row; use this for a subset of the original

which(rowSums(as.matrix(df) == "U") > 0)

      

no need to explicitly outline (via apply or vapply); these are "vectorized" calculations and are fast (although the above implies creating 2 new matrices, and therefore could be improved).

+1


source







All Articles