Regexp: length matched pattern substitution

Suppose I have a line like this:

> x <- c("16^TG40")

      

I am trying to get the result c(16 2 40)

where 2

- length(^TG)-1

. I can find this pattern for example:

> gsub("(\\^[ACGT]+)", " \\1 ", x)
[1] "16 ^TG 40"

      

However, I cannot directly replace this line with my own length-1

. Is there an easier way to replace the matched pattern with length?

After quite a lot of searching (here on SO and google search), I ended up with a package stringr

that I think is awesome. But still it all boils down to finding the location of that pattern (using str_locate_all

) and then replacing the substring with whatever value you want (using str_sub

). I have over 100,000 lines and it takes a long time (since the pattern can also appear multiple times in a line).

I am working in parallel at the moment to compensate for the slowness, but I would be happy to know if this is even possible (or fast).

Any ideas?

+3


source to share


3 answers


Here's a base-R approach.

The syntax is far from intuitive, but by tightly linking to this pattern, you can perform all sorts of subscript manipulations and substitutions. (See ?gregexpr

for more complex examples.)



x2 <- x <- c("16^TG40", "16^TGCT40", "16^TG40^GATTACA40")

pat <- "(\\^[ACGT]+)"              ## A pattern matching substrings of interest
modFun <- function(ss) {           ## A function to modify them
    paste0(" ", nchar(ss) - 1, " ")
}

## Use regmatches() <- regmatches(gregexpr()) to search, modify, and replace.
m <- gregexpr(pat, x2)
regmatches(x2, m) <- sapply(regmatches(x2, m), modFun)
x2
## [1] "16 2 40"      "16 4 40"      "16 2 40 7 40"

      

+8


source


(1) gsubfn The operator gsubfn

replaces the ^ ... part with its length surrounded by spaces, but strapply

extracts digits from this string and converts them to numeric ones. Omit the character strapply

if character output is sufficient.

> library(gsubfn)
> xx <- gsubfn("\\^[ACGT]*", ~ sprintf(" %s ", nchar(x) - 1), x)
> strapply(xx, "\\d+", as.numeric)
[[1]]
[1] 16  2 40

      

(2) Scroll lengthwise

This assumes that the number of characters in each ACGT sequence is between mn and mx, and it simply replaces the ACGT sequences i long with i using gsub, continuing in a loop. If there are only a few possible lengths, there will only be a few iterations, so it will be fast, but if the strings can have many different lengths, it will be slow as more iterations of the loop are required. Below we have assumed that the ACGT sequences are 2, 4, or 6 in length, but they may need to be adjusted. A possible disadvantage of this solution is the need to accept a set of possible sequence lengths.

x <- "4^CG5^CAGT656"

mn <- 2
mx <- 6
y <- x
for(i in seq(mn, mx, 2)) {
   pat <- sprintf("\\^[ACGT]{%d}(\\d)", i)
   replacement <- sprintf(" %d \\1", i)
   y <- gsub(pat, replacement, y)
}

      

(3) Loop through ACGT sequences

This loop goes through the ACGT sequences, replacing it with length until there is no more. If there are a small number of ACGT sequences, it may be fast since there will be multiple iterations, but if there may be many ACGT sequences, it will be slow due to the more iterations.

x <- "4^CG5^CAGT656"
y <- x
while(regexpr("^", y, fixed = TRUE) > 0) {
    y <- sprintf("%s %d %s", sub("\\^.*", "", y),
        nchar(sub("^[0-9 ]+\\^([ACGT]+).*", "\\1", y)),
        sub("^[0-9 ]+\\^[ACGT]+", "", y))
}

      



Benchmark

Here is the benchmark. Note that in some of the solutions above, I converted strings to numeric ones (which of course takes extra time), but in order to benchmark benchmarks, I compared the string generation speed without any numeric conversion.

x <- "4^CGT5^CCA656"
library(rbenchmark)
benchmark(order = "relative", replications = 10000,
   columns = c("test", "replications", "relative", "elapsed"),
   regmatch = {
      pat <- "(\\^[ACGT]+)"
      x2 <- x
      m <- gregexpr(pat, x2)
      regmatches(x2, m) <- sapply(regmatches(x2, m), modFun)
      x2
   },
   gsubfn = gsubfn("\\^[ACGT]*", ~ sprintf(" %s ", length(x) - 1), x),
   loop.on.len = {
    mn <- 2
    mx <- 6
    y <- x
    for(i in seq(mn, mx, 2)) {
       pat <- sprintf("\\^[ACGT]{%d}(\\d)", i)
       replacement <- sprintf(" %d \\1", i)
       y <- gsub(pat, replacement, y)
    }
   },
   loop.on.seq = {
    y <- x
    while(regexpr("^", y, fixed = TRUE) > 0) {
        y <- sprintf("%s %d %s", sub("\\^.*", "", y),
            nchar(sub("^[0-9 ]+\\^([ACGT]+).*", "\\1", y)),
            sub("^[0-9 ]+\\^[ACGT]+", "", y))
    }
  }
)

      

The results are shown below. These two decision loops were the fastest on the inputs shown, but their performance will depend on how many iterations are required, so the actual data may make a difference. The loop.on.len solution has the disadvantage that the ACGT lengths must be among the intended set. Josh's regmatch solution is loop-free and fast. The advantage of gsubfn is that it is a single line of code and especially straight forward.

         test replications relative elapsed
4 loop.on.seq        10000    1.000    1.93
3 loop.on.len        10000    1.140    2.20
1    regmatch        10000    1.803    3.48
2      gsubfn        10000    7.145   13.79

      

UPDATE . Added two looping solutions and removed those that were previously part of the post that do not handle more than one ACGT sequence (based on comments clarifying the question). Tests were also repeated, including only solutions that handle multiple ACGT sequences.

UPDATE Removed one solution that doesn't work with multiple ^ sequences ... It was removed from the test before, but the code was not removed. Improved explanation in (1).

+8


source


I vote for an incredibly slick answer gsubfn

, but since I already have this clunky code:

mod <- gsub("(\\^[ACGT]+)", " \\1 ", x)
locs <- gregexpr(" ", mod , fixed=TRUE)[[1]]
paste( substr( x, 1, locs[1]-1), 
       diff(locs)-2, 
       substr(mod, locs[2]+1, nchar(mod) ) , sep=" ")
#[1] "16 2 40"

      

+2


source







All Articles