Which R function to use for regex capture groups?

I am doing some text wrangling in R and for a specific extraction I need to use a capture group. For some reason, the base / stringr functions I'm familiar with don't support capture groups:

str_extract("abcd123asdc", pattern = "([0-9]{3}).+$") 
# Returns: "123asdc"

stri_extract(str = "abcd123asdc", regex = "([0-9]{3}).+$")
# Returns: "123asdc"

grep(x = "abcd123asdc", pattern = "([0-9]{3}).+$", value = TRUE)
# Returns: "abcd123asdc"

      

A normal search query for the R regex of the R capturing group yields no useful hits to solve this problem. Am I missing something or are capturing groups not implemented in R?

EDIT: So trying to find a solution suggested in the comments that works with a small example, it doesn't work for my situation. Please note that this is text from the email dataset of the email and therefore does not contain sensitive information:

txt <- "Message-ID: <24216240.1075855687451.JavaMail.evans@thyme>
Date: Wed, 18 Oct 2000 03:00:00 -0700 (PDT)
From: phillip.allen@enron.com
To: leah.arsdall@enron.com
Subject: Re: test
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Phillip K Allen
X-To: Leah Van Arsdall
X-cc: 
X-bcc: 
X-Folder: \Phillip_Allen_Dec2000\Notes Folders\'sent mail
X-Origin: Allen-P
X-FileName: pallen.nsf

test successful.  way to go!!!"

sub("X-FileName:.+\n\n([\\W\\w]+)$", "\\1", txt)
# Returns all of "txt", not the capture group

      

Since we only have one capture group, shouldn't we capture it "\ 1"? I tested the regex using an online regex tester and it should work. Also tried both \ n and \ n for newlines. Any ideas?

+3


source to share


1 answer


Completing of the work

You can always fetch capture groups with stringr using str_match

either str_match_all

:

> result <- str_match(txt, "X-FileName:.+\n\n(?s)(.+)$")
> result[,2]
[1] "test successful.  way to go!!!"

      

Template details :

  • X-FileName:

    - literal substring
  • .+

    - any 1+ characters other than line break (since in ICU regex, period does not match line break char)
  • \n\n

    - 2 newline characters
  • (?s)

    - built-in DOTALL modifier (now .

    that appears on the right will match a line break char)
  • (.+)

    - Group 1 capturing any 1+ characters (including line breaks) up to
  • $

    - end of line.

Or you can use the R base regmatches

with regexec

:

> result <- regmatches(txt, regexec("X-FileName:[^\n]+\n\n(.+)$", txt))
> result[[1]][2]
[1] "test successful.  way to go!!!"

      

See the online demonstration of the R . Here the regex TRE is used (with regexec

unfortunately the regexp PCRE cannot be used), so .

will match any character, including a line break char, so the pattern will look like X-FileName:[^\n]+\n\n(.+)$

:



  • X-FileName:

    - literal string
  • [^\n]+

    - 1+ characters other than newline
  • \n\n

    - 2 new lines
  • (.+)

    - any 1+ characters (including line break characters) as far as possible, up to
  • $

    - end of line.
Option

A sub

can also be considered:

sub(".*X-FileName:[^\n]+\n\n", "", txt)
[1] "test successful.  way to go!!!"

      

See R this demo version . Here it .*

matches any 0+ characters as much as possible (whole string) and then goes backwards to find the substring X-FileName:

, [^\n]+

matches 1+ characters other than newline, and then \n\n

matches 2 newlines .

Comparison of indicators

Taking into account hwnd's comment , I added the sub

TRE regex based option above and it seems to be the fastest of all the 4 suggested options, str_match

almost as fast as my code above sub

:

library(microbenchmark)

f1 <- function(text) { return(str_match(txt, "X-FileName:.+\n\n(?s)(.+)$")[,2]) }
f2 <- function(text) { return(regmatches(txt, regexec("X-FileName:[^\n]+\n\n(.+)$", txt))[[1]][2]) }
f3 <- function(text) { return(sub('(?s).*X-FileName:[^\n]+\\R+', '', txt, perl=TRUE)) }
f4 <- function(text) { return(sub('.*X-FileName:[^\n]+\n\n', '', txt)) }

> test <- microbenchmark( f1(txt), f2(txt), f3(txt), f4(txt), times = 500000 )
> test
Unit: microseconds
    expr    min     lq     mean median     uq       max neval  cld
 f1(txt) 21.130 24.451 28.08150 27.168 28.677 53796.565 5e+05  b  
 f2(txt) 29.280 32.903 37.46800 35.318 37.431 54556.635 5e+05   c 
 f3(txt) 57.655 59.466 63.36906 60.674 61.881  1651.448 5e+05    d
 f4(txt) 22.036 23.545 25.56820 24.451 25.356  1660.504 5e+05 a   

      

+6


source







All Articles