R - Highlight text in a regular expression that recognizes numeric values greater than 1 number

Question

R - Highlight text in a regular expression that recognizes numeric values greater than 1 number

I am trying to extract information from a string using a combination of gregexpr

and substr

. Each line has a phase starting with a word and ending with a number (sometimes more than 9).

Here is a list of lines:

y = c("Hearing #3: The document states in Article ABC 3 Section 9 line 10 that...",
  "Hearing #3: The document states in Article ABC 31 Section 9 that...",
  "Hearing #3: The document states in Article ABC 3.1 Section 9 that...")

Now I have disabled everything before the word Article

that triggered the phrase I'm interested in:

z = substr(y, gregexpr("Article", y)[[1]][1], nchar(y))

> z
[1] "Article ABC 3 Section 9 line 10 that..."   "Article ABC 31 Section 9 that..."  "Article ABC 3.1 Section 9 that..."

So far so good, but now I need to recognize the first number (not a digit) after the word Article

:

> substr(z, 0, regexpr(pattern='[0-9]', z)[1][1])
[1] "Article ABC 3" "Article ABC 3" "Article ABC 3"

This is not entirely true, so I tried to come up with a way to do it with positioning with another one gregxepr

:

gregexpr(pattern='[0-9]', z)

I can't figure out how to do this and I'm not even sure if I will do it right.

Desired result:

[1] "Article ABC 3" "Article ABC 31" "Article ABC 3.1"

+3

regex r

jdesilvio 09 Aug 15 at 20:29

source to share

2 answers

You can fix your problem by adding negative class after looking up the number.

substr(z, 0, regexpr('[0-9][^0-9.]', z))
# [1] "Article ABC 3"   "Article ABC 31"  "Article ABC 3.1"

It would be much easier to use sub

for this task:

sub('.*(Article\\D*[0-9.]+).*', '\\1', y)
# [1] "Article ABC 3"   "Article ABC 31"  "Article ABC 3.1"

+2

hwnd 09 Aug 15 at 20:38

source to share

akrun · Accepted Answer · 2015-08-09T20:30:52+0000

We can use str_extract

from stringr

to extract a substring from "Article" into the numeric part, including.

 library(stringr)
 str_extract(y, 'Article[^0-9]*[0-9.]+')
 #[1] "Article ABC 3"   "Article ABC 31"  "Article ABC 3.1"

Or with sub

, we match Article

, followed by 0 or more digits ( [^0-9]*

) followed by one or more numeric characters ( [0-9.]+

), use a capturing group by putting in parentheses, It can be used as a replacement for ( \\1

)

sub('^.*(Article[^0-9]*[0-9.]+).*', '\\1', y)
#[1] "Article ABC 3"   "Article ABC 31"  "Article ABC 3.1"

R - Highlight text in a regular expression that recognizes numeric values ​​greater than 1 number

More articles:

R - Highlight text in a regular expression that recognizes numeric values greater than 1 number