R - Highlight text in a regular expression that recognizes numeric values โ€‹โ€‹greater than 1 number

I am trying to extract information from a string using a combination of gregexpr

and substr

. Each line has a phase starting with a word and ending with a number (sometimes more than 9).

Here is a list of lines:

y = c("Hearing #3: The document states in Article ABC 3 Section 9 line 10 that...",
  "Hearing #3: The document states in Article ABC 31 Section 9 that...",
  "Hearing #3: The document states in Article ABC 3.1 Section 9 that...")

      

Now I have disabled everything before the word Article

that triggered the phrase I'm interested in:

z = substr(y, gregexpr("Article", y)[[1]][1], nchar(y))

> z
[1] "Article ABC 3 Section 9 line 10 that..."   "Article ABC 31 Section 9 that..."  "Article ABC 3.1 Section 9 that..."

      

So far so good, but now I need to recognize the first number (not a digit) after the word Article

:

> substr(z, 0, regexpr(pattern='[0-9]', z)[1][1])
[1] "Article ABC 3" "Article ABC 3" "Article ABC 3"

      

This is not entirely true, so I tried to come up with a way to do it with positioning with another one gregxepr

:

gregexpr(pattern='[0-9]', z)

      

I can't figure out how to do this and I'm not even sure if I will do it right.

Desired result:

[1] "Article ABC 3" "Article ABC 31" "Article ABC 3.1"

      

+3


source to share


2 answers


We can use str_extract

from stringr

to extract a substring from "Article" into the numeric part, including.

 library(stringr)
 str_extract(y, 'Article[^0-9]*[0-9.]+')
 #[1] "Article ABC 3"   "Article ABC 31"  "Article ABC 3.1"

      



Or with sub

, we match Article

, followed by 0 or more digits ( [^0-9]*

) followed by one or more numeric characters ( [0-9.]+

), use a capturing group by putting in parentheses, It can be used as a replacement for ( \\1

)

sub('^.*(Article[^0-9]*[0-9.]+).*', '\\1', y)
#[1] "Article ABC 3"   "Article ABC 31"  "Article ABC 3.1"

      

+1


source


You can fix your problem by adding negative class after looking up the number.

substr(z, 0, regexpr('[0-9][^0-9.]', z))
# [1] "Article ABC 3"   "Article ABC 31"  "Article ABC 3.1"

      



It would be much easier to use sub

for this task:

sub('.*(Article\\D*[0-9.]+).*', '\\1', y)
# [1] "Article ABC 3"   "Article ABC 31"  "Article ABC 3.1"

      

+2


source







All Articles