R - Highlight text in a regular expression that recognizes numeric values โโgreater than 1 number
I am trying to extract information from a string using a combination of gregexpr
and substr
. Each line has a phase starting with a word and ending with a number (sometimes more than 9).
Here is a list of lines:
y = c("Hearing #3: The document states in Article ABC 3 Section 9 line 10 that...",
"Hearing #3: The document states in Article ABC 31 Section 9 that...",
"Hearing #3: The document states in Article ABC 3.1 Section 9 that...")
Now I have disabled everything before the word Article
that triggered the phrase I'm interested in:
z = substr(y, gregexpr("Article", y)[[1]][1], nchar(y))
> z
[1] "Article ABC 3 Section 9 line 10 that..." "Article ABC 31 Section 9 that..." "Article ABC 3.1 Section 9 that..."
So far so good, but now I need to recognize the first number (not a digit) after the word Article
:
> substr(z, 0, regexpr(pattern='[0-9]', z)[1][1])
[1] "Article ABC 3" "Article ABC 3" "Article ABC 3"
This is not entirely true, so I tried to come up with a way to do it with positioning with another one gregxepr
:
gregexpr(pattern='[0-9]', z)
I can't figure out how to do this and I'm not even sure if I will do it right.
Desired result:
[1] "Article ABC 3" "Article ABC 31" "Article ABC 3.1"
source to share
We can use str_extract
from stringr
to extract a substring from "Article" into the numeric part, including.
library(stringr)
str_extract(y, 'Article[^0-9]*[0-9.]+')
#[1] "Article ABC 3" "Article ABC 31" "Article ABC 3.1"
Or with sub
, we match Article
, followed by 0 or more digits ( [^0-9]*
) followed by one or more numeric characters ( [0-9.]+
), use a capturing group by putting in parentheses, It can be used as a replacement for ( \\1
)
sub('^.*(Article[^0-9]*[0-9.]+).*', '\\1', y)
#[1] "Article ABC 3" "Article ABC 31" "Article ABC 3.1"
source to share
You can fix your problem by adding negative class after looking up the number.
substr(z, 0, regexpr('[0-9][^0-9.]', z))
# [1] "Article ABC 3" "Article ABC 31" "Article ABC 3.1"
It would be much easier to use sub
for this task:
sub('.*(Article\\D*[0-9.]+).*', '\\1', y)
# [1] "Article ABC 3" "Article ABC 31" "Article ABC 3.1"
source to share