Match word starting with a known pattern

Question

Match word starting with a known pattern

I am struggling to match up a whole word that starts with a known pattern and ends with either a space or the end of the line. I think I have a pattern for a word:

pat <- "https?:\\/\\/.*"

require(stringr)
str_extract("http://t.co/som7hing", pat)
# [1] "http://t.co/som7hing" # So far so good...

i don't understand how to define word boundaries. Four situations are possible:

My url is at the beginning of the line
My url is at the end of the line
My url precedes another token
My url is accompanied by a number of other tokens

In all four cases, my template only needs to match the URL, from start to finish.

str_extract("something something http://t.co/som7hing", pat)
# [1] "http://t.co/som7hing"

So far so good ...

str_extract("http://t.co/som7hing ", pat)
# [1] "http://t.co/som7hing "

First problem, finite space is also mapped

str_extract("http://t.co/som7hing #hash name", pat)
# [1] "http://t.co/som7hing #hash name"

Second problem: all end words are matched

+3

regex r stringr

CptNemo 08 Aug 14 at 1:54

source to share

3 answers

*

is a greedy operator; resulting in both problems with trailing spaces and trailing words. Hence .*

will match as much as it can, and allow the rest of the regex to match.

I recommend using the following regex:

re <- '\\bhttps?://\\S+'

We use \b

which is a word boundary . The word boundary does not consume any characters. He claims that there is a word symbol on one side, and not on the other. \S

matches any non-white space.

You can see how we do this with your published examples.

x  <- c('http://t.co/som7hing', 
        'http://t.co/som7hing ',
        'something something http://t.co/som7hing', 
        'http://t.co/som7hing #hash name',
        'foohttp://www.example.com',
        'barhttp://www.foo.com    ')

re <- '\\bhttps?://\\S+'

for (i in x) print(str_extract(i, re))
# [1] "http://t.co/som7hing"
# [1] "http://t.co/som7hing"
# [1] "http://t.co/som7hing"
# [1] "http://t.co/som7hing"
# [1] NA
# [1] NA

The last two were not matched because of a word boundary, now if you want to match a prefix anywhere in a string, remove the border from the regex.

+4

hwnd 08 Aug 14 at 2:05

source to share

I think this does the trick. It fits the space and stops. I used backslash to avoid colons and forward slashes from addresses. Instead of matching any charater for any number, I matched any character that is not a space [! \ S]

https?\:\/\/[!\S]*

I tested this at http://regexpal.com/

0

RowanC 08 Aug 14 at 2:08

source to share

waternova · Accepted Answer · 2014-08-08T02:02:54+0000

The sample you are looking for is

pat <- "https?:\\/\\/\\S*"

.

in regex will match any character, including spaces. You want to match any non-whitespace character that is done with \S

.

Match word starting with a known pattern

More articles: