Match word starting with a known pattern

I am struggling to match up a whole word that starts with a known pattern and ends with either a space or the end of the line. I think I have a pattern for a word:

pat <- "https?:\\/\\/.*"

require(stringr)
str_extract("http://t.co/som7hing", pat)
# [1] "http://t.co/som7hing" # So far so good...

      

i don't understand how to define word boundaries. Four situations are possible:

  • My url is at the beginning of the line
  • My url is at the end of the line
  • My url precedes another token
  • My url is accompanied by a number of other tokens

In all four cases, my template only needs to match the URL, from start to finish.

str_extract("something something http://t.co/som7hing", pat)
# [1] "http://t.co/som7hing" 

      

So far so good ...

str_extract("http://t.co/som7hing ", pat)
# [1] "http://t.co/som7hing " 

      

First problem, finite space is also mapped

str_extract("http://t.co/som7hing #hash name", pat)
# [1] "http://t.co/som7hing #hash name" 

      

Second problem: all end words are matched

+3


source to share


3 answers


The sample you are looking for is

pat <- "https?:\\/\\/\\S*"

      



.

in regex will match any character, including spaces. You want to match any non-whitespace character that is done with \S

.

+4


source


*

is a greedy operator; resulting in both problems with trailing spaces and trailing words. Hence .*

will match as much as it can, and allow the rest of the regex to match.

I recommend using the following regex:

re <- '\\bhttps?://\\S+'

      

We use \b

which is a word boundary . The word boundary does not consume any characters. He claims that there is a word symbol on one side, and not on the other. \S

matches any non-white space.



You can see how we do this with your published examples.

x  <- c('http://t.co/som7hing', 
        'http://t.co/som7hing ',
        'something something http://t.co/som7hing', 
        'http://t.co/som7hing #hash name',
        'foohttp://www.example.com',
        'barhttp://www.foo.com    ')

re <- '\\bhttps?://\\S+'

for (i in x) print(str_extract(i, re))
# [1] "http://t.co/som7hing"
# [1] "http://t.co/som7hing"
# [1] "http://t.co/som7hing"
# [1] "http://t.co/som7hing"
# [1] NA
# [1] NA

      

The last two were not matched because of a word boundary, now if you want to match a prefix anywhere in a string, remove the border from the regex.

+4


source


I think this does the trick. It fits the space and stops. I used backslash to avoid colons and forward slashes from addresses. Instead of matching any charater for any number, I matched any character that is not a space [! \ S]

https?\:\/\/[!\S]*

      

I tested this at http://regexpal.com/

0


source







All Articles