Match word starting with a known pattern
I am struggling to match up a whole word that starts with a known pattern and ends with either a space or the end of the line. I think I have a pattern for a word:
pat <- "https?:\\/\\/.*"
require(stringr)
str_extract("http://t.co/som7hing", pat)
# [1] "http://t.co/som7hing" # So far so good...
i don't understand how to define word boundaries. Four situations are possible:
- My url is at the beginning of the line
- My url is at the end of the line
- My url precedes another token
- My url is accompanied by a number of other tokens
In all four cases, my template only needs to match the URL, from start to finish.
str_extract("something something http://t.co/som7hing", pat)
# [1] "http://t.co/som7hing"
So far so good ...
str_extract("http://t.co/som7hing ", pat)
# [1] "http://t.co/som7hing "
First problem, finite space is also mapped
str_extract("http://t.co/som7hing #hash name", pat)
# [1] "http://t.co/som7hing #hash name"
Second problem: all end words are matched
source to share
*
is a greedy operator; resulting in both problems with trailing spaces and trailing words. Hence .*
will match as much as it can, and allow the rest of the regex to match.
I recommend using the following regex:
re <- '\\bhttps?://\\S+'
We use \b
which is a word boundary . The word boundary does not consume any characters. He claims that there is a word symbol on one side, and not on the other. \S
matches any non-white space.
You can see how we do this with your published examples.
x <- c('http://t.co/som7hing',
'http://t.co/som7hing ',
'something something http://t.co/som7hing',
'http://t.co/som7hing #hash name',
'foohttp://www.example.com',
'barhttp://www.foo.com ')
re <- '\\bhttps?://\\S+'
for (i in x) print(str_extract(i, re))
# [1] "http://t.co/som7hing"
# [1] "http://t.co/som7hing"
# [1] "http://t.co/som7hing"
# [1] "http://t.co/som7hing"
# [1] NA
# [1] NA
The last two were not matched because of a word boundary, now if you want to match a prefix anywhere in a string, remove the border from the regex.
source to share
I think this does the trick. It fits the space and stops. I used backslash to avoid colons and forward slashes from addresses. Instead of matching any charater for any number, I matched any character that is not a space [! \ S]
https?\:\/\/[!\S]*
I tested this at http://regexpal.com/
source to share