Regex hash tags, store # in url

I want to extract hash tags from tweets using a regex R (I would like to store this in the R base, but other solutions are welcome for the reliability of the answer for future searchers).

I have a regex that I thought would remove the hash tags, but found a corner case when present #

in the url as shown in the MWE below. How to remove hash tags in text but keep # in url?

Here is the MWE and the code I tried:

text.var <- c("Slides from great talk: @ramnath_vaidya: Interactive slides from Interactive Visualization", 
    "presentation #user2014. http://ramnathv.github.io/user2014-rcharts/#1")

gsub("#\\w+", "", text.var)
gsub("#\\S+", "", text.var)

      

Desired output:

[1] "Slides from great talk: @ramnath_vaidya: Interactive slides from Interactive Visualization"
[2] "presentation . http://ramnathv.github.io/user2014-rcharts/#1"

      

Note R regex is similar to other regex, but is specific to R. This question is specific to R regex, not general regex.

+3


source to share


1 answer


Well, for this specific case, you can use Negative Lookbehind assertion.

gsub('(?<!/)#\\w+', '', text.var, perl=T)
# [1] "Slides from great talk: @ramnath_vaidya: Interactive slides from Interactive Visualization"
# [2] "presentation . http://ramnathv.github.io/user2014-rcharts/#1" 

      

Or you can use dark magic, which PCRE

suggests:



gsub('http://\\S+(*SKIP)(*F)|#\\w+', '', text.var, perl=T)
# [1] "Slides from great talk: @ramnath_vaidya: Interactive slides from Interactive Visualization"
# [2] "presentation . http://ramnathv.github.io/user2014-rcharts/#1"    

      

The idea here is to skip any url that starts with http://

, which you can customize if you need to.

On the left side of the alternation operator, we match the URL, creating a fail subpattern , forcing the regex engine not to repeat the substring, using the control to go back to the next position in the string. The right side of the alternation operator is what we want ...

+6


source







All Articles