Regular Expression Reference: Identifying Websites in Text

I am trying to write a function that removes websites from a piece of text. I have:

removeWebsites<- function(text){
  text = gsub("(http://|https://|www.)[[:alnum:]~!#$%&+-=?,:/;._]*",'',text)
  return(text)
}

      

This handles a large set of problems, but is not popular, i.e. something like the form xyz.com

I don't want to add .com

to the end of the above expression as it limits the scope of this regex. However, I tried to write some more regular expressions like:

gsub("[[:alnum:]~!#$%&+-=?,:/;._]*.com",'',testset[10])

      

This worked, but it also changed the form email ids abc@xyz.com

to abc@

. I don't want this, so I changed it to

gsub("*((^@)[[:alnum:]~!#$%&+-=?,:/;._]*).com",'\\1',testset[10])

      

This left only email ids but stopped recognizing form websites xyz.com

I understand that here I need some kind of difference in the form settings of the form here , but I have not been able to implement it (mainly because I have not been able to fully understand it). Any idea on how I am solving the problem?

Edit: I've tried negative views:

gsub("[[:alnum:]~!#$%&+-=?,:/;._](?!@)[^(?!.*@)]*.com",'',testset[10])

      

I got an "incorrect regular expression" error. I believe a little help on fixing it might make this work ...

+3


source to share


2 answers


I can not believe this. This is actually a simple solution.

gsub(" ([[:alnum:]~!#$%&+-=?,:/;._]+)((.com)|(.net)|(.org)|(.info))",' ',text)

      

It works:



  • Start with a space.
  • Place all sorts of things other than @.
  • end with .com / net / org / info /

Please study it! I'm sure there will be cases that break this as well.

+1


source


your images look a little funny to me: you can't look inside the character class and why are you looking ahead? Rear view is more appropriate. I think the following expression should work, although I haven't tested it:

gsub("*((?<!@)[[:alnum:]~!#$%&+-=?,:/;._]*).com",'\\1',testset[10])

      



also note that lookbehind must be of fixed length, so multipliers are not allowed

0


source







All Articles