Regular Expression Reference: Identifying Websites in Text
I am trying to write a function that removes websites from a piece of text. I have:
removeWebsites<- function(text){
text = gsub("(http://|https://|www.)[[:alnum:]~!#$%&+-=?,:/;._]*",'',text)
return(text)
}
This handles a large set of problems, but is not popular, i.e. something like the form xyz.com
I don't want to add .com
to the end of the above expression as it limits the scope of this regex. However, I tried to write some more regular expressions like:
gsub("[[:alnum:]~!#$%&+-=?,:/;._]*.com",'',testset[10])
This worked, but it also changed the form email ids abc@xyz.com
to abc@
. I don't want this, so I changed it to
gsub("*((^@)[[:alnum:]~!#$%&+-=?,:/;._]*).com",'\\1',testset[10])
This left only email ids but stopped recognizing form websites xyz.com
I understand that here I need some kind of difference in the form settings of the form here , but I have not been able to implement it (mainly because I have not been able to fully understand it). Any idea on how I am solving the problem?
Edit: I've tried negative views:
gsub("[[:alnum:]~!#$%&+-=?,:/;._](?!@)[^(?!.*@)]*.com",'',testset[10])
I got an "incorrect regular expression" error. I believe a little help on fixing it might make this work ...
source to share
I can not believe this. This is actually a simple solution.
gsub(" ([[:alnum:]~!#$%&+-=?,:/;._]+)((.com)|(.net)|(.org)|(.info))",' ',text)
It works:
- Start with a space.
- Place all sorts of things other than @.
- end with .com / net / org / info /
Please study it! I'm sure there will be cases that break this as well.
source to share
your images look a little funny to me: you can't look inside the character class and why are you looking ahead? Rear view is more appropriate. I think the following expression should work, although I haven't tested it:
gsub("*((?<!@)[[:alnum:]~!#$%&+-=?,:/;._]*).com",'\\1',testset[10])
also note that lookbehind must be of fixed length, so multipliers are not allowed
source to share