Regex to extract URL from text (with / without protocol and www or subdomains)
I'm looking to extract url from text inside an element. I'm not very good with regexp, but this is what I have done so far:
var regexp = /((https?:\/\/)?[\w-]+(\.[\w-]+)+\.?(:\d+)?(\/\S*)?)/i;
See how my regex works: http://jsfiddle.net/h70mr1zt/5/
This is the result I want:
1. stackoverflow => not found
2. stackoverflow.com => found => stackoverflow.com
3. www.stackoverflow.com => found => www.stackoverflow.com
4. api.stackoverflow.com => found => api.stackoverflow.com
5. http://www.stackoverflow.com => found => http://www.stackoverflow.com
6. foo://www.stackoverflow.com => found => www.stackoverflow.com
7. someone@stackoverflow.com => not found
8. .com => not found
As you can see my fiddle, I got pretty much everything except the number 7 , in which it recognizes the domain part of the email address.
source to share
you can use regex like
^(http:\/\/)?(www\.)?\w+\.(com|net|org)$
see example
http://regex101.com/r/uQ9aL4/1
how it works?
^
binds a regular expression at the beginning of a line.
(http:\/\/)?
mathces 0 or 1 appearance http://
(www\.)
matches 0 or 1 occurrences www.
\w*
calculates any number of letters
.(com|net|org)$
matches .com
or .net
or.org
$
binds the regex at the end of the line.
source to share
You can do it with this regex:
/^(?:[a-z]*?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/i
see jsfiddle
EDIT
It is very difficult to match any (even bogus) protocols, and also exclude domain names from, for example, email without using assertions (Lookahead & & lookbehind, which javascript does not support).
I would go for something like this :
$('li').each(function(){
var text = $(this).text(),
regexp = /(^https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/i;
regexpMail = /\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b/i;
if((url = regexp.exec(text)) && !text.match(regexpMail)){
$(this).append(' => <b>found</b> => <span>'+url[0]+'</span>');
}else{
$(this).append(' => <b class="nf">not found</b>');
}
});
Break up:
Matches http/s matches the rest
v v
regexp = /(^https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/i;
Since the above email will extract part of the domain name, since you need a secret check to exclude the emails, this is done in this regex:
regexpMail = /\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b/i;
Doing all of this leads to the desired result. Maybe someone else can add this to the ONE regexp, but I'm not that good.
source to share