Regex to extract URL from text (with / without protocol and www or subdomains)

I'm looking to extract url from text inside an element. I'm not very good with regexp, but this is what I have done so far:

var regexp = /((https?:\/\/)?[\w-]+(\.[\w-]+)+\.?(:\d+)?(\/\S*)?)/i;

      

See how my regex works: http://jsfiddle.net/h70mr1zt/5/

This is the result I want:

 1. stackoverflow => not found
 2. stackoverflow.com => found => stackoverflow.com
 3. www.stackoverflow.com => found => www.stackoverflow.com
 4. api.stackoverflow.com => found => api.stackoverflow.com
 5. http://www.stackoverflow.com => found => http://www.stackoverflow.com
 6. foo://www.stackoverflow.com => found => www.stackoverflow.com
 7. someone@stackoverflow.com => not found
 8. .com => not found

      

As you can see my fiddle, I got pretty much everything except the number 7 , in which it recognizes the domain part of the email address.

+3


source to share


2 answers


you can use regex like

^(http:\/\/)?(www\.)?\w+\.(com|net|org)$

see example

http://regex101.com/r/uQ9aL4/1

how it works?

^

binds a regular expression at the beginning of a line.



(http:\/\/)?

mathces 0 or 1 appearance http://

(www\.)

matches 0 or 1 occurrences www.

\w*

calculates any number of letters

.(com|net|org)$

matches .com

or .net

or.org

$

binds the regex at the end of the line.

+2


source


You can do it with this regex:

/^(?:[a-z]*?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/i

      

see jsfiddle

EDIT

It is very difficult to match any (even bogus) protocols, and also exclude domain names from, for example, email without using assertions (Lookahead & & lookbehind, which javascript does not support).

I would go for something like this :



$('li').each(function(){
    var text = $(this).text(),
        regexp = /(^https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/i;
        regexpMail = /\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b/i;
    if((url = regexp.exec(text)) && !text.match(regexpMail)){
        $(this).append(' => <b>found</b> => <span>'+url[0]+'</span>');
    }else{
        $(this).append(' => <b class="nf">not found</b>');
    }
});

      

Break up:

          Matches http/s        matches the rest
                v                   v
regexp = /(^https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?$/i;

      

Since the above email will extract part of the domain name, since you need a secret check to exclude the emails, this is done in this regex:

 regexpMail = /\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b/i;

      

Doing all of this leads to the desired result. Maybe someone else can add this to the ONE regexp, but I'm not that good.

+1


source







All Articles