I need regEx to match common urls

I need to check for common urls using any protocol (http, https, shttp, ftp, svn, mysql and things I don't know about).

My first pass:

\w+://(\w+\.)+[\w+](/[\w]+)(\?[-A-Z0-9+&@#/%=~_|!:,.;]*)?

      

( PCRE and .NET , so nothing to fancy)

0


source to share


3 answers


According to RFC2396 :



^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?

      

+3


source


adding that RegEx as a wiki answer:

[\w+-]+://([a-zA-Z0-9]+\.)+[[a-zA-Z0-9]+](/[%\w]+)(\?[-A-Z0-9+&@#/%=~_|!:,.;]*)?

      

option 2 (Re CMS)



^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?

      

But this is weak for any common sense so stripped down as to make it more restrictive and distinguish other things.

proto      ://  name      : pass      @  server    :port      /path     ? args
^([^:/?#]+)://(([^/?#@:]+(:[^/?#@:]+)?@)?[^/?#@:]+(:[0-9]+)?)(/[^?#]*)(\?([^#]*))?

      

+1


source


I came to this from a slightly different direction. I wanted to emulate gchats' ability to match something.co.uk

and tie it. So I went with a regex that searches .

without any next period or space on either side and then grabs everything around until it hits spaces. This matches the period at the end of the URI, but I'll take that later. As such, it might be an option if you prefer false positives over the absence of any potentials.

url_re = re.compile(r"""
           [^\s]             # not whitespace
           [a-zA-Z0-9:/\-]+  # the protocol and domain name
           \.(?!\.)          # A literal '.' not followed by another
           [\w\-\./\?=&%~#]+ # country and path components
           [^\s]             # not whitespace""", re.VERBOSE) 

url_re.findall('http://thereisnothing.com/a/path adn some text www.google.com/?=query#%20 https://somewhere.com other-countries.co.nz. ellipsis... is also a great place to buy. But try text-hello.com ftp://something.com')

['http://thereisnothing.com/a/path',
 'www.google.com/?=query#%20',
 'https://somewhere.com',
 'other-countries.co.nz.',
 'text-hello.com',
 'ftp://something.com']

      

0


source







All Articles