RegEx pattern for partial URL (including two values ​​in the path)

I have a URL pattern that must contain either APPLES or ORANGES in it, no other value. Additionally, it can also have query parameters. I've tried several RegEx patterns, but I just can't seem to get a pattern that respects strict match.

URL examples

Good

http://www.website.com/en/pages/APPLES
http://www.website.com/en/pages/APPLES?k=v
http://www.website.com/en/pages/ORANGES?k=v&k2=v2
http://www.website.com/en/pages/ORANGES

      

Bad

http://www.website.com/en/pages/APPLES???k=v
http://www.website.com/en/pages/APPLES?k=v=v
http://www.website.com/en/pages/APPLESORANGES
http://www.website.com/en/pages/1APPLES
http://www.website.com/en/APPLES

      

RegEx templating attempts (well, at least the best attempts)

(http://*.*.website*.*.com/*.*/pages(/APPLES)|(/ORANGES)[\?]*.*)
(http://*.*.website*.*.com/*.*/pages(/APPLES|/ORANGES)[\?]*.*)

      

In case you're wondering, I intentionally want to allow any subdomain, suffix after "website" (for different environments), and any path between .com / and / pages hence use. in several places.

What would be the best way to achieve this?

** Edit: definitive answer **

My final answer was combined with math. coffee and fardjad.

^https?://.*\.website\b.*\.com/.*/pages/(APPLES\b|ORANGES\b)((\?\w+=\w+)(&?\w+=\w+)*)?$

      

The only limitation I found is that it will not allow multiple valid characters (. ~ _-% +) in key pair parameters = query string parameter value (see http://en.wikipedia.org/wiki/ Query_string # Structure ). This is not a problem for me as I am mapping to the string returned from the .NET Uri class, so I know the url is well formed overall.

+3


source to share


2 answers


I think it *.*

should be .*

:

http://.*\.website\b.*\.com/.*/pages/PAGE[12](\?[^=]+=[^&=]+(&[^=]+=[^=&]+)*)?

      

Explanation:

http://      # just http://
.*\.         # any thing, just make sure it followed by '.'
website\b    # website, the whole word
.*\.com      # anything between website and .com
/.*/pages/   # anything between the .com and the pages
PAGE[12]     # PAGE1 or PAGE2
(\?          # opening bracket and '?' (query string)
[^=]+        # the key: i've said it can't include =
=            # =
[^=&]+       # the value: i've said it can't include = or &
(&           # opening bracket and '&' for next part of query string
[^=]+=[^=&]+ # key=value pair, same regex as before
)*           # 0 or more of these (the &key=value)
)?           # the entire query string is optional.

      



NOTE. It is common to have problems parsing the regex query strings and guaranteeing the syntactically correct regex.

For example, in the above regex expression, I said that the value in & key = value cannot have an ampersand in it. But it could be a hidden entity, for example &

, that is legal.

You will always run into similar problems when trying to parse the syntax with a regular expression. This is the risk you will have to take.

Alternatively, I'm sure there is a C # module for parsing URLs (in many other languages) and they take care of all those special cases for you.

+3


source


Try this :



^https?://(www\.)?\w+[^/]+(/\w+(?=/)){2}/(PAGE1|PAGE2)((\?\w+=\w+)(&?\w+=\w+)*)?$

      

+1


source







All Articles