Remove query string from URL in HTML using regex

Given the html document, what is the most correct and concise regex pattern to remove query strings from every url document in the document?

0


source to share


3 answers


You cannot parse HTML with regex. If you know the format of the page in advance - for example,

  • Links
  • always in the form <a href = "url without extra characters"> or
  • all links are absolute and no other non-link lines starting with http: exist


then you can just get away with it, but for general [X] HTML the regexp parser is not suitable.

Depending on which language you are using, you need to find either an HTML parser library (like Python BeautifulSoup) or an HTML tidier combined with a standard XML parser, then scan the document for <a> elements (and maybe others, e.g. <img> if you are interested in?), then split the attribute value by "?.

+5


source


Re: Bobing comment, HTMLAgilityPack is a good html parser for .NET, it's more forgiving with incorrect markup than other parsers.



Using this, you can find all the A tags, then you can get the HREF and just remove anything after and include the '?'

+2


source


Find this:

/href="([^\?"]*?)\?[^\"]*"/

      

Replaced by:

href="\1"

      

you may need to make sure it doesn't break tags <link>

.

0


source







All Articles