Remove query string from URL in HTML using regex
3 answers
You cannot parse HTML with regex. If you know the format of the page in advance - for example,
- Links
- always in the form <a href = "url without extra characters"> or
- all links are absolute and no other non-link lines starting with http: exist
then you can just get away with it, but for general [X] HTML the regexp parser is not suitable.
Depending on which language you are using, you need to find either an HTML parser library (like Python BeautifulSoup) or an HTML tidier combined with a standard XML parser, then scan the document for <a> elements (and maybe others, e.g. <img> if you are interested in?), then split the attribute value by "?.
+5
source to share
Re: Bobing comment, HTMLAgilityPack is a good html parser for .NET, it's more forgiving with incorrect markup than other parsers.
Using this, you can find all the A tags, then you can get the HREF and just remove anything after and include the '?'
+2
source to share