LXML etree.tostring stripping urls in link href attributes
When using LXML to parse an html document, and then using etree.tostring (), I notice that the ampersands in links are converted to html escaped objects.
This breaks the connection for obvious reasons. Here's a simple self-contained example of a problem:
>>> from lxml import etree
>>> parser = etree.HTMLParser()
>>> tree = etree.fromstring("""<a href="https://www.example.com/?param1=value1¶m2=value2">link</a>""", parser)
>>> etree.tostring(tree)
'<html><body><a href="https://www.example.com/?param1=value1&param2=value2">link</a></body></html>'
I would like the result to be:
<html><body><a href="https://www.example.com/?param1=value1¶m2=value2">link</a></body></html>
source to share
Although coding should be the standard way . If you really need to avoid converting for some reason, you can do the following:
Step 1. Find a unique string that shouldn't exist in your html source. You can just use ANDamp; as reserved_amp if you are sure that "ANDamp;" the string will not appear in your html source. Otherwise, you can create a random alphabet and check that this string doesn't exist in your html source:
>>> import random
>>> import string
>>> length = 15 #increase the length if it still seems to be collide
>>> reserved_amp = "&"
>>> html = """<a href="https://www.example.com/?param1=value1¶m2=value2">link</a>"""
>>> while reserved_amp in [html, "&"]:
... reserved_amp = ''.join(random.choice(string.ascii_lowercase + string.digits) for _ in range(length)) + "amp;" #amp; is for you easy to spot on
...
>>> print reserved_amp
2eya6oywxg5z7q5amp;
Step 2.replace all events and before the syntax:
>>> html = html.replace("&", reserved_amp)
>>> html
'<a href="https://www.example.com/?param1=value12eya6oywxg5z7q5amp;param2=value2">link</a>'
>>>
Step 3.replace it back only if you want the original form:
>>> from lxml import etree
>>> parser = etree.HTMLParser()
>>> tree = etree.fromstring(html, parser)
>>> etree.tostring(tree).replace(reserved_amp, "&")
'<html><body><a href="https://www.example.com/?param1=value1¶m2=value2">link</a></body></html>'
>>>
[UPDATE]:
The colon marked at the end reserved_amp
is a safe defender .
What if we created one reserved_amp
?
ampXampXampXampX + amp;
And the html contains:
yyYampX&
It will be encoded in this form:
yyYampXampXampXampXampXamp;
However, it is not possible to return / decode an incorrect reverse result, something like yy&YampX
(original is yyYampX&
) due to the safe colon for the last character - it is non-ASCII alphabetic which will never be generated like reserved_amp
from string.ascii_lowercase + string.digits
above.
So, make sure the random one doesn't use a colon (or any other non-ASCII character) and then add it at the end (MUST be the last character), no need to worry about yyYampX&
falling back into the yy&YampX
trap.
source to share
According to the documentation lxml tostring () , method='xml'
it can be transferred in order to avoid the specifics of HTML
etree.tostring(tree, method='xml')
In my projects I use:
from lxml import html
html.tostring(node, with_tail=False, method='xml', encoding='unicode')
source to share