LXML etree.tostring stripping urls in link href attributes

When using LXML to parse an html document, and then using etree.tostring (), I notice that the ampersands in links are converted to html escaped objects.

This breaks the connection for obvious reasons. Here's a simple self-contained example of a problem:

>>> from lxml import etree
>>> parser = etree.HTMLParser()
>>> tree = etree.fromstring("""<a href="https://www.example.com/?param1=value1&param2=value2">link</a>""", parser)
>>> etree.tostring(tree)
'<html><body><a href="https://www.example.com/?param1=value1&amp;param2=value2">link</a></body></html>'

      

I would like the result to be:

<html><body><a href="https://www.example.com/?param1=value1&param2=value2">link</a></body></html>

      

+4


source to share


2 answers


Although coding should be the standard way . If you really need to avoid converting for some reason, you can do the following:

Step 1. Find a unique string that shouldn't exist in your html source. You can just use ANDamp; as reserved_amp if you are sure that "ANDamp;" the string will not appear in your html source. Otherwise, you can create a random alphabet and check that this string doesn't exist in your html source:

>>> import random
>>> import string
>>> length = 15 #increase the length if it still seems to be collide
>>> reserved_amp = "&amp;"
>>> html = """<a href="https://www.example.com/?param1=value1&param2=value2">link</a>"""
>>> while reserved_amp in [html, "&amp;"]: 
...     reserved_amp = ''.join(random.choice(string.ascii_lowercase + string.digits) for _ in range(length)) + "amp;" #amp; is for you easy to spot on
... 
>>> print reserved_amp
2eya6oywxg5z7q5amp;

      

Step 2.replace all events and before the syntax:

>>> html = html.replace("&", reserved_amp)
>>> html
'<a href="https://www.example.com/?param1=value12eya6oywxg5z7q5amp;param2=value2">link</a>'
>>> 

      

Step 3.replace it back only if you want the original form:

>>> from lxml import etree
>>> parser = etree.HTMLParser()
>>> tree = etree.fromstring(html, parser)
>>> etree.tostring(tree).replace(reserved_amp, "&")
'<html><body><a href="https://www.example.com/?param1=value1&param2=value2">link</a></body></html>'
>>> 

      

[UPDATE]:

The colon marked at the end reserved_amp

is a safe defender .



What if we created one reserved_amp

?

ampXampXampXampX + amp;

And the html contains:

yyYampX&

It will be encoded in this form:

yyYampXampXampXampXampXamp;

However, it is not possible to return / decode an incorrect reverse result, something like yy&YampX

(original is yyYampX&

) due to the safe colon for the last character - it is non-ASCII alphabetic which will never be generated like reserved_amp

from string.ascii_lowercase + string.digits

above.

So, make sure the random one doesn't use a colon (or any other non-ASCII character) and then add it at the end (MUST be the last character), no need to worry about yyYampX&

falling back into the yy&YampX

trap.

+1


source


According to the documentation lxml tostring () , method='xml'

it can be transferred in order to avoid the specifics of HTML

etree.tostring(tree, method='xml')

      



In my projects I use:

from lxml import html
html.tostring(node, with_tail=False, method='xml', encoding='unicode')

      

0


source







All Articles