How do I remove bad data in an XPath element using Python?

I have this short example to demonstrate my problem:

from lxml import html

post = """<p>This a page with URLs.
<a href="http://google.com">This goes to&#xA; Google</a><br/>
<a href="http://yahoo.com">This &#xA; goes to Yahoo!</a><br/>
<a&#xA;href="http://example.com">This is invalid due to that&#xA;line feed character</p>&#xA;"""

doc = html.fromstring(post)
for link in doc.xpath('//a'):
    print link.get('href')

      

Output:

http://google.com
http://yahoo.com
None

      

The problem is my data contains &#xA;

characters embedded in it. For my last link, it is embedded directly between the anchor and the href attribute. The ruler applied outside the elements is important to me.

doc.xpath('//a')

saw <a&#xA;href="http://example.com">

it correctly as a link, but can't access the attribute href

when i do link.get('href')

.

How do I clear the data if it link.get('href')

returns None

so that I can still find the detected attribute href

?

I cannot remove all characters &#xA;

from the whole element post

, since the ones in the text are important.

+3


source to share


2 answers


Unidecode module

Since you need data outside of tags, you can try using unidecode . It doesn't deal with Chinese and Korean, but it will do things like change the quotes on the left and right to ASCII quotes. It should also help those characters &#xA;

by replacing them with spaces instead of non-breaking spaces. Hopefully everything you need in regards to saving other data. str.replace(u"\#xa", u" ")

less cumbersome if ascii space is ok.

import unidecode, urllib2
from lxml import html

html_text = urllib2.urlopen("http://www.yourwebsite.com")
ascii_text = unidecode.unidecode(html_text)
html.fromstring(ascii_text)

      



Explanation of the problem

There seems to be a known issue with this in multiple versions of Python. And this is C # as well . The linked one seems to indicate that the problem has been closed because the XML attribute tags are not built to support carriage returns, so avoiding them in all xml contexts would be silly. As it turns out, the W3C specification requires unicode to be entered on parsing ( see Section 1 ).

All line breaks must be normalized on input to #xA, as described in End-of-Line Processing Section 2.11, so the rest of this algorithm works with text normalized in this way.

+1


source


You can solve your specific problem:

post = post.replace('&#xA;', '\n')

      

Resulting test program:



from lxml import html

post = """<p>This a page with URLs. 
<a href="http://google.com">This goes to&#xA; Google</a><br/>
<a href="http://yahoo.com">This &#xA; goes to Yahoo!</a><br/>
<a&#xA;href="http://example.com">This is invalid due to that&#xA;line feed character</p>&#xA;"""

post = post.replace('&#xA;', '\n')

doc = html.fromstring(post)
for link in doc.xpath('//a'):
    print link.get('href')

      

Output:

http://google.com
http://yahoo.com
http://example.com

      

+1


source







All Articles