Parsing HTML using LXML in Python

Question

Parsing HTML using LXML in Python

I am trying to parse a website for

blahblahblah 
<a  href="THIS IS WHAT I WANT" title="NOT THIS">I DONT CARE ABOUT THIS EITHER</a>
blahblahblah

(there are a lot of them, and I want them all to be in some symbolic form). Unfortunately the HTML is very large and a bit complex, so trying to crawl through the tree can take a while to just sort the nested elements. Is there an easy way to just get this?

Thank!

+3

python html parsing html-parsing lxml

user1922956 02 Feb At 15:57

source to share

1 answer

Jon Clements · Accepted Answer · 2013-02-02T15:59:17+0000

If you just want to use href for tags a

use:

data = """blahblahblah 
<a  href="THIS IS WHAT I WANT" title="NOT THIS">I DONT CARE ABOUT THIS EITHER</a>
blahblahblah"""

import lxml.html
tree = lxml.html.fromstring(data)
print tree.xpath('//a/@href')

# ['THIS IS WHAT I WANT']

Parsing HTML using LXML in Python

More articles: