I DONT CARE ...">

Parsing HTML using LXML in Python

I am trying to parse a website for

blahblahblah 
<a  href="THIS IS WHAT I WANT" title="NOT THIS">I DONT CARE ABOUT THIS EITHER</a>
blahblahblah 

      

(there are a lot of them, and I want them all to be in some symbolic form). Unfortunately the HTML is very large and a bit complex, so trying to crawl through the tree can take a while to just sort the nested elements. Is there an easy way to just get this?

Thank!

+3


source to share


1 answer


If you just want to use href for tags a

use:



data = """blahblahblah 
<a  href="THIS IS WHAT I WANT" title="NOT THIS">I DONT CARE ABOUT THIS EITHER</a>
blahblahblah"""

import lxml.html
tree = lxml.html.fromstring(data)
print tree.xpath('//a/@href')

# ['THIS IS WHAT I WANT']

      

+14


source







All Articles