Using lxml to find the literal text of url links

Question

Using lxml to find the literal text of url links

(Python 3.4.2) First, I'm pretty new to python - more than a beginner, but less than an intermediate user.

I am trying to display the literal url text on a page using lxml. I think I FULLY got this, but I am missing something. I can get the actual links to the URL, but not their names.

An example is from this,

<a class="yt-uix-sessionlink yt-uix-tile-link  spf-link  yt-ui-ellipsis yt-ui-ellipsis-2" dir="ltr" aria-describedby="description-id-588180" data-sessionlink="ei=6t2FVJLtEsOWrAbQ24HYAg&amp;ved=CAcQvxs&amp;feature=c4-videos-u" href="/watch?v=I2AcJG4112A&amp;list=UUrtZO4nmCBN4C9ySmi013oA">Zombie on Omegle!</a>

I want to get the following:

'Zombie on Omegle!'

(I'll make this html tag a little more readable for you guys)

<a class="yt-uix-sessionlink yt-uix-tile-link  spf-link  yt-ui-ellipsis yt-ui-ellipsis-2"
   dir="ltr" aria-describedby="description-id-588180"
   data-sessionlink="ei=6t2FVJLtEsOWrAbQ24HYAg&amp;ved=CAcQvxs&amp;feature=c4-videos-u"
   href="/watch?v=I2AcJG4112A&amp;list=UUrtZO4nmCBN4C9ySmi013oA">
       Zombie on Omegle!
</a>

I am trying to do this on a YouTube page and one problem is that YouTube does not specify a tag or attribute for the titles of its links, if that makes sense.

Here's what I've tried:

import lxml.html
from lxml import etree
import urllib

url = 'https://www.youtube.com/user/makemebad35/videos'
response = urllib.request.urlopen(url)
content = response.read()
doc = lxml.html.fromstring(content)
tree = lxml.etree.HTML(content)
parser = etree.HTMLParser()

href_list = tree.xpath('//a/@href')
#Perfect. List of all url under the 'href' attribute.
href_res = [lxml.etree.tostring(href) for href in href_list]
#^TypeError: Type 'lxml.etree._ElementUnicodeResult' cannot be serialized.

#So I tried extracting the 'a' tag without the attribute 'href'.
a_list = tree.xpath('//a')
a_res = [lxml.etree.tostring(clas) for clas in a_list]
#^This works.

links_fail = lxml.html.find_rel_links(doc,'href')
#^I named it 'links_fail because it doesn't work: the list is empty on output.
#   But the 'links_success' list below works.
urls = doc.xpath('//a/@href')
links_success = [link for link in urls if link.startswith('/watch')]
links_success
#^Out: ['/watch?v=K_yEaIBByFo&list=UUrtZO4nmCBN4C9ySmi013oA', ...]
#Awesome! List of all url that begin with 'watch?v=..."
#Now only if I could get the titles of the links...

contents = [text.text_content() for text in urls if text.startswith('/watch')]
#^Empty list.

#I thought this paragraph below wouldn't work,
#   but I decided to try it anyway.
texts_fail = doc.xpath('//a/[@href="watch"]')
#^XPathEvalError: Invalid expression
#^Oops, I made a syntax error there. I forgot a '/' before 'watch'.
#    But after correcting it (below), the output is the same.
texts_fail = doc.xpath('//a/[@href="/watch"]')
#^XPathEvalError: Invalid expression
texts_false = doc.xpath('//a/@href="watch"')
texts_false
#^Out: False
#^Typo again. But again, the output is still the same.
texts_false = doc.xpath('//a/@href="/watch"')
texts_false
#^Out: False

target_tag = ''.join(('//a/@class=',
                        '"yt-uix-sessionlink yt-uix-tile-link  spf-link  ',
                        'yt-ui-ellipsis yt-ui-ellipsis-2"'))
texts_html = doc.xpath(target_tag)
#^Out: True
#But YouTube doesn't make attributes for link titles.
texts_tree = tree.xpath(target_tag)
#^Out: True

#I also tried this below, which I found in another stackoverflow question.
#It fails. The error is below.
doc_abs = doc.make_links_absolute(url)
#^Creates empty list, which is why the rest of this paragraph fails.
text = []
text_content = []
notText = []
hasText = []
for each in doc_abs.iter():
    if each.text:
        text.append(each.text)
        hasText.append(each)   # list of elements that has text each.text is true
    text_content.append(each.text_content()) #the text for all elements 
    if each not in hasText:
        notText.append(each)
#AttributeError                            Traceback (most recent call last)
#<ipython-input-215-38c68f560efe> in <module>()
#----> 1 for each in doc_abs.iter():
#      2     if each.text:
#      3         text.append(each.text)
#      4         hasText.append(each)   # list of elements that has text each.text is true
#      5     text_content.append(each.text_content()) #the text for all elements
#
#AttributeError: 'NoneType' object has no attribute 'iter'

I have no idea. Anyone want to help this python padawan ?: P

----- EDIT -----

I take one more step, thanks theSmallNothing

. This command gets text items:

doc.xpath('//a/text()')

Unfortunately this command returns a lot of spaces and newlines ('\ n') as values. I will probably write another question later. If I do, I'll post a link to this question here in case anyone else with the same question ends up here.

How to use lxml to match "url links" to link "names" (eg {name: link})

+3

python python-3.x lxml lxml.html

GreenRaccoon23 08 dec. 14 at 18:21

source to share

1 answer

crazyhatfish · Accepted Answer · 2014-12-08T18:30:40+0000

In your example, you want to use the text selector in the xpath query:

doc.xpath('//a/text()')

which returns the text element of all elements a , that he can find.

To get the href and text all the elements a , which I think you're trying to do, you can first remove all the elements of a , then iterate and extract href and text separately.

watch_els = []

els = doc.xpath('//a')
for el in els:
    text = el.xpath("//text()")
    href = el.xpath("//@href")
    #check text and href arrays are not empty...
    if len(href) <= 0 or len(text) <= 0:
        #empty text/href, skip.
        continue

    text = text[0]
    href = href[0]
    if "/watch?" in href:
        #do something with a youtube video link...
        watch_els.append((text, href))

Using lxml to find the literal text of url links

More articles: