Using lxml to find the literal text of url links
(Python 3.4.2) First, I'm pretty new to python - more than a beginner, but less than an intermediate user.
I am trying to display the literal url text on a page using lxml. I think I FULLY got this, but I am missing something. I can get the actual links to the URL, but not their names.
An example is from this,
<a class="yt-uix-sessionlink yt-uix-tile-link spf-link yt-ui-ellipsis yt-ui-ellipsis-2" dir="ltr" aria-describedby="description-id-588180" data-sessionlink="ei=6t2FVJLtEsOWrAbQ24HYAg&ved=CAcQvxs&feature=c4-videos-u" href="/watch?v=I2AcJG4112A&list=UUrtZO4nmCBN4C9ySmi013oA">Zombie on Omegle!</a>
I want to get the following:
'Zombie on Omegle!'
(I'll make this html tag a little more readable for you guys)
<a class="yt-uix-sessionlink yt-uix-tile-link spf-link yt-ui-ellipsis yt-ui-ellipsis-2"
dir="ltr" aria-describedby="description-id-588180"
data-sessionlink="ei=6t2FVJLtEsOWrAbQ24HYAg&ved=CAcQvxs&feature=c4-videos-u"
href="/watch?v=I2AcJG4112A&list=UUrtZO4nmCBN4C9ySmi013oA">
Zombie on Omegle!
</a>
I am trying to do this on a YouTube page and one problem is that YouTube does not specify a tag or attribute for the titles of its links, if that makes sense.
Here's what I've tried:
import lxml.html
from lxml import etree
import urllib
url = 'https://www.youtube.com/user/makemebad35/videos'
response = urllib.request.urlopen(url)
content = response.read()
doc = lxml.html.fromstring(content)
tree = lxml.etree.HTML(content)
parser = etree.HTMLParser()
href_list = tree.xpath('//a/@href')
#Perfect. List of all url under the 'href' attribute.
href_res = [lxml.etree.tostring(href) for href in href_list]
#^TypeError: Type 'lxml.etree._ElementUnicodeResult' cannot be serialized.
#So I tried extracting the 'a' tag without the attribute 'href'.
a_list = tree.xpath('//a')
a_res = [lxml.etree.tostring(clas) for clas in a_list]
#^This works.
links_fail = lxml.html.find_rel_links(doc,'href')
#^I named it 'links_fail because it doesn't work: the list is empty on output.
# But the 'links_success' list below works.
urls = doc.xpath('//a/@href')
links_success = [link for link in urls if link.startswith('/watch')]
links_success
#^Out: ['/watch?v=K_yEaIBByFo&list=UUrtZO4nmCBN4C9ySmi013oA', ...]
#Awesome! List of all url that begin with 'watch?v=..."
#Now only if I could get the titles of the links...
contents = [text.text_content() for text in urls if text.startswith('/watch')]
#^Empty list.
#I thought this paragraph below wouldn't work,
# but I decided to try it anyway.
texts_fail = doc.xpath('//a/[@href="watch"]')
#^XPathEvalError: Invalid expression
#^Oops, I made a syntax error there. I forgot a '/' before 'watch'.
# But after correcting it (below), the output is the same.
texts_fail = doc.xpath('//a/[@href="/watch"]')
#^XPathEvalError: Invalid expression
texts_false = doc.xpath('//a/@href="watch"')
texts_false
#^Out: False
#^Typo again. But again, the output is still the same.
texts_false = doc.xpath('//a/@href="/watch"')
texts_false
#^Out: False
target_tag = ''.join(('//a/@class=',
'"yt-uix-sessionlink yt-uix-tile-link spf-link ',
'yt-ui-ellipsis yt-ui-ellipsis-2"'))
texts_html = doc.xpath(target_tag)
#^Out: True
#But YouTube doesn't make attributes for link titles.
texts_tree = tree.xpath(target_tag)
#^Out: True
#I also tried this below, which I found in another stackoverflow question.
#It fails. The error is below.
doc_abs = doc.make_links_absolute(url)
#^Creates empty list, which is why the rest of this paragraph fails.
text = []
text_content = []
notText = []
hasText = []
for each in doc_abs.iter():
if each.text:
text.append(each.text)
hasText.append(each) # list of elements that has text each.text is true
text_content.append(each.text_content()) #the text for all elements
if each not in hasText:
notText.append(each)
#AttributeError Traceback (most recent call last)
#<ipython-input-215-38c68f560efe> in <module>()
#----> 1 for each in doc_abs.iter():
# 2 if each.text:
# 3 text.append(each.text)
# 4 hasText.append(each) # list of elements that has text each.text is true
# 5 text_content.append(each.text_content()) #the text for all elements
#
#AttributeError: 'NoneType' object has no attribute 'iter'
I have no idea. Anyone want to help this python padawan ?: P
----- EDIT -----
I take one more step, thanks theSmallNothing
. This command gets text items:
doc.xpath('//a/text()')
Unfortunately this command returns a lot of spaces and newlines ('\ n') as values. I will probably write another question later. If I do, I'll post a link to this question here in case anyone else with the same question ends up here.
How to use lxml to match "url links" to link "names" (eg {name: link})
source to share
In your example, you want to use the text selector in the xpath query:
doc.xpath('//a/text()')
which returns the text element of all elements a , that he can find.
To get the href and text all the elements a , which I think you're trying to do, you can first remove all the elements of a , then iterate and extract href and text separately.
watch_els = []
els = doc.xpath('//a')
for el in els:
text = el.xpath("//text()")
href = el.xpath("//@href")
#check text and href arrays are not empty...
if len(href) <= 0 or len(text) <= 0:
#empty text/href, skip.
continue
text = text[0]
href = href[0]
if "/watch?" in href:
#do something with a youtube video link...
watch_els.append((text, href))
source to share