How to use lxml to connect "url links" to link "names" (eg {name: link})
For some reference, I posted a question that related to this: using lxml to find the literal text of url links
Also, I'm a bit new to python - more than a beginner, but not 100% more comfortable with it.
I am trying to use lxml to map every link on a page to its name (blue hyperlink text displayed in a web browser). I am doing this for a YouTube page and one of the problems is that YouTube is not generating html attributes for the link titles.
I almost figured it out, but I am missing something. It can be as simple as a syntax change.: /
Problem: When I get literal text for attributes <a>
on a page (in a python list), it returns TON values, most of which are filled with spaces only.
I'll explain what I liked doing at the end of this post. But first I'll post my code:
import lxml.html
from lxml import etree
import re
url = 'https://www.youtube.com/user/makemebad35/videos'
response = urllib.request.urlopen(url)
content = response.read()
doc = lxml.html.fromstring(content)
tree = lxml.etree.HTML(content)
parser = etree.HTMLParser()
#Get all links.
urls = doc.xpath('//a/@href')
len(urls)
#^Out: 109
#Get link names (try, at least).
texts = doc.xpath('//a/text()')
len(texts)
#^Out: 263
#That a little more than 109.
#A lot of the values in this list are whitespace or '\n'.
#Make copies of the list 'texts' from above,
# and try to filter out the whitespace.
texts_test = []
#^This list will strip the values filled of whitespace,
#^ but the values themselves won't be deleted, just empty.
texts_test2 = []
#^This list will only hold the values of list 'texts'
#^ that contain something other than whitespace (\S and not \s)
texts_test3 = []
#^This list will only hold the values of list 'texts'
#^ that contain something other than newlines
for t in texts:
texts_test.append(t.strip())
#^List of stripped
if re.findall('\S', t):
texts_test2.append(t)
if not re.findall('\n', t):
texts_test3.append(t)
#Now filter out the values in list 'urls'.
urls_test = []
#^This list will only contains the values of list 'urls'
#^ that begin with 'watch'.
#^ In other words, only the urls of YouTube videos.
urls = doc.xpath('//a/@href')
for u in urls:
if u.startswith('https://www.youtube.com/watch'):
urls_test.append(u)
len(texts) #List holds all literal text under html tag <a>.
#263
len(texts_test) #Copy of list above with 'junk' values emptied but not deleted.
#263
len(texts_test2) #List holds values with something other than whitespace.
#44
len(texts_test3) #List holds values with something other than '\n'.
#43
len(urls) #List holds all url links.
#109
len(urls_test) #List holds only links of YouTube videos.
#60
For lists that are close in value (texts_test3 and urls_test), I checked their values. They basically contain what I want. I also checked that urls_test only had some additional values at the beginning or at the end, but unfortunately this is not the case. In other words, the differences are common throughout the list. For example, urls_test[5-15]
does not match the ten ten-digit values of text_test3.
Currently I am getting the text of ALL tags using this command:
texts = doc.xpath('//a/text()')
What I would like to do is get the text of ONLY tags <a>
that contain attributes href
. So something like this:
texts = doc.xpath('//a/@href/text()')
But this command doesn't output anything. I also tried this:
texts = doc.xpath('//a/[@href]/text()')
But I am getting this error:
XPathEvalError: Invalid expression
I have no idea. Does anyone else have?
source to share
XPath requires that the predicate (the specific quality you want in a tag such as a specific attribute) comes right after the tag name:
//title[@lang] Selects all the title elements that have an attribute named lang
As taken from W3Schools .
In your case its an extra slash, which is your problem:
texts = doc.xpath('//a/[@href]/text()')
to
texts = doc.xpath('//a[@href]/text()')
source to share