How to use lxml to connect "url links" to link "names" (eg {name: link})

For some reference, I posted a question that related to this: using lxml to find the literal text of url links

Also, I'm a bit new to python - more than a beginner, but not 100% more comfortable with it.

I am trying to use lxml to map every link on a page to its name (blue hyperlink text displayed in a web browser). I am doing this for a YouTube page and one of the problems is that YouTube is not generating html attributes for the link titles.

I almost figured it out, but I am missing something. It can be as simple as a syntax change.: /

Problem: When I get literal text for attributes <a>

on a page (in a python list), it returns TON values, most of which are filled with spaces only.

I'll explain what I liked doing at the end of this post. But first I'll post my code:

import lxml.html
from lxml import etree
import re

url = 'https://www.youtube.com/user/makemebad35/videos'
response = urllib.request.urlopen(url)
content = response.read()
doc = lxml.html.fromstring(content)
tree = lxml.etree.HTML(content)
parser = etree.HTMLParser()

#Get all links.
urls = doc.xpath('//a/@href')
len(urls)
#^Out: 109

#Get link names (try, at least).
texts = doc.xpath('//a/text()')
len(texts)
#^Out: 263
#That a little more than 109.
#A lot of the values in this list are whitespace or '\n'.

#Make copies of the list 'texts' from above,
#    and try to filter out the whitespace.
texts_test = []
#^This list will strip the values filled of whitespace,
#^    but the values themselves won't be deleted, just empty.
texts_test2 = []
#^This list will only hold the values of list 'texts'
#^    that contain something other than whitespace (\S and not \s)
texts_test3 = []
#^This list will only hold the values of list 'texts'
#^    that contain something other than newlines
for t in texts:
    texts_test.append(t.strip())
    #^List of stripped 
    if re.findall('\S', t):
        texts_test2.append(t)
    if not re.findall('\n', t):
        texts_test3.append(t)

#Now filter out the values in list 'urls'.
urls_test = []
#^This list will only contains the values of list 'urls'
#^    that begin with 'watch'.
#^    In other words, only the urls of YouTube videos.
urls = doc.xpath('//a/@href')
for u in urls:
    if u.startswith('https://www.youtube.com/watch'):
        urls_test.append(u)

len(texts)       #List holds all literal text under html tag <a>.
#263
len(texts_test)  #Copy of list above with 'junk' values emptied but not deleted.
#263
len(texts_test2) #List holds values with something other than whitespace.
#44
len(texts_test3) #List holds values with something other than '\n'.
#43
len(urls)        #List holds all url links.
#109
len(urls_test)   #List holds only links of YouTube videos.
#60

      

For lists that are close in value (texts_test3 and urls_test), I checked their values. They basically contain what I want. I also checked that urls_test only had some additional values ​​at the beginning or at the end, but unfortunately this is not the case. In other words, the differences are common throughout the list. For example, urls_test[5-15]

does not match the ten ten-digit values ​​of text_test3.

Currently I am getting the text of ALL tags using this command:

texts = doc.xpath('//a/text()')

      

What I would like to do is get the text of ONLY tags <a>

that contain attributes href

. So something like this:

texts = doc.xpath('//a/@href/text()')

      

But this command doesn't output anything. I also tried this:

texts = doc.xpath('//a/[@href]/text()')

      

But I am getting this error:

XPathEvalError: Invalid expression

      

I have no idea. Does anyone else have?

0


source to share


1 answer


XPath requires that the predicate (the specific quality you want in a tag such as a specific attribute) comes right after the tag name:

//title[@lang]  Selects all the title elements that have an attribute named lang

      

As taken from W3Schools .

In your case its an extra slash, which is your problem:



texts = doc.xpath('//a/[@href]/text()')

      

to

texts = doc.xpath('//a[@href]/text()')

      

0


source







All Articles