Regex not working in bs4

I am trying to extract some links from a specific file on the watchseriesfree.to website. In the next case, I want quickvideo links, so I use regex to filter these tags with text containing quickvideo

import re
import urllib2
from bs4 import BeautifulSoup

def gethtml(link):
    req = urllib2.Request(link, headers={'User-Agent': "Magic Browser"})
    con = urllib2.urlopen(req)
    html = con.read()
    return html


def findLatest():
    url = "https://watchseriesfree.to/serie/Madam-Secretary"
    head = "https://watchseriesfree.to"

    soup = BeautifulSoup(gethtml(url), 'html.parser')
    latep = soup.find("a", title=re.compile('Latest Episode'))

    soup = BeautifulSoup(gethtml(head + latep['href']), 'html.parser')
    firstVod = soup.findAll("tr",text=re.compile('rapidvideo'))

    return firstVod

print(findLatest())

      

However, the above code returns an empty list. What am I doing wrong?

+3


source to share


1 answer


The problem is here:

firstVod = soup.findAll("tr",text=re.compile('rapidvideo'))

      

When BeautifulSoup

your regex pattern applies, it will use .string

the
value attribute of all matching elements tr

. Now .string

has this important caveat - when an element has multiple children, .string

-None

:

If a tag contains multiple objects, it is not clear which .string

one should be referred to, so it .string

is defined as None

.



Hence, you have no results.

What you can do is check the actual text of the elements tr

using the search function and calling .get_text()

:

soup.find_all(lambda tag: tag.name == 'tr' and 'rapidvideo' in tag.get_text())

      

+4


source







All Articles