Regex not working in bs4

Question

Regex not working in bs4

I am trying to extract some links from a specific file on the watchseriesfree.to website. In the next case, I want quickvideo links, so I use regex to filter these tags with text containing quickvideo

import re
import urllib2
from bs4 import BeautifulSoup

def gethtml(link):
    req = urllib2.Request(link, headers={'User-Agent': "Magic Browser"})
    con = urllib2.urlopen(req)
    html = con.read()
    return html


def findLatest():
    url = "https://watchseriesfree.to/serie/Madam-Secretary"
    head = "https://watchseriesfree.to"

    soup = BeautifulSoup(gethtml(url), 'html.parser')
    latep = soup.find("a", title=re.compile('Latest Episode'))

    soup = BeautifulSoup(gethtml(head + latep['href']), 'html.parser')
    firstVod = soup.findAll("tr",text=re.compile('rapidvideo'))

    return firstVod

print(findLatest())

However, the above code returns an empty list. What am I doing wrong?

+3

python regex urllib2 bs4

Echchama nayak 27 Mar 17 at 12:40 am

source to share

1 answer

alecxe · Accepted Answer · 2017-03-27T00:51:45+0000

The problem is here:

firstVod = soup.findAll("tr",text=re.compile('rapidvideo'))

When BeautifulSoup

your regex pattern applies, it will use .string

the value attribute of all matching elements tr

. Now .string

has this important caveat - when an element has multiple children, .string

-None

:

If a tag contains multiple objects, it is not clear which .string

one should be referred to, so it .string

is defined as None

.

Hence, you have no results.

What you can do is check the actual text of the elements tr

using the search function and calling .get_text()

:

soup.find_all(lambda tag: tag.name == 'tr' and 'rapidvideo' in tag.get_text())

Regex not working in bs4

More articles: