Regex not working in bs4
I am trying to extract some links from a specific file on the watchseriesfree.to website. In the next case, I want quickvideo links, so I use regex to filter these tags with text containing quickvideo
import re
import urllib2
from bs4 import BeautifulSoup
def gethtml(link):
req = urllib2.Request(link, headers={'User-Agent': "Magic Browser"})
con = urllib2.urlopen(req)
html = con.read()
return html
def findLatest():
url = "https://watchseriesfree.to/serie/Madam-Secretary"
head = "https://watchseriesfree.to"
soup = BeautifulSoup(gethtml(url), 'html.parser')
latep = soup.find("a", title=re.compile('Latest Episode'))
soup = BeautifulSoup(gethtml(head + latep['href']), 'html.parser')
firstVod = soup.findAll("tr",text=re.compile('rapidvideo'))
return firstVod
print(findLatest())
However, the above code returns an empty list. What am I doing wrong?
source to share
The problem is here:
firstVod = soup.findAll("tr",text=re.compile('rapidvideo'))
When BeautifulSoup
your regex pattern applies, it will use .string
the value attribute of all matching elements tr
. Now .string
has this important caveat - when an element has multiple children, .string
-None
:
If a tag contains multiple objects, it is not clear which
.string
one should be referred to, so it.string
is defined asNone
.
Hence, you have no results.
What you can do is check the actual text of the elements tr
using the search function and calling .get_text()
:
soup.find_all(lambda tag: tag.name == 'tr' and 'rapidvideo' in tag.get_text())
source to share