Python requests: check if url is not a html page
So, I have a finder that uses something like this:
#if ".mp3" in baseUrl[0] or ".pdf" in baseUrl[0]:
if baseUrl[0][-4] == "." and ".htm" not in baseUrl[0]:
raise Exception
html = requests.get(baseUrl[0], timeout=3).text
It works really well. What happens if a file similar to .mp4 or .m4a hits the crawler instead of the HTML page, then the script freezes and on Linux, when I try to run the script, it just prints:
Killed
Is there a better way to catch these non-HTML pages?
+3
source to share
1 answer
You can submit a chapter request and check the content type. If its text / html will only act
r = requests.head(url)
if "text/html" in r.headers["content-type"]:
html = requests.get(url).text
else:
print "non html page"
If you just want to make one request,
r = requests.get(url)
if "text/html" in r.headers["content-type"]:
html = r.text
else:
print "non html page"
+4
source to share