Python requests: check if url is not a html page

Question

Python requests: check if url is not a html page

So, I have a finder that uses something like this:

#if ".mp3" in baseUrl[0] or ".pdf" in baseUrl[0]:
if baseUrl[0][-4] == "." and ".htm" not in baseUrl[0]:
    raise Exception
html = requests.get(baseUrl[0], timeout=3).text

It works really well. What happens if a file similar to .mp4 or .m4a hits the crawler instead of the HTML page, then the script freezes and on Linux, when I try to run the script, it just prints:

Killed

Is there a better way to catch these non-HTML pages?

+3

python python-requests

User 19 Aug 14 at 20:21

source to share

1 answer

Ankush shah · Accepted Answer · 2014-08-19T20:31:13+0000

You can submit a chapter request and check the content type. If its text / html will only act

r = requests.head(url)
if "text/html" in r.headers["content-type"]:
    html = requests.get(url).text
else:
    print "non html page"

If you just want to make one request,

r = requests.get(url)
if "text/html" in r.headers["content-type"]:    
    html = r.text
else:
    print "non html page"

Python requests: check if url is not a html page

More articles: