Python requests: check if url is not a html page

So, I have a finder that uses something like this:

#if ".mp3" in baseUrl[0] or ".pdf" in baseUrl[0]:
if baseUrl[0][-4] == "." and ".htm" not in baseUrl[0]:
    raise Exception
html = requests.get(baseUrl[0], timeout=3).text

      

It works really well. What happens if a file similar to .mp4 or .m4a hits the crawler instead of the HTML page, then the script freezes and on Linux, when I try to run the script, it just prints:

Killed

      

Is there a better way to catch these non-HTML pages?

+3


source to share


1 answer


You can submit a chapter request and check the content type. If its text / html will only act

r = requests.head(url)
if "text/html" in r.headers["content-type"]:
    html = requests.get(url).text
else:
    print "non html page"

      



If you just want to make one request,

r = requests.get(url)
if "text/html" in r.headers["content-type"]:    
    html = r.text
else:
    print "non html page"

      

+4


source







All Articles