Urllib2 getparam charset returns None for some sites
I've been scared with this for a while. The following code snippet returns None
for some websites even though the character set is present in the meta header, so it doesn't seem like a reliable way to get the correct encoding of a web page.
conn = urllib2.urlopen(req)
charset = conn.headers.getparam('charset')
I've read several threads here on SO, and some mentions to use chardet
, but I don't want to import an additional module if possible. Instead, I'm going to load just the header and get the encoding information with some string functions.
Does anyone have a better idea?
conn.headers.getparam('charset')
does not parse html content (tag <meta>
), it only appears in http headers (like Content-Type
).
You can use html parser to get character encoding if not specified in http headers.
Move my comment here and post it as an answer.
Thanks to @JF Sebastian I could get the charset from the meta tag using the below code snippet:
conn = urllib2.urlopen(url)
site = parse(conn).getroot()
charset = site.cssselect('meta[http-equiv="Content-Type"]')[0].get('content').split("charset=",1)[1]