Urllib2 getparam charset returns None for some sites
I've been scared with this for a while. The following code snippet returns None
for some websites even though the character set is present in the meta header, so it doesn't seem like a reliable way to get the correct encoding of a web page.
conn = urllib2.urlopen(req)
charset = conn.headers.getparam('charset')
I've read several threads here on SO, and some mentions to use chardet
, but I don't want to import an additional module if possible. Instead, I'm going to load just the header and get the encoding information with some string functions.
Does anyone have a better idea?
source to share
conn.headers.getparam('charset')
does not parse html content (tag <meta>
), it only appears in http headers (like Content-Type
).
You can use html parser to get character encoding if not specified in http headers.
source to share
Move my comment here and post it as an answer.
Thanks to @JF Sebastian I could get the charset from the meta tag using the below code snippet:
conn = urllib2.urlopen(url)
site = parse(conn).getroot()
charset = site.cssselect('meta[http-equiv="Content-Type"]')[0].get('content').split("charset=",1)[1]
source to share