Urllib2 getparam charset returns None for some sites

Question

Urllib2 getparam charset returns None for some sites

I've been scared with this for a while. The following code snippet returns None

for some websites even though the character set is present in the meta header, so it doesn't seem like a reliable way to get the correct encoding of a web page.

conn = urllib2.urlopen(req)
charset = conn.headers.getparam('charset')

I've read several threads here on SO, and some mentions to use chardet

, but I don't want to import an additional module if possible. Instead, I'm going to load just the header and get the encoding information with some string functions.

Does anyone have a better idea?

+3

python character-encoding urllib2

g0m3z 02 Sep 14 at 13:25

source to share

2 answers

Move my comment here and post it as an answer.

Thanks to @JF Sebastian I could get the charset from the meta tag using the below code snippet:

conn = urllib2.urlopen(url)
site = parse(conn).getroot()
charset = site.cssselect('meta[http-equiv="Content-Type"]')[0].get('content').split("chars‌et=",1)[1]

0

g0m3z 03 Sep '14 at 11:50

source to share

jfs · Accepted Answer · 2014-09-02T13:32:02+0000

conn.headers.getparam('charset')

does not parse html content (tag <meta>

), it only appears in http headers (like Content-Type

).

You can use html parser to get character encoding if not specified in http headers.

Urllib2 getparam charset returns None for some sites

More articles: