Urllib2 getparam charset returns None for some sites

I've been scared with this for a while. The following code snippet returns None

for some websites even though the character set is present in the meta header, so it doesn't seem like a reliable way to get the correct encoding of a web page.

conn = urllib2.urlopen(req)
charset = conn.headers.getparam('charset')

      

I've read several threads here on SO, and some mentions to use chardet

, but I don't want to import an additional module if possible. Instead, I'm going to load just the header and get the encoding information with some string functions.

Does anyone have a better idea?

+3


source to share


2 answers


conn.headers.getparam('charset')

does not parse html content (tag <meta>

), it only appears in http headers (like Content-Type

).



You can use html parser to get character encoding if not specified in http headers.

+2


source


Move my comment here and post it as an answer.

Thanks to @JF Sebastian I could get the charset from the meta tag using the below code snippet:



conn = urllib2.urlopen(url)
site = parse(conn).getroot()
charset = site.cssselect('meta[http-equiv="Content-Type"]')[0].get('content').split("chars‌​et=",1)[1]

      

0


source







All Articles