Is the implementation of response.info (). Getencoding () broken in urllib2?

Question

Is the implementation of response.info (). Getencoding () broken in urllib2?

I expect the getencoding output in the next python session to be "ISO-8859-1":

>>> import urllib2
>>> response = urllib2.urlopen("http://www.google.com/")
>>> response.info().plist
['charset=ISO-8859-1']
>>> response.info().getencoding()
'7bit'

This is with python version 2.6 ('2.6 (r26: 66714, Aug 17, 2009, 16:01:07) \ n [GCC 4.0.1 (Apple Inc. build 5484)]' specially).

+2

python encoding urllib2

John 20 Aug 09 at 22:41

source to share

2 answers

Lennart Regebro · Answer 1 · 2009-08-21T11:05:34+0000

Well, what do you think is broken?

I am getting ISO-8859-2 for urllib and wget (I'm in Poland now). I am getting UTF-8 from Firefox. This is because my Firefox tells the site that it accepts ISO-8859-1 and UTF-8, while wget and urllib2 don't say anything. Relevant request header:

Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7

Remove UTF-8 from that and you don't get UTF-8 easily verified by telnetting to port 80.

Google.com just (and sanely) defaults to ISO-8859-1 and google.pl for ISO-8859-2, and I'm sure there are other defaults for other sites.

I don't have an encoding header for wget, urllib2, or telnet, I think urllib2 then assumes 7 bits, and that might be a bit insensitive as Content-Encoding is usually either gzip or nothing.

zhangyoufu · Answer 2 · 2013-06-17T14:40:27+0000

According to document

Message.getencoding ()

Return the encoding specified in the Content-Transfer-Encoding message header . If no such header exists, return '7bit'. The encoding is converted to lower case.

Is the implementation of response.info (). Getencoding () broken in urllib2?

More articles: