Is the implementation of response.info (). Getencoding () broken in urllib2?

I expect the getencoding output in the next python session to be "ISO-8859-1":

>>> import urllib2
>>> response = urllib2.urlopen("http://www.google.com/")
>>> response.info().plist
['charset=ISO-8859-1']
>>> response.info().getencoding()
'7bit'

      

This is with python version 2.6 ('2.6 (r26: 66714, Aug 17, 2009, 16:01:07) \ n [GCC 4.0.1 (Apple Inc. build 5484)]' specially).

+2


source to share


2 answers


Well, what do you think is broken?

I am getting ISO-8859-2 for urllib and wget (I'm in Poland now). I am getting UTF-8 from Firefox. This is because my Firefox tells the site that it accepts ISO-8859-1 and UTF-8, while wget and urllib2 don't say anything. Relevant request header:

Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7

      



Remove UTF-8 from that and you don't get UTF-8 easily verified by telnetting to port 80.

Google.com just (and sanely) defaults to ISO-8859-1 and google.pl for ISO-8859-2, and I'm sure there are other defaults for other sites.

I don't have an encoding header for wget, urllib2, or telnet, I think urllib2 then assumes 7 bits, and that might be a bit insensitive as Content-Encoding is usually either gzip or nothing.

0


source


According to document



Message.getencoding ()

Return the encoding specified in the Content-Transfer-Encoding message header . If no such header exists, return '7bit'. The encoding is converted to lower case.

0


source







All Articles