Is the implementation of response.info (). Getencoding () broken in urllib2?
I expect the getencoding output in the next python session to be "ISO-8859-1":
>>> import urllib2
>>> response = urllib2.urlopen("http://www.google.com/")
>>> response.info().plist
['charset=ISO-8859-1']
>>> response.info().getencoding()
'7bit'
This is with python version 2.6 ('2.6 (r26: 66714, Aug 17, 2009, 16:01:07) \ n [GCC 4.0.1 (Apple Inc. build 5484)]' specially).
source to share
Well, what do you think is broken?
I am getting ISO-8859-2 for urllib and wget (I'm in Poland now). I am getting UTF-8 from Firefox. This is because my Firefox tells the site that it accepts ISO-8859-1 and UTF-8, while wget and urllib2 don't say anything. Relevant request header:
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Remove UTF-8 from that and you don't get UTF-8 easily verified by telnetting to port 80.
Google.com just (and sanely) defaults to ISO-8859-1 and google.pl for ISO-8859-2, and I'm sure there are other defaults for other sites.
I don't have an encoding header for wget, urllib2, or telnet, I think urllib2 then assumes 7 bits, and that might be a bit insensitive as Content-Encoding is usually either gzip or nothing.
source to share