How to encode chinese character as "gbk" in json, format url String query parameter?
I want to dump a dict as a json String that contains some Chinese characters and format the url request parameter with that.
here is my python code:
import httplib
import simplejson as json
import urllib
d={
"key":"上海",
"num":1
}
jsonStr = json.dumps(d,encoding='gbk')
url_encode=urllib.quote_plus(jsonStr)
conn = httplib.HTTPConnection("localhost",port=8885)
conn.request("GET","/?json="+url_encode)
res = conn.getresponse()
what i expected from the query string:
GET /?json=%7B%22num%22%3A+1%2C+%22key%22%3A+%22%C9%CF%BA%A3%22%7D
------------
|
V
"%C9%CF%BA%A3" represent "上海" in format of 'gbk' in url.
but i got this:
GET /?json=%7B%22num%22%3A+1%2C+%22key%22%3A+%22%5Cu6d93%5Cu5a43%5Cu6363%22%7D
------------------------
|
v
%5Cu6d93%5Cu5a43%5Cu6363 is 'some' format of chinese characters "上海"
I also tried to dump json using the option ensure_ascii=False
:
jsonStr = json.dumps(d,ensure_ascii=False,encoding='gbk')
but it won't work.
So how can I do this? thank.
source to share
You almost got it with ensure_ascii=False
. It works:
jsonStr = json.dumps(d, encoding='gbk', ensure_ascii=False).encode('gbk')
You need to advise json.dumps()
that the lines it will read are GBK and that it shouldn't try to use ASCII-fy. Then you have to re-specify the output encoding because there json.dumps()
is no separate option for that.
This solution is similar to another answer here: fooobar.com/questions/31075 / ...
So this does what you want, although I should point out that the standard for URIs seems to say they should be in UTF-8 whenever possible. For more details see here: fooobar.com/questions/2169531 / ...
source to share
"key":"上海",
You saved your source as UTF-8, so it's a byte string '\xe4\xb8\x8a\xe6\xb5\xb7'
.
jsonStr = json.dumps(d,encoding='gbk')
JSON format only supports Unicode strings. The parameter encoding
can be used to force json.dumps
byte strings to be resolved, automatically decode them to Unicode using the given encoding.
However, the byte string encoding is actually not UTF-8 'gbk'
, so it json.dumps
decodes incorrectly, giving u'涓婃捣'
. Then it gives incorrect JSON output "\u6d93\u5a43\u6363"
which gets url in %22%5Cu6d93%5Cu5a43%5Cu6363%22
.
To fix this, you have to make the input the json.dumps
correct Unicode ( u''
) string :
# coding: utf-8
d = {
"key": u"上海", # or u'\u4e0a\u6d77' if you don't want to rely on the coding decl
"num":1
}
jsonStr = json.dumps(d)
...
This will give you JSON "\u4e0a\u6d77"
, URL encoded %22%5Cu4e0a%5Cu6d77%22
.
If you don't really need screens \u
in your JSON, you really can ensure_ascii=False
, and then .encode()
output before url-encoding. But I would not recommend it, as you then have to worry about the encoding of the target application requiring URL parameters in the parameters, which is the source of some pain. The version is \u
accepted by all JSON parsers and is usually not much longer after URL encoding.
source to share