How to encode chinese character as "gbk" in json, format url String query parameter?

I want to dump a dict as a json String that contains some Chinese characters and format the url request parameter with that.

here is my python code:

import httplib
import simplejson as json
import urllib

d={
  "key":"上海",
  "num":1
}

jsonStr = json.dumps(d,encoding='gbk')
url_encode=urllib.quote_plus(jsonStr)

conn = httplib.HTTPConnection("localhost",port=8885)
conn.request("GET","/?json="+url_encode)
res = conn.getresponse()

      

what i expected from the query string:

GET /?json=%7B%22num%22%3A+1%2C+%22key%22%3A+%22%C9%CF%BA%A3%22%7D
                                                ------------
                                                     |
                                                     V
                       "%C9%CF%BA%A3" represent "上海" in format of 'gbk' in url.

      

but i got this:

GET /?json=%7B%22num%22%3A+1%2C+%22key%22%3A+%22%5Cu6d93%5Cu5a43%5Cu6363%22%7D
                                                ------------------------
                                                         |
                                                         v
           %5Cu6d93%5Cu5a43%5Cu6363  is 'some' format of chinese characters "上海"  

      

I also tried to dump json using the option ensure_ascii=False

:

jsonStr = json.dumps(d,ensure_ascii=False,encoding='gbk')

      

but it won't work.

So how can I do this? thank.

+3


source to share


2 answers


You almost got it with ensure_ascii=False

. It works:

jsonStr = json.dumps(d, encoding='gbk', ensure_ascii=False).encode('gbk')

      

You need to advise json.dumps()

that the lines it will read are GBK and that it shouldn't try to use ASCII-fy. Then you have to re-specify the output encoding because there json.dumps()

is no separate option for that.



This solution is similar to another answer here: fooobar.com/questions/31075 / ...

So this does what you want, although I should point out that the standard for URIs seems to say they should be in UTF-8 whenever possible. For more details see here: fooobar.com/questions/2169531 / ...

+2


source


"key":"上海",

      

You saved your source as UTF-8, so it's a byte string '\xe4\xb8\x8a\xe6\xb5\xb7'

.

jsonStr = json.dumps(d,encoding='gbk')

      

JSON format only supports Unicode strings. The parameter encoding

can be used to force json.dumps

byte strings to be resolved, automatically decode them to Unicode using the given encoding.

However, the byte string encoding is actually not UTF-8 'gbk'

, so it json.dumps

decodes incorrectly, giving u'涓婃捣'

. Then it gives incorrect JSON output "\u6d93\u5a43\u6363"

which gets url in %22%5Cu6d93%5Cu5a43%5Cu6363%22

.



To fix this, you have to make the input the json.dumps

correct Unicode ( u''

) string :

# coding: utf-8

d = {
    "key": u"上海",  # or u'\u4e0a\u6d77' if you don't want to rely on the coding decl
    "num":1
}
jsonStr = json.dumps(d)
...

      

This will give you JSON "\u4e0a\u6d77"

, URL encoded %22%5Cu4e0a%5Cu6d77%22

.

If you don't really need screens \u

in your JSON, you really can ensure_ascii=False

, and then .encode()

output before url-encoding. But I would not recommend it, as you then have to worry about the encoding of the target application requiring URL parameters in the parameters, which is the source of some pain. The version is \u

accepted by all JSON parsers and is usually not much longer after URL encoding.

0


source







All Articles