Can't parse plain json with python

I have a very simple json. I am unable to parse the simplejson module. Reproduction:

import simplejson as json
json.loads(r'{"translatedatt1":"Vari\351es"}')

      

Result:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/pymodules/python2.5/simplejson/__init__.py", line 307, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/pymodules/python2.5/simplejson/decoder.py", line 335, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/pymodules/python2.5/simplejson/decoder.py", line 351, in raw_decode
    obj, end = self.scan_once(s, idx)
ValueError: Invalid \escape: line 1 column 23 (char 23)

      

Does anyone know what is wrong and how to parse the json above correctly?

The string that is encoded there: Options

PS I am using python 2.5

Thank you so much!

+3


source to share


2 answers


That would be perfectly correct; Vari\351es

contains an invalid escape code, the JSON standard does not allow \

followed by only numbers.

No matter what code needs to be fixed. If this is not possible, you will need to use a regular expression to remove those screens or replace them with valid screens.

If we interpret the number 351

as an octal number, it will indicate the U-coded U + 00E9, the character é

(LATIN SMALL LETTER E WITH ACUTE). You can "restore" your JSON input with

import re

invalid_escape = re.compile(r'\\[0-7]{1,6}')  # up to 6 digits for codepoints up to FFFF

def replace_with_codepoint(match):
    return unichr(int(match.group(0)[1:], 8))


def repair(brokenjson):
    return invalid_escape.sub(replace_with_codepoint, brokenjson)

      



Using repair()

, your example can be downloaded:

>>> json.loads(repair(r'{"translatedatt1":"Vari\351es"}'))
{u'translatedatt1': u'Vari\xe9es'}

      

You may need to customize the interpretation of code points; I choose octal (because it Variées

is an actual word), but you need to check this more with other code points.

+8


source


You are probably not going to use a raw string, but a unicode string?

>>> import simplejson as json
>>> json.loads(u'{"translatedatt1":"Vari\351es"}')
{u'translatedatt1': u'Vari\xe9es'}

      

If you want to quote data inside a JSON string, you need to use \uNNNN

:



>>> json.loads(r'{"translatedatt1":"Vari\u351es"}')
{'translatedatt1': u'Vari\u351es'}

      

Note that the resulting dict is slightly different in this case. When parsing unicode strings, simplejson uses unicode strings

for keys. Otherwise, it uses keys byte string

.

If your JSON data is actually using \351e

than just broken and invalid JSON.

+4


source







All Articles