In which world will \ u00c3 \\ u00a9 become é?

I have an incorrectly encoded json document from a source that I have no control over, which contains the following lines:

d\u00c3\u00a9cor

business\u00e2\u20ac\u2122 active accounts 

the \u00e2\u20ac\u0153Made in the USA\u00e2\u20ac\u009d label

      

From this I am going to intend \u00c3\u00a9

beceome é

, which would be utf-8 hex C3 A9

. It makes sense. For others, I assume that we are dealing with some types of index quotes.

My theory here is that it is either using some encoding that I have never encountered before, or that it is encoded in some way. I am fine at writing some code to convert my broken contribution into something I can understand, as it is unlikely that they will be able to fix the system if I brought her to me.

Any ideas how to get them to contribute to something I can understand? For the record, I'm working in Python.

+6


source to share


2 answers


You should try the ftfy module:



>>> print ftfy.ftfy(u"d\u00c3\u00a9cor")
décor
>>> print ftfy.ftfy(u"business\u00e2\u20ac\u2122 active accounts")
business' active accounts
>>> print ftfy.ftfy(u"the \u00e2\u20ac\u0153Made in the USA\u00e2\u20ac\u009d label")
the "Made in the USA" label
>>> print ftfy.ftfy(u"the \u00e2\u20ac\u0153Made in the USA\u00e2\u20ac\u009d label", uncurl_quotes=False)
the "Made in the USA" label

      

+12


source


You have Mojibake data here; UTF-8 data is decoded from bytes with the wrong codec.

The trick is to figure out what encoding was used to decode before generating the JSON output. The first two samples can be repaired if you assume the encoding was Windows Codepage 1252:

>>> sample = u'''\
... d\u00c3\u00a9cor
... business\u00e2\u20ac\u2122 active accounts 
... the \u00e2\u20ac\u0153Made in the USA\u00e2\u20ac\u009d label
... '''.splitlines()
>>> print sample[0].encode('cp1252').decode('utf8')
décor
>>> print sample[1].encode('cp1252').decode('utf8')
business active accounts 

      

but this codec doesn't work for the third one:

>>> print sample[2].encode('cp1252').decode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mpieters/Development/venvs/stackoverflow-2.7/lib/python2.7/encodings/cp1252.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode character u'\x9d' in position 24: character maps to <undefined>

      

The first 3 'strange' bytes are of course the Mojibake CP1252 for U + 201C LEFT DOUBLE QUOTATION MARK codepoint:

>>> sample[2]
u'the \xe2\u20ac\u0153Made in the USA\xe2\u20ac\x9d label'
>>> sample[2][:22].encode('cp1252').decode('utf8')
u'the \u201cMade in the USA'

      



so the other combination is presumably meant for U + 201D CORRECT DOUBLE TARGET MARK , but the last character results in a UTF-8 byte not normally present in CP1252:

>>> u'\u201d'.encode('utf8').decode('cp1252')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mpieters/Development/venvs/stackoverflow-2.7/lib/python2.7/encodings/cp1252.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 2: character maps to <undefined>

      

This is because the CP1252 codec is missing the 9D hex position, but the code example ended up in the JSON output:

>>> sample[2][22:]
u'\xe2\u20ac\x9d label'

      

The Ned library so helpfully warned me that it uses the "sloppy" CP1252 codec to work around this problem by matching nonexistent bytes one-to-one (UTF-8 byte to Unicode Latin-1 point). The resulting "fancy quotes" are then displayed in ASCII quotes by the library, but you can turn this off: ftfy

ftfy

>>> import ftfy
>>> ftfy.fix_text(sample[2])
u'the "Made in the USA" label'
>>> ftfy.fix_text(sample[2], uncurl_quotes=False)
u'the \u201cMade in the USA\u201d label'

      

Since this library automates this task for you and does a better job than the standard Python codecs you can do for you here, you should simply install it and apply it to the mess this API will serve you. Do not hesitate to beat up the people who pass this data on to you, however, if you have half a chance. They produced one perfect mud.

+7


source







All Articles