Unicode latin1 encoding / decoding

When fetching data from unknown / old / inconsistent Mysql database to Postgres utf-8 db using Python ORM (Django) I sometimes get erroneous encoded data.

Target: grégory

> a
u'gr\xe3\xa9gory'

> print a
grã©gory

      

I've tried several decode / encoding attempts with no success:

 > print a.encode('utf-8').decode('latin1')
 grã©gory

 > print a.encode('utf-8').decode('latin1')
 grã©gory

 > print a.decode('latin-1')
UnicodeEncodeError: 'ascii' codec can't encode characters in position 2-3: ordinal not in range(128)

      

Even with some unicode_escape

+3


source to share


3 answers


My guess is that the string was incorrectly converted to lowercase at some point, changing \xc3

to \xe3

. The lowercase conversion assumed latin1 encoding when it was actually utf-8.



>>> print 'gr\xc3\xa9gory'.decode('utf8')
grégory

      

+7


source


Since the problem was lower (), I could fix this:



print a.upper().encode('latin1').lower()

      

-2


source


Try the following:

print a.decode('latin1')

      

-6


source







All Articles