Python encoding string - ASCII in unicode string; how to remove this 'u'?

When I use the Chinese python module "pygoogle" I got a url like u'http://zh.wikipedia.org/zh/\xe6\xb1\x89\xe8\xaf\xad'

It is unicode but includes ascii. I am trying to convert it back to utf-8, but the code changes too.

a =  u'http://zh.wikipedia.org/zh/\xe6\xb1\x89\xe8\xaf\xad'
a.encode('utf-8')
>>> 'http://zh.wikipedia.org/zh/\xc3\xa6\xc2\xb1\xc2\x89\xc3\xa8\xc2\xaf\xc2\xad'

      

Also I am trying to use:

str(a)

      

but i got error:

UnicodeEncodeError: 'ascii' codec can't encode characters in position 27-32: ordinal not in range(128)

      

How can I encode it to remove the 'u'?

By the way, if there is no 'u' then I will get the correct result, e.g .:

s = 'http://zh.wikipedia.org/zh/\xe6\xb1\x89\xe8\xaf\xad'
print s
>>> http://zh.wikipedia.org/zh/汉语

      

+3


source to share


1 answer


You have Mojibake ; in this case, they are UTF-8 bytes decoded as if they were latin-1 bytes.

To reverse the process, write again in Latin-1:

>>> a =  u'http://zh.wikipedia.org/zh/\xe6\xb1\x89\xe8\xaf\xad'
>>> a.encode('latin-1')
'http://zh.wikipedia.org/zh/\xe6\xb1\x89\xe8\xaf\xad'
>>> print a.encode('latin-1')
http://zh.wikipedia.org/zh/汉语

      

print

worked because my terminal is configured to handle UTF-8. You can get the object again by unicode

decrypting it as UTF-8:



>>> a.encode('latin-1').decode('utf8')
u'http://zh.wikipedia.org/zh/\u6c49\u8bed'

      

ISO-8859-1 (latin-1) codec maps one by one to the first 255 Unicode codes, so the contents of the string appear otherwise unchanged.

You can use the ftfy

library
for such jobs; it handles a large number of text problems, including the Windows Mojibake code page where some of the resulting "code points" are not encoded by the code encoding. The function ftfy.fix_text()

takes a Unicode input and reconstructs it:

>>> import ftfy
>>> ftfy.fix_text(a)
u'http://zh.wikipedia.org/zh/\u6c49\u8bed'

      

+8


source







All Articles