Python encoding string - ASCII in unicode string; how to remove this 'u'?
When I use the Chinese python module "pygoogle" I got a url like u'http://zh.wikipedia.org/zh/\xe6\xb1\x89\xe8\xaf\xad'
It is unicode but includes ascii. I am trying to convert it back to utf-8, but the code changes too.
a = u'http://zh.wikipedia.org/zh/\xe6\xb1\x89\xe8\xaf\xad'
a.encode('utf-8')
>>> 'http://zh.wikipedia.org/zh/\xc3\xa6\xc2\xb1\xc2\x89\xc3\xa8\xc2\xaf\xc2\xad'
Also I am trying to use:
str(a)
but i got error:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 27-32: ordinal not in range(128)
How can I encode it to remove the 'u'?
By the way, if there is no 'u' then I will get the correct result, e.g .:
s = 'http://zh.wikipedia.org/zh/\xe6\xb1\x89\xe8\xaf\xad'
print s
>>> http://zh.wikipedia.org/zh/汉语
source to share
You have Mojibake ; in this case, they are UTF-8 bytes decoded as if they were latin-1 bytes.
To reverse the process, write again in Latin-1:
>>> a = u'http://zh.wikipedia.org/zh/\xe6\xb1\x89\xe8\xaf\xad'
>>> a.encode('latin-1')
'http://zh.wikipedia.org/zh/\xe6\xb1\x89\xe8\xaf\xad'
>>> print a.encode('latin-1')
http://zh.wikipedia.org/zh/汉语
print
worked because my terminal is configured to handle UTF-8. You can get the object again by unicode
decrypting it as UTF-8:
>>> a.encode('latin-1').decode('utf8')
u'http://zh.wikipedia.org/zh/\u6c49\u8bed'
ISO-8859-1 (latin-1) codec maps one by one to the first 255 Unicode codes, so the contents of the string appear otherwise unchanged.
You can use the ftfy
library for such jobs; it handles a large number of text problems, including the Windows Mojibake code page where some of the resulting "code points" are not encoded by the code encoding. The function ftfy.fix_text()
takes a Unicode input and reconstructs it:
>>> import ftfy
>>> ftfy.fix_text(a)
u'http://zh.wikipedia.org/zh/\u6c49\u8bed'
source to share