Python 2.7: How to convert unicode escapes to string to actual utf-8 characters

Question

Python 2.7: How to convert unicode escapes to string to actual utf-8 characters

I am using python 2.7 and I am getting the line from the server (not in Unicode!). Inside this line, I find text with unicode escape sequences. For example, for example:

<a href = "http://www.mypage.com/\u0441andmoretext">\u00b2<\a>

How do I convert those \uxxxx

- back to utf-8? The answers I found were about &#

or required eval()

, which is too slow for my purposes. I need a universal solution for any text containing such sequences.

Edit: <\a>

- this is a typo, but I also want the nasty to be similar. The reaction should only be on\u

Sample text is implied in correct python syntax, e.g .:

"<a href = \"http://www.mypage.com/\\u0441andmoretext\">\\u00b2<\\a>"

Desired output in correct python syntax

"<a href = \"http://www.mypage.com/\xd1\x81andmoretext\">\xc2\xb2<\\a>"

+3

python string utf-8 converter unicode-escapes

evolution Apr 22 15 at 17:55

source to share

2 answers

Ella shar · Answer 1 · 2015-04-23T20:20:05+0000

Try

>>> s = "<a href = \"http://www.mypage.com/\\u0441andmoretext\">\\u00b2<\\a>"
>>> s.decode("raw_unicode_escape")
u'<a href = "http://www.mypage.com/\u0441andmoretext">\xb2<\\a>'

And then you can encode to utf8 as usual.

jsbueno · Answer 2 · 2015-04-22T18:14:28+0000

Python contains some special string codecs for such cases.

In this case, if there are no other characters outside the 32-127 range, you can safely decode your byte string using the "unicode_escape" codec to have a valid Unicode text object in Python. (On which your program should do all the text operations) - Whenever you output this text again, you convert it to utf-8 as usual:

rawtext = r"""<a href="http://www.mypage.com/\u0441andmoretext">\u00b2<\a>"""
text = rawtext.decode("unicode_escape")
# Text operations go here
...
output_text = text.encode("utf-8")

If there are bytes outside of the 32-127 range, the unicode_escape codec assumes that they are in latin1 encoding. So if your answer mixes utf-8 and these \ uXXXX sequences, you should:

decode original string with utf-8
encode back to latin1
decoding using "unicode_escape"
work on the text
encode back to utf-8

Python 2.7: How to convert unicode escapes to string to actual utf-8 characters

More articles: