Python 2.7: How to convert unicode escapes to string to actual utf-8 characters
I am using python 2.7 and I am getting the line from the server (not in Unicode!). Inside this line, I find text with unicode escape sequences. For example, for example:
<a href = "http://www.mypage.com/\u0441andmoretext">\u00b2<\a>
How do I convert those \uxxxx
- back to utf-8? The answers I found were about &#
or required eval()
, which is too slow for my purposes. I need a universal solution for any text containing such sequences.
Edit:
<\a>
- this is a typo, but I also want the nasty to be similar. The reaction should only be on\u
Sample text is implied in correct python syntax, e.g .:
"<a href = \"http://www.mypage.com/\\u0441andmoretext\">\\u00b2<\\a>"
Desired output in correct python syntax
"<a href = \"http://www.mypage.com/\xd1\x81andmoretext\">\xc2\xb2<\\a>"
source to share
Python contains some special string codecs for such cases.
In this case, if there are no other characters outside the 32-127 range, you can safely decode your byte string using the "unicode_escape" codec to have a valid Unicode text object in Python. (On which your program should do all the text operations) - Whenever you output this text again, you convert it to utf-8 as usual:
rawtext = r"""<a href="http://www.mypage.com/\u0441andmoretext">\u00b2<\a>"""
text = rawtext.decode("unicode_escape")
# Text operations go here
...
output_text = text.encode("utf-8")
If there are bytes outside of the 32-127 range, the unicode_escape codec assumes that they are in latin1 encoding. So if your answer mixes utf-8 and these \ uXXXX sequences, you should:
- decode original string with utf-8
- encode back to latin1
- decoding using "unicode_escape"
- work on the text
- encode back to utf-8
source to share