Python 2.7: How to convert unicode escapes to string to actual utf-8 characters

I am using python 2.7 and I am getting the line from the server (not in Unicode!). Inside this line, I find text with unicode escape sequences. For example, for example:

<a href = "http://www.mypage.com/\u0441andmoretext">\u00b2<\a>

      

How do I convert those \uxxxx

- back to utf-8? The answers I found were about &#

or required eval()

, which is too slow for my purposes. I need a universal solution for any text containing such sequences.

Edit: <\a>

- this is a typo, but I also want the nasty to be similar. The reaction should only be on\u

Sample text is implied in correct python syntax, e.g .:

"<a href = \"http://www.mypage.com/\\u0441andmoretext\">\\u00b2<\\a>"

      

Desired output in correct python syntax

"<a href = \"http://www.mypage.com/\xd1\x81andmoretext\">\xc2\xb2<\\a>"

      

+3


source to share


2 answers


Try

>>> s = "<a href = \"http://www.mypage.com/\\u0441andmoretext\">\\u00b2<\\a>"
>>> s.decode("raw_unicode_escape")
u'<a href = "http://www.mypage.com/\u0441andmoretext">\xb2<\\a>'

      



And then you can encode to utf8 as usual.

+5


source


Python contains some special string codecs for such cases.

In this case, if there are no other characters outside the 32-127 range, you can safely decode your byte string using the "unicode_escape" codec to have a valid Unicode text object in Python. (On which your program should do all the text operations) - Whenever you output this text again, you convert it to utf-8 as usual:

rawtext = r"""<a href="http://www.mypage.com/\u0441andmoretext">\u00b2<\a>"""
text = rawtext.decode("unicode_escape")
# Text operations go here
...
output_text = text.encode("utf-8")

      



If there are bytes outside of the 32-127 range, the unicode_escape codec assumes that they are in latin1 encoding. So if your answer mixes utf-8 and these \ uXXXX sequences, you should:

  • decode original string with utf-8
  • encode back to latin1
  • decoding using "unicode_escape"
  • work on the text
  • encode back to utf-8
+1


source







All Articles