Decode byte object results in unexpected + invalid UTF-8 - how can I avoid this?

The code below (Python 3.6) takes a byte object that represents the multiplication sign in UTF-8 ( b'\xc3\x97'

), decodes it to a string, and writes the string to a file:

# Byte sequence corresponds to multiplication sign in UTF-8
myBytes = b'\xc3\x97'
# Decode to string 
myString = myBytes.decode('utf-8')

# Write myString to file
with open("myString.txt", "w") as ms_file:
    ms_file.write(myString)

      

This gives me the following output:

Bytes written to the file myString.txt (marked by opening the file in a hex editor): D7

As a result, I expected it to be a 2 byte sequence C3 97

, which is the UTF-8 representation of the multiplication sign. Moreover, D7

it is not even a valid (single-byte) UTF-8 sequence (see also UTF-8 Codepage Layout ). It is a byte value that conforms to ISO / IEC 8859-1 (Latin) encoding .

So my question is how can I ensure that I end up with valid UTF-8. Can I ignore something really obvious, or is this a bug in Python?

In some context, I ran into this issue while writing code that parses XML files (which use UTF-8), parses the XML Element object using lxml, retrieves the text values ​​of some elements, which are subsequently written to another XML file (which also uses UTF-8). Due to this issue, I can now get XML files that were not well formed.

I am using Python 3.6 on Windows 7.

EDIT : The original question / code contains a function that was supposed to display the hexadecimal representation of myString to the screen, but it turned out to be not behaving as expected. Since this made things unnecessarily confusing (also the function was not essential to the question), I removed it from the code.

+3


source to share





All Articles