Need help dealing with characters longer than 2 or more bytes in Python
I learn about bits and bytes in python by writing a small program that converts strings to binary again and back to string again. Temporarily I have a function that converts to binary.
string = 'word'
for c in word:
convertToBinary(c) #Function that converts to binary
Output:
01110111 01101111 01110010 01100100
Now I want to write a function fromBinary()
that converts from binary to string. However, I am stuck on how to work with characters longer than 1 byte, for example 'å'
.
string = 'å'
for c in word:
convertToCBinary(c)
Output:
11000011 10100101
This becomes a problem when I have a string containing characters of different lengths (in bytes).
string = 'åw'
for c in word:
convertToCBinary(c)
Output:
11000011 #first byte of 'å'
10100101 #second byte of 'å'
01110111 #w
I thought that I could join bytes together as one, however I am really puzzled as to how to determine which bytes to join. How can I create a function that recognizes which bytes together form a character?
source to share
It's not that hard. Of course, there is a system there - otherwise no program can print or edit names like Ñáñez ...
The upper bits in each byte indicate what the status of that byte is:
1) if bit 7 is 0 then it's just ASCII ( *0*1110111 = w
)
2), if you find 11 at the top, then there are more bytes (and how many):
*110*xxxxx *10*xxxxxx
*1110*xxxx *10*xxxxxx *10*xxxxxx
*11110*xxx *10*xxxxxx *10*xxxxxx *10*xxxxxx
*111110*xx *10*xxxxxx *10*xxxxxx *10*xxxxxx *10*xxxxxx
*1111110*x *10*xxxxxx *10*xxxxxx *10*xxxxxx *10*xxxxxx *10*xxxxxx
11000011 #first byte of 'å'
10100101 #second byte of 'å'
Thus:
*110* means 1 byte follows:
*110*00011 *10*100101
00011 + 100101 = 000 11100101 = the unicode value for å (0x00e5)
Note. I believe there is a problem with your w in your example.
source to share