Python utf-8 character range
I am working with a utf-8 encoded text file and reading its contents using python. After reading the content, I split the text into a character array.
import codecs
with codecs.open(fullpath,'r',encoding='utf8') as f:
text = f.read()
# Split the 'text' to characters
Now I am repeating each character. First, let's convert it to hexadecimal and run some code on it.
numerialValue = ord(char)
I noticed that between all these characters, some characters are outside the expected range.
The expected maximum value is FFFF. The actual value of the symbol is 1D463.
I have translated this code to python. The original source code comes from C # whose value "\ u1D463" is an invalid character.
Vaguely.
source to share
You seem to have escaped the Unicode code point (U + 1D463) with \u
instead \u
. The former expects four hexadecimal digits, where the latter expects eight hexadecimal digits. According to Microsoft Visual Studio:
The condition was ch == '\u1D463'
When I used this literal in the Python Interpreter, it doesn't complain, but it happily escapes the first four hex digits and usually prints three times when run in cmd:
>>> print('\u1D463')
แต3
You have this exception: Expected max value - FFFF. Actual character value - 1D463
because you are using the wrong unicode escape, use \U0001D463
instead \u1D463
. The maximum value for character codewords in \u
is \uFFFF
, and the maximum value for \u
is \UFFFFFFFF
. Note the leading zeros in \U0001D463
, \u
occupies exactly eight hexadecimal digits, and \u
occupies exactly four hexadecimal digits:
>>> '\U1D463'
File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-6: truncated \UXXXXXXXX escape
>>> '\uFF'
File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-3: truncated \uXXXX escape
source to share