Python utf-8 character range

I am working with a utf-8 encoded text file and reading its contents using python. After reading the content, I split the text into a character array.

import codecs

with codecs.open(fullpath,'r',encoding='utf8') as f:
    text = f.read()
    # Split the 'text' to characters

      

Now I am repeating each character. First, let's convert it to hexadecimal and run some code on it.

numerialValue = ord(char)

      

I noticed that between all these characters, some characters are outside the expected range.

The expected maximum value is FFFF. The actual value of the symbol is 1D463.

I have translated this code to python. The original source code comes from C # whose value "\ u1D463" is an invalid character.

enter image description here

Vaguely.

+3


source to share


1 answer


You seem to have escaped the Unicode code point (U + 1D463) with \u

instead \u

. The former expects four hexadecimal digits, where the latter expects eight hexadecimal digits. According to Microsoft Visual Studio:

The condition was ch == '\u1D463'

When I used this literal in the Python Interpreter, it doesn't complain, but it happily escapes the first four hex digits and usually prints three times when run in cmd:



 >>> print('\u1D463')
แต†3

      

You have this exception: Expected max value - FFFF. Actual character value - 1D463

because you are using the wrong unicode escape, use \U0001D463

instead \u1D463

. The maximum value for character codewords in \u

is \uFFFF

, and the maximum value for \u

is \UFFFFFFFF

. Note the leading zeros in \U0001D463

, \u

occupies exactly eight hexadecimal digits, and \u

occupies exactly four hexadecimal digits:

>>> '\U1D463'
  File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-6: truncated \UXXXXXXXX escape

>>> '\uFF'
  File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-3: truncated \uXXXX escape

      

+1


source







All Articles