Support for Chinese and Japanese characters in python

How to read Japanese and Chinese characters correctly. I am using python 2.5. The output is displayed as"E:\Test\?????????"

path = r"E:\Test\は最高のプログラマ"
t = path.encode()
print t
u = path.decode()
print u
t = path.encode("utf-8")
print t
t = path.decode("utf-8")
print t

      

+3


source to share


3 answers


Please read the Python Unicode HOWTO ; it explains how to handle and include non-ASCII text in your Python code.

If you want to include Japanese text literals in your code, you have several options:

  • Use literals from Unicode (create unicode

    objects instead of byte strings), but any non-ascii code is represented by the Unicode escape character. They take the form \uabcd

    , so backslashes, u

    and 4 hexadecimal digits:

    ru = u'\u30EB'
    
          

    will be one character, katakana 'ru' codepoint ('ル').

  • Use unicode literals, but include characters in some kind of encoding. Your text editor will save files in the specified encoding (for example, UTF-16); you need to declare this encoding at the top of your source file:

    # encoding: utf-16
    
    ru = u'ル'
    
          

    where 'ル' is included without using escape. The default encoding for Python 2 files is ASCII, so by declaring the encoding you can use Japanese directly.

  • Use byte string literals ready to be encoded. Encoding code points by other means and including them in string literals. If everything you're going to do with them still uses them in coded form, this should be fine:

    ru = '\xeb\x30'  # ru encoded to UTF16 little-endian
    
          

    I coded 'ル' for UTF-16 little-endian because that's the standard Windows NTFS file name encoding.



The next problem will be your terminal, Windows console is known for not supporting many character sets out of the box. You probably want to configure it to handle UTF-8. See this question for some details, but you need to run the following command in the console:

chcp 65001

      

to switch to UTF-8 and you might need to switch to a console font that can handle your codepoints (Lucida perhaps?).

+11


source


There are two independent problems:

  • You must specify the original Python encoding if you are using non-ascii characters and are using Unicode literals for data representing text, for example:

    # -*- coding: utf-8 -*-
    path = ur"E:\Test\は最高のプログラマ"
    
          

  • Printing Unicode in the Windows console is tricky , but if you install the correct font, it's simple:

    print path
    
          

    can work.

Whether or not your console can display the path; it should be good to pass the Unicode path to the filesystem functions like:



entries = os.listdir(path)

      

Don't call .encode(char_enc)

on ontestrings, and call it on Unicode strings.
Don't call .decode(char_enc)

on Unicode strings, call it ontestrings instead.

+3


source


You have to force the string to be an object unicode

like

path = ur"E:\Test\は最高のプログラマ"

      

The docs on string literals specific to 2.5 are located here

Edit: I'm not sure if the object is unicode

in 2.5, but the docs indicate that it \uXXXX[XXXX]

will be processed and the string will be a "Unicode string".

+2


source







All Articles