Support for Chinese and Japanese characters in python
Please read the Python Unicode HOWTO ; it explains how to handle and include non-ASCII text in your Python code.
If you want to include Japanese text literals in your code, you have several options:
-
Use literals from Unicode (create
unicode
objects instead of byte strings), but any non-ascii code is represented by the Unicode escape character. They take the form\uabcd
, so backslashes,u
and 4 hexadecimal digits:ru = u'\u30EB'
will be one character, katakana 'ru' codepoint ('ル').
-
Use unicode literals, but include characters in some kind of encoding. Your text editor will save files in the specified encoding (for example, UTF-16); you need to declare this encoding at the top of your source file:
# encoding: utf-16 ru = u'ル'
where 'ル' is included without using escape. The default encoding for Python 2 files is ASCII, so by declaring the encoding you can use Japanese directly.
-
Use byte string literals ready to be encoded. Encoding code points by other means and including them in string literals. If everything you're going to do with them still uses them in coded form, this should be fine:
ru = '\xeb\x30' # ru encoded to UTF16 little-endian
I coded 'ル' for UTF-16 little-endian because that's the standard Windows NTFS file name encoding.
The next problem will be your terminal, Windows console is known for not supporting many character sets out of the box. You probably want to configure it to handle UTF-8. See this question for some details, but you need to run the following command in the console:
chcp 65001
to switch to UTF-8 and you might need to switch to a console font that can handle your codepoints (Lucida perhaps?).
source to share
There are two independent problems:
-
You must specify the original Python encoding if you are using non-ascii characters and are using Unicode literals for data representing text, for example:
# -*- coding: utf-8 -*- path = ur"E:\Test\は最高のプログラマ"
-
Printing Unicode in the Windows console is tricky , but if you install the correct font, it's simple:
print path
can work.
Whether or not your console can display the path; it should be good to pass the Unicode path to the filesystem functions like:
entries = os.listdir(path)
Don't call .encode(char_enc)
on ontestrings, and call it on Unicode strings.
Don't call .decode(char_enc)
on Unicode strings, call it ontestrings instead.
source to share
You have to force the string to be an object unicode
like
path = ur"E:\Test\は最高のプログラマ"
The docs on string literals specific to 2.5 are located here
Edit: I'm not sure if the object is unicode
in 2.5, but the docs indicate that it \uXXXX[XXXX]
will be processed and the string will be a "Unicode string".
source to share