Why does locale.getpreferredencoding () return 'ANSI_X3.4-1968' instead of 'UTF-8'?

Whenever I try to read UTF-8 encoded text files using open(file_name, encoding='utf-8')

, I always get a message that the ASCII codec cannot decode some characters (for example when using for line in f: print(line)

)

Python 3.5.3 (default, Jan 19 2017, 14:11:04)
[GCC 6.3.0 20170118] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import locale
>>> locale.getpreferredencoding()
'ANSI_X3.4-1968'
>>> import sys
>>> sys.getfilesystemencoding()
'ascii'
>>>

      

Command

and locale

prints:

locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_ALL to default locale: No such file or directory
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE=en_HK.UTF-8
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

      

+5


source to share


3 answers


I had a similar problem. For me initially the environment variable LANG

was not set (you can check this by running env

)

$ python3 -c 'import locale; print(locale.getdefaultlocale())'
(None, None)
$ python3 -c 'import locale; print(locale.getpreferredencoding())'
ANSI_X3.4-1968

      

The locales available for me were (on a fresh Ubuntu 18.04 Docker image):

$ locale -a
C
C.UTF-8
POSIX

      

So, I chose UTF-8:

$ export LANG="C.UTF-8"

      

And then everything works

$ python3 -c 'import locale; print(locale.getdefaultlocale())'
('en_US', 'UTF-8')
$ python3 -c 'import locale; print(locale.getpreferredencoding())'
UTF-8

      




If you choose an unavailable locale like

export LANG="en_US.UTF-8"

      

it won't work:

$ python3 -c 'import locale; print(locale.getdefaultlocale())'
('en_US', 'UTF-8')
$ python3 -c 'import locale; print(locale.getpreferredencoding())'
ANSI_X3.4-1968

      

and this is why it locale

gives error messages:

locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_ALL to default locale: No such file or directory

      

+4


source


I solved it by running the following:



apt install locales-all

      

0


source


I think you are reading the error message incorrectly. Be careful to highlight Unicode De codeError and Unicode En codeError.

You say that Python complains that "the ascii codec cannot decode some characters". However, as far as I know, there is no such error message. Compare the following two cases:

>>> b = 'Γ©'.encode('utf8')
>>> b.decode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can’t decode byte 0xc3 in position 0: ordinal not in range(128)
>>> 'Γ©'.encode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can’t encode character '\xe9' in position 0: ordinal not in range(128)

      

It either "cannot decode a byte" or "cannot encode a character" but never "decodes a character".

This may sound pedantic, but in this line

for line in f: print(line)

      

you have both de encoding (before colon) and en encoding (expression print

). Therefore, you need to be sure which process is causing the problem. One possibility is to write this on two lines.

However, if it f

opens with help encoding='utf-8'

as you write, I'm pretty sure the expression is causing the problem print

. print()

is written to by default sys.stdout

. Since this stream is already open when you start Python, its encoding is already set, depending on your environment. Since your locale is LC_ALL

not specified, the default ASCII ("ANSI X3.4-1968") is used (this may answer your question in the title).

If you can't or don't want to change the locale, here's what you can do to send UTF-8 text to STDOUT from Python:

  • use a basic binary stream:

    for line in f:
        sys.stdout.buffer.write(line.encode('utf-8')
    
          

  • recode sys.stdout

    (actually: replace sys.stdout

    with recoded version):

    import codecs
    sys.stdout = codecs.getwriter('utf-8')(sys.stdout.buffer)
    
          

In any case, it is still possible that your terminal cannot display UTF-8 text correctly, either because it is unable to do so or because it is not configured to do so. In this case, you will probably see question marks or mojibak. But that's a different story, outside of Python control ...

-2


source







All Articles