Python Coding - Is there any explanation?
Can someone explain to me why python has this behavior?
Let me explain.
BACKGROUND
I have a python installation and I want to use some characters that are not in the ASCII table. So I change my default python. I store each line in a .py file this way'_MAIL_TITLE_': u' ',
Now, using a method that replaces my dictionary keys, I want to dynamically insert my strings into the html template.
I put in the header of the html page:
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
...... <!-- Some Css -->
</head>
Unfortunately my html document comes to me (after replacement) with some wrong characters (unconverted? Incorrectly converted?)
So, I open up a terminal and start doing some sort of order:
1 - Python 2.4.6 (#1, Jan 27 2012, 15:41:03)
2 - [GCC 4.1.2 20080704 (Red Hat 4.1.2-51)] on linux2
3 - Type "help", "copyright", "credits" or "license" for more information.
4 - >>> import sys
5 - >>> sys.getdefaultencoding()
6 - 'utf-8'
7 - >>> u'èéòç'
8 - u'\xe8\xe9\xf2\xe7'
9 - >>> u'èéòç'.encode('utf-8')
10 - '\xc3\xa8\xc3\xa9\xc3\xb2\xc3\xa7'
11 - >>> u'è'
12 - u'\xe8'
13 - >>> u'è'.encode()
14 - '\xc3\xa8'
Question
Look at line [7-10]. Isn't it weird? Why, if my (line 6) python has defaultencoding utf-8
, does it convert that line (line7) differently than on line 9? Now let's take a look at lines [11-14] and their output.
Now I am completely confused!
TIP
So, I tried to change my final way of inputting files (formerly ISO-8859-1, now utf-8) and something changed:
1 - Python 2.4.6 (#1, Jan 27 2012, 15:41:03)
2 - [GCC 4.1.2 20080704 (Red Hat 4.1.2-51)] on linux2
3 - Type "help", "copyright", "credits" or "license" for more information.
4 - >>> import sys
5 - >>> sys.getdefaultencoding()
6 - 'utf-8'
7 - >>> u'èéòç'
8 - u'\xc3\xc3\xa8\xc3\xa9\xc3\xb2\xc3\xa7'
9 - >>> u'èéòç'.encode('utf-8')
10 - '\xc3\xa8\xc3\xa9\xc3\xb2\xc3\xa7'
11 - >>> u'è'
12 - u'\xe8'
13 - >>> u'è'.encode()
14 -'\xc3\xa8'
So the encoding (explicit encoding) works regardless of the input encoding (or it seems to me, but I've been stuck on it for days, so maybe I messed up my mind).
WHERE IS THE SOLUTION
By examining lines 8 of background
and hint
, you will see that there are some differences in the unicode object being created. So, I started thinking about it. What have I concluded? Nothing. Nothing else, maybe my problems with the encoding up to the encoding of the file after saving my .py (which contains all the characters utf-8, which should be inserted into the html-document)
CODE "REAL"
The code does nothing: it opens the html template, puts it on a string, replaces the space holders with unicode numbers (utf-8ed? Wish yes) and saves them in another file that will be rendered from the web (yes, my landing page has specs utf-8 header). I don't have the code here because it's scattered across multiple files, but I'm confident in the program's workflow (by tracking it).
FINAL QUESTION
In light of this, does anyone have any ideas to make my code work? Ideas about encoding unix files? Or .py file encoding? How do I change the encoding to make my code work?
LATEST TIPS
Before replacing placeholders with utf-8 object if i insert
utf8Obj.encode('latin-1')
my document is perfectly visible to the web!
Thanks to those who are in charge.
EDIT1 - WORKFLOW DEVELOPMENT
Good thing my development workflow is:
I have a CVS for this project. The project is located at the OSS. This server is a 64-bit machine. I am developing my code on Windows 7 (64 bit) with eclipse. Each modification is done ONLY with a CVS commit. The code is exectude on a Centos machine that uses this type of python:
Python 2.4.6 (#1, Jan 27 2012, 15:41:03)
[GCC 4.1.2 20080704 (Red Hat 4.1.2-51)] on linux2
I have set up Eclipse to work this way: PREFERENCES -> GENERAL -> WORKSPACE -> ENCODING TEXT FILE: UTF-8
The Zope / Plone application runs on the same server: it serves up some PHP pages. PHP Pages names some of the python (application logic) WS methods that are located on the Zope / Plone server. This server interface is directly related to the application logic.
What all
EDIT2
This is the function that does the replacement:
def _fillTemplate(self, buf):
"""_fillTemplate(buf)-->str
Ritorna il documento con i campi sostituiti con dict_template.
"""
try:
for k, v in self.dict_template.iteritems():
if not isinstance(v,unicode):
v=str(v)
else:
v=v.encode('latin-1') #In that way it works, but why?
buf = buf.replace(k, v)
source to share
To address this and future issues, I would suggest that you look at the answers to the UnicodeDecodeError on file redirection question , which contains a general discussion of what this encoding / decoding business is about.
In the first example, your terminal is Latin1 encoded:
7 - >>> u'èéòç'
8 - u'\xe8\xe9\xf2\xe7'
Encoding these characters in Latin1 is a valid UTF-8 encoding of the same characters, so Python does not require any conversion. When you switch the terminal to UTF-8 you get
7 - >>> u'èéòç'
8 - u'\xc3\xc3\xa8\xc3\xa9\xc3\xb2\xc3\xa7'
Your terminal sends UTF-8 encodings to Python as four 2-byte sequences. Your Python interpreter took these bytes verbatim and stored them: they are also the actual encoded representation of your string; UTF-8 can actually encode the same characters in several ways.
If your editor saves UTF-8, you should place the following on top of your .py file:
# -*- coding: utf-8 -*-
This string must match the encoding used by your editor.
The most reliable approach to handling encodings is probably one of the following two:
-
Your program should only handle internal (byte) strings in the same encoding (UTF-8 is a good choice). This means that if you receive, say, data with latin encoding 1, you have to recode it to UTF-8:
data.decode('latin1').encode('utf8')
The best way to handle your string literals in this case is for your editor to save your file in UTF-8 and use regular (byte) string literals (
"This is a string"
, nou
leading). -
Your program can alternatively manipulate only Unicode strings. My experience is that this is a bit cumbersome with Python 2. This would be my method of choice with Python 3 though, because Python 3 has much more natural support for these encoding issues (literal strings are character strings, not strings bytes, etc.).
source to share
While you are answering my comment, here is the answer to the first question:
Look at line [7-10]. Isn't it weird? Why if my (line 6) python have a default value of utf-8, then convert that string (line 7) to differently, which does line 9 do? Now take a look at lines [11-14] and their output.
No, this is not strange: you have to distinguish between Python encoding, shell encoding, system encoding, file encoding, declared encoding, and application encoding. Does a lot of encoding, doesn't it?
sys.getdefaultencoding()
This will give you the Python encoding for the unicode implementation. It is not related to the exit.
In [7]: u'è'
Out[7]: u'\xe8'
In [8]: u'è'.encode('utf8')
Out[8]: '\xc3\xa8'
In [9]: print u'è'
è
In [10]: print u'è'.encode('utf8')
è
When you use the print
caracter prints to the screen, if you don't, Python gives you a view that you can copy / paste to get the same data.
Since the unicode string is not the same as the utf8 string, it doesn't give you the same data.
Unicode is the "neutral" representation of the string, while utf8 is encoded.
source to share
On line 7, you output the Unicode object:
>>> u'èéòç'
u'\xe8\xe9\xf2\xe7'
No encoding happens, it just tells you that your input is in Unicode code blocks \xe8
, \xe9
etc.
On line 11, you create a UTF-8 encoded string from a Unicode object. The output of an encoded string is different from an unencoded Unicode object, but why not:
>>> u'èéòç'.encode('utf-8')
'\xc3\xa8\xc3\xa9\xc3\xb2\xc3\xa7'
In your second experiment, where you changed the terminal encoding, you actually broke the interpretation of the input characters:
>>> u'èéòç'
u'\xc3\xa8\xc3\xa9\xc3\xb2\xc3\xa7'
When you put these four characters in a string, they are encoded in some way, and Python thinks you have typed eight bytes of UTF-8 code. But these bytes don't represent the characters you wanted to enter. It looks like Python thinks that it will receive ISO-8859-1 characters from the terminal, while it actually receives UTF-8 data, which leads to a mess.
source to share