UnicodeDecodeError: 'ascii' codec cannot decode byte 0x92 at position 47: ordinal not in range (128)
I am trying to write data to a StringIO object using Python and then eventually load that data into a postgres database using the psycopg2 copy_from () function.
At first, when I did this, copy_from () was throwing an error: ERROR: Invalid byte sequence for encoding "UTF8": 0xc92 So I followed this question .
I figured out that my Postgres database is UTF8 encoded.
The file / StringIO object I'm writing my data from shows its encoding like this: setgid Non-ISO extended-ASCII English text with very long lines with CRLF line terminators
I have tried to encode every string I write into an intermediate file / StringIO file in UTF8 format. This uses .encode (encoding = 'UTF-8', errors = 'strict')) for each line.
This is the error I got now: UnicodeDecodeError: ascii codec cannot decode byte 0x92 at position 47: ordinal not in range (128)
What does it mean? How to fix it?
EDIT: I am using Python 2.7 Some code snippets:
I read from a MySQL database which has data encoded in UTF-8 according to MySQL Workbench. These are a few lines of code to write my data (retrieved from MySQL db) to a StringIO object:
# Populate the table_data variable with rows delimited by \n and columns delimited by \t
row_num=0
for row in cursor.fetchall() :
# Separate rows in a table by new line delimiter
if(row_num!=0):
table_data.write("\n")
col_num=0
for cell in row:
# Separate cells in a row by tab delimiter
if(col_num!=0):
table_data.write("\t")
table_data.write(cell.encode(encoding='UTF-8',errors='strict'))
col_num = col_num+1
row_num = row_num+1
This is the code that writes to Postgres database from my StringIO object table_data:
cursor = db_connection.cursor() cursor.copy_from(table_data, <postgres_table_name>)
source to share
The problem is what you are calling encode
on the object str
.
A str
is a byte string, usually representing text encoded in some way as UTF-8. When you call encode
on this, it needs to be decoded back to text first, so the text can be re-encoded. By default, Python does this by calling s.decode(sys.getgetdefaultencoding())
, and getdefaultencoding()
usually returns 'ascii'
.
So you are talking UTF-8 encoded text, decode it as if it were ASCII and then re-encode it to UTF-8.
The general solution is to explicitly call it decode
with the correct encoding instead of letting Python use the default and then the encode
result.
But when the correct encoding is already the one you want, the easiest thing is to just skip .decode('utf-8').encode('utf-8')
and just use UTF-8 str
as UTF-8 str
it already is.
Or, conversely, if your MySQL shell has a function that allows you to specify the encoding and return values unicode
for columns CHAR
/ VARCHAR
/ TEXT
instead of values str
(for example, in MySQLdb, you pass use_unicode=True
in a call, connect
or charset='UTF-8'
if your database is too old to automatically detect it), simply do it. Then you have objects unicode
and you can name them .encode('utf-8')
.
In general, the best way to deal with Unicode problems is to decode everything as early as possible, do all the processing in Unicode, and then encode as late as possible. But in any case, you must be consistent. Don't click str
on what might be unicode
; do not concatenate the literal str
into unicode
or pass it to a method replace
; etc. Every time you mix and match, Python will implicitly convert for you using standard encoding, which will almost never be what you want.
As a side note, this is one of the many things Python 3.x Unicode can handle. First, str
it is now Unicode text, not encoded bytes. More importantly, if you have encoded bytes, for example in an object bytes
, the call encode
will give you AttributeError
instead of trying to decode silently so it can recode. And, similar to trying to mix and match Unicode and bytes, you end up with the obvious TypeError
instead of the implicit conversion, which succeeds in some cases and gives a cryptic message about encoding or decoding that you haven't asked for in others.
source to share