UnicodeDecodeError: 'ascii' codec cannot decode byte 0x92 at position 47: ordinal not in range (128)

I am trying to write data to a StringIO object using Python and then eventually load that data into a postgres database using the psycopg2 copy_from () function.

At first, when I did this, copy_from () was throwing an error: ERROR: Invalid byte sequence for encoding "UTF8": 0xc92 So I followed this question .

I figured out that my Postgres database is UTF8 encoded.

The file / StringIO object I'm writing my data from shows its encoding like this: setgid Non-ISO extended-ASCII English text with very long lines with CRLF line terminators

I have tried to encode every string I write into an intermediate file / StringIO file in UTF8 format. This uses .encode (encoding = 'UTF-8', errors = 'strict')) for each line.

This is the error I got now: UnicodeDecodeError: ascii codec cannot decode byte 0x92 at position 47: ordinal not in range (128)

What does it mean? How to fix it?

EDIT: I am using Python 2.7 Some code snippets:

I read from a MySQL database which has data encoded in UTF-8 according to MySQL Workbench. These are a few lines of code to write my data (retrieved from MySQL db) to a StringIO object:

# Populate the table_data variable with rows delimited by \n and columns delimited by \t
row_num=0
for row in cursor.fetchall() :

    # Separate rows in a table by new line delimiter
    if(row_num!=0):
        table_data.write("\n")

    col_num=0
    for cell in row:    
        # Separate cells in a row by tab delimiter
        if(col_num!=0):
            table_data.write("\t") 

        table_data.write(cell.encode(encoding='UTF-8',errors='strict'))
        col_num = col_num+1

    row_num = row_num+1   

      

This is the code that writes to Postgres database from my StringIO object table_data:

cursor = db_connection.cursor()
cursor.copy_from(table_data, <postgres_table_name>)

      

+2


source to share


1 answer


The problem is what you are calling encode

on the object str

.

A str

is a byte string, usually representing text encoded in some way as UTF-8. When you call encode

on this, it needs to be decoded back to text first, so the text can be re-encoded. By default, Python does this by calling s.decode(sys.getgetdefaultencoding())

, and getdefaultencoding()

usually returns 'ascii'

.

So you are talking UTF-8 encoded text, decode it as if it were ASCII and then re-encode it to UTF-8.

The general solution is to explicitly call it decode

with the correct encoding instead of letting Python use the default and then the encode

result.



But when the correct encoding is already the one you want, the easiest thing is to just skip .decode('utf-8').encode('utf-8')

and just use UTF-8 str

as UTF-8 str

it already is.

Or, conversely, if your MySQL shell has a function that allows you to specify the encoding and return values unicode

for columns CHAR

/ VARCHAR

/ TEXT

instead of values str

(for example, in MySQLdb, you pass use_unicode=True

in a call, connect

or charset='UTF-8'

if your database is too old to automatically detect it), simply do it. Then you have objects unicode

and you can name them .encode('utf-8')

.

In general, the best way to deal with Unicode problems is to decode everything as early as possible, do all the processing in Unicode, and then encode as late as possible. But in any case, you must be consistent. Don't click str

on what might be unicode

; do not concatenate the literal str

into unicode

or pass it to a method replace

; etc. Every time you mix and match, Python will implicitly convert for you using standard encoding, which will almost never be what you want.

As a side note, this is one of the many things Python 3.x Unicode can handle. First, str

it is now Unicode text, not encoded bytes. More importantly, if you have encoded bytes, for example in an object bytes

, the call encode

will give you AttributeError

instead of trying to decode silently so it can recode. And, similar to trying to mix and match Unicode and bytes, you end up with the obvious TypeError

instead of the implicit conversion, which succeeds in some cases and gives a cryptic message about encoding or decoding that you haven't asked for in others.

+6


source







All Articles