UnicodeEncodeError: codec "latin-1" cannot encode character u '\ u2014'

I am getting this UnicodeEncodeError: "latin-1" codec cannot encode character u '\ u2014'

I am trying to load a lot of news into MySQLdb. However I am having a hard time dealing with non-standard characters, I am getting hundreds of these errors for all kinds of characters. I can handle them individually with .replace (), although I would like a more complete solution to handle them correctly.

ubuntu@ip-10-0-0-21:~/scripts/work$ python test_db_load_error.py
Traceback (most recent call last):
  File "test_db_load_error.py", line 27, in <module>
    cursor.execute(sql_load)
  File "/usr/lib/python2.7/dist-packages/MySQLdb/cursors.py", line 157, in execute
    query = query.encode(charset)
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2014' in position 158: ordinal not in range(256)

      

My script;

import MySQLdb as mdb
from goose import Goose
import string
import datetime

host = 'rds.amazonaws.com'
user = 'news'
password = 'xxxxxxx'
db_name = 'news_reader'
conn = mdb.connect(host, user, password, db_name)

url = 'http://www.dailymail.co.uk/wires/ap/article-3060183/Andrew-Lesnie-Lord-Rings-cinematographer-dies.html?ITO=1490&ns_mchannel=rss&ns_campaign=1490'
g = Goose()
article = g.extract(url=url)
body = article.cleaned_text
body = body.replace("'","`")
load_date = str(datetime.datetime.now())
summary = article.meta_description
title = article.title
image = article.top_image

sql_load = "insert into articles " \
        "    (title,summary,article,,image,source,load_date) " \
        "     values ('%s','%s','%s','%s','%s','%s');" % \
        (title,summary,body,image,url,load_date)
cursor = conn.cursor()
cursor.execute(sql_load)
#conn.commit()

      

Any help would be appreciated.

+3


source to share


3 answers


If your database is indeed configured for Latin-1, then you cannot store non-Latin characters in it. This includes U + 2014, EM DASH .

The ideal solution is to simply switch to a database configured for UTF-8. Just pass charset='utf-8'

on the initial creation of the database and every time you connect to it. (If you already have existing data, you probably want to use the MySQL tools to migrate the old database to the new one instead of Python code, but the basic idea is the same.)



However, sometimes this is not possible. Perhaps you have other software that cannot be updated, requires Latin-1 and needs to provide a common database. Or maybe you've been mixing Latin-1 text and binary data in ways that can't be programmatically unmixed, or your database is too big to migrate, or whatever. In this case, you have two options:

  • Convert your strings to Latin-1 destructively before storing and searching. For example, you might need to convert the em dash to -

    or to --

    , or maybe it doesn't matter that much and you can just convert all non-latin-1 characters to ?

    (which is faster and easier).

  • Come up with an encoding scheme for smuggling non-latin-1 characters into the database. This means that some searches become more complex or simply cannot be executed directly in the database.

+2


source


When you create a connection to mysqldb, go charset='utf8'

to the connection.



conn = mdb.connect(host, user, password, db_name, charset='utf8')

      

+7


source


It may be hard reading, but at least I started.

http://www.joelonsoftware.com/articles/Unicode.html

0


source







All Articles