How do I configure MySQL to handle Unicode dialogs correctly?

Question

How do I configure MySQL to handle Unicode dialogs correctly?

This is an odd conundrum, AFAIK utf8_bin is to ensure that every accent is stored properly in the database, i.e. without any weird ASCII conversion. So I have a table like this with:

DEFAULT CHARSET=utf8 COLLATE=utf8_bin

and yet when I try to compare / query / any records such as "Krąków" and "Kraków" according to MySQL, they are the same string.

Out of curiosity, I also tried utf8_polish, and MySQL states that for the polish guys, "a" and "±" make no difference.

So how do I set up my MySQL table so that I can store unicode strings safely without losing emphasis anyway?

Server: MySQL 5.5 + openSUSE 11.4, Client: Windows 7 + MySQL Workbench 5.2.

Update - CREATE TABLE

CREATE TABLE `Cities` (
  `city_Name` VARCHAR(145) CHARACTER SET utf8 NOT NULL,
  PRIMARY KEY (`city_Name`)
) DEFAULT CHARSET=utf8 COLLATE=utf8_bin;

Note that I cannot set the column to different utf8_bin for the column because the whole table is utf8_bin, so the default collation for the column is reset.

+3

mysql unicode utf-8 diacritics collation

greenoldman 15 Feb 13 at 20:47

source to share

2 answers

At the time the table is created, the default MySQL encoding and collation are applied, which can be changed on a per-connection basis. Changing the default values after creating a table does not affect existing tables.

Character sets and collations are attributes of individual columns. They can be set by default, but they belong to the columns.

The utf8 encoding must be sufficient to display correctly all European languages. You should definitely be able to store "a" and "±" as two different characters.

The utf8-bin collation results in the sorting of the case and the accented character.

Here are some examples of the difference between text value and matching behavior. I am using three example lines: "abcd", "ĄBCD" and "ąbcd". The last two have the letter A-ogonek.

This first example says that with utf8 character representation and utf8_general_ci collation, the three displayed strings are displayed as specified by the user, but compare them equal. This is to be expected in a comparison that does not distinguish between a and ±. This is a typical case-insensitive collation, where all variants of characters are sorted equal to a character without any accents.

SET NAMES 'utf8' COLLATE 'utf8_general_ci';
SELECT 'abcd', 'ąbcd' , 'abcd' < 'ąbcd',  'abcd' = 'ąbcd';
                               false            true

This next example shows that in a region-insensitive usability-language sort, a precedes ±. I don't know Polish, but I suspect that Polish phone books have As and Ą separated.

SET NAMES 'utf8' COLLATE 'utf8_polish_ci';
SELECT 'abcd', 'ĄBCD' , 'ąbcd', 'abcd' < 'ĄBCD', 'abcd' < 'ąbcd' , 'ąbcd' = 'ĄBCD' 
                                      true             true              true

The following example shows what happens with the utf8_bin collation.

SET NAMES 'utf8' COLLATE 'utf8_bin';
SELECT 'abcd', 'ĄBCD' , 'ąbcd', 'abcd' < 'ĄBCD', 'abcd' < 'ąbcd' , 'ąbcd' = 'ĄBCD' 
                                      true           true               false

In this case, you can notice one unintuitive thing. 'abcd' 'ĄBCD' is true (whereas 'abcd' <'ABCD' with pure ASCII is false). This is an odd result if you think linguistically. This is because both A-ogonek characters have binary values in utf8 that are above all abc and ABC characters. So: if you use utf8-bin collation for ORDER BY operations, you will get linguistically weird results.

You say that "Krąków" and "Kraków" compare the same, and that you are puzzled by this. They compare equals when the collation used is utf8_general_ci. But they are not with utf8_bin or utf8_polish_ci. According to the Polish language support in MySQL, the two spellings of the city name are different.

When you're developing an application, you need to figure out how you want the whole thing to work linguistically. "Krakow" and "Krakow" in the same place? Are "Sharon" and "Aaron" the same person? If so, you want utf8_general_ci.

You may want to consider modifying the table you specified as follows:

  ALTER TABLE Cities
MODIFY COLUMN city_Name 
              VARCHAR(145)
              CHARACTER SET utf8 
              COLLATE utf8_general_ci

This will set the column in your table the way you want.

0

O. Jones Feb 17 13 at 15:13

source to share

greenoldman · Accepted Answer · 2013-02-17T21:11:18+0000

All solution credits go to bobince , so please confirm your comment to my question.

The solution to the problem is somewhat strange, and I would venture to say that MySQL is broken in this regard.

So let's say I created a table with utf8 and didn't do anything for the column. Later I realize that I need a strict character comparison, so I change the collation for the table columns AND and utf8_bin. Is it solved?

No, now MySQL sees this: the table is indeed utf8_bin, but the column is also utf8_bin, which means that the column uses the DEFAULT collation of the table. However, MySQL does not understand that the previous default is not the same as the current default. And thus the comparison still doesn't work.

Thus, you have to get rid of this default value for the column, to some foreign value from the collation scope "family" (in case "utf8xxx" means no other "utf8xxx"). After it is disabled and you see an entry that does not say "default" when sorting columns, you can set utf8_bin, which is now evaluated as the default, but since we are assuming a non-standard collation, everything works as expected.

Remember to apply your changes at every step.

How do I configure MySQL to handle Unicode dialogs correctly?

Update - CREATE TABLE

More articles: