MySQL full text search, sorting and non-breaking space
I put UTF-8 encoded data into a database table configured to use utf8 encoding, but when I then do a full text search, it doesn't match a word before non-breaking spaces.
For example, for formatting issues, we have a non-breaking place in hepatitis B. This string is not matched when looking for hepatitis.
CREATE TABLE `search` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`title` text COLLATE FULLTEXT KEY `title` (`title`),
PRIMARY KEY (`id`),
FULLTEXT KEY `title` (`title`),
) ENGINE=MyISAM AUTO_INCREMENT=202337 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
This query returns nothing:
SELECT
title,
MATCH(title) AGAINST ('hepatitis') AS `titleScore`
FROM
`search`
WHERE
MATCH(title) AGAINST ("hepatitis")
ORDER BY
`titleScore` DESC LIMIT 10;
But this query returns the following:
SELECT
title
FROM
search
WHERE
title LIKE "%hepatitis%";
+-------------------------------------------------------------------------+
| title |
+-------------------------------------------------------------------------+
| Comparison of drugs for chronic HBeAg-positive hepatitis B |
| Antivirals in chronic hepatitis C |
| Chronic hepatitis C |
| Antivirals for hepatitis C |
| Antivirals for hepatitis B |
| Other antivirals for hepatitis C |
| Chronic hepatitis B |
| Hepatitis A vaccine |
| Hepatitis B vaccine |
| Hepatitis B immunoglobulin |
| Hepatitis C virus protease inhibitors, see HCV-protease inhibitors |
+-------------------------------------------------------------------------+
According to "Full-text material that we did not enter in the manual" at http://ftp.nchu.edu.tw/MySQL/tech-resources/articles/full-text-revealed.html#breaking full text should only match alphanumerics as elements of a word and, therefore, break in a non-breaking space (although it does not explicitly indicate the non-breaking space character itself).
I found a comment on the MySQL manual - http://dev.mysql.com/doc/refman/5.5/en/fulltext-search.html
For FULLTEXT MATCH to work with Japanese UTF-8 text, be careful that words from your Japanese text will be separated by an ASCII character space and not Japanese UTF-8 (or other) characters. (when using phpMyAdmin to manage data / write SQL query, you have to switch from your Japanese IME to insert a char space ...)
I created a new collation according to the MySQL manual with the following rules:
<charset name="utf8">
...
<collation name="utf8_custom" id="1001">
<rules>
<reset>\u0020</reset> <!-- ascii space character -->
<i>\u00A0</i> <!-- non-breaking space -->
<reset>A</reset> <!-- test -->
<i>B</i>
</rules>
</collation>
</charset>
I restarted the server and then confirmed that the sort is available with show collation like 'utf8_custom';
Then I changed the table to use the new collation and quickly rebuilt the indexes using the repair table.
SELECT title FROM search WHERE "Hepatitis A vaccine";
still returns no results
SELECT title FROM search WHERE "Hepatitis A vaccine";
returns results - two, in fact:
+------------------------+
| title |
+------------------------+
| Hepatitis A vaccine |
| Hepatitis B vaccine |
+------------------------+
This shows that the rule in sorting that makes up B, which is identical to A, is observed, but the non-breaking space is not.
Â
bothers me - my table is utf8 and my client is utf8 and the original data is utf8. I'm not sure I should be seeing this character.
source to share
The problem is the lookup data is being written to the database - I had to issue SET NAMES "utf8"
(or the Zend / PDO equivalent) to make sure the utf8 string sent to the utf8 table was migrated as utf8.
Adding a parameter to my database config in my Zend application.ini for charset = 'utf8'
resolved this.
source to share