MySQL full text search, sorting and non-breaking space

Question

MySQL full text search, sorting and non-breaking space

I put UTF-8 encoded data into a database table configured to use utf8 encoding, but when I then do a full text search, it doesn't match a word before non-breaking spaces.

For example, for formatting issues, we have a non-breaking place in hepatitis B. This string is not matched when looking for hepatitis.

CREATE TABLE `search` (
  `id` bigint(20) NOT NULL AUTO_INCREMENT,
  `title` text COLLATE FULLTEXT KEY `title` (`title`),
  PRIMARY KEY (`id`),
  FULLTEXT KEY `title` (`title`),
) ENGINE=MyISAM AUTO_INCREMENT=202337 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

This query returns nothing:

SELECT 
  title, 
  MATCH(title) AGAINST ('hepatitis') AS `titleScore` 
FROM 
  `search` 
WHERE 
  MATCH(title) AGAINST ("hepatitis")
ORDER BY 
  `titleScore` DESC LIMIT 10;

But this query returns the following:

SELECT
  title
FROM
  search
WHERE
  title LIKE "%hepatitis%";

+-------------------------------------------------------------------------+
| title                                                                   |
+-------------------------------------------------------------------------+
| Comparison of drugs for chronic HBeAg-positive hepatitisÂ B             |
| Antivirals in chronic hepatitisÂ C                                      |
| Chronic hepatitisÂ C                                                    |
| Antivirals for hepatitisÂ C                                             |
| Antivirals for hepatitisÂ B                                             |
| Other antivirals for hepatitisÂ C                                       |
| Chronic hepatitisÂ B                                                    |
| HepatitisÂ A vaccine                                                    |
| HepatitisÂ B vaccine                                                    |
| HepatitisÂ B immunoglobulin                                             |
| HepatitisÂ C virus protease inhibitors, see  HCV-protease inhibitors    |
+-------------------------------------------------------------------------+

According to "Full-text material that we did not enter in the manual" at http://ftp.nchu.edu.tw/MySQL/tech-resources/articles/full-text-revealed.html#breaking full text should only match alphanumerics as elements of a word and, therefore, break in a non-breaking space (although it does not explicitly indicate the non-breaking space character itself).

I found a comment on the MySQL manual - http://dev.mysql.com/doc/refman/5.5/en/fulltext-search.html

For FULLTEXT MATCH to work with Japanese UTF-8 text, be careful that words from your Japanese text will be separated by an ASCII character space and not Japanese UTF-8 (or other) characters. (when using phpMyAdmin to manage data / write SQL query, you have to switch from your Japanese IME to insert a char space ...)

I created a new collation according to the MySQL manual with the following rules:

<charset name="utf8">
  ...
  <collation name="utf8_custom" id="1001">
    <rules>
      <reset>\u0020</reset> <!-- ascii space character -->
      <i>\u00A0</i>         <!-- non-breaking space -->
      <reset>A</reset>      <!-- test -->
      <i>B</i>
    </rules>
  </collation>
</charset>

I restarted the server and then confirmed that the sort is available with show collation like 'utf8_custom';

Then I changed the table to use the new collation and quickly rebuilt the indexes using the repair table.

SELECT title FROM search WHERE "Hepatitis A vaccine";

still returns no results

SELECT title FROM search WHERE "HepatitisÂ A vaccine";

returns results - two, in fact:

 +------------------------+
 | title                  |
 +------------------------+
 | HepatitisÂ A vaccine   |
 | HepatitisÂ B vaccine   |
 +------------------------+

This shows that the rule in sorting that makes up B, which is identical to A, is observed, but the non-breaking space is not.

Â

bothers me - my table is utf8 and my client is utf8 and the original data is utf8. I'm not sure I should be seeing this character.

+3

mysql encoding full-text-search utf-8 collation

HorusKol 05 jan. 15 at 7:03

source to share

1 answer

HorusKol · Answer 1 · 2015-01-08T03:11:02+0000

The problem is the lookup data is being written to the database - I had to issue SET NAMES "utf8"

(or the Zend / PDO equivalent) to make sure the utf8 string sent to the utf8 table was migrated as utf8.

Adding a parameter to my database config in my Zend application.ini for charset = 'utf8'

resolved this.

MySQL full text search, sorting and non-breaking space

More articles: