Search with asciifolding and utf-8 characters in elasticsearch
I index all names on a web page with accented characters like "Jose". I want to be able to search for this name with "Jose" and "Jose".
How do I set up my index collation and parser for a simple index with one field "name"?
I have set up the parser for the name field as follows:
"analyzer": {
"folding": {
"tokenizer": "standard",
"filter": ["lowercase", "asciifolding"]
}
}
But it folds all accents into ascii equivalents and ignores the accents when indexing "é". I want the "é" char to be in the index and I want to be able to search for "Jose" using "Jose" or "Jose"
thank
source to share
You need to keep the original accent token. To do this, you need to override your own token filter asciifolding
, for example:
PUT /my_index
{
"settings" : {
"analysis" : {
"analyzer" : {
"folding" : {
"tokenizer" : "standard",
"filter" : ["lowercase", "my_ascii_folding"]
}
},
"filter" : {
"my_ascii_folding" : {
"type" : "asciifolding",
"preserve_original" : true
}
}
}
},
"mappings": {
"my_type": {
"properties": {
"name": {
"type": "text",
"analyzer": "folding"
}
}
}
}
}
Thereafter, both the token jose
and josé
will be indexed and available for search
source to share
Here's what I can work out to solve the problem of addition with accents:
Analyzer used:
{
"settings": {
"analysis": {
"analyzer": {
"folding": {
"tokenizer": "standard",
"filter": [ "lowercase", "asciifolding" ]
}
}
}
}
}
Below is the comparison:
mappings used:
{
"properties": {
"title": {
"type": "string",
"analyzer": "standard",
"fields": {
"folded": {
"type": "string",
"analyzer": "folding"
}
}
}
}
}
- The header field uses a standard parser and will contain the original diacritic word.
- The title.folded field uses a fold parser that separates the diacritical marks.
Below is the search query:
{
"query": {
"multi_match": {
"type": "most_fields",
"query": "esta loca",
"fields": [ "title", "title.folded" ]
}
}
}
source to share