Search with asciifolding and utf-8 characters in elasticsearch

I index all names on a web page with accented characters like "Jose". I want to be able to search for this name with "Jose" and "Jose".

How do I set up my index collation and parser for a simple index with one field "name"?

I have set up the parser for the name field as follows:

"analyzer": {
  "folding": {
    "tokenizer": "standard",
    "filter": ["lowercase", "asciifolding"]
   }
 }

      

But it folds all accents into ascii equivalents and ignores the accents when indexing "é". I want the "é" char to be in the index and I want to be able to search for "Jose" using "Jose" or "Jose"

thank

+3


source to share


2 answers


You need to keep the original accent token. To do this, you need to override your own token filter asciifolding

, for example:

PUT /my_index
{
    "settings" : {
        "analysis" : {
            "analyzer" : {
                "folding" : {
                    "tokenizer" : "standard",
                    "filter" : ["lowercase", "my_ascii_folding"]
                }
            },
            "filter" : {
                "my_ascii_folding" : {
                    "type" : "asciifolding",
                    "preserve_original" : true
                }
            }
        }
    },
    "mappings": {
        "my_type": {
            "properties": {
                "name": {
                    "type": "text",
                    "analyzer": "folding"
                }
            }
        }
    }
}

      



Thereafter, both the token jose

and josé

will be indexed and available for search

+4


source


Here's what I can work out to solve the problem of addition with accents:

Analyzer used:
{
  "settings": {
    "analysis": {
      "analyzer": {
        "folding": {
          "tokenizer": "standard",
          "filter":  [ "lowercase", "asciifolding" ]
        }
      }
    }
  }
}

      

Below is the comparison:

mappings used:
    {
      "properties": {
        "title": { 
          "type":           "string",
          "analyzer":       "standard",
          "fields": {
            "folded": { 
              "type":       "string",
              "analyzer":   "folding"
            }
          }
    }
  }
}

      



  • The header field uses a standard parser and will contain the original diacritic word.
  • The title.folded field uses a fold parser that separates the diacritical marks.

Below is the search query:

{
  "query": {
    "multi_match": {
      "type":     "most_fields",
      "query":    "esta loca",
      "fields": [ "title", "title.folded" ]
    }
  }
}

      

0


source







All Articles