Ignore leading zeros with Elasticsearch

I am trying to create a search bar where the most common query would be for "serviceOrderNo". "serviceOrderNo" is not a numeric field in the database, it is a string field. Examples:

000000007
000000002
WO0000042
123456789
AllTextss
000000054
000000065
000000874

      

The most common format is simply an integer number with a number of zeros.

How do I configure Elasticsearch so that the search for "65" matches "000000065"? I also want to give preference to the "serviceOrderNo" field (which I already have). This is where I am now:

{
   "query": {
      "multi_match": {
         "query": "65",
         "fields": ["serviceOrderNo^2", "_all"],
      }
   }
}

      

+3


source to share


2 answers


One way to do this is to use the regular existential lucene query:

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-regexp-query.html

"query": {
     "regexp":{
        "serviceOrderNo": "[0]*65"
     }
}

      

In addition, the Query String also supports a small set of special characters, a more limited regex character set, AS WELL regex AS lucene which would look like this: https://www.elastic.co/guide/en/elasticsearch/reference /1.x/query-dsl-query-string-query.html

"query": {
    "query_string": {
       "default_field": "serviceOrderNo",
       "query": "0*65"
    }
}

      



These are fairly simple regular expressions that match the character (s) contained in parentheses [0]

or the character 0

indefinitely *

.

If you have the option to re-index or not yet index your data, you also have the option to make it easier for yourself by writing your own analyzer. Right now, you are using the default parser for strings in the serviceOrderNo field. When you index "serviceOrderNo":"00000065"

, ES interprets it simply as 00000065.

Your custom parser can spoof this int field with both "0000065" and "65" using the same regex. The advantage of this is that the Regex only runs once during the index, not every time you run your query, because ES will look for both "0000065" and "65".

You can also check the ES website documentation on parsers .

"settings":{
    "analysis": {
        "filter":{
           "trimZero": {
                "type":"pattern_capture",
                "patterns":"^0*([0-9]*$)"
           }
        },
       "analyzer": {
           "serviceOrderNo":{
               "type":"custom",
               "tokenizer":"standard",
               "filter":"trimZero"
           }
        }
    }
},
"mappings":{
    "serviceorderdto": {
        "properties":{
            "serviceOrderNo":{
                "type":"String",
                "analyzer":"serviceOrderNo"
            }
        }
    }
}

      

+6


source


One way to do this is to use the ngram token filter so that "12345" gets the token:

[ 1, 2, 3, 4, 5 ]
[ 12, 23, 34, 45 ]
[ 123, 234, 345 ]
[ 12345 ]

      

When this token is designated like this, "65" corresponds to "000000065".

To fix this, create a new index that has its own parser that uses the ngram filter:



POST /my-index
{
   "mappings": {
      "serviceorderdto": {
         "properties": {
            "serviceOrderNo": {
               "type": "string",
               "analyzer": "autocomplete"
            }
         }
      }
   },
   "settings": {
      "analysis": {
         "filter": {
            "autocomplete_filter": {
               "type": "ngram",
               "min_gram": 1,
               "max_gram": 20
            }
         },
         "analyzer": {
            "autocomplete": {
               "type": "custom",
               "tokenizer": "standard",
               "filter": [
                  "lowercase",
                  "autocomplete_filter"
               ]
            }
         }
      }
   }
}

      

Please provide some details. Then run your query:

GET /my-index/_search
{
    "query": {
        "multi_match": {
            "query": "55", 
            "fields": [
               "serviceOrderNo^2",
               "_all"
            ]
        }
    }
}   

      

0


source







All Articles