How to match terms to spaces in elasticsearch?

Question

How to match terms to spaces in elasticsearch?

I have a content field (string) indexed in elasticsearch. The default analyzer is the standard analyzer.

When I use a match query to find:

{"query":{"match":{"content":"micro soft", "operator":"and"}}}

The result shows that it cannot match "microsoft".

Then, how to use the "micro soft" keyword to match the content of the document containing "microsoft"?

+3

match spaces elasticsearch

Leon kennedy Apr 17 15 at 7:41

source to share

3 answers

Lee h · Answer 1 · 2015-04-17T21:26:03+0000

Another solution to this is to use the nGram token filter, which will allow you to have a more "fuzzy" match.

Using your example for "microsoft" and "micro soft", here is an example of how the ngram token filter split tokens:

POST /test
{
  "settings": {
    "analysis": {
      "filter": {
        "my_ngrams": {
          "type": "ngram",
          "min_gram": "3",
          "max_gram": "5"
        }
      },
      "analyzer" : {
        "my_analyzer" : {
          "type" : "custom",
          "tokenizer" : "standard",
          "filter": ["my_ngrams"]
        }
      }
    }
  },
  "mappings": {
    "doc": {
      "properties": {
        "body": {
          "type": "string",
          "analyzer": "my_analyzer"
        }
      }
    }
  }
}

And let's analyze two things:

curl '0:9200/test/_analyze?field=body&pretty' -d'microsoft'
{
  "tokens" : [ {
    "token" : "mic"
  }, {
    "token" : "micr"
  }, {
    "token" : "micro"
  }, {
    "token" : "icr"
  }, {
    "token" : "icro"
  }, {
    "token" : "icros"
  }, {
    "token" : "cro"
  }, {
    "token" : "cros"
  }, {
    "token" : "croso"
  }, {
    "token" : "ros"
  }, {
    "token" : "roso"
  }, {
    "token" : "rosof"
  }, {
    "token" : "oso"
  }, {
    "token" : "osof"
  }, {
    "token" : "osoft"
  }, {
    "token" : "sof"
  }, {
    "token" : "soft"
  }, {
    "token" : "oft"
  } ]
}

curl '0:9200/test/_analyze?field=body&pretty' -d'micro soft'
{
  "tokens" : [ {
    "token" : "mic"
  }, {
    "token" : "micr"
  }, {
    "token" : "micro"
  }, {
    "token" : "icr"
  }, {
    "token" : "icro"
  }, {
    "token" : "cro"
  }, {
    "token" : "sof"
  }, {
    "token" : "soft"
  }, {
    "token" : "oft"
  } ]
}

(I cut out part of the output, the full output is here: https://gist.github.com/dakrone/10abb4a0cfe8ce8636ad )

As you can see, since the terms ngram for "microsoft" and "micro soft" overlap, you should be able to find matches for these queries.

paweloque · Answer 2 · 2015-12-22T20:01:16+0000

Another approach to this problem is word decomposition, which you can use either dictionary based: Component token filter or use a plugin that decomposes words algorithmically: Decompound plugin .

The word microsoft

will, for example, be divided into the following tokens:

{
   "tokens": [
      {
         "token": "microsoft",
      },
      {
         "token": "micro",
      },
      {
         "token": "soft",
      }
   ]
}

These markers will allow you to search for partial words as you requested.

Compared to the approach ngrams

mentioned in the other answer, this approach will result in higher accuracy with slightly lower recall rates.

Yogesh · Answer 3 · 2015-04-17T07:57:09+0000

Try using ES wilcard below

 { 
 "query" : { 
     "bool" : { 
         "must" : { 
             "wildcard" : { "content":"micro*soft" } 
         } 
     } 
 }

}

How to match terms to spaces in elasticsearch?

More articles: