How to match terms to spaces in elasticsearch?

I have a content field (string) indexed in elasticsearch. The default analyzer is the standard analyzer.

When I use a match query to find:

{"query":{"match":{"content":"micro soft", "operator":"and"}}}

      

The result shows that it cannot match "microsoft".

Then, how to use the "micro soft" keyword to match the content of the document containing "microsoft"?

+3


source to share


3 answers


Another solution to this is to use the nGram token filter, which will allow you to have a more "fuzzy" match.

Using your example for "microsoft" and "micro soft", here is an example of how the ngram token filter split tokens:

POST /test
{
  "settings": {
    "analysis": {
      "filter": {
        "my_ngrams": {
          "type": "ngram",
          "min_gram": "3",
          "max_gram": "5"
        }
      },
      "analyzer" : {
        "my_analyzer" : {
          "type" : "custom",
          "tokenizer" : "standard",
          "filter": ["my_ngrams"]
        }
      }
    }
  },
  "mappings": {
    "doc": {
      "properties": {
        "body": {
          "type": "string",
          "analyzer": "my_analyzer"
        }
      }
    }
  }
}

      

And let's analyze two things:



curl '0:9200/test/_analyze?field=body&pretty' -d'microsoft'
{
  "tokens" : [ {
    "token" : "mic"
  }, {
    "token" : "micr"
  }, {
    "token" : "micro"
  }, {
    "token" : "icr"
  }, {
    "token" : "icro"
  }, {
    "token" : "icros"
  }, {
    "token" : "cro"
  }, {
    "token" : "cros"
  }, {
    "token" : "croso"
  }, {
    "token" : "ros"
  }, {
    "token" : "roso"
  }, {
    "token" : "rosof"
  }, {
    "token" : "oso"
  }, {
    "token" : "osof"
  }, {
    "token" : "osoft"
  }, {
    "token" : "sof"
  }, {
    "token" : "soft"
  }, {
    "token" : "oft"
  } ]
}

curl '0:9200/test/_analyze?field=body&pretty' -d'micro soft'
{
  "tokens" : [ {
    "token" : "mic"
  }, {
    "token" : "micr"
  }, {
    "token" : "micro"
  }, {
    "token" : "icr"
  }, {
    "token" : "icro"
  }, {
    "token" : "cro"
  }, {
    "token" : "sof"
  }, {
    "token" : "soft"
  }, {
    "token" : "oft"
  } ]
}

      

(I cut out part of the output, the full output is here: https://gist.github.com/dakrone/10abb4a0cfe8ce8636ad )

As you can see, since the terms ngram for "microsoft" and "micro soft" overlap, you should be able to find matches for these queries.

+1


source


Another approach to this problem is word decomposition, which you can use either dictionary based: Component token filter or use a plugin that decomposes words algorithmically: Decompound plugin .

The word microsoft

will, for example, be divided into the following tokens:

{
   "tokens": [
      {
         "token": "microsoft",
      },
      {
         "token": "micro",
      },
      {
         "token": "soft",
      }
   ]
}

      



These markers will allow you to search for partial words as you requested.

Compared to the approach ngrams

mentioned in the other answer, this approach will result in higher accuracy with slightly lower recall rates.

+1


source


Try using ES wilcard below

 { 
 "query" : { 
     "bool" : { 
         "must" : { 
             "wildcard" : { "content":"micro*soft" } 
         } 
     } 
 }

}

      

0


source







All Articles