How to match terms to spaces in elasticsearch?
I have a content field (string) indexed in elasticsearch. The default analyzer is the standard analyzer.
When I use a match query to find:
{"query":{"match":{"content":"micro soft", "operator":"and"}}}
The result shows that it cannot match "microsoft".
Then, how to use the "micro soft" keyword to match the content of the document containing "microsoft"?
source to share
Another solution to this is to use the nGram token filter, which will allow you to have a more "fuzzy" match.
Using your example for "microsoft" and "micro soft", here is an example of how the ngram token filter split tokens:
POST /test
{
"settings": {
"analysis": {
"filter": {
"my_ngrams": {
"type": "ngram",
"min_gram": "3",
"max_gram": "5"
}
},
"analyzer" : {
"my_analyzer" : {
"type" : "custom",
"tokenizer" : "standard",
"filter": ["my_ngrams"]
}
}
}
},
"mappings": {
"doc": {
"properties": {
"body": {
"type": "string",
"analyzer": "my_analyzer"
}
}
}
}
}
And let's analyze two things:
curl '0:9200/test/_analyze?field=body&pretty' -d'microsoft'
{
"tokens" : [ {
"token" : "mic"
}, {
"token" : "micr"
}, {
"token" : "micro"
}, {
"token" : "icr"
}, {
"token" : "icro"
}, {
"token" : "icros"
}, {
"token" : "cro"
}, {
"token" : "cros"
}, {
"token" : "croso"
}, {
"token" : "ros"
}, {
"token" : "roso"
}, {
"token" : "rosof"
}, {
"token" : "oso"
}, {
"token" : "osof"
}, {
"token" : "osoft"
}, {
"token" : "sof"
}, {
"token" : "soft"
}, {
"token" : "oft"
} ]
}
curl '0:9200/test/_analyze?field=body&pretty' -d'micro soft'
{
"tokens" : [ {
"token" : "mic"
}, {
"token" : "micr"
}, {
"token" : "micro"
}, {
"token" : "icr"
}, {
"token" : "icro"
}, {
"token" : "cro"
}, {
"token" : "sof"
}, {
"token" : "soft"
}, {
"token" : "oft"
} ]
}
(I cut out part of the output, the full output is here: https://gist.github.com/dakrone/10abb4a0cfe8ce8636ad )
As you can see, since the terms ngram for "microsoft" and "micro soft" overlap, you should be able to find matches for these queries.
source to share
Another approach to this problem is word decomposition, which you can use either dictionary based: Component token filter or use a plugin that decomposes words algorithmically: Decompound plugin .
The word microsoft
will, for example, be divided into the following tokens:
{
"tokens": [
{
"token": "microsoft",
},
{
"token": "micro",
},
{
"token": "soft",
}
]
}
These markers will allow you to search for partial words as you requested.
Compared to the approach ngrams
mentioned in the other answer, this approach will result in higher accuracy with slightly lower recall rates.
source to share