Preserving the order of terms in an ElasticSearch query
Is it possible in ElasticSearch to generate a query that preserves the ordering of terms?
A simple example would be indexing these documents with a standard parser:
- You know to search
- You know the search
- Know what you are looking for.
I could query +you +search
and this will return me all documents including the third one.
What if I only wanted to receive documents that have conditions in this particular order? Can I generate a request that will do this for me?
Considering that the phrases can be simply quoted in the text: "you know"
(get the 1st and 2nd documents), it seems to me that there should be a way to preserve the order for several terms that are not adjacent.
In the simple example above, I could use proximity searches, but that doesn't cover more complex cases.
source to share
Matching a phrase doesn't guarantee order ;-). If you specify enough slopes, for example 2, for example, "hello world" will match "world hello". But this is not necessarily a bad thing, because usually searches are more relevant if two terms are "close" to each other, and it doesn't matter for their order. And I don't think the authors of this function thought about matching words divided into 1000 line segments.
There is a solution I could find to preserve order and not just: using scripts. Here's one example:
POST /my_index/my_type/_bulk
{ "index": { "_id": 1 }}
{ "title": "hello world" }
{ "index": { "_id": 2 }}
{ "title": "world hello" }
{ "index": { "_id": 3 }}
{ "title": "hello term1 term2 term3 term4 world" }
POST my_index/_search
{
"query": {
"filtered": {
"query": {
"match": {
"title": {
"query": "hello world",
"slop": 5,
"type": "phrase"
}
}
},
"filter": {
"script": {
"script": "term1Pos=0;term2Pos=0;term1Info = _index['title'].get('hello',_POSITIONS);term2Info = _index['title'].get('world',_POSITIONS); for(pos in term1Info){term1Pos=pos.position;}; for(pos in term2Info){term2Pos=pos.position;}; return term1Pos<term2Pos;",
"params": {}
}
}
}
}
}
To make the script itself more readable, I rewrite here indented:
term1Pos = 0;
term2Pos = 0;
term1Info = _index['title'].get('hello',_POSITIONS);
term2Info = _index['title'].get('world',_POSITIONS);
for(pos in term1Info) {
term1Pos = pos.position;
};
for(pos in term2Info) {
term2Pos = pos.position;
};
return term1Pos < term2Pos;
Above is a query that looks for "hello world" with a rollback of 5, which in the docs above will match all of them. But the script filter ensures that the position in the document for the word "hello" is lower than the position in the document for the word "world". Thus, no matter how many line segments we specify in the query, the fact that the positions are one after the other ensures order.
It's in the documentation that sheds light on the things used in the script above.
source to share
This is exactly what the request does match_phrase
(see here ).
It checks the position of terms on top of their presence.
For example, these documents:
POST test/values
{
"test": "Hello World"
}
POST test/values
{
"test": "Hello nice World"
}
POST test/values
{
"test": "World, I don't say hello"
}
a general query will be found match
:
POST test/_search
{
"query": {
"match": {
"test": "Hello World"
}
}
}
But using match_phrase
, only the first document will be returned:
POST test/_search
{
"query": {
"match_phrase": {
"test": "Hello World"
}
}
}
{
...
"hits": {
"total": 1,
"max_score": 2.3953633,
"hits": [
{
"_index": "test",
"_type": "values",
"_id": "qFZAKYOTQh2AuqplLQdHcA",
"_score": 2.3953633,
"_source": {
"test": "Hello World"
}
}
]
}
}
In your case, you want to take some distance between your terms . This can be achieved with a parameter slop
that specifies how much you will allow your terms to be apart:
POST test/_search
{
"query": {
"match": {
"test": {
"query": "Hello world",
"slop":1,
"type": "phrase"
}
}
}
}
With this last query, you will also find the second document:
{
...
"hits": {
"total": 2,
"max_score": 0.38356602,
"hits": [
{
"_index": "test",
"_type": "values",
"_id": "7mhBJgm5QaO2_aXOrTB_BA",
"_score": 0.38356602,
"_source": {
"test": "Hello World"
}
},
{
"_index": "test",
"_type": "values",
"_id": "VKdUJSZFQNCFrxKk_hWz4A",
"_score": 0.2169777,
"_source": {
"test": "Hello nice World"
}
}
]
}
}
You can find a whole chapter about this use case in the definitive guide .
source to share