How to intelligently combine tiles and edgeNgram to provide flexible full text search?

Question

How to intelligently combine tiles and edgeNgram to provide flexible full text search?

We have an OData compliant API that delegates some of its full text search needs to an Elasticsearch cluster. Since OData expressions can be quite complex, we decided to simply translate them into the equivalent Lucene query syntax and pass them into the query query_string

.

We support some OData filter text expressions like:

startswith(field,'bla')
endswith(field,'bla')
substringof('bla',field)
name eq 'bla'

The fields we are mapping to can be analyzed

, not_analyzed

or both (i.e. via a multifield). Seeking the text may be one token (for example table

), only a portion thereof (e.g., tab

) or more tokens (e.g. table 1.

, table 10

etc.). The search must be case insensitive.

Here are some examples of the behavior we need to support:

startswith(name,'table 1')

must correspond to " Table 1 ", " Table 1 00", " Table 1 .5", " Table 1 12 top level"
endswith(name,'table 1')

must correspond to "Room 1, Table 1 ", "Subtable 1 ", " Table 1 ", " Jeff's Table 1 "
substringof('table 1',name)

must match "Big Table 1 back", " table 1 ", " Table 1 ", "Small Table1 2"
name eq 'table 1'

must match " Table 1 ", " TABLE 1 ", " table 1 "

Thus, basically, we take user input (that is, what is passed to the 2nd parameter startswith

/ endswith

, respectively, to the 1st parameter substringof

, respectively, to the right value eq

) and try to match it exactly, the tokens completely match or only partially.

Right now, we get the clunky solution highlighted below that works pretty well, but is far from perfect.

In ours query_string

we not_analyzed

field not_analyzed

using regular expression syntax . Since the field not_analyzed

and search need to be case insensitive, we do our own tokenization when preparing the regex to feed into the query to come up with something like this, that is, it is equivalent to an OData filter endswith(name,'table 8')

(=> matches all documents, name

ends with "table 8")

  "query": {
    "query_string": {
      "query": "name.raw:/.*(T|t)(A|a)(B|b)(L|l)(E|e) 8/",
      "lowercase_expanded_terms": false,
      "analyze_wildcard": true
    }
  }

So even though this solution works pretty well and the performance isn't all that bad (which came as a surprise), we would like to do it differently and use the full power of the parsers to shift all this load to indexing. time instead of looking for time. However, since it will take weeks to reindex all of our data, we would like to first find out if there is a good combination of filters and token analyzers that would help us achieve the same search requirements listed above.

I think the ideal solution would be some wise mixture of tiles (i.e. multiple tokens together) and edge-nGram (i.e. to match at the beginning or end of a token). However, I'm not sure if it is possible to get them to work together to match multiple tokens when one of the tokens might not be fully entered by the user). For example, if the indexed name field is "Large table 123" I need for substringof('table 1',name)

, so "table" is an exactly matching token, and "1" is just a prefix next token

Thanks in advance for sharing your thoughts on this matter.

UPDATE 1: after testing Andrey's solution

=> Exact match ( eq

) and startswith

perfect work.

A. endswith

glitches

Searching for substringof('table 112', name)

107 documents. Searching for a more specific case such as endswith(name, 'table 112')

gives 1525 documents, while it should give fewer documents (suffix matches should be a subset of substring matches). On closer inspection, I found some inconsistencies such as "Social Club, Table 12" (does not contain "112") or "Order 312" (does not contain "table" or "112"). I am assuming this because they end in "12" and that the valid gram for token "112" therefore matches.

B. substringof

glitches

The search substringof('table',name)

matches "Party table", "Alex on big table" but does not match "Table 1", "table 112", etc. Search substringof('tabl',name)

doesn't match anything

UPDATE 2

This was implied, but I forgot to explicitly mention that the solution would have to work with the query query_string

, mainly due to the fact that OData expressions (however complex they may be) would continue to translate into their Lucene equivalent. I know we are trading with Elasticsearch Query DSL with Lucene Query Syntax, which is a little less powerful and less expressive, but this is something we cannot change. We're pretty close though!

UPDATE 3 (Jun 25, 2019):

In ES 7.2, a new named data type was introduced search_as_you_type

which initially had search_as_you_type

this behavior. Find out more at: https://www.elastic.co/guide/en/elasticsearch/reference/7.2/search-as-you-type.html.

+4

regex odata elasticsearch lucene analyzer

Val 05 june 15 at 12:17

source to share

1 answer

Andrei Stefan · Accepted Answer · 2015-06-10T20:50:16+0000

This is an interesting use case. Here's my take:

{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_ngram_analyzer": {
          "tokenizer": "my_ngram_tokenizer",
          "filter": ["lowercase"]
        },
        "my_edge_ngram_analyzer": {
          "tokenizer": "my_edge_ngram_tokenizer",
          "filter": ["lowercase"]
        },
        "my_reverse_edge_ngram_analyzer": {
          "tokenizer": "keyword",
          "filter" : ["lowercase","reverse","substring","reverse"]
        },
        "lowercase_keyword": {
          "type": "custom",
          "filter": ["lowercase"],
          "tokenizer": "keyword"
        }
      },
      "tokenizer": {
        "my_ngram_tokenizer": {
          "type": "nGram",
          "min_gram": "2",
          "max_gram": "25"
        },
        "my_edge_ngram_tokenizer": {
          "type": "edgeNGram",
          "min_gram": "2",
          "max_gram": "25"
        }
      },
      "filter": {
        "substring": {
          "type": "edgeNGram",
          "min_gram": 2,
          "max_gram": 25
        }
      }
    }
  },
  "mappings": {
    "test_type": {
      "properties": {
        "text": {
          "type": "string",
          "analyzer": "my_ngram_analyzer",
          "fields": {
            "starts_with": {
              "type": "string",
              "analyzer": "my_edge_ngram_analyzer"
            },
            "ends_with": {
              "type": "string",
              "analyzer": "my_reverse_edge_ngram_analyzer"
            },
            "exact_case_insensitive_match": {
              "type": "string",
              "analyzer": "lowercase_keyword"
            }
          }
        }
      }
    }
  }
}

my_ngram_analyzer

is used to break each text into smaller pieces, how large the pieces depend on your use case. I chose 25 characters for testing. lowercase

used since you said case insensitive. It is basically a tokenizer used for substringof('table 1',name)

. The request is simple:

{
  "query": {
    "term": {
      "text": {
        "value": "table 1"
      }
    }
  }
}

my_edge_ngram_analyzer

is used to separate text starting from the beginning and this is specifically used for the use case startswith(name,'table 1')

. Again, the request is simple:

{
  "query": {
    "term": {
      "text.starts_with": {
        "value": "table 1"
      }
    }
  }
}

I found this hardest part - for endswith(name,'table 1')

. For this, I defined my_reverse_edge_ngram_analyzer

which uses the tokenizer keyword

along with the lowercase

filter edgeNGram

preceded by reverse

filter . What this tokenizer basically does is split the text in edgeNGrams, but the edge is the end of the text, not the beginning (like with a regular one edgeNGram

). Request:

{
  "query": {
    "term": {
      "text.ends_with": {
        "value": "table 1"
      }
    }
  }
}

for the case name eq 'table 1'

, a simple tokenizer keyword

together with a filter lowercase

should do this Request:

{
  "query": {
    "term": {
      "text.exact_case_insensitive_match": {
        "value": "table 1"
      }
    }
  }
}

As for , this will change the solution a bit, because I was counting on not to parse the input text and match exactly one of the terms in the index. query_string

term

But it can be "simulated" with query_string

if appropriate is specified for itanalyzer

.

The solution is a set of queries like the following (always use this parser, changing only the field name):

{
  "query": {
    "query_string": {
      "query": "text.starts_with:(\"table 1\")",
      "analyzer": "lowercase_keyword"
    }
  }
}

How to intelligently combine tiles and edgeNgram to provide flexible full text search?

More articles: