Elasticsearch "pattern_replace", replacing spaces when parsing

Basically I want to remove all spaces and tokenize the entire string as a single token. (I will be using nGram on top of this later).

These are my index settings:

"settings": {
 "index": {
  "analysis": {
    "filter": {
      "whitespace_remove": {
        "type": "pattern_replace",
        "pattern": " ",
        "replacement": ""
      }
    },
    "analyzer": {
      "meliuz_analyzer": {
        "filter": [
          "lowercase",
          "whitespace_remove"
        ],
        "type": "custom",
        "tokenizer": "standard"
      }
    }
  }
}

      

Instead, "pattern": " "

I tried "pattern": "\\u0020"

it \\s

too.

But when I parse the text "beleza na web", it still creates three separate tokens: "beleza", "na" and "web" instead of one "belezanaweb".

+3


source to share


1 answer


The parser parses the string, first a token, then applying a series of token filters. You specified the tokenizer as a standard means, since the input is already denoted with a standard tokenizer , which generated the tokens separately. Then the filter filter is replaced with tokens.

Use the tokenizer keyword instead of the standard tokenizer. The rest of the display is fine. You can change your display as shown below.



"settings": {
 "index": {
  "analysis": {
    "filter": {
      "whitespace_remove": {
        "type": "pattern_replace",
        "pattern": " ",
        "replacement": ""
      }
    },
    "analyzer": {
      "meliuz_analyzer": {
        "filter": [
          "lowercase",
          "whitespace_remove",
          "nGram"
        ],
        "type": "custom",
        "tokenizer": "keyword"
      }
    }
  }
}

      

+13


source







All Articles