Elasticsearch "pattern_replace", replacing spaces when parsing

Question

Elasticsearch "pattern_replace", replacing spaces when parsing

Basically I want to remove all spaces and tokenize the entire string as a single token. (I will be using nGram on top of this later).

These are my index settings:

"settings": {
 "index": {
  "analysis": {
    "filter": {
      "whitespace_remove": {
        "type": "pattern_replace",
        "pattern": " ",
        "replacement": ""
      }
    },
    "analyzer": {
      "meliuz_analyzer": {
        "filter": [
          "lowercase",
          "whitespace_remove"
        ],
        "type": "custom",
        "tokenizer": "standard"
      }
    }
  }
}

Instead, "pattern": " "

I tried "pattern": "\\u0020"

it \\s

too.

But when I parse the text "beleza na web", it still creates three separate tokens: "beleza", "na" and "web" instead of one "belezanaweb".

+3

tokenize whitespace elasticsearch removing-whitespace

Sagar chandarana Apr 26. '15 at 3:23

source to share

1 answer

Prabin meitei · Accepted Answer · 2015-04-26T04:32:28+0000

The parser parses the string, first a token, then applying a series of token filters. You specified the tokenizer as a standard means, since the input is already denoted with a standard tokenizer , which generated the tokens separately. Then the filter filter is replaced with tokens.

Use the tokenizer keyword instead of the standard tokenizer. The rest of the display is fine. You can change your display as shown below.

"settings": {
 "index": {
  "analysis": {
    "filter": {
      "whitespace_remove": {
        "type": "pattern_replace",
        "pattern": " ",
        "replacement": ""
      }
    },
    "analyzer": {
      "meliuz_analyzer": {
        "filter": [
          "lowercase",
          "whitespace_remove",
          "nGram"
        ],
        "type": "custom",
        "tokenizer": "keyword"
      }
    }
  }
}

Elasticsearch "pattern_replace", replacing spaces when parsing

More articles: