Word_delimiter with split_on_numerics removes all markers

Question

Word_delimiter with split_on_numerics removes all markers

When parsing, alpha 1a beta

I want the token result to be [alpha 1 a beta]

. Why is myAnalyzer

n't it doing the trick?

POST myindex
{
  "settings" : {
    "analysis" : {
      "analyzer" : {
        "myAnalyzer" : {
          "type" : "custom",
          "tokenizer" : "standard",
          "filter" : [ "split_on_numerics" ]
        }
      },
      "filter" : {
        "split_on_numerics" : {
          "type" : "word_delimiter",
          "split_on_numerics" : true,
          "split_on_case_change" : false,
          "generate_word_parts" : false,
          "generate_number_parts" : false,
          "catenate_all" : false
        }
      }
    }
  }
}

Now when I run

GET /myindex/_analyze?analyzer=myAnalyzer&text=alpha 1a beta

no tokens are returned. Again, why?

+3

tokenize elasticsearch

i_love_nachos May 16 '15 at 21:22

source to share

1 answer

keety · Accepted Answer · 2015-05-17T01:11:10+0000

To achieve this in custom word-delimiter , you need to install "generate_word_parts" : true

and "generate_number_parts" : true

.

This essentially ensures that the "alphanumeric marker", when split, must generate its numeric and vocabulary parts.

An example filter would be as follows:

{
  "settings" : {
    "analysis" : {
      "analyzer" : {
        "myAnalyzer" : {
          "type" : "custom",
          "tokenizer" : "standard",
          "filter" : [ "split_on_numerics" ]
        }
      },
      "filter" : {
        "split_on_numerics" : {
          "type" : "word_delimiter",
          "split_on_numerics" : true,
          "split_on_case_change" : false,
          "generate_word_parts" : true,
          "generate_number_parts" : true,
          "catenate_all" : false
        }
      }
    }
  }
}

If you want the original term to "1a"

be indexed, you need to set

preserve_original : true

for indexing 1

Word_delimiter with split_on_numerics removes all markers

More articles: