Word_delimiter with split_on_numerics removes all markers
When parsing, alpha 1a beta
I want the token result to be [alpha 1 a beta]
. Why is myAnalyzer
n't it doing the trick?
POST myindex
{
"settings" : {
"analysis" : {
"analyzer" : {
"myAnalyzer" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : [ "split_on_numerics" ]
}
},
"filter" : {
"split_on_numerics" : {
"type" : "word_delimiter",
"split_on_numerics" : true,
"split_on_case_change" : false,
"generate_word_parts" : false,
"generate_number_parts" : false,
"catenate_all" : false
}
}
}
}
}
Now when I run
GET /myindex/_analyze?analyzer=myAnalyzer&text=alpha 1a beta
no tokens are returned. Again, why?
source to share
To achieve this in custom word-delimiter , you need to install "generate_word_parts" : true
and "generate_number_parts" : true
.
This essentially ensures that the "alphanumeric marker", when split, must generate its numeric and vocabulary parts.
An example filter would be as follows:
{
"settings" : {
"analysis" : {
"analyzer" : {
"myAnalyzer" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : [ "split_on_numerics" ]
}
},
"filter" : {
"split_on_numerics" : {
"type" : "word_delimiter",
"split_on_numerics" : true,
"split_on_case_change" : false,
"generate_word_parts" : true,
"generate_number_parts" : true,
"catenate_all" : false
}
}
}
}
}
If you want the original term to "1a"
be indexed, you need to set
preserve_original : true
for indexing 1
source to share