Elasticsearch: separating words by underscore; Nothing found
I am setting up a tokenizer that delimits words under the underscore char as well as all other punctuation characters. I decided to use word_delimiter . Then I set my default parser for the desired field.
I have two problems:
- The parser splits lines into words, but does not preserve the original line, despite the preserve_original parameter. See Query Analysis.
- Searching on underscore-delimited substrings still produces no results
Here is my template, data object, parser test and search queries:
PUT simple
{
"template" : "simple",
"settings" : {
"index" : {
"analysis" : {
"analyzer" : {
"underscore_splits_words" : {
"tokenizer" : "standard",
"filter" : ["word_delimiter"],
"generate_word_parts" : true,
"preserve_original" : true
}
}
}
},
"mappings": {
"_default_": {
"properties" : {
"request" : { "type" : "string", "analyzer" : "underscore_splits_words" }
}
}
}
}
}
Data object:
POST simple/0
{ "request" : "GET /queue/1/under_score-hyphenword/poll?ttl=300&limit=10" }
This returns tokens: "under", "score", "hyphenword", but not "underscore_splits_words":
POST simple/_analyze?analyzer=underscore_splits_words
{"/queue/1/under_score-hyphenword/poll?ttl=300&limit=10"}
searching results
Hit:
GET simple/_search?q=hyphenword
Hit:
POST simple/_search
{
"query": {
"query_string": {
"query": "hyphenword"
}
}
}
Miss:
GET simple/_search?q=score
Miss:
POST simple/_search
{
"query": {
"query_string": {
"query": "score"
}
}
}
Please suggest the correct way to achieve my goal. Thank!
source to share
You should be able to use a "simple" parser for this. There is no need for a custom parser because a simple parser uses a literal tokenizer and a lowercase tokenizer in combination (so any non-leading characters signal a new token). The reason you are not getting any hits is because you are not specifying the field in your request, so you are asking for the _all field, which is mainly for convenient full text search.
Create Index
PUT myindex
{
"mappings": {
"mytype": {
"properties": {
"request": {
"type": "string",
"analyzer": "simple"
}
}
}
}
}
Insert document
POST myindex/mytype/1
{ "request" : "GET /queue/1/key_word-hyphenword/poll?ttl=300&limit=10" }
Request for document
GET myindex/mytype/_search?q=request:key
Request using DSL request:
POST myindex/mytype/_search
{
"query": {
"query_string": {
"default_field": "request",
"query": "key"
}
}
}
Another request using DSL request:
POST myindex/mytype/_search
{
"query": {
"bool": {
"must": [
{ "match": { "request": "key"}}
]
}
}
}
The query result looks correct:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.095891505,
"hits": [
{
"_index": "myindex",
"_type": "mytype",
"_id": "1",
"_score": 0.095891505,
"_source": {
"request": "GET /queue/1/key_word-hyphenword/poll?ttl=300&limit=10"
}
}
]
}
}
If you want to omit a specific field you are looking for (DO NOT RECOMMEND), you can set the default parser for all collations in the index when you create the index. (Note: This feature is deprecated and you shouldn't use it for performance / stability reasons.)
Create an index with default collation to parse the _all field with a "simple" parser
PUT myindex
{
"mappings": {
"_default_": {
"index_analyzer": "simple"
}
}
}
Insert document
POST myindex/mytype/1
{ "request" : "GET /queue/1/key_word-hyphenword/poll?ttl=300&limit=10" }
Request an index without specifying a field
GET myindex/mytype/_search?q=key
You will get the same result (1 hit).
source to share