Custom Elasticsearch parser for hyphen, underscore and number

Admittedly, I'm not that good at the analytical part of ES. Here's the layout for the index:

{
    "mappings": {
        "event": {
            "properties": {
                "ipaddress": {
                    "type": "string"
                },
                "hostname": {
                    "type": "string",
                    "analyzer": "my_analyzer",
                    "fields": {
                        "raw": {
                            "type": "string",
                            "index": "not_analyzed"
                        }
                    }
                }
            }
        }
    },
    "settings": {
        "analysis": {
            "filter": {
                "my_filter": {
                    "type": "word_delimiter",
                    "preserve_original": true
                }
            },
            "analyzer": {
                "my_analyzer": {
                    "type": "custom",
                    "tokenizer": "whitespace",
                    "filter": ["lowercase", "my_filter"]
                }
            }
        }
    }
}

      

You can see that I tried to use my own parser for the hostname field. This kind of work when I use this query to find a host named "WIN_1":

{
    "query": {
        "match": {
            "hostname": "WIN_1"
        }
    }
}

      

The problem is that it also returns any hostname that has a 1. Using the endpoint _analyze

I can see the numbers are denoted too.

{
    "tokens": [
        {
            "token": "win_1",
            "start_offset": 0,
            "end_offset": 5,
            "type": "word",
            "position": 1
        },
        {
            "token": "win",
            "start_offset": 0,
            "end_offset": 3,
            "type": "word",
            "position": 1
        },
        {
            "token": "1",
            "start_offset": 4,
            "end_offset": 5,
            "type": "word",
            "position": 2
        }
    ]
}

      

What I would like to do is search for WIN and return any host that has the name WIN. But I also need to be able to search for WIN_1 and return that exact host, or any host with the WIN_1 name in it. Below are some test data.

{
    "ipaddress": "192.168.1.253",
    "hostname": "WIN_8_ENT_1"
}
{
    "ipaddress": "10.0.0.1",
    "hostname": "server1"
}
{
    "ipaddress": "172.20.10.36",
    "hostname": "ServA-1"
}

      

Hopefully someone can point me in the right direction. Maybe my simple query doesn't fit either. I have poured ES docs, but they are not very good with examples.

+3


source to share


4 answers


Here's the parser and the queries I ended up with:

{
    "mappings": {
        "event": {
            "properties": {
                "ipaddress": {
                    "type": "string"
                },
                "hostname": {
                    "type": "string",
                    "analyzer": "hostname_analyzer",
                    "fields": {
                        "raw": {
                            "type": "string",
                            "index": "not_analyzed"
                        }
                    }
                }
            }
        }
    },
    "settings": {
        "analysis": {
            "filter": {
                "hostname_filter": {
                    "type": "pattern_capture",
                    "preserve_original": 0,
                    "patterns": [
                        "(\\p{Ll}{3,})"
                    ]
                }
            },
            "analyzer": {
                "hostname_analyzer": {
                    "type": "custom",
                    "tokenizer": "whitespace",
                    "filter": [  "lowercase", "hostname_filter" ]
                }
            }
        }
    }
}

      

Requests: Find hostname starting with:

{
    "query": {
        "prefix": {
            "hostname.raw": "WIN_8"
        }
    }
}

      



Find a hostname containing:

{
    "query": {
        "multi_match": {
            "fields": [
                "hostname",
                "hostname.raw"
            ],
            "query": "WIN"
       }
   }
}

      

Thanks Dan for taking me in the right direction.

+1


source


You can modify your analysis to use a pattern parser that discards numbers and scores:

{
   "analysis": {
      "analyzer": {
          "word_only": {
              "type": "pattern",
              "pattern": "([^\p{L}]+)"
          }
       }
    }
}

      

Using the analysis API:

curl -XGET 'localhost:9200/{yourIndex}/_analyze?analyzer=word_only&pretty=true' -d 'WIN_8_ENT_1'

      

returns:



"tokens" : [ {
    "token" : "win",
    "start_offset" : 0,
    "end_offset" : 3,
    "type" : "word",
    "position" : 1
}, {
    "token" : "ent",
    "start_offset" : 6,
    "end_offset" : 9,
    "type" : "word",
    "position" : 2
} ]

      

Your mapping will become:

{
    "event": {
        "properties": {
            "ipaddress": {
                 "type": "string"
             },
             "hostname": {
                 "type": "string",
                 "analyzer": "word_only",
                 "fields": {
                     "raw": {
                         "type": "string",
                         "index": "not_analyzed"
                     }
                 }
             }
         }
    }
}

      

You can use a multi_match query to get the results you want:

{
    "query": {
        "multi_match": {
            "fields": [
                "hostname",
                "hostname.raw"
            ],
            "query": "WIN_1"
       }
   }
}

      

+3


source


When ES 1.4 is released, there will be a new filter called "keep types" that will allow you to save only certain types after a string is tagged. (i.e. keep only words, only numbers, etc.).

Check it out here: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-keep-types-tokenfilter.html#analysis-keep-types-tokenfilter

It may be a more convenient solution for your future needs

+1


source


It sounds like you want to apply two different types of lookups on your hostname field. One for exact matches and one for a variant of the pattern (perhaps, in your particular case, a prefix query).

After trying to implement all types of different searches using several different parsers, I found it sometimes easier to add another field to represent each type of search you want to do. Is there any reason why you don't want to add another field, like this:

{"ipaddress": "192.168.1.253", "hostname": "WIN_8_ENT_1" "system": "WIN"}

Otherwise, you might consider writing your own custom filter that does the same under the hood. Your filter will read in your hostname field and index the exact keyword and substring that matches your pattern (like WIN in WIN_8_ENT_1).

I don't think there is any existing analyzer / filter combination that can do what you are looking for, provided I understand your requirements correctly.

0


source







All Articles