How to define bucket aggregation where buckets are defined by arbitrary filters in a field (GROUP BY CASE equivalent)

Question

How to define bucket aggregation where buckets are defined by arbitrary filters in a field (GROUP BY CASE equivalent)

ElasticSearch allows us to filter a set of documents by regular expression in any given field, as well as group the resulting documents using terms in a given (in the same or another field using "bucket aggregates". For example, by the index that contains the "Url" field and the field "UserAgent" (some kind of web server log), the following will return the top document count for terms found in the UserAgent field.

{
    query: { filtered: { filter: { regexp: { Url : ".*interestingpage.*" } } } },
    size: 0,                            
    aggs: { myaggregation: { terms: { field: "UserAgent" } } }                          
}

What I would like to do is use a regex filter (which works across the whole field, not just within the field) to manually define my aggregate buckets so that I can relatively reliably split my documents / count / delete data like " user agent "rather than arbitrary terms analyzed by elastic field search.

Basically, I'm looking for the equivalent of the CASE statement in GROUP BY in terms of SQL. The SQL query that expresses my intent would be something like this:

SELECT Bucket, Count(*)
FROM (
    SELECT CASE 
        WHEN UserAgent LIKE '%android%' OR UserAgent LIKE '%ipad%' OR UserAgent LIKE '%iphone%' OR UserAgent LIKE '%mobile%' THEN 'Mobile'
        WHEN UserAgent LIKE '%msie 7.0%' then 'IE7'
        WHEN UserAgent LIKE '%msie 8.0%' then 'IE8'
        WHEN UserAgent LIKE '%firefox%' then 'FireFox'
        ELSE 'OTHER'
        END Bucket
    FROM pagedata
    WHERE Url LIKE '%interestingpage%'
) Buckets
GROUP BY Bucket

Can this be done in an ElasticSearch query?

+3

elasticsearch

Tao June 18. '15 at 8:14

source to share

2 answers

This is an interesting precedent.

You will find Elasticsearch solution here. The idea is to do all this regex when indexing, and the search time is fast (scripts during search if there are many documents does not work well and will take time). Let me explain:

define a subfield for your main field that configures the term setting
This manipulation is performed so that only members of that will be stored in the index, will be the ones that you've defined: FireFox

, IE8

, IE7

, Mobile

. Each document can have more than one of these fields. A text value such as msie 7.0 sucks and ipad rules

will generate only two members: IE7

and Mobile

.

All this is made possible by the token filter keep

.

there should be another list of token filters that will actually perform the replacement. This can be used with a token filter pattern_replace

.
because you have two words that should be replaced (for example msie 7.0

), you need a way to capture these two words ( msie

and 7.0

) one beside the other. This will be possible using a token filter shingle

.

Let me put it all together and provide a complete solution:

PUT /test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_pattern_replace_analyzer": {
          "tokenizer": "whitespace",
          "filter": [
            "filter_shingle",
            "my_pattern_replace1",
            "my_pattern_replace2",
            "my_pattern_replace3",
            "my_pattern_replace4",
            "words_to_be_kept"
          ]
        }
      },
      "filter": {
        "filter_shingle": {
          "type": "shingle",
          "max_shingle_size": 10,
          "min_shingle_size": 2,
          "output_unigrams": true
        },
        "my_pattern_replace1": {
          "type": "pattern_replace",
          "pattern": "android|ipad|iphone|mobile",
          "replacement": "Mobile"
        },
        "my_pattern_replace2": {
          "type": "pattern_replace",
          "pattern": "msie 7.0",
          "replacement": "IE7"
        },
        "my_pattern_replace3": {
          "type": "pattern_replace",
          "pattern": "msie 8.0",
          "replacement": "IE8"
        },
        "my_pattern_replace4": {
          "type": "pattern_replace",
          "pattern": "firefox",
          "replacement": "FireFox"
        },
        "words_to_be_kept": {
          "type": "keep",
          "keep_words": [
            "FireFox", "IE8", "IE7", "Mobile"
          ]
        }
      }
    }
  },
  "mappings": {
    "test": {
      "properties": {
        "UserAgent": {
          "type": "string",
          "fields": {
            "custom": {
              "analyzer": "my_pattern_replace_analyzer",
              "type": "string"
            }
          }
        }
      }
    }
  }
}

Test data:

POST /test/test/_bulk
{"index":{"_id":1}}
{"UserAgent": "android OS is the best firefox"}
{"index":{"_id":2}}
{"UserAgent": "firefox is my favourite browser"}
{"index":{"_id":3}}
{"UserAgent": "msie 7.0 sucks and ipad rules"}

Query:

GET /test/test/_search?search_type=count
{
  "aggs": {
    "myaggregation": {
      "terms": {
        "field": "UserAgent.custom",
        "size": 10
      }
    }
  }
}

Results:

   "hits": {
      "total": 3,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "myaggregation": {
         "doc_count_error_upper_bound": 0,
         "sum_other_doc_count": 0,
         "buckets": [
            {
               "key": "FireFox",
               "doc_count": 2
            },
            {
               "key": "Mobile",
               "doc_count": 2
            },
            {
               "key": "IE7",
               "doc_count": 1
            }
         ]
      }
   }

+2

Andrei Stefan June 18. '15 at 9:26

source to share

Shadocko · Accepted Answer · 2015-06-18T09:08:07+0000

You can use aggregation of terms with a script:

{
  query: { filtered: { filter: { regexp: { Url : ".*interestingpage.*" } } } },
  size: 0,
  aggs: {
    myaggregation: {
      terms: {
        script: "doc['UserAgent'] =~ /.*android.*/ || doc['UserAgent'] =~ /.*ipad.*/ || doc['UserAgent'] =~ /.*iphone.*/ || doc['UserAgent'] =~ /.*mobile.*/ ? 'Mobile' : doc['UserAgent'] =~ /.*msie 7.0.*/ ? 'IE7' : '...you got the idea by now...'"
      }
    }
  }
}

But beware of the performance hit!

How to define bucket aggregation where buckets are defined by arbitrary filters in a field (GROUP BY CASE equivalent)

More articles: