Analytics in Elasticsearch

I am working with event analytics, I am using hadoop to process logs and save some results to Mysql. Now it didn't work due to scalability issues as logs keep coming in daily.

We need to show statistics per year, month, week, day, hour along with filtering options Our samples can grow for 100k users, each using 20 websites every hour
100,000 (users) * 20 (unique site) * 2 (locations) * 24 (hours) = 96,000,000 (96 million maximum entries per day)

Our table looks like event_src_id, time, user, site, location, some statistics

An example of some queries:

1) select website, sum(stats), count(distinct(user_id)) from table group by website;
2) select website, sum(stats), count(distinct(user_id)) from table where YEAR(Time) = 2009 group by website, MONTH(Time);
3) select website, sum(stats), count(distinct(user_id)) from table group by website where event_src_id=XXXXXXXXXXX;
4) select website, sum(stats), count(distinct(user_id)) from table group by website where time > 1 jan 2014 and time <=31 jan 2014;
5) select website, location, sum(stats), count(distinct(user_id)) from table group by website, location;
6) select website, sum(stats) as stats_val from table group by website order by stats_val desc limit 10;
   select location, sum(stats) as stats_val from table group by location order by stats_val desc limit 10;
7) delete from table where event_src_id=XXXXXXXXXXX; (may delete all 96M records)

      

I tried searching for elasticity of Hadoop and it seems like the insert part can be fixed with that, I'm more worried about the reading part. It seems that the aggregation framework gives some hope, but I couldn't work as requested. how to group and sum and distinguish at the same time? How to best use Elasticsearch with Hadoop with scalability and performance targets for OLAP-based queries. Any help would be appreciated.

+3


source to share


2 answers


First, I don't think using ElasticSearch for OLAP-like queries is a good idea. I would recommend using some technologies like Dremel (Impala, TEZ, Storm, etc.) that support the sql you listed. This has some advantages like:

  • You don't need to transfer data from Hadoop to ElasticSearch.
  • you can use sql
  • you don't need to care about parsing json from ElasticSearch request responses.

Don't get me wrong, I love ElasticSearch / Logstash / Kibana, but for log collection and rendering. Sure, some advanced queries can be done, but it has some limitations which I found in my personal projects.

Also consider using Kibana, it's a great tool for data statistics in ElasticSearch and you can do a lot with it.

Here is an example of requests for your request (I have not tested it):

1)



{
  "aggs": {
    "website": {
      "terms": {
        "field": "website"
      },
      "aggs": {
        "sum_stats": {
          "sum": {
            "field": "stats"
          },
          "aggs": {
            "distinct_user": {
              "cardinality": {
                "field": "user_id",
                "precision_threshold": 100
              }
            }
          }          
        }
      }
    }
  }
}

      

2-6 are similar, use things from 1) with different filters like this:

{
  "aggs": {
    "your_filter": {
      "filter": {
        "term": {"event_src_id" : "XXXXXXXXXXX"}
      }
    },
    "aggs": {
      "website": {
        "terms": {
          "field": "website"
        },
        "aggs": {
          "sum_stats": {
            "sum": {
              "field": "stats"
            },
            "aggs": {
              "distinct_user": {
                "cardinality": {
                  "field": "user_id",
                  "precision_threshold": 100
                }
              }
            }
          }
        }
      }
    }
  }
}

      

7) DELETE is pretty easy

    "query" : {
        "term" : { "event_src_id" : "XXXXXXXXXXX" }
    }
}

      

+4


source


how to group and sum and highlight at the same time

Aggregations can contain subcategories .

First, the functionality of the group corresponds to the aggregation of terms and (sometimes) the aggregation of top_hits . Second, there is a collection of sums , a simple statistical aggregation of statistics. Finally, your use of the excellent in this case is to do a score (reported), which corresponds to the aggregation of capacity , which can be approximate or precise depending on your needs.

7) delete from the table, where event_src_id = XXXXXXXXXXX; (can delete all 96M records)

There is a delete on demand api that you can use, but be careful with a high percentage of deleted documents; Lucene and Elasticsearch are not optimized for this, and you will incur the overhead of removing tokens in the data.



<strong> Examples

select website, amount (statistics), counter (separate (user_id)) from table by website

GET /_search
{
   "aggs": {
      "website_stats": {
        "terms": {
           "field": "website"
        },
        "aggs": {
           "sum_stats": {
             "sum": {
               "field": "stats"
             }
           },
           "count_users": {
             "cardinality": {
               "field": "user_id"
            }
          }
        }
      }
   }
}

      

select website, sum (statistics), counter (separate (user_id)) from table where YEAR (time) = 2009 group by website, MONTH (time)

GET /_search
{
   "query": {
     "filter": {
       "range": {
           "Time": {
              "gt": "2009-01-01 00:00:00",
              "lt": "2010-01-01 00:00:00"
           }
       }
     }
   },
   "aggs": {
      "monthly_stats" {
        "terms": {
           "field": "website"
        },
        "aggs": {
           "months": {
              "date_histogram": {
                "field": "Time",
                "interval": "month"
              },
              "aggs" : {
                "sum_stats": {
                  "sum": {
                    "field": "stats"
                  }
                },
                "count_users": {
                  "cardinality": {
                    "field": "user_id"
                }
              }
            }
          }
        }
      }
   }
}

      

+3


source







All Articles