ElasticSearch / Logstash / Kibana How to deal with spikes in log traffic
What is the best way to deal with the burst of log messages that are written to the ElasticSearch cluster in a standard ELK setup?
We use the default ELK setup (ElasticSearch / Logstash / Kibana) in AWS for our website registration needs.
We have a Logstash Instance Autoscale Team behind a load balancer that registers with an ElasticSearch Instance Autoscale Team behind another load balancer. Then we have one instance serving Kibana.
For day-to-day work, we run 2 Logstash instances and 2 ElasticSearch instances.
Our site experiences short periods of high traffic during events - our traffic increases by about 2000% during these events. We know in advance about these events.
We are currently just increasing the number of ElasticSearch instances temporarily during the event. However, we had issues where we subsequently shrank quickly, which means we lost shards and damaged our indexes.
I was thinking of setting the parameter to a auto_expand_replicas
value "1-all"
so that each node has a copy of all data, so we don't have to worry about how fast we scale up or down. How significant would be the overhead of transferring all data to new nodes? We currently only store 2 weeks of log data - that's only about 50 GB.
I've also seen people mention using a separate group of auto-scaling non-data nodes to deal with increased search traffic, and the number of data nodes is the same. Would it help in a difficult recording situation, such as an event I mentioned earlier?
source to share
My advice
Your best bet is to use Redis as a broker between Logstash and Elasticsearch:
This is documented in some old Logstash docs, but is still pretty up to date.
Yes, you will see minimal latency between produced logs and their final landing in Elasticsearch, but it should be minimal, since the latency between Redis and Logstash is relatively small. In my experience, Logstash is pretty fast lagging behind Redis.
This type of setup also gives you a more robust setup where even if Logstash fails, you still receive events through Redis.
We just scale Elasticsearch
As for your question about whether additional non-data nodes will help during periods of heavy writing: I don't believe it, no. No data nodes are good when you see a lot of searches (reads) performed, as they delegate the search to all data nodes and then aggregate the results before sending them back to the client. They take the burden of aggregating results off the data nodes.
Writing will always use your data nodes.
I don't think adding and removing nodes is a great way to deal with this.
You can try to tune thread pools and queues during peak periods. Let's say you usually have the following:
threadpool:
index:
type: fixed
size: 30
queue_size: 1000
search
type: fixed
size: 30
queue_size: 1000
This way you have an even number of search and index topics. Just before your peak time, you can change the setting ( on the fly ) to the following:
threadpool:
index:
type: fixed
size: 50
queue_size: 2000
search
type: fixed
size: 10
queue_size: 500
You now have a lot more threads doing indexing, which allows for faster indexing speeds while searches are pushed to the background. For good measure, I also increased the queue_size to increase the amount of backlog. This may not work as expected, so experimenting and tweaking is recommended.
source to share