What conditions cause the election of a marathon leader?

Question

What conditions cause the election of a marathon leader?

I am using Mesos and Marathon to manage app deployments and ran into this error in a marathon https://github.com/mesosphere/marathon/issues/3783 which says that leader election during deployment reduces the number of cases to 0. Leadership elections happened very often (about once every 30 minutes), and therefore I often encounter this problem.

I know that every 30 minutes is very irregular because I have since upgraded to Marathon 1.3.10 and have had no selections for the past 2 days, but how often is "normal" ? Does the leadership rejection / election take place under normal circumstances, or should I expect 0 elections if there is no underlying problem? A colleague suggested to me that "the election of a leader is normal" and that "a certain number of elections are normal and expected." I just don't believe this and would like to know for sure.

+3

apache-zookeeper mesos mesosphere marathon

Mike sherov Apr 24 17 at 12:45

source to share

1 answer

janisz · Accepted Answer · 2017-04-24T15:22:39+0000

It's not okay if your marathon gets re-elected every 30 minutes. Under normal circumstances, a marathon should not refuse or re-elect a new leader until maintenance (update or reboot) occurs. Although, if it does, it could be caused by four main problems (all timeout results):

Marathon Performance - When a marathon has performance issues, one symptom is losing leadership. This is because the marathon is not responding to Zookeeper within the given interval and is marked as gone.
Problems connecting to the Marathon Zookeeper router - when the network latency is too high (for example, the Zookeeper cluster is in a different DC than Marathon), then some updates may be delayed. This will lead to loss of leadership.
Zookeeper Performance - When Zookeeper has a lot of work to do, a timeout will require the Marathon to lose leadership.
Marathon forced to abdicate DELETE /v2/leader

To troubleshoot performance issues, follow below steps below here

A shard of your marathon.

Monitor - enable metrics, but don't forget to tweak them.

Upgrade to version 1.3.10 or newer.

Minimize Zookeeper communication latency and object size.

Tune JVM - add more heap and CPUs :).

Don't use an event bus - if you really need to, use filtered SSE and take it asynchronous, and events are delivered at most once.

If you need task lifecycle events, use a custom executor.

Prefer batch deployment for many individual.

What conditions cause the election of a marathon leader?

More articles: