What conditions cause the election of a marathon leader?

I am using Mesos and Marathon to manage app deployments and ran into this error in a marathon https://github.com/mesosphere/marathon/issues/3783 which says that leader election during deployment reduces the number of cases to 0. Leadership elections happened very often (about once every 30 minutes), and therefore I often encounter this problem.

I know that every 30 minutes is very irregular because I have since upgraded to Marathon 1.3.10 and have had no selections for the past 2 days, but how often is "normal" ? Does the leadership rejection / election take place under normal circumstances, or should I expect 0 elections if there is no underlying problem? A colleague suggested to me that "the election of a leader is normal" and that "a certain number of elections are normal and expected." I just don't believe this and would like to know for sure.

+3


source to share


1 answer


It's not okay if your marathon gets re-elected every 30 minutes. Under normal circumstances, a marathon should not refuse or re-elect a new leader until maintenance (update or reboot) occurs. Although, if it does, it could be caused by four main problems (all timeout results):

  • Marathon Performance - When a marathon has performance issues, one symptom is losing leadership. This is because the marathon is not responding to Zookeeper within the given interval and is marked as gone.
  • Problems connecting to the Marathon Zookeeper router - when the network latency is too high (for example, the Zookeeper cluster is in a different DC than Marathon), then some updates may be delayed. This will lead to loss of leadership.
  • Zookeeper Performance - When Zookeeper has a lot of work to do, a timeout will require the Marathon to lose leadership.
  • Marathon forced to abdicate DELETE /v2/leader



To troubleshoot performance issues, follow below steps below here

  • A shard of your marathon.
  • Monitor - enable metrics, but don't forget to tweak them.
  • Upgrade to version 1.3.10 or newer.
  • Minimize Zookeeper communication latency and object size.
  • Tune JVM - add more heap and CPUs :).
  • Don't use an event bus - if you really need to, use filtered SSE and take it asynchronous, and events are delivered at most once.
  • If you need task lifecycle events, use a custom executor.
  • Prefer batch deployment for many individual.
+5


source







All Articles