Timeout configurations in the curator
I create a curator client as follows:
RetryPolicy retryPolicy = new RetryNTimes(3, 1000);
CuratorFramework client = CuratorFrameworkFactory.newClient(zkConnectString,
15000, // sessionTimeoutMs
15000, // connectionTimeoutMs
retryPolicy);
When I run my client program, I simulate a network partition by lowering the network adapter that the curator uses to communicate with Zookeeper. I have a few questions based on the behavior I see:
- I see the message
ConnectionStateManager - State change: SUSPENDED
after 10 seconds. How long it takes for a curator to enter SUSPENDED state, configurable based on a percentage of other timeouts, or always 10 seconds? - I am not getting any notification after the configured 15 second session timeout has elapsed since the last successful heartbeat. I do see the message
ZooKeeper - Session: 0x14adf3f01ef0001 closed
in the log, however this does not seem to be leaking out as an event that I can capture or listen to. Am I missing something here? - I ended up getting the message
ConnectionStateManager - State change: LOST
almost two minutes after the connection was lost. Why so long? - If my goal is to use InterProcessMutex as a means of preventing brain-aliasing in a HA scenario, it seems like the safest approach is that the lock owner assumed that he lost the lock when the message was posted,
SUSPENDED
because it is entirely possible that Zookeeper issued an unbeknownst lock to him on the other side of the network partition. Is this a typical / sane approach?
source to share
Right. Suppose the leadership was lost in SUSPEND and LOST. This is how Apache Curator recipes do. You might want to consider using Apache Curator instead of implementing your own algorithm. https://curator.apache.org/curator-recipes/index.html
source to share
It depends on which version of the curator you are using (note: I am the main curator author) ...
In curator 2.x, the LOST state means the retry policy has been exhausted. This does not mean that the session has been lost. In ZooKeeper, a session is only defined as lost after reestablishing a connection with the ensemble. So you get SUSPECT when Curator sees the first Disabled message. Then, when the operation fails due to a re-policy failure, you get LOST.
In Curator 3.x, the LOST value has changed. In 3.x, when receiving "Disabled", the curator starts an internal timer. When the timer passes the agreed session timeout, the curator calls getTestable (). InjectSessionExpiration () and dispatches the LOST state change.
source to share