Rollback time namenode ha

Namenode HA (NFS, QJM) is available in hadoop 2.x (HDFS-1623). It provides rapid failover for the NameNode, but I can not find a description of how long you want to recover from the failure. Can anyone tell me?


Thanks for your reply. Basically, I want to know the time between conversion of two nodes (active naming and fallback namenode). How long?

+3


source to share


4 answers


Below are some qualified examples of failover times with a fallback NameNode:

A 60 node cluster with 6 million blocks using 300TB bootstore and 100K files: 30 seconds. Hence, the total switching time switches from 1-3 minutes.

A 200 node cluster with 20 million blocks, occupying 1PB raw storage and 1 million files: 110 seconds. Consequently, the total switching time switches between 2.5 and 4.5 minutes.

For light to medium load clusters, the cold failover is only slowed down by 30-120 seconds.



From: http://hortonworks.com/blog/ha-namenode-for-hdfs-with-hadoop-1-0-part-1/

+2


source


From Hadoop: The Definitive Guide , I find it easy to understand and pretty straight forward.
Fault tolerance and fencing

The transition from active changeover to standby mode is accomplished through a new object in the system called a fault tolerance controller. Fault tolerance controllers are plugged in, but the first implementation uses ZooKeeper to keep only one node active. Each namenode starts a failover process control, whose job is to monitor its namenode for failures (using a simple heartbeat mechanism) and initiate a failover on naming failure.

Fault tolerance can also be manually initiated by the administrator, for example, during ongoing maintenance. This is known as graceful failover because the failover controller orders the transition for both naming to switch roles.

In the case, however, it is impossible to be sure that the namenode failed to start. For example, a slow network or network partition may initiate a failover even though the previously active namenode is still running and thinks it is still the active namenode. Implementing HA is essential to ensure that a previously active namenode failed to perform damage and corruption — a technique known as swordsmanship. The system uses a number of fencing mechanisms, including killing the namenodes, revoking its access to shared storage (usually with a vendor-specific NFS command) and disabling its network port with a remote control command. As a last resort, the formerly active nomenod can be fenced off with a technology rather visibly known as STONITH, or "shoot another node in the head"which uses a dedicated power distribution unit to force shutdown the host machine.

Client failover is handled transparently by the client library. The simplest implementation uses a client-side configuration for failover management. The HDFS URI uses a logical hostname that maps to a pair of address names (in the configuration file) and the client library tries each address change until the operation succeeds.



Hope it helps!

+1


source


  • Failure quickly means not recovering, but giving up another namenode
  • Ha namenode is configured with multiple namenodes
  • If any namenode fails, then another namenode will be activated.
  • If the namenode restart is started again, it will be in standby mode.
0


source


  • When you use HA, multiple nomenod clusters are started, but the node log will only write one node name at a time. So, one node name will be in the active state and the other will be in standby

  • If one namenode is down, the standby node becomes active. This is called disaster recovery.

0


source







All Articles