You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by "Paul K. Harter, Jr." <Pa...@oracle.com> on 2014/04/22 23:49:13 UTC
Network paritions and Failover Times

I am trying to understand the mechanisms and timing involved when Hadoop is
faced

with a network partition.  Suppose we have a large Hadoop cluster configured
with

automatic failover:

1)      Active Name node

2)      Standby NameNode

3)      Quorum journal nodes  (which we'll ignore for now)

4)      Zookeeper ensemble with 3 nodes

 

Suppose the zookeeper session from the active name node happens to be direct

to the ZK leader node, and that the system experiences a network failure
resulting

in 2 partitions (A and B) with the nodes distributed as follows:

A)     Zookeeper leader node; 
Active NameNode

B)      2 Zookeeper followers
Standby NameNode

 

QUESTIONS:

Seems the result should be that both Zookeeper and the NameNode fail over to

partition B,  but I wanted to confirm the sequence of actions as outlined
below.

Does this look right?

 

If the network failure occurs at time zero, then how long should this whole
sequence

take, if for example, syncLimit is 5 ticks and the NameNode sessionTImeout
is 10 ticks??

 

 

FAILOVER SEQUENCE (as I understand it):

 

    1) Leader, who ends up in the minority, loses connection to remaining

       servers. 

 

    2) After syncLimit, the ZK ensemble realizes there's a problem.  If a

       follower loses connection, then he is dropped by the leader, and

       no longer participates in voting. 

 

       However, in this case the Leader no longer has quorum, so he has to

       relinquish leadership.  He stops responding to client requests,

       enters the LOOKING state and and starts trying to form/join a quorum

       (it informs the ZK client library, and) all clients are notified

       with a DISCONNECTED event.  (or is it that the DISCONNECTED event

       delivered to the client library who delivers connection loss

       exceptions to clients?)  

 

       The remaining nodes on the majority side enter leader election and

       choose a new leader (which starts a new epoch) on the majority

       side. 

 

    3) All clients who were connected to the (now former) leader are told to

       reconnect and will either fail if they can't talk to a node on the

       new majority side or will succeed in connecting with a node in the
new

       quorum.  

 

    4) Meanwhile, when the Active NameNode is informed that its server has

       become disconnected (DISCONNECTED event), it must stop responding

       like the Active NameNode.  

       When the ZK quorum reforms and does not get heartbeats from the

       (formerly) Active Name node, will eventually (SessionTimeout)

       declare its session dead.  This deletes the ephemeral node being

       used to hold its lock on its status as "Active" and triggers the

       Watcher for the Standby NameNode. 

       The Standby then attempts to compete for Active Name Node election

       and should win and become the new Active.