You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by "Flavio Junqueira (JIRA)" <ji...@apache.org> on 2010/11/10 22:03:16 UTC

[jira] Commented: (ZOOKEEPER-928) Follower should stop following and start FLE if it does not receive pings from the leader

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930774#action_12930774 ] 

Flavio Junqueira commented on ZOOKEEPER-928:
--------------------------------------------

I've just seen the messages on zookeeper-dev, and I'm not sure this is right:

# readPacket is implemented in Learner.java, and the socket read is performed in this line: leaderIs.readRecord(pp, "packet");
# leaderIs is an InputArchive instance instantiated in Learner:connectToLeader;
# The socket used to instantiate leaderIs has its SO_TIMEOUT value set right before in connectToLeader: sock.setSoTimeout(self.tickTime * self.initLimit).

Consequently, the operation should not be delayed indefinitely and should return after self.tickTime * self.initLimit. This discussion on SO_TIMEOUT sounds familiar, huh? ;-)

> Follower should stop following and start FLE if it does not receive pings from the leader
> -----------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-928
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-928
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: quorum, server
>    Affects Versions: 3.3.2
>            Reporter: Vishal K
>            Priority: Critical
>             Fix For: 3.3.3, 3.4.0
>
>
> In Follower.followLeader() after syncing with the leader, the follower does:
>                 while (self.isRunning()) {
>                     readPacket(qp);
>                     processPacket(qp);
>                 }
> It looks like it relies on socket timeout expiry to figure out if the connection with the leader has gone down.  So a follower *with no cilents* may never notice a faulty leader if a Leader has a software hang, but the TCP connections with the peers are still valid. Since it has no cilents, it won't hearbeat with the Leader. If majority of followers are not connected to any clients, then FLE will fail even if other followers attempt to elect a new leader.
> We should keep track of pings received from the leader and see if we havent seen
> a ping packet from the leader for (syncLimit * tickTime) time and give up following the
> leader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.