You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by "Hadoop QA (JIRA)" <ji...@apache.org> on 2014/07/08 22:34:05 UTC

[jira] [Commented] (ZOOKEEPER-1865) Fix retry logic in Learner.connectToLeader()

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14055477#comment-14055477 ] 

Hadoop QA commented on ZOOKEEPER-1865:
--------------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12654667/ZOOKEEPER-1865.patch
  against trunk revision 1608872.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    -1 javac.  The patch appears to cause tar ant target to fail.

    +1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    -1 core tests.  The patch failed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2181//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2181//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2181//console

This message is automatically generated.

> Fix retry logic in Learner.connectToLeader() 
> ---------------------------------------------
>
>                 Key: ZOOKEEPER-1865
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1865
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>            Reporter: Thawan Kooburat
>            Assignee: Edward Carter
>             Fix For: 3.5.0, 3.5.1
>
>         Attachments: ZOOKEEPER-1865.patch
>
>
> We discovered a long leader election time today in one of our prod ensemble.
> Here is the description of the event. 
> Before the old leader goes down, it is able to announce notification message. So 3 out 5 (including the old leader) elected the old leader to be a new leader for the next epoch. While, the old leader is being rebooted, 2 other machines are trying to connect to the old leader.  So the quorum couldn't form until those 2 machines give up and move to the next round of leader election.
> This is because Learner.connectToLeader() use a simple retry logic. The contract for this method is that it should never spend longer that initLimit trying to connect to the leader.  In our outage, each sock.connect() is probably blocked for initLimit and it is called 5 times.



--
This message was sent by Atlassian JIRA
(v6.2#6252)