You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by "Ian Babrou (JIRA)" <ji...@apache.org> on 2012/07/24 08:02:33 UTC

[jira] [Updated] (ZOOKEEPER-1515) Long reconnect timeout if leader failed.

     [ https://issues.apache.org/jira/browse/ZOOKEEPER-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ian Babrou updated ZOOKEEPER-1515:
----------------------------------

    Description: 
In zookeeper 3.3.5 in file src/java/main/org/apache/zookeeper/server/quorum/Learner.java:325 you may see Thread.sleep(1000);

This is always happens after leader failure or restart. Zookeeper reelects new leader and all followers try to connect to it. But first attempt always fails because of "Connection refused":

{quote}
2012-07-23 18:55:48,159 - WARN  [QuorumPeer:/0.0.0.0:2181:Learner@229] - Unexpected exception, tries=0, connecting to web329.local/192.168.1.74:2888
java.net.ConnectException: Connection refused
	at java.net.PlainSocketImpl.socketConnect(Native Method)
	at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351)
	at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213)
	at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200)
	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
	at java.net.Socket.connect(Socket.java:529)
	at org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:221)
	at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:65)
	at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:645)
{quote}

I propose to change this line to the next code:

{code:title=Learner.java|borderStyle=solid}
if (tries > 0) {
    Thread.sleep(self.tickTime);
}
{code}

This way first reconnect attempt will be done immediately, other will wait for tick time (this is good semantic change, I suppose).

The result of this change - leader reelection time lowered from >1500ms to 300-400ms with 50ms tick time. This is pretty important for our production environment and will not break any existing installations.

  was:
In zookeeper 3.3.5 in file src/java/main/org/apache/zookeeper/server/quorum/Learner.java:325 you may see Thread.sleep(1000);

This is always happens after leader failure or restart. Zookeeper reelects new leader and all followers try to connect to it. But first attempt always fails because of "Connection refused":

{quote}
2012-07-23 18:55:48,159 - WARN  [QuorumPeer:/0.0.0.0:2181:Learner@229] - Unexpected exception, tries=0, connecting to web329.local/192.168.1.74:2888
java.net.ConnectException: Connection refused
	at java.net.PlainSocketImpl.socketConnect(Native Method)
	at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351)
	at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213)
	at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200)
	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
	at java.net.Socket.connect(Socket.java:529)
	at org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:221)
	at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:65)
	at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:645)
{quote}

I propose to change this line to the next code:

{quote}
if (tries > 0) {
    Thread.sleep(self.tickTime);
}
{quote}

This way first reconnect attempt will be done immediately, other will wait for tick time (this is good semantic change, I suppose).

The result of this change - leader reelection time lowered from >1500ms to 300-400ms with 50ms tick time. This is pretty important for our production environment and will not break any existing installations.


Fixed code formatting.
                
> Long reconnect timeout if leader failed.
> ----------------------------------------
>
>                 Key: ZOOKEEPER-1515
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1515
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: leaderElection, quorum, server
>    Affects Versions: 3.3.5
>         Environment: Gentoo linux, but every environment is affected.
>            Reporter: Ian Babrou
>              Labels: patch, performance
>
> In zookeeper 3.3.5 in file src/java/main/org/apache/zookeeper/server/quorum/Learner.java:325 you may see Thread.sleep(1000);
> This is always happens after leader failure or restart. Zookeeper reelects new leader and all followers try to connect to it. But first attempt always fails because of "Connection refused":
> {quote}
> 2012-07-23 18:55:48,159 - WARN  [QuorumPeer:/0.0.0.0:2181:Learner@229] - Unexpected exception, tries=0, connecting to web329.local/192.168.1.74:2888
> java.net.ConnectException: Connection refused
> 	at java.net.PlainSocketImpl.socketConnect(Native Method)
> 	at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351)
> 	at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213)
> 	at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200)
> 	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
> 	at java.net.Socket.connect(Socket.java:529)
> 	at org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:221)
> 	at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:65)
> 	at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:645)
> {quote}
> I propose to change this line to the next code:
> {code:title=Learner.java|borderStyle=solid}
> if (tries > 0) {
>     Thread.sleep(self.tickTime);
> }
> {code}
> This way first reconnect attempt will be done immediately, other will wait for tick time (this is good semantic change, I suppose).
> The result of this change - leader reelection time lowered from >1500ms to 300-400ms with 50ms tick time. This is pretty important for our production environment and will not break any existing installations.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira