You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@zookeeper.apache.org by Krzysztof Rybak <kr...@motorolasolutions.com> on 2016/10/20 14:50:25 UTC

Zookeeper exception: Timeout while waiting for epoch from quorum

Hi All,
first mail in the group so sorry for possible inconsistency in advance.
Zookeeper version is zookeeper-3.4.6.

I'm facing a problem when zookeeper is reconfiguring a cluster.

Initial state:
machine A:
zk1(follower)
zk2(leader)
machine B:
zk3(follower)

zk1 and zk2 are stopped (in that order).
zk1 is started on machine B.
zk1 and zk3 are not creating a cluster, status is (using srvr word)
'This ZooKeeper instance is not currently serving requests'

A part of a log is:
[2016-10-20 04:03:10,053] WARN Unexpected exception
(org.apache.zookeeper.server.quorum.QuorumPeer)
java.lang.InterruptedException: Timeout while waiting for epoch from quorum
at
org.apache.zookeeper.server.quorum.Leader.getEpochToPropose(Leader.java:878)
at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:377)
at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:799)
[2016-10-20 04:03:10,054] INFO Shutting down
(org.apache.zookeeper.server.quorum.Leader)
[2016-10-20 04:03:10,054] INFO Shutdown called
(org.apache.zookeeper.server.quorum.Leader)
java.lang.Exception: shutdown Leader! reason: Forcing shutdown
at org.apache.zookeeper.server.quorum.Leader.shutdown(Leader.java:499)
at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:805)
[2016-10-20 04:03:10,054] INFO shutting down
(org.apache.zookeeper.server.ZooKeeperServer)

What is interesting: when zk2(previous leader) is started on machine B
(instead of zk1) cluster is configured correctly.
The same situation happens when all happen on the single machine.

Issue is similar to this, but algorithm used by me is 3 (by default and
confirmed with electionAlg=3 in .cfg files)
https://issues.apache.org/jira/browse/ZOOKEEPER-2400

thanks,
Krzysztof

Re: Zookeeper exception: Timeout while waiting for epoch from quorum

Posted by Krzysztof Rybak <kr...@motorolasolutions.com>.
Information I provided wasn't complete as I didn't investigate the issue
well regarding connect() function timeouts.
Fortunately the issue seems to be resolved with changing cnxTimeout value
in configuration file.

Scenario is:

zk1(follower) on lo:1
zk2(leader) on lo:2
zk3(follower) on lo:3

step 1.

stop zk1

ifdown lo:1

step 2.

stop zk2
ifdown lo:2

step 3.

ifup lo:1

start zk1


result:
zk1 and zk3 cannot create a cluster of 2 zk instances.
When lo:2 is not down in step 2. the cluster is created (IP address of
former leader).

Observation:
When IP address of lo:2 is unreachable sock.connect() throws exception
'connect timed out' after a configured period of time. When lo:2 is up,
exception is thrown immediately with information 'connection refused': IP
is reachable, but no listener on the port. Socket timeout is set to 5000 by
default (cnxTO).

Solution:
Decreasing cnxTO by changing cnxTimeout in configuration file. Example
value is 1000, the cluster is created.

Question is: is it correct solution and is such a scenario supported by
zookeeper?

On Wed, Oct 26, 2016 at 2:57 PM, Krzysztof Rybak <krzysztof.rybak@
motorolasolutions.com> wrote:

> I'm not sure about the ports, but the configuration is below. Firewall is
> turned off, this shouldn't be a problem.
>
> Configuration file for zk1:
> dataDir=/opt/zookeeper/zk_server_{1,2,3}
> clientPort=1400{1,2,3}
>
> maxClientCnxns=0
> tickTime=2000
>
> initLimit=5
> syncLimit=2
> server.1=10.54.1.51:12001:13001
> server.2=10.54.1.52:12001:13001
> server.3=10.54.1.53:12001:13001
> autopurge.snapRetainCount=3
> autopurge.purgeInterval=1
> dataLogDir=/opt/zookeeper/zk_log_{1,2,3}
>
> {1,2,3} means different values for instances: 1 for zk1, 2 for zk2 and 3
> for zk3.
> 10.54.1.{51,52,53} are IP addresses on local interfaces, can be pinged
> from any machine(problem is observed on one single machine indeed).
> Increasing initLimit results with no success.
> Also changing election and quorum port results with fail
> (
> ...
> server.1=10.54.1.51:12001:13001
> server.2=10.54.1.52:12002:13002
> server.3=10.54.1.53:12003:13003
> ...
> )
>
> Scenario is:
> >>> zk1(follower)
> >>> zk2(leader)
> >>> zk3(follower)
> >>>
> >>> step 1. stop zk1
> >>> step 2. stop zk2
> >>> step 3. start zk1
> >>>
> >>> result:
> >>> zk1 and zk3 cannot create a cluster of 2 zk instances
>
> I found in a log that follower logs another time for leader election than
> new leader (log is attached):
> step 3. start of zk1:
>
> [2016-10-20 04:03:34,627] INFO Reading configuration from:
> /opt/kafka/config/zookeeper.properties.1 (org.apache.zookeeper.server.q
> uorum.QuorumPeerConfig)
>
> after around 10 seconds(!) follower logs time of leader election
> [2016-10-20 04:03:44,991] INFO FOLLOWING - LEADER ELECTION TOOK - 10255
> (org.apache.zookeeper.server.quorum.Learner)
>
> Then follower tries to connect to leader on quorum server communication
> port(12001):
>
> [2016-10-20 04:03:44,996] WARN Unexpected exception, tries=0, connecting
> to /10.54.1.53:12001 (org.apache.zookeeper.server.quorum.Learner)
> java.net.ConnectException: Connection refused
>
> After 4 tries(0,1,2,3) follower is shutdown:
>
> [2016-10-20 04:03:49,002] INFO shutdown called
> (org.apache.zookeeper.server.quorum.Learner)
> java.lang.Exception: shutdown Follower
>
> Then leader seems to resolve the election:
>
> [2016-10-20 04:03:49,764] INFO LEADING - LEADER ELECTION TOOK - 39709
> (org.apache.zookeeper.server.quorum.Leader)
>
> And starts listening on server communication port(12001).
>
> There are many logs informing that communication to zk2 fails
> (10.54.1.52). This is correct as this instance is not active.
> Before starting zk1 communication to zk1 fails ([2016-10-20
> 04:03:30,070]). This is correct as this instance is not yet active.
> In correct scenario information LEADER ELECTION TOOK for leader is near
> the log for the follower(a part of another log):
>
> [2016-10-20 02:48:47,675] INFO FOLLOWING - LEADER ELECTION TOOK - 15283
> (org.apache.zookeeper.server.quorum.Learner)
> [2016-10-20 02:48:47,690] INFO LEADING - LEADER ELECTION TOOK - 174954
> (org.apache.zookeeper.server.quorum.Leader)
>
> regards,
> Krzysztof
>
> On Tue, Oct 25, 2016 at 8:15 PM, Flavio Junqueira <fp...@apache.org> wrote:
>
>> There are two potential reasons I can think of:
>>
>> 1- The configuration is incorrect and it is pointing to the wrong port.
>> Keep in mind that the election port is different from the quorum server
>> communication port.
>> 2- There is something block traffic on the configured port between those
>> two servers.
>>
>> Could it be any of these?
>>
>> -Flavio
>>
>> > On 25 Oct 2016, at 14:33, Krzysztof Rybak <
>> krzysztof.rybak@motorolasolutions.com> wrote:
>> >
>> > Hi,
>> > first log is a little misleading as epoch exception is observed later
>> in a
>> > log and may be a consequence of lack of connection.
>> > The first exception observed in failing scenario is:
>> >
>> > [2016-10-25 07:48:41,107] WARN Unexpected exception, tries=0,
>> connecting to
>> > /10.54.1.53:12001 (org.apache.zookeeper.server.quorum.Learner)
>> > java.net.ConnectException: Connection refused
>> >        at java.net.PlainSocketImpl.socketConnect(Native Method)
>> >        at
>> > java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSock
>> etImpl.java:350)
>> >        at
>> > java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPl
>> ainSocketImpl.java:206)
>> >        at
>> > java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocket
>> Impl.java:188)
>> >        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>> >        at java.net.Socket.connect(Socket.java:589)
>> >        at
>> > org.apache.zookeeper.server.quorum.Learner.connectToLeader(L
>> earner.java:225)
>> >        at
>> > org.apache.zookeeper.server.quorum.Follower.followLeader(Fol
>> lower.java:71)
>> >        at
>> > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786)
>> >
>> > I think that instances are able to elect a Leader(zookeeper instance on
>> > address 10.54.1.53, leader port 12001-2888 in zookeeper examples) as
>> such
>> > information is logged before exception:
>> >
>> > [2016-10-25 07:48:41,103] INFO FOLLOWING - LEADER ELECTION TOOK - 10249
>> > (org.apache.zookeeper.server.quorum.Learner)
>> > Then follower tries to connect to leader but this fails.
>> >
>> > - If a leader is being elected but the leader isn't being able to sync
>> and
>> > establish itself, then you need to determine the reason. Is the follower
>> > being able to connect to the leader? Is the initialization taking too
>> long
>> > such that it times out? Is the socket read timing out?
>> >
>> > This seems to be an issue that follower cannot connect to a leader on
>> > leader port (12001 in my case).
>> > Exception is related to connect() method:
>> >
>> > at
>> > org.apache.zookeeper.server.quorum.Learner.connectToLeader(L
>> earner.java:225)
>> >
>> > and the exception is
>> >
>> > java.net.ConnectException: Connection refused
>> >
>> > Follower tries to connect just after the leader stops listening on a
>> leader
>> > port (12001 in my case, 2888 in zookeeper examples).
>> > I just don't know why follower starts trying to connect after around 10
>> > seconds after setting a new leader(during this 10 seconds leader
>> listens on
>> > a leader port). There is no timeout in socket connect as leader has
>> already
>> > stopped listening.
>> > This seems to be a reason, but why this happens ?
>> >
>> > I assume this scenario is supported and server should be set.
>> >
>> >> [2016-10-20 04:04:19,996] WARN Unexpected exception, tries=0,
>> connecting
>> > to
>> >> /10.54.1.53:12001 (org.apache.zookeeper.server.quorum.Learner)
>> >> java.net.ConnectException: Connection refused
>> >> at java.net.PlainSocketImpl.socketConnect(Native Method)
>> >> at
>> >> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSock
>> etImpl.java:
>> > 350)
>> >> at
>> >> java.net.AbstractPlainSocketImpl.connectToAddress(
>> > AbstractPlainSocketImpl.java:206)
>> >> at
>> >> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocket
>> Impl.java:188)
>> >> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>> >> at java.net.Socket.connect(Socket.java:589)
>> >> at
>> >> org.apache.zookeeper.server.quorum.Learner.connectToLeader(L
>> earner.java:
>> > 225)
>> >> at
>> >> org.apache.zookeeper.server.quorum.Follower.followLeader(Fol
>> lower.java:71)
>> >> at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer
>> .java:786)
>> >
>> >
>> >
>> > On Mon, Oct 24, 2016 at 6:57 PM, Flavio Junqueira <fp...@apache.org>
>> wrote:
>> >
>> >> Hi Krzysztof,
>> >>
>> >> Your first message seems to indicate that a leader is being elected,
>> but
>> >> it is not being able to sync with enough followers (in your case 1
>> >> follower). I'm saying this because of this line:
>> >>
>> >>> at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:377)
>> >>
>> >> You say that the issue is timing in your second message, but it isn't
>> >> entirely clear to me why you think so. I think you need to determine
>> the
>> >> following:
>> >>
>> >> - Are the two remaining servers being able to elect a leader among
>> them?
>> >> If not, is the problem that they can't connect to each other?
>> >> - If a leader is being elected but the leader isn't being able to sync
>> and
>> >> establish itself, then you need to determine the reason. Is the
>> follower
>> >> being able to connect to the leader? Is the initialization taking too
>> long
>> >> such that it times out? Is the socket read timing out?
>> >>
>> >> -Flavio
>> >>
>> >>
>> >>> On 24 Oct 2016, at 11:37, Krzysztof Rybak <krzysztof.rybak@
>> >> motorolasolutions.com> wrote:
>> >>>
>> >>> Hi,
>> >>> I've logged some more information and the problem seems to be a timing
>> >>> issue.
>> >>> Long story short, the scenario is:
>> >>> zk1(follower)
>> >>> zk2(leader)
>> >>> zk3(follower)
>> >>>
>> >>> step 1. stop zk1
>> >>> step 2. stop zk2
>> >>> step 3. start zk1
>> >>>
>> >>> result:
>> >>> zk1 and zk3 cannot create a cluster of 2 zk instances(is such scenario
>> >>> supported by zookeeper?).
>> >>> When zk2 is started again instead of zk1, cluster is created.
>> >>>
>> >>> Observation:
>> >>> After step 3., zk3 is elected to be a leader and starts listening on
>> port
>> >>> 2888
>> >>> ( example port from example
>> >>> http://zookeeper.apache.org/doc/trunk/zookeeperStarted.html ).
>> Listening
>> >>> takes around 10 seconds but during this time zk1 is not trying to
>> >> connect.
>> >>> zk1 tries to connect just after zk3 stops listening.
>> >>>
>> >>> a part of a log is:
>> >>> zk3 stops listening on 04:04:17(testing with netstat) and zk1 starts
>> >> trying
>> >>> to connect at 04:04:19,996 as below.
>> >>>
>> >>> [2016-10-20 04:04:19,996] WARN Unexpected exception, tries=0,
>> connecting
>> >> to
>> >>> /10.54.1.53:12001 (org.apache.zookeeper.server.quorum.Learner)
>> >>> java.net.ConnectException: Connection refused
>> >>> at java.net.PlainSocketImpl.socketConnect(Native Method)
>> >>> at
>> >>> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSock
>> etImpl.java:
>> >> 350)
>> >>> at
>> >>> java.net.AbstractPlainSocketImpl.connectToAddress(
>> >> AbstractPlainSocketImpl.java:206)
>> >>> at
>> >>> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocket
>> Impl.java:
>> >> 188)
>> >>> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>> >>> at java.net.Socket.connect(Socket.java:589)
>> >>> at
>> >>> org.apache.zookeeper.server.quorum.Learner.connectToLeader(L
>> earner.java:
>> >> 225)
>> >>> at
>> >>> org.apache.zookeeper.server.quorum.Follower.followLeader(
>> >> Follower.java:71)
>> >>> at org.apache.zookeeper.server.quorum.QuorumPeer.run(
>> >> QuorumPeer.java:786)
>> >>>
>> >>> This is strange as I couldn't reproduce this on local VM VirtualBox
>> but
>> >> on
>> >>> VMware reproduction is 100%.
>> >>>
>> >>> thanks,
>> >>> Krzysztof
>> >>>
>> >>> ---------- Forwarded message ----------
>> >>> From: Krzysztof Rybak <kr...@motorolasolutions.com>
>> >>> Date: Thu, Oct 20, 2016 at 4:50 PM
>> >>> Subject: Zookeeper exception: Timeout while waiting for epoch from
>> quorum
>> >>> To: user@zookeeper.apache.org
>> >>>
>> >>>
>> >>> Hi All,
>> >>> first mail in the group so sorry for possible inconsistency in
>> advance.
>> >>> Zookeeper version is zookeeper-3.4.6.
>> >>>
>> >>> I'm facing a problem when zookeeper is reconfiguring a cluster.
>> >>>
>> >>> Initial state:
>> >>> machine A:
>> >>> zk1(follower)
>> >>> zk2(leader)
>> >>> machine B:
>> >>> zk3(follower)
>> >>>
>> >>> zk1 and zk2 are stopped (in that order).
>> >>> zk1 is started on machine B.
>> >>> zk1 and zk3 are not creating a cluster, status is (using srvr word)
>> >>> 'This ZooKeeper instance is not currently serving requests'
>> >>>
>> >>> A part of a log is:
>> >>> [2016-10-20 04:03:10,053] WARN Unexpected exception
>> >>> (org.apache.zookeeper.server.quorum.QuorumPeer)
>> >>> java.lang.InterruptedException: Timeout while waiting for epoch from
>> >> quorum
>> >>> at org.apache.zookeeper.server.quorum.Leader.
>> >> getEpochToPropose(Leader.java:
>> >>> 878)
>> >>> at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:377)
>> >>> at org.apache.zookeeper.server.quorum.QuorumPeer.run(
>> >> QuorumPeer.java:799)
>> >>> [2016-10-20 04:03:10,054] INFO Shutting down
>> >> (org.apache.zookeeper.server.
>> >>> quorum.Leader)
>> >>> [2016-10-20 04:03:10,054] INFO Shutdown called
>> >> (org.apache.zookeeper.server.
>> >>> quorum.Leader)
>> >>> java.lang.Exception: shutdown Leader! reason: Forcing shutdown
>> >>> at org.apache.zookeeper.server.quorum.Leader.shutdown(Leader.ja
>> va:499)
>> >>> at org.apache.zookeeper.server.quorum.QuorumPeer.run(
>> >> QuorumPeer.java:805)
>> >>> [2016-10-20 04:03:10,054] INFO shutting down
>> >> (org.apache.zookeeper.server.
>> >>> ZooKeeperServer)
>> >>>
>> >>> What is interesting: when zk2(previous leader) is started on machine B
>> >>> (instead of zk1) cluster is configured correctly.
>> >>> The same situation happens when all happen on the single machine.
>> >>>
>> >>> Issue is similar to this, but algorithm used by me is 3 (by default
>> and
>> >>> confirmed with electionAlg=3 in .cfg files)
>> >>> https://issues.apache.org/jira/browse/ZOOKEEPER-2400
>> >>>
>> >>> thanks,
>> >>> Krzysztof
>> >>
>> >>
>>
>>
>

Re: Zookeeper exception: Timeout while waiting for epoch from quorum

Posted by Krzysztof Rybak <kr...@motorolasolutions.com>.
I'm not sure about the ports, but the configuration is below. Firewall is
turned off, this shouldn't be a problem.

Configuration file for zk1:
dataDir=/opt/zookeeper/zk_server_{1,2,3}
clientPort=1400{1,2,3}

maxClientCnxns=0
tickTime=2000

initLimit=5
syncLimit=2
server.1=10.54.1.51:12001:13001
server.2=10.54.1.52:12001:13001
server.3=10.54.1.53:12001:13001
autopurge.snapRetainCount=3
autopurge.purgeInterval=1
dataLogDir=/opt/zookeeper/zk_log_{1,2,3}

{1,2,3} means different values for instances: 1 for zk1, 2 for zk2 and 3
for zk3.
10.54.1.{51,52,53} are IP addresses on local interfaces, can be pinged from
any machine(problem is observed on one single machine indeed).
Increasing initLimit results with no success.
Also changing election and quorum port results with fail
(
...
server.1=10.54.1.51:12001:13001
server.2=10.54.1.52:12002:13002
server.3=10.54.1.53:12003:13003
...
)

Scenario is:
>>> zk1(follower)
>>> zk2(leader)
>>> zk3(follower)
>>>
>>> step 1. stop zk1
>>> step 2. stop zk2
>>> step 3. start zk1
>>>
>>> result:
>>> zk1 and zk3 cannot create a cluster of 2 zk instances

I found in a log that follower logs another time for leader election than
new leader (log is attached):
step 3. start of zk1:

[2016-10-20 04:03:34,627] INFO Reading configuration from:
/opt/kafka/config/zookeeper.properties.1
(org.apache.zookeeper.server.quorum.QuorumPeerConfig)

after around 10 seconds(!) follower logs time of leader election
[2016-10-20 04:03:44,991] INFO FOLLOWING - LEADER ELECTION TOOK - 10255
(org.apache.zookeeper.server.quorum.Learner)

Then follower tries to connect to leader on quorum server communication
port(12001):

[2016-10-20 04:03:44,996] WARN Unexpected exception, tries=0, connecting to
/10.54.1.53:12001 (org.apache.zookeeper.server.quorum.Learner)
java.net.ConnectException: Connection refused

After 4 tries(0,1,2,3) follower is shutdown:

[2016-10-20 04:03:49,002] INFO shutdown called
(org.apache.zookeeper.server.quorum.Learner)
java.lang.Exception: shutdown Follower

Then leader seems to resolve the election:

[2016-10-20 04:03:49,764] INFO LEADING - LEADER ELECTION TOOK - 39709
(org.apache.zookeeper.server.quorum.Leader)

And starts listening on server communication port(12001).

There are many logs informing that communication to zk2 fails (10.54.1.52).
This is correct as this instance is not active.
Before starting zk1 communication to zk1 fails ([2016-10-20 04:03:30,070]).
This is correct as this instance is not yet active.
In correct scenario information LEADER ELECTION TOOK for leader is near the
log for the follower(a part of another log):

[2016-10-20 02:48:47,675] INFO FOLLOWING - LEADER ELECTION TOOK - 15283
(org.apache.zookeeper.server.quorum.Learner)
[2016-10-20 02:48:47,690] INFO LEADING - LEADER ELECTION TOOK - 174954
(org.apache.zookeeper.server.quorum.Leader)

regards,
Krzysztof

On Tue, Oct 25, 2016 at 8:15 PM, Flavio Junqueira <fp...@apache.org> wrote:

> There are two potential reasons I can think of:
>
> 1- The configuration is incorrect and it is pointing to the wrong port.
> Keep in mind that the election port is different from the quorum server
> communication port.
> 2- There is something block traffic on the configured port between those
> two servers.
>
> Could it be any of these?
>
> -Flavio
>
> > On 25 Oct 2016, at 14:33, Krzysztof Rybak <krzysztof.rybak@
> motorolasolutions.com> wrote:
> >
> > Hi,
> > first log is a little misleading as epoch exception is observed later in
> a
> > log and may be a consequence of lack of connection.
> > The first exception observed in failing scenario is:
> >
> > [2016-10-25 07:48:41,107] WARN Unexpected exception, tries=0, connecting
> to
> > /10.54.1.53:12001 (org.apache.zookeeper.server.quorum.Learner)
> > java.net.ConnectException: Connection refused
> >        at java.net.PlainSocketImpl.socketConnect(Native Method)
> >        at
> > java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:
> 350)
> >        at
> > java.net.AbstractPlainSocketImpl.connectToAddress(
> AbstractPlainSocketImpl.java:206)
> >        at
> > java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:
> 188)
> >        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> >        at java.net.Socket.connect(Socket.java:589)
> >        at
> > org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:
> 225)
> >        at
> > org.apache.zookeeper.server.quorum.Follower.followLeader(
> Follower.java:71)
> >        at
> > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786)
> >
> > I think that instances are able to elect a Leader(zookeeper instance on
> > address 10.54.1.53, leader port 12001-2888 in zookeeper examples) as such
> > information is logged before exception:
> >
> > [2016-10-25 07:48:41,103] INFO FOLLOWING - LEADER ELECTION TOOK - 10249
> > (org.apache.zookeeper.server.quorum.Learner)
> > Then follower tries to connect to leader but this fails.
> >
> > - If a leader is being elected but the leader isn't being able to sync
> and
> > establish itself, then you need to determine the reason. Is the follower
> > being able to connect to the leader? Is the initialization taking too
> long
> > such that it times out? Is the socket read timing out?
> >
> > This seems to be an issue that follower cannot connect to a leader on
> > leader port (12001 in my case).
> > Exception is related to connect() method:
> >
> > at
> > org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:
> 225)
> >
> > and the exception is
> >
> > java.net.ConnectException: Connection refused
> >
> > Follower tries to connect just after the leader stops listening on a
> leader
> > port (12001 in my case, 2888 in zookeeper examples).
> > I just don't know why follower starts trying to connect after around 10
> > seconds after setting a new leader(during this 10 seconds leader listens
> on
> > a leader port). There is no timeout in socket connect as leader has
> already
> > stopped listening.
> > This seems to be a reason, but why this happens ?
> >
> > I assume this scenario is supported and server should be set.
> >
> >> [2016-10-20 04:04:19,996] WARN Unexpected exception, tries=0, connecting
> > to
> >> /10.54.1.53:12001 (org.apache.zookeeper.server.quorum.Learner)
> >> java.net.ConnectException: Connection refused
> >> at java.net.PlainSocketImpl.socketConnect(Native Method)
> >> at
> >> java.net.AbstractPlainSocketImpl.doConnect(
> AbstractPlainSocketImpl.java:
> > 350)
> >> at
> >> java.net.AbstractPlainSocketImpl.connectToAddress(
> > AbstractPlainSocketImpl.java:206)
> >> at
> >> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:
> 188)
> >> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> >> at java.net.Socket.connect(Socket.java:589)
> >> at
> >> org.apache.zookeeper.server.quorum.Learner.
> connectToLeader(Learner.java:
> > 225)
> >> at
> >> org.apache.zookeeper.server.quorum.Follower.followLeader(
> Follower.java:71)
> >> at org.apache.zookeeper.server.quorum.QuorumPeer.run(
> QuorumPeer.java:786)
> >
> >
> >
> > On Mon, Oct 24, 2016 at 6:57 PM, Flavio Junqueira <fp...@apache.org>
> wrote:
> >
> >> Hi Krzysztof,
> >>
> >> Your first message seems to indicate that a leader is being elected, but
> >> it is not being able to sync with enough followers (in your case 1
> >> follower). I'm saying this because of this line:
> >>
> >>> at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:377)
> >>
> >> You say that the issue is timing in your second message, but it isn't
> >> entirely clear to me why you think so. I think you need to determine the
> >> following:
> >>
> >> - Are the two remaining servers being able to elect a leader among them?
> >> If not, is the problem that they can't connect to each other?
> >> - If a leader is being elected but the leader isn't being able to sync
> and
> >> establish itself, then you need to determine the reason. Is the follower
> >> being able to connect to the leader? Is the initialization taking too
> long
> >> such that it times out? Is the socket read timing out?
> >>
> >> -Flavio
> >>
> >>
> >>> On 24 Oct 2016, at 11:37, Krzysztof Rybak <krzysztof.rybak@
> >> motorolasolutions.com> wrote:
> >>>
> >>> Hi,
> >>> I've logged some more information and the problem seems to be a timing
> >>> issue.
> >>> Long story short, the scenario is:
> >>> zk1(follower)
> >>> zk2(leader)
> >>> zk3(follower)
> >>>
> >>> step 1. stop zk1
> >>> step 2. stop zk2
> >>> step 3. start zk1
> >>>
> >>> result:
> >>> zk1 and zk3 cannot create a cluster of 2 zk instances(is such scenario
> >>> supported by zookeeper?).
> >>> When zk2 is started again instead of zk1, cluster is created.
> >>>
> >>> Observation:
> >>> After step 3., zk3 is elected to be a leader and starts listening on
> port
> >>> 2888
> >>> ( example port from example
> >>> http://zookeeper.apache.org/doc/trunk/zookeeperStarted.html ).
> Listening
> >>> takes around 10 seconds but during this time zk1 is not trying to
> >> connect.
> >>> zk1 tries to connect just after zk3 stops listening.
> >>>
> >>> a part of a log is:
> >>> zk3 stops listening on 04:04:17(testing with netstat) and zk1 starts
> >> trying
> >>> to connect at 04:04:19,996 as below.
> >>>
> >>> [2016-10-20 04:04:19,996] WARN Unexpected exception, tries=0,
> connecting
> >> to
> >>> /10.54.1.53:12001 (org.apache.zookeeper.server.quorum.Learner)
> >>> java.net.ConnectException: Connection refused
> >>> at java.net.PlainSocketImpl.socketConnect(Native Method)
> >>> at
> >>> java.net.AbstractPlainSocketImpl.doConnect(
> AbstractPlainSocketImpl.java:
> >> 350)
> >>> at
> >>> java.net.AbstractPlainSocketImpl.connectToAddress(
> >> AbstractPlainSocketImpl.java:206)
> >>> at
> >>> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:
> >> 188)
> >>> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> >>> at java.net.Socket.connect(Socket.java:589)
> >>> at
> >>> org.apache.zookeeper.server.quorum.Learner.
> connectToLeader(Learner.java:
> >> 225)
> >>> at
> >>> org.apache.zookeeper.server.quorum.Follower.followLeader(
> >> Follower.java:71)
> >>> at org.apache.zookeeper.server.quorum.QuorumPeer.run(
> >> QuorumPeer.java:786)
> >>>
> >>> This is strange as I couldn't reproduce this on local VM VirtualBox but
> >> on
> >>> VMware reproduction is 100%.
> >>>
> >>> thanks,
> >>> Krzysztof
> >>>
> >>> ---------- Forwarded message ----------
> >>> From: Krzysztof Rybak <kr...@motorolasolutions.com>
> >>> Date: Thu, Oct 20, 2016 at 4:50 PM
> >>> Subject: Zookeeper exception: Timeout while waiting for epoch from
> quorum
> >>> To: user@zookeeper.apache.org
> >>>
> >>>
> >>> Hi All,
> >>> first mail in the group so sorry for possible inconsistency in advance.
> >>> Zookeeper version is zookeeper-3.4.6.
> >>>
> >>> I'm facing a problem when zookeeper is reconfiguring a cluster.
> >>>
> >>> Initial state:
> >>> machine A:
> >>> zk1(follower)
> >>> zk2(leader)
> >>> machine B:
> >>> zk3(follower)
> >>>
> >>> zk1 and zk2 are stopped (in that order).
> >>> zk1 is started on machine B.
> >>> zk1 and zk3 are not creating a cluster, status is (using srvr word)
> >>> 'This ZooKeeper instance is not currently serving requests'
> >>>
> >>> A part of a log is:
> >>> [2016-10-20 04:03:10,053] WARN Unexpected exception
> >>> (org.apache.zookeeper.server.quorum.QuorumPeer)
> >>> java.lang.InterruptedException: Timeout while waiting for epoch from
> >> quorum
> >>> at org.apache.zookeeper.server.quorum.Leader.
> >> getEpochToPropose(Leader.java:
> >>> 878)
> >>> at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:377)
> >>> at org.apache.zookeeper.server.quorum.QuorumPeer.run(
> >> QuorumPeer.java:799)
> >>> [2016-10-20 04:03:10,054] INFO Shutting down
> >> (org.apache.zookeeper.server.
> >>> quorum.Leader)
> >>> [2016-10-20 04:03:10,054] INFO Shutdown called
> >> (org.apache.zookeeper.server.
> >>> quorum.Leader)
> >>> java.lang.Exception: shutdown Leader! reason: Forcing shutdown
> >>> at org.apache.zookeeper.server.quorum.Leader.shutdown(Leader.java:499)
> >>> at org.apache.zookeeper.server.quorum.QuorumPeer.run(
> >> QuorumPeer.java:805)
> >>> [2016-10-20 04:03:10,054] INFO shutting down
> >> (org.apache.zookeeper.server.
> >>> ZooKeeperServer)
> >>>
> >>> What is interesting: when zk2(previous leader) is started on machine B
> >>> (instead of zk1) cluster is configured correctly.
> >>> The same situation happens when all happen on the single machine.
> >>>
> >>> Issue is similar to this, but algorithm used by me is 3 (by default and
> >>> confirmed with electionAlg=3 in .cfg files)
> >>> https://issues.apache.org/jira/browse/ZOOKEEPER-2400
> >>>
> >>> thanks,
> >>> Krzysztof
> >>
> >>
>
>

Re: Zookeeper exception: Timeout while waiting for epoch from quorum

Posted by Flavio Junqueira <fp...@apache.org>.
There are two potential reasons I can think of:

1- The configuration is incorrect and it is pointing to the wrong port. Keep in mind that the election port is different from the quorum server communication port.
2- There is something block traffic on the configured port between those two servers.

Could it be any of these?

-Flavio

> On 25 Oct 2016, at 14:33, Krzysztof Rybak <kr...@motorolasolutions.com> wrote:
> 
> Hi,
> first log is a little misleading as epoch exception is observed later in a
> log and may be a consequence of lack of connection.
> The first exception observed in failing scenario is:
> 
> [2016-10-25 07:48:41,107] WARN Unexpected exception, tries=0, connecting to
> /10.54.1.53:12001 (org.apache.zookeeper.server.quorum.Learner)
> java.net.ConnectException: Connection refused
>        at java.net.PlainSocketImpl.socketConnect(Native Method)
>        at
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
>        at
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
>        at
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
>        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>        at java.net.Socket.connect(Socket.java:589)
>        at
> org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:225)
>        at
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:71)
>        at
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786)
> 
> I think that instances are able to elect a Leader(zookeeper instance on
> address 10.54.1.53, leader port 12001-2888 in zookeeper examples) as such
> information is logged before exception:
> 
> [2016-10-25 07:48:41,103] INFO FOLLOWING - LEADER ELECTION TOOK - 10249
> (org.apache.zookeeper.server.quorum.Learner)
> Then follower tries to connect to leader but this fails.
> 
> - If a leader is being elected but the leader isn't being able to sync and
> establish itself, then you need to determine the reason. Is the follower
> being able to connect to the leader? Is the initialization taking too long
> such that it times out? Is the socket read timing out?
> 
> This seems to be an issue that follower cannot connect to a leader on
> leader port (12001 in my case).
> Exception is related to connect() method:
> 
> at
> org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:225)
> 
> and the exception is
> 
> java.net.ConnectException: Connection refused
> 
> Follower tries to connect just after the leader stops listening on a leader
> port (12001 in my case, 2888 in zookeeper examples).
> I just don't know why follower starts trying to connect after around 10
> seconds after setting a new leader(during this 10 seconds leader listens on
> a leader port). There is no timeout in socket connect as leader has already
> stopped listening.
> This seems to be a reason, but why this happens ?
> 
> I assume this scenario is supported and server should be set.
> 
>> [2016-10-20 04:04:19,996] WARN Unexpected exception, tries=0, connecting
> to
>> /10.54.1.53:12001 (org.apache.zookeeper.server.quorum.Learner)
>> java.net.ConnectException: Connection refused
>> at java.net.PlainSocketImpl.socketConnect(Native Method)
>> at
>> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:
> 350)
>> at
>> java.net.AbstractPlainSocketImpl.connectToAddress(
> AbstractPlainSocketImpl.java:206)
>> at
>> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
>> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>> at java.net.Socket.connect(Socket.java:589)
>> at
>> org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:
> 225)
>> at
>> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:71)
>> at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786)
> 
> 
> 
> On Mon, Oct 24, 2016 at 6:57 PM, Flavio Junqueira <fp...@apache.org> wrote:
> 
>> Hi Krzysztof,
>> 
>> Your first message seems to indicate that a leader is being elected, but
>> it is not being able to sync with enough followers (in your case 1
>> follower). I'm saying this because of this line:
>> 
>>> at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:377)
>> 
>> You say that the issue is timing in your second message, but it isn't
>> entirely clear to me why you think so. I think you need to determine the
>> following:
>> 
>> - Are the two remaining servers being able to elect a leader among them?
>> If not, is the problem that they can't connect to each other?
>> - If a leader is being elected but the leader isn't being able to sync and
>> establish itself, then you need to determine the reason. Is the follower
>> being able to connect to the leader? Is the initialization taking too long
>> such that it times out? Is the socket read timing out?
>> 
>> -Flavio
>> 
>> 
>>> On 24 Oct 2016, at 11:37, Krzysztof Rybak <krzysztof.rybak@
>> motorolasolutions.com> wrote:
>>> 
>>> Hi,
>>> I've logged some more information and the problem seems to be a timing
>>> issue.
>>> Long story short, the scenario is:
>>> zk1(follower)
>>> zk2(leader)
>>> zk3(follower)
>>> 
>>> step 1. stop zk1
>>> step 2. stop zk2
>>> step 3. start zk1
>>> 
>>> result:
>>> zk1 and zk3 cannot create a cluster of 2 zk instances(is such scenario
>>> supported by zookeeper?).
>>> When zk2 is started again instead of zk1, cluster is created.
>>> 
>>> Observation:
>>> After step 3., zk3 is elected to be a leader and starts listening on port
>>> 2888
>>> ( example port from example
>>> http://zookeeper.apache.org/doc/trunk/zookeeperStarted.html ). Listening
>>> takes around 10 seconds but during this time zk1 is not trying to
>> connect.
>>> zk1 tries to connect just after zk3 stops listening.
>>> 
>>> a part of a log is:
>>> zk3 stops listening on 04:04:17(testing with netstat) and zk1 starts
>> trying
>>> to connect at 04:04:19,996 as below.
>>> 
>>> [2016-10-20 04:04:19,996] WARN Unexpected exception, tries=0, connecting
>> to
>>> /10.54.1.53:12001 (org.apache.zookeeper.server.quorum.Learner)
>>> java.net.ConnectException: Connection refused
>>> at java.net.PlainSocketImpl.socketConnect(Native Method)
>>> at
>>> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:
>> 350)
>>> at
>>> java.net.AbstractPlainSocketImpl.connectToAddress(
>> AbstractPlainSocketImpl.java:206)
>>> at
>>> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:
>> 188)
>>> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>>> at java.net.Socket.connect(Socket.java:589)
>>> at
>>> org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:
>> 225)
>>> at
>>> org.apache.zookeeper.server.quorum.Follower.followLeader(
>> Follower.java:71)
>>> at org.apache.zookeeper.server.quorum.QuorumPeer.run(
>> QuorumPeer.java:786)
>>> 
>>> This is strange as I couldn't reproduce this on local VM VirtualBox but
>> on
>>> VMware reproduction is 100%.
>>> 
>>> thanks,
>>> Krzysztof
>>> 
>>> ---------- Forwarded message ----------
>>> From: Krzysztof Rybak <kr...@motorolasolutions.com>
>>> Date: Thu, Oct 20, 2016 at 4:50 PM
>>> Subject: Zookeeper exception: Timeout while waiting for epoch from quorum
>>> To: user@zookeeper.apache.org
>>> 
>>> 
>>> Hi All,
>>> first mail in the group so sorry for possible inconsistency in advance.
>>> Zookeeper version is zookeeper-3.4.6.
>>> 
>>> I'm facing a problem when zookeeper is reconfiguring a cluster.
>>> 
>>> Initial state:
>>> machine A:
>>> zk1(follower)
>>> zk2(leader)
>>> machine B:
>>> zk3(follower)
>>> 
>>> zk1 and zk2 are stopped (in that order).
>>> zk1 is started on machine B.
>>> zk1 and zk3 are not creating a cluster, status is (using srvr word)
>>> 'This ZooKeeper instance is not currently serving requests'
>>> 
>>> A part of a log is:
>>> [2016-10-20 04:03:10,053] WARN Unexpected exception
>>> (org.apache.zookeeper.server.quorum.QuorumPeer)
>>> java.lang.InterruptedException: Timeout while waiting for epoch from
>> quorum
>>> at org.apache.zookeeper.server.quorum.Leader.
>> getEpochToPropose(Leader.java:
>>> 878)
>>> at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:377)
>>> at org.apache.zookeeper.server.quorum.QuorumPeer.run(
>> QuorumPeer.java:799)
>>> [2016-10-20 04:03:10,054] INFO Shutting down
>> (org.apache.zookeeper.server.
>>> quorum.Leader)
>>> [2016-10-20 04:03:10,054] INFO Shutdown called
>> (org.apache.zookeeper.server.
>>> quorum.Leader)
>>> java.lang.Exception: shutdown Leader! reason: Forcing shutdown
>>> at org.apache.zookeeper.server.quorum.Leader.shutdown(Leader.java:499)
>>> at org.apache.zookeeper.server.quorum.QuorumPeer.run(
>> QuorumPeer.java:805)
>>> [2016-10-20 04:03:10,054] INFO shutting down
>> (org.apache.zookeeper.server.
>>> ZooKeeperServer)
>>> 
>>> What is interesting: when zk2(previous leader) is started on machine B
>>> (instead of zk1) cluster is configured correctly.
>>> The same situation happens when all happen on the single machine.
>>> 
>>> Issue is similar to this, but algorithm used by me is 3 (by default and
>>> confirmed with electionAlg=3 in .cfg files)
>>> https://issues.apache.org/jira/browse/ZOOKEEPER-2400
>>> 
>>> thanks,
>>> Krzysztof
>> 
>> 


Re: Zookeeper exception: Timeout while waiting for epoch from quorum

Posted by Krzysztof Rybak <kr...@motorolasolutions.com>.
Hi,
first log is a little misleading as epoch exception is observed later in a
log and may be a consequence of lack of connection.
The first exception observed in failing scenario is:

[2016-10-25 07:48:41,107] WARN Unexpected exception, tries=0, connecting to
/10.54.1.53:12001 (org.apache.zookeeper.server.quorum.Learner)
java.net.ConnectException: Connection refused
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
        at
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
        at
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
        at java.net.Socket.connect(Socket.java:589)
        at
org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:225)
        at
org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:71)
        at
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786)

I think that instances are able to elect a Leader(zookeeper instance on
address 10.54.1.53, leader port 12001-2888 in zookeeper examples) as such
information is logged before exception:

[2016-10-25 07:48:41,103] INFO FOLLOWING - LEADER ELECTION TOOK - 10249
(org.apache.zookeeper.server.quorum.Learner)
Then follower tries to connect to leader but this fails.

- If a leader is being elected but the leader isn't being able to sync and
establish itself, then you need to determine the reason. Is the follower
being able to connect to the leader? Is the initialization taking too long
such that it times out? Is the socket read timing out?

This seems to be an issue that follower cannot connect to a leader on
leader port (12001 in my case).
Exception is related to connect() method:

at
org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:225)

and the exception is

java.net.ConnectException: Connection refused

Follower tries to connect just after the leader stops listening on a leader
port (12001 in my case, 2888 in zookeeper examples).
I just don't know why follower starts trying to connect after around 10
seconds after setting a new leader(during this 10 seconds leader listens on
a leader port). There is no timeout in socket connect as leader has already
stopped listening.
This seems to be a reason, but why this happens ?

I assume this scenario is supported and server should be set.

> [2016-10-20 04:04:19,996] WARN Unexpected exception, tries=0, connecting
to
> /10.54.1.53:12001 (org.apache.zookeeper.server.quorum.Learner)
> java.net.ConnectException: Connection refused
> at java.net.PlainSocketImpl.socketConnect(Native Method)
> at
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:
350)
> at
> java.net.AbstractPlainSocketImpl.connectToAddress(
AbstractPlainSocketImpl.java:206)
> at
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> at java.net.Socket.connect(Socket.java:589)
> at
> org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:
225)
> at
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:71)
> at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786)



On Mon, Oct 24, 2016 at 6:57 PM, Flavio Junqueira <fp...@apache.org> wrote:

> Hi Krzysztof,
>
> Your first message seems to indicate that a leader is being elected, but
> it is not being able to sync with enough followers (in your case 1
> follower). I'm saying this because of this line:
>
> > at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:377)
>
> You say that the issue is timing in your second message, but it isn't
> entirely clear to me why you think so. I think you need to determine the
> following:
>
> - Are the two remaining servers being able to elect a leader among them?
> If not, is the problem that they can't connect to each other?
> - If a leader is being elected but the leader isn't being able to sync and
> establish itself, then you need to determine the reason. Is the follower
> being able to connect to the leader? Is the initialization taking too long
> such that it times out? Is the socket read timing out?
>
> -Flavio
>
>
> > On 24 Oct 2016, at 11:37, Krzysztof Rybak <krzysztof.rybak@
> motorolasolutions.com> wrote:
> >
> > Hi,
> > I've logged some more information and the problem seems to be a timing
> > issue.
> > Long story short, the scenario is:
> > zk1(follower)
> > zk2(leader)
> > zk3(follower)
> >
> > step 1. stop zk1
> > step 2. stop zk2
> > step 3. start zk1
> >
> > result:
> > zk1 and zk3 cannot create a cluster of 2 zk instances(is such scenario
> > supported by zookeeper?).
> > When zk2 is started again instead of zk1, cluster is created.
> >
> > Observation:
> > After step 3., zk3 is elected to be a leader and starts listening on port
> > 2888
> > ( example port from example
> > http://zookeeper.apache.org/doc/trunk/zookeeperStarted.html ). Listening
> > takes around 10 seconds but during this time zk1 is not trying to
> connect.
> > zk1 tries to connect just after zk3 stops listening.
> >
> > a part of a log is:
> > zk3 stops listening on 04:04:17(testing with netstat) and zk1 starts
> trying
> > to connect at 04:04:19,996 as below.
> >
> > [2016-10-20 04:04:19,996] WARN Unexpected exception, tries=0, connecting
> to
> > /10.54.1.53:12001 (org.apache.zookeeper.server.quorum.Learner)
> > java.net.ConnectException: Connection refused
> > at java.net.PlainSocketImpl.socketConnect(Native Method)
> > at
> > java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:
> 350)
> > at
> > java.net.AbstractPlainSocketImpl.connectToAddress(
> AbstractPlainSocketImpl.java:206)
> > at
> > java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:
> 188)
> > at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> > at java.net.Socket.connect(Socket.java:589)
> > at
> > org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:
> 225)
> > at
> > org.apache.zookeeper.server.quorum.Follower.followLeader(
> Follower.java:71)
> > at org.apache.zookeeper.server.quorum.QuorumPeer.run(
> QuorumPeer.java:786)
> >
> > This is strange as I couldn't reproduce this on local VM VirtualBox but
> on
> > VMware reproduction is 100%.
> >
> > thanks,
> > Krzysztof
> >
> > ---------- Forwarded message ----------
> > From: Krzysztof Rybak <kr...@motorolasolutions.com>
> > Date: Thu, Oct 20, 2016 at 4:50 PM
> > Subject: Zookeeper exception: Timeout while waiting for epoch from quorum
> > To: user@zookeeper.apache.org
> >
> >
> > Hi All,
> > first mail in the group so sorry for possible inconsistency in advance.
> > Zookeeper version is zookeeper-3.4.6.
> >
> > I'm facing a problem when zookeeper is reconfiguring a cluster.
> >
> > Initial state:
> > machine A:
> > zk1(follower)
> > zk2(leader)
> > machine B:
> > zk3(follower)
> >
> > zk1 and zk2 are stopped (in that order).
> > zk1 is started on machine B.
> > zk1 and zk3 are not creating a cluster, status is (using srvr word)
> > 'This ZooKeeper instance is not currently serving requests'
> >
> > A part of a log is:
> > [2016-10-20 04:03:10,053] WARN Unexpected exception
> > (org.apache.zookeeper.server.quorum.QuorumPeer)
> > java.lang.InterruptedException: Timeout while waiting for epoch from
> quorum
> > at org.apache.zookeeper.server.quorum.Leader.
> getEpochToPropose(Leader.java:
> > 878)
> > at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:377)
> > at org.apache.zookeeper.server.quorum.QuorumPeer.run(
> QuorumPeer.java:799)
> > [2016-10-20 04:03:10,054] INFO Shutting down
> (org.apache.zookeeper.server.
> > quorum.Leader)
> > [2016-10-20 04:03:10,054] INFO Shutdown called
> (org.apache.zookeeper.server.
> > quorum.Leader)
> > java.lang.Exception: shutdown Leader! reason: Forcing shutdown
> > at org.apache.zookeeper.server.quorum.Leader.shutdown(Leader.java:499)
> > at org.apache.zookeeper.server.quorum.QuorumPeer.run(
> QuorumPeer.java:805)
> > [2016-10-20 04:03:10,054] INFO shutting down
> (org.apache.zookeeper.server.
> > ZooKeeperServer)
> >
> > What is interesting: when zk2(previous leader) is started on machine B
> > (instead of zk1) cluster is configured correctly.
> > The same situation happens when all happen on the single machine.
> >
> > Issue is similar to this, but algorithm used by me is 3 (by default and
> > confirmed with electionAlg=3 in .cfg files)
> > https://issues.apache.org/jira/browse/ZOOKEEPER-2400
> >
> > thanks,
> > Krzysztof
>
>

Re: Zookeeper exception: Timeout while waiting for epoch from quorum

Posted by Flavio Junqueira <fp...@apache.org>.
Hi Krzysztof,

Your first message seems to indicate that a leader is being elected, but it is not being able to sync with enough followers (in your case 1 follower). I'm saying this because of this line:

> at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:377)

You say that the issue is timing in your second message, but it isn't entirely clear to me why you think so. I think you need to determine the following:

- Are the two remaining servers being able to elect a leader among them? If not, is the problem that they can't connect to each other?
- If a leader is being elected but the leader isn't being able to sync and establish itself, then you need to determine the reason. Is the follower being able to connect to the leader? Is the initialization taking too long such that it times out? Is the socket read timing out?

-Flavio


> On 24 Oct 2016, at 11:37, Krzysztof Rybak <kr...@motorolasolutions.com> wrote:
> 
> Hi,
> I've logged some more information and the problem seems to be a timing
> issue.
> Long story short, the scenario is:
> zk1(follower)
> zk2(leader)
> zk3(follower)
> 
> step 1. stop zk1
> step 2. stop zk2
> step 3. start zk1
> 
> result:
> zk1 and zk3 cannot create a cluster of 2 zk instances(is such scenario
> supported by zookeeper?).
> When zk2 is started again instead of zk1, cluster is created.
> 
> Observation:
> After step 3., zk3 is elected to be a leader and starts listening on port
> 2888
> ( example port from example
> http://zookeeper.apache.org/doc/trunk/zookeeperStarted.html ). Listening
> takes around 10 seconds but during this time zk1 is not trying to connect.
> zk1 tries to connect just after zk3 stops listening.
> 
> a part of a log is:
> zk3 stops listening on 04:04:17(testing with netstat) and zk1 starts trying
> to connect at 04:04:19,996 as below.
> 
> [2016-10-20 04:04:19,996] WARN Unexpected exception, tries=0, connecting to
> /10.54.1.53:12001 (org.apache.zookeeper.server.quorum.Learner)
> java.net.ConnectException: Connection refused
> at java.net.PlainSocketImpl.socketConnect(Native Method)
> at
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
> at
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
> at
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> at java.net.Socket.connect(Socket.java:589)
> at
> org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:225)
> at
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:71)
> at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786)
> 
> This is strange as I couldn't reproduce this on local VM VirtualBox but on
> VMware reproduction is 100%.
> 
> thanks,
> Krzysztof
> 
> ---------- Forwarded message ----------
> From: Krzysztof Rybak <kr...@motorolasolutions.com>
> Date: Thu, Oct 20, 2016 at 4:50 PM
> Subject: Zookeeper exception: Timeout while waiting for epoch from quorum
> To: user@zookeeper.apache.org
> 
> 
> Hi All,
> first mail in the group so sorry for possible inconsistency in advance.
> Zookeeper version is zookeeper-3.4.6.
> 
> I'm facing a problem when zookeeper is reconfiguring a cluster.
> 
> Initial state:
> machine A:
> zk1(follower)
> zk2(leader)
> machine B:
> zk3(follower)
> 
> zk1 and zk2 are stopped (in that order).
> zk1 is started on machine B.
> zk1 and zk3 are not creating a cluster, status is (using srvr word)
> 'This ZooKeeper instance is not currently serving requests'
> 
> A part of a log is:
> [2016-10-20 04:03:10,053] WARN Unexpected exception
> (org.apache.zookeeper.server.quorum.QuorumPeer)
> java.lang.InterruptedException: Timeout while waiting for epoch from quorum
> at org.apache.zookeeper.server.quorum.Leader.getEpochToPropose(Leader.java:
> 878)
> at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:377)
> at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:799)
> [2016-10-20 04:03:10,054] INFO Shutting down (org.apache.zookeeper.server.
> quorum.Leader)
> [2016-10-20 04:03:10,054] INFO Shutdown called (org.apache.zookeeper.server.
> quorum.Leader)
> java.lang.Exception: shutdown Leader! reason: Forcing shutdown
> at org.apache.zookeeper.server.quorum.Leader.shutdown(Leader.java:499)
> at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:805)
> [2016-10-20 04:03:10,054] INFO shutting down (org.apache.zookeeper.server.
> ZooKeeperServer)
> 
> What is interesting: when zk2(previous leader) is started on machine B
> (instead of zk1) cluster is configured correctly.
> The same situation happens when all happen on the single machine.
> 
> Issue is similar to this, but algorithm used by me is 3 (by default and
> confirmed with electionAlg=3 in .cfg files)
> https://issues.apache.org/jira/browse/ZOOKEEPER-2400
> 
> thanks,
> Krzysztof


Fwd: Zookeeper exception: Timeout while waiting for epoch from quorum

Posted by Krzysztof Rybak <kr...@motorolasolutions.com>.
Hi,
I've logged some more information and the problem seems to be a timing
issue.
Long story short, the scenario is:
zk1(follower)
zk2(leader)
zk3(follower)

step 1. stop zk1
step 2. stop zk2
step 3. start zk1

result:
zk1 and zk3 cannot create a cluster of 2 zk instances(is such scenario
supported by zookeeper?).
When zk2 is started again instead of zk1, cluster is created.

Observation:
After step 3., zk3 is elected to be a leader and starts listening on port
2888
( example port from example
http://zookeeper.apache.org/doc/trunk/zookeeperStarted.html ). Listening
takes around 10 seconds but during this time zk1 is not trying to connect.
zk1 tries to connect just after zk3 stops listening.

a part of a log is:
zk3 stops listening on 04:04:17(testing with netstat) and zk1 starts trying
to connect at 04:04:19,996 as below.

[2016-10-20 04:04:19,996] WARN Unexpected exception, tries=0, connecting to
/10.54.1.53:12001 (org.apache.zookeeper.server.quorum.Learner)
java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at
org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:225)
at
org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:71)
at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786)

This is strange as I couldn't reproduce this on local VM VirtualBox but on
VMware reproduction is 100%.

thanks,
Krzysztof

---------- Forwarded message ----------
From: Krzysztof Rybak <kr...@motorolasolutions.com>
Date: Thu, Oct 20, 2016 at 4:50 PM
Subject: Zookeeper exception: Timeout while waiting for epoch from quorum
To: user@zookeeper.apache.org


Hi All,
first mail in the group so sorry for possible inconsistency in advance.
Zookeeper version is zookeeper-3.4.6.

I'm facing a problem when zookeeper is reconfiguring a cluster.

Initial state:
machine A:
zk1(follower)
zk2(leader)
machine B:
zk3(follower)

zk1 and zk2 are stopped (in that order).
zk1 is started on machine B.
zk1 and zk3 are not creating a cluster, status is (using srvr word)
'This ZooKeeper instance is not currently serving requests'

A part of a log is:
[2016-10-20 04:03:10,053] WARN Unexpected exception
(org.apache.zookeeper.server.quorum.QuorumPeer)
java.lang.InterruptedException: Timeout while waiting for epoch from quorum
at org.apache.zookeeper.server.quorum.Leader.getEpochToPropose(Leader.java:
878)
at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:377)
at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:799)
[2016-10-20 04:03:10,054] INFO Shutting down (org.apache.zookeeper.server.
quorum.Leader)
[2016-10-20 04:03:10,054] INFO Shutdown called (org.apache.zookeeper.server.
quorum.Leader)
java.lang.Exception: shutdown Leader! reason: Forcing shutdown
at org.apache.zookeeper.server.quorum.Leader.shutdown(Leader.java:499)
at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:805)
[2016-10-20 04:03:10,054] INFO shutting down (org.apache.zookeeper.server.
ZooKeeperServer)

What is interesting: when zk2(previous leader) is started on machine B
(instead of zk1) cluster is configured correctly.
The same situation happens when all happen on the single machine.

Issue is similar to this, but algorithm used by me is 3 (by default and
confirmed with electionAlg=3 in .cfg files)
https://issues.apache.org/jira/browse/ZOOKEEPER-2400

thanks,
Krzysztof