You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@zookeeper.apache.org by Cameron McKenzie <mc...@gmail.com> on 2014/04/30 09:43:34 UTC

ZOOKEEPER-900 / 901 / 1678

ZooKeeper users,
Does anyone know the status of these issues? They don't seem to have had
anything done to them since late 2010?

I think that we're experiencing the same issue currently. If we have a 3
node cluster for example, and 1 of these nodes is completely dead (i.e the
entire host is not contactable due to a power outage), I would expect that
a quorum could still be formed, but this does not appear to be the case.

I haven't delved into the code too much, but it appears that blocking IO is
being used for the connect. This doesn't respect the socket SO timeout
being set, so it means that the connect() call can block for some arbitrary
amount of time (based on the OS level TCP settings?). This in turn means
that leader election will fail because it times out before the socket
connect does, even though there are enough live hosts present to form a
quorum.

This seems like a fairly fundamental problem, unless I'm missing something.
If a single host goes down due to a power failure for example, it can
prevent any further hosts joining the cluster. In addition, if after a
power failure, enough hosts come back online to form a quorum, but some
don't, that a quorum may still not be able to be formed.
cheers
Cam

Re: ZOOKEEPER-900 / 901 / 1678

Posted by Cameron McKenzie <mc...@gmail.com>.
More digging!

See the attached screenshot from loggraph. You can see that server1 thinks
that it's the follower, but it can't connect to server3 because server3
doesn't think that it's leader yet. I believe this is because server3 is
blocked for 5 seconds (the connect timeout) while trying to connect to the
dead server (server2). It can't receive any notifications during this
period due to the synchronization on the connectOne() method. Because of
this, it has not yet spawned the Learner thread which opens up a server
socket to accept connections from followers.

Server 1 tries to connect to server3 5 times (in the connectToLeader()
method in Learner) in relatively quick succession (1 second sleeps
between), which all fail because the server socket is not yet up. At this
point, server 1 gives up and closes the Follower class, and goes back into
a LOOKING state, which results in another election occurring.

I can't think of anything that can be done without making the socket
establishment calls non blocking, which is not an insignificant change.

We can reduce the timeout for connection establishment, which should
greatly reduce the likelihood of the issue. The window of opportunity seems
occur when the leader is blocked trying to connect to the dead host (5
seconds), and the follower is attempting to connect to the leader. At a
minimum, the attempts to connect to the leader will take 4 seconds +
however long the connection attempts themselves take(connectToLeader()
method has 5 attempts to establish a connection to the leader, with a 1
second sleep in between them). So, given that we cannot increase the number
of attempts to communicate with the leader, or the sleep period between
attempts), the only option left to us is to minimize the time that the
leader can be blocked for while attempting to connect to the dead host.
Obviously reducing this number too much will result in other issues, so a
bit of fine tuning will be required.

Any other suggestions? I'm still hoping that I'm missing something simple!
cheers
Cam



On Thu, May 1, 2014 at 8:48 AM, Cameron McKenzie <mc...@gmail.com>wrote:

> Flavio,
> I modified the zookeeper.cnxTimeout system property to 1000ms (it defaults
> to 5000ms), and the election succeeds and stays up. The ZK cluster works as
> expected. So, there is definitely some interaction between the connection
> timeout and the leader election.
>
> I notice that in QuorumCnxManager that the connectOne() method is
> synchronized, and this method is also blocking on the socket connect. I
> also notice that both the receiveConnection() and toSend() methods both
> call into this method as well, so they would be blocked while an attempt to
> connect to a dead host occurs. This is potentially quite a large window to
> be blocked for (5 seconds by default), but I haven't looked into the code
> in enough detail to understand what the implications of this blocking are.
> This certainly looks like a possible cause of the issue though.
> cheers
>
>
>
>
> On Thu, May 1, 2014 at 8:23 AM, Cameron McKenzie <mc...@gmail.com>wrote:
>
>> Debug logs attached.
>>
>> cheers
>>
>>
>> On Thu, May 1, 2014 at 8:13 AM, Flavio Junqueira <fp...@yahoo.com>wrote:
>>
>>> Sure, having the logs might help.
>>>
>>> -Flavio
>>>
>>> -----Original Message-----
>>> From: Cameron McKenzie [mailto:mckenzie.cam@gmail.com]
>>> Sent: Wednesday, April 30, 2014 11:10 PM
>>> To: user@zookeeper.apache.org
>>> Subject: Re: ZOOKEEPER-900 / 901 / 1678
>>>
>>> Thanks Flavio,
>>> The length of the leader election seems directly related to the presence
>>> of this dead host in the configuration though. If I remove the dead host
>>> from the configuration, a quorum is quickly formed. From the logs it does
>>> appear that the election is completing though (after about 15 seconds in
>>> most cases), but then another election seems to happen shortly afterwards.
>>>
>>> Would it be helpful for me to provide debug level logs?
>>> cheers
>>>
>>>
>>> On Thu, May 1, 2014 at 8:04 AM, Flavio Junqueira <fpjunqueira@yahoo.com
>>> >wrote:
>>>
>>> > Leader election seems to be taking a long time. The connection
>>> > attempts from QuorumCnxManager are not causing a new round of leader
>>> > election. What causes it is the absence of a quorum of supporters, so
>>> > the elected leader is not getting enough servers to support it.
>>> >
>>> > -Flavio
>>> >
>>> > -----Original Message-----
>>> > From: Cameron McKenzie [mailto:mckenzie.cam@gmail.com]
>>> > Sent: Wednesday, April 30, 2014 10:36 PM
>>> > To: user@zookeeper.apache.org
>>> > Subject: Re: ZOOKEEPER-900 / 901 / 1678
>>> >
>>> > I've done a bit more testing this morning, and it appears that the
>>> > leader election is actually completing, but then just after the
>>> > election has completed, the connection attempt to the dead host times
>>> > out, and this seems to cause another leader election. The same thing
>>> > happens the next leader election. etc.
>>> >
>>> > 2014-04-30 04:07:25,383 [myid:3] - INFO
>>> > [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2183:Leader@358] - LEADING -
>>> > LEADER ELECTION TOOK - 14662
>>> > 2014-04-30 04:07:25,756 [myid:3] - WARN
>>> > [WorkerSender[myid=3]:QuorumCnxManager@382] - Cannot open channel to 2
>>> > at election address /10.0.0.0:3889
>>> > java.net.SocketTimeoutException: connect timed out
>>> >         at java.net.PlainSocketImpl.socketConnect(Native Method)
>>> >         at
>>> >
>>> >
>>> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
>>> >         at
>>> >
>>> >
>>> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
>>> >         at
>>> >
>>> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
>>> >         at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>>> >         at java.net.Socket.connect(Socket.java:579)
>>> >         at
>>> >
>>> >
>>> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:368)
>>> >         at
>>> >
>>> >
>>> org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:341)
>>> >         at
>>> >
>>> >
>>> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:449)
>>> >         at
>>> >
>>> >
>>> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:430)
>>> >         at java.lang.Thread.run(Thread.java:744)
>>> > 2014-04-30 04:07:25,757 [myid:3] - INFO
>>> > [WorkerReceiver[myid=3]:FastLeaderElection@597] - Notification: 1
>>> > (message format version), 3 (n.leader), 0xc00000001 (n.zxid), 0xb
>>> > (n.round), LOOKING (n.state), 3 (n.sid), 0xd (n.peerEpoch) LEADING (my
>>> > state)
>>> >
>>> > cheers
>>> >
>>> >
>>> >
>>> > On Wed, Apr 30, 2014 at 6:48 PM, Cameron McKenzie
>>> > <mckenzie.cam@gmail.com
>>> > >wrote:
>>> >
>>> > > hey Flavio,
>>> > > Thanks for the quick reply.
>>> > >
>>> > > I'm running ZK 3.4.6. Having looked into the code a bit more, I
>>> > > think that I was slightly presumptuous about the root cause. The
>>> > > actual socket connects seem to be passing a timeout correctly, and
>>> > > based on the logs, I can see the timeouts on connect occurring.
>>> > >
>>> > > I can reproduce the issue on a VM running two instances of ZK. These
>>> > > instances are configured in a 3 node cluster (with the 2 real ZK
>>> > > instances, and one bogus IP address that will not resolve to
>>> > > anything
>>> > useful).
>>> > > Specifically, this bogus host is configured 2nd in the server list.
>>> > > When I configured it third, the cluster would occasionally form a
>>> > > quorum (though still not consistently). I've attached the config and
>>> > > logs from both of the ZK instances.
>>> > >
>>> > > Any help would be much appreciated!
>>> > > cheers
>>> > >
>>> > >
>>> > >
>>> > >
>>> > > On Wed, Apr 30, 2014 at 6:09 PM, FPJ <fp...@yahoo.com> wrote:
>>> > >
>>> > >> Hi Cameron,
>>> > >>
>>> > >> Which version of ZK are you using? Also, if you can share logs,
>>> > >> then it might be easier for us to help you out.
>>> > >>
>>> > >> -Flavio
>>> > >>
>>> > >> > -----Original Message-----
>>> > >> > From: Cameron McKenzie [mailto:mckenzie.cam@gmail.com]
>>> > >> > Sent: 30 April 2014 08:44
>>> > >> > To: zookeeper-user@hadoop.apache.org
>>> > >> > Subject: ZOOKEEPER-900 / 901 / 1678
>>> > >> >
>>> > >> > ZooKeeper users,
>>> > >> > Does anyone know the status of these issues? They don't seem to
>>> > >> > have had anything done to them since late 2010?
>>> > >> >
>>> > >> > I think that we're experiencing the same issue currently. If we
>>> > >> > have a
>>> > >> 3 node
>>> > >> > cluster for example, and 1 of these nodes is completely dead (i.e
>>> > >> > the
>>> > >> entire
>>> > >> > host is not contactable due to a power outage), I would expect
>>> > >> > that a quorum could still be formed, but this does not appear to
>>> > >> > be the
>>> > case.
>>> > >> >
>>> > >> > I haven't delved into the code too much, but it appears that
>>> > >> > blocking
>>> > >> IO is
>>> > >> > being used for the connect. This doesn't respect the socket SO
>>> > >> > timeout
>>> > >> being
>>> > >> > set, so it means that the connect() call can block for some
>>> > >> > arbitrary
>>> > >> amount of
>>> > >> > time (based on the OS level TCP settings?). This in turn means
>>> > >> > that
>>> > >> leader
>>> > >> > election will fail because it times out before the socket connect
>>> > >> > does,
>>> > >> even
>>> > >> > though there are enough live hosts present to form a quorum.
>>> > >> >
>>> > >> > This seems like a fairly fundamental problem, unless I'm missing
>>> > >> something.
>>> > >> > If a single host goes down due to a power failure for example, it
>>> > >> > can
>>> > >> prevent
>>> > >> > any further hosts joining the cluster. In addition, if after a
>>> > >> > power
>>> > >> failure,
>>> > >> > enough hosts come back online to form a quorum, but some don't,
>>> > >> > that a quorum may still not be able to be formed.
>>> > >> > cheers
>>> > >> > Cam
>>> > >>
>>> > >>
>>> > >
>>> >
>>> >
>>>
>>>
>>
>

Re: ZOOKEEPER-900 / 901 / 1678

Posted by Cameron McKenzie <mc...@gmail.com>.
Flavio,
I modified the zookeeper.cnxTimeout system property to 1000ms (it defaults
to 5000ms), and the election succeeds and stays up. The ZK cluster works as
expected. So, there is definitely some interaction between the connection
timeout and the leader election.

I notice that in QuorumCnxManager that the connectOne() method is
synchronized, and this method is also blocking on the socket connect. I
also notice that both the receiveConnection() and toSend() methods both
call into this method as well, so they would be blocked while an attempt to
connect to a dead host occurs. This is potentially quite a large window to
be blocked for (5 seconds by default), but I haven't looked into the code
in enough detail to understand what the implications of this blocking are.
This certainly looks like a possible cause of the issue though.
cheers




On Thu, May 1, 2014 at 8:23 AM, Cameron McKenzie <mc...@gmail.com>wrote:

> Debug logs attached.
>
> cheers
>
>
> On Thu, May 1, 2014 at 8:13 AM, Flavio Junqueira <fp...@yahoo.com>wrote:
>
>> Sure, having the logs might help.
>>
>> -Flavio
>>
>> -----Original Message-----
>> From: Cameron McKenzie [mailto:mckenzie.cam@gmail.com]
>> Sent: Wednesday, April 30, 2014 11:10 PM
>> To: user@zookeeper.apache.org
>> Subject: Re: ZOOKEEPER-900 / 901 / 1678
>>
>> Thanks Flavio,
>> The length of the leader election seems directly related to the presence
>> of this dead host in the configuration though. If I remove the dead host
>> from the configuration, a quorum is quickly formed. From the logs it does
>> appear that the election is completing though (after about 15 seconds in
>> most cases), but then another election seems to happen shortly afterwards.
>>
>> Would it be helpful for me to provide debug level logs?
>> cheers
>>
>>
>> On Thu, May 1, 2014 at 8:04 AM, Flavio Junqueira <fpjunqueira@yahoo.com
>> >wrote:
>>
>> > Leader election seems to be taking a long time. The connection
>> > attempts from QuorumCnxManager are not causing a new round of leader
>> > election. What causes it is the absence of a quorum of supporters, so
>> > the elected leader is not getting enough servers to support it.
>> >
>> > -Flavio
>> >
>> > -----Original Message-----
>> > From: Cameron McKenzie [mailto:mckenzie.cam@gmail.com]
>> > Sent: Wednesday, April 30, 2014 10:36 PM
>> > To: user@zookeeper.apache.org
>> > Subject: Re: ZOOKEEPER-900 / 901 / 1678
>> >
>> > I've done a bit more testing this morning, and it appears that the
>> > leader election is actually completing, but then just after the
>> > election has completed, the connection attempt to the dead host times
>> > out, and this seems to cause another leader election. The same thing
>> > happens the next leader election. etc.
>> >
>> > 2014-04-30 04:07:25,383 [myid:3] - INFO
>> > [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2183:Leader@358] - LEADING -
>> > LEADER ELECTION TOOK - 14662
>> > 2014-04-30 04:07:25,756 [myid:3] - WARN
>> > [WorkerSender[myid=3]:QuorumCnxManager@382] - Cannot open channel to 2
>> > at election address /10.0.0.0:3889
>> > java.net.SocketTimeoutException: connect timed out
>> >         at java.net.PlainSocketImpl.socketConnect(Native Method)
>> >         at
>> >
>> >
>> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
>> >         at
>> >
>> >
>> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
>> >         at
>> >
>> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
>> >         at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>> >         at java.net.Socket.connect(Socket.java:579)
>> >         at
>> >
>> >
>> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:368)
>> >         at
>> >
>> >
>> org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:341)
>> >         at
>> >
>> >
>> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:449)
>> >         at
>> >
>> >
>> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:430)
>> >         at java.lang.Thread.run(Thread.java:744)
>> > 2014-04-30 04:07:25,757 [myid:3] - INFO
>> > [WorkerReceiver[myid=3]:FastLeaderElection@597] - Notification: 1
>> > (message format version), 3 (n.leader), 0xc00000001 (n.zxid), 0xb
>> > (n.round), LOOKING (n.state), 3 (n.sid), 0xd (n.peerEpoch) LEADING (my
>> > state)
>> >
>> > cheers
>> >
>> >
>> >
>> > On Wed, Apr 30, 2014 at 6:48 PM, Cameron McKenzie
>> > <mckenzie.cam@gmail.com
>> > >wrote:
>> >
>> > > hey Flavio,
>> > > Thanks for the quick reply.
>> > >
>> > > I'm running ZK 3.4.6. Having looked into the code a bit more, I
>> > > think that I was slightly presumptuous about the root cause. The
>> > > actual socket connects seem to be passing a timeout correctly, and
>> > > based on the logs, I can see the timeouts on connect occurring.
>> > >
>> > > I can reproduce the issue on a VM running two instances of ZK. These
>> > > instances are configured in a 3 node cluster (with the 2 real ZK
>> > > instances, and one bogus IP address that will not resolve to
>> > > anything
>> > useful).
>> > > Specifically, this bogus host is configured 2nd in the server list.
>> > > When I configured it third, the cluster would occasionally form a
>> > > quorum (though still not consistently). I've attached the config and
>> > > logs from both of the ZK instances.
>> > >
>> > > Any help would be much appreciated!
>> > > cheers
>> > >
>> > >
>> > >
>> > >
>> > > On Wed, Apr 30, 2014 at 6:09 PM, FPJ <fp...@yahoo.com> wrote:
>> > >
>> > >> Hi Cameron,
>> > >>
>> > >> Which version of ZK are you using? Also, if you can share logs,
>> > >> then it might be easier for us to help you out.
>> > >>
>> > >> -Flavio
>> > >>
>> > >> > -----Original Message-----
>> > >> > From: Cameron McKenzie [mailto:mckenzie.cam@gmail.com]
>> > >> > Sent: 30 April 2014 08:44
>> > >> > To: zookeeper-user@hadoop.apache.org
>> > >> > Subject: ZOOKEEPER-900 / 901 / 1678
>> > >> >
>> > >> > ZooKeeper users,
>> > >> > Does anyone know the status of these issues? They don't seem to
>> > >> > have had anything done to them since late 2010?
>> > >> >
>> > >> > I think that we're experiencing the same issue currently. If we
>> > >> > have a
>> > >> 3 node
>> > >> > cluster for example, and 1 of these nodes is completely dead (i.e
>> > >> > the
>> > >> entire
>> > >> > host is not contactable due to a power outage), I would expect
>> > >> > that a quorum could still be formed, but this does not appear to
>> > >> > be the
>> > case.
>> > >> >
>> > >> > I haven't delved into the code too much, but it appears that
>> > >> > blocking
>> > >> IO is
>> > >> > being used for the connect. This doesn't respect the socket SO
>> > >> > timeout
>> > >> being
>> > >> > set, so it means that the connect() call can block for some
>> > >> > arbitrary
>> > >> amount of
>> > >> > time (based on the OS level TCP settings?). This in turn means
>> > >> > that
>> > >> leader
>> > >> > election will fail because it times out before the socket connect
>> > >> > does,
>> > >> even
>> > >> > though there are enough live hosts present to form a quorum.
>> > >> >
>> > >> > This seems like a fairly fundamental problem, unless I'm missing
>> > >> something.
>> > >> > If a single host goes down due to a power failure for example, it
>> > >> > can
>> > >> prevent
>> > >> > any further hosts joining the cluster. In addition, if after a
>> > >> > power
>> > >> failure,
>> > >> > enough hosts come back online to form a quorum, but some don't,
>> > >> > that a quorum may still not be able to be formed.
>> > >> > cheers
>> > >> > Cam
>> > >>
>> > >>
>> > >
>> >
>> >
>>
>>
>

Re: ZOOKEEPER-900 / 901 / 1678

Posted by Cameron McKenzie <mc...@gmail.com>.
Debug logs attached.

cheers


On Thu, May 1, 2014 at 8:13 AM, Flavio Junqueira <fp...@yahoo.com>wrote:

> Sure, having the logs might help.
>
> -Flavio
>
> -----Original Message-----
> From: Cameron McKenzie [mailto:mckenzie.cam@gmail.com]
> Sent: Wednesday, April 30, 2014 11:10 PM
> To: user@zookeeper.apache.org
> Subject: Re: ZOOKEEPER-900 / 901 / 1678
>
> Thanks Flavio,
> The length of the leader election seems directly related to the presence
> of this dead host in the configuration though. If I remove the dead host
> from the configuration, a quorum is quickly formed. From the logs it does
> appear that the election is completing though (after about 15 seconds in
> most cases), but then another election seems to happen shortly afterwards.
>
> Would it be helpful for me to provide debug level logs?
> cheers
>
>
> On Thu, May 1, 2014 at 8:04 AM, Flavio Junqueira <fpjunqueira@yahoo.com
> >wrote:
>
> > Leader election seems to be taking a long time. The connection
> > attempts from QuorumCnxManager are not causing a new round of leader
> > election. What causes it is the absence of a quorum of supporters, so
> > the elected leader is not getting enough servers to support it.
> >
> > -Flavio
> >
> > -----Original Message-----
> > From: Cameron McKenzie [mailto:mckenzie.cam@gmail.com]
> > Sent: Wednesday, April 30, 2014 10:36 PM
> > To: user@zookeeper.apache.org
> > Subject: Re: ZOOKEEPER-900 / 901 / 1678
> >
> > I've done a bit more testing this morning, and it appears that the
> > leader election is actually completing, but then just after the
> > election has completed, the connection attempt to the dead host times
> > out, and this seems to cause another leader election. The same thing
> > happens the next leader election. etc.
> >
> > 2014-04-30 04:07:25,383 [myid:3] - INFO
> > [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2183:Leader@358] - LEADING -
> > LEADER ELECTION TOOK - 14662
> > 2014-04-30 04:07:25,756 [myid:3] - WARN
> > [WorkerSender[myid=3]:QuorumCnxManager@382] - Cannot open channel to 2
> > at election address /10.0.0.0:3889
> > java.net.SocketTimeoutException: connect timed out
> >         at java.net.PlainSocketImpl.socketConnect(Native Method)
> >         at
> >
> >
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
> >         at
> >
> >
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
> >         at
> >
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
> >         at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> >         at java.net.Socket.connect(Socket.java:579)
> >         at
> >
> >
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:368)
> >         at
> >
> >
> org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:341)
> >         at
> >
> >
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:449)
> >         at
> >
> >
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:430)
> >         at java.lang.Thread.run(Thread.java:744)
> > 2014-04-30 04:07:25,757 [myid:3] - INFO
> > [WorkerReceiver[myid=3]:FastLeaderElection@597] - Notification: 1
> > (message format version), 3 (n.leader), 0xc00000001 (n.zxid), 0xb
> > (n.round), LOOKING (n.state), 3 (n.sid), 0xd (n.peerEpoch) LEADING (my
> > state)
> >
> > cheers
> >
> >
> >
> > On Wed, Apr 30, 2014 at 6:48 PM, Cameron McKenzie
> > <mckenzie.cam@gmail.com
> > >wrote:
> >
> > > hey Flavio,
> > > Thanks for the quick reply.
> > >
> > > I'm running ZK 3.4.6. Having looked into the code a bit more, I
> > > think that I was slightly presumptuous about the root cause. The
> > > actual socket connects seem to be passing a timeout correctly, and
> > > based on the logs, I can see the timeouts on connect occurring.
> > >
> > > I can reproduce the issue on a VM running two instances of ZK. These
> > > instances are configured in a 3 node cluster (with the 2 real ZK
> > > instances, and one bogus IP address that will not resolve to
> > > anything
> > useful).
> > > Specifically, this bogus host is configured 2nd in the server list.
> > > When I configured it third, the cluster would occasionally form a
> > > quorum (though still not consistently). I've attached the config and
> > > logs from both of the ZK instances.
> > >
> > > Any help would be much appreciated!
> > > cheers
> > >
> > >
> > >
> > >
> > > On Wed, Apr 30, 2014 at 6:09 PM, FPJ <fp...@yahoo.com> wrote:
> > >
> > >> Hi Cameron,
> > >>
> > >> Which version of ZK are you using? Also, if you can share logs,
> > >> then it might be easier for us to help you out.
> > >>
> > >> -Flavio
> > >>
> > >> > -----Original Message-----
> > >> > From: Cameron McKenzie [mailto:mckenzie.cam@gmail.com]
> > >> > Sent: 30 April 2014 08:44
> > >> > To: zookeeper-user@hadoop.apache.org
> > >> > Subject: ZOOKEEPER-900 / 901 / 1678
> > >> >
> > >> > ZooKeeper users,
> > >> > Does anyone know the status of these issues? They don't seem to
> > >> > have had anything done to them since late 2010?
> > >> >
> > >> > I think that we're experiencing the same issue currently. If we
> > >> > have a
> > >> 3 node
> > >> > cluster for example, and 1 of these nodes is completely dead (i.e
> > >> > the
> > >> entire
> > >> > host is not contactable due to a power outage), I would expect
> > >> > that a quorum could still be formed, but this does not appear to
> > >> > be the
> > case.
> > >> >
> > >> > I haven't delved into the code too much, but it appears that
> > >> > blocking
> > >> IO is
> > >> > being used for the connect. This doesn't respect the socket SO
> > >> > timeout
> > >> being
> > >> > set, so it means that the connect() call can block for some
> > >> > arbitrary
> > >> amount of
> > >> > time (based on the OS level TCP settings?). This in turn means
> > >> > that
> > >> leader
> > >> > election will fail because it times out before the socket connect
> > >> > does,
> > >> even
> > >> > though there are enough live hosts present to form a quorum.
> > >> >
> > >> > This seems like a fairly fundamental problem, unless I'm missing
> > >> something.
> > >> > If a single host goes down due to a power failure for example, it
> > >> > can
> > >> prevent
> > >> > any further hosts joining the cluster. In addition, if after a
> > >> > power
> > >> failure,
> > >> > enough hosts come back online to form a quorum, but some don't,
> > >> > that a quorum may still not be able to be formed.
> > >> > cheers
> > >> > Cam
> > >>
> > >>
> > >
> >
> >
>
>

RE: ZOOKEEPER-900 / 901 / 1678

Posted by Flavio Junqueira <fp...@yahoo.com>.
Sure, having the logs might help.

-Flavio

-----Original Message-----
From: Cameron McKenzie [mailto:mckenzie.cam@gmail.com] 
Sent: Wednesday, April 30, 2014 11:10 PM
To: user@zookeeper.apache.org
Subject: Re: ZOOKEEPER-900 / 901 / 1678

Thanks Flavio,
The length of the leader election seems directly related to the presence of this dead host in the configuration though. If I remove the dead host from the configuration, a quorum is quickly formed. From the logs it does appear that the election is completing though (after about 15 seconds in most cases), but then another election seems to happen shortly afterwards.

Would it be helpful for me to provide debug level logs?
cheers


On Thu, May 1, 2014 at 8:04 AM, Flavio Junqueira <fp...@yahoo.com>wrote:

> Leader election seems to be taking a long time. The connection 
> attempts from QuorumCnxManager are not causing a new round of leader 
> election. What causes it is the absence of a quorum of supporters, so 
> the elected leader is not getting enough servers to support it.
>
> -Flavio
>
> -----Original Message-----
> From: Cameron McKenzie [mailto:mckenzie.cam@gmail.com]
> Sent: Wednesday, April 30, 2014 10:36 PM
> To: user@zookeeper.apache.org
> Subject: Re: ZOOKEEPER-900 / 901 / 1678
>
> I've done a bit more testing this morning, and it appears that the 
> leader election is actually completing, but then just after the 
> election has completed, the connection attempt to the dead host times 
> out, and this seems to cause another leader election. The same thing 
> happens the next leader election. etc.
>
> 2014-04-30 04:07:25,383 [myid:3] - INFO 
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2183:Leader@358] - LEADING - 
> LEADER ELECTION TOOK - 14662
> 2014-04-30 04:07:25,756 [myid:3] - WARN 
> [WorkerSender[myid=3]:QuorumCnxManager@382] - Cannot open channel to 2 
> at election address /10.0.0.0:3889
> java.net.SocketTimeoutException: connect timed out
>         at java.net.PlainSocketImpl.socketConnect(Native Method)
>         at
>
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
>         at
>
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
>         at
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
>         at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>         at java.net.Socket.connect(Socket.java:579)
>         at
>
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:368)
>         at
>
> org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:341)
>         at
>
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:449)
>         at
>
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:430)
>         at java.lang.Thread.run(Thread.java:744)
> 2014-04-30 04:07:25,757 [myid:3] - INFO 
> [WorkerReceiver[myid=3]:FastLeaderElection@597] - Notification: 1 
> (message format version), 3 (n.leader), 0xc00000001 (n.zxid), 0xb 
> (n.round), LOOKING (n.state), 3 (n.sid), 0xd (n.peerEpoch) LEADING (my
> state)
>
> cheers
>
>
>
> On Wed, Apr 30, 2014 at 6:48 PM, Cameron McKenzie 
> <mckenzie.cam@gmail.com
> >wrote:
>
> > hey Flavio,
> > Thanks for the quick reply.
> >
> > I'm running ZK 3.4.6. Having looked into the code a bit more, I 
> > think that I was slightly presumptuous about the root cause. The 
> > actual socket connects seem to be passing a timeout correctly, and 
> > based on the logs, I can see the timeouts on connect occurring.
> >
> > I can reproduce the issue on a VM running two instances of ZK. These 
> > instances are configured in a 3 node cluster (with the 2 real ZK 
> > instances, and one bogus IP address that will not resolve to 
> > anything
> useful).
> > Specifically, this bogus host is configured 2nd in the server list.
> > When I configured it third, the cluster would occasionally form a 
> > quorum (though still not consistently). I've attached the config and 
> > logs from both of the ZK instances.
> >
> > Any help would be much appreciated!
> > cheers
> >
> >
> >
> >
> > On Wed, Apr 30, 2014 at 6:09 PM, FPJ <fp...@yahoo.com> wrote:
> >
> >> Hi Cameron,
> >>
> >> Which version of ZK are you using? Also, if you can share logs, 
> >> then it might be easier for us to help you out.
> >>
> >> -Flavio
> >>
> >> > -----Original Message-----
> >> > From: Cameron McKenzie [mailto:mckenzie.cam@gmail.com]
> >> > Sent: 30 April 2014 08:44
> >> > To: zookeeper-user@hadoop.apache.org
> >> > Subject: ZOOKEEPER-900 / 901 / 1678
> >> >
> >> > ZooKeeper users,
> >> > Does anyone know the status of these issues? They don't seem to 
> >> > have had anything done to them since late 2010?
> >> >
> >> > I think that we're experiencing the same issue currently. If we 
> >> > have a
> >> 3 node
> >> > cluster for example, and 1 of these nodes is completely dead (i.e 
> >> > the
> >> entire
> >> > host is not contactable due to a power outage), I would expect 
> >> > that a quorum could still be formed, but this does not appear to 
> >> > be the
> case.
> >> >
> >> > I haven't delved into the code too much, but it appears that 
> >> > blocking
> >> IO is
> >> > being used for the connect. This doesn't respect the socket SO 
> >> > timeout
> >> being
> >> > set, so it means that the connect() call can block for some 
> >> > arbitrary
> >> amount of
> >> > time (based on the OS level TCP settings?). This in turn means 
> >> > that
> >> leader
> >> > election will fail because it times out before the socket connect 
> >> > does,
> >> even
> >> > though there are enough live hosts present to form a quorum.
> >> >
> >> > This seems like a fairly fundamental problem, unless I'm missing
> >> something.
> >> > If a single host goes down due to a power failure for example, it 
> >> > can
> >> prevent
> >> > any further hosts joining the cluster. In addition, if after a 
> >> > power
> >> failure,
> >> > enough hosts come back online to form a quorum, but some don't, 
> >> > that a quorum may still not be able to be formed.
> >> > cheers
> >> > Cam
> >>
> >>
> >
>
>


Re: ZOOKEEPER-900 / 901 / 1678

Posted by Cameron McKenzie <mc...@gmail.com>.
Thanks Flavio,
The length of the leader election seems directly related to the presence of
this dead host in the configuration though. If I remove the dead host from
the configuration, a quorum is quickly formed. From the logs it does appear
that the election is completing though (after about 15 seconds in most
cases), but then another election seems to happen shortly afterwards.

Would it be helpful for me to provide debug level logs?
cheers


On Thu, May 1, 2014 at 8:04 AM, Flavio Junqueira <fp...@yahoo.com>wrote:

> Leader election seems to be taking a long time. The connection attempts
> from QuorumCnxManager are not causing a new round of leader election. What
> causes it is the absence of a quorum of supporters, so the elected leader
> is not getting enough servers to support it.
>
> -Flavio
>
> -----Original Message-----
> From: Cameron McKenzie [mailto:mckenzie.cam@gmail.com]
> Sent: Wednesday, April 30, 2014 10:36 PM
> To: user@zookeeper.apache.org
> Subject: Re: ZOOKEEPER-900 / 901 / 1678
>
> I've done a bit more testing this morning, and it appears that the leader
> election is actually completing, but then just after the election has
> completed, the connection attempt to the dead host times out, and this
> seems to cause another leader election. The same thing happens the next
> leader election. etc.
>
> 2014-04-30 04:07:25,383 [myid:3] - INFO
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2183:Leader@358] - LEADING - LEADER
> ELECTION TOOK - 14662
> 2014-04-30 04:07:25,756 [myid:3] - WARN
> [WorkerSender[myid=3]:QuorumCnxManager@382] - Cannot open channel to 2 at
> election address /10.0.0.0:3889
> java.net.SocketTimeoutException: connect timed out
>         at java.net.PlainSocketImpl.socketConnect(Native Method)
>         at
>
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
>         at
>
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
>         at
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
>         at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>         at java.net.Socket.connect(Socket.java:579)
>         at
>
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:368)
>         at
>
> org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:341)
>         at
>
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:449)
>         at
>
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:430)
>         at java.lang.Thread.run(Thread.java:744)
> 2014-04-30 04:07:25,757 [myid:3] - INFO
> [WorkerReceiver[myid=3]:FastLeaderElection@597] - Notification: 1
> (message format version), 3 (n.leader), 0xc00000001 (n.zxid), 0xb
> (n.round), LOOKING (n.state), 3 (n.sid), 0xd (n.peerEpoch) LEADING (my
> state)
>
> cheers
>
>
>
> On Wed, Apr 30, 2014 at 6:48 PM, Cameron McKenzie <mckenzie.cam@gmail.com
> >wrote:
>
> > hey Flavio,
> > Thanks for the quick reply.
> >
> > I'm running ZK 3.4.6. Having looked into the code a bit more, I think
> > that I was slightly presumptuous about the root cause. The actual
> > socket connects seem to be passing a timeout correctly, and based on
> > the logs, I can see the timeouts on connect occurring.
> >
> > I can reproduce the issue on a VM running two instances of ZK. These
> > instances are configured in a 3 node cluster (with the 2 real ZK
> > instances, and one bogus IP address that will not resolve to anything
> useful).
> > Specifically, this bogus host is configured 2nd in the server list.
> > When I configured it third, the cluster would occasionally form a
> > quorum (though still not consistently). I've attached the config and
> > logs from both of the ZK instances.
> >
> > Any help would be much appreciated!
> > cheers
> >
> >
> >
> >
> > On Wed, Apr 30, 2014 at 6:09 PM, FPJ <fp...@yahoo.com> wrote:
> >
> >> Hi Cameron,
> >>
> >> Which version of ZK are you using? Also, if you can share logs, then
> >> it might be easier for us to help you out.
> >>
> >> -Flavio
> >>
> >> > -----Original Message-----
> >> > From: Cameron McKenzie [mailto:mckenzie.cam@gmail.com]
> >> > Sent: 30 April 2014 08:44
> >> > To: zookeeper-user@hadoop.apache.org
> >> > Subject: ZOOKEEPER-900 / 901 / 1678
> >> >
> >> > ZooKeeper users,
> >> > Does anyone know the status of these issues? They don't seem to
> >> > have had anything done to them since late 2010?
> >> >
> >> > I think that we're experiencing the same issue currently. If we
> >> > have a
> >> 3 node
> >> > cluster for example, and 1 of these nodes is completely dead (i.e
> >> > the
> >> entire
> >> > host is not contactable due to a power outage), I would expect that
> >> > a quorum could still be formed, but this does not appear to be the
> case.
> >> >
> >> > I haven't delved into the code too much, but it appears that
> >> > blocking
> >> IO is
> >> > being used for the connect. This doesn't respect the socket SO
> >> > timeout
> >> being
> >> > set, so it means that the connect() call can block for some
> >> > arbitrary
> >> amount of
> >> > time (based on the OS level TCP settings?). This in turn means that
> >> leader
> >> > election will fail because it times out before the socket connect
> >> > does,
> >> even
> >> > though there are enough live hosts present to form a quorum.
> >> >
> >> > This seems like a fairly fundamental problem, unless I'm missing
> >> something.
> >> > If a single host goes down due to a power failure for example, it
> >> > can
> >> prevent
> >> > any further hosts joining the cluster. In addition, if after a
> >> > power
> >> failure,
> >> > enough hosts come back online to form a quorum, but some don't,
> >> > that a quorum may still not be able to be formed.
> >> > cheers
> >> > Cam
> >>
> >>
> >
>
>

RE: ZOOKEEPER-900 / 901 / 1678

Posted by Flavio Junqueira <fp...@yahoo.com>.
Leader election seems to be taking a long time. The connection attempts from QuorumCnxManager are not causing a new round of leader election. What causes it is the absence of a quorum of supporters, so the elected leader is not getting enough servers to support it.

-Flavio

-----Original Message-----
From: Cameron McKenzie [mailto:mckenzie.cam@gmail.com] 
Sent: Wednesday, April 30, 2014 10:36 PM
To: user@zookeeper.apache.org
Subject: Re: ZOOKEEPER-900 / 901 / 1678

I've done a bit more testing this morning, and it appears that the leader election is actually completing, but then just after the election has completed, the connection attempt to the dead host times out, and this seems to cause another leader election. The same thing happens the next leader election. etc.

2014-04-30 04:07:25,383 [myid:3] - INFO
[QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2183:Leader@358] - LEADING - LEADER ELECTION TOOK - 14662
2014-04-30 04:07:25,756 [myid:3] - WARN
[WorkerSender[myid=3]:QuorumCnxManager@382] - Cannot open channel to 2 at election address /10.0.0.0:3889
java.net.SocketTimeoutException: connect timed out
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
        at
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
        at
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
        at java.net.Socket.connect(Socket.java:579)
        at
org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:368)
        at
org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:341)
        at
org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:449)
        at
org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:430)
        at java.lang.Thread.run(Thread.java:744)
2014-04-30 04:07:25,757 [myid:3] - INFO
[WorkerReceiver[myid=3]:FastLeaderElection@597] - Notification: 1 (message format version), 3 (n.leader), 0xc00000001 (n.zxid), 0xb (n.round), LOOKING (n.state), 3 (n.sid), 0xd (n.peerEpoch) LEADING (my state)

cheers



On Wed, Apr 30, 2014 at 6:48 PM, Cameron McKenzie <mc...@gmail.com>wrote:

> hey Flavio,
> Thanks for the quick reply.
>
> I'm running ZK 3.4.6. Having looked into the code a bit more, I think 
> that I was slightly presumptuous about the root cause. The actual 
> socket connects seem to be passing a timeout correctly, and based on 
> the logs, I can see the timeouts on connect occurring.
>
> I can reproduce the issue on a VM running two instances of ZK. These 
> instances are configured in a 3 node cluster (with the 2 real ZK 
> instances, and one bogus IP address that will not resolve to anything useful).
> Specifically, this bogus host is configured 2nd in the server list. 
> When I configured it third, the cluster would occasionally form a 
> quorum (though still not consistently). I've attached the config and 
> logs from both of the ZK instances.
>
> Any help would be much appreciated!
> cheers
>
>
>
>
> On Wed, Apr 30, 2014 at 6:09 PM, FPJ <fp...@yahoo.com> wrote:
>
>> Hi Cameron,
>>
>> Which version of ZK are you using? Also, if you can share logs, then 
>> it might be easier for us to help you out.
>>
>> -Flavio
>>
>> > -----Original Message-----
>> > From: Cameron McKenzie [mailto:mckenzie.cam@gmail.com]
>> > Sent: 30 April 2014 08:44
>> > To: zookeeper-user@hadoop.apache.org
>> > Subject: ZOOKEEPER-900 / 901 / 1678
>> >
>> > ZooKeeper users,
>> > Does anyone know the status of these issues? They don't seem to 
>> > have had anything done to them since late 2010?
>> >
>> > I think that we're experiencing the same issue currently. If we 
>> > have a
>> 3 node
>> > cluster for example, and 1 of these nodes is completely dead (i.e 
>> > the
>> entire
>> > host is not contactable due to a power outage), I would expect that 
>> > a quorum could still be formed, but this does not appear to be the case.
>> >
>> > I haven't delved into the code too much, but it appears that 
>> > blocking
>> IO is
>> > being used for the connect. This doesn't respect the socket SO 
>> > timeout
>> being
>> > set, so it means that the connect() call can block for some 
>> > arbitrary
>> amount of
>> > time (based on the OS level TCP settings?). This in turn means that
>> leader
>> > election will fail because it times out before the socket connect 
>> > does,
>> even
>> > though there are enough live hosts present to form a quorum.
>> >
>> > This seems like a fairly fundamental problem, unless I'm missing
>> something.
>> > If a single host goes down due to a power failure for example, it 
>> > can
>> prevent
>> > any further hosts joining the cluster. In addition, if after a 
>> > power
>> failure,
>> > enough hosts come back online to form a quorum, but some don't, 
>> > that a quorum may still not be able to be formed.
>> > cheers
>> > Cam
>>
>>
>


Re: ZOOKEEPER-900 / 901 / 1678

Posted by Cameron McKenzie <mc...@gmail.com>.
I've done a bit more testing this morning, and it appears that the leader
election is actually completing, but then just after the election has
completed, the connection attempt to the dead host times out, and this
seems to cause another leader election. The same thing happens the next
leader election. etc.

2014-04-30 04:07:25,383 [myid:3] - INFO
[QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2183:Leader@358] - LEADING - LEADER
ELECTION TOOK - 14662
2014-04-30 04:07:25,756 [myid:3] - WARN
[WorkerSender[myid=3]:QuorumCnxManager@382] - Cannot open channel to 2 at
election address /10.0.0.0:3889
java.net.SocketTimeoutException: connect timed out
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
        at
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
        at
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
        at java.net.Socket.connect(Socket.java:579)
        at
org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:368)
        at
org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:341)
        at
org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:449)
        at
org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:430)
        at java.lang.Thread.run(Thread.java:744)
2014-04-30 04:07:25,757 [myid:3] - INFO
[WorkerReceiver[myid=3]:FastLeaderElection@597] - Notification: 1 (message
format version), 3 (n.leader), 0xc00000001 (n.zxid), 0xb (n.round), LOOKING
(n.state), 3 (n.sid), 0xd (n.peerEpoch) LEADING (my state)

cheers



On Wed, Apr 30, 2014 at 6:48 PM, Cameron McKenzie <mc...@gmail.com>wrote:

> hey Flavio,
> Thanks for the quick reply.
>
> I'm running ZK 3.4.6. Having looked into the code a bit more, I think that
> I was slightly presumptuous about the root cause. The actual socket
> connects seem to be passing a timeout correctly, and based on the logs, I
> can see the timeouts on connect occurring.
>
> I can reproduce the issue on a VM running two instances of ZK. These
> instances are configured in a 3 node cluster (with the 2 real ZK instances,
> and one bogus IP address that will not resolve to anything useful).
> Specifically, this bogus host is configured 2nd in the server list. When I
> configured it third, the cluster would occasionally form a quorum (though
> still not consistently). I've attached the config and logs from both of the
> ZK instances.
>
> Any help would be much appreciated!
> cheers
>
>
>
>
> On Wed, Apr 30, 2014 at 6:09 PM, FPJ <fp...@yahoo.com> wrote:
>
>> Hi Cameron,
>>
>> Which version of ZK are you using? Also, if you can share logs, then it
>> might be easier for us to help you out.
>>
>> -Flavio
>>
>> > -----Original Message-----
>> > From: Cameron McKenzie [mailto:mckenzie.cam@gmail.com]
>> > Sent: 30 April 2014 08:44
>> > To: zookeeper-user@hadoop.apache.org
>> > Subject: ZOOKEEPER-900 / 901 / 1678
>> >
>> > ZooKeeper users,
>> > Does anyone know the status of these issues? They don't seem to have had
>> > anything done to them since late 2010?
>> >
>> > I think that we're experiencing the same issue currently. If we have a
>> 3 node
>> > cluster for example, and 1 of these nodes is completely dead (i.e the
>> entire
>> > host is not contactable due to a power outage), I would expect that a
>> > quorum could still be formed, but this does not appear to be the case.
>> >
>> > I haven't delved into the code too much, but it appears that blocking
>> IO is
>> > being used for the connect. This doesn't respect the socket SO timeout
>> being
>> > set, so it means that the connect() call can block for some arbitrary
>> amount of
>> > time (based on the OS level TCP settings?). This in turn means that
>> leader
>> > election will fail because it times out before the socket connect does,
>> even
>> > though there are enough live hosts present to form a quorum.
>> >
>> > This seems like a fairly fundamental problem, unless I'm missing
>> something.
>> > If a single host goes down due to a power failure for example, it can
>> prevent
>> > any further hosts joining the cluster. In addition, if after a power
>> failure,
>> > enough hosts come back online to form a quorum, but some don't, that a
>> > quorum may still not be able to be formed.
>> > cheers
>> > Cam
>>
>>
>

Re: ZOOKEEPER-900 / 901 / 1678

Posted by Cameron McKenzie <mc...@gmail.com>.
hey Flavio,
Thanks for the quick reply.

I'm running ZK 3.4.6. Having looked into the code a bit more, I think that
I was slightly presumptuous about the root cause. The actual socket
connects seem to be passing a timeout correctly, and based on the logs, I
can see the timeouts on connect occurring.

I can reproduce the issue on a VM running two instances of ZK. These
instances are configured in a 3 node cluster (with the 2 real ZK instances,
and one bogus IP address that will not resolve to anything useful).
Specifically, this bogus host is configured 2nd in the server list. When I
configured it third, the cluster would occasionally form a quorum (though
still not consistently). I've attached the config and logs from both of the
ZK instances.

Any help would be much appreciated!
cheers




On Wed, Apr 30, 2014 at 6:09 PM, FPJ <fp...@yahoo.com> wrote:

> Hi Cameron,
>
> Which version of ZK are you using? Also, if you can share logs, then it
> might be easier for us to help you out.
>
> -Flavio
>
> > -----Original Message-----
> > From: Cameron McKenzie [mailto:mckenzie.cam@gmail.com]
> > Sent: 30 April 2014 08:44
> > To: zookeeper-user@hadoop.apache.org
> > Subject: ZOOKEEPER-900 / 901 / 1678
> >
> > ZooKeeper users,
> > Does anyone know the status of these issues? They don't seem to have had
> > anything done to them since late 2010?
> >
> > I think that we're experiencing the same issue currently. If we have a 3
> node
> > cluster for example, and 1 of these nodes is completely dead (i.e the
> entire
> > host is not contactable due to a power outage), I would expect that a
> > quorum could still be formed, but this does not appear to be the case.
> >
> > I haven't delved into the code too much, but it appears that blocking IO
> is
> > being used for the connect. This doesn't respect the socket SO timeout
> being
> > set, so it means that the connect() call can block for some arbitrary
> amount of
> > time (based on the OS level TCP settings?). This in turn means that
> leader
> > election will fail because it times out before the socket connect does,
> even
> > though there are enough live hosts present to form a quorum.
> >
> > This seems like a fairly fundamental problem, unless I'm missing
> something.
> > If a single host goes down due to a power failure for example, it can
> prevent
> > any further hosts joining the cluster. In addition, if after a power
> failure,
> > enough hosts come back online to form a quorum, but some don't, that a
> > quorum may still not be able to be formed.
> > cheers
> > Cam
>
>

RE: ZOOKEEPER-900 / 901 / 1678

Posted by FPJ <fp...@yahoo.com>.
Hi Cameron,

Which version of ZK are you using? Also, if you can share logs, then it might be easier for us to help you out.

-Flavio

> -----Original Message-----
> From: Cameron McKenzie [mailto:mckenzie.cam@gmail.com]
> Sent: 30 April 2014 08:44
> To: zookeeper-user@hadoop.apache.org
> Subject: ZOOKEEPER-900 / 901 / 1678
> 
> ZooKeeper users,
> Does anyone know the status of these issues? They don't seem to have had
> anything done to them since late 2010?
> 
> I think that we're experiencing the same issue currently. If we have a 3 node
> cluster for example, and 1 of these nodes is completely dead (i.e the entire
> host is not contactable due to a power outage), I would expect that a
> quorum could still be formed, but this does not appear to be the case.
> 
> I haven't delved into the code too much, but it appears that blocking IO is
> being used for the connect. This doesn't respect the socket SO timeout being
> set, so it means that the connect() call can block for some arbitrary amount of
> time (based on the OS level TCP settings?). This in turn means that leader
> election will fail because it times out before the socket connect does, even
> though there are enough live hosts present to form a quorum.
> 
> This seems like a fairly fundamental problem, unless I'm missing something.
> If a single host goes down due to a power failure for example, it can prevent
> any further hosts joining the cluster. In addition, if after a power failure,
> enough hosts come back online to form a quorum, but some don't, that a
> quorum may still not be able to be formed.
> cheers
> Cam