You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@zookeeper.apache.org by Raghav <ra...@gmail.com> on 2018/05/11 22:11:45 UTC

Help Needed: Leadership Issue upon ZK Restart (ZooKeeper 3.4.9)

Hi

We have a 3 node zk ensemble as well as 3 node Kafka Cluster. They both are
hosted on the same 3 VMs.

Before Restart
1. We were on Kafka 0.10.2.1

After Restart
1. We moved to Kafka 1.1

We observe that Kafkas report leadership issues, and for lot of partitions
Leader is -1. I see some logs in ZK that mainly point towards some
connectivity issue around restart time.

*We are stuck on this one for a while now, and neither rolling restart of
ZK is helping. Can you please help or point us how we can debug this.*

*2018-05-11_17:20:49.00305 2018-05-11 17:20:49,002 [myid:1] - INFO
[WorkerReceiver[myid=1]:FastLeaderElection@600] - Notification: 1 (message
format version), 1 (n.leader), 0x200000112 (n.zxid), 0x1 (n.round), LOOKING
(n.state), 1 (n.sid), 0x2 (n.peerEpoch) LOOKING (my
state)                                    2018-05-11_17:20:49.01201
2018-05-11 17:20:49,010 [myid:1] - WARN
[WorkerSender[myid=1]:QuorumCnxManager@400] - Cannot open channel to 2 at
election address /1.1.1.143:3888
<http://1.1.1.143:3888>
2018-05-11_17:20:49.01203 java.net.ConnectException: Connection
refused
2018-05-11_17:20:49.01203       at
java.net.PlainSocketImpl.socketConnect(Native
Method)
2018-05-11_17:20:49.01203       at
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:345)
2018-05-11_17:20:49.01203       at
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
2018-05-11_17:20:49.01204       at
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
2018-05-11_17:20:49.01204       at
java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
2018-05-11_17:20:49.01204       at
java.net.Socket.connect(Socket.java:589)
2018-05-11_17:20:49.01204       at
org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:381)
2018-05-11_17:20:49.01204       at
org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:354)
2018-05-11_17:20:49.01205       at
org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:452)
2018-05-11_17:20:49.01205       at
org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:433)
2018-05-11_17:20:49.01206       at java.lang.Thread.run(Thread.java:745)*


Raghav

Re: Help Needed: Leadership Issue upon ZK Restart (ZooKeeper 3.4.9)

Posted by Prasanth Mathialagan <pr...@gmail.com>.

If you can provide the logs from all the servers in the ensemble, it might
be helpful

On Sun, May 13, 2018, 11:41 PM Raghav <ra...@gmail.com> wrote:

> Yes, it seems reachable. After the logs that I pasted, it seems to have
> printed logs that show it is connected.
>
> Anything else I could be missing ? Thanks.
>
> On Fri, May 11, 2018 at 4:35 PM, Prasanth Mathialagan <
> prasanthmathialagan@gmail.com> wrote:
>
> > Is 1.1.1.143:3888 reachable from the host in which you see this error?
> >
> > On Fri, May 11, 2018 at 3:11 PM, Raghav <ra...@gmail.com> wrote:
> >
> > > Hi
> > >
> > > We have a 3 node zk ensemble as well as 3 node Kafka Cluster. They both
> > are
> > > hosted on the same 3 VMs.
> > >
> > > Before Restart
> > > 1. We were on Kafka 0.10.2.1
> > >
> > > After Restart
> > > 1. We moved to Kafka 1.1
> > >
> > > We observe that Kafkas report leadership issues, and for lot of
> > partitions
> > > Leader is -1. I see some logs in ZK that mainly point towards some
> > > connectivity issue around restart time.
> > >
> > > *We are stuck on this one for a while now, and neither rolling restart
> of
> > > ZK is helping. Can you please help or point us how we can debug this.*
> > >
> > > *2018-05-11_17:20:49.00305 2018-05-11 17:20:49,002 [myid:1] - INFO
> > > [WorkerReceiver[myid=1]:FastLeaderElection@600] - Notification: 1
> > (message
> > > format version), 1 (n.leader), 0x200000112 (n.zxid), 0x1 (n.round),
> > LOOKING
> > > (n.state), 1 (n.sid), 0x2 (n.peerEpoch) LOOKING (my
> > > state)                                    2018-05-11_17:20:49.01201
> > > 2018-05-11 17:20:49,010 [myid:1] - WARN
> > > [WorkerSender[myid=1]:QuorumCnxManager@400] - Cannot open channel to 2
> > at
> > > election address /1.1.1.143:3888
> > > <http://1.1.1.143:3888>
> > > 2018-05-11_17:20:49.01203 java.net.ConnectException: Connection
> > > refused
> > > 2018-05-11_17:20:49.01203       at
> > > java.net.PlainSocketImpl.socketConnect(Native
> > > Method)
> > > 2018-05-11_17:20:49.01203       at
> > > java.net
> .AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:
> > > 345)
> > > 2018-05-11_17:20:49.01203       at
> > > java.net.AbstractPlainSocketImpl.connectToAddress(
> > > AbstractPlainSocketImpl.java:206)
> > > 2018-05-11_17:20:49.01204       at
> > > java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:
> > 188)
> > > 2018-05-11_17:20:49.01204       at
> > > java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> > > 2018-05-11_17:20:49.01204       at
> > > java.net.Socket.connect(Socket.java:589)
> > > 2018-05-11_17:20:49.01204       at
> > > org.apache.zookeeper.server.quorum.QuorumCnxManager.
> > > connectOne(QuorumCnxManager.java:381)
> > > 2018-05-11_17:20:49.01204       at
> > > org.apache.zookeeper.server.quorum.QuorumCnxManager.
> > > toSend(QuorumCnxManager.java:354)
> > > 2018-05-11_17:20:49.01205       at
> > > org.apache.zookeeper.server.quorum.FastLeaderElection$
> > > Messenger$WorkerSender.process(FastLeaderElection.java:452)
> > > 2018-05-11_17:20:49.01205       at
> > > org.apache.zookeeper.server.quorum.FastLeaderElection$
> > > Messenger$WorkerSender.run(FastLeaderElection.java:433)
> > > 2018-05-11_17:20:49.01206       at java.lang.Thread.run(Thread.
> > java:745)*
> > >
> > >
> > > Raghav
> > >
> >
>
>
>
> --
> Raghav
>

Re: Help Needed: Leadership Issue upon ZK Restart (ZooKeeper 3.4.9)

Posted by Raghav <ra...@gmail.com>.

Yes, it seems reachable. After the logs that I pasted, it seems to have
printed logs that show it is connected.

Anything else I could be missing ? Thanks.

On Fri, May 11, 2018 at 4:35 PM, Prasanth Mathialagan <
prasanthmathialagan@gmail.com> wrote:

> Is 1.1.1.143:3888 reachable from the host in which you see this error?
>
> On Fri, May 11, 2018 at 3:11 PM, Raghav <ra...@gmail.com> wrote:
>
> > Hi
> >
> > We have a 3 node zk ensemble as well as 3 node Kafka Cluster. They both
> are
> > hosted on the same 3 VMs.
> >
> > Before Restart
> > 1. We were on Kafka 0.10.2.1
> >
> > After Restart
> > 1. We moved to Kafka 1.1
> >
> > We observe that Kafkas report leadership issues, and for lot of
> partitions
> > Leader is -1. I see some logs in ZK that mainly point towards some
> > connectivity issue around restart time.
> >
> > *We are stuck on this one for a while now, and neither rolling restart of
> > ZK is helping. Can you please help or point us how we can debug this.*
> >
> > *2018-05-11_17:20:49.00305 2018-05-11 17:20:49,002 [myid:1] - INFO
> > [WorkerReceiver[myid=1]:FastLeaderElection@600] - Notification: 1
> (message
> > format version), 1 (n.leader), 0x200000112 (n.zxid), 0x1 (n.round),
> LOOKING
> > (n.state), 1 (n.sid), 0x2 (n.peerEpoch) LOOKING (my
> > state)                                    2018-05-11_17:20:49.01201
> > 2018-05-11 17:20:49,010 [myid:1] - WARN
> > [WorkerSender[myid=1]:QuorumCnxManager@400] - Cannot open channel to 2
> at
> > election address /1.1.1.143:3888
> > <http://1.1.1.143:3888>
> > 2018-05-11_17:20:49.01203 java.net.ConnectException: Connection
> > refused
> > 2018-05-11_17:20:49.01203       at
> > java.net.PlainSocketImpl.socketConnect(Native
> > Method)
> > 2018-05-11_17:20:49.01203       at
> > java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:
> > 345)
> > 2018-05-11_17:20:49.01203       at
> > java.net.AbstractPlainSocketImpl.connectToAddress(
> > AbstractPlainSocketImpl.java:206)
> > 2018-05-11_17:20:49.01204       at
> > java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:
> 188)
> > 2018-05-11_17:20:49.01204       at
> > java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> > 2018-05-11_17:20:49.01204       at
> > java.net.Socket.connect(Socket.java:589)
> > 2018-05-11_17:20:49.01204       at
> > org.apache.zookeeper.server.quorum.QuorumCnxManager.
> > connectOne(QuorumCnxManager.java:381)
> > 2018-05-11_17:20:49.01204       at
> > org.apache.zookeeper.server.quorum.QuorumCnxManager.
> > toSend(QuorumCnxManager.java:354)
> > 2018-05-11_17:20:49.01205       at
> > org.apache.zookeeper.server.quorum.FastLeaderElection$
> > Messenger$WorkerSender.process(FastLeaderElection.java:452)
> > 2018-05-11_17:20:49.01205       at
> > org.apache.zookeeper.server.quorum.FastLeaderElection$
> > Messenger$WorkerSender.run(FastLeaderElection.java:433)
> > 2018-05-11_17:20:49.01206       at java.lang.Thread.run(Thread.
> java:745)*
> >
> >
> > Raghav
> >
>



-- 
Raghav

Re: Help Needed: Leadership Issue upon ZK Restart (ZooKeeper 3.4.9)

Posted by Prasanth Mathialagan <pr...@gmail.com>.

Is 1.1.1.143:3888 reachable from the host in which you see this error?

On Fri, May 11, 2018 at 3:11 PM, Raghav <ra...@gmail.com> wrote:

> Hi
>
> We have a 3 node zk ensemble as well as 3 node Kafka Cluster. They both are
> hosted on the same 3 VMs.
>
> Before Restart
> 1. We were on Kafka 0.10.2.1
>
> After Restart
> 1. We moved to Kafka 1.1
>
> We observe that Kafkas report leadership issues, and for lot of partitions
> Leader is -1. I see some logs in ZK that mainly point towards some
> connectivity issue around restart time.
>
> *We are stuck on this one for a while now, and neither rolling restart of
> ZK is helping. Can you please help or point us how we can debug this.*
>
> *2018-05-11_17:20:49.00305 2018-05-11 17:20:49,002 [myid:1] - INFO
> [WorkerReceiver[myid=1]:FastLeaderElection@600] - Notification: 1 (message
> format version), 1 (n.leader), 0x200000112 (n.zxid), 0x1 (n.round), LOOKING
> (n.state), 1 (n.sid), 0x2 (n.peerEpoch) LOOKING (my
> state)                                    2018-05-11_17:20:49.01201
> 2018-05-11 17:20:49,010 [myid:1] - WARN
> [WorkerSender[myid=1]:QuorumCnxManager@400] - Cannot open channel to 2 at
> election address /1.1.1.143:3888
> <http://1.1.1.143:3888>
> 2018-05-11_17:20:49.01203 java.net.ConnectException: Connection
> refused
> 2018-05-11_17:20:49.01203       at
> java.net.PlainSocketImpl.socketConnect(Native
> Method)
> 2018-05-11_17:20:49.01203       at
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:
> 345)
> 2018-05-11_17:20:49.01203       at
> java.net.AbstractPlainSocketImpl.connectToAddress(
> AbstractPlainSocketImpl.java:206)
> 2018-05-11_17:20:49.01204       at
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
> 2018-05-11_17:20:49.01204       at
> java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> 2018-05-11_17:20:49.01204       at
> java.net.Socket.connect(Socket.java:589)
> 2018-05-11_17:20:49.01204       at
> org.apache.zookeeper.server.quorum.QuorumCnxManager.
> connectOne(QuorumCnxManager.java:381)
> 2018-05-11_17:20:49.01204       at
> org.apache.zookeeper.server.quorum.QuorumCnxManager.
> toSend(QuorumCnxManager.java:354)
> 2018-05-11_17:20:49.01205       at
> org.apache.zookeeper.server.quorum.FastLeaderElection$
> Messenger$WorkerSender.process(FastLeaderElection.java:452)
> 2018-05-11_17:20:49.01205       at
> org.apache.zookeeper.server.quorum.FastLeaderElection$
> Messenger$WorkerSender.run(FastLeaderElection.java:433)
> 2018-05-11_17:20:49.01206       at java.lang.Thread.run(Thread.java:745)*
>
>
> Raghav
>