You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by "Ahmed H." <ah...@gmail.com> on 2014/01/22 15:24:17 UTC

Kafka rebalancing causes Zookeeper to fail

I have a basic Zookeeper/Kafka setup. I am still on Kafka 0.8 beta 1, and
Zookeeper 3.4.5. The activity on this machine isn't massive...I would say
the Kafka queues get a consistent 1 message every 2-3 seconds, as well as
occasional spikes, but still nothing large enough to push the limits. Both
Kafka and Zookeeper are running on the same machine.

Occasionally, a rebalance is triggered, which causes our Kafka clients to
try reconnecting several times, but it ultimately fails with the following
error:


04:56:10,020 INFO  [kafka.consumer.ZookeeperConsumerConnector]
(alarms.topology.updates_<host>-1383643783747-c7775701_watcher_executor)
[alarms.topology.updates_<host>-1383643783747-c7775701], exception
during rebalance : org.I0Itec.zkclient.exception.ZkNoNodeException:
org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode
= NoNode for /consumers/alarms.topology.updates/ids/alarms.topology.updates_<host>-1383643783747-c7775701
	at org.I0Itec.zkclient.exception.ZkException.create(ZkException.java:47)
[zkclient-0.3.jar:0.3]
	at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:685)
[zkclient-0.3.jar:0.3]
	at org.I0Itec.zkclient.ZkClient.readData(ZkClient.java:766)
[zkclient-0.3.jar:0.3]
	at org.I0Itec.zkclient.ZkClient.readData(ZkClient.java:761)
[zkclient-0.3.jar:0.3]
	at kafka.utils.ZkUtils$.readData(ZkUtils.scala:407)
[kafka_2.9.2-0.8.0-SNAPSHOT.jar:0.8.0-SNAPSHOT]
	at kafka.consumer.TopicCount$.constructTopicCount(TopicCount.scala:52)
[kafka_2.9.2-0.8.0-SNAPSHOT.jar:0.8.0-SNAPSHOT]
	at kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener.kafka$consumer$ZookeeperConsumerConnector$ZKRebalancerListener$$rebalance(ZookeeperConsumerConnector.scala:401)
[kafka_2.9.2-0.8.0-SNAPSHOT.jar:0.8.0-SNAPSHOT]
	at kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener$$anonfun$syncedRebalance$1.apply$mcVI$sp(ZookeeperConsumerConnector.scala:374)
[kafka_2.9.2-0.8.0-SNAPSHOT.jar:0.8.0-SNAPSHOT]
	at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:78)
[scala-library-2.9.2.jar:]
	at kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener.syncedRebalance(ZookeeperConsumerConnector.scala:369)
[kafka_2.9.2-0.8.0-SNAPSHOT.jar:0.8.0-SNAPSHOT]
	at kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener$$anon$1.run(ZookeeperConsumerConnector.scala:326)
[kafka_2.9.2-0.8.0-SNAPSHOT.jar:0.8.0-SNAPSHOT]
Caused by: org.apache.zookeeper.KeeperException$NoNodeException:
KeeperErrorCode = NoNode for
/consumers/alarms.topology.updates/ids/alarms.topology.updates_<host>-1383643783747-c7775701
	at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
[zookeeper-3.4.3.jar:3.4.3-1240972]
	at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
[zookeeper-3.4.3.jar:3.4.3-1240972]
	at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1131)
[zookeeper-3.4.3.jar:3.4.3-1240972]
	at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1160)
[zookeeper-3.4.3.jar:3.4.3-1240972]
	at org.I0Itec.zkclient.ZkConnection.readData(ZkConnection.java:103)
[zkclient-0.3.jar:0.3]
	at org.I0Itec.zkclient.ZkClient$9.call(ZkClient.java:770)
[zkclient-0.3.jar:0.3]
	at org.I0Itec.zkclient.ZkClient$9.call(ZkClient.java:766)
[zkclient-0.3.jar:0.3]
	at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:675)
[zkclient-0.3.jar:0.3]
	... 9 more


Our Kafka consumers are written in Clojure (
https://github.com/pingles/clj-kafka).

Any ideas on what can cause such behaviour? The rebalances themselves
happen sporadically, but when they do, they sometimes fail and an error
like the one above is shown. I'm not sure if this is a Kafka or Zookeeper
problem at this point, but any help would be appreciated.

Thanks

Re: Kafka rebalancing causes Zookeeper to fail

Posted by Jun Rao <ju...@gmail.com>.
zookeeper.session.timeout.ms  in consumer config.
Thanks,
Jun


On Thu, Jan 23, 2014 at 11:24 AM, Ahmed H. <ah...@gmail.com> wrote:

> When you say "use a larger session timeout", which session timeout do you
> refer to? Is it the zookeeper session timeout variable that we define when
> creating a Kafka consumer? Or is there a different session timeout?
>
> As for downgrading, that is currently not an option for the time being, so
> I will have to have some better debugging tools to pinpoint the cause.
>
> Thanks
>
>
> On Wed, Jan 22, 2014 at 11:44 PM, Jun Rao <ju...@gmail.com> wrote:
>
> > You can find some of the GC settings in
> > https://cwiki.apache.org/confluence/display/KAFKA/Operations
> >
> > There were some ZK bugs exposed during session expiration, which were
> fixed
> > in 3.3.4. Not sure if 3.4.5 exposes any new issues. The easiest thing is
> > probably to avoid GC-induced ZK session timeout in the first place or
> use a
> > larger session timeout.
> >
> > Thanks,
> >
> > Jun
> >
> >
> > On Wed, Jan 22, 2014 at 8:29 AM, Ahmed H. <ah...@gmail.com>
> wrote:
> >
> > > Hello,
> > >
> > > I looked at that, not sure if it is applicable or not at this point. We
> > > used to have frequent rebalances, but that issue was mitigated by
> > > increasing the zktimeout on the consumer side. With that said, it may
> > still
> > > be a problem. I have't collected any metrics concerning rebalances in a
> > > while. I will certainly take a look at our current GC settings. What
> are
> > > typical settings that we should have for GC (I am not sure of what
> > exactly
> > > I'm looking for)?
> > >
> > > As for downgrading the Zookeeper version, would there be any major loss
> > of
> > > functionality? Version 3.4.5 is currently stable, so I am unsure of how
> > it
> > > would help. I can try it and let it soak for a while to see if it helps
> > or
> > > not. The problem is we have many components that tie into Zookeeper and
> > I'm
> > > worried that downgrading may break some of our API calls to it.
> > >
> > > Is there a good way of trying to narrow this problem down further?
> > >
> > > Thanks again
> > >
> > >
> > > On Wed, Jan 22, 2014 at 10:15 AM, Jun Rao <ju...@gmail.com> wrote:
> > >
> > > > Not sure how stable ZK 3.4.5 is. Could you try 3.3.4? Also, see if
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/FAQ#FAQ-Whyaretheremanyrebalancesinmyconsumerlog
> > > > ?
> > > > is applicable.
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > >
> > > > On Wed, Jan 22, 2014 at 6:24 AM, Ahmed H. <ah...@gmail.com>
> > > wrote:
> > > >
> > > > > I have a basic Zookeeper/Kafka setup. I am still on Kafka 0.8 beta
> 1,
> > > and
> > > > > Zookeeper 3.4.5. The activity on this machine isn't massive...I
> would
> > > say
> > > > > the Kafka queues get a consistent 1 message every 2-3 seconds, as
> > well
> > > as
> > > > > occasional spikes, but still nothing large enough to push the
> limits.
> > > > Both
> > > > > Kafka and Zookeeper are running on the same machine.
> > > > >
> > > > > Occasionally, a rebalance is triggered, which causes our Kafka
> > clients
> > > to
> > > > > try reconnecting several times, but it ultimately fails with the
> > > > following
> > > > > error:
> > > > >
> > > > >
> > > > > 04:56:10,020 INFO  [kafka.consumer.ZookeeperConsumerConnector]
> > > > >
> > >
> (alarms.topology.updates_<host>-1383643783747-c7775701_watcher_executor)
> > > > > [alarms.topology.updates_<host>-1383643783747-c7775701], exception
> > > > > during rebalance : org.I0Itec.zkclient.exception.ZkNoNodeException:
> > > > > org.apache.zookeeper.KeeperException$NoNodeException:
> KeeperErrorCode
> > > > > = NoNode for
> > > > >
> > > >
> > >
> >
> /consumers/alarms.topology.updates/ids/alarms.topology.updates_<host>-1383643783747-c7775701
> > > > >         at
> > > > >
> org.I0Itec.zkclient.exception.ZkException.create(ZkException.java:47)
> > > > > [zkclient-0.3.jar:0.3]
> > > > >         at
> > > > > org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:685)
> > > > > [zkclient-0.3.jar:0.3]
> > > > >         at org.I0Itec.zkclient.ZkClient.readData(ZkClient.java:766)
> > > > > [zkclient-0.3.jar:0.3]
> > > > >         at org.I0Itec.zkclient.ZkClient.readData(ZkClient.java:761)
> > > > > [zkclient-0.3.jar:0.3]
> > > > >         at kafka.utils.ZkUtils$.readData(ZkUtils.scala:407)
> > > > > [kafka_2.9.2-0.8.0-SNAPSHOT.jar:0.8.0-SNAPSHOT]
> > > > >         at
> > > > > kafka.consumer.TopicCount$.constructTopicCount(TopicCount.scala:52)
> > > > > [kafka_2.9.2-0.8.0-SNAPSHOT.jar:0.8.0-SNAPSHOT]
> > > > >         at
> > > > >
> > > >
> > >
> >
> kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener.kafka$consumer$ZookeeperConsumerConnector$ZKRebalancerListener$$rebalance(ZookeeperConsumerConnector.scala:401)
> > > > > [kafka_2.9.2-0.8.0-SNAPSHOT.jar:0.8.0-SNAPSHOT]
> > > > >         at
> > > > >
> > > >
> > >
> >
> kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener$$anonfun$syncedRebalance$1.apply$mcVI$sp(ZookeeperConsumerConnector.scala:374)
> > > > > [kafka_2.9.2-0.8.0-SNAPSHOT.jar:0.8.0-SNAPSHOT]
> > > > >         at
> > > > scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:78)
> > > > > [scala-library-2.9.2.jar:]
> > > > >         at
> > > > >
> > > >
> > >
> >
> kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener.syncedRebalance(ZookeeperConsumerConnector.scala:369)
> > > > > [kafka_2.9.2-0.8.0-SNAPSHOT.jar:0.8.0-SNAPSHOT]
> > > > >         at
> > > > >
> > > >
> > >
> >
> kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener$$anon$1.run(ZookeeperConsumerConnector.scala:326)
> > > > > [kafka_2.9.2-0.8.0-SNAPSHOT.jar:0.8.0-SNAPSHOT]
> > > > > Caused by: org.apache.zookeeper.KeeperException$NoNodeException:
> > > > > KeeperErrorCode = NoNode for
> > > > >
> > > > >
> > > >
> > >
> >
> /consumers/alarms.topology.updates/ids/alarms.topology.updates_<host>-1383643783747-c7775701
> > > > >         at
> > > > >
> org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
> > > > > [zookeeper-3.4.3.jar:3.4.3-1240972]
> > > > >         at
> > > > >
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> > > > > [zookeeper-3.4.3.jar:3.4.3-1240972]
> > > > >         at
> > org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1131)
> > > > > [zookeeper-3.4.3.jar:3.4.3-1240972]
> > > > >         at
> > org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1160)
> > > > > [zookeeper-3.4.3.jar:3.4.3-1240972]
> > > > >         at
> > > > org.I0Itec.zkclient.ZkConnection.readData(ZkConnection.java:103)
> > > > > [zkclient-0.3.jar:0.3]
> > > > >         at org.I0Itec.zkclient.ZkClient$9.call(ZkClient.java:770)
> > > > > [zkclient-0.3.jar:0.3]
> > > > >         at org.I0Itec.zkclient.ZkClient$9.call(ZkClient.java:766)
> > > > > [zkclient-0.3.jar:0.3]
> > > > >         at
> > > > > org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:675)
> > > > > [zkclient-0.3.jar:0.3]
> > > > >         ... 9 more
> > > > >
> > > > >
> > > > > Our Kafka consumers are written in Clojure (
> > > > > https://github.com/pingles/clj-kafka).
> > > > >
> > > > > Any ideas on what can cause such behaviour? The rebalances
> themselves
> > > > > happen sporadically, but when they do, they sometimes fail and an
> > error
> > > > > like the one above is shown. I'm not sure if this is a Kafka or
> > > Zookeeper
> > > > > problem at this point, but any help would be appreciated.
> > > > >
> > > > > Thanks
> > > > >
> > > >
> > >
> >
>

Re: Kafka rebalancing causes Zookeeper to fail

Posted by "Ahmed H." <ah...@gmail.com>.
When you say "use a larger session timeout", which session timeout do you
refer to? Is it the zookeeper session timeout variable that we define when
creating a Kafka consumer? Or is there a different session timeout?

As for downgrading, that is currently not an option for the time being, so
I will have to have some better debugging tools to pinpoint the cause.

Thanks


On Wed, Jan 22, 2014 at 11:44 PM, Jun Rao <ju...@gmail.com> wrote:

> You can find some of the GC settings in
> https://cwiki.apache.org/confluence/display/KAFKA/Operations
>
> There were some ZK bugs exposed during session expiration, which were fixed
> in 3.3.4. Not sure if 3.4.5 exposes any new issues. The easiest thing is
> probably to avoid GC-induced ZK session timeout in the first place or use a
> larger session timeout.
>
> Thanks,
>
> Jun
>
>
> On Wed, Jan 22, 2014 at 8:29 AM, Ahmed H. <ah...@gmail.com> wrote:
>
> > Hello,
> >
> > I looked at that, not sure if it is applicable or not at this point. We
> > used to have frequent rebalances, but that issue was mitigated by
> > increasing the zktimeout on the consumer side. With that said, it may
> still
> > be a problem. I have't collected any metrics concerning rebalances in a
> > while. I will certainly take a look at our current GC settings. What are
> > typical settings that we should have for GC (I am not sure of what
> exactly
> > I'm looking for)?
> >
> > As for downgrading the Zookeeper version, would there be any major loss
> of
> > functionality? Version 3.4.5 is currently stable, so I am unsure of how
> it
> > would help. I can try it and let it soak for a while to see if it helps
> or
> > not. The problem is we have many components that tie into Zookeeper and
> I'm
> > worried that downgrading may break some of our API calls to it.
> >
> > Is there a good way of trying to narrow this problem down further?
> >
> > Thanks again
> >
> >
> > On Wed, Jan 22, 2014 at 10:15 AM, Jun Rao <ju...@gmail.com> wrote:
> >
> > > Not sure how stable ZK 3.4.5 is. Could you try 3.3.4? Also, see if
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/FAQ#FAQ-Whyaretheremanyrebalancesinmyconsumerlog
> > > ?
> > > is applicable.
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > >
> > > On Wed, Jan 22, 2014 at 6:24 AM, Ahmed H. <ah...@gmail.com>
> > wrote:
> > >
> > > > I have a basic Zookeeper/Kafka setup. I am still on Kafka 0.8 beta 1,
> > and
> > > > Zookeeper 3.4.5. The activity on this machine isn't massive...I would
> > say
> > > > the Kafka queues get a consistent 1 message every 2-3 seconds, as
> well
> > as
> > > > occasional spikes, but still nothing large enough to push the limits.
> > > Both
> > > > Kafka and Zookeeper are running on the same machine.
> > > >
> > > > Occasionally, a rebalance is triggered, which causes our Kafka
> clients
> > to
> > > > try reconnecting several times, but it ultimately fails with the
> > > following
> > > > error:
> > > >
> > > >
> > > > 04:56:10,020 INFO  [kafka.consumer.ZookeeperConsumerConnector]
> > > >
> > (alarms.topology.updates_<host>-1383643783747-c7775701_watcher_executor)
> > > > [alarms.topology.updates_<host>-1383643783747-c7775701], exception
> > > > during rebalance : org.I0Itec.zkclient.exception.ZkNoNodeException:
> > > > org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode
> > > > = NoNode for
> > > >
> > >
> >
> /consumers/alarms.topology.updates/ids/alarms.topology.updates_<host>-1383643783747-c7775701
> > > >         at
> > > > org.I0Itec.zkclient.exception.ZkException.create(ZkException.java:47)
> > > > [zkclient-0.3.jar:0.3]
> > > >         at
> > > > org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:685)
> > > > [zkclient-0.3.jar:0.3]
> > > >         at org.I0Itec.zkclient.ZkClient.readData(ZkClient.java:766)
> > > > [zkclient-0.3.jar:0.3]
> > > >         at org.I0Itec.zkclient.ZkClient.readData(ZkClient.java:761)
> > > > [zkclient-0.3.jar:0.3]
> > > >         at kafka.utils.ZkUtils$.readData(ZkUtils.scala:407)
> > > > [kafka_2.9.2-0.8.0-SNAPSHOT.jar:0.8.0-SNAPSHOT]
> > > >         at
> > > > kafka.consumer.TopicCount$.constructTopicCount(TopicCount.scala:52)
> > > > [kafka_2.9.2-0.8.0-SNAPSHOT.jar:0.8.0-SNAPSHOT]
> > > >         at
> > > >
> > >
> >
> kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener.kafka$consumer$ZookeeperConsumerConnector$ZKRebalancerListener$$rebalance(ZookeeperConsumerConnector.scala:401)
> > > > [kafka_2.9.2-0.8.0-SNAPSHOT.jar:0.8.0-SNAPSHOT]
> > > >         at
> > > >
> > >
> >
> kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener$$anonfun$syncedRebalance$1.apply$mcVI$sp(ZookeeperConsumerConnector.scala:374)
> > > > [kafka_2.9.2-0.8.0-SNAPSHOT.jar:0.8.0-SNAPSHOT]
> > > >         at
> > > scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:78)
> > > > [scala-library-2.9.2.jar:]
> > > >         at
> > > >
> > >
> >
> kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener.syncedRebalance(ZookeeperConsumerConnector.scala:369)
> > > > [kafka_2.9.2-0.8.0-SNAPSHOT.jar:0.8.0-SNAPSHOT]
> > > >         at
> > > >
> > >
> >
> kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener$$anon$1.run(ZookeeperConsumerConnector.scala:326)
> > > > [kafka_2.9.2-0.8.0-SNAPSHOT.jar:0.8.0-SNAPSHOT]
> > > > Caused by: org.apache.zookeeper.KeeperException$NoNodeException:
> > > > KeeperErrorCode = NoNode for
> > > >
> > > >
> > >
> >
> /consumers/alarms.topology.updates/ids/alarms.topology.updates_<host>-1383643783747-c7775701
> > > >         at
> > > > org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
> > > > [zookeeper-3.4.3.jar:3.4.3-1240972]
> > > >         at
> > > > org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> > > > [zookeeper-3.4.3.jar:3.4.3-1240972]
> > > >         at
> org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1131)
> > > > [zookeeper-3.4.3.jar:3.4.3-1240972]
> > > >         at
> org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1160)
> > > > [zookeeper-3.4.3.jar:3.4.3-1240972]
> > > >         at
> > > org.I0Itec.zkclient.ZkConnection.readData(ZkConnection.java:103)
> > > > [zkclient-0.3.jar:0.3]
> > > >         at org.I0Itec.zkclient.ZkClient$9.call(ZkClient.java:770)
> > > > [zkclient-0.3.jar:0.3]
> > > >         at org.I0Itec.zkclient.ZkClient$9.call(ZkClient.java:766)
> > > > [zkclient-0.3.jar:0.3]
> > > >         at
> > > > org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:675)
> > > > [zkclient-0.3.jar:0.3]
> > > >         ... 9 more
> > > >
> > > >
> > > > Our Kafka consumers are written in Clojure (
> > > > https://github.com/pingles/clj-kafka).
> > > >
> > > > Any ideas on what can cause such behaviour? The rebalances themselves
> > > > happen sporadically, but when they do, they sometimes fail and an
> error
> > > > like the one above is shown. I'm not sure if this is a Kafka or
> > Zookeeper
> > > > problem at this point, but any help would be appreciated.
> > > >
> > > > Thanks
> > > >
> > >
> >
>

Re: Kafka rebalancing causes Zookeeper to fail

Posted by Jun Rao <ju...@gmail.com>.
You can find some of the GC settings in
https://cwiki.apache.org/confluence/display/KAFKA/Operations

There were some ZK bugs exposed during session expiration, which were fixed
in 3.3.4. Not sure if 3.4.5 exposes any new issues. The easiest thing is
probably to avoid GC-induced ZK session timeout in the first place or use a
larger session timeout.

Thanks,

Jun


On Wed, Jan 22, 2014 at 8:29 AM, Ahmed H. <ah...@gmail.com> wrote:

> Hello,
>
> I looked at that, not sure if it is applicable or not at this point. We
> used to have frequent rebalances, but that issue was mitigated by
> increasing the zktimeout on the consumer side. With that said, it may still
> be a problem. I have't collected any metrics concerning rebalances in a
> while. I will certainly take a look at our current GC settings. What are
> typical settings that we should have for GC (I am not sure of what exactly
> I'm looking for)?
>
> As for downgrading the Zookeeper version, would there be any major loss of
> functionality? Version 3.4.5 is currently stable, so I am unsure of how it
> would help. I can try it and let it soak for a while to see if it helps or
> not. The problem is we have many components that tie into Zookeeper and I'm
> worried that downgrading may break some of our API calls to it.
>
> Is there a good way of trying to narrow this problem down further?
>
> Thanks again
>
>
> On Wed, Jan 22, 2014 at 10:15 AM, Jun Rao <ju...@gmail.com> wrote:
>
> > Not sure how stable ZK 3.4.5 is. Could you try 3.3.4? Also, see if
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/FAQ#FAQ-Whyaretheremanyrebalancesinmyconsumerlog
> > ?
> > is applicable.
> >
> > Thanks,
> >
> > Jun
> >
> >
> > On Wed, Jan 22, 2014 at 6:24 AM, Ahmed H. <ah...@gmail.com>
> wrote:
> >
> > > I have a basic Zookeeper/Kafka setup. I am still on Kafka 0.8 beta 1,
> and
> > > Zookeeper 3.4.5. The activity on this machine isn't massive...I would
> say
> > > the Kafka queues get a consistent 1 message every 2-3 seconds, as well
> as
> > > occasional spikes, but still nothing large enough to push the limits.
> > Both
> > > Kafka and Zookeeper are running on the same machine.
> > >
> > > Occasionally, a rebalance is triggered, which causes our Kafka clients
> to
> > > try reconnecting several times, but it ultimately fails with the
> > following
> > > error:
> > >
> > >
> > > 04:56:10,020 INFO  [kafka.consumer.ZookeeperConsumerConnector]
> > >
> (alarms.topology.updates_<host>-1383643783747-c7775701_watcher_executor)
> > > [alarms.topology.updates_<host>-1383643783747-c7775701], exception
> > > during rebalance : org.I0Itec.zkclient.exception.ZkNoNodeException:
> > > org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode
> > > = NoNode for
> > >
> >
> /consumers/alarms.topology.updates/ids/alarms.topology.updates_<host>-1383643783747-c7775701
> > >         at
> > > org.I0Itec.zkclient.exception.ZkException.create(ZkException.java:47)
> > > [zkclient-0.3.jar:0.3]
> > >         at
> > > org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:685)
> > > [zkclient-0.3.jar:0.3]
> > >         at org.I0Itec.zkclient.ZkClient.readData(ZkClient.java:766)
> > > [zkclient-0.3.jar:0.3]
> > >         at org.I0Itec.zkclient.ZkClient.readData(ZkClient.java:761)
> > > [zkclient-0.3.jar:0.3]
> > >         at kafka.utils.ZkUtils$.readData(ZkUtils.scala:407)
> > > [kafka_2.9.2-0.8.0-SNAPSHOT.jar:0.8.0-SNAPSHOT]
> > >         at
> > > kafka.consumer.TopicCount$.constructTopicCount(TopicCount.scala:52)
> > > [kafka_2.9.2-0.8.0-SNAPSHOT.jar:0.8.0-SNAPSHOT]
> > >         at
> > >
> >
> kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener.kafka$consumer$ZookeeperConsumerConnector$ZKRebalancerListener$$rebalance(ZookeeperConsumerConnector.scala:401)
> > > [kafka_2.9.2-0.8.0-SNAPSHOT.jar:0.8.0-SNAPSHOT]
> > >         at
> > >
> >
> kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener$$anonfun$syncedRebalance$1.apply$mcVI$sp(ZookeeperConsumerConnector.scala:374)
> > > [kafka_2.9.2-0.8.0-SNAPSHOT.jar:0.8.0-SNAPSHOT]
> > >         at
> > scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:78)
> > > [scala-library-2.9.2.jar:]
> > >         at
> > >
> >
> kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener.syncedRebalance(ZookeeperConsumerConnector.scala:369)
> > > [kafka_2.9.2-0.8.0-SNAPSHOT.jar:0.8.0-SNAPSHOT]
> > >         at
> > >
> >
> kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener$$anon$1.run(ZookeeperConsumerConnector.scala:326)
> > > [kafka_2.9.2-0.8.0-SNAPSHOT.jar:0.8.0-SNAPSHOT]
> > > Caused by: org.apache.zookeeper.KeeperException$NoNodeException:
> > > KeeperErrorCode = NoNode for
> > >
> > >
> >
> /consumers/alarms.topology.updates/ids/alarms.topology.updates_<host>-1383643783747-c7775701
> > >         at
> > > org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
> > > [zookeeper-3.4.3.jar:3.4.3-1240972]
> > >         at
> > > org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> > > [zookeeper-3.4.3.jar:3.4.3-1240972]
> > >         at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1131)
> > > [zookeeper-3.4.3.jar:3.4.3-1240972]
> > >         at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1160)
> > > [zookeeper-3.4.3.jar:3.4.3-1240972]
> > >         at
> > org.I0Itec.zkclient.ZkConnection.readData(ZkConnection.java:103)
> > > [zkclient-0.3.jar:0.3]
> > >         at org.I0Itec.zkclient.ZkClient$9.call(ZkClient.java:770)
> > > [zkclient-0.3.jar:0.3]
> > >         at org.I0Itec.zkclient.ZkClient$9.call(ZkClient.java:766)
> > > [zkclient-0.3.jar:0.3]
> > >         at
> > > org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:675)
> > > [zkclient-0.3.jar:0.3]
> > >         ... 9 more
> > >
> > >
> > > Our Kafka consumers are written in Clojure (
> > > https://github.com/pingles/clj-kafka).
> > >
> > > Any ideas on what can cause such behaviour? The rebalances themselves
> > > happen sporadically, but when they do, they sometimes fail and an error
> > > like the one above is shown. I'm not sure if this is a Kafka or
> Zookeeper
> > > problem at this point, but any help would be appreciated.
> > >
> > > Thanks
> > >
> >
>

Re: Kafka rebalancing causes Zookeeper to fail

Posted by "Ahmed H." <ah...@gmail.com>.
Hello,

I looked at that, not sure if it is applicable or not at this point. We
used to have frequent rebalances, but that issue was mitigated by
increasing the zktimeout on the consumer side. With that said, it may still
be a problem. I have't collected any metrics concerning rebalances in a
while. I will certainly take a look at our current GC settings. What are
typical settings that we should have for GC (I am not sure of what exactly
I'm looking for)?

As for downgrading the Zookeeper version, would there be any major loss of
functionality? Version 3.4.5 is currently stable, so I am unsure of how it
would help. I can try it and let it soak for a while to see if it helps or
not. The problem is we have many components that tie into Zookeeper and I'm
worried that downgrading may break some of our API calls to it.

Is there a good way of trying to narrow this problem down further?

Thanks again


On Wed, Jan 22, 2014 at 10:15 AM, Jun Rao <ju...@gmail.com> wrote:

> Not sure how stable ZK 3.4.5 is. Could you try 3.3.4? Also, see if
>
> https://cwiki.apache.org/confluence/display/KAFKA/FAQ#FAQ-Whyaretheremanyrebalancesinmyconsumerlog
> ?
> is applicable.
>
> Thanks,
>
> Jun
>
>
> On Wed, Jan 22, 2014 at 6:24 AM, Ahmed H. <ah...@gmail.com> wrote:
>
> > I have a basic Zookeeper/Kafka setup. I am still on Kafka 0.8 beta 1, and
> > Zookeeper 3.4.5. The activity on this machine isn't massive...I would say
> > the Kafka queues get a consistent 1 message every 2-3 seconds, as well as
> > occasional spikes, but still nothing large enough to push the limits.
> Both
> > Kafka and Zookeeper are running on the same machine.
> >
> > Occasionally, a rebalance is triggered, which causes our Kafka clients to
> > try reconnecting several times, but it ultimately fails with the
> following
> > error:
> >
> >
> > 04:56:10,020 INFO  [kafka.consumer.ZookeeperConsumerConnector]
> > (alarms.topology.updates_<host>-1383643783747-c7775701_watcher_executor)
> > [alarms.topology.updates_<host>-1383643783747-c7775701], exception
> > during rebalance : org.I0Itec.zkclient.exception.ZkNoNodeException:
> > org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode
> > = NoNode for
> >
> /consumers/alarms.topology.updates/ids/alarms.topology.updates_<host>-1383643783747-c7775701
> >         at
> > org.I0Itec.zkclient.exception.ZkException.create(ZkException.java:47)
> > [zkclient-0.3.jar:0.3]
> >         at
> > org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:685)
> > [zkclient-0.3.jar:0.3]
> >         at org.I0Itec.zkclient.ZkClient.readData(ZkClient.java:766)
> > [zkclient-0.3.jar:0.3]
> >         at org.I0Itec.zkclient.ZkClient.readData(ZkClient.java:761)
> > [zkclient-0.3.jar:0.3]
> >         at kafka.utils.ZkUtils$.readData(ZkUtils.scala:407)
> > [kafka_2.9.2-0.8.0-SNAPSHOT.jar:0.8.0-SNAPSHOT]
> >         at
> > kafka.consumer.TopicCount$.constructTopicCount(TopicCount.scala:52)
> > [kafka_2.9.2-0.8.0-SNAPSHOT.jar:0.8.0-SNAPSHOT]
> >         at
> >
> kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener.kafka$consumer$ZookeeperConsumerConnector$ZKRebalancerListener$$rebalance(ZookeeperConsumerConnector.scala:401)
> > [kafka_2.9.2-0.8.0-SNAPSHOT.jar:0.8.0-SNAPSHOT]
> >         at
> >
> kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener$$anonfun$syncedRebalance$1.apply$mcVI$sp(ZookeeperConsumerConnector.scala:374)
> > [kafka_2.9.2-0.8.0-SNAPSHOT.jar:0.8.0-SNAPSHOT]
> >         at
> scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:78)
> > [scala-library-2.9.2.jar:]
> >         at
> >
> kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener.syncedRebalance(ZookeeperConsumerConnector.scala:369)
> > [kafka_2.9.2-0.8.0-SNAPSHOT.jar:0.8.0-SNAPSHOT]
> >         at
> >
> kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener$$anon$1.run(ZookeeperConsumerConnector.scala:326)
> > [kafka_2.9.2-0.8.0-SNAPSHOT.jar:0.8.0-SNAPSHOT]
> > Caused by: org.apache.zookeeper.KeeperException$NoNodeException:
> > KeeperErrorCode = NoNode for
> >
> >
> /consumers/alarms.topology.updates/ids/alarms.topology.updates_<host>-1383643783747-c7775701
> >         at
> > org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
> > [zookeeper-3.4.3.jar:3.4.3-1240972]
> >         at
> > org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> > [zookeeper-3.4.3.jar:3.4.3-1240972]
> >         at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1131)
> > [zookeeper-3.4.3.jar:3.4.3-1240972]
> >         at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1160)
> > [zookeeper-3.4.3.jar:3.4.3-1240972]
> >         at
> org.I0Itec.zkclient.ZkConnection.readData(ZkConnection.java:103)
> > [zkclient-0.3.jar:0.3]
> >         at org.I0Itec.zkclient.ZkClient$9.call(ZkClient.java:770)
> > [zkclient-0.3.jar:0.3]
> >         at org.I0Itec.zkclient.ZkClient$9.call(ZkClient.java:766)
> > [zkclient-0.3.jar:0.3]
> >         at
> > org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:675)
> > [zkclient-0.3.jar:0.3]
> >         ... 9 more
> >
> >
> > Our Kafka consumers are written in Clojure (
> > https://github.com/pingles/clj-kafka).
> >
> > Any ideas on what can cause such behaviour? The rebalances themselves
> > happen sporadically, but when they do, they sometimes fail and an error
> > like the one above is shown. I'm not sure if this is a Kafka or Zookeeper
> > problem at this point, but any help would be appreciated.
> >
> > Thanks
> >
>

Re: Kafka rebalancing causes Zookeeper to fail

Posted by Jun Rao <ju...@gmail.com>.
Not sure how stable ZK 3.4.5 is. Could you try 3.3.4? Also, see if
https://cwiki.apache.org/confluence/display/KAFKA/FAQ#FAQ-Whyaretheremanyrebalancesinmyconsumerlog?
is applicable.

Thanks,

Jun


On Wed, Jan 22, 2014 at 6:24 AM, Ahmed H. <ah...@gmail.com> wrote:

> I have a basic Zookeeper/Kafka setup. I am still on Kafka 0.8 beta 1, and
> Zookeeper 3.4.5. The activity on this machine isn't massive...I would say
> the Kafka queues get a consistent 1 message every 2-3 seconds, as well as
> occasional spikes, but still nothing large enough to push the limits. Both
> Kafka and Zookeeper are running on the same machine.
>
> Occasionally, a rebalance is triggered, which causes our Kafka clients to
> try reconnecting several times, but it ultimately fails with the following
> error:
>
>
> 04:56:10,020 INFO  [kafka.consumer.ZookeeperConsumerConnector]
> (alarms.topology.updates_<host>-1383643783747-c7775701_watcher_executor)
> [alarms.topology.updates_<host>-1383643783747-c7775701], exception
> during rebalance : org.I0Itec.zkclient.exception.ZkNoNodeException:
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode
> = NoNode for
> /consumers/alarms.topology.updates/ids/alarms.topology.updates_<host>-1383643783747-c7775701
>         at
> org.I0Itec.zkclient.exception.ZkException.create(ZkException.java:47)
> [zkclient-0.3.jar:0.3]
>         at
> org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:685)
> [zkclient-0.3.jar:0.3]
>         at org.I0Itec.zkclient.ZkClient.readData(ZkClient.java:766)
> [zkclient-0.3.jar:0.3]
>         at org.I0Itec.zkclient.ZkClient.readData(ZkClient.java:761)
> [zkclient-0.3.jar:0.3]
>         at kafka.utils.ZkUtils$.readData(ZkUtils.scala:407)
> [kafka_2.9.2-0.8.0-SNAPSHOT.jar:0.8.0-SNAPSHOT]
>         at
> kafka.consumer.TopicCount$.constructTopicCount(TopicCount.scala:52)
> [kafka_2.9.2-0.8.0-SNAPSHOT.jar:0.8.0-SNAPSHOT]
>         at
> kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener.kafka$consumer$ZookeeperConsumerConnector$ZKRebalancerListener$$rebalance(ZookeeperConsumerConnector.scala:401)
> [kafka_2.9.2-0.8.0-SNAPSHOT.jar:0.8.0-SNAPSHOT]
>         at
> kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener$$anonfun$syncedRebalance$1.apply$mcVI$sp(ZookeeperConsumerConnector.scala:374)
> [kafka_2.9.2-0.8.0-SNAPSHOT.jar:0.8.0-SNAPSHOT]
>         at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:78)
> [scala-library-2.9.2.jar:]
>         at
> kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener.syncedRebalance(ZookeeperConsumerConnector.scala:369)
> [kafka_2.9.2-0.8.0-SNAPSHOT.jar:0.8.0-SNAPSHOT]
>         at
> kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener$$anon$1.run(ZookeeperConsumerConnector.scala:326)
> [kafka_2.9.2-0.8.0-SNAPSHOT.jar:0.8.0-SNAPSHOT]
> Caused by: org.apache.zookeeper.KeeperException$NoNodeException:
> KeeperErrorCode = NoNode for
>
> /consumers/alarms.topology.updates/ids/alarms.topology.updates_<host>-1383643783747-c7775701
>         at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
> [zookeeper-3.4.3.jar:3.4.3-1240972]
>         at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> [zookeeper-3.4.3.jar:3.4.3-1240972]
>         at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1131)
> [zookeeper-3.4.3.jar:3.4.3-1240972]
>         at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1160)
> [zookeeper-3.4.3.jar:3.4.3-1240972]
>         at org.I0Itec.zkclient.ZkConnection.readData(ZkConnection.java:103)
> [zkclient-0.3.jar:0.3]
>         at org.I0Itec.zkclient.ZkClient$9.call(ZkClient.java:770)
> [zkclient-0.3.jar:0.3]
>         at org.I0Itec.zkclient.ZkClient$9.call(ZkClient.java:766)
> [zkclient-0.3.jar:0.3]
>         at
> org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:675)
> [zkclient-0.3.jar:0.3]
>         ... 9 more
>
>
> Our Kafka consumers are written in Clojure (
> https://github.com/pingles/clj-kafka).
>
> Any ideas on what can cause such behaviour? The rebalances themselves
> happen sporadically, but when they do, they sometimes fail and an error
> like the one above is shown. I'm not sure if this is a Kafka or Zookeeper
> problem at this point, but any help would be appreciated.
>
> Thanks
>