You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Ashwin Jayaprakash <as...@gmail.com> on 2015/03/08 18:03:08 UTC

Re: Repeated failures due to ConsumerRebalanceFailedException

Apologies for not replying sooner. (I have digest subscription and I cannot
reply to the email chain I started! So I had not received any direct
replies either).

@Mayuresh and @Jiangjie, there was nothing very indicative in the Kafka
logs.

But I'm writing to tell you that the issue was "resolved". It was a side
effect of one of our unrelated ZK paths which had more than 14K children
nodes. Curator was trying to sync that part of the ZK tree while Kafka was
trying to rebalance consumer subscriptions. We cleared that sub-tree and
Kafka consumers immediately booted up without a fuss. So, the retries and
backoff interval did not help in this case.

So, the moral of the story is that there could be seemingly unrelated
components in the system that have effects on each other. This sub-tree
turned out to have 14K+ elements and was around 1.3MB including all the
children. One hint was that the Curator path cache was taking 1min+ to sync
that part of the tree. That gave me a hint that there may be a culprit
outside the Kafka code.

Thanks!


On Thu, Feb 26, 2015 at 7:41 PM, Ashwin Jayaprakash <
ashwin.jayaprakash@gmail.com> wrote:

> Just give you some more debugging context, we noticed that the "consumers"
> path becomes empty after all the JVMs have exited because of this error.
> So, when we restart, there are no visible entries in ZK.
>
> On Thu, Feb 26, 2015 at 6:04 PM, Ashwin Jayaprakash <
> ashwin.jayaprakash@gmail.com> wrote:
>
>> Hello, we have a set of JVMs that consume messages from Kafka topics.
>> Each JVM creates 4 ConsumerConnectors that are used by 4 separate threads.
>> These JVMs also create and use the CuratorFramework's Path children cache
>> to watch and keep a sub-tree of the ZooKeeper in sync with other JVMs. This
>> path has several thousand children elements.
>>
>> Everything was working perfectly until one fine day we decided to restart
>> these JVMs. We restart these JVMs to roll in new code every few weeks or
>> so. We never had any problems until suddenly the Kafka consumers on these
>> JVMs were unable to rebalance partitions among themselves.  We have bounced
>> these JVMs before with no issues.
>>
>> The exception:
>> Caused by: kafka.common.ConsumerRebalanceFailedException:
>> group1-system01-27422-kafka-787 can't rebalance after 12 retries
>> at
>> kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener.syncedRebalance(ZookeeperConsumerConnector.scala:432)
>> at
>> kafka.consumer.ZookeeperConsumerConnector.kafka$consumer$ZookeeperConsumerConnector$$reinitializeConsumer(ZookeeperConsumerConnector.scala:722)
>> at
>> kafka.consumer.ZookeeperConsumerConnector$WildcardStreamsHandler.<init>(ZookeeperConsumerConnector.scala:756)
>> at
>> kafka.consumer.ZookeeperConsumerConnector.createMessageStreamsByFilter(ZookeeperConsumerConnector.scala:145)
>> at
>> kafka.javaapi.consumer.ZookeeperConsumerConnector.createMessageStreamsByFilter(ZookeeperConsumerConnector.scala:96)
>> at
>> kafka.javaapi.consumer.ZookeeperConsumerConnector.createMessageStreamsByFilter(ZookeeperConsumerConnector.scala:100)
>>
>> We then set rebalance.max.retries=16 and rebalance.backoff.ms=10000.
>> I've seen the Spark-Kafka issue
>> https://issues.apache.org/jira/browse/SPARK-5505 and Jun's
>> recommendation to increase the backoff property.
>>
>> We must've tried restarting these JVMs about 20 times now both with and
>> without the "rebalance.xx" properties. Every time it is the same issue.
>> Except for the first time we applied the "rebalance.backoff.ms=10000"
>> property when all 4 JVMs started! We thought that solved everything and
>> then we tried restarting it just to make sure and then we were back to
>> square one.
>>
>> If we have only 1 thread create 1 ConsumerConnector instead of 4 it
>> works. This way we can have any number of JVMs running 1 ConsumerConnector
>> and they all behave well and rebalance partitions. It is only when we try
>> to start multiple ConsumerConnectors on the same JVM does this problem
>> occur. I'd like to remind you that 4 ConsumerConnectors was working for
>> several months. The ZK sub-tree for our non-Kafka part of the code was
>> small when we started.
>>
>> Does anybody have any thoughts on this? What could be causing this issue?
>> Could there be a Curator/ZK client conflict with the High level Kafka
>> consumer? Or is the number of nodes that we have on ZK from our code
>> causing problems with partition assignment in the Kafka code? Because the
>> Curator framework keeps syncing data in the background while the Kafka code
>> is creating ConsumerConnectors and rebalancing topics.
>>
>> Thanks,
>> Ashwin Jayaprakash.
>>
>
>