You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by "Guozhang Wang (JIRA)" <ji...@apache.org> on 2015/07/21 00:55:05 UTC

[jira] [Commented] (KAFKA-2329) Consumers balance fails when multiple consumers are started simultaneously.

    [ https://issues.apache.org/jira/browse/KAFKA-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14634222#comment-14634222 ] 

Guozhang Wang commented on KAFKA-2329:
--------------------------------------

Thanks for the patch [~zklapow], a couple comments:

0. Could upload the patch to the RB system? Current patch contains multiple versions of incremental diffs, which makes it very hard to view.
1. When compareAndTriggerRebalance is checked, it checks if the consumer cache is the same as the read value from ZK, how it will handle the case when topic changes happen while consumers do not change?

> Consumers balance fails when multiple consumers are started simultaneously.
> ---------------------------------------------------------------------------
>
>                 Key: KAFKA-2329
>                 URL: https://issues.apache.org/jira/browse/KAFKA-2329
>             Project: Kafka
>          Issue Type: Bug
>          Components: consumer
>    Affects Versions: 0.8.1.1, 0.8.2.1
>            Reporter: Ze'ev Eli Klapow
>            Assignee: Ze'ev Eli Klapow
>              Labels: consumer, patch
>             Fix For: 0.8.1.2
>
>         Attachments: zookeeper-consumer-connector-epoch-node.patch
>
>
> During consumer startup a race condition can occur if multiple consumers are started (nearly) simultaneously. 
> If a second consumer is started while the first consumer is in the middle of {{zkClient.subscribeChildChanges}} the first consumer will never see the registration of the second consumer, because the consumer registration node for the second consumer will be unwatched, and no new child will be registered later. This causes the first consumer to own all partitions, and then never release ownership causing the second consumer to fail rebalancing.
> The attached patch solves this by using an "epoch" node which all consumers watch and update to trigger  a rebalance. When a rebalance is triggered we check the consumer registrations against a cached state, to avoid unnecessary rebalances. For safety, we also periodically check the consumer registrations and rebalance. We have been using this patch in production at HubSpot for a while and it has eliminated all rebalance issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)