You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Kerry Wei <xw...@salesforce.com> on 2016/07/19 22:29:11 UTC

Rebalance and Failures

Hi all,
Bit confused on rebalance and failures:

(if understand correctly about rebalance procedure)
Suppose during the middle of the rebalance, some consumer, C1, hits an
unclean shutdown (i.e. crashes, or kill -9), and the coordinator won't be
aware that C1 is dead until {zookeeper.session.timeout.ms} time passed; the
rebalance will fail as the partitions of this dead consumer can't be
released and distributed to other consumers.
Realizing C1 is dead, the coordinator exclude it from rebalance loop, and
starts a second retry. However, another consumer, C2, hits an unclean
shutdown during the second re-balance, causing the rebalance failed
again... and if the coordinator exhausted all retries (with {
rebalance.backoff.ms} time in between each retry), then the rebalance will
not complete.

My question is: what is the consequences/results of a eventually-failed
rebalance? i.e. some partitions held by the dead consumers will not be
consumed?

If there are new consumers joining the group during rebalance and existing
consumers crashed/kill-9, does it mean that rebalance could continue
forever? If so, what would be a good time to stop retry? i.e. Let
{rebalance.max.retries} * {rebalance.backoff.ms} > N * {
zookeeper.session.timeout.ms}, where N controls how many times you want to
survive a consumer crash during rebalance



BTW, how do you search kafka email archives?

Thanks!
Kerry

Re: Rebalance and Failures

Posted by Ewen Cheslack-Postava <ew...@confluent.io>.

Since you mention ZK timeout, I think you might be confused about new vs
old consumer semantics. With the new consumer, there's no ZK interaction.
If one of the member dies after indicating membership but before the group
protocol completes, it will simply be assigned data and not process it.
After the session timeout, it will be removed from the group and partitions
will be reassigned. The partitions will be recovered after the session
timeout, reassigned to other group members so the data can be processed.

-Ewen

On Tue, Jul 19, 2016 at 3:29 PM, Kerry Wei <xw...@salesforce.com> wrote:

> Hi all,
> Bit confused on rebalance and failures:
>
> (if understand correctly about rebalance procedure)
> Suppose during the middle of the rebalance, some consumer, C1, hits an
> unclean shutdown (i.e. crashes, or kill -9), and the coordinator won't be
> aware that C1 is dead until {zookeeper.session.timeout.ms} time passed;
> the
> rebalance will fail as the partitions of this dead consumer can't be
> released and distributed to other consumers.
> Realizing C1 is dead, the coordinator exclude it from rebalance loop, and
> starts a second retry. However, another consumer, C2, hits an unclean
> shutdown during the second re-balance, causing the rebalance failed
> again... and if the coordinator exhausted all retries (with {
> rebalance.backoff.ms} time in between each retry), then the rebalance will
> not complete.
>
> My question is: what is the consequences/results of a eventually-failed
> rebalance? i.e. some partitions held by the dead consumers will not be
> consumed?
>
> If there are new consumers joining the group during rebalance and existing
> consumers crashed/kill-9, does it mean that rebalance could continue
> forever? If so, what would be a good time to stop retry? i.e. Let
> {rebalance.max.retries} * {rebalance.backoff.ms} > N * {
> zookeeper.session.timeout.ms}, where N controls how many times you want to
> survive a consumer crash during rebalance
>
>
>
> BTW, how do you search kafka email archives?
>
> Thanks!
> Kerry
>



-- 
Thanks,
Ewen