You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@kafka.apache.org by "Konstantine Karantasis (Jira)" <ji...@apache.org> on 2020/06/10 07:17:00 UTC

[jira] [Updated] (KAFKA-9849) Fix issue with worker.unsync.backoff.ms creating zombie workers when incremental cooperative rebalancing is used

     [ https://issues.apache.org/jira/browse/KAFKA-9849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Konstantine Karantasis updated KAFKA-9849:
------------------------------------------
    Fix Version/s: 2.5.1
                   2.4.2
                   2.6.0
                   2.3.2

> Fix issue with worker.unsync.backoff.ms creating zombie workers when incremental cooperative rebalancing is used
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-9849
>                 URL: https://issues.apache.org/jira/browse/KAFKA-9849
>             Project: Kafka
>          Issue Type: Bug
>          Components: KafkaConnect
>    Affects Versions: 2.3.1, 2.5.0, 2.4.1
>            Reporter: Konstantine Karantasis
>            Assignee: Konstantine Karantasis
>            Priority: Major
>             Fix For: 2.3.2, 2.6.0, 2.4.2, 2.5.1
>
>
> {{worker.unsync.backoff.ms}} is a property that was introduced a while ago when eager (stop-the-world) rebalancing was the only option for Connect workers. The goal of this property is to avoid triggering consecutive rebalances when a worker fails to catch up with the config topic in time and therefore voluntarily leaves the group with a {{LeaveGroupRequest}}.
> With incremental cooperative rebalancing this backoff ({{worker.unsync.backoff.ms) }}that has a default value equal to the default value of {{scheduled.rebalance.max.delay.ms}} (5min) might end up turning a worker into a zombie worker that retains its tasks but stays out of the group. This worker, by backing off from rebalancing, leaves not option to the leader of the group but to reassign the missing tasks that were thought as lost to other members of the group if the worker that backs off does not return in time before {{scheduled.rebalance.max.delay.ms}} expires. 
> Clearly, {{worker.unsync.backoff.ms}} was introduced to avoid rebalancing storms under the presence of intermittent connectivity issues with eager rebalancing. However when incremental cooperative rebalancing is used this property might inadvertently make workers operate as zombie workers that keep running tasks while they are out of the group.
> Of course, a good tradeoff needs to be made between avoiding to make the protocol too eager again and at the same time avoiding to turn workers into zombies when connection is not lost for too long from the broker coordinator.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)