You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Ismael Juma (Jira)" <ji...@apache.org> on 2020/08/09 22:06:00 UTC

[jira] [Resolved] (KAFKA-10105) Regression in group coordinator dealing with flaky clients joining while leaving

     [ https://issues.apache.org/jira/browse/KAFKA-10105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ismael Juma resolved KAFKA-10105.
---------------------------------
    Resolution: Duplicate

Closing as duplicate of KAFKA-9752. Please reopen if reproduced with 2.5.0.

> Regression in group coordinator dealing with flaky clients joining while leaving
> --------------------------------------------------------------------------------
>
>                 Key: KAFKA-10105
>                 URL: https://issues.apache.org/jira/browse/KAFKA-10105
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 2.4.1
>         Environment: Kafka 2.4.1 on jre 11 on debian 9 in docker
>            Reporter: William Reynolds
>            Priority: Major
>
> Since upgrade of a cluster from 1.1.0 to 2.4.1 the broker no longer deals correctly with a consumer sending a join after a leave correctly.
> What happens no is that if a consumer sends a leaving then follows up by trying to send a join again as it is shutting down the group coordinator adds the leaving member to the group but never seems to heartbeat that member.
> Since the consumer is then gone when it joins again after starting it is added as a new member but the zombie member is there and is included in the partition assignment which means that those partitions never get consumed from. What can also happen is that one of the zombies gets group leader so rebalance gets stuck forever and the group is entirely blocked.
> I have not been able to track down where this got introduced between 1.1.0 and 2.4.1 but I will look further into this. Unfortunately the logs are essentially silent about the zombie mebers and I only had INFO level logging on during the issue and by stopping all the consumers in the group and restarting the broker coordinating that group we could get back to a working state.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)