You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by "Manikumar (Jira)" <ji...@apache.org> on 2019/11/06 18:13:01 UTC

[jira] [Resolved] (KAFKA-9140) Consumer gets stuck rejoining the group indefinitely

     [ https://issues.apache.org/jira/browse/KAFKA-9140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Manikumar resolved KAFKA-9140.
------------------------------
      Assignee: Guozhang Wang
    Resolution: Fixed

> Consumer gets stuck rejoining the group indefinitely
> ----------------------------------------------------
>
>                 Key: KAFKA-9140
>                 URL: https://issues.apache.org/jira/browse/KAFKA-9140
>             Project: Kafka
>          Issue Type: Bug
>          Components: clients, consumer
>    Affects Versions: 2.4.0
>            Reporter: Sophie Blee-Goldman
>            Assignee: Guozhang Wang
>            Priority: Blocker
>             Fix For: 2.4.0
>
>         Attachments: debug.tgz, info.tgz, kafka-data-logs-1.tgz, kafka-data-logs-2.tgz, server-start-stdout-stderr.log.tgz, streams.log.tgz
>
>
> There seems to be a race condition that is now causing a rejoining member to potentially get stuck infinitely initiating a rejoin. The relevant client logs are attached (streams.log.tgz; all others attachments are broker logs), but basically it repeats this message (and nothing else) continuously until killed/shutdown:
>  
> {code:java}
> [2019-11-05 01:53:54,699] INFO [Consumer clientId=StreamsUpgradeTest-a4c1cff8-7883-49cd-82da-d2cdfc33a2f0-StreamThread-1-consumer, groupId=StreamsUpgradeTest] Generation data was cleared by heartbeat thread. Initiating rejoin. (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> {code}
>  
> The message that appears was added as part of the bugfix ([PR 7460|https://github.com/apache/kafka/pull/7460]) for this related race condition: KAFKA-8104.
> This issue was uncovered by the Streams version probing upgrade test, which fails with a varying frequency. Here is the rate of failures for different system test runs so far:
> trunk (cooperative): 1/1 and 2/10 failures
> 2.4 (cooperative) : 0/10 and 1/15 failures
> trunk (eager): 0/10 failures
> I've kicked off some high-repeat runs to complete overnight and hopefully shed more light.
> Note that I have also kicked off runs of both 2.4 and trunk with the PR for KAFKA-8104 reverted. Both of them saw 2/10 failures, due to hitting the bug that was fixed by [PR 7460|https://github.com/apache/kafka/pull/7460]. It is therefore unclear whether [PR 7460|https://github.com/apache/kafka/pull/7460] introduced another or a new race condition/bug, or merely uncovered an existing one that previously would have first failed due to KAFKA-8104.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)