You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by "Jason Gustafson (Jira)" <ji...@apache.org> on 2020/04/07 00:09:00 UTC

[jira] [Resolved] (KAFKA-9815) Consumer may never re-join if inconsistent metadata is received once

     [ https://issues.apache.org/jira/browse/KAFKA-9815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Gustafson resolved KAFKA-9815.
------------------------------------
    Fix Version/s: 2.4.2
                   2.5.0
       Resolution: Fixed

> Consumer may never re-join if inconsistent metadata is received once
> --------------------------------------------------------------------
>
>                 Key: KAFKA-9815
>                 URL: https://issues.apache.org/jira/browse/KAFKA-9815
>             Project: Kafka
>          Issue Type: Bug
>          Components: consumer
>            Reporter: Rajini Sivaram
>            Assignee: Rajini Sivaram
>            Priority: Major
>             Fix For: 2.5.0, 2.4.2
>
>
> KAFKA-9797 is the result of an incorrect rolling upgrade test where a new listener is added to brokers and set as the inter-broker listener within the same rolling upgrade. As a result, metadata is inconsistent across brokers until the rolling upgrade completes since interbroker communication is broken until all brokers have the new listener. The test fails due to consumer timeouts and sometimes this is because the upgrade takes longer than consumer timeout. But several logs show an issue with the consumer when one metadata response received during upgrade is different from the consumer's cached `assignmentSnapshot`, triggering rebalance.
> In [https://github.com/apache/kafka/blob/7f640f13b4d486477035c3edb28466734f053beb/clients/src/main/java/org/apache/kafka/clients/consumer/internals/ConsumerCoordinator.java#L750,] we return true for `rejoinNeededOrPending()` if `assignmentSnapshot` is not the same as the current `metadataSnapshot`. We don't set `rejoinNeeded` in the instance, but we revoke partitions and send JoinGroup request. If the JoinGroup request fails and a subsequent metadata response contains the same snapshot value as the previously cached `assignmentSnapshot`, we never send `JoinGroup` again since snapshots match and `rejoinNeeded=false`. Partitions are not assigned to the consumer after this and the test fails because messages are not received.
> Even though this particular system test isn't a valid upgrade scenario, we should fix the consumer, since temporary metadata differences between brokers can result in this scenario.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)