You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "A. Sophie Blee-Goldman (Jira)" <ji...@apache.org> on 2021/03/16 05:19:00 UTC

[jira] [Created] (KAFKA-12477) Smart rebalancing with dynamic protocol selection

A. Sophie Blee-Goldman created KAFKA-12477:
----------------------------------------------

             Summary: Smart rebalancing with dynamic protocol selection
                 Key: KAFKA-12477
                 URL: https://issues.apache.org/jira/browse/KAFKA-12477
             Project: Kafka
          Issue Type: Improvement
          Components: consumer
            Reporter: A. Sophie Blee-Goldman
             Fix For: 3.0.0


Users who want to upgrade their applications and enable the COOPERATIVE rebalancing protocol in their consumer apps are required to follow a double rolling bounce upgrade path. The reason for this is laid out in the [Consumer Upgrades|https://cwiki.apache.org/confluence/display/KAFKA/KIP-429%3A+Kafka+Consumer+Incremental+Rebalance+Protocol#KIP429:KafkaConsumerIncrementalRebalanceProtocol-Consumer] section of KIP-429. Basically, the ConsumerCoordinator picks a rebalancing protocol in its constructor based on the list of supported partition assignors. The protocol is selected as the highest protocol that is commonly supported by all assignors in the list, and never changes after that.

This is a bit unfortunate because it may end up using an older protocol even after every member in the group has been updated to support the newer protocol. After the first rolling bounce of the upgrade, all members will have two assignors: "cooperative-sticky" and "range" (or sticky/round-robin/etc). At this point the EAGER protocol will still be selected due to the presence of the "range" assignor, but it's the "cooperative-sticky" assignor that will ultimately be selected for use in rebalances if that assignor is preferred (ie positioned first in the list). The only reason for the second rolling bounce is to strip off the "range" assignor and allow the upgraded members to switch over to COOPERATIVE. We can't allow them to use cooperative rebalancing until everyone has been upgraded, but once they have it's safe to do so.

And there is already a way for the client to detect that everyone is on the new byte code: if the CooperativeStickyAssignor is selected by the group coordinator, then that means it is supported by all consumers in the group and therefore everyone must be upgraded. 

We may be able to save the second rolling bounce by dynamically updating the rebalancing protocol inside the ConsumerCoordinator as "the highest protocol supported by the assignor chosen by the group coordinator". This means we'll still be using EAGER at the first rebalance, since we of course need to wait for this initial rebalance to get the response from the group coordinator. But we should take the hint from the chosen assignor rather than dropping this information on the floor and sticking with the original protocol



--
This message was sent by Atlassian Jira
(v8.3.4#803005)