You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Guozhang Wang (Jira)" <ji...@apache.org> on 2021/03/16 06:56:00 UTC

[jira] [Comment Edited] (KAFKA-12477) Smart rebalancing with dynamic protocol selection

    [ https://issues.apache.org/jira/browse/KAFKA-12477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17302260#comment-17302260 ] 

Guozhang Wang edited comment on KAFKA-12477 at 3/16/21, 6:55 AM:
-----------------------------------------------------------------

I think this is a reasonable idea. Just clarifying one thing here, that in the original design the second rebalance with `range` assignor removed actually serve as a "no turning back" barrier to users. That is, it is the user's call to removes the range assignor after she's certain that, e.g. there would be no new members joining the group that only has old `range` assignors which would cause the whole rebalance to be failing as the broker's coordinator cannot pick a common assignor anymore. Overall, we're pushing it to user's shoulder, and if they shoot themselves in the foot they have no one to blame :)

By removing this second rebalance we are kind of taking that call for users -- "there's no turning point after you upgrade to 3.0 with the new assignor added". Personally I'm on the fence about whether we should take the responsibility from user's shoulder, but if we feel this is worthy and we will make very good docs explaining this, I can also be convinced.


was (Author: guozhang):
I think this is a reasonable idea. Just clarifying one thing here, that in the original design the second rebalance with `range` assignor removed actually serve as a "no turning back" barrier to users. That is, it is the user's call to removes the range assignor after she's certain that, e.g. there would be no new members joining the group that only has old `range` assignors which would cause the whole rebalance to be failing as the broker's coordinator cannot pick a common assignor anymore. Overall, we're pushing it to user's shoulder, and if they shoot themselves in the foot they have no one to blame :)

By removing this second rebalance we are kind of taking that call for users -- "there's no turning point after you upgrade to 3.0 with the new assignor added". Personally I'm on the fence about whether we should take the responsibility from user's shoulder, but if we feel this is worthy and we will make very good docs explaining this, I can also be convinced :)

> Smart rebalancing with dynamic protocol selection
> -------------------------------------------------
>
>                 Key: KAFKA-12477
>                 URL: https://issues.apache.org/jira/browse/KAFKA-12477
>             Project: Kafka
>          Issue Type: Improvement
>          Components: consumer
>            Reporter: A. Sophie Blee-Goldman
>            Priority: Major
>             Fix For: 3.0.0
>
>
> Users who want to upgrade their applications and enable the COOPERATIVE rebalancing protocol in their consumer apps are required to follow a double rolling bounce upgrade path. The reason for this is laid out in the [Consumer Upgrades|https://cwiki.apache.org/confluence/display/KAFKA/KIP-429%3A+Kafka+Consumer+Incremental+Rebalance+Protocol#KIP429:KafkaConsumerIncrementalRebalanceProtocol-Consumer] section of KIP-429. Basically, the ConsumerCoordinator picks a rebalancing protocol in its constructor based on the list of supported partition assignors. The protocol is selected as the highest protocol that is commonly supported by all assignors in the list, and never changes after that.
> This is a bit unfortunate because it may end up using an older protocol even after every member in the group has been updated to support the newer protocol. After the first rolling bounce of the upgrade, all members will have two assignors: "cooperative-sticky" and "range" (or sticky/round-robin/etc). At this point the EAGER protocol will still be selected due to the presence of the "range" assignor, but it's the "cooperative-sticky" assignor that will ultimately be selected for use in rebalances if that assignor is preferred (ie positioned first in the list). The only reason for the second rolling bounce is to strip off the "range" assignor and allow the upgraded members to switch over to COOPERATIVE. We can't allow them to use cooperative rebalancing until everyone has been upgraded, but once they have it's safe to do so.
> And there is already a way for the client to detect that everyone is on the new byte code: if the CooperativeStickyAssignor is selected by the group coordinator, then that means it is supported by all consumers in the group and therefore everyone must be upgraded. 
> We may be able to save the second rolling bounce by dynamically updating the rebalancing protocol inside the ConsumerCoordinator as "the highest protocol supported by the assignor chosen by the group coordinator". This means we'll still be using EAGER at the first rebalance, since we of course need to wait for this initial rebalance to get the response from the group coordinator. But we should take the hint from the chosen assignor rather than dropping this information on the floor and sticking with the original protocol



--
This message was sent by Atlassian Jira
(v8.3.4#803005)