You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "A. Sophie Blee-Goldman (Jira)" <ji...@apache.org> on 2022/11/09 00:48:00 UTC

[jira] [Reopened] (KAFKA-13891) sync group failed with rebalanceInProgress error cause rebalance many rounds in coopeartive

     [ https://issues.apache.org/jira/browse/KAFKA-13891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

A. Sophie Blee-Goldman reopened KAFKA-13891:
--------------------------------------------

Reopening – original fix was reverted, we should instead fix this assignor-side by making it smarter about partition ownership across generations. Basically, it should take as the previous owner whichever consumer has the highest generation and claims it among their owned partitions

 

[~showuon] I probably won't be able to get to this within the next few days so if you're interested in picking up this fix go ahead and I'll find time to review – otherwise I will try to get to it in time for the 3.4 release

> sync group failed with rebalanceInProgress error cause rebalance many rounds in coopeartive
> -------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-13891
>                 URL: https://issues.apache.org/jira/browse/KAFKA-13891
>             Project: Kafka
>          Issue Type: Bug
>          Components: clients
>    Affects Versions: 3.0.0
>            Reporter: Shawn Wang
>            Priority: Major
>             Fix For: 3.3.0, 3.2.4
>
>
> This issue was first found in [KAFKA-13419|https://issues.apache.org/jira/browse/KAFKA-13419]
> But the previous PR forgot to reset generation when sync group failed with rebalanceInProgress error. So the previous bug still exists and it may cause consumer to rebalance many rounds before final stable.
> Here's the example ({*}bold is added{*}):
>  # consumer A joined and synced group successfully with generation 1 *( with ownedPartition P1/P2 )*
>  # New rebalance started with generation 2, consumer A joined successfully, but somehow, consumer A doesn't send out sync group immediately
>  # other consumer completed sync group successfully in generation 2, except consumer A.
>  # After consumer A send out sync group, the new rebalance start, with generation 3. So consumer A got REBALANCE_IN_PROGRESS error with sync group response
>  # When receiving REBALANCE_IN_PROGRESS, we re-join the group, with generation 3, with the assignment (ownedPartition) in generation 1.
>  # So, now, we have out-of-date ownedPartition sent, with unexpected results happened
>  # *After the generation-3 rebalance, consumer A got P3/P4 partition. the ownedPartition is ignored because of old generation.*
>  # *consumer A revoke P1/P2 and re-join to start a new round of rebalance*
>  # *if some other consumer C failed to syncGroup before consumer A's joinGroup. the same issue will happens again and result in many rounds of rebalance before stable*
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)