You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Shawn Wang (Jira)" <ji...@apache.org> on 2022/06/17 06:12:00 UTC

[jira] [Comment Edited] (KAFKA-13419) sync group failed with rebalanceInProgress error might cause out-of-date ownedPartition in Cooperative protocol

    [ https://issues.apache.org/jira/browse/KAFKA-13419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17555390#comment-17555390 ] 

Shawn Wang edited comment on KAFKA-13419 at 6/17/22 6:11 AM:
-------------------------------------------------------------

Hi [~showuon] 

After i applied this fix and my previous change to make this fix work[Pull Request|[https://github.com/apache/kafka/pull/12140]|https://github.com/apache/kafka/pull/12140),]what we are seeing is that: sometimes consumer will revoker almost all partitions with cooperative enabled.

detail:
 * we have more than 1000 consumers, coopeartive rebalance. 
 * Just the same as the example in this JIRA:  in cooperative rebalance some consumer will do a very quick re-join after get SyncGroupResponse. if there are some consumer that didn't send SyncGroupRequest yet, it will do a revoke-all and re-join operation.
 * after applied this change, it will solve the rebalance many rounds problem
 * but it will result in many partitions revoked if there is a very fast re-join consumer, and make cooperative almost the same as eager rebalance.

So instead of "{*}resetStateAndRejoin{*} when *RebalanceInProgressException* errors happend in {*}sync group{*}", can we just treat the ownedPartition in previous generation legal if there are no same partition claimed by other member? 

 

What do you think?

Thanks a lot!


was (Author: JIRAUSER289108):
Hi [~showuon] 

After i applied this fix and my previous change to make this fix work[Pull Request|[https://github.com/apache/kafka/pull/12140]|https://github.com/apache/kafka/pull/12140),]what we are seeing is that: sometimes consumer will revoker almost all partitions with cooperative enabled.

detail:
 * we have more than 1000 consumers, coopeartive rebalance. 
 * Just the same as the example in this JIRA:  in cooperative rebalance some consumer will do a very quick re-join after get SyncGroupResponse. if there are some consumer that didn't send SyncGroupRequest yet, it will do a revoke-all and re-join operation.
 * after applied this change, it will solve the rebalance many rounds problem
 * but it will result in many partitions revoked if there is a very fast re-join consumer, and make cooperative almost the same as eager rebalance.

So instead of "{*}resetStateAndRejoin{*} when *RebalanceInProgressException* errors happend in {*}sync group{*}", can we just treat the ownedPartition in previous generation legal if there are no same partition claimed by other member? 

> sync group failed with rebalanceInProgress error might cause out-of-date ownedPartition in Cooperative protocol
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-13419
>                 URL: https://issues.apache.org/jira/browse/KAFKA-13419
>             Project: Kafka
>          Issue Type: Bug
>          Components: clients
>    Affects Versions: 3.0.0
>            Reporter: Luke Chen
>            Assignee: Luke Chen
>            Priority: Major
>             Fix For: 3.1.0
>
>
> In KAFKA-13406, we found there's user got stuck when in rebalancing with cooperative sticky assignor. The reason is the "ownedPartition" is out-of-date, and it failed the cooperative assignment validation.
> Investigate deeper, I found the root cause is we didn't reset generation and state after sync group fail. In KAFKA-12983, we fixed the issue that the onJoinPrepare is not called in resetStateAndRejoin method. And it causes the ownedPartition not get cleared. But there's another case that the ownedPartition will be out-of-date. Here's the example:
>  # consumer A joined and synced group successfully with generation 1
>  # New rebalance started with generation 2, consumer A joined successfully, but somehow, consumer A doesn't send out sync group immediately
>  # other consumer completed sync group successfully in generation 2, except consumer A.
>  # After consumer A send out sync group, the new rebalance start, with generation 3. So consumer A got REBALANCE_IN_PROGRESS error with sync group response
>  # When receiving REBALANCE_IN_PROGRESS, we re-join the group, with generation 3, with the assignment (ownedPartition) in generation 1.
>  # So, now, we have out-of-date ownedPartition sent, with unexpected results happened
>  
> We might want to do *resetStateAndRejoin* when *RebalanceInProgressException* errors happend in *sync group*. Because when we got sync group error, it means, join group passed, and other consumers (and the leader) might already completed this round of rebalance. The assignment distribution this consumer have is already out-of-date.
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)