You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Shawn Wang (Jira)" <ji...@apache.org> on 2022/06/23 12:32:00 UTC
[jira] [Updated] (KAFKA-14016) Revoke more partitions than expected in Cooperative rebalance
[ https://issues.apache.org/jira/browse/KAFKA-14016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Shawn Wang updated KAFKA-14016:
-------------------------------
Description:
In https://issues.apache.org/jira/browse/KAFKA-13419 we found that some consumer didn't reset generation and state after sync group fail with REABALANCE_IN_PROGRESS error.
So we fixed it by reset generationId (no memberId) when sync group fail with REABALANCE_IN_PROGRESS error.
But this change missed the reset part, so another change made in https://issues.apache.org/jira/browse/KAFKA-13891 make this works.
After apply this change, we found that: sometimes consumer will revoker almost 1/3 of the partitions with cooperative enabled. Because if a consumer did a very quick re-join, other consumers will get REABALANCE_IN_PROGRESS in syncGroup and revoked their partition before re-jion. example:
# consumer A1-A10 (ten consumers) joined and synced group successfully with generation 1
# New consumer B1 joined and start a rebalance
# all consumer joined successfully and then A1 need to revoke partition to transfer to B1
# A1 do a very quick syncGroup and re-join, because it revoked partition
# A2-A10 didn't send syncGroup before A1 re-join, so after the send syncGruop, will get REBALANCE_IN_PROGRESS
# A2-A10 will revoke there partitions and re-join
So in this rebalance almost every partition revoked, which highly decrease the benefit of Cooperative rebalance
So i think instead of "{*}resetStateAndRejoin{*} when *RebalanceInProgressException* errors happend in {*}sync group{*}" we need another way to fix it.
Here is my proposal:
# revert the change in https://issues.apache.org/jira/browse/KAFKA-13891
# In Server Coordinator handleSyncGroup when generationId checked and group state i PreparingRebalance. we can send the assignment along with the error code REBALANCE_IN_PROGRESS.
# When get the REBALANCE_IN_PROGRESS error in client, try to apply the assignment first and the set the rejoinNeeded = true to make it re-join immediately
was:
In https://issues.apache.org/jira/browse/KAFKA-13419 we found that some consumer didn't reset generation and state after sync group fail with REABALANCE_IN_PROGRESS error.
So we fixed it by reset generationId (no memberId) when sync group fail with REABALANCE_IN_PROGRESS error.
But this change missed the reset part, so another change made in https://issues.apache.org/jira/browse/KAFKA-13891 make this works.
After apply this change, we found that: sometimes consumer will revoker almost 1/3 of the partitions with cooperative enabled. Because if a consumer did a very quick re-join, other consumers will get REABALANCE_IN_PROGRESS in syncGroup and revoked their partition before re-jion.
I think the whole history is : in Cooperative rebalance, we found a duplicate assign bug https://issues.apache.org/jira/browse/KAFKA-12984 and have several fixes:
#
https://issues.apache.org/jira/browse/KAFKA-12983
*
**
> Revoke more partitions than expected in Cooperative rebalance
> -------------------------------------------------------------
>
> Key: KAFKA-14016
> URL: https://issues.apache.org/jira/browse/KAFKA-14016
> Project: Kafka
> Issue Type: Bug
> Components: clients
> Affects Versions: 3.3.0
> Reporter: Shawn Wang
> Priority: Major
>
> In https://issues.apache.org/jira/browse/KAFKA-13419 we found that some consumer didn't reset generation and state after sync group fail with REABALANCE_IN_PROGRESS error.
> So we fixed it by reset generationId (no memberId) when sync group fail with REABALANCE_IN_PROGRESS error.
> But this change missed the reset part, so another change made in https://issues.apache.org/jira/browse/KAFKA-13891 make this works.
> After apply this change, we found that: sometimes consumer will revoker almost 1/3 of the partitions with cooperative enabled. Because if a consumer did a very quick re-join, other consumers will get REABALANCE_IN_PROGRESS in syncGroup and revoked their partition before re-jion. example:
> # consumer A1-A10 (ten consumers) joined and synced group successfully with generation 1
> # New consumer B1 joined and start a rebalance
> # all consumer joined successfully and then A1 need to revoke partition to transfer to B1
> # A1 do a very quick syncGroup and re-join, because it revoked partition
> # A2-A10 didn't send syncGroup before A1 re-join, so after the send syncGruop, will get REBALANCE_IN_PROGRESS
> # A2-A10 will revoke there partitions and re-join
> So in this rebalance almost every partition revoked, which highly decrease the benefit of Cooperative rebalance
> So i think instead of "{*}resetStateAndRejoin{*} when *RebalanceInProgressException* errors happend in {*}sync group{*}" we need another way to fix it.
>
> Here is my proposal:
> # revert the change in https://issues.apache.org/jira/browse/KAFKA-13891
> # In Server Coordinator handleSyncGroup when generationId checked and group state i PreparingRebalance. we can send the assignment along with the error code REBALANCE_IN_PROGRESS.
> # When get the REBALANCE_IN_PROGRESS error in client, try to apply the assignment first and the set the rejoinNeeded = true to make it re-join immediately
--
This message was sent by Atlassian Jira
(v8.20.7#820007)