You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@kafka.apache.org by "Guozhang Wang (Jira)" <ji...@apache.org> on 2021/02/21 02:56:00 UTC

[jira] [Created] (KAFKA-12352) Improve debuggability with continuous consumer rebalances

Guozhang Wang created KAFKA-12352:
-------------------------------------

Summary: Improve debuggability with continuous consumer rebalances
Key: KAFKA-12352
URL: https://issues.apache.org/jira/browse/KAFKA-12352
Project: Kafka
Issue Type: Improvement
Components: consumer, streams
Reporter: Guozhang Wang
Assignee: Guozhang Wang

There are several scenarios where a consumer/streams client can fall into continuous rebalances and hence does not make any progress. Today when this happens, developers usually need to do a lot digging in order to get insights on what happens. Here's short summary of different scenarios where we (re-)trigger rebalances:

1. Group member kicked out of the group: when the coordinator kicked out the member, later on when the member issues a join / sync / heartbeat / offset-commit, it will fail and the member will try to re-join. When the member was constantly calling poll too late, it would continuously fall into this scenario and not make progress.

2. Group is rebalancing: if the group is rebalancing at the moment, the member's heartbeat / offset commit / sync-group will fail. In this case the member rejoining the group is not the root cause of the rebalancing anyways.

3. Caller enforce a rebalance via `enforceRebalance`. This is one-off and should not cause rebalance storms.

4. After a rebalance is completed, the member found out that a) its subscription has changed or 2) its subscribed topics' number of partitions changed since the re-join request was sent. In this case it needs to re-trigger the rebalance in order to get the new assignment. Since the subscription change is one-off, it should not cause rebalance storms; topic metadata change should also be infrequent, but there are some rare cases where topic metadata keeps "vibrating" due to broker side issues.

5. After a rebalance is completed, the member need to revoke some partitions as indicated by the assignment. After the revocation it needs to re-join the group. This may cause rebalance storms when the partition assignor was sub-optimal in determining the assignment and hence the partitions keep migrating around and rebalances triggered continuously.

As we can see, 1/5 above could potentially cause rebalance storms, while 2/3/4 should not in normal cases. In all of these scenarios, we should expose exactly the reason why the member is re-joining the group, and whether this re-joining the group would trigger the rebalance, or if it is already in a rebalance (hence join-group itself is not causing it, but the result of it). This could help operators to quickly nail down which of the above may be the root cause of continuous rebalances.

I'd suggest we first go through the log4j hierarchy to make sure this is the right place, and maybe in the future we can expose a single state metric on top of the logging categorization for even convienent trouble shooting.

--
This message was sent by Atlassian Jira
(v8.3.4#803005)