You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "A. Sophie Blee-Goldman (Jira)" <ji...@apache.org> on 2021/03/16 02:06:00 UTC

[jira] [Commented] (KAFKA-12472) Add a Consumer / Streams metric to indicate the current rebalance status

    [ https://issues.apache.org/jira/browse/KAFKA-12472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17302155#comment-17302155 ] 

A. Sophie Blee-Goldman commented on KAFKA-12472:
------------------------------------------------

This sounds like a great (and much needed) improvement to the rebalancing observability. Thanks for the proposal!

A few thoughts I had while reading this:
 # We should distinguish between a shutdown/unsubscribe LeaveGroup and a user-triggered rebalance via enforceRebalance(). Especially within Streams, these have very different purposes and meanings (I realize Streams will have its own enums for all cases of enforceRebalance() and won't fall back on this one, this just highlights that they are semantically quite distinct).
 # Similarly, should we have a separate enum for the “Consumer being closed” and “Unsubscribed all partitions” cases of a LeaveGroup? This one I think it less important to distinguish than the enforceRebalance case, but it’s worth considering.
 # If you can only transition to 0 or to a higher code, should the CoordinatorRequested enum have a higher value? It seems like NewMember should be first after None. I know this is addressed at the end of the ticket, but I’m not sure I buy it — also the explanation seems only relevant to Streams, and we should not design the plain consumer client around Kafka Streams unless it also makes sense for the plain consumer (which I’m not sure it does).
 # Just wondering, when do we get an UnknownMemberId vs an IllegalGeneration after being kicked out of the group.
 # For Streams, I think we should still report at the consumer/thread level and let users roll up the metrics themselves, or else provide an aggregated client-level metric _in addition_ to the thread-level ones. We have seen many examples where one thread is continuously rebalancing for whatever reason, eg dropping out on the max.poll.interval.ms due to a particular task. Reporting these at the thread-level will greatly help with visibility into this sort of thing within Streams.

> Add a Consumer / Streams metric to indicate the current rebalance status
> ------------------------------------------------------------------------
>
>                 Key: KAFKA-12472
>                 URL: https://issues.apache.org/jira/browse/KAFKA-12472
>             Project: Kafka
>          Issue Type: Improvement
>          Components: consumer, streams
>            Reporter: Guozhang Wang
>            Priority: Major
>              Labels: needs-kip
>
> Today to trouble shoot a rebalance issue operators need to do a lot of manual steps: locating the problematic members, search in the log entries, and look for related metrics. It would be great to add a single metric that covers all these manual steps and operators would only need to check this single signal to check what is the root cause. A concrete idea is to expose two enum gauge metrics on consumer and streams, respectively:
> * Consumer level (the order below is by-design, see Streams level for details):
>   0. *None* => there is no rebalance on going.
>   1. *CoordinatorRequested* => any of the coordinator response contains a RebalanceInProgress error code.
>   2. *NewMember* => when the join group response has a MemberIdRequired error code.
>   3. *UnknownMember* => when any of the coordinator response contains an UnknownMember error code, indicating this member is already kicked out of the group.
>   4. *StaleMember* => when any of the coordinator response contains an IllegalGeneration error code.
>   5. *DroppedGroup* => when hb thread decides to call leaveGroup due to hb expired.
>   6. *UserRequested* => when leaveGroup upon the shutdown / unsubscribeAll API, as well as upon calling the enforceRebalance API.
>   7. *MetadataChanged* => requestRejoin triggered since metadata has changed.
>   8. *SubscriptionChanged* => requestRejoin triggered since subscription has changed.
>   9. *RetryOnError* => when join/syncGroup response contains a retriable error which would cause the consumer to backoff and retry.
>  10. *RevocationNeeded* => requestRejoin triggered since revoked partitions is not empty.
> The transition rule is that a non-zero status code can only transit to zero or to a higher code, but not to a lower code (same for streams, see rationales below).
> * Streams level: today a streams client can have multiple consumers. We introduced some new enum states as well as aggregation rules across consumers: if there's no streams-layer events as below that transits its status (i.e. streams layer think it is 0), then we aggregate across all the embedded consumers and take the largest status code value as the streams metric; if there are streams-layer events that determines its status should be in 10+, then it ignores all embedded consumer layer status code since it should has a higher precedence. In addition, when create aggregated metric across streams instance (a.k.a at the app-level, which is usually what we would care and alert on), we also follow the same aggregation rule, e.g. if there are two streams instance where one instance's status code is 1), and the other is 10), then the app's status is 10).
>  10. *RevocationNeeded* => the definition of this is changed to the original 10) defined in consumer above, OR leader decides to revoke either active/standby tasks and hence schedule follow-ups.
>  11. *AssignmentProbing* => leader decides to schedule follow-ups since the current assignment is unstable.
>  12. *VersionProbing* => leader decides to schedule follow-ups due to version probing.
>  13. *EndpointUpdate* => anyone decides to schedule follow-ups due to endpoint updates.
> The main motivations of the above proposed precedence order are the following:
> 1. When a rebalance is triggered by one member, all other members would only know it is due to CoordinatorRequested from coordinator error codes, and hence CoordinatorRequested should be overridden by any other status when aggregating across clients.
> 2. DroppedGroup could cause unknown/stale members that would fail and retry immediately, and hence should take higher precedence.
> 3. Revocation definition is extended in Streams, and hence it needs to take the highest precedence among all consumer-only status so that it would not be overridden by any of the consumer-only status.
> 4. In general, more rare events get higher precedence.
> This is proposed on top of KAFKA-12352. Any comments on the precedence rules / categorization are more than welcomed!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)