You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Sagar Rao (Jira)" <ji...@apache.org> on 2021/09/04 17:23:00 UTC

[jira] [Commented] (KAFKA-12550) Introduce RESTORING state to the KafkaStreams FSM

    [ https://issues.apache.org/jira/browse/KAFKA-12550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17410010#comment-17410010 ] 

Sagar Rao commented on KAFKA-12550:
-----------------------------------

[~ableegoldman]/ [~mjsax], Thanks. it makes sense now. I liked the idea of removing states PARTITIONS_ASSIGNED/PARTITIONS_REVOKED from StreamThread and adding REBALANCING/RESTORING due to cooperative rebalancing.

I also agree that REBALANCING should take precedence over RESTORING. 

> Introduce RESTORING state to the KafkaStreams FSM
> -------------------------------------------------
>
>                 Key: KAFKA-12550
>                 URL: https://issues.apache.org/jira/browse/KAFKA-12550
>             Project: Kafka
>          Issue Type: Improvement
>          Components: streams
>            Reporter: A. Sophie Blee-Goldman
>            Assignee: Sagar Rao
>            Priority: Major
>              Labels: needs-kip
>             Fix For: 4.0.0
>
>
> We should consider adding a new state to the KafkaStreams FSM: RESTORING
> This would cover the time between the completion of a stable rebalance and the completion of restoration across the client. Currently, Streams will report the state during this time as REBALANCING even though it is generally spending much more time restoring than rebalancing in most cases.
> There are a few motivations/benefits behind this idea:
> # Observability is a big one: using the umbrella REBALANCING state to cover all aspects of rebalancing -> task initialization -> restoring has been a common source of confusion in the past. It’s also proved to be a time sink for us, during escalations, incidents, mailing list questions, and bug reports. It often adds latency to escalations in particular as we have to go through GTS and wait for the customer to clarify whether their “Kafka Streams is stuck rebalancing” ticket means that it’s literally rebalancing, or just in the REBALANCING state and actually stuck elsewhere in Streams
> # Prereq for global thread improvements: for example [KIP-406: GlobalStreamThread should honor custom reset policy |https://cwiki.apache.org/confluence/display/KAFKA/KIP-406%3A+GlobalStreamThread+should+honor+custom+reset+policy] was ultimately blocked on this as we needed to pause the Streams app while the global thread restored from the appropriate offset. Since there’s absolutely no rebalancing involved in this case, piggybacking on the REBALANCING state would just be shooting ourselves in the foot.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)