You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kafka.apache.org by Guozhang Wang <gu...@gmail.com> on 2022/10/12 17:02:02 UTC

[VOTE] KIP-869: Improve Streams State Restoration Visibility

Hello all,

I'd like to start a vote for the following KIP, aiming to improve Kafka
Stream's restoration visibility via new metrics and callback methods:

https://cwiki.apache.org/confluence/display/KAFKA/KIP-869%3A+Improve+Streams+State+Restoration+Visibility


Thanks!
-- Guozhang

Re: [VOTE] KIP-869: Improve Streams State Restoration Visibility

Posted by Walker Carlson <wc...@confluent.io.INVALID>.

Hey, I'm changing my vote to binding now :)

On Mon, Jan 23, 2023 at 9:38 PM Matthias J. Sax <mj...@apache.org> wrote:

> Thanks Guozhang. Couple of clarifications and follow up questions.
>
>
> >> I'm not aware of a discussion to rename the call name to "suspend" for
> >> KIP-834. Could you point me to the reference?
>
> My commend was not about KIP-834, but about this KIP. You originally
> proposed to call the new call-back `onRestorePause` but to avoid
> confusion it was improved and renamed to `onRestoreSuspended`.
>
>
> > The only one so far that I feel is probably better, is
> > "state-update-ratio". If folks feel this one is better than
> > "restore-ratio" I'm happy to update.
>
> Could we actually report two metric, one for the restore phase
> (restore-ration), and one for steady state ([standby-]update-ratio)?
>
> I could like with `state-update-ratio` if we want to have a single
> metric for both, but splitting them sound useful to me.
>
>
> > (4) `restore-call-rate`
>
> Maybe we can clarify in the description a little bit. I agree it's very
> low level but if you think it could be useful to debugging, I have no
> objection.
>
>
> > The rationale behind it is the general principle in metrics design
> > that "Kafka would provide the lowest necessary metrics levels, and
> > users can do the roll-ups however they want".
>
> That's fair, but it seems to be a rather important metric, and having it
> only at DEBUG level seems not ideal? Could we make it INFO level, even
> if it's a task level (ie, apply an exception to the rule).
>
>
>
> -Matthias
>
>
>
> On 1/19/23 2:35 PM, Guozhang Wang wrote:
> > Hello Matthias,
> >
> > Thanks for the feedback. I was on vacation for a while. Pardon for the
> > late replies. Please see them inline below
> >
> > On Thu, Dec 1, 2022 at 11:23 PM Matthias J. Sax <mj...@apache.org>
> wrote:
> >>
> >> Seems I am late to the party... Great KIP. Couple of questions from my
> side:
> >>
> >> (1) What is the purpose of `standby-updating-tasks`? It seems to be the
> >> same as the number of assigned standby task? Not sure how useful it
> >> would be?
> >>
> > In general, yes, it is the number of assigned standby tasks --- there
> > will be transit times when the assigned standby tasks are not yet
> > being updated but it would not last long --- but we do not yet have a
> > direct gauge to expose this before, and users have to infer this from
> > other indirect metrics.
> >
> >>
> >>
> >> (2) `active-paused-tasks` / `standby-paused-tasks` -- what does "paused"
> >> exactly mean? There was a discussion about renaming the callback method
> >> from pause to suspended. So should this be called `suspended`, too? And
> >> if yes, how is it useful for users?
> >>
> > Pausing here refers to "KIP-834: Pause / Resume KafkaStreams
> > Topologies" (
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211882832
> ).
> > When a topology is paused, all its tasks including standbys will be
> > paused too.
> >
> > I'm not aware of a discussion to rename the call name to "suspend" for
> > KIP-834. Could you point me to the reference?
> >
> >>
> >>
> >> (3) `restore-ratio`: the description says
> >>
> >>> The fraction of time the thread spent on restoring active or standby
> tasks
> >>
> >> I find the term "restoring" does only apply to active tasks, but not to
> >> standbys. Can we reword this?
> >>
> > Yeah I have been discussing this with others in the community a bit as
> > well, but so far I have not been convinced of a better name than it.
> > Some other alternatives being discussed but not win everyone's love is
> > "restore-or-update-ratio", "process-ratio" (for the restore thread
> > that means restoring or updating), and "io-ratio".
> >
> > The only one so far that I feel is probably better, is
> > "state-update-ratio". If folks feel this one is better than
> > "restore-ratio" I'm happy to update.
> >
> >>
> >> (4) `restore-call-rate`: not sure what you exactly mean by "restore
> calls"?
> >>
> > This is similar to the "io-calls-rate" in the selector classes, i.e.
> > the number of "restore" function calls made. It's argurably a very
> > low-level metrics but I included it since it could be useful in some
> > debugging scenarios.
> >
> >>
> >> (5) `restore-remaining-records-total` -- why is this a task metric?
> >> Seems we could roll it up into a thread metric that we report at INFO
> >> level (we could still have per-task DEBUG level metric for it in
> addition).
> >>
> > The rationale behind it is the general principle in metrics design
> > that "Kafka would provide the lowest necessary metrics levels, and
> > users can do the roll-ups however they want".
> >
> >>
> >> (6) What about "warmup tasks"? Internally, we treat them as standbys,
> >> but it seems it's hard for users to reason about it in the scale-out
> >> warm-up case. Would it be helpful (and possible) to report "warmup
> >> progress" explicitly?
> >>
> > At the restore thread level, we cannot differentiate standby tasks
> > from warmup tasks since the latter is created exactly just like the
> > former. But I do agree this is an issue for visibility that worth
> > addressing, I think another KIP would be needed to first consider
> > distinguishing these two at the class level.
> >
> >>
> >> -Matthias
> >>
> >>
> >> On 11/1/22 2:44 AM, Lucas Brutschy wrote:
> >>> We need this!
> >>>
> >>> + 1 non binding
> >>>
> >>> Cheers,
> >>> Lucas
> >>>
> >>> On Tue, Nov 1, 2022 at 10:01 AM Bruno Cadonna <ca...@apache.org>
> wrote:
> >>>>
> >>>> Guozhang,
> >>>>
> >>>> Thanks for the KIP!
> >>>>
> >>>> +1 (binding)
> >>>>
> >>>> Best,
> >>>> Bruno
> >>>>
> >>>> On 25.10.22 22:07, Walker Carlson wrote:
> >>>>> +1 non binding
> >>>>>
> >>>>> Thanks for the kip!
> >>>>>
> >>>>> On Thu, Oct 20, 2022 at 10:25 PM John Roesler <vv...@apache.org>
> wrote:
> >>>>>
> >>>>>> Thanks for the KIP, Guozhang!
> >>>>>>
> >>>>>> I'm +1 (binding)
> >>>>>>
> >>>>>> -John
> >>>>>>
> >>>>>> On Wed, Oct 12, 2022, at 16:36, Nick Telford wrote:
> >>>>>>> Can't wait!
> >>>>>>> +1 (non-binding)
> >>>>>>>
> >>>>>>> On Wed, 12 Oct 2022, 18:02 Guozhang Wang, <
> guozhang.wang.us@gmail.com>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> Hello all,
> >>>>>>>>
> >>>>>>>> I'd like to start a vote for the following KIP, aiming to improve
> Kafka
> >>>>>>>> Stream's restoration visibility via new metrics and callback
> methods:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-869%3A+Improve+Streams+State+Restoration+Visibility
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Thanks!
> >>>>>>>> -- Guozhang
> >>>>>>>>
> >>>>>>
> >>>>>
>

Re: [VOTE] KIP-869: Improve Streams State Restoration Visibility

Posted by "Matthias J. Sax" <mj...@apache.org>.

Thanks Guozhang. Couple of clarifications and follow up questions.


>> I'm not aware of a discussion to rename the call name to "suspend" for
>> KIP-834. Could you point me to the reference?

My commend was not about KIP-834, but about this KIP. You originally 
proposed to call the new call-back `onRestorePause` but to avoid 
confusion it was improved and renamed to `onRestoreSuspended`.


> The only one so far that I feel is probably better, is
> "state-update-ratio". If folks feel this one is better than
> "restore-ratio" I'm happy to update.

Could we actually report two metric, one for the restore phase 
(restore-ration), and one for steady state ([standby-]update-ratio)?

I could like with `state-update-ratio` if we want to have a single 
metric for both, but splitting them sound useful to me.


> (4) `restore-call-rate`

Maybe we can clarify in the description a little bit. I agree it's very 
low level but if you think it could be useful to debugging, I have no 
objection.


> The rationale behind it is the general principle in metrics design
> that "Kafka would provide the lowest necessary metrics levels, and
> users can do the roll-ups however they want".

That's fair, but it seems to be a rather important metric, and having it 
only at DEBUG level seems not ideal? Could we make it INFO level, even 
if it's a task level (ie, apply an exception to the rule).



-Matthias



On 1/19/23 2:35 PM, Guozhang Wang wrote:
> Hello Matthias,
> 
> Thanks for the feedback. I was on vacation for a while. Pardon for the
> late replies. Please see them inline below
> 
> On Thu, Dec 1, 2022 at 11:23 PM Matthias J. Sax <mj...@apache.org> wrote:
>>
>> Seems I am late to the party... Great KIP. Couple of questions from my side:
>>
>> (1) What is the purpose of `standby-updating-tasks`? It seems to be the
>> same as the number of assigned standby task? Not sure how useful it
>> would be?
>>
> In general, yes, it is the number of assigned standby tasks --- there
> will be transit times when the assigned standby tasks are not yet
> being updated but it would not last long --- but we do not yet have a
> direct gauge to expose this before, and users have to infer this from
> other indirect metrics.
> 
>>
>>
>> (2) `active-paused-tasks` / `standby-paused-tasks` -- what does "paused"
>> exactly mean? There was a discussion about renaming the callback method
>> from pause to suspended. So should this be called `suspended`, too? And
>> if yes, how is it useful for users?
>>
> Pausing here refers to "KIP-834: Pause / Resume KafkaStreams
> Topologies" (https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211882832).
> When a topology is paused, all its tasks including standbys will be
> paused too.
> 
> I'm not aware of a discussion to rename the call name to "suspend" for
> KIP-834. Could you point me to the reference?
> 
>>
>>
>> (3) `restore-ratio`: the description says
>>
>>> The fraction of time the thread spent on restoring active or standby tasks
>>
>> I find the term "restoring" does only apply to active tasks, but not to
>> standbys. Can we reword this?
>>
> Yeah I have been discussing this with others in the community a bit as
> well, but so far I have not been convinced of a better name than it.
> Some other alternatives being discussed but not win everyone's love is
> "restore-or-update-ratio", "process-ratio" (for the restore thread
> that means restoring or updating), and "io-ratio".
> 
> The only one so far that I feel is probably better, is
> "state-update-ratio". If folks feel this one is better than
> "restore-ratio" I'm happy to update.
> 
>>
>> (4) `restore-call-rate`: not sure what you exactly mean by "restore calls"?
>>
> This is similar to the "io-calls-rate" in the selector classes, i.e.
> the number of "restore" function calls made. It's argurably a very
> low-level metrics but I included it since it could be useful in some
> debugging scenarios.
> 
>>
>> (5) `restore-remaining-records-total` -- why is this a task metric?
>> Seems we could roll it up into a thread metric that we report at INFO
>> level (we could still have per-task DEBUG level metric for it in addition).
>>
> The rationale behind it is the general principle in metrics design
> that "Kafka would provide the lowest necessary metrics levels, and
> users can do the roll-ups however they want".
> 
>>
>> (6) What about "warmup tasks"? Internally, we treat them as standbys,
>> but it seems it's hard for users to reason about it in the scale-out
>> warm-up case. Would it be helpful (and possible) to report "warmup
>> progress" explicitly?
>>
> At the restore thread level, we cannot differentiate standby tasks
> from warmup tasks since the latter is created exactly just like the
> former. But I do agree this is an issue for visibility that worth
> addressing, I think another KIP would be needed to first consider
> distinguishing these two at the class level.
> 
>>
>> -Matthias
>>
>>
>> On 11/1/22 2:44 AM, Lucas Brutschy wrote:
>>> We need this!
>>>
>>> + 1 non binding
>>>
>>> Cheers,
>>> Lucas
>>>
>>> On Tue, Nov 1, 2022 at 10:01 AM Bruno Cadonna <ca...@apache.org> wrote:
>>>>
>>>> Guozhang,
>>>>
>>>> Thanks for the KIP!
>>>>
>>>> +1 (binding)
>>>>
>>>> Best,
>>>> Bruno
>>>>
>>>> On 25.10.22 22:07, Walker Carlson wrote:
>>>>> +1 non binding
>>>>>
>>>>> Thanks for the kip!
>>>>>
>>>>> On Thu, Oct 20, 2022 at 10:25 PM John Roesler <vv...@apache.org> wrote:
>>>>>
>>>>>> Thanks for the KIP, Guozhang!
>>>>>>
>>>>>> I'm +1 (binding)
>>>>>>
>>>>>> -John
>>>>>>
>>>>>> On Wed, Oct 12, 2022, at 16:36, Nick Telford wrote:
>>>>>>> Can't wait!
>>>>>>> +1 (non-binding)
>>>>>>>
>>>>>>> On Wed, 12 Oct 2022, 18:02 Guozhang Wang, <gu...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hello all,
>>>>>>>>
>>>>>>>> I'd like to start a vote for the following KIP, aiming to improve Kafka
>>>>>>>> Stream's restoration visibility via new metrics and callback methods:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-869%3A+Improve+Streams+State+Restoration+Visibility
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>> -- Guozhang
>>>>>>>>
>>>>>>
>>>>>

Re: [VOTE] KIP-869: Improve Streams State Restoration Visibility

Posted by Guozhang Wang <gu...@gmail.com>.

Thanks for all the very helpful discussions, I'm closing the vote with
a tally here:

+1: 7 (Nick, John, Walker, Bruno, Lucas, Matthias, Guozhang), with 5
binding votes and 2 non-binding votes.
-1: 0


Guozhang

On Wed, Jan 25, 2023 at 5:48 PM Matthias J. Sax <mj...@apache.org> wrote:
>
> Thanks!
>
> +1 (binding)
>
> -Matthias
>
> On 1/24/23 1:17 PM, Guozhang Wang wrote:
> > Hi Matthias:
> >
> > re "paused" -> "suspended": I got your point now, thanks. Just to
> > clarify the two functions are a bit different: "paused" tasks are
> > because of the topology being paused, i.e. from KIP-834; whereas
> > "suspended" tasks are when a restoring tasks are being removed before
> > it completes due to a follow-up rebalance, and this is to distinguish
> > with "onRestoreEnd", as described in KAFKA-10575. A suspended task is
> > no longer owned by the thread and hence there's no need to measure the
> > number of such tasks.
> >
> > re: "restore-ratio": that's a good point. I like it to function in the
> > same way as the "records-rate" metrics. Will update the wiki.
> >
> > re: making "restore-remaining-records-total" at INFO level: sounds
> > good to me too. I will also update the metric name a bit to be more
> > specific.
> >
> >
> >
> > On Thu, Jan 19, 2023 at 2:35 PM Guozhang Wang
> > <gu...@gmail.com> wrote:
> >>
> >> Hello Matthias,
> >>
> >> Thanks for the feedback. I was on vacation for a while. Pardon for the
> >> late replies. Please see them inline below
> >>
> >> On Thu, Dec 1, 2022 at 11:23 PM Matthias J. Sax <mj...@apache.org> wrote:
> >>>
> >>> Seems I am late to the party... Great KIP. Couple of questions from my side:
> >>>
> >>> (1) What is the purpose of `standby-updating-tasks`? It seems to be the
> >>> same as the number of assigned standby task? Not sure how useful it
> >>> would be?
> >>>
> >> In general, yes, it is the number of assigned standby tasks --- there
> >> will be transit times when the assigned standby tasks are not yet
> >> being updated but it would not last long --- but we do not yet have a
> >> direct gauge to expose this before, and users have to infer this from
> >> other indirect metrics.
> >>
> >>>
> >>>
> >>> (2) `active-paused-tasks` / `standby-paused-tasks` -- what does "paused"
> >>> exactly mean? There was a discussion about renaming the callback method
> >>> from pause to suspended. So should this be called `suspended`, too? And
> >>> if yes, how is it useful for users?
> >>>
> >> Pausing here refers to "KIP-834: Pause / Resume KafkaStreams
> >> Topologies" (https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211882832).
> >> When a topology is paused, all its tasks including standbys will be
> >> paused too.
> >>
> >> I'm not aware of a discussion to rename the call name to "suspend" for
> >> KIP-834. Could you point me to the reference?
> >>
> >>>
> >>>
> >>> (3) `restore-ratio`: the description says
> >>>
> >>>> The fraction of time the thread spent on restoring active or standby tasks
> >>>
> >>> I find the term "restoring" does only apply to active tasks, but not to
> >>> standbys. Can we reword this?
> >>>
> >> Yeah I have been discussing this with others in the community a bit as
> >> well, but so far I have not been convinced of a better name than it.
> >> Some other alternatives being discussed but not win everyone's love is
> >> "restore-or-update-ratio", "process-ratio" (for the restore thread
> >> that means restoring or updating), and "io-ratio".
> >>
> >> The only one so far that I feel is probably better, is
> >> "state-update-ratio". If folks feel this one is better than
> >> "restore-ratio" I'm happy to update.
> >>
> >>>
> >>> (4) `restore-call-rate`: not sure what you exactly mean by "restore calls"?
> >>>
> >> This is similar to the "io-calls-rate" in the selector classes, i.e.
> >> the number of "restore" function calls made. It's argurably a very
> >> low-level metrics but I included it since it could be useful in some
> >> debugging scenarios.
> >>
> >>>
> >>> (5) `restore-remaining-records-total` -- why is this a task metric?
> >>> Seems we could roll it up into a thread metric that we report at INFO
> >>> level (we could still have per-task DEBUG level metric for it in addition).
> >>>
> >> The rationale behind it is the general principle in metrics design
> >> that "Kafka would provide the lowest necessary metrics levels, and
> >> users can do the roll-ups however they want".
> >>
> >>>
> >>> (6) What about "warmup tasks"? Internally, we treat them as standbys,
> >>> but it seems it's hard for users to reason about it in the scale-out
> >>> warm-up case. Would it be helpful (and possible) to report "warmup
> >>> progress" explicitly?
> >>>
> >> At the restore thread level, we cannot differentiate standby tasks
> >> from warmup tasks since the latter is created exactly just like the
> >> former. But I do agree this is an issue for visibility that worth
> >> addressing, I think another KIP would be needed to first consider
> >> distinguishing these two at the class level.
> >>
> >>>
> >>> -Matthias
> >>>
> >>>
> >>> On 11/1/22 2:44 AM, Lucas Brutschy wrote:
> >>>> We need this!
> >>>>
> >>>> + 1 non binding
> >>>>
> >>>> Cheers,
> >>>> Lucas
> >>>>
> >>>> On Tue, Nov 1, 2022 at 10:01 AM Bruno Cadonna <ca...@apache.org> wrote:
> >>>>>
> >>>>> Guozhang,
> >>>>>
> >>>>> Thanks for the KIP!
> >>>>>
> >>>>> +1 (binding)
> >>>>>
> >>>>> Best,
> >>>>> Bruno
> >>>>>
> >>>>> On 25.10.22 22:07, Walker Carlson wrote:
> >>>>>> +1 non binding
> >>>>>>
> >>>>>> Thanks for the kip!
> >>>>>>
> >>>>>> On Thu, Oct 20, 2022 at 10:25 PM John Roesler <vv...@apache.org> wrote:
> >>>>>>
> >>>>>>> Thanks for the KIP, Guozhang!
> >>>>>>>
> >>>>>>> I'm +1 (binding)
> >>>>>>>
> >>>>>>> -John
> >>>>>>>
> >>>>>>> On Wed, Oct 12, 2022, at 16:36, Nick Telford wrote:
> >>>>>>>> Can't wait!
> >>>>>>>> +1 (non-binding)
> >>>>>>>>
> >>>>>>>> On Wed, 12 Oct 2022, 18:02 Guozhang Wang, <gu...@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Hello all,
> >>>>>>>>>
> >>>>>>>>> I'd like to start a vote for the following KIP, aiming to improve Kafka
> >>>>>>>>> Stream's restoration visibility via new metrics and callback methods:
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-869%3A+Improve+Streams+State+Restoration+Visibility
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Thanks!
> >>>>>>>>> -- Guozhang
> >>>>>>>>>
> >>>>>>>
> >>>>>>

Re: [VOTE] KIP-869: Improve Streams State Restoration Visibility

Posted by "Matthias J. Sax" <mj...@apache.org>.

Thanks!

+1 (binding)

-Matthias

On 1/24/23 1:17 PM, Guozhang Wang wrote:
> Hi Matthias:
> 
> re "paused" -> "suspended": I got your point now, thanks. Just to
> clarify the two functions are a bit different: "paused" tasks are
> because of the topology being paused, i.e. from KIP-834; whereas
> "suspended" tasks are when a restoring tasks are being removed before
> it completes due to a follow-up rebalance, and this is to distinguish
> with "onRestoreEnd", as described in KAFKA-10575. A suspended task is
> no longer owned by the thread and hence there's no need to measure the
> number of such tasks.
> 
> re: "restore-ratio": that's a good point. I like it to function in the
> same way as the "records-rate" metrics. Will update the wiki.
> 
> re: making "restore-remaining-records-total" at INFO level: sounds
> good to me too. I will also update the metric name a bit to be more
> specific.
> 
> 
> 
> On Thu, Jan 19, 2023 at 2:35 PM Guozhang Wang
> <gu...@gmail.com> wrote:
>>
>> Hello Matthias,
>>
>> Thanks for the feedback. I was on vacation for a while. Pardon for the
>> late replies. Please see them inline below
>>
>> On Thu, Dec 1, 2022 at 11:23 PM Matthias J. Sax <mj...@apache.org> wrote:
>>>
>>> Seems I am late to the party... Great KIP. Couple of questions from my side:
>>>
>>> (1) What is the purpose of `standby-updating-tasks`? It seems to be the
>>> same as the number of assigned standby task? Not sure how useful it
>>> would be?
>>>
>> In general, yes, it is the number of assigned standby tasks --- there
>> will be transit times when the assigned standby tasks are not yet
>> being updated but it would not last long --- but we do not yet have a
>> direct gauge to expose this before, and users have to infer this from
>> other indirect metrics.
>>
>>>
>>>
>>> (2) `active-paused-tasks` / `standby-paused-tasks` -- what does "paused"
>>> exactly mean? There was a discussion about renaming the callback method
>>> from pause to suspended. So should this be called `suspended`, too? And
>>> if yes, how is it useful for users?
>>>
>> Pausing here refers to "KIP-834: Pause / Resume KafkaStreams
>> Topologies" (https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211882832).
>> When a topology is paused, all its tasks including standbys will be
>> paused too.
>>
>> I'm not aware of a discussion to rename the call name to "suspend" for
>> KIP-834. Could you point me to the reference?
>>
>>>
>>>
>>> (3) `restore-ratio`: the description says
>>>
>>>> The fraction of time the thread spent on restoring active or standby tasks
>>>
>>> I find the term "restoring" does only apply to active tasks, but not to
>>> standbys. Can we reword this?
>>>
>> Yeah I have been discussing this with others in the community a bit as
>> well, but so far I have not been convinced of a better name than it.
>> Some other alternatives being discussed but not win everyone's love is
>> "restore-or-update-ratio", "process-ratio" (for the restore thread
>> that means restoring or updating), and "io-ratio".
>>
>> The only one so far that I feel is probably better, is
>> "state-update-ratio". If folks feel this one is better than
>> "restore-ratio" I'm happy to update.
>>
>>>
>>> (4) `restore-call-rate`: not sure what you exactly mean by "restore calls"?
>>>
>> This is similar to the "io-calls-rate" in the selector classes, i.e.
>> the number of "restore" function calls made. It's argurably a very
>> low-level metrics but I included it since it could be useful in some
>> debugging scenarios.
>>
>>>
>>> (5) `restore-remaining-records-total` -- why is this a task metric?
>>> Seems we could roll it up into a thread metric that we report at INFO
>>> level (we could still have per-task DEBUG level metric for it in addition).
>>>
>> The rationale behind it is the general principle in metrics design
>> that "Kafka would provide the lowest necessary metrics levels, and
>> users can do the roll-ups however they want".
>>
>>>
>>> (6) What about "warmup tasks"? Internally, we treat them as standbys,
>>> but it seems it's hard for users to reason about it in the scale-out
>>> warm-up case. Would it be helpful (and possible) to report "warmup
>>> progress" explicitly?
>>>
>> At the restore thread level, we cannot differentiate standby tasks
>> from warmup tasks since the latter is created exactly just like the
>> former. But I do agree this is an issue for visibility that worth
>> addressing, I think another KIP would be needed to first consider
>> distinguishing these two at the class level.
>>
>>>
>>> -Matthias
>>>
>>>
>>> On 11/1/22 2:44 AM, Lucas Brutschy wrote:
>>>> We need this!
>>>>
>>>> + 1 non binding
>>>>
>>>> Cheers,
>>>> Lucas
>>>>
>>>> On Tue, Nov 1, 2022 at 10:01 AM Bruno Cadonna <ca...@apache.org> wrote:
>>>>>
>>>>> Guozhang,
>>>>>
>>>>> Thanks for the KIP!
>>>>>
>>>>> +1 (binding)
>>>>>
>>>>> Best,
>>>>> Bruno
>>>>>
>>>>> On 25.10.22 22:07, Walker Carlson wrote:
>>>>>> +1 non binding
>>>>>>
>>>>>> Thanks for the kip!
>>>>>>
>>>>>> On Thu, Oct 20, 2022 at 10:25 PM John Roesler <vv...@apache.org> wrote:
>>>>>>
>>>>>>> Thanks for the KIP, Guozhang!
>>>>>>>
>>>>>>> I'm +1 (binding)
>>>>>>>
>>>>>>> -John
>>>>>>>
>>>>>>> On Wed, Oct 12, 2022, at 16:36, Nick Telford wrote:
>>>>>>>> Can't wait!
>>>>>>>> +1 (non-binding)
>>>>>>>>
>>>>>>>> On Wed, 12 Oct 2022, 18:02 Guozhang Wang, <gu...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hello all,
>>>>>>>>>
>>>>>>>>> I'd like to start a vote for the following KIP, aiming to improve Kafka
>>>>>>>>> Stream's restoration visibility via new metrics and callback methods:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-869%3A+Improve+Streams+State+Restoration+Visibility
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks!
>>>>>>>>> -- Guozhang
>>>>>>>>>
>>>>>>>
>>>>>>

Re: [VOTE] KIP-869: Improve Streams State Restoration Visibility

Posted by Guozhang Wang <gu...@gmail.com>.

Hi Matthias:

re "paused" -> "suspended": I got your point now, thanks. Just to
clarify the two functions are a bit different: "paused" tasks are
because of the topology being paused, i.e. from KIP-834; whereas
"suspended" tasks are when a restoring tasks are being removed before
it completes due to a follow-up rebalance, and this is to distinguish
with "onRestoreEnd", as described in KAFKA-10575. A suspended task is
no longer owned by the thread and hence there's no need to measure the
number of such tasks.

re: "restore-ratio": that's a good point. I like it to function in the
same way as the "records-rate" metrics. Will update the wiki.

re: making "restore-remaining-records-total" at INFO level: sounds
good to me too. I will also update the metric name a bit to be more
specific.



On Thu, Jan 19, 2023 at 2:35 PM Guozhang Wang
<gu...@gmail.com> wrote:
>
> Hello Matthias,
>
> Thanks for the feedback. I was on vacation for a while. Pardon for the
> late replies. Please see them inline below
>
> On Thu, Dec 1, 2022 at 11:23 PM Matthias J. Sax <mj...@apache.org> wrote:
> >
> > Seems I am late to the party... Great KIP. Couple of questions from my side:
> >
> > (1) What is the purpose of `standby-updating-tasks`? It seems to be the
> > same as the number of assigned standby task? Not sure how useful it
> > would be?
> >
> In general, yes, it is the number of assigned standby tasks --- there
> will be transit times when the assigned standby tasks are not yet
> being updated but it would not last long --- but we do not yet have a
> direct gauge to expose this before, and users have to infer this from
> other indirect metrics.
>
> >
> >
> > (2) `active-paused-tasks` / `standby-paused-tasks` -- what does "paused"
> > exactly mean? There was a discussion about renaming the callback method
> > from pause to suspended. So should this be called `suspended`, too? And
> > if yes, how is it useful for users?
> >
> Pausing here refers to "KIP-834: Pause / Resume KafkaStreams
> Topologies" (https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211882832).
> When a topology is paused, all its tasks including standbys will be
> paused too.
>
> I'm not aware of a discussion to rename the call name to "suspend" for
> KIP-834. Could you point me to the reference?
>
> >
> >
> > (3) `restore-ratio`: the description says
> >
> > > The fraction of time the thread spent on restoring active or standby tasks
> >
> > I find the term "restoring" does only apply to active tasks, but not to
> > standbys. Can we reword this?
> >
> Yeah I have been discussing this with others in the community a bit as
> well, but so far I have not been convinced of a better name than it.
> Some other alternatives being discussed but not win everyone's love is
> "restore-or-update-ratio", "process-ratio" (for the restore thread
> that means restoring or updating), and "io-ratio".
>
> The only one so far that I feel is probably better, is
> "state-update-ratio". If folks feel this one is better than
> "restore-ratio" I'm happy to update.
>
> >
> > (4) `restore-call-rate`: not sure what you exactly mean by "restore calls"?
> >
> This is similar to the "io-calls-rate" in the selector classes, i.e.
> the number of "restore" function calls made. It's argurably a very
> low-level metrics but I included it since it could be useful in some
> debugging scenarios.
>
> >
> > (5) `restore-remaining-records-total` -- why is this a task metric?
> > Seems we could roll it up into a thread metric that we report at INFO
> > level (we could still have per-task DEBUG level metric for it in addition).
> >
> The rationale behind it is the general principle in metrics design
> that "Kafka would provide the lowest necessary metrics levels, and
> users can do the roll-ups however they want".
>
> >
> > (6) What about "warmup tasks"? Internally, we treat them as standbys,
> > but it seems it's hard for users to reason about it in the scale-out
> > warm-up case. Would it be helpful (and possible) to report "warmup
> > progress" explicitly?
> >
> At the restore thread level, we cannot differentiate standby tasks
> from warmup tasks since the latter is created exactly just like the
> former. But I do agree this is an issue for visibility that worth
> addressing, I think another KIP would be needed to first consider
> distinguishing these two at the class level.
>
> >
> > -Matthias
> >
> >
> > On 11/1/22 2:44 AM, Lucas Brutschy wrote:
> > > We need this!
> > >
> > > + 1 non binding
> > >
> > > Cheers,
> > > Lucas
> > >
> > > On Tue, Nov 1, 2022 at 10:01 AM Bruno Cadonna <ca...@apache.org> wrote:
> > >>
> > >> Guozhang,
> > >>
> > >> Thanks for the KIP!
> > >>
> > >> +1 (binding)
> > >>
> > >> Best,
> > >> Bruno
> > >>
> > >> On 25.10.22 22:07, Walker Carlson wrote:
> > >>> +1 non binding
> > >>>
> > >>> Thanks for the kip!
> > >>>
> > >>> On Thu, Oct 20, 2022 at 10:25 PM John Roesler <vv...@apache.org> wrote:
> > >>>
> > >>>> Thanks for the KIP, Guozhang!
> > >>>>
> > >>>> I'm +1 (binding)
> > >>>>
> > >>>> -John
> > >>>>
> > >>>> On Wed, Oct 12, 2022, at 16:36, Nick Telford wrote:
> > >>>>> Can't wait!
> > >>>>> +1 (non-binding)
> > >>>>>
> > >>>>> On Wed, 12 Oct 2022, 18:02 Guozhang Wang, <gu...@gmail.com>
> > >>>>> wrote:
> > >>>>>
> > >>>>>> Hello all,
> > >>>>>>
> > >>>>>> I'd like to start a vote for the following KIP, aiming to improve Kafka
> > >>>>>> Stream's restoration visibility via new metrics and callback methods:
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-869%3A+Improve+Streams+State+Restoration+Visibility
> > >>>>>>
> > >>>>>>
> > >>>>>> Thanks!
> > >>>>>> -- Guozhang
> > >>>>>>
> > >>>>
> > >>>

Re: [VOTE] KIP-869: Improve Streams State Restoration Visibility

Posted by Guozhang Wang <gu...@gmail.com>.

Hello Matthias,

Thanks for the feedback. I was on vacation for a while. Pardon for the
late replies. Please see them inline below

On Thu, Dec 1, 2022 at 11:23 PM Matthias J. Sax <mj...@apache.org> wrote:
>
> Seems I am late to the party... Great KIP. Couple of questions from my side:
>
> (1) What is the purpose of `standby-updating-tasks`? It seems to be the
> same as the number of assigned standby task? Not sure how useful it
> would be?
>
In general, yes, it is the number of assigned standby tasks --- there
will be transit times when the assigned standby tasks are not yet
being updated but it would not last long --- but we do not yet have a
direct gauge to expose this before, and users have to infer this from
other indirect metrics.

>
>
> (2) `active-paused-tasks` / `standby-paused-tasks` -- what does "paused"
> exactly mean? There was a discussion about renaming the callback method
> from pause to suspended. So should this be called `suspended`, too? And
> if yes, how is it useful for users?
>
Pausing here refers to "KIP-834: Pause / Resume KafkaStreams
Topologies" (https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211882832).
When a topology is paused, all its tasks including standbys will be
paused too.

I'm not aware of a discussion to rename the call name to "suspend" for
KIP-834. Could you point me to the reference?

>
>
> (3) `restore-ratio`: the description says
>
> > The fraction of time the thread spent on restoring active or standby tasks
>
> I find the term "restoring" does only apply to active tasks, but not to
> standbys. Can we reword this?
>
Yeah I have been discussing this with others in the community a bit as
well, but so far I have not been convinced of a better name than it.
Some other alternatives being discussed but not win everyone's love is
"restore-or-update-ratio", "process-ratio" (for the restore thread
that means restoring or updating), and "io-ratio".

The only one so far that I feel is probably better, is
"state-update-ratio". If folks feel this one is better than
"restore-ratio" I'm happy to update.

>
> (4) `restore-call-rate`: not sure what you exactly mean by "restore calls"?
>
This is similar to the "io-calls-rate" in the selector classes, i.e.
the number of "restore" function calls made. It's argurably a very
low-level metrics but I included it since it could be useful in some
debugging scenarios.

>
> (5) `restore-remaining-records-total` -- why is this a task metric?
> Seems we could roll it up into a thread metric that we report at INFO
> level (we could still have per-task DEBUG level metric for it in addition).
>
The rationale behind it is the general principle in metrics design
that "Kafka would provide the lowest necessary metrics levels, and
users can do the roll-ups however they want".

>
> (6) What about "warmup tasks"? Internally, we treat them as standbys,
> but it seems it's hard for users to reason about it in the scale-out
> warm-up case. Would it be helpful (and possible) to report "warmup
> progress" explicitly?
>
At the restore thread level, we cannot differentiate standby tasks
from warmup tasks since the latter is created exactly just like the
former. But I do agree this is an issue for visibility that worth
addressing, I think another KIP would be needed to first consider
distinguishing these two at the class level.

>
> -Matthias
>
>
> On 11/1/22 2:44 AM, Lucas Brutschy wrote:
> > We need this!
> >
> > + 1 non binding
> >
> > Cheers,
> > Lucas
> >
> > On Tue, Nov 1, 2022 at 10:01 AM Bruno Cadonna <ca...@apache.org> wrote:
> >>
> >> Guozhang,
> >>
> >> Thanks for the KIP!
> >>
> >> +1 (binding)
> >>
> >> Best,
> >> Bruno
> >>
> >> On 25.10.22 22:07, Walker Carlson wrote:
> >>> +1 non binding
> >>>
> >>> Thanks for the kip!
> >>>
> >>> On Thu, Oct 20, 2022 at 10:25 PM John Roesler <vv...@apache.org> wrote:
> >>>
> >>>> Thanks for the KIP, Guozhang!
> >>>>
> >>>> I'm +1 (binding)
> >>>>
> >>>> -John
> >>>>
> >>>> On Wed, Oct 12, 2022, at 16:36, Nick Telford wrote:
> >>>>> Can't wait!
> >>>>> +1 (non-binding)
> >>>>>
> >>>>> On Wed, 12 Oct 2022, 18:02 Guozhang Wang, <gu...@gmail.com>
> >>>>> wrote:
> >>>>>
> >>>>>> Hello all,
> >>>>>>
> >>>>>> I'd like to start a vote for the following KIP, aiming to improve Kafka
> >>>>>> Stream's restoration visibility via new metrics and callback methods:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-869%3A+Improve+Streams+State+Restoration+Visibility
> >>>>>>
> >>>>>>
> >>>>>> Thanks!
> >>>>>> -- Guozhang
> >>>>>>
> >>>>
> >>>

Re: [VOTE] KIP-869: Improve Streams State Restoration Visibility

Posted by "Matthias J. Sax" <mj...@apache.org>.

Seems I am late to the party... Great KIP. Couple of questions from my side:

(1) What is the purpose of `standby-updating-tasks`? It seems to be the 
same as the number of assigned standby task? Not sure how useful it 
would be?

(2) `active-paused-tasks` / `standby-paused-tasks` -- what does "paused" 
exactly mean? There was a discussion about renaming the callback method 
from pause to suspended. So should this be called `suspended`, too? And 
if yes, how is it useful for users?

(3) `restore-ratio`: the description says

> The fraction of time the thread spent on restoring active or standby tasks

I find the term "restoring" does only apply to active tasks, but not to 
standbys. Can we reword this?

(4) `restore-call-rate`: not sure what you exactly mean by "restore calls"?

(5) `restore-remaining-records-total` -- why is this a task metric? 
Seems we could roll it up into a thread metric that we report at INFO 
level (we could still have per-task DEBUG level metric for it in addition).

(6) What about "warmup tasks"? Internally, we treat them as standbys, 
but it seems it's hard for users to reason about it in the scale-out 
warm-up case. Would it be helpful (and possible) to report "warmup 
progress" explicitly?

-Matthias

On 11/1/22 2:44 AM, Lucas Brutschy wrote:
> We need this!
> 
> + 1 non binding
> 
> Cheers,
> Lucas
> 
> On Tue, Nov 1, 2022 at 10:01 AM Bruno Cadonna <ca...@apache.org> wrote:
>>
>> Guozhang,
>>
>> Thanks for the KIP!
>>
>> +1 (binding)
>>
>> Best,
>> Bruno
>>
>> On 25.10.22 22:07, Walker Carlson wrote:
>>> +1 non binding
>>>
>>> Thanks for the kip!
>>>
>>> On Thu, Oct 20, 2022 at 10:25 PM John Roesler <vv...@apache.org> wrote:
>>>
>>>> Thanks for the KIP, Guozhang!
>>>>
>>>> I'm +1 (binding)
>>>>
>>>> -John
>>>>
>>>> On Wed, Oct 12, 2022, at 16:36, Nick Telford wrote:
>>>>> Can't wait!
>>>>> +1 (non-binding)
>>>>>
>>>>> On Wed, 12 Oct 2022, 18:02 Guozhang Wang, <gu...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hello all,
>>>>>>
>>>>>> I'd like to start a vote for the following KIP, aiming to improve Kafka
>>>>>> Stream's restoration visibility via new metrics and callback methods:
>>>>>>
>>>>>>
>>>>>>
>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-869%3A+Improve+Streams+State+Restoration+Visibility
>>>>>>
>>>>>>
>>>>>> Thanks!
>>>>>> -- Guozhang
>>>>>>
>>>>
>>>

Re: [VOTE] KIP-869: Improve Streams State Restoration Visibility

Posted by Lucas Brutschy <lb...@confluent.io.INVALID>.

We need this!

+ 1 non binding

Cheers,
Lucas

On Tue, Nov 1, 2022 at 10:01 AM Bruno Cadonna <ca...@apache.org> wrote:
>
> Guozhang,
>
> Thanks for the KIP!
>
> +1 (binding)
>
> Best,
> Bruno
>
> On 25.10.22 22:07, Walker Carlson wrote:
> > +1 non binding
> >
> > Thanks for the kip!
> >
> > On Thu, Oct 20, 2022 at 10:25 PM John Roesler <vv...@apache.org> wrote:
> >
> >> Thanks for the KIP, Guozhang!
> >>
> >> I'm +1 (binding)
> >>
> >> -John
> >>
> >> On Wed, Oct 12, 2022, at 16:36, Nick Telford wrote:
> >>> Can't wait!
> >>> +1 (non-binding)
> >>>
> >>> On Wed, 12 Oct 2022, 18:02 Guozhang Wang, <gu...@gmail.com>
> >>> wrote:
> >>>
> >>>> Hello all,
> >>>>
> >>>> I'd like to start a vote for the following KIP, aiming to improve Kafka
> >>>> Stream's restoration visibility via new metrics and callback methods:
> >>>>
> >>>>
> >>>>
> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-869%3A+Improve+Streams+State+Restoration+Visibility
> >>>>
> >>>>
> >>>> Thanks!
> >>>> -- Guozhang
> >>>>
> >>
> >

Re: [VOTE] KIP-869: Improve Streams State Restoration Visibility

Posted by Bruno Cadonna <ca...@apache.org>.

Guozhang,

Thanks for the KIP!

+1 (binding)

Best,
Bruno

On 25.10.22 22:07, Walker Carlson wrote:
> +1 non binding
> 
> Thanks for the kip!
> 
> On Thu, Oct 20, 2022 at 10:25 PM John Roesler <vv...@apache.org> wrote:
> 
>> Thanks for the KIP, Guozhang!
>>
>> I'm +1 (binding)
>>
>> -John
>>
>> On Wed, Oct 12, 2022, at 16:36, Nick Telford wrote:
>>> Can't wait!
>>> +1 (non-binding)
>>>
>>> On Wed, 12 Oct 2022, 18:02 Guozhang Wang, <gu...@gmail.com>
>>> wrote:
>>>
>>>> Hello all,
>>>>
>>>> I'd like to start a vote for the following KIP, aiming to improve Kafka
>>>> Stream's restoration visibility via new metrics and callback methods:
>>>>
>>>>
>>>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-869%3A+Improve+Streams+State+Restoration+Visibility
>>>>
>>>>
>>>> Thanks!
>>>> -- Guozhang
>>>>
>>
>

Re: [VOTE] KIP-869: Improve Streams State Restoration Visibility

Posted by Walker Carlson <wc...@confluent.io.INVALID>.

+1 non binding

Thanks for the kip!

On Thu, Oct 20, 2022 at 10:25 PM John Roesler <vv...@apache.org> wrote:

> Thanks for the KIP, Guozhang!
>
> I'm +1 (binding)
>
> -John
>
> On Wed, Oct 12, 2022, at 16:36, Nick Telford wrote:
> > Can't wait!
> > +1 (non-binding)
> >
> > On Wed, 12 Oct 2022, 18:02 Guozhang Wang, <gu...@gmail.com>
> > wrote:
> >
> >> Hello all,
> >>
> >> I'd like to start a vote for the following KIP, aiming to improve Kafka
> >> Stream's restoration visibility via new metrics and callback methods:
> >>
> >>
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-869%3A+Improve+Streams+State+Restoration+Visibility
> >>
> >>
> >> Thanks!
> >> -- Guozhang
> >>
>

Re: [VOTE] KIP-869: Improve Streams State Restoration Visibility

Posted by John Roesler <vv...@apache.org>.

Thanks for the KIP, Guozhang!

I'm +1 (binding)

-John

On Wed, Oct 12, 2022, at 16:36, Nick Telford wrote:
> Can't wait!
> +1 (non-binding)
>
> On Wed, 12 Oct 2022, 18:02 Guozhang Wang, <gu...@gmail.com>
> wrote:
>
>> Hello all,
>>
>> I'd like to start a vote for the following KIP, aiming to improve Kafka
>> Stream's restoration visibility via new metrics and callback methods:
>>
>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-869%3A+Improve+Streams+State+Restoration+Visibility
>>
>>
>> Thanks!
>> -- Guozhang
>>

Re: [VOTE] KIP-869: Improve Streams State Restoration Visibility

Posted by Nick Telford <ni...@gmail.com>.

Can't wait!
+1 (non-binding)

On Wed, 12 Oct 2022, 18:02 Guozhang Wang, <gu...@gmail.com>
wrote:

> Hello all,
>
> I'd like to start a vote for the following KIP, aiming to improve Kafka
> Stream's restoration visibility via new metrics and callback methods:
>
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-869%3A+Improve+Streams+State+Restoration+Visibility
>
>
> Thanks!
> -- Guozhang
>