You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kafka.apache.org by Guozhang Wang <gu...@gmail.com> on 2023/01/19 22:35:57 UTC

Re: [VOTE] KIP-869: Improve Streams State Restoration Visibility

Hello Matthias,

Thanks for the feedback. I was on vacation for a while. Pardon for the
late replies. Please see them inline below

On Thu, Dec 1, 2022 at 11:23 PM Matthias J. Sax <mj...@apache.org> wrote:
>
> Seems I am late to the party... Great KIP. Couple of questions from my side:
>
> (1) What is the purpose of `standby-updating-tasks`? It seems to be the
> same as the number of assigned standby task? Not sure how useful it
> would be?
>
In general, yes, it is the number of assigned standby tasks --- there
will be transit times when the assigned standby tasks are not yet
being updated but it would not last long --- but we do not yet have a
direct gauge to expose this before, and users have to infer this from
other indirect metrics.

>
>
> (2) `active-paused-tasks` / `standby-paused-tasks` -- what does "paused"
> exactly mean? There was a discussion about renaming the callback method
> from pause to suspended. So should this be called `suspended`, too? And
> if yes, how is it useful for users?
>
Pausing here refers to "KIP-834: Pause / Resume KafkaStreams
Topologies" (https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211882832).
When a topology is paused, all its tasks including standbys will be
paused too.

I'm not aware of a discussion to rename the call name to "suspend" for
KIP-834. Could you point me to the reference?

>
>
> (3) `restore-ratio`: the description says
>
> > The fraction of time the thread spent on restoring active or standby tasks
>
> I find the term "restoring" does only apply to active tasks, but not to
> standbys. Can we reword this?
>
Yeah I have been discussing this with others in the community a bit as
well, but so far I have not been convinced of a better name than it.
Some other alternatives being discussed but not win everyone's love is
"restore-or-update-ratio", "process-ratio" (for the restore thread
that means restoring or updating), and "io-ratio".

The only one so far that I feel is probably better, is
"state-update-ratio". If folks feel this one is better than
"restore-ratio" I'm happy to update.

>
> (4) `restore-call-rate`: not sure what you exactly mean by "restore calls"?
>
This is similar to the "io-calls-rate" in the selector classes, i.e.
the number of "restore" function calls made. It's argurably a very
low-level metrics but I included it since it could be useful in some
debugging scenarios.

>
> (5) `restore-remaining-records-total` -- why is this a task metric?
> Seems we could roll it up into a thread metric that we report at INFO
> level (we could still have per-task DEBUG level metric for it in addition).
>
The rationale behind it is the general principle in metrics design
that "Kafka would provide the lowest necessary metrics levels, and
users can do the roll-ups however they want".

>
> (6) What about "warmup tasks"? Internally, we treat them as standbys,
> but it seems it's hard for users to reason about it in the scale-out
> warm-up case. Would it be helpful (and possible) to report "warmup
> progress" explicitly?
>
At the restore thread level, we cannot differentiate standby tasks
from warmup tasks since the latter is created exactly just like the
former. But I do agree this is an issue for visibility that worth
addressing, I think another KIP would be needed to first consider
distinguishing these two at the class level.

>
> -Matthias
>
>
> On 11/1/22 2:44 AM, Lucas Brutschy wrote:
> > We need this!
> >
> > + 1 non binding
> >
> > Cheers,
> > Lucas
> >
> > On Tue, Nov 1, 2022 at 10:01 AM Bruno Cadonna <ca...@apache.org> wrote:
> >>
> >> Guozhang,
> >>
> >> Thanks for the KIP!
> >>
> >> +1 (binding)
> >>
> >> Best,
> >> Bruno
> >>
> >> On 25.10.22 22:07, Walker Carlson wrote:
> >>> +1 non binding
> >>>
> >>> Thanks for the kip!
> >>>
> >>> On Thu, Oct 20, 2022 at 10:25 PM John Roesler <vv...@apache.org> wrote:
> >>>
> >>>> Thanks for the KIP, Guozhang!
> >>>>
> >>>> I'm +1 (binding)
> >>>>
> >>>> -John
> >>>>
> >>>> On Wed, Oct 12, 2022, at 16:36, Nick Telford wrote:
> >>>>> Can't wait!
> >>>>> +1 (non-binding)
> >>>>>
> >>>>> On Wed, 12 Oct 2022, 18:02 Guozhang Wang, <gu...@gmail.com>
> >>>>> wrote:
> >>>>>
> >>>>>> Hello all,
> >>>>>>
> >>>>>> I'd like to start a vote for the following KIP, aiming to improve Kafka
> >>>>>> Stream's restoration visibility via new metrics and callback methods:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-869%3A+Improve+Streams+State+Restoration+Visibility
> >>>>>>
> >>>>>>
> >>>>>> Thanks!
> >>>>>> -- Guozhang
> >>>>>>
> >>>>
> >>>

Re: [VOTE] KIP-869: Improve Streams State Restoration Visibility

Posted by Walker Carlson <wc...@confluent.io.INVALID>.

Hey, I'm changing my vote to binding now :)

On Mon, Jan 23, 2023 at 9:38 PM Matthias J. Sax <mj...@apache.org> wrote:

> Thanks Guozhang. Couple of clarifications and follow up questions.
>
>
> >> I'm not aware of a discussion to rename the call name to "suspend" for
> >> KIP-834. Could you point me to the reference?
>
> My commend was not about KIP-834, but about this KIP. You originally
> proposed to call the new call-back `onRestorePause` but to avoid
> confusion it was improved and renamed to `onRestoreSuspended`.
>
>
> > The only one so far that I feel is probably better, is
> > "state-update-ratio". If folks feel this one is better than
> > "restore-ratio" I'm happy to update.
>
> Could we actually report two metric, one for the restore phase
> (restore-ration), and one for steady state ([standby-]update-ratio)?
>
> I could like with `state-update-ratio` if we want to have a single
> metric for both, but splitting them sound useful to me.
>
>
> > (4) `restore-call-rate`
>
> Maybe we can clarify in the description a little bit. I agree it's very
> low level but if you think it could be useful to debugging, I have no
> objection.
>
>
> > The rationale behind it is the general principle in metrics design
> > that "Kafka would provide the lowest necessary metrics levels, and
> > users can do the roll-ups however they want".
>
> That's fair, but it seems to be a rather important metric, and having it
> only at DEBUG level seems not ideal? Could we make it INFO level, even
> if it's a task level (ie, apply an exception to the rule).
>
>
>
> -Matthias
>
>
>
> On 1/19/23 2:35 PM, Guozhang Wang wrote:
> > Hello Matthias,
> >
> > Thanks for the feedback. I was on vacation for a while. Pardon for the
> > late replies. Please see them inline below
> >
> > On Thu, Dec 1, 2022 at 11:23 PM Matthias J. Sax <mj...@apache.org>
> wrote:
> >>
> >> Seems I am late to the party... Great KIP. Couple of questions from my
> side:
> >>
> >> (1) What is the purpose of `standby-updating-tasks`? It seems to be the
> >> same as the number of assigned standby task? Not sure how useful it
> >> would be?
> >>
> > In general, yes, it is the number of assigned standby tasks --- there
> > will be transit times when the assigned standby tasks are not yet
> > being updated but it would not last long --- but we do not yet have a
> > direct gauge to expose this before, and users have to infer this from
> > other indirect metrics.
> >
> >>
> >>
> >> (2) `active-paused-tasks` / `standby-paused-tasks` -- what does "paused"
> >> exactly mean? There was a discussion about renaming the callback method
> >> from pause to suspended. So should this be called `suspended`, too? And
> >> if yes, how is it useful for users?
> >>
> > Pausing here refers to "KIP-834: Pause / Resume KafkaStreams
> > Topologies" (
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211882832
> ).
> > When a topology is paused, all its tasks including standbys will be
> > paused too.
> >
> > I'm not aware of a discussion to rename the call name to "suspend" for
> > KIP-834. Could you point me to the reference?
> >
> >>
> >>
> >> (3) `restore-ratio`: the description says
> >>
> >>> The fraction of time the thread spent on restoring active or standby
> tasks
> >>
> >> I find the term "restoring" does only apply to active tasks, but not to
> >> standbys. Can we reword this?
> >>
> > Yeah I have been discussing this with others in the community a bit as
> > well, but so far I have not been convinced of a better name than it.
> > Some other alternatives being discussed but not win everyone's love is
> > "restore-or-update-ratio", "process-ratio" (for the restore thread
> > that means restoring or updating), and "io-ratio".
> >
> > The only one so far that I feel is probably better, is
> > "state-update-ratio". If folks feel this one is better than
> > "restore-ratio" I'm happy to update.
> >
> >>
> >> (4) `restore-call-rate`: not sure what you exactly mean by "restore
> calls"?
> >>
> > This is similar to the "io-calls-rate" in the selector classes, i.e.
> > the number of "restore" function calls made. It's argurably a very
> > low-level metrics but I included it since it could be useful in some
> > debugging scenarios.
> >
> >>
> >> (5) `restore-remaining-records-total` -- why is this a task metric?
> >> Seems we could roll it up into a thread metric that we report at INFO
> >> level (we could still have per-task DEBUG level metric for it in
> addition).
> >>
> > The rationale behind it is the general principle in metrics design
> > that "Kafka would provide the lowest necessary metrics levels, and
> > users can do the roll-ups however they want".
> >
> >>
> >> (6) What about "warmup tasks"? Internally, we treat them as standbys,
> >> but it seems it's hard for users to reason about it in the scale-out
> >> warm-up case. Would it be helpful (and possible) to report "warmup
> >> progress" explicitly?
> >>
> > At the restore thread level, we cannot differentiate standby tasks
> > from warmup tasks since the latter is created exactly just like the
> > former. But I do agree this is an issue for visibility that worth
> > addressing, I think another KIP would be needed to first consider
> > distinguishing these two at the class level.
> >
> >>
> >> -Matthias
> >>
> >>
> >> On 11/1/22 2:44 AM, Lucas Brutschy wrote:
> >>> We need this!
> >>>
> >>> + 1 non binding
> >>>
> >>> Cheers,
> >>> Lucas
> >>>
> >>> On Tue, Nov 1, 2022 at 10:01 AM Bruno Cadonna <ca...@apache.org>
> wrote:
> >>>>
> >>>> Guozhang,
> >>>>
> >>>> Thanks for the KIP!
> >>>>
> >>>> +1 (binding)
> >>>>
> >>>> Best,
> >>>> Bruno
> >>>>
> >>>> On 25.10.22 22:07, Walker Carlson wrote:
> >>>>> +1 non binding
> >>>>>
> >>>>> Thanks for the kip!
> >>>>>
> >>>>> On Thu, Oct 20, 2022 at 10:25 PM John Roesler <vv...@apache.org>
> wrote:
> >>>>>
> >>>>>> Thanks for the KIP, Guozhang!
> >>>>>>
> >>>>>> I'm +1 (binding)
> >>>>>>
> >>>>>> -John
> >>>>>>
> >>>>>> On Wed, Oct 12, 2022, at 16:36, Nick Telford wrote:
> >>>>>>> Can't wait!
> >>>>>>> +1 (non-binding)
> >>>>>>>
> >>>>>>> On Wed, 12 Oct 2022, 18:02 Guozhang Wang, <
> guozhang.wang.us@gmail.com>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> Hello all,
> >>>>>>>>
> >>>>>>>> I'd like to start a vote for the following KIP, aiming to improve
> Kafka
> >>>>>>>> Stream's restoration visibility via new metrics and callback
> methods:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-869%3A+Improve+Streams+State+Restoration+Visibility
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Thanks!
> >>>>>>>> -- Guozhang
> >>>>>>>>
> >>>>>>
> >>>>>
>

Re: [VOTE] KIP-869: Improve Streams State Restoration Visibility

Posted by "Matthias J. Sax" <mj...@apache.org>.

Thanks Guozhang. Couple of clarifications and follow up questions.


>> I'm not aware of a discussion to rename the call name to "suspend" for
>> KIP-834. Could you point me to the reference?

My commend was not about KIP-834, but about this KIP. You originally 
proposed to call the new call-back `onRestorePause` but to avoid 
confusion it was improved and renamed to `onRestoreSuspended`.


> The only one so far that I feel is probably better, is
> "state-update-ratio". If folks feel this one is better than
> "restore-ratio" I'm happy to update.

Could we actually report two metric, one for the restore phase 
(restore-ration), and one for steady state ([standby-]update-ratio)?

I could like with `state-update-ratio` if we want to have a single 
metric for both, but splitting them sound useful to me.


> (4) `restore-call-rate`

Maybe we can clarify in the description a little bit. I agree it's very 
low level but if you think it could be useful to debugging, I have no 
objection.


> The rationale behind it is the general principle in metrics design
> that "Kafka would provide the lowest necessary metrics levels, and
> users can do the roll-ups however they want".

That's fair, but it seems to be a rather important metric, and having it 
only at DEBUG level seems not ideal? Could we make it INFO level, even 
if it's a task level (ie, apply an exception to the rule).



-Matthias



On 1/19/23 2:35 PM, Guozhang Wang wrote:
> Hello Matthias,
> 
> Thanks for the feedback. I was on vacation for a while. Pardon for the
> late replies. Please see them inline below
> 
> On Thu, Dec 1, 2022 at 11:23 PM Matthias J. Sax <mj...@apache.org> wrote:
>>
>> Seems I am late to the party... Great KIP. Couple of questions from my side:
>>
>> (1) What is the purpose of `standby-updating-tasks`? It seems to be the
>> same as the number of assigned standby task? Not sure how useful it
>> would be?
>>
> In general, yes, it is the number of assigned standby tasks --- there
> will be transit times when the assigned standby tasks are not yet
> being updated but it would not last long --- but we do not yet have a
> direct gauge to expose this before, and users have to infer this from
> other indirect metrics.
> 
>>
>>
>> (2) `active-paused-tasks` / `standby-paused-tasks` -- what does "paused"
>> exactly mean? There was a discussion about renaming the callback method
>> from pause to suspended. So should this be called `suspended`, too? And
>> if yes, how is it useful for users?
>>
> Pausing here refers to "KIP-834: Pause / Resume KafkaStreams
> Topologies" (https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211882832).
> When a topology is paused, all its tasks including standbys will be
> paused too.
> 
> I'm not aware of a discussion to rename the call name to "suspend" for
> KIP-834. Could you point me to the reference?
> 
>>
>>
>> (3) `restore-ratio`: the description says
>>
>>> The fraction of time the thread spent on restoring active or standby tasks
>>
>> I find the term "restoring" does only apply to active tasks, but not to
>> standbys. Can we reword this?
>>
> Yeah I have been discussing this with others in the community a bit as
> well, but so far I have not been convinced of a better name than it.
> Some other alternatives being discussed but not win everyone's love is
> "restore-or-update-ratio", "process-ratio" (for the restore thread
> that means restoring or updating), and "io-ratio".
> 
> The only one so far that I feel is probably better, is
> "state-update-ratio". If folks feel this one is better than
> "restore-ratio" I'm happy to update.
> 
>>
>> (4) `restore-call-rate`: not sure what you exactly mean by "restore calls"?
>>
> This is similar to the "io-calls-rate" in the selector classes, i.e.
> the number of "restore" function calls made. It's argurably a very
> low-level metrics but I included it since it could be useful in some
> debugging scenarios.
> 
>>
>> (5) `restore-remaining-records-total` -- why is this a task metric?
>> Seems we could roll it up into a thread metric that we report at INFO
>> level (we could still have per-task DEBUG level metric for it in addition).
>>
> The rationale behind it is the general principle in metrics design
> that "Kafka would provide the lowest necessary metrics levels, and
> users can do the roll-ups however they want".
> 
>>
>> (6) What about "warmup tasks"? Internally, we treat them as standbys,
>> but it seems it's hard for users to reason about it in the scale-out
>> warm-up case. Would it be helpful (and possible) to report "warmup
>> progress" explicitly?
>>
> At the restore thread level, we cannot differentiate standby tasks
> from warmup tasks since the latter is created exactly just like the
> former. But I do agree this is an issue for visibility that worth
> addressing, I think another KIP would be needed to first consider
> distinguishing these two at the class level.
> 
>>
>> -Matthias
>>
>>
>> On 11/1/22 2:44 AM, Lucas Brutschy wrote:
>>> We need this!
>>>
>>> + 1 non binding
>>>
>>> Cheers,
>>> Lucas
>>>
>>> On Tue, Nov 1, 2022 at 10:01 AM Bruno Cadonna <ca...@apache.org> wrote:
>>>>
>>>> Guozhang,
>>>>
>>>> Thanks for the KIP!
>>>>
>>>> +1 (binding)
>>>>
>>>> Best,
>>>> Bruno
>>>>
>>>> On 25.10.22 22:07, Walker Carlson wrote:
>>>>> +1 non binding
>>>>>
>>>>> Thanks for the kip!
>>>>>
>>>>> On Thu, Oct 20, 2022 at 10:25 PM John Roesler <vv...@apache.org> wrote:
>>>>>
>>>>>> Thanks for the KIP, Guozhang!
>>>>>>
>>>>>> I'm +1 (binding)
>>>>>>
>>>>>> -John
>>>>>>
>>>>>> On Wed, Oct 12, 2022, at 16:36, Nick Telford wrote:
>>>>>>> Can't wait!
>>>>>>> +1 (non-binding)
>>>>>>>
>>>>>>> On Wed, 12 Oct 2022, 18:02 Guozhang Wang, <gu...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hello all,
>>>>>>>>
>>>>>>>> I'd like to start a vote for the following KIP, aiming to improve Kafka
>>>>>>>> Stream's restoration visibility via new metrics and callback methods:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-869%3A+Improve+Streams+State+Restoration+Visibility
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>> -- Guozhang
>>>>>>>>
>>>>>>
>>>>>

Re: [VOTE] KIP-869: Improve Streams State Restoration Visibility

Posted by Guozhang Wang <gu...@gmail.com>.

Thanks for all the very helpful discussions, I'm closing the vote with
a tally here:

+1: 7 (Nick, John, Walker, Bruno, Lucas, Matthias, Guozhang), with 5
binding votes and 2 non-binding votes.
-1: 0


Guozhang

On Wed, Jan 25, 2023 at 5:48 PM Matthias J. Sax <mj...@apache.org> wrote:
>
> Thanks!
>
> +1 (binding)
>
> -Matthias
>
> On 1/24/23 1:17 PM, Guozhang Wang wrote:
> > Hi Matthias:
> >
> > re "paused" -> "suspended": I got your point now, thanks. Just to
> > clarify the two functions are a bit different: "paused" tasks are
> > because of the topology being paused, i.e. from KIP-834; whereas
> > "suspended" tasks are when a restoring tasks are being removed before
> > it completes due to a follow-up rebalance, and this is to distinguish
> > with "onRestoreEnd", as described in KAFKA-10575. A suspended task is
> > no longer owned by the thread and hence there's no need to measure the
> > number of such tasks.
> >
> > re: "restore-ratio": that's a good point. I like it to function in the
> > same way as the "records-rate" metrics. Will update the wiki.
> >
> > re: making "restore-remaining-records-total" at INFO level: sounds
> > good to me too. I will also update the metric name a bit to be more
> > specific.
> >
> >
> >
> > On Thu, Jan 19, 2023 at 2:35 PM Guozhang Wang
> > <gu...@gmail.com> wrote:
> >>
> >> Hello Matthias,
> >>
> >> Thanks for the feedback. I was on vacation for a while. Pardon for the
> >> late replies. Please see them inline below
> >>
> >> On Thu, Dec 1, 2022 at 11:23 PM Matthias J. Sax <mj...@apache.org> wrote:
> >>>
> >>> Seems I am late to the party... Great KIP. Couple of questions from my side:
> >>>
> >>> (1) What is the purpose of `standby-updating-tasks`? It seems to be the
> >>> same as the number of assigned standby task? Not sure how useful it
> >>> would be?
> >>>
> >> In general, yes, it is the number of assigned standby tasks --- there
> >> will be transit times when the assigned standby tasks are not yet
> >> being updated but it would not last long --- but we do not yet have a
> >> direct gauge to expose this before, and users have to infer this from
> >> other indirect metrics.
> >>
> >>>
> >>>
> >>> (2) `active-paused-tasks` / `standby-paused-tasks` -- what does "paused"
> >>> exactly mean? There was a discussion about renaming the callback method
> >>> from pause to suspended. So should this be called `suspended`, too? And
> >>> if yes, how is it useful for users?
> >>>
> >> Pausing here refers to "KIP-834: Pause / Resume KafkaStreams
> >> Topologies" (https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211882832).
> >> When a topology is paused, all its tasks including standbys will be
> >> paused too.
> >>
> >> I'm not aware of a discussion to rename the call name to "suspend" for
> >> KIP-834. Could you point me to the reference?
> >>
> >>>
> >>>
> >>> (3) `restore-ratio`: the description says
> >>>
> >>>> The fraction of time the thread spent on restoring active or standby tasks
> >>>
> >>> I find the term "restoring" does only apply to active tasks, but not to
> >>> standbys. Can we reword this?
> >>>
> >> Yeah I have been discussing this with others in the community a bit as
> >> well, but so far I have not been convinced of a better name than it.
> >> Some other alternatives being discussed but not win everyone's love is
> >> "restore-or-update-ratio", "process-ratio" (for the restore thread
> >> that means restoring or updating), and "io-ratio".
> >>
> >> The only one so far that I feel is probably better, is
> >> "state-update-ratio". If folks feel this one is better than
> >> "restore-ratio" I'm happy to update.
> >>
> >>>
> >>> (4) `restore-call-rate`: not sure what you exactly mean by "restore calls"?
> >>>
> >> This is similar to the "io-calls-rate" in the selector classes, i.e.
> >> the number of "restore" function calls made. It's argurably a very
> >> low-level metrics but I included it since it could be useful in some
> >> debugging scenarios.
> >>
> >>>
> >>> (5) `restore-remaining-records-total` -- why is this a task metric?
> >>> Seems we could roll it up into a thread metric that we report at INFO
> >>> level (we could still have per-task DEBUG level metric for it in addition).
> >>>
> >> The rationale behind it is the general principle in metrics design
> >> that "Kafka would provide the lowest necessary metrics levels, and
> >> users can do the roll-ups however they want".
> >>
> >>>
> >>> (6) What about "warmup tasks"? Internally, we treat them as standbys,
> >>> but it seems it's hard for users to reason about it in the scale-out
> >>> warm-up case. Would it be helpful (and possible) to report "warmup
> >>> progress" explicitly?
> >>>
> >> At the restore thread level, we cannot differentiate standby tasks
> >> from warmup tasks since the latter is created exactly just like the
> >> former. But I do agree this is an issue for visibility that worth
> >> addressing, I think another KIP would be needed to first consider
> >> distinguishing these two at the class level.
> >>
> >>>
> >>> -Matthias
> >>>
> >>>
> >>> On 11/1/22 2:44 AM, Lucas Brutschy wrote:
> >>>> We need this!
> >>>>
> >>>> + 1 non binding
> >>>>
> >>>> Cheers,
> >>>> Lucas
> >>>>
> >>>> On Tue, Nov 1, 2022 at 10:01 AM Bruno Cadonna <ca...@apache.org> wrote:
> >>>>>
> >>>>> Guozhang,
> >>>>>
> >>>>> Thanks for the KIP!
> >>>>>
> >>>>> +1 (binding)
> >>>>>
> >>>>> Best,
> >>>>> Bruno
> >>>>>
> >>>>> On 25.10.22 22:07, Walker Carlson wrote:
> >>>>>> +1 non binding
> >>>>>>
> >>>>>> Thanks for the kip!
> >>>>>>
> >>>>>> On Thu, Oct 20, 2022 at 10:25 PM John Roesler <vv...@apache.org> wrote:
> >>>>>>
> >>>>>>> Thanks for the KIP, Guozhang!
> >>>>>>>
> >>>>>>> I'm +1 (binding)
> >>>>>>>
> >>>>>>> -John
> >>>>>>>
> >>>>>>> On Wed, Oct 12, 2022, at 16:36, Nick Telford wrote:
> >>>>>>>> Can't wait!
> >>>>>>>> +1 (non-binding)
> >>>>>>>>
> >>>>>>>> On Wed, 12 Oct 2022, 18:02 Guozhang Wang, <gu...@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Hello all,
> >>>>>>>>>
> >>>>>>>>> I'd like to start a vote for the following KIP, aiming to improve Kafka
> >>>>>>>>> Stream's restoration visibility via new metrics and callback methods:
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-869%3A+Improve+Streams+State+Restoration+Visibility
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Thanks!
> >>>>>>>>> -- Guozhang
> >>>>>>>>>
> >>>>>>>
> >>>>>>

Re: [VOTE] KIP-869: Improve Streams State Restoration Visibility

Posted by "Matthias J. Sax" <mj...@apache.org>.

Thanks!

+1 (binding)

-Matthias

On 1/24/23 1:17 PM, Guozhang Wang wrote:
> Hi Matthias:
> 
> re "paused" -> "suspended": I got your point now, thanks. Just to
> clarify the two functions are a bit different: "paused" tasks are
> because of the topology being paused, i.e. from KIP-834; whereas
> "suspended" tasks are when a restoring tasks are being removed before
> it completes due to a follow-up rebalance, and this is to distinguish
> with "onRestoreEnd", as described in KAFKA-10575. A suspended task is
> no longer owned by the thread and hence there's no need to measure the
> number of such tasks.
> 
> re: "restore-ratio": that's a good point. I like it to function in the
> same way as the "records-rate" metrics. Will update the wiki.
> 
> re: making "restore-remaining-records-total" at INFO level: sounds
> good to me too. I will also update the metric name a bit to be more
> specific.
> 
> 
> 
> On Thu, Jan 19, 2023 at 2:35 PM Guozhang Wang
> <gu...@gmail.com> wrote:
>>
>> Hello Matthias,
>>
>> Thanks for the feedback. I was on vacation for a while. Pardon for the
>> late replies. Please see them inline below
>>
>> On Thu, Dec 1, 2022 at 11:23 PM Matthias J. Sax <mj...@apache.org> wrote:
>>>
>>> Seems I am late to the party... Great KIP. Couple of questions from my side:
>>>
>>> (1) What is the purpose of `standby-updating-tasks`? It seems to be the
>>> same as the number of assigned standby task? Not sure how useful it
>>> would be?
>>>
>> In general, yes, it is the number of assigned standby tasks --- there
>> will be transit times when the assigned standby tasks are not yet
>> being updated but it would not last long --- but we do not yet have a
>> direct gauge to expose this before, and users have to infer this from
>> other indirect metrics.
>>
>>>
>>>
>>> (2) `active-paused-tasks` / `standby-paused-tasks` -- what does "paused"
>>> exactly mean? There was a discussion about renaming the callback method
>>> from pause to suspended. So should this be called `suspended`, too? And
>>> if yes, how is it useful for users?
>>>
>> Pausing here refers to "KIP-834: Pause / Resume KafkaStreams
>> Topologies" (https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211882832).
>> When a topology is paused, all its tasks including standbys will be
>> paused too.
>>
>> I'm not aware of a discussion to rename the call name to "suspend" for
>> KIP-834. Could you point me to the reference?
>>
>>>
>>>
>>> (3) `restore-ratio`: the description says
>>>
>>>> The fraction of time the thread spent on restoring active or standby tasks
>>>
>>> I find the term "restoring" does only apply to active tasks, but not to
>>> standbys. Can we reword this?
>>>
>> Yeah I have been discussing this with others in the community a bit as
>> well, but so far I have not been convinced of a better name than it.
>> Some other alternatives being discussed but not win everyone's love is
>> "restore-or-update-ratio", "process-ratio" (for the restore thread
>> that means restoring or updating), and "io-ratio".
>>
>> The only one so far that I feel is probably better, is
>> "state-update-ratio". If folks feel this one is better than
>> "restore-ratio" I'm happy to update.
>>
>>>
>>> (4) `restore-call-rate`: not sure what you exactly mean by "restore calls"?
>>>
>> This is similar to the "io-calls-rate" in the selector classes, i.e.
>> the number of "restore" function calls made. It's argurably a very
>> low-level metrics but I included it since it could be useful in some
>> debugging scenarios.
>>
>>>
>>> (5) `restore-remaining-records-total` -- why is this a task metric?
>>> Seems we could roll it up into a thread metric that we report at INFO
>>> level (we could still have per-task DEBUG level metric for it in addition).
>>>
>> The rationale behind it is the general principle in metrics design
>> that "Kafka would provide the lowest necessary metrics levels, and
>> users can do the roll-ups however they want".
>>
>>>
>>> (6) What about "warmup tasks"? Internally, we treat them as standbys,
>>> but it seems it's hard for users to reason about it in the scale-out
>>> warm-up case. Would it be helpful (and possible) to report "warmup
>>> progress" explicitly?
>>>
>> At the restore thread level, we cannot differentiate standby tasks
>> from warmup tasks since the latter is created exactly just like the
>> former. But I do agree this is an issue for visibility that worth
>> addressing, I think another KIP would be needed to first consider
>> distinguishing these two at the class level.
>>
>>>
>>> -Matthias
>>>
>>>
>>> On 11/1/22 2:44 AM, Lucas Brutschy wrote:
>>>> We need this!
>>>>
>>>> + 1 non binding
>>>>
>>>> Cheers,
>>>> Lucas
>>>>
>>>> On Tue, Nov 1, 2022 at 10:01 AM Bruno Cadonna <ca...@apache.org> wrote:
>>>>>
>>>>> Guozhang,
>>>>>
>>>>> Thanks for the KIP!
>>>>>
>>>>> +1 (binding)
>>>>>
>>>>> Best,
>>>>> Bruno
>>>>>
>>>>> On 25.10.22 22:07, Walker Carlson wrote:
>>>>>> +1 non binding
>>>>>>
>>>>>> Thanks for the kip!
>>>>>>
>>>>>> On Thu, Oct 20, 2022 at 10:25 PM John Roesler <vv...@apache.org> wrote:
>>>>>>
>>>>>>> Thanks for the KIP, Guozhang!
>>>>>>>
>>>>>>> I'm +1 (binding)
>>>>>>>
>>>>>>> -John
>>>>>>>
>>>>>>> On Wed, Oct 12, 2022, at 16:36, Nick Telford wrote:
>>>>>>>> Can't wait!
>>>>>>>> +1 (non-binding)
>>>>>>>>
>>>>>>>> On Wed, 12 Oct 2022, 18:02 Guozhang Wang, <gu...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hello all,
>>>>>>>>>
>>>>>>>>> I'd like to start a vote for the following KIP, aiming to improve Kafka
>>>>>>>>> Stream's restoration visibility via new metrics and callback methods:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-869%3A+Improve+Streams+State+Restoration+Visibility
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks!
>>>>>>>>> -- Guozhang
>>>>>>>>>
>>>>>>>
>>>>>>

Re: [VOTE] KIP-869: Improve Streams State Restoration Visibility

Posted by Guozhang Wang <gu...@gmail.com>.

Hi Matthias:

re "paused" -> "suspended": I got your point now, thanks. Just to
clarify the two functions are a bit different: "paused" tasks are
because of the topology being paused, i.e. from KIP-834; whereas
"suspended" tasks are when a restoring tasks are being removed before
it completes due to a follow-up rebalance, and this is to distinguish
with "onRestoreEnd", as described in KAFKA-10575. A suspended task is
no longer owned by the thread and hence there's no need to measure the
number of such tasks.

re: "restore-ratio": that's a good point. I like it to function in the
same way as the "records-rate" metrics. Will update the wiki.

re: making "restore-remaining-records-total" at INFO level: sounds
good to me too. I will also update the metric name a bit to be more
specific.



On Thu, Jan 19, 2023 at 2:35 PM Guozhang Wang
<gu...@gmail.com> wrote:
>
> Hello Matthias,
>
> Thanks for the feedback. I was on vacation for a while. Pardon for the
> late replies. Please see them inline below
>
> On Thu, Dec 1, 2022 at 11:23 PM Matthias J. Sax <mj...@apache.org> wrote:
> >
> > Seems I am late to the party... Great KIP. Couple of questions from my side:
> >
> > (1) What is the purpose of `standby-updating-tasks`? It seems to be the
> > same as the number of assigned standby task? Not sure how useful it
> > would be?
> >
> In general, yes, it is the number of assigned standby tasks --- there
> will be transit times when the assigned standby tasks are not yet
> being updated but it would not last long --- but we do not yet have a
> direct gauge to expose this before, and users have to infer this from
> other indirect metrics.
>
> >
> >
> > (2) `active-paused-tasks` / `standby-paused-tasks` -- what does "paused"
> > exactly mean? There was a discussion about renaming the callback method
> > from pause to suspended. So should this be called `suspended`, too? And
> > if yes, how is it useful for users?
> >
> Pausing here refers to "KIP-834: Pause / Resume KafkaStreams
> Topologies" (https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211882832).
> When a topology is paused, all its tasks including standbys will be
> paused too.
>
> I'm not aware of a discussion to rename the call name to "suspend" for
> KIP-834. Could you point me to the reference?
>
> >
> >
> > (3) `restore-ratio`: the description says
> >
> > > The fraction of time the thread spent on restoring active or standby tasks
> >
> > I find the term "restoring" does only apply to active tasks, but not to
> > standbys. Can we reword this?
> >
> Yeah I have been discussing this with others in the community a bit as
> well, but so far I have not been convinced of a better name than it.
> Some other alternatives being discussed but not win everyone's love is
> "restore-or-update-ratio", "process-ratio" (for the restore thread
> that means restoring or updating), and "io-ratio".
>
> The only one so far that I feel is probably better, is
> "state-update-ratio". If folks feel this one is better than
> "restore-ratio" I'm happy to update.
>
> >
> > (4) `restore-call-rate`: not sure what you exactly mean by "restore calls"?
> >
> This is similar to the "io-calls-rate" in the selector classes, i.e.
> the number of "restore" function calls made. It's argurably a very
> low-level metrics but I included it since it could be useful in some
> debugging scenarios.
>
> >
> > (5) `restore-remaining-records-total` -- why is this a task metric?
> > Seems we could roll it up into a thread metric that we report at INFO
> > level (we could still have per-task DEBUG level metric for it in addition).
> >
> The rationale behind it is the general principle in metrics design
> that "Kafka would provide the lowest necessary metrics levels, and
> users can do the roll-ups however they want".
>
> >
> > (6) What about "warmup tasks"? Internally, we treat them as standbys,
> > but it seems it's hard for users to reason about it in the scale-out
> > warm-up case. Would it be helpful (and possible) to report "warmup
> > progress" explicitly?
> >
> At the restore thread level, we cannot differentiate standby tasks
> from warmup tasks since the latter is created exactly just like the
> former. But I do agree this is an issue for visibility that worth
> addressing, I think another KIP would be needed to first consider
> distinguishing these two at the class level.
>
> >
> > -Matthias
> >
> >
> > On 11/1/22 2:44 AM, Lucas Brutschy wrote:
> > > We need this!
> > >
> > > + 1 non binding
> > >
> > > Cheers,
> > > Lucas
> > >
> > > On Tue, Nov 1, 2022 at 10:01 AM Bruno Cadonna <ca...@apache.org> wrote:
> > >>
> > >> Guozhang,
> > >>
> > >> Thanks for the KIP!
> > >>
> > >> +1 (binding)
> > >>
> > >> Best,
> > >> Bruno
> > >>
> > >> On 25.10.22 22:07, Walker Carlson wrote:
> > >>> +1 non binding
> > >>>
> > >>> Thanks for the kip!
> > >>>
> > >>> On Thu, Oct 20, 2022 at 10:25 PM John Roesler <vv...@apache.org> wrote:
> > >>>
> > >>>> Thanks for the KIP, Guozhang!
> > >>>>
> > >>>> I'm +1 (binding)
> > >>>>
> > >>>> -John
> > >>>>
> > >>>> On Wed, Oct 12, 2022, at 16:36, Nick Telford wrote:
> > >>>>> Can't wait!
> > >>>>> +1 (non-binding)
> > >>>>>
> > >>>>> On Wed, 12 Oct 2022, 18:02 Guozhang Wang, <gu...@gmail.com>
> > >>>>> wrote:
> > >>>>>
> > >>>>>> Hello all,
> > >>>>>>
> > >>>>>> I'd like to start a vote for the following KIP, aiming to improve Kafka
> > >>>>>> Stream's restoration visibility via new metrics and callback methods:
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-869%3A+Improve+Streams+State+Restoration+Visibility
> > >>>>>>
> > >>>>>>
> > >>>>>> Thanks!
> > >>>>>> -- Guozhang
> > >>>>>>
> > >>>>
> > >>>