You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kafka.apache.org by Guozhang Wang <wa...@gmail.com> on 2022/09/16 17:33:42 UTC

[DISCUSS] KIP-869: Improve Streams State Restoration Visibility

Hello everyone,

I'd like to start a discussion for the following KIP, aiming to improve
Kafka Stream's restoration visibility via new metrics and callback methods:

https://cwiki.apache.org/confluence/display/KAFKA/KIP-869%3A+Improve+Streams+State+Restoration+Visibility


Thanks!
-- Guozhang

Re: [DISCUSS] KIP-869: Improve Streams State Restoration Visibility

Posted by Guozhang Wang <wa...@gmail.com>.

I've updated the KIP doc and would start calling for a vote for this KIP
now.

On Tue, Oct 11, 2022 at 4:26 AM Nick Telford <ni...@gmail.com> wrote:

> Hi Guozhang,
>
> What you propose sounds good to me. Having the more detailed Task-level
> metrics at DEBUG makes sense.
>
> Regards,
>
> Nick
>

-- 
-- Guozhang

Re: [DISCUSS] KIP-869: Improve Streams State Restoration Visibility

Posted by Nick Telford <ni...@gmail.com>.

Hi Guozhang,

What you propose sounds good to me. Having the more detailed Task-level
metrics at DEBUG makes sense.

Regards,

Nick

Re: [DISCUSS] KIP-869: Improve Streams State Restoration Visibility

Posted by Guozhang Wang <wa...@gmail.com>.

Hello Bruno / Nick,

Thanks for the great feedback (again)! I want to give my thoughts to them
and reply together here since some of them are correlated.

1) This KIP is indeed based on the separated restoration thread work,
trying to give more visibility as we move restoration to dedicated threads.
At the moment the number of restoration threads is not configurable, and is
correlated to the num.main threads, but in the future we would make the
num.restore.threads either configurable or at least not correlated to
num.main threads. So from that perspective it's indeed not very reasonable
to have those metrics on the thread-level since Streams will try to
abstract away from users. However, from our initial framework on Streams
metrics (
https://cwiki.apache.org/confluence/display/KAFKA/KIP-444%3A+Augment+metrics+for+Kafka+Streams),
task-level metrics would by default be on the DEBUG reporting level, which
is contradicting from my intention to make restoration more visible to
users by default. Hence I'm proposing it at the restore-thread level.

With all those rationales in mind, I think there are two options on the
high level that we can provide: 1) provide the finer-grained level metrics
from Streams i.e. task-level (the reason I did not consider store-level is
that the internal physical state stores may evolve and changed), and let
users to do their aggregations themselves if they like; but this require
users to first turn on DEBUG level reporting being aware it's going to
increasing the metrics reporting overhead; 2) provide the
coarse-grained level metrics from Streams on the thread-level, and just get
users to watch on that coarse-grained level.

After some thought about that, my current preference is to split the group
of metrics, and report rate metrics on the thread level, while reporting
remaining and total num.records on the task level, but keeping the
remaining.num.records at INFO level (btw just to answer Bruno's question,
reporting remaining num.records does not require additional requests since
we are already sending committed / log-end requests periodically to
consider when we should complete / pause the restoration).

2) At the restoration thread, it would only either try to restore active
tasks, or update standby tasks, with a priority to do the first as long as
there are still active tasks needing restoration, at any given time. And
hence I was originally thinking that we only need one rate metric, since if
we split it then only one of them would ever have a non-zero value anyways.
But after some thought I think it's still okay to split the metric since we
do not have a state indicating if we are in "active-restore" or
"standby-update" state anyways, and having those two metrics where only one
of them would be non-zero would be valuable for users to tell when the
state switched inside the thread.

As a result, the proposed metrics would look like this:

Thread-level (all INFO level):

active-restoring-tasks

standby-updating-tasks

active-paused-tasks

standby-paused-tasks

idle-ratio

restore-ratio

checkpoint-ratio

restore-call-rate

active-restore-records-rate

standby-update-records-rate



Task-level:

restored-total (DEBUG, as to align with process-total DEBUG)

restored-rate (DEBUG, as to align with process-rate DEBUG)

remaining-restore-records (INFO)



Please let me know what you think.


Guozhang


On Mon, Oct 10, 2022 at 3:53 AM Nick Telford <ni...@gmail.com> wrote:

> Hi Guozhang,
>
> The metrics you've described are currently "thread-level" metrics. I know
> you're planning on splitting out the restoration from the StreamThread, so
> I question the utility of some of these metrics being thread-level, as they
> won't correlate with other thread-level metrics, which correspond to a
> StreamThread.
>
> Perhaps the record counting metrics would be more useful as
> task/store-level metrics? That way, they can be aggregated by users to
> determine things like total records remaining to restore by store, across
> the entire app etc.
>
> Regards,
>
> Nick
>
> On Thu, 22 Sept 2022 at 13:21, Bruno Cadonna <ca...@apache.org> wrote:
>
> > Hi Guozhang!
> >
> > Thanks for the updates!
> >
> > 1.
> > Why do you distinguish between active and standby for the total amount
> > of restored/updated records but not for the rate of restored/updated
> > records?
> >
> > 2.
> > Regarding "standby-records-remaining" I am not sure how useful this
> > metric is and I am not sure how hard it will be to record. I see the
> > usefulness of monitoring the lag of a single standby state store, but I
> > am not sure how useful it is to monitor the "lag" of a state updater
> > thread that might potentially contain multiple state stores.
> > Furthermore, do we need to issue a remote call to record this metric or
> > do we get this information from each poll?
> >
> > 3.
> > Could you please give an example where "active-records-restored-total"
> > and "standby-records-updated-total" are useful?
> >
> > Best,
> > Bruno
> >
> > On 20.09.22 22:45, Guozhang Wang wrote:
> > > Hello Bruno/Nick,
> > >
> > > I've updated the KIP wiki to reflect the incorporated comments from
> you,
> > > please feel free to take another look and let me know what you think.
> > >
> > >
> > > Guozhang
> > >
> > > On Tue, Sep 20, 2022 at 9:37 AM Guozhang Wang <wa...@gmail.com>
> > wrote:
> > >
> > >> Hi Nick,
> > >>
> > >> Thanks for the reviews, and I think these are good suggestions. Note
> > that
> > >> currently the `restore-records-total/rate` would include both
> restoring
> > >> active tasks as well as updating standby tasks, I think for your
> > purposes
> > >> you'd be more interested in active restoring tasks since they could
> > >> complete, while updating standby tasks would not complete even if they
> > have
> > >> caught up with the active. At the same time, the changelog reader
> would
> > >> only be restoring either active or standby at a given time, and active
> > >> tasks has a higher priority such that as long as there is at least one
> > >> active task still restoring, we would not restore any standby tasks.
> > From
> > >> your suggestion, I'm thinking that maybe I should break up the rate /
> > total
> > >> metric for active and standby tasks separately.
> > >>
> > >> For deriving estimated time remaining though, the `total` metric may
> not
> > >> be helpful since they will not be "reset" after rebalances, i.e. they
> > will
> > >> be an ever-increasing number and record the total number of records
> for
> > the
> > >> lifetime of the app. But still, just the remaining records alone, with
> > the
> > >> time elapsed monitored by the apps, should be sufficient to get the
> > >> estimated time remaining.
> > >>
> > >>
> > >> Guozhang
> > >>
> > >>
> > >> On Tue, Sep 20, 2022 at 3:10 AM Nick Telford <ni...@gmail.com>
> > >> wrote:
> > >>
> > >>> Hi Guozhang,
> > >>>
> > >>> KIP looks great, I have one suggestion: in addition to
> > >>> "restore-records-total", it would also be useful to track the number
> of
> > >>> records *remaining*, that is, the records that have not yet been
> > restored.
> > >>> This is actually the metric I was attempting to implement in the
> > >>> StateRestoreListener that bumped me up against KAFKA-10575 :-)
> > >>>
> > >>> With both a "restore-records-total" and a "restore-remaining-total"
> (or
> > >>> similar) metric, it's possible to derive useful information like the
> > >>> estimated time remaining for restoration (by dividing the remaining
> > total
> > >>> by the restoration rate).
> > >>>
> > >>> Regards,
> > >>>
> > >>> Nick
> > >>>
> > >>> On Mon, 19 Sept 2022 at 19:57, Guozhang Wang <wa...@gmail.com>
> > wrote:
> > >>>
> > >>>> Hello Bruno,
> > >>>>
> > >>>> Thanks for your comments!
> > >>>>
> > >>>> 1. Regarding the metrics group name: originally I put
> > >>>> "stream-state-metrics" as it's related to state store restorations,
> > but
> > >>>> after a second thought I think I agree with you that this is quite
> > >>>> confusing and not right. About the metrics groups, right now I have
> > two
> > >>>> ideas:
> > >>>>
> > >>>> a) Still use the metric group name "stream-thread-metrics", but
> > >>>> differentiate with the processing threads on the thread id. The pros
> > is
> > >>>> that we do not introduce a new group, the cons is that users who
> want
> > to
> > >>>> separate processing from restoration/updating in the future needs to
> > do
> > >>>> that on the thread id labels.
> > >>>> b) Introduce a new group name, for example
> > >>> "stream-state-updater-metrics"
> > >>>> and still have a thread-id label. We would be introducing a new
> group
> > >>> which
> > >>>> still have a thread-id, which may sound a bit weird (maybe if we do
> > >>> that we
> > >>>> could change the existing "stream-thread-metrics" into
> > >>>> "stream-processor-metrics").
> > >>>>
> > >>>> Right now I'm leaning towards b) and maybe in the future rename
> > >>>> "thread-metrics" to "processor-metrics", LMK what do you think.
> > >>>>
> > >>>> 2. Regarding the metric names: today we may also pause a standby
> > tasks,
> > >>> and
> > >>>> hence I'm trying to differentiate updating standbys from paused
> > >>> standbys.
> > >>>> Right now I'm thinking we can change "restoring-standby-tasks" to
> > >>>> "updating-standby-tasks", and change all other metrics' "restore"
> > >>> (except
> > >>>> the "restoring-active-tasks") to "state-update", a.k.a
> > >>>> "state-update-ratio", "state-update-records-total",
> > >>>> "updating-standby-tasks" etc.
> > >>>>
> > >>>> 3. Regarding the function name: yeah I think that's a valid
> concern, I
> > >>> can
> > >>>> change it to "onRestoreSuspended" since "Aborted" may confuse people
> > >>> that
> > >>>> previously called "onBatchRestored" are undone as part of the
> > abortion.
> > >>>>
> > >>>>
> > >>>> Guozhang
> > >>>>
> > >>>>
> > >>>>
> > >>>> On Mon, Sep 19, 2022 at 10:47 AM Bruno Cadonna <ca...@apache.org>
> > >>> wrote:
> > >>>>
> > >>>>> Hi Guozhang,
> > >>>>>
> > >>>>> Thanks for the KIP! I think this KIP is a really nice addition to
> > >>> better
> > >>>>> understand what is going on in a Kafka Streams application.
> > >>>>>
> > >>>>> 1.
> > >>>>> The metric names "paused-active-tasks" and "paused-standby-tasks"
> > >>> might
> > >>>>> be a bit confusing since at least active tasks can be paused also
> > >>>>> outside of restoration.
> > >>>>>
> > >>>>> 2.
> > >>>>> Why is the type of the metrics "stream-state-metrics"? I would have
> > >>>>> expected "stream-thread-metrics" as the type.
> > >>>>>
> > >>>>> 3.
> > >>>>> Isn't the value of the metric "restoring-standby-tasks" simply the
> > >>>>> number of standby tasks since standby tasks are basically always
> > >>>>> updating (aka restoring)?
> > >>>>>
> > >>>>> 4.
> > >>>>> "idle-ratio", "restore-ratio", and "checkpoint-ratio" seem metrics
> > >>>>> tailored to the upcoming state updater. They do not make much sense
> > >>> with
> > >>>>> a stream thread. Would it be better to introduce a new metrics
> level
> > >>>>> specifically for the state updater?
> > >>>>>
> > >>>>> 5.
> > >>>>> Personally, I do not like to use the word "restoration" together
> with
> > >>>>> standbys since restoration somehow implies that there is an offset
> > for
> > >>>>> which the active task is considered restored and active processing
> > can
> > >>>>> start. In other words, restoration is finite. Standby tasks rather
> > >>>>> update continuously their states. They can be up-to-date or
> lagging.
> > I
> > >>>>> see that you could say "restored" instead of "up-to-date" and "not
> > >>>>> restored" instead of "lagging", but IMO it does not describe well
> the
> > >>>>> situation. That is a rather minor point. I just wanted to mention
> it.
> > >>>>>
> > >>>>> 6.
> > >>>>> The name "onRestorePaused()" might be confusing since in Kafka
> > Streams
> > >>>>> users can also pause tasks. What about "onRestoreAborted()" or
> > >>>>> "onRestoreSuspended"?
> > >>>>>
> > >>>>> Best,
> > >>>>> Bruno
> > >>>>>
> > >>>>>
> > >>>>> On 16.09.22 19:33, Guozhang Wang wrote:
> > >>>>>> Hello everyone,
> > >>>>>>
> > >>>>>> I'd like to start a discussion for the following KIP, aiming to
> > >>> improve
> > >>>>>> Kafka Stream's restoration visibility via new metrics and callback
> > >>>>> methods:
> > >>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-869%3A+Improve+Streams+State+Restoration+Visibility
> > >>>>>>
> > >>>>>>
> > >>>>>> Thanks!
> > >>>>>> -- Guozhang
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>>
> > >>>> --
> > >>>> -- Guozhang
> > >>>>
> > >>>
> > >>
> > >>
> > >> --
> > >> -- Guozhang
> > >>
> > >
> > >
> >
>


-- 
-- Guozhang

Re: [DISCUSS] KIP-869: Improve Streams State Restoration Visibility

Posted by Nick Telford <ni...@gmail.com>.

Hi Guozhang,

The metrics you've described are currently "thread-level" metrics. I know
you're planning on splitting out the restoration from the StreamThread, so
I question the utility of some of these metrics being thread-level, as they
won't correlate with other thread-level metrics, which correspond to a
StreamThread.

Perhaps the record counting metrics would be more useful as
task/store-level metrics? That way, they can be aggregated by users to
determine things like total records remaining to restore by store, across
the entire app etc.

Regards,

Nick

On Thu, 22 Sept 2022 at 13:21, Bruno Cadonna <ca...@apache.org> wrote:

> Hi Guozhang!
>
> Thanks for the updates!
>
> 1.
> Why do you distinguish between active and standby for the total amount
> of restored/updated records but not for the rate of restored/updated
> records?
>
> 2.
> Regarding "standby-records-remaining" I am not sure how useful this
> metric is and I am not sure how hard it will be to record. I see the
> usefulness of monitoring the lag of a single standby state store, but I
> am not sure how useful it is to monitor the "lag" of a state updater
> thread that might potentially contain multiple state stores.
> Furthermore, do we need to issue a remote call to record this metric or
> do we get this information from each poll?
>
> 3.
> Could you please give an example where "active-records-restored-total"
> and "standby-records-updated-total" are useful?
>
> Best,
> Bruno
>
> On 20.09.22 22:45, Guozhang Wang wrote:
> > Hello Bruno/Nick,
> >
> > I've updated the KIP wiki to reflect the incorporated comments from you,
> > please feel free to take another look and let me know what you think.
> >
> >
> > Guozhang
> >
> > On Tue, Sep 20, 2022 at 9:37 AM Guozhang Wang <wa...@gmail.com>
> wrote:
> >
> >> Hi Nick,
> >>
> >> Thanks for the reviews, and I think these are good suggestions. Note
> that
> >> currently the `restore-records-total/rate` would include both restoring
> >> active tasks as well as updating standby tasks, I think for your
> purposes
> >> you'd be more interested in active restoring tasks since they could
> >> complete, while updating standby tasks would not complete even if they
> have
> >> caught up with the active. At the same time, the changelog reader would
> >> only be restoring either active or standby at a given time, and active
> >> tasks has a higher priority such that as long as there is at least one
> >> active task still restoring, we would not restore any standby tasks.
> From
> >> your suggestion, I'm thinking that maybe I should break up the rate /
> total
> >> metric for active and standby tasks separately.
> >>
> >> For deriving estimated time remaining though, the `total` metric may not
> >> be helpful since they will not be "reset" after rebalances, i.e. they
> will
> >> be an ever-increasing number and record the total number of records for
> the
> >> lifetime of the app. But still, just the remaining records alone, with
> the
> >> time elapsed monitored by the apps, should be sufficient to get the
> >> estimated time remaining.
> >>
> >>
> >> Guozhang
> >>
> >>
> >> On Tue, Sep 20, 2022 at 3:10 AM Nick Telford <ni...@gmail.com>
> >> wrote:
> >>
> >>> Hi Guozhang,
> >>>
> >>> KIP looks great, I have one suggestion: in addition to
> >>> "restore-records-total", it would also be useful to track the number of
> >>> records *remaining*, that is, the records that have not yet been
> restored.
> >>> This is actually the metric I was attempting to implement in the
> >>> StateRestoreListener that bumped me up against KAFKA-10575 :-)
> >>>
> >>> With both a "restore-records-total" and a "restore-remaining-total" (or
> >>> similar) metric, it's possible to derive useful information like the
> >>> estimated time remaining for restoration (by dividing the remaining
> total
> >>> by the restoration rate).
> >>>
> >>> Regards,
> >>>
> >>> Nick
> >>>
> >>> On Mon, 19 Sept 2022 at 19:57, Guozhang Wang <wa...@gmail.com>
> wrote:
> >>>
> >>>> Hello Bruno,
> >>>>
> >>>> Thanks for your comments!
> >>>>
> >>>> 1. Regarding the metrics group name: originally I put
> >>>> "stream-state-metrics" as it's related to state store restorations,
> but
> >>>> after a second thought I think I agree with you that this is quite
> >>>> confusing and not right. About the metrics groups, right now I have
> two
> >>>> ideas:
> >>>>
> >>>> a) Still use the metric group name "stream-thread-metrics", but
> >>>> differentiate with the processing threads on the thread id. The pros
> is
> >>>> that we do not introduce a new group, the cons is that users who want
> to
> >>>> separate processing from restoration/updating in the future needs to
> do
> >>>> that on the thread id labels.
> >>>> b) Introduce a new group name, for example
> >>> "stream-state-updater-metrics"
> >>>> and still have a thread-id label. We would be introducing a new group
> >>> which
> >>>> still have a thread-id, which may sound a bit weird (maybe if we do
> >>> that we
> >>>> could change the existing "stream-thread-metrics" into
> >>>> "stream-processor-metrics").
> >>>>
> >>>> Right now I'm leaning towards b) and maybe in the future rename
> >>>> "thread-metrics" to "processor-metrics", LMK what do you think.
> >>>>
> >>>> 2. Regarding the metric names: today we may also pause a standby
> tasks,
> >>> and
> >>>> hence I'm trying to differentiate updating standbys from paused
> >>> standbys.
> >>>> Right now I'm thinking we can change "restoring-standby-tasks" to
> >>>> "updating-standby-tasks", and change all other metrics' "restore"
> >>> (except
> >>>> the "restoring-active-tasks") to "state-update", a.k.a
> >>>> "state-update-ratio", "state-update-records-total",
> >>>> "updating-standby-tasks" etc.
> >>>>
> >>>> 3. Regarding the function name: yeah I think that's a valid concern, I
> >>> can
> >>>> change it to "onRestoreSuspended" since "Aborted" may confuse people
> >>> that
> >>>> previously called "onBatchRestored" are undone as part of the
> abortion.
> >>>>
> >>>>
> >>>> Guozhang
> >>>>
> >>>>
> >>>>
> >>>> On Mon, Sep 19, 2022 at 10:47 AM Bruno Cadonna <ca...@apache.org>
> >>> wrote:
> >>>>
> >>>>> Hi Guozhang,
> >>>>>
> >>>>> Thanks for the KIP! I think this KIP is a really nice addition to
> >>> better
> >>>>> understand what is going on in a Kafka Streams application.
> >>>>>
> >>>>> 1.
> >>>>> The metric names "paused-active-tasks" and "paused-standby-tasks"
> >>> might
> >>>>> be a bit confusing since at least active tasks can be paused also
> >>>>> outside of restoration.
> >>>>>
> >>>>> 2.
> >>>>> Why is the type of the metrics "stream-state-metrics"? I would have
> >>>>> expected "stream-thread-metrics" as the type.
> >>>>>
> >>>>> 3.
> >>>>> Isn't the value of the metric "restoring-standby-tasks" simply the
> >>>>> number of standby tasks since standby tasks are basically always
> >>>>> updating (aka restoring)?
> >>>>>
> >>>>> 4.
> >>>>> "idle-ratio", "restore-ratio", and "checkpoint-ratio" seem metrics
> >>>>> tailored to the upcoming state updater. They do not make much sense
> >>> with
> >>>>> a stream thread. Would it be better to introduce a new metrics level
> >>>>> specifically for the state updater?
> >>>>>
> >>>>> 5.
> >>>>> Personally, I do not like to use the word "restoration" together with
> >>>>> standbys since restoration somehow implies that there is an offset
> for
> >>>>> which the active task is considered restored and active processing
> can
> >>>>> start. In other words, restoration is finite. Standby tasks rather
> >>>>> update continuously their states. They can be up-to-date or lagging.
> I
> >>>>> see that you could say "restored" instead of "up-to-date" and "not
> >>>>> restored" instead of "lagging", but IMO it does not describe well the
> >>>>> situation. That is a rather minor point. I just wanted to mention it.
> >>>>>
> >>>>> 6.
> >>>>> The name "onRestorePaused()" might be confusing since in Kafka
> Streams
> >>>>> users can also pause tasks. What about "onRestoreAborted()" or
> >>>>> "onRestoreSuspended"?
> >>>>>
> >>>>> Best,
> >>>>> Bruno
> >>>>>
> >>>>>
> >>>>> On 16.09.22 19:33, Guozhang Wang wrote:
> >>>>>> Hello everyone,
> >>>>>>
> >>>>>> I'd like to start a discussion for the following KIP, aiming to
> >>> improve
> >>>>>> Kafka Stream's restoration visibility via new metrics and callback
> >>>>> methods:
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-869%3A+Improve+Streams+State+Restoration+Visibility
> >>>>>>
> >>>>>>
> >>>>>> Thanks!
> >>>>>> -- Guozhang
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>> --
> >>>> -- Guozhang
> >>>>
> >>>
> >>
> >>
> >> --
> >> -- Guozhang
> >>
> >
> >
>

Re: [DISCUSS] KIP-869: Improve Streams State Restoration Visibility

Posted by Bruno Cadonna <ca...@apache.org>.

Hi Guozhang!

Thanks for the updates!

1.
Why do you distinguish between active and standby for the total amount 
of restored/updated records but not for the rate of restored/updated 
records?

2.
Regarding "standby-records-remaining" I am not sure how useful this 
metric is and I am not sure how hard it will be to record. I see the 
usefulness of monitoring the lag of a single standby state store, but I 
am not sure how useful it is to monitor the "lag" of a state updater 
thread that might potentially contain multiple state stores. 
Furthermore, do we need to issue a remote call to record this metric or 
do we get this information from each poll?

3.
Could you please give an example where "active-records-restored-total" 
and "standby-records-updated-total" are useful?

Best,
Bruno

On 20.09.22 22:45, Guozhang Wang wrote:
> Hello Bruno/Nick,
> 
> I've updated the KIP wiki to reflect the incorporated comments from you,
> please feel free to take another look and let me know what you think.
> 
> 
> Guozhang
> 
> On Tue, Sep 20, 2022 at 9:37 AM Guozhang Wang <wa...@gmail.com> wrote:
> 
>> Hi Nick,
>>
>> Thanks for the reviews, and I think these are good suggestions. Note that
>> currently the `restore-records-total/rate` would include both restoring
>> active tasks as well as updating standby tasks, I think for your purposes
>> you'd be more interested in active restoring tasks since they could
>> complete, while updating standby tasks would not complete even if they have
>> caught up with the active. At the same time, the changelog reader would
>> only be restoring either active or standby at a given time, and active
>> tasks has a higher priority such that as long as there is at least one
>> active task still restoring, we would not restore any standby tasks. From
>> your suggestion, I'm thinking that maybe I should break up the rate / total
>> metric for active and standby tasks separately.
>>
>> For deriving estimated time remaining though, the `total` metric may not
>> be helpful since they will not be "reset" after rebalances, i.e. they will
>> be an ever-increasing number and record the total number of records for the
>> lifetime of the app. But still, just the remaining records alone, with the
>> time elapsed monitored by the apps, should be sufficient to get the
>> estimated time remaining.
>>
>>
>> Guozhang
>>
>>
>> On Tue, Sep 20, 2022 at 3:10 AM Nick Telford <ni...@gmail.com>
>> wrote:
>>
>>> Hi Guozhang,
>>>
>>> KIP looks great, I have one suggestion: in addition to
>>> "restore-records-total", it would also be useful to track the number of
>>> records *remaining*, that is, the records that have not yet been restored.
>>> This is actually the metric I was attempting to implement in the
>>> StateRestoreListener that bumped me up against KAFKA-10575 :-)
>>>
>>> With both a "restore-records-total" and a "restore-remaining-total" (or
>>> similar) metric, it's possible to derive useful information like the
>>> estimated time remaining for restoration (by dividing the remaining total
>>> by the restoration rate).
>>>
>>> Regards,
>>>
>>> Nick
>>>
>>> On Mon, 19 Sept 2022 at 19:57, Guozhang Wang <wa...@gmail.com> wrote:
>>>
>>>> Hello Bruno,
>>>>
>>>> Thanks for your comments!
>>>>
>>>> 1. Regarding the metrics group name: originally I put
>>>> "stream-state-metrics" as it's related to state store restorations, but
>>>> after a second thought I think I agree with you that this is quite
>>>> confusing and not right. About the metrics groups, right now I have two
>>>> ideas:
>>>>
>>>> a) Still use the metric group name "stream-thread-metrics", but
>>>> differentiate with the processing threads on the thread id. The pros is
>>>> that we do not introduce a new group, the cons is that users who want to
>>>> separate processing from restoration/updating in the future needs to do
>>>> that on the thread id labels.
>>>> b) Introduce a new group name, for example
>>> "stream-state-updater-metrics"
>>>> and still have a thread-id label. We would be introducing a new group
>>> which
>>>> still have a thread-id, which may sound a bit weird (maybe if we do
>>> that we
>>>> could change the existing "stream-thread-metrics" into
>>>> "stream-processor-metrics").
>>>>
>>>> Right now I'm leaning towards b) and maybe in the future rename
>>>> "thread-metrics" to "processor-metrics", LMK what do you think.
>>>>
>>>> 2. Regarding the metric names: today we may also pause a standby tasks,
>>> and
>>>> hence I'm trying to differentiate updating standbys from paused
>>> standbys.
>>>> Right now I'm thinking we can change "restoring-standby-tasks" to
>>>> "updating-standby-tasks", and change all other metrics' "restore"
>>> (except
>>>> the "restoring-active-tasks") to "state-update", a.k.a
>>>> "state-update-ratio", "state-update-records-total",
>>>> "updating-standby-tasks" etc.
>>>>
>>>> 3. Regarding the function name: yeah I think that's a valid concern, I
>>> can
>>>> change it to "onRestoreSuspended" since "Aborted" may confuse people
>>> that
>>>> previously called "onBatchRestored" are undone as part of the abortion.
>>>>
>>>>
>>>> Guozhang
>>>>
>>>>
>>>>
>>>> On Mon, Sep 19, 2022 at 10:47 AM Bruno Cadonna <ca...@apache.org>
>>> wrote:
>>>>
>>>>> Hi Guozhang,
>>>>>
>>>>> Thanks for the KIP! I think this KIP is a really nice addition to
>>> better
>>>>> understand what is going on in a Kafka Streams application.
>>>>>
>>>>> 1.
>>>>> The metric names "paused-active-tasks" and "paused-standby-tasks"
>>> might
>>>>> be a bit confusing since at least active tasks can be paused also
>>>>> outside of restoration.
>>>>>
>>>>> 2.
>>>>> Why is the type of the metrics "stream-state-metrics"? I would have
>>>>> expected "stream-thread-metrics" as the type.
>>>>>
>>>>> 3.
>>>>> Isn't the value of the metric "restoring-standby-tasks" simply the
>>>>> number of standby tasks since standby tasks are basically always
>>>>> updating (aka restoring)?
>>>>>
>>>>> 4.
>>>>> "idle-ratio", "restore-ratio", and "checkpoint-ratio" seem metrics
>>>>> tailored to the upcoming state updater. They do not make much sense
>>> with
>>>>> a stream thread. Would it be better to introduce a new metrics level
>>>>> specifically for the state updater?
>>>>>
>>>>> 5.
>>>>> Personally, I do not like to use the word "restoration" together with
>>>>> standbys since restoration somehow implies that there is an offset for
>>>>> which the active task is considered restored and active processing can
>>>>> start. In other words, restoration is finite. Standby tasks rather
>>>>> update continuously their states. They can be up-to-date or lagging. I
>>>>> see that you could say "restored" instead of "up-to-date" and "not
>>>>> restored" instead of "lagging", but IMO it does not describe well the
>>>>> situation. That is a rather minor point. I just wanted to mention it.
>>>>>
>>>>> 6.
>>>>> The name "onRestorePaused()" might be confusing since in Kafka Streams
>>>>> users can also pause tasks. What about "onRestoreAborted()" or
>>>>> "onRestoreSuspended"?
>>>>>
>>>>> Best,
>>>>> Bruno
>>>>>
>>>>>
>>>>> On 16.09.22 19:33, Guozhang Wang wrote:
>>>>>> Hello everyone,
>>>>>>
>>>>>> I'd like to start a discussion for the following KIP, aiming to
>>> improve
>>>>>> Kafka Stream's restoration visibility via new metrics and callback
>>>>> methods:
>>>>>>
>>>>>>
>>>>>
>>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-869%3A+Improve+Streams+State+Restoration+Visibility
>>>>>>
>>>>>>
>>>>>> Thanks!
>>>>>> -- Guozhang
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> -- Guozhang
>>>>
>>>
>>
>>
>> --
>> -- Guozhang
>>
> 
>

Re: [DISCUSS] KIP-869: Improve Streams State Restoration Visibility

Posted by Guozhang Wang <wa...@gmail.com>.

Hello Bruno/Nick,

I've updated the KIP wiki to reflect the incorporated comments from you,
please feel free to take another look and let me know what you think.


Guozhang

On Tue, Sep 20, 2022 at 9:37 AM Guozhang Wang <wa...@gmail.com> wrote:

> Hi Nick,
>
> Thanks for the reviews, and I think these are good suggestions. Note that
> currently the `restore-records-total/rate` would include both restoring
> active tasks as well as updating standby tasks, I think for your purposes
> you'd be more interested in active restoring tasks since they could
> complete, while updating standby tasks would not complete even if they have
> caught up with the active. At the same time, the changelog reader would
> only be restoring either active or standby at a given time, and active
> tasks has a higher priority such that as long as there is at least one
> active task still restoring, we would not restore any standby tasks. From
> your suggestion, I'm thinking that maybe I should break up the rate / total
> metric for active and standby tasks separately.
>
> For deriving estimated time remaining though, the `total` metric may not
> be helpful since they will not be "reset" after rebalances, i.e. they will
> be an ever-increasing number and record the total number of records for the
> lifetime of the app. But still, just the remaining records alone, with the
> time elapsed monitored by the apps, should be sufficient to get the
> estimated time remaining.
>
>
> Guozhang
>
>
> On Tue, Sep 20, 2022 at 3:10 AM Nick Telford <ni...@gmail.com>
> wrote:
>
>> Hi Guozhang,
>>
>> KIP looks great, I have one suggestion: in addition to
>> "restore-records-total", it would also be useful to track the number of
>> records *remaining*, that is, the records that have not yet been restored.
>> This is actually the metric I was attempting to implement in the
>> StateRestoreListener that bumped me up against KAFKA-10575 :-)
>>
>> With both a "restore-records-total" and a "restore-remaining-total" (or
>> similar) metric, it's possible to derive useful information like the
>> estimated time remaining for restoration (by dividing the remaining total
>> by the restoration rate).
>>
>> Regards,
>>
>> Nick
>>
>> On Mon, 19 Sept 2022 at 19:57, Guozhang Wang <wa...@gmail.com> wrote:
>>
>> > Hello Bruno,
>> >
>> > Thanks for your comments!
>> >
>> > 1. Regarding the metrics group name: originally I put
>> > "stream-state-metrics" as it's related to state store restorations, but
>> > after a second thought I think I agree with you that this is quite
>> > confusing and not right. About the metrics groups, right now I have two
>> > ideas:
>> >
>> > a) Still use the metric group name "stream-thread-metrics", but
>> > differentiate with the processing threads on the thread id. The pros is
>> > that we do not introduce a new group, the cons is that users who want to
>> > separate processing from restoration/updating in the future needs to do
>> > that on the thread id labels.
>> > b) Introduce a new group name, for example
>> "stream-state-updater-metrics"
>> > and still have a thread-id label. We would be introducing a new group
>> which
>> > still have a thread-id, which may sound a bit weird (maybe if we do
>> that we
>> > could change the existing "stream-thread-metrics" into
>> > "stream-processor-metrics").
>> >
>> > Right now I'm leaning towards b) and maybe in the future rename
>> > "thread-metrics" to "processor-metrics", LMK what do you think.
>> >
>> > 2. Regarding the metric names: today we may also pause a standby tasks,
>> and
>> > hence I'm trying to differentiate updating standbys from paused
>> standbys.
>> > Right now I'm thinking we can change "restoring-standby-tasks" to
>> > "updating-standby-tasks", and change all other metrics' "restore"
>> (except
>> > the "restoring-active-tasks") to "state-update", a.k.a
>> > "state-update-ratio", "state-update-records-total",
>> > "updating-standby-tasks" etc.
>> >
>> > 3. Regarding the function name: yeah I think that's a valid concern, I
>> can
>> > change it to "onRestoreSuspended" since "Aborted" may confuse people
>> that
>> > previously called "onBatchRestored" are undone as part of the abortion.
>> >
>> >
>> > Guozhang
>> >
>> >
>> >
>> > On Mon, Sep 19, 2022 at 10:47 AM Bruno Cadonna <ca...@apache.org>
>> wrote:
>> >
>> > > Hi Guozhang,
>> > >
>> > > Thanks for the KIP! I think this KIP is a really nice addition to
>> better
>> > > understand what is going on in a Kafka Streams application.
>> > >
>> > > 1.
>> > > The metric names "paused-active-tasks" and "paused-standby-tasks"
>> might
>> > > be a bit confusing since at least active tasks can be paused also
>> > > outside of restoration.
>> > >
>> > > 2.
>> > > Why is the type of the metrics "stream-state-metrics"? I would have
>> > > expected "stream-thread-metrics" as the type.
>> > >
>> > > 3.
>> > > Isn't the value of the metric "restoring-standby-tasks" simply the
>> > > number of standby tasks since standby tasks are basically always
>> > > updating (aka restoring)?
>> > >
>> > > 4.
>> > > "idle-ratio", "restore-ratio", and "checkpoint-ratio" seem metrics
>> > > tailored to the upcoming state updater. They do not make much sense
>> with
>> > > a stream thread. Would it be better to introduce a new metrics level
>> > > specifically for the state updater?
>> > >
>> > > 5.
>> > > Personally, I do not like to use the word "restoration" together with
>> > > standbys since restoration somehow implies that there is an offset for
>> > > which the active task is considered restored and active processing can
>> > > start. In other words, restoration is finite. Standby tasks rather
>> > > update continuously their states. They can be up-to-date or lagging. I
>> > > see that you could say "restored" instead of "up-to-date" and "not
>> > > restored" instead of "lagging", but IMO it does not describe well the
>> > > situation. That is a rather minor point. I just wanted to mention it.
>> > >
>> > > 6.
>> > > The name "onRestorePaused()" might be confusing since in Kafka Streams
>> > > users can also pause tasks. What about "onRestoreAborted()" or
>> > > "onRestoreSuspended"?
>> > >
>> > > Best,
>> > > Bruno
>> > >
>> > >
>> > > On 16.09.22 19:33, Guozhang Wang wrote:
>> > > > Hello everyone,
>> > > >
>> > > > I'd like to start a discussion for the following KIP, aiming to
>> improve
>> > > > Kafka Stream's restoration visibility via new metrics and callback
>> > > methods:
>> > > >
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-869%3A+Improve+Streams+State+Restoration+Visibility
>> > > >
>> > > >
>> > > > Thanks!
>> > > > -- Guozhang
>> > > >
>> > >
>> >
>> >
>> > --
>> > -- Guozhang
>> >
>>
>
>
> --
> -- Guozhang
>


-- 
-- Guozhang

Re: [DISCUSS] KIP-869: Improve Streams State Restoration Visibility

Posted by Guozhang Wang <wa...@gmail.com>.

Hi Nick,

Thanks for the reviews, and I think these are good suggestions. Note that
currently the `restore-records-total/rate` would include both restoring
active tasks as well as updating standby tasks, I think for your purposes
you'd be more interested in active restoring tasks since they could
complete, while updating standby tasks would not complete even if they have
caught up with the active. At the same time, the changelog reader would
only be restoring either active or standby at a given time, and active
tasks has a higher priority such that as long as there is at least one
active task still restoring, we would not restore any standby tasks. From
your suggestion, I'm thinking that maybe I should break up the rate / total
metric for active and standby tasks separately.

For deriving estimated time remaining though, the `total` metric may not be
helpful since they will not be "reset" after rebalances, i.e. they will be
an ever-increasing number and record the total number of records for the
lifetime of the app. But still, just the remaining records alone, with the
time elapsed monitored by the apps, should be sufficient to get the
estimated time remaining.


Guozhang


On Tue, Sep 20, 2022 at 3:10 AM Nick Telford <ni...@gmail.com> wrote:

> Hi Guozhang,
>
> KIP looks great, I have one suggestion: in addition to
> "restore-records-total", it would also be useful to track the number of
> records *remaining*, that is, the records that have not yet been restored.
> This is actually the metric I was attempting to implement in the
> StateRestoreListener that bumped me up against KAFKA-10575 :-)
>
> With both a "restore-records-total" and a "restore-remaining-total" (or
> similar) metric, it's possible to derive useful information like the
> estimated time remaining for restoration (by dividing the remaining total
> by the restoration rate).
>
> Regards,
>
> Nick
>
> On Mon, 19 Sept 2022 at 19:57, Guozhang Wang <wa...@gmail.com> wrote:
>
> > Hello Bruno,
> >
> > Thanks for your comments!
> >
> > 1. Regarding the metrics group name: originally I put
> > "stream-state-metrics" as it's related to state store restorations, but
> > after a second thought I think I agree with you that this is quite
> > confusing and not right. About the metrics groups, right now I have two
> > ideas:
> >
> > a) Still use the metric group name "stream-thread-metrics", but
> > differentiate with the processing threads on the thread id. The pros is
> > that we do not introduce a new group, the cons is that users who want to
> > separate processing from restoration/updating in the future needs to do
> > that on the thread id labels.
> > b) Introduce a new group name, for example "stream-state-updater-metrics"
> > and still have a thread-id label. We would be introducing a new group
> which
> > still have a thread-id, which may sound a bit weird (maybe if we do that
> we
> > could change the existing "stream-thread-metrics" into
> > "stream-processor-metrics").
> >
> > Right now I'm leaning towards b) and maybe in the future rename
> > "thread-metrics" to "processor-metrics", LMK what do you think.
> >
> > 2. Regarding the metric names: today we may also pause a standby tasks,
> and
> > hence I'm trying to differentiate updating standbys from paused standbys.
> > Right now I'm thinking we can change "restoring-standby-tasks" to
> > "updating-standby-tasks", and change all other metrics' "restore" (except
> > the "restoring-active-tasks") to "state-update", a.k.a
> > "state-update-ratio", "state-update-records-total",
> > "updating-standby-tasks" etc.
> >
> > 3. Regarding the function name: yeah I think that's a valid concern, I
> can
> > change it to "onRestoreSuspended" since "Aborted" may confuse people that
> > previously called "onBatchRestored" are undone as part of the abortion.
> >
> >
> > Guozhang
> >
> >
> >
> > On Mon, Sep 19, 2022 at 10:47 AM Bruno Cadonna <ca...@apache.org>
> wrote:
> >
> > > Hi Guozhang,
> > >
> > > Thanks for the KIP! I think this KIP is a really nice addition to
> better
> > > understand what is going on in a Kafka Streams application.
> > >
> > > 1.
> > > The metric names "paused-active-tasks" and "paused-standby-tasks" might
> > > be a bit confusing since at least active tasks can be paused also
> > > outside of restoration.
> > >
> > > 2.
> > > Why is the type of the metrics "stream-state-metrics"? I would have
> > > expected "stream-thread-metrics" as the type.
> > >
> > > 3.
> > > Isn't the value of the metric "restoring-standby-tasks" simply the
> > > number of standby tasks since standby tasks are basically always
> > > updating (aka restoring)?
> > >
> > > 4.
> > > "idle-ratio", "restore-ratio", and "checkpoint-ratio" seem metrics
> > > tailored to the upcoming state updater. They do not make much sense
> with
> > > a stream thread. Would it be better to introduce a new metrics level
> > > specifically for the state updater?
> > >
> > > 5.
> > > Personally, I do not like to use the word "restoration" together with
> > > standbys since restoration somehow implies that there is an offset for
> > > which the active task is considered restored and active processing can
> > > start. In other words, restoration is finite. Standby tasks rather
> > > update continuously their states. They can be up-to-date or lagging. I
> > > see that you could say "restored" instead of "up-to-date" and "not
> > > restored" instead of "lagging", but IMO it does not describe well the
> > > situation. That is a rather minor point. I just wanted to mention it.
> > >
> > > 6.
> > > The name "onRestorePaused()" might be confusing since in Kafka Streams
> > > users can also pause tasks. What about "onRestoreAborted()" or
> > > "onRestoreSuspended"?
> > >
> > > Best,
> > > Bruno
> > >
> > >
> > > On 16.09.22 19:33, Guozhang Wang wrote:
> > > > Hello everyone,
> > > >
> > > > I'd like to start a discussion for the following KIP, aiming to
> improve
> > > > Kafka Stream's restoration visibility via new metrics and callback
> > > methods:
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-869%3A+Improve+Streams+State+Restoration+Visibility
> > > >
> > > >
> > > > Thanks!
> > > > -- Guozhang
> > > >
> > >
> >
> >
> > --
> > -- Guozhang
> >
>


-- 
-- Guozhang

Re: [DISCUSS] KIP-869: Improve Streams State Restoration Visibility

Posted by Nick Telford <ni...@gmail.com>.

Hi Guozhang,

KIP looks great, I have one suggestion: in addition to
"restore-records-total", it would also be useful to track the number of
records *remaining*, that is, the records that have not yet been restored.
This is actually the metric I was attempting to implement in the
StateRestoreListener that bumped me up against KAFKA-10575 :-)

With both a "restore-records-total" and a "restore-remaining-total" (or
similar) metric, it's possible to derive useful information like the
estimated time remaining for restoration (by dividing the remaining total
by the restoration rate).

Regards,

Nick

On Mon, 19 Sept 2022 at 19:57, Guozhang Wang <wa...@gmail.com> wrote:

> Hello Bruno,
>
> Thanks for your comments!
>
> 1. Regarding the metrics group name: originally I put
> "stream-state-metrics" as it's related to state store restorations, but
> after a second thought I think I agree with you that this is quite
> confusing and not right. About the metrics groups, right now I have two
> ideas:
>
> a) Still use the metric group name "stream-thread-metrics", but
> differentiate with the processing threads on the thread id. The pros is
> that we do not introduce a new group, the cons is that users who want to
> separate processing from restoration/updating in the future needs to do
> that on the thread id labels.
> b) Introduce a new group name, for example "stream-state-updater-metrics"
> and still have a thread-id label. We would be introducing a new group which
> still have a thread-id, which may sound a bit weird (maybe if we do that we
> could change the existing "stream-thread-metrics" into
> "stream-processor-metrics").
>
> Right now I'm leaning towards b) and maybe in the future rename
> "thread-metrics" to "processor-metrics", LMK what do you think.
>
> 2. Regarding the metric names: today we may also pause a standby tasks, and
> hence I'm trying to differentiate updating standbys from paused standbys.
> Right now I'm thinking we can change "restoring-standby-tasks" to
> "updating-standby-tasks", and change all other metrics' "restore" (except
> the "restoring-active-tasks") to "state-update", a.k.a
> "state-update-ratio", "state-update-records-total",
> "updating-standby-tasks" etc.
>
> 3. Regarding the function name: yeah I think that's a valid concern, I can
> change it to "onRestoreSuspended" since "Aborted" may confuse people that
> previously called "onBatchRestored" are undone as part of the abortion.
>
>
> Guozhang
>
>
>
> On Mon, Sep 19, 2022 at 10:47 AM Bruno Cadonna <ca...@apache.org> wrote:
>
> > Hi Guozhang,
> >
> > Thanks for the KIP! I think this KIP is a really nice addition to better
> > understand what is going on in a Kafka Streams application.
> >
> > 1.
> > The metric names "paused-active-tasks" and "paused-standby-tasks" might
> > be a bit confusing since at least active tasks can be paused also
> > outside of restoration.
> >
> > 2.
> > Why is the type of the metrics "stream-state-metrics"? I would have
> > expected "stream-thread-metrics" as the type.
> >
> > 3.
> > Isn't the value of the metric "restoring-standby-tasks" simply the
> > number of standby tasks since standby tasks are basically always
> > updating (aka restoring)?
> >
> > 4.
> > "idle-ratio", "restore-ratio", and "checkpoint-ratio" seem metrics
> > tailored to the upcoming state updater. They do not make much sense with
> > a stream thread. Would it be better to introduce a new metrics level
> > specifically for the state updater?
> >
> > 5.
> > Personally, I do not like to use the word "restoration" together with
> > standbys since restoration somehow implies that there is an offset for
> > which the active task is considered restored and active processing can
> > start. In other words, restoration is finite. Standby tasks rather
> > update continuously their states. They can be up-to-date or lagging. I
> > see that you could say "restored" instead of "up-to-date" and "not
> > restored" instead of "lagging", but IMO it does not describe well the
> > situation. That is a rather minor point. I just wanted to mention it.
> >
> > 6.
> > The name "onRestorePaused()" might be confusing since in Kafka Streams
> > users can also pause tasks. What about "onRestoreAborted()" or
> > "onRestoreSuspended"?
> >
> > Best,
> > Bruno
> >
> >
> > On 16.09.22 19:33, Guozhang Wang wrote:
> > > Hello everyone,
> > >
> > > I'd like to start a discussion for the following KIP, aiming to improve
> > > Kafka Stream's restoration visibility via new metrics and callback
> > methods:
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-869%3A+Improve+Streams+State+Restoration+Visibility
> > >
> > >
> > > Thanks!
> > > -- Guozhang
> > >
> >
>
>
> --
> -- Guozhang
>

Re: [DISCUSS] KIP-869: Improve Streams State Restoration Visibility

Posted by Guozhang Wang <wa...@gmail.com>.

Hello Bruno,

Thanks for your comments!

1. Regarding the metrics group name: originally I put
"stream-state-metrics" as it's related to state store restorations, but
after a second thought I think I agree with you that this is quite
confusing and not right. About the metrics groups, right now I have two
ideas:

a) Still use the metric group name "stream-thread-metrics", but
differentiate with the processing threads on the thread id. The pros is
that we do not introduce a new group, the cons is that users who want to
separate processing from restoration/updating in the future needs to do
that on the thread id labels.
b) Introduce a new group name, for example "stream-state-updater-metrics"
and still have a thread-id label. We would be introducing a new group which
still have a thread-id, which may sound a bit weird (maybe if we do that we
could change the existing "stream-thread-metrics" into
"stream-processor-metrics").

Right now I'm leaning towards b) and maybe in the future rename
"thread-metrics" to "processor-metrics", LMK what do you think.

2. Regarding the metric names: today we may also pause a standby tasks, and
hence I'm trying to differentiate updating standbys from paused standbys.
Right now I'm thinking we can change "restoring-standby-tasks" to
"updating-standby-tasks", and change all other metrics' "restore" (except
the "restoring-active-tasks") to "state-update", a.k.a
"state-update-ratio", "state-update-records-total",
"updating-standby-tasks" etc.

3. Regarding the function name: yeah I think that's a valid concern, I can
change it to "onRestoreSuspended" since "Aborted" may confuse people that
previously called "onBatchRestored" are undone as part of the abortion.

Guozhang

On Mon, Sep 19, 2022 at 10:47 AM Bruno Cadonna <ca...@apache.org> wrote:

> Hi Guozhang,
>
> Thanks for the KIP! I think this KIP is a really nice addition to better
> understand what is going on in a Kafka Streams application.
>
> 1.
> The metric names "paused-active-tasks" and "paused-standby-tasks" might
> be a bit confusing since at least active tasks can be paused also
> outside of restoration.
>
> 2.
> Why is the type of the metrics "stream-state-metrics"? I would have
> expected "stream-thread-metrics" as the type.
>
> 3.
> Isn't the value of the metric "restoring-standby-tasks" simply the
> number of standby tasks since standby tasks are basically always
> updating (aka restoring)?
>
> 4.
> "idle-ratio", "restore-ratio", and "checkpoint-ratio" seem metrics
> tailored to the upcoming state updater. They do not make much sense with
> a stream thread. Would it be better to introduce a new metrics level
> specifically for the state updater?
>
> 5.
> Personally, I do not like to use the word "restoration" together with
> standbys since restoration somehow implies that there is an offset for
> which the active task is considered restored and active processing can
> start. In other words, restoration is finite. Standby tasks rather
> update continuously their states. They can be up-to-date or lagging. I
> see that you could say "restored" instead of "up-to-date" and "not
> restored" instead of "lagging", but IMO it does not describe well the
> situation. That is a rather minor point. I just wanted to mention it.
>
> 6.
> The name "onRestorePaused()" might be confusing since in Kafka Streams
> users can also pause tasks. What about "onRestoreAborted()" or
> "onRestoreSuspended"?
>
> Best,
> Bruno
>
>
> On 16.09.22 19:33, Guozhang Wang wrote:
> > Hello everyone,
> >
> > I'd like to start a discussion for the following KIP, aiming to improve
> > Kafka Stream's restoration visibility via new metrics and callback
> methods:
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-869%3A+Improve+Streams+State+Restoration+Visibility
> >
> >
> > Thanks!
> > -- Guozhang
> >
>

-- 
-- Guozhang

Re: [DISCUSS] KIP-869: Improve Streams State Restoration Visibility

Posted by Bruno Cadonna <ca...@apache.org>.

Hi Guozhang,

Thanks for the KIP! I think this KIP is a really nice addition to better 
understand what is going on in a Kafka Streams application.

1.
The metric names "paused-active-tasks" and "paused-standby-tasks" might 
be a bit confusing since at least active tasks can be paused also 
outside of restoration.

2.
Why is the type of the metrics "stream-state-metrics"? I would have 
expected "stream-thread-metrics" as the type.

3.
Isn't the value of the metric "restoring-standby-tasks" simply the 
number of standby tasks since standby tasks are basically always 
updating (aka restoring)?

4.
"idle-ratio", "restore-ratio", and "checkpoint-ratio" seem metrics 
tailored to the upcoming state updater. They do not make much sense with 
a stream thread. Would it be better to introduce a new metrics level 
specifically for the state updater?

5.
Personally, I do not like to use the word "restoration" together with 
standbys since restoration somehow implies that there is an offset for 
which the active task is considered restored and active processing can 
start. In other words, restoration is finite. Standby tasks rather 
update continuously their states. They can be up-to-date or lagging. I 
see that you could say "restored" instead of "up-to-date" and "not 
restored" instead of "lagging", but IMO it does not describe well the 
situation. That is a rather minor point. I just wanted to mention it.

6.
The name "onRestorePaused()" might be confusing since in Kafka Streams 
users can also pause tasks. What about "onRestoreAborted()" or 
"onRestoreSuspended"?

Best,
Bruno

On 16.09.22 19:33, Guozhang Wang wrote:
> Hello everyone,
> 
> I'd like to start a discussion for the following KIP, aiming to improve
> Kafka Stream's restoration visibility via new metrics and callback methods:
> 
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-869%3A+Improve+Streams+State+Restoration+Visibility
> 
> 
> Thanks!
> -- Guozhang
>