You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kafka.apache.org by Rohan Desai <de...@gmail.com> on 2021/07/12 19:00:41 UTC

[DISCUSS] KIP-761: Add total blocked time metric to streams

https://cwiki.apache.org/confluence/display/KAFKA/KIP-761%3A+Add+Total+Blocked+Time+Metric+to+Streams

Re: [DISCUSS] KIP-761: Add total blocked time metric to streams

Posted by Guozhang Wang <wa...@gmail.com>.

Thanks Rohan,

I've reviewed the new PR and had a question regarding whether we should
have the new metric in ms or ns, maybe we can first discuss about that
before we finalize the KIP?


Guozhang

On Fri, Feb 25, 2022 at 5:48 AM Bruno Cadonna <ca...@apache.org> wrote:

> Hi Rohan,
>
> Thank you for the heads up!
>
> Yes, please update the KIP and send this message also to the VOTE thread
> of the KIP.
>
> Best,
> Bruno
>
> On 25.02.22 04:01, Rohan Desai wrote:
> > Hello,
> >
> > I discovered a bug in the design of this metric. The bug is documented
> > here: https://github.com/apache/kafka/pull/11805. We need to include
> time
> > the producer spends waiting on topic metadata into the total blocked
> time.
> > But to do this we would need to add a new producer metric that tracks the
> > total time spent blocked on metadata. I've implemented this in a patch
> > here: https://github.com/apache/kafka/pull/11805
> >
> > I'm hoping I can just update this KIP to include the new producer metric.
> >
> > On Mon, Jul 12, 2021 at 12:00 PM Rohan Desai <de...@gmail.com>
> > wrote:
> >
> >>
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-761%3A+Add+Total+Blocked+Time+Metric+to+Streams
> >>
> >
>


-- 
-- Guozhang

Re: [DISCUSS] KIP-761: Add total blocked time metric to streams

Posted by Bruno Cadonna <ca...@apache.org>.

Hi Rohan,

Thank you for the heads up!

Yes, please update the KIP and send this message also to the VOTE thread 
of the KIP.

Best,
Bruno

On 25.02.22 04:01, Rohan Desai wrote:
> Hello,
> 
> I discovered a bug in the design of this metric. The bug is documented
> here: https://github.com/apache/kafka/pull/11805. We need to include time
> the producer spends waiting on topic metadata into the total blocked time.
> But to do this we would need to add a new producer metric that tracks the
> total time spent blocked on metadata. I've implemented this in a patch
> here: https://github.com/apache/kafka/pull/11805
> 
> I'm hoping I can just update this KIP to include the new producer metric.
> 
> On Mon, Jul 12, 2021 at 12:00 PM Rohan Desai <de...@gmail.com>
> wrote:
> 
>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-761%3A+Add+Total+Blocked+Time+Metric+to+Streams
>>
>

Re: [DISCUSS] KIP-761: Add total blocked time metric to streams

Posted by Rohan Desai <de...@gmail.com>.

Hello,

I discovered a bug in the design of this metric. The bug is documented
here: https://github.com/apache/kafka/pull/11805. We need to include time
the producer spends waiting on topic metadata into the total blocked time.
But to do this we would need to add a new producer metric that tracks the
total time spent blocked on metadata. I've implemented this in a patch
here: https://github.com/apache/kafka/pull/11805

I'm hoping I can just update this KIP to include the new producer metric.

On Mon, Jul 12, 2021 at 12:00 PM Rohan Desai <de...@gmail.com>
wrote:

>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-761%3A+Add+Total+Blocked+Time+Metric+to+Streams
>

Re: [DISCUSS] KIP-761: Add total blocked time metric to streams

Posted by Guozhang Wang <wa...@gmail.com>.

Hey Rohan,

Thanks for putting up the KIP. Maybe you can also briefly talk about the
context in the email message as well (they are very well explained inside
the KIP doc though).

I have a few meta and detailed thoughts after reading it.

1) I agree that the current ratio metrics is just "snapshot in point", and
more flexible metrics that would allow reporters to calculate based on
window intervals are better. However, the current mechanism of the proposed
metrics assumes the thread->clients mapping as of today, where each thread
would own exclusively one main consumer, restore consumer, producer and an
admin client. But this mapping may be subject to change in the future. Have
you thought about how this metric can be extended when, e.g. the embedded
clients and stream threads are de-coupled?

2) [This and all below are minor comments] The "flush-time-total" may
better be a producer client metric, as "flush-wait-time-total", than a
streams metric, though the streams-level "total-blocked" can still leverage
it. Similarly, I think "txn-commit-time-total" and
"offset-commit-time-total" may better be inside producer and consumer
clients respectively.

3) The doc was not very clear on how "thread-start-time" would be needed
when calculating streams utilization along with total-blocked time, could
you elaborate a bit more in the KIP?

4) For "txn-commit-time-total" specifically, besides producer.commitTxn,
other txn-related calls may also be blocking, including
producer.beginTxn/abortTxn, I saw you mentioned "txn-begin-time-total"
later in the doc, but did not include it as a separate metric, and
similarly, should we have a `txn-abort-time-total` as well? If yes, could
you update the KIP page accordingly.

5) Not a suggestion, but just wanted to bring up that the producer related
metrics are a bit "coarsen" compared with the consumer/admin clients since
their IO mechanisms are a bit different: for producer the caller thread
does not do any IOs since it's all delegated to the background sender
thread, while for consumer/admin the caller thread would still need to do
some IOs, and hence the selector-level metrics would make sense. On top of
my head I cannot think of a better measuring mechanism for producers
either, especially for txn-related ones, we may need to experiment and see
if the generated ratio is relatively accurate and reasonable with the
reflected "block time".

Guozhang

On Mon, Jul 12, 2021 at 12:01 PM Rohan Desai <de...@gmail.com>
wrote:

>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-761%3A+Add+Total+Blocked+Time+Metric+to+Streams
>

-- 
-- Guozhang

Re: [DISCUSS] KIP-761: Add total blocked time metric to streams

Posted by Guozhang Wang <wa...@gmail.com>.

Thanks Rohan.

I remember now that we moved the round-trip PID's txn completion logic into
init-transaction and commit/abort-transaction. So I think we'd count time
as in StreamsProducer#initTransaction as well (admittedly it is in most
cases a one-time thing).

Other than that, I do not have any further comments.

Guozhang

On Tue, Jul 20, 2021 at 11:59 AM Rohan Desai <de...@gmail.com>
wrote:

> > Similarly, I think "txn-commit-time-total" and
> "offset-commit-time-total" may better be inside producer and consumer
> clients respectively.
>
> I agree for offset-commit-time-total. For txn-commit-time-total I'm
> proposing we measure `StreamsProducer.commitTransaction`, which wraps
> multiple producer calls (sendOffsets, commitTransaction)
>
> > > For "txn-commit-time-total" specifically, besides producer.commitTxn.
> other txn-related calls may also be blocking, including
> producer.beginTxn/abortTxn, I saw you mentioned "txn-begin-time-total"
> later in the doc, but did not include it as a separate metric, and
> similarly, should we have a `txn-abort-time-total` as well? If yes, could
> you update the KIP page accordingly.
>
> `beginTransaction` is not blocking - I meant to remove that from that doc.
> I'll add something for abort.
>
> On Mon, Jul 19, 2021 at 11:55 PM Rohan Desai <de...@gmail.com>
> wrote:
>
> > Thanks for the review Guozhang! responding to your feedback inline:
> >
> > > 1) I agree that the current ratio metrics is just "snapshot in point",
> > and
> > more flexible metrics that would allow reporters to calculate based on
> > window intervals are better. However, the current mechanism of the
> proposed
> > metrics assumes the thread->clients mapping as of today, where each
> thread
> > would own exclusively one main consumer, restore consumer, producer and
> an
> > admin client. But this mapping may be subject to change in the future.
> Have
> > you thought about how this metric can be extended when, e.g. the embedded
> > clients and stream threads are de-coupled?
> >
> > Of course this depends on how exactly we refactor the runtime - assuming
> > that we plan to factor out consumers into an "I/O" layer that is
> > responsible for receiving records and enqueuing them to be processed by
> > processing threads, then I think it should be reasonable to count the
> time
> > we spend blocked on this internal queue(s) as blocked. The main concern
> > there to me is that the I/O layer would be doing something expensive like
> > decompression that shouldn't be counted as "blocked". But if that really
> is
> > so expensive that it starts to throw off our ratios then it's probably
> > indicative of a larger problem that the "i/o layer" is a bottleneck and
> it
> > would be worth refactoring so that decompression (or insert other
> expensive
> > thing here) can also be done on the processing threads.
> >
> > > 2) [This and all below are minor comments] The "flush-time-total" may
> > better be a producer client metric, as "flush-wait-time-total", than a
> > streams metric, though the streams-level "total-blocked" can still
> leverage
> > it. Similarly, I think "txn-commit-time-total" and
> > "offset-commit-time-total" may better be inside producer and consumer
> > clients respectively.
> >
> > Good call - I'll update the KIP
> >
> > > 3) The doc was not very clear on how "thread-start-time" would be
> needed
> > when calculating streams utilization along with total-blocked time, could
> > you elaborate a bit more in the KIP?
> >
> > Yes, will do.
> >
> > > For "txn-commit-time-total" specifically, besides producer.commitTxn.
> > other txn-related calls may also be blocking, including
> > producer.beginTxn/abortTxn, I saw you mentioned "txn-begin-time-total"
> > later in the doc, but did not include it as a separate metric, and
> > similarly, should we have a `txn-abort-time-total` as well? If yes, could
> > you update the KIP page accordingly.
> >
> > Ack.
> >
> > On Mon, Jul 12, 2021 at 11:29 PM Rohan Desai <de...@gmail.com>
> > wrote:
> >
> >> Hello All,
> >>
> >> I'd like to start a discussion on the KIP linked above which proposes
> >> some metrics that we would find useful to help measure whether a Kafka
> >> Streams application is saturated. The motivation section in the KIP goes
> >> into some more detail on why we think this is a useful addition to the
> >> metrics already implemented. Thanks in advance for your feedback!
> >>
> >> Best Regards,
> >>
> >> Rohan
> >>
> >> On Mon, Jul 12, 2021 at 12:00 PM Rohan Desai <de...@gmail.com>
> >> wrote:
> >>
> >>>
> >>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-761%3A+Add+Total+Blocked+Time+Metric+to+Streams
> >>>
> >>
>


-- 
-- Guozhang

Re: [DISCUSS] KIP-761: Add total blocked time metric to streams

Posted by Sophie Blee-Goldman <so...@confluent.io.INVALID>.

Hey Rohan,

The current metrics proposed in the KIP all LGTM, but if the goal is to
include *all* time spent
blocking on any client API then there are a few that might need to be added
to this list. For
example Consumer#committed is called in StreamTask to get the offsets
during initialization,
and several blocking AdminClient APIs are used by Streams such as
#listOffsets (as well as
 #createTopics and #describeTopics, but only during a rebalance to
create/validate the internal
topics)

One thing that wasn't quite clear to me was whether this metric is intended
to reflect only the
time spent blocking on Kafka *while the StreamThread is in RUNNING*, or
just the total time
spent inside any Kafka client API after the thread has started up. If we
only care about the
former, then the current proposal is sufficient, but if we're actually
targeting the latter then
we need to consider those additional APIs listed above.

Personally, I feel that both could actually be useful -- ie, we expose one
metric that only
corresponds to time we spend blocked on Kafka that could otherwise be spent
processing
records from active tasks, and a second metric that reports all time spent
blocking on Kafka,
whether the thread was still initializing (ie in the PARTITIONS_ASSIGNED
state) or actively
processing (ie in RUNNING). WDYT?


On Tue, Jul 20, 2021 at 11:49 PM Rohan Desai <de...@gmail.com>
wrote:

> > I remember now that we moved the round-trip PID's txn completion logic
> into
> init-transaction and commit/abort-transaction. So I think we'd count time
> as in StreamsProducer#initTransaction as well (admittedly it is in most
> cases a one-time thing).
>
> Makes sense - I'll update the KIP
>
> On Tue, Jul 20, 2021 at 11:48 PM Rohan Desai <de...@gmail.com>
> wrote:
>
> >
> > > I had a question - it seems like from the descriptionsof
> > `txn-commit-time-total` and `offset-commit-time-total` that they measure
> > similar processes for ALOS and EOS, but only `txn-commit-time-total` is
> > included in `blocked-time-total`. Why isn't `offset-commit-time-total`
> also
> > included?
> >
> > I've updated the KIP to include it.
> >
> > > Aside from `flush-time-total`, `txn-commit-time-total` and
> > `offset-commit-time-total`, which will be producer/consumer client
> metrics,
> > the rest of the metrics will be streams metrics that will be thread
> level,
> > is that right?
> >
> > Based on the feedback from Guozhang, I've updated the KIP to reflect that
> > the lower-level metrics are all client metrics that are then summed to
> > compute the blocked time metric, which is a Streams metric.
> >
> > On Tue, Jul 20, 2021 at 11:58 AM Rohan Desai <de...@gmail.com>
> > wrote:
> >
> >> > Similarly, I think "txn-commit-time-total" and
> >> "offset-commit-time-total" may better be inside producer and consumer
> >> clients respectively.
> >>
> >> I agree for offset-commit-time-total. For txn-commit-time-total I'm
> >> proposing we measure `StreamsProducer.commitTransaction`, which wraps
> >> multiple producer calls (sendOffsets, commitTransaction)
> >>
> >> > > For "txn-commit-time-total" specifically, besides
> producer.commitTxn.
> >> other txn-related calls may also be blocking, including
> >> producer.beginTxn/abortTxn, I saw you mentioned "txn-begin-time-total"
> >> later in the doc, but did not include it as a separate metric, and
> >> similarly, should we have a `txn-abort-time-total` as well? If yes,
> could
> >> you update the KIP page accordingly.
> >>
> >> `beginTransaction` is not blocking - I meant to remove that from that
> >> doc. I'll add something for abort.
> >>
> >> On Mon, Jul 19, 2021 at 11:55 PM Rohan Desai <de...@gmail.com>
> >> wrote:
> >>
> >>> Thanks for the review Guozhang! responding to your feedback inline:
> >>>
> >>> > 1) I agree that the current ratio metrics is just "snapshot in
> >>> point", and
> >>> more flexible metrics that would allow reporters to calculate based on
> >>> window intervals are better. However, the current mechanism of the
> >>> proposed
> >>> metrics assumes the thread->clients mapping as of today, where each
> >>> thread
> >>> would own exclusively one main consumer, restore consumer, producer and
> >>> an
> >>> admin client. But this mapping may be subject to change in the future.
> >>> Have
> >>> you thought about how this metric can be extended when, e.g. the
> embedded
> >>> clients and stream threads are de-coupled?
> >>>
> >>> Of course this depends on how exactly we refactor the runtime -
> assuming
> >>> that we plan to factor out consumers into an "I/O" layer that is
> >>> responsible for receiving records and enqueuing them to be processed by
> >>> processing threads, then I think it should be reasonable to count the
> time
> >>> we spend blocked on this internal queue(s) as blocked. The main concern
> >>> there to me is that the I/O layer would be doing something expensive
> like
> >>> decompression that shouldn't be counted as "blocked". But if that
> really is
> >>> so expensive that it starts to throw off our ratios then it's probably
> >>> indicative of a larger problem that the "i/o layer" is a bottleneck
> and it
> >>> would be worth refactoring so that decompression (or insert other
> expensive
> >>> thing here) can also be done on the processing threads.
> >>>
> >>> > 2) [This and all below are minor comments] The "flush-time-total" may
> >>> better be a producer client metric, as "flush-wait-time-total", than a
> >>> streams metric, though the streams-level "total-blocked" can still
> >>> leverage
> >>> it. Similarly, I think "txn-commit-time-total" and
> >>> "offset-commit-time-total" may better be inside producer and consumer
> >>> clients respectively.
> >>>
> >>> Good call - I'll update the KIP
> >>>
> >>> > 3) The doc was not very clear on how "thread-start-time" would be
> >>> needed
> >>> when calculating streams utilization along with total-blocked time,
> could
> >>> you elaborate a bit more in the KIP?
> >>>
> >>> Yes, will do.
> >>>
> >>> > For "txn-commit-time-total" specifically, besides producer.commitTxn.
> >>> other txn-related calls may also be blocking, including
> >>> producer.beginTxn/abortTxn, I saw you mentioned "txn-begin-time-total"
> >>> later in the doc, but did not include it as a separate metric, and
> >>> similarly, should we have a `txn-abort-time-total` as well? If yes,
> >>> could
> >>> you update the KIP page accordingly.
> >>>
> >>> Ack.
> >>>
> >>> On Mon, Jul 12, 2021 at 11:29 PM Rohan Desai <de...@gmail.com>
> >>> wrote:
> >>>
> >>>> Hello All,
> >>>>
> >>>> I'd like to start a discussion on the KIP linked above which proposes
> >>>> some metrics that we would find useful to help measure whether a Kafka
> >>>> Streams application is saturated. The motivation section in the KIP
> goes
> >>>> into some more detail on why we think this is a useful addition to the
> >>>> metrics already implemented. Thanks in advance for your feedback!
> >>>>
> >>>> Best Regards,
> >>>>
> >>>> Rohan
> >>>>
> >>>> On Mon, Jul 12, 2021 at 12:00 PM Rohan Desai <desai.p.rohan@gmail.com
> >
> >>>> wrote:
> >>>>
> >>>>>
> >>>>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-761%3A+Add+Total+Blocked+Time+Metric+to+Streams
> >>>>>
> >>>>
>

Re: [DISCUSS] KIP-761: Add total blocked time metric to streams

Posted by Bruno Cadonna <ca...@apache.org>.

Hi,

With "group", I actually meant the "type" tag. A thread-level metrics in 
Streams has the following tag "type=stream-thread-metrics" which we also 
call "group" in the code.
I realized now that I mentioned "type" and "group" in my previous e-mail 
which are synonyms for the same concept.

For example, the blocked-time-total metric in Rohan's KIP should be 
defined as:

blocked-time-total

tags:
type = stream-thread-metrics
thread-id = [stream thread ID]

level: INFO

description: the total time the Kafka Streams thread spent blocked on Kafka.

Best,
Bruno


On 27.07.21 10:13, Sophie Blee-Goldman wrote:
> Thanks for the clarifications, that all makes sense.
> 
> I'm ready to vote on the KIP, but can you just update the KIP first to
> address Bruno's feedback? Ie just fix the tags and fill in the missing
> fields.
> 
> For example it sounds like these would be thread-level metrics. You should
> be able to figure out what the values should be from the KIP-444 doc,
> there's a chart with the type and tags for all thread-level metrics.
> 
> Not 100% sure what Bruno meant by "group" but my guess would be whether
> it's INFO/DEBUG/TRACE. This is probably one of the most important
> things to include in a KIP that's introducing new metrics: how big is the
> potential performance impact of recording these metrics? How big is the
> intended audience, would these be useful to almost everyone or are they
> more "niche"?
> 
> It sounds like maybe DEBUG would be most appropriate here -- WDYT?
> 
> -Sophie
> 
> On Thu, Jul 22, 2021 at 9:01 AM Bruno Cadonna <ca...@apache.org> wrote:
> 
>> Hi Rohan,
>>
>> Thank you for the KIP!
>>
>> I agree that the KIP is well-motivated.
>>
>> What is not very clear is the metadata like type, group, and tags of the
>> metrics. For example, there is not application-id tag in Streams and
>> there is also no producer-id tag. The clients, i.e., producer, admin,
>> consumer, and also Streams have a client-id tag, that corresponds to the
>> producer-id, consumer-id, etc you use in the KIP.
>>
>> For examples of metadata used in Streams you can look at the following
>> KIPs:
>>
>> -
>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-444%3A+Augment+metrics+for+Kafka+Streams
>> -
>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-471%3A+Expose+RocksDB+Metrics+in+Kafka+Streams
>> -
>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-607%3A+Add+Metrics+to+Kafka+Streams+to+Report+Properties+of+RocksDB
>> -
>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-613%3A+Add+end-to-end+latency+metrics+to+Streams
>>
>>
>> Best,
>> Bruno
>>
>> On 22.07.21 09:42, Rohan Desai wrote:
>>> re sophie:
>>>
>>> The intent here was to include all blocked time (not just `RUNNING`). The
>>> caller can window the total blocked time themselves, and that can be
>>> compared with a timeseries of the state to understand the ratio in
>>> different states. I'll update the KIP to include `committed`. The admin
>> API
>>> calls should be accounted for by the admin client iotime/iowaittime
>>> metrics.
>>>
>>> On Tue, Jul 20, 2021 at 11:49 PM Rohan Desai <de...@gmail.com>
>>> wrote:
>>>
>>>>> I remember now that we moved the round-trip PID's txn completion logic
>>>> into
>>>> init-transaction and commit/abort-transaction. So I think we'd count
>> time
>>>> as in StreamsProducer#initTransaction as well (admittedly it is in most
>>>> cases a one-time thing).
>>>>
>>>> Makes sense - I'll update the KIP
>>>>
>>>> On Tue, Jul 20, 2021 at 11:48 PM Rohan Desai <de...@gmail.com>
>>>> wrote:
>>>>
>>>>>
>>>>>> I had a question - it seems like from the descriptionsof
>>>>> `txn-commit-time-total` and `offset-commit-time-total` that they
>> measure
>>>>> similar processes for ALOS and EOS, but only `txn-commit-time-total` is
>>>>> included in `blocked-time-total`. Why isn't `offset-commit-time-total`
>> also
>>>>> included?
>>>>>
>>>>> I've updated the KIP to include it.
>>>>>
>>>>>> Aside from `flush-time-total`, `txn-commit-time-total` and
>>>>> `offset-commit-time-total`, which will be producer/consumer client
>>>>> metrics,
>>>>> the rest of the metrics will be streams metrics that will be thread
>> level,
>>>>> is that right?
>>>>>
>>>>> Based on the feedback from Guozhang, I've updated the KIP to reflect
>> that
>>>>> the lower-level metrics are all client metrics that are then summed to
>>>>> compute the blocked time metric, which is a Streams metric.
>>>>>
>>>>> On Tue, Jul 20, 2021 at 11:58 AM Rohan Desai <de...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>>> Similarly, I think "txn-commit-time-total" and
>>>>>> "offset-commit-time-total" may better be inside producer and consumer
>>>>>> clients respectively.
>>>>>>
>>>>>> I agree for offset-commit-time-total. For txn-commit-time-total I'm
>>>>>> proposing we measure `StreamsProducer.commitTransaction`, which wraps
>>>>>> multiple producer calls (sendOffsets, commitTransaction)
>>>>>>
>>>>>>>> For "txn-commit-time-total" specifically, besides
>>>>>> producer.commitTxn.
>>>>>> other txn-related calls may also be blocking, including
>>>>>> producer.beginTxn/abortTxn, I saw you mentioned "txn-begin-time-total"
>>>>>> later in the doc, but did not include it as a separate metric, and
>>>>>> similarly, should we have a `txn-abort-time-total` as well? If yes,
>>>>>> could
>>>>>> you update the KIP page accordingly.
>>>>>>
>>>>>> `beginTransaction` is not blocking - I meant to remove that from that
>>>>>> doc. I'll add something for abort.
>>>>>>
>>>>>> On Mon, Jul 19, 2021 at 11:55 PM Rohan Desai <desai.p.rohan@gmail.com
>>>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks for the review Guozhang! responding to your feedback inline:
>>>>>>>
>>>>>>>> 1) I agree that the current ratio metrics is just "snapshot in
>>>>>>> point", and
>>>>>>> more flexible metrics that would allow reporters to calculate based
>> on
>>>>>>> window intervals are better. However, the current mechanism of the
>>>>>>> proposed
>>>>>>> metrics assumes the thread->clients mapping as of today, where each
>>>>>>> thread
>>>>>>> would own exclusively one main consumer, restore consumer, producer
>> and
>>>>>>> an
>>>>>>> admin client. But this mapping may be subject to change in the
>> future.
>>>>>>> Have
>>>>>>> you thought about how this metric can be extended when, e.g. the
>>>>>>> embedded
>>>>>>> clients and stream threads are de-coupled?
>>>>>>>
>>>>>>> Of course this depends on how exactly we refactor the runtime -
>>>>>>> assuming that we plan to factor out consumers into an "I/O" layer
>> that is
>>>>>>> responsible for receiving records and enqueuing them to be processed
>> by
>>>>>>> processing threads, then I think it should be reasonable to count
>> the time
>>>>>>> we spend blocked on this internal queue(s) as blocked. The main
>> concern
>>>>>>> there to me is that the I/O layer would be doing something expensive
>> like
>>>>>>> decompression that shouldn't be counted as "blocked". But if that
>> really is
>>>>>>> so expensive that it starts to throw off our ratios then it's
>> probably
>>>>>>> indicative of a larger problem that the "i/o layer" is a bottleneck
>> and it
>>>>>>> would be worth refactoring so that decompression (or insert other
>> expensive
>>>>>>> thing here) can also be done on the processing threads.
>>>>>>>
>>>>>>>> 2) [This and all below are minor comments] The "flush-time-total"
>> may
>>>>>>> better be a producer client metric, as "flush-wait-time-total", than
>> a
>>>>>>> streams metric, though the streams-level "total-blocked" can still
>>>>>>> leverage
>>>>>>> it. Similarly, I think "txn-commit-time-total" and
>>>>>>> "offset-commit-time-total" may better be inside producer and consumer
>>>>>>> clients respectively.
>>>>>>>
>>>>>>> Good call - I'll update the KIP
>>>>>>>
>>>>>>>> 3) The doc was not very clear on how "thread-start-time" would be
>>>>>>> needed
>>>>>>> when calculating streams utilization along with total-blocked time,
>>>>>>> could
>>>>>>> you elaborate a bit more in the KIP?
>>>>>>>
>>>>>>> Yes, will do.
>>>>>>>
>>>>>>>> For "txn-commit-time-total" specifically, besides
>> producer.commitTxn.
>>>>>>> other txn-related calls may also be blocking, including
>>>>>>> producer.beginTxn/abortTxn, I saw you mentioned
>> "txn-begin-time-total"
>>>>>>> later in the doc, but did not include it as a separate metric, and
>>>>>>> similarly, should we have a `txn-abort-time-total` as well? If yes,
>>>>>>> could
>>>>>>> you update the KIP page accordingly.
>>>>>>>
>>>>>>> Ack.
>>>>>>>
>>>>>>> On Mon, Jul 12, 2021 at 11:29 PM Rohan Desai <
>> desai.p.rohan@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hello All,
>>>>>>>>
>>>>>>>> I'd like to start a discussion on the KIP linked above which
>> proposes
>>>>>>>> some metrics that we would find useful to help measure whether a
>> Kafka
>>>>>>>> Streams application is saturated. The motivation section in the KIP
>> goes
>>>>>>>> into some more detail on why we think this is a useful addition to
>> the
>>>>>>>> metrics already implemented. Thanks in advance for your feedback!
>>>>>>>>
>>>>>>>> Best Regards,
>>>>>>>>
>>>>>>>> Rohan
>>>>>>>>
>>>>>>>> On Mon, Jul 12, 2021 at 12:00 PM Rohan Desai <
>> desai.p.rohan@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-761%3A+Add+Total+Blocked+Time+Metric+to+Streams
>>>>>>>>>
>>>>>>>>
>>>
>>
>

Re: [DISCUSS] KIP-761: Add total blocked time metric to streams

Posted by Sophie Blee-Goldman <so...@confluent.io.INVALID>.

Thanks for the clarifications, that all makes sense.

I'm ready to vote on the KIP, but can you just update the KIP first to
address Bruno's feedback? Ie just fix the tags and fill in the missing
fields.

For example it sounds like these would be thread-level metrics. You should
be able to figure out what the values should be from the KIP-444 doc,
there's a chart with the type and tags for all thread-level metrics.

Not 100% sure what Bruno meant by "group" but my guess would be whether
it's INFO/DEBUG/TRACE. This is probably one of the most important
things to include in a KIP that's introducing new metrics: how big is the
potential performance impact of recording these metrics? How big is the
intended audience, would these be useful to almost everyone or are they
more "niche"?

It sounds like maybe DEBUG would be most appropriate here -- WDYT?

-Sophie

On Thu, Jul 22, 2021 at 9:01 AM Bruno Cadonna <ca...@apache.org> wrote:

> Hi Rohan,
>
> Thank you for the KIP!
>
> I agree that the KIP is well-motivated.
>
> What is not very clear is the metadata like type, group, and tags of the
> metrics. For example, there is not application-id tag in Streams and
> there is also no producer-id tag. The clients, i.e., producer, admin,
> consumer, and also Streams have a client-id tag, that corresponds to the
> producer-id, consumer-id, etc you use in the KIP.
>
> For examples of metadata used in Streams you can look at the following
> KIPs:
>
> -
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-444%3A+Augment+metrics+for+Kafka+Streams
> -
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-471%3A+Expose+RocksDB+Metrics+in+Kafka+Streams
> -
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-607%3A+Add+Metrics+to+Kafka+Streams+to+Report+Properties+of+RocksDB
> -
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-613%3A+Add+end-to-end+latency+metrics+to+Streams
>
>
> Best,
> Bruno
>
> On 22.07.21 09:42, Rohan Desai wrote:
> > re sophie:
> >
> > The intent here was to include all blocked time (not just `RUNNING`). The
> > caller can window the total blocked time themselves, and that can be
> > compared with a timeseries of the state to understand the ratio in
> > different states. I'll update the KIP to include `committed`. The admin
> API
> > calls should be accounted for by the admin client iotime/iowaittime
> > metrics.
> >
> > On Tue, Jul 20, 2021 at 11:49 PM Rohan Desai <de...@gmail.com>
> > wrote:
> >
> >>> I remember now that we moved the round-trip PID's txn completion logic
> >> into
> >> init-transaction and commit/abort-transaction. So I think we'd count
> time
> >> as in StreamsProducer#initTransaction as well (admittedly it is in most
> >> cases a one-time thing).
> >>
> >> Makes sense - I'll update the KIP
> >>
> >> On Tue, Jul 20, 2021 at 11:48 PM Rohan Desai <de...@gmail.com>
> >> wrote:
> >>
> >>>
> >>>> I had a question - it seems like from the descriptionsof
> >>> `txn-commit-time-total` and `offset-commit-time-total` that they
> measure
> >>> similar processes for ALOS and EOS, but only `txn-commit-time-total` is
> >>> included in `blocked-time-total`. Why isn't `offset-commit-time-total`
> also
> >>> included?
> >>>
> >>> I've updated the KIP to include it.
> >>>
> >>>> Aside from `flush-time-total`, `txn-commit-time-total` and
> >>> `offset-commit-time-total`, which will be producer/consumer client
> >>> metrics,
> >>> the rest of the metrics will be streams metrics that will be thread
> level,
> >>> is that right?
> >>>
> >>> Based on the feedback from Guozhang, I've updated the KIP to reflect
> that
> >>> the lower-level metrics are all client metrics that are then summed to
> >>> compute the blocked time metric, which is a Streams metric.
> >>>
> >>> On Tue, Jul 20, 2021 at 11:58 AM Rohan Desai <de...@gmail.com>
> >>> wrote:
> >>>
> >>>>> Similarly, I think "txn-commit-time-total" and
> >>>> "offset-commit-time-total" may better be inside producer and consumer
> >>>> clients respectively.
> >>>>
> >>>> I agree for offset-commit-time-total. For txn-commit-time-total I'm
> >>>> proposing we measure `StreamsProducer.commitTransaction`, which wraps
> >>>> multiple producer calls (sendOffsets, commitTransaction)
> >>>>
> >>>>>> For "txn-commit-time-total" specifically, besides
> >>>> producer.commitTxn.
> >>>> other txn-related calls may also be blocking, including
> >>>> producer.beginTxn/abortTxn, I saw you mentioned "txn-begin-time-total"
> >>>> later in the doc, but did not include it as a separate metric, and
> >>>> similarly, should we have a `txn-abort-time-total` as well? If yes,
> >>>> could
> >>>> you update the KIP page accordingly.
> >>>>
> >>>> `beginTransaction` is not blocking - I meant to remove that from that
> >>>> doc. I'll add something for abort.
> >>>>
> >>>> On Mon, Jul 19, 2021 at 11:55 PM Rohan Desai <desai.p.rohan@gmail.com
> >
> >>>> wrote:
> >>>>
> >>>>> Thanks for the review Guozhang! responding to your feedback inline:
> >>>>>
> >>>>>> 1) I agree that the current ratio metrics is just "snapshot in
> >>>>> point", and
> >>>>> more flexible metrics that would allow reporters to calculate based
> on
> >>>>> window intervals are better. However, the current mechanism of the
> >>>>> proposed
> >>>>> metrics assumes the thread->clients mapping as of today, where each
> >>>>> thread
> >>>>> would own exclusively one main consumer, restore consumer, producer
> and
> >>>>> an
> >>>>> admin client. But this mapping may be subject to change in the
> future.
> >>>>> Have
> >>>>> you thought about how this metric can be extended when, e.g. the
> >>>>> embedded
> >>>>> clients and stream threads are de-coupled?
> >>>>>
> >>>>> Of course this depends on how exactly we refactor the runtime -
> >>>>> assuming that we plan to factor out consumers into an "I/O" layer
> that is
> >>>>> responsible for receiving records and enqueuing them to be processed
> by
> >>>>> processing threads, then I think it should be reasonable to count
> the time
> >>>>> we spend blocked on this internal queue(s) as blocked. The main
> concern
> >>>>> there to me is that the I/O layer would be doing something expensive
> like
> >>>>> decompression that shouldn't be counted as "blocked". But if that
> really is
> >>>>> so expensive that it starts to throw off our ratios then it's
> probably
> >>>>> indicative of a larger problem that the "i/o layer" is a bottleneck
> and it
> >>>>> would be worth refactoring so that decompression (or insert other
> expensive
> >>>>> thing here) can also be done on the processing threads.
> >>>>>
> >>>>>> 2) [This and all below are minor comments] The "flush-time-total"
> may
> >>>>> better be a producer client metric, as "flush-wait-time-total", than
> a
> >>>>> streams metric, though the streams-level "total-blocked" can still
> >>>>> leverage
> >>>>> it. Similarly, I think "txn-commit-time-total" and
> >>>>> "offset-commit-time-total" may better be inside producer and consumer
> >>>>> clients respectively.
> >>>>>
> >>>>> Good call - I'll update the KIP
> >>>>>
> >>>>>> 3) The doc was not very clear on how "thread-start-time" would be
> >>>>> needed
> >>>>> when calculating streams utilization along with total-blocked time,
> >>>>> could
> >>>>> you elaborate a bit more in the KIP?
> >>>>>
> >>>>> Yes, will do.
> >>>>>
> >>>>>> For "txn-commit-time-total" specifically, besides
> producer.commitTxn.
> >>>>> other txn-related calls may also be blocking, including
> >>>>> producer.beginTxn/abortTxn, I saw you mentioned
> "txn-begin-time-total"
> >>>>> later in the doc, but did not include it as a separate metric, and
> >>>>> similarly, should we have a `txn-abort-time-total` as well? If yes,
> >>>>> could
> >>>>> you update the KIP page accordingly.
> >>>>>
> >>>>> Ack.
> >>>>>
> >>>>> On Mon, Jul 12, 2021 at 11:29 PM Rohan Desai <
> desai.p.rohan@gmail.com>
> >>>>> wrote:
> >>>>>
> >>>>>> Hello All,
> >>>>>>
> >>>>>> I'd like to start a discussion on the KIP linked above which
> proposes
> >>>>>> some metrics that we would find useful to help measure whether a
> Kafka
> >>>>>> Streams application is saturated. The motivation section in the KIP
> goes
> >>>>>> into some more detail on why we think this is a useful addition to
> the
> >>>>>> metrics already implemented. Thanks in advance for your feedback!
> >>>>>>
> >>>>>> Best Regards,
> >>>>>>
> >>>>>> Rohan
> >>>>>>
> >>>>>> On Mon, Jul 12, 2021 at 12:00 PM Rohan Desai <
> desai.p.rohan@gmail.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>>>
> >>>>>>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-761%3A+Add+Total+Blocked+Time+Metric+to+Streams
> >>>>>>>
> >>>>>>
> >
>

Re: [DISCUSS] KIP-761: Add total blocked time metric to streams

Posted by Bruno Cadonna <ca...@apache.org>.

Hi Rohan,

Thank you for the KIP!

I agree that the KIP is well-motivated.

What is not very clear is the metadata like type, group, and tags of the 
metrics. For example, there is not application-id tag in Streams and 
there is also no producer-id tag. The clients, i.e., producer, admin, 
consumer, and also Streams have a client-id tag, that corresponds to the 
producer-id, consumer-id, etc you use in the KIP.

For examples of metadata used in Streams you can look at the following KIPs:

- 
https://cwiki.apache.org/confluence/display/KAFKA/KIP-444%3A+Augment+metrics+for+Kafka+Streams
- 
https://cwiki.apache.org/confluence/display/KAFKA/KIP-471%3A+Expose+RocksDB+Metrics+in+Kafka+Streams
- 
https://cwiki.apache.org/confluence/display/KAFKA/KIP-607%3A+Add+Metrics+to+Kafka+Streams+to+Report+Properties+of+RocksDB
- 
https://cwiki.apache.org/confluence/display/KAFKA/KIP-613%3A+Add+end-to-end+latency+metrics+to+Streams


Best,
Bruno

On 22.07.21 09:42, Rohan Desai wrote:
> re sophie:
> 
> The intent here was to include all blocked time (not just `RUNNING`). The
> caller can window the total blocked time themselves, and that can be
> compared with a timeseries of the state to understand the ratio in
> different states. I'll update the KIP to include `committed`. The admin API
> calls should be accounted for by the admin client iotime/iowaittime
> metrics.
> 
> On Tue, Jul 20, 2021 at 11:49 PM Rohan Desai <de...@gmail.com>
> wrote:
> 
>>> I remember now that we moved the round-trip PID's txn completion logic
>> into
>> init-transaction and commit/abort-transaction. So I think we'd count time
>> as in StreamsProducer#initTransaction as well (admittedly it is in most
>> cases a one-time thing).
>>
>> Makes sense - I'll update the KIP
>>
>> On Tue, Jul 20, 2021 at 11:48 PM Rohan Desai <de...@gmail.com>
>> wrote:
>>
>>>
>>>> I had a question - it seems like from the descriptionsof
>>> `txn-commit-time-total` and `offset-commit-time-total` that they measure
>>> similar processes for ALOS and EOS, but only `txn-commit-time-total` is
>>> included in `blocked-time-total`. Why isn't `offset-commit-time-total` also
>>> included?
>>>
>>> I've updated the KIP to include it.
>>>
>>>> Aside from `flush-time-total`, `txn-commit-time-total` and
>>> `offset-commit-time-total`, which will be producer/consumer client
>>> metrics,
>>> the rest of the metrics will be streams metrics that will be thread level,
>>> is that right?
>>>
>>> Based on the feedback from Guozhang, I've updated the KIP to reflect that
>>> the lower-level metrics are all client metrics that are then summed to
>>> compute the blocked time metric, which is a Streams metric.
>>>
>>> On Tue, Jul 20, 2021 at 11:58 AM Rohan Desai <de...@gmail.com>
>>> wrote:
>>>
>>>>> Similarly, I think "txn-commit-time-total" and
>>>> "offset-commit-time-total" may better be inside producer and consumer
>>>> clients respectively.
>>>>
>>>> I agree for offset-commit-time-total. For txn-commit-time-total I'm
>>>> proposing we measure `StreamsProducer.commitTransaction`, which wraps
>>>> multiple producer calls (sendOffsets, commitTransaction)
>>>>
>>>>>> For "txn-commit-time-total" specifically, besides
>>>> producer.commitTxn.
>>>> other txn-related calls may also be blocking, including
>>>> producer.beginTxn/abortTxn, I saw you mentioned "txn-begin-time-total"
>>>> later in the doc, but did not include it as a separate metric, and
>>>> similarly, should we have a `txn-abort-time-total` as well? If yes,
>>>> could
>>>> you update the KIP page accordingly.
>>>>
>>>> `beginTransaction` is not blocking - I meant to remove that from that
>>>> doc. I'll add something for abort.
>>>>
>>>> On Mon, Jul 19, 2021 at 11:55 PM Rohan Desai <de...@gmail.com>
>>>> wrote:
>>>>
>>>>> Thanks for the review Guozhang! responding to your feedback inline:
>>>>>
>>>>>> 1) I agree that the current ratio metrics is just "snapshot in
>>>>> point", and
>>>>> more flexible metrics that would allow reporters to calculate based on
>>>>> window intervals are better. However, the current mechanism of the
>>>>> proposed
>>>>> metrics assumes the thread->clients mapping as of today, where each
>>>>> thread
>>>>> would own exclusively one main consumer, restore consumer, producer and
>>>>> an
>>>>> admin client. But this mapping may be subject to change in the future.
>>>>> Have
>>>>> you thought about how this metric can be extended when, e.g. the
>>>>> embedded
>>>>> clients and stream threads are de-coupled?
>>>>>
>>>>> Of course this depends on how exactly we refactor the runtime -
>>>>> assuming that we plan to factor out consumers into an "I/O" layer that is
>>>>> responsible for receiving records and enqueuing them to be processed by
>>>>> processing threads, then I think it should be reasonable to count the time
>>>>> we spend blocked on this internal queue(s) as blocked. The main concern
>>>>> there to me is that the I/O layer would be doing something expensive like
>>>>> decompression that shouldn't be counted as "blocked". But if that really is
>>>>> so expensive that it starts to throw off our ratios then it's probably
>>>>> indicative of a larger problem that the "i/o layer" is a bottleneck and it
>>>>> would be worth refactoring so that decompression (or insert other expensive
>>>>> thing here) can also be done on the processing threads.
>>>>>
>>>>>> 2) [This and all below are minor comments] The "flush-time-total" may
>>>>> better be a producer client metric, as "flush-wait-time-total", than a
>>>>> streams metric, though the streams-level "total-blocked" can still
>>>>> leverage
>>>>> it. Similarly, I think "txn-commit-time-total" and
>>>>> "offset-commit-time-total" may better be inside producer and consumer
>>>>> clients respectively.
>>>>>
>>>>> Good call - I'll update the KIP
>>>>>
>>>>>> 3) The doc was not very clear on how "thread-start-time" would be
>>>>> needed
>>>>> when calculating streams utilization along with total-blocked time,
>>>>> could
>>>>> you elaborate a bit more in the KIP?
>>>>>
>>>>> Yes, will do.
>>>>>
>>>>>> For "txn-commit-time-total" specifically, besides producer.commitTxn.
>>>>> other txn-related calls may also be blocking, including
>>>>> producer.beginTxn/abortTxn, I saw you mentioned "txn-begin-time-total"
>>>>> later in the doc, but did not include it as a separate metric, and
>>>>> similarly, should we have a `txn-abort-time-total` as well? If yes,
>>>>> could
>>>>> you update the KIP page accordingly.
>>>>>
>>>>> Ack.
>>>>>
>>>>> On Mon, Jul 12, 2021 at 11:29 PM Rohan Desai <de...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hello All,
>>>>>>
>>>>>> I'd like to start a discussion on the KIP linked above which proposes
>>>>>> some metrics that we would find useful to help measure whether a Kafka
>>>>>> Streams application is saturated. The motivation section in the KIP goes
>>>>>> into some more detail on why we think this is a useful addition to the
>>>>>> metrics already implemented. Thanks in advance for your feedback!
>>>>>>
>>>>>> Best Regards,
>>>>>>
>>>>>> Rohan
>>>>>>
>>>>>> On Mon, Jul 12, 2021 at 12:00 PM Rohan Desai <de...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-761%3A+Add+Total+Blocked+Time+Metric+to+Streams
>>>>>>>
>>>>>>
>

Re: [DISCUSS] KIP-761: Add total blocked time metric to streams

Posted by Rohan Desai <de...@gmail.com>.

re sophie:

The intent here was to include all blocked time (not just `RUNNING`). The
caller can window the total blocked time themselves, and that can be
compared with a timeseries of the state to understand the ratio in
different states. I'll update the KIP to include `committed`. The admin API
calls should be accounted for by the admin client iotime/iowaittime
metrics.

On Tue, Jul 20, 2021 at 11:49 PM Rohan Desai <de...@gmail.com>
wrote:

> > I remember now that we moved the round-trip PID's txn completion logic
> into
> init-transaction and commit/abort-transaction. So I think we'd count time
> as in StreamsProducer#initTransaction as well (admittedly it is in most
> cases a one-time thing).
>
> Makes sense - I'll update the KIP
>
> On Tue, Jul 20, 2021 at 11:48 PM Rohan Desai <de...@gmail.com>
> wrote:
>
>>
>> > I had a question - it seems like from the descriptionsof
>> `txn-commit-time-total` and `offset-commit-time-total` that they measure
>> similar processes for ALOS and EOS, but only `txn-commit-time-total` is
>> included in `blocked-time-total`. Why isn't `offset-commit-time-total` also
>> included?
>>
>> I've updated the KIP to include it.
>>
>> > Aside from `flush-time-total`, `txn-commit-time-total` and
>> `offset-commit-time-total`, which will be producer/consumer client
>> metrics,
>> the rest of the metrics will be streams metrics that will be thread level,
>> is that right?
>>
>> Based on the feedback from Guozhang, I've updated the KIP to reflect that
>> the lower-level metrics are all client metrics that are then summed to
>> compute the blocked time metric, which is a Streams metric.
>>
>> On Tue, Jul 20, 2021 at 11:58 AM Rohan Desai <de...@gmail.com>
>> wrote:
>>
>>> > Similarly, I think "txn-commit-time-total" and
>>> "offset-commit-time-total" may better be inside producer and consumer
>>> clients respectively.
>>>
>>> I agree for offset-commit-time-total. For txn-commit-time-total I'm
>>> proposing we measure `StreamsProducer.commitTransaction`, which wraps
>>> multiple producer calls (sendOffsets, commitTransaction)
>>>
>>> > > For "txn-commit-time-total" specifically, besides
>>> producer.commitTxn.
>>> other txn-related calls may also be blocking, including
>>> producer.beginTxn/abortTxn, I saw you mentioned "txn-begin-time-total"
>>> later in the doc, but did not include it as a separate metric, and
>>> similarly, should we have a `txn-abort-time-total` as well? If yes,
>>> could
>>> you update the KIP page accordingly.
>>>
>>> `beginTransaction` is not blocking - I meant to remove that from that
>>> doc. I'll add something for abort.
>>>
>>> On Mon, Jul 19, 2021 at 11:55 PM Rohan Desai <de...@gmail.com>
>>> wrote:
>>>
>>>> Thanks for the review Guozhang! responding to your feedback inline:
>>>>
>>>> > 1) I agree that the current ratio metrics is just "snapshot in
>>>> point", and
>>>> more flexible metrics that would allow reporters to calculate based on
>>>> window intervals are better. However, the current mechanism of the
>>>> proposed
>>>> metrics assumes the thread->clients mapping as of today, where each
>>>> thread
>>>> would own exclusively one main consumer, restore consumer, producer and
>>>> an
>>>> admin client. But this mapping may be subject to change in the future.
>>>> Have
>>>> you thought about how this metric can be extended when, e.g. the
>>>> embedded
>>>> clients and stream threads are de-coupled?
>>>>
>>>> Of course this depends on how exactly we refactor the runtime -
>>>> assuming that we plan to factor out consumers into an "I/O" layer that is
>>>> responsible for receiving records and enqueuing them to be processed by
>>>> processing threads, then I think it should be reasonable to count the time
>>>> we spend blocked on this internal queue(s) as blocked. The main concern
>>>> there to me is that the I/O layer would be doing something expensive like
>>>> decompression that shouldn't be counted as "blocked". But if that really is
>>>> so expensive that it starts to throw off our ratios then it's probably
>>>> indicative of a larger problem that the "i/o layer" is a bottleneck and it
>>>> would be worth refactoring so that decompression (or insert other expensive
>>>> thing here) can also be done on the processing threads.
>>>>
>>>> > 2) [This and all below are minor comments] The "flush-time-total" may
>>>> better be a producer client metric, as "flush-wait-time-total", than a
>>>> streams metric, though the streams-level "total-blocked" can still
>>>> leverage
>>>> it. Similarly, I think "txn-commit-time-total" and
>>>> "offset-commit-time-total" may better be inside producer and consumer
>>>> clients respectively.
>>>>
>>>> Good call - I'll update the KIP
>>>>
>>>> > 3) The doc was not very clear on how "thread-start-time" would be
>>>> needed
>>>> when calculating streams utilization along with total-blocked time,
>>>> could
>>>> you elaborate a bit more in the KIP?
>>>>
>>>> Yes, will do.
>>>>
>>>> > For "txn-commit-time-total" specifically, besides producer.commitTxn.
>>>> other txn-related calls may also be blocking, including
>>>> producer.beginTxn/abortTxn, I saw you mentioned "txn-begin-time-total"
>>>> later in the doc, but did not include it as a separate metric, and
>>>> similarly, should we have a `txn-abort-time-total` as well? If yes,
>>>> could
>>>> you update the KIP page accordingly.
>>>>
>>>> Ack.
>>>>
>>>> On Mon, Jul 12, 2021 at 11:29 PM Rohan Desai <de...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hello All,
>>>>>
>>>>> I'd like to start a discussion on the KIP linked above which proposes
>>>>> some metrics that we would find useful to help measure whether a Kafka
>>>>> Streams application is saturated. The motivation section in the KIP goes
>>>>> into some more detail on why we think this is a useful addition to the
>>>>> metrics already implemented. Thanks in advance for your feedback!
>>>>>
>>>>> Best Regards,
>>>>>
>>>>> Rohan
>>>>>
>>>>> On Mon, Jul 12, 2021 at 12:00 PM Rohan Desai <de...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-761%3A+Add+Total+Blocked+Time+Metric+to+Streams
>>>>>>
>>>>>

Re: [DISCUSS] KIP-761: Add total blocked time metric to streams

Posted by Rohan Desai <de...@gmail.com>.

> I remember now that we moved the round-trip PID's txn completion logic
into
init-transaction and commit/abort-transaction. So I think we'd count time
as in StreamsProducer#initTransaction as well (admittedly it is in most
cases a one-time thing).

Makes sense - I'll update the KIP

On Tue, Jul 20, 2021 at 11:48 PM Rohan Desai <de...@gmail.com>
wrote:

>
> > I had a question - it seems like from the descriptionsof
> `txn-commit-time-total` and `offset-commit-time-total` that they measure
> similar processes for ALOS and EOS, but only `txn-commit-time-total` is
> included in `blocked-time-total`. Why isn't `offset-commit-time-total` also
> included?
>
> I've updated the KIP to include it.
>
> > Aside from `flush-time-total`, `txn-commit-time-total` and
> `offset-commit-time-total`, which will be producer/consumer client metrics,
> the rest of the metrics will be streams metrics that will be thread level,
> is that right?
>
> Based on the feedback from Guozhang, I've updated the KIP to reflect that
> the lower-level metrics are all client metrics that are then summed to
> compute the blocked time metric, which is a Streams metric.
>
> On Tue, Jul 20, 2021 at 11:58 AM Rohan Desai <de...@gmail.com>
> wrote:
>
>> > Similarly, I think "txn-commit-time-total" and
>> "offset-commit-time-total" may better be inside producer and consumer
>> clients respectively.
>>
>> I agree for offset-commit-time-total. For txn-commit-time-total I'm
>> proposing we measure `StreamsProducer.commitTransaction`, which wraps
>> multiple producer calls (sendOffsets, commitTransaction)
>>
>> > > For "txn-commit-time-total" specifically, besides producer.commitTxn.
>> other txn-related calls may also be blocking, including
>> producer.beginTxn/abortTxn, I saw you mentioned "txn-begin-time-total"
>> later in the doc, but did not include it as a separate metric, and
>> similarly, should we have a `txn-abort-time-total` as well? If yes, could
>> you update the KIP page accordingly.
>>
>> `beginTransaction` is not blocking - I meant to remove that from that
>> doc. I'll add something for abort.
>>
>> On Mon, Jul 19, 2021 at 11:55 PM Rohan Desai <de...@gmail.com>
>> wrote:
>>
>>> Thanks for the review Guozhang! responding to your feedback inline:
>>>
>>> > 1) I agree that the current ratio metrics is just "snapshot in
>>> point", and
>>> more flexible metrics that would allow reporters to calculate based on
>>> window intervals are better. However, the current mechanism of the
>>> proposed
>>> metrics assumes the thread->clients mapping as of today, where each
>>> thread
>>> would own exclusively one main consumer, restore consumer, producer and
>>> an
>>> admin client. But this mapping may be subject to change in the future.
>>> Have
>>> you thought about how this metric can be extended when, e.g. the embedded
>>> clients and stream threads are de-coupled?
>>>
>>> Of course this depends on how exactly we refactor the runtime - assuming
>>> that we plan to factor out consumers into an "I/O" layer that is
>>> responsible for receiving records and enqueuing them to be processed by
>>> processing threads, then I think it should be reasonable to count the time
>>> we spend blocked on this internal queue(s) as blocked. The main concern
>>> there to me is that the I/O layer would be doing something expensive like
>>> decompression that shouldn't be counted as "blocked". But if that really is
>>> so expensive that it starts to throw off our ratios then it's probably
>>> indicative of a larger problem that the "i/o layer" is a bottleneck and it
>>> would be worth refactoring so that decompression (or insert other expensive
>>> thing here) can also be done on the processing threads.
>>>
>>> > 2) [This and all below are minor comments] The "flush-time-total" may
>>> better be a producer client metric, as "flush-wait-time-total", than a
>>> streams metric, though the streams-level "total-blocked" can still
>>> leverage
>>> it. Similarly, I think "txn-commit-time-total" and
>>> "offset-commit-time-total" may better be inside producer and consumer
>>> clients respectively.
>>>
>>> Good call - I'll update the KIP
>>>
>>> > 3) The doc was not very clear on how "thread-start-time" would be
>>> needed
>>> when calculating streams utilization along with total-blocked time, could
>>> you elaborate a bit more in the KIP?
>>>
>>> Yes, will do.
>>>
>>> > For "txn-commit-time-total" specifically, besides producer.commitTxn.
>>> other txn-related calls may also be blocking, including
>>> producer.beginTxn/abortTxn, I saw you mentioned "txn-begin-time-total"
>>> later in the doc, but did not include it as a separate metric, and
>>> similarly, should we have a `txn-abort-time-total` as well? If yes,
>>> could
>>> you update the KIP page accordingly.
>>>
>>> Ack.
>>>
>>> On Mon, Jul 12, 2021 at 11:29 PM Rohan Desai <de...@gmail.com>
>>> wrote:
>>>
>>>> Hello All,
>>>>
>>>> I'd like to start a discussion on the KIP linked above which proposes
>>>> some metrics that we would find useful to help measure whether a Kafka
>>>> Streams application is saturated. The motivation section in the KIP goes
>>>> into some more detail on why we think this is a useful addition to the
>>>> metrics already implemented. Thanks in advance for your feedback!
>>>>
>>>> Best Regards,
>>>>
>>>> Rohan
>>>>
>>>> On Mon, Jul 12, 2021 at 12:00 PM Rohan Desai <de...@gmail.com>
>>>> wrote:
>>>>
>>>>>
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-761%3A+Add+Total+Blocked+Time+Metric+to+Streams
>>>>>
>>>>

Re: [DISCUSS] KIP-761: Add total blocked time metric to streams

Posted by Rohan Desai <de...@gmail.com>.

> I had a question - it seems like from the descriptionsof
`txn-commit-time-total` and `offset-commit-time-total` that they measure
similar processes for ALOS and EOS, but only `txn-commit-time-total` is
included in `blocked-time-total`. Why isn't `offset-commit-time-total` also
included?

I've updated the KIP to include it.

> Aside from `flush-time-total`, `txn-commit-time-total` and
`offset-commit-time-total`, which will be producer/consumer client metrics,
the rest of the metrics will be streams metrics that will be thread level,
is that right?

Based on the feedback from Guozhang, I've updated the KIP to reflect that
the lower-level metrics are all client metrics that are then summed to
compute the blocked time metric, which is a Streams metric.

On Tue, Jul 20, 2021 at 11:58 AM Rohan Desai <de...@gmail.com>
wrote:

> > Similarly, I think "txn-commit-time-total" and
> "offset-commit-time-total" may better be inside producer and consumer
> clients respectively.
>
> I agree for offset-commit-time-total. For txn-commit-time-total I'm
> proposing we measure `StreamsProducer.commitTransaction`, which wraps
> multiple producer calls (sendOffsets, commitTransaction)
>
> > > For "txn-commit-time-total" specifically, besides producer.commitTxn.
> other txn-related calls may also be blocking, including
> producer.beginTxn/abortTxn, I saw you mentioned "txn-begin-time-total"
> later in the doc, but did not include it as a separate metric, and
> similarly, should we have a `txn-abort-time-total` as well? If yes, could
> you update the KIP page accordingly.
>
> `beginTransaction` is not blocking - I meant to remove that from that doc.
> I'll add something for abort.
>
> On Mon, Jul 19, 2021 at 11:55 PM Rohan Desai <de...@gmail.com>
> wrote:
>
>> Thanks for the review Guozhang! responding to your feedback inline:
>>
>> > 1) I agree that the current ratio metrics is just "snapshot in point",
>> and
>> more flexible metrics that would allow reporters to calculate based on
>> window intervals are better. However, the current mechanism of the
>> proposed
>> metrics assumes the thread->clients mapping as of today, where each thread
>> would own exclusively one main consumer, restore consumer, producer and an
>> admin client. But this mapping may be subject to change in the future.
>> Have
>> you thought about how this metric can be extended when, e.g. the embedded
>> clients and stream threads are de-coupled?
>>
>> Of course this depends on how exactly we refactor the runtime - assuming
>> that we plan to factor out consumers into an "I/O" layer that is
>> responsible for receiving records and enqueuing them to be processed by
>> processing threads, then I think it should be reasonable to count the time
>> we spend blocked on this internal queue(s) as blocked. The main concern
>> there to me is that the I/O layer would be doing something expensive like
>> decompression that shouldn't be counted as "blocked". But if that really is
>> so expensive that it starts to throw off our ratios then it's probably
>> indicative of a larger problem that the "i/o layer" is a bottleneck and it
>> would be worth refactoring so that decompression (or insert other expensive
>> thing here) can also be done on the processing threads.
>>
>> > 2) [This and all below are minor comments] The "flush-time-total" may
>> better be a producer client metric, as "flush-wait-time-total", than a
>> streams metric, though the streams-level "total-blocked" can still
>> leverage
>> it. Similarly, I think "txn-commit-time-total" and
>> "offset-commit-time-total" may better be inside producer and consumer
>> clients respectively.
>>
>> Good call - I'll update the KIP
>>
>> > 3) The doc was not very clear on how "thread-start-time" would be
>> needed
>> when calculating streams utilization along with total-blocked time, could
>> you elaborate a bit more in the KIP?
>>
>> Yes, will do.
>>
>> > For "txn-commit-time-total" specifically, besides producer.commitTxn.
>> other txn-related calls may also be blocking, including
>> producer.beginTxn/abortTxn, I saw you mentioned "txn-begin-time-total"
>> later in the doc, but did not include it as a separate metric, and
>> similarly, should we have a `txn-abort-time-total` as well? If yes, could
>> you update the KIP page accordingly.
>>
>> Ack.
>>
>> On Mon, Jul 12, 2021 at 11:29 PM Rohan Desai <de...@gmail.com>
>> wrote:
>>
>>> Hello All,
>>>
>>> I'd like to start a discussion on the KIP linked above which proposes
>>> some metrics that we would find useful to help measure whether a Kafka
>>> Streams application is saturated. The motivation section in the KIP goes
>>> into some more detail on why we think this is a useful addition to the
>>> metrics already implemented. Thanks in advance for your feedback!
>>>
>>> Best Regards,
>>>
>>> Rohan
>>>
>>> On Mon, Jul 12, 2021 at 12:00 PM Rohan Desai <de...@gmail.com>
>>> wrote:
>>>
>>>>
>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-761%3A+Add+Total+Blocked+Time+Metric+to+Streams
>>>>
>>>

Re: [DISCUSS] KIP-761: Add total blocked time metric to streams

Posted by Rohan Desai <de...@gmail.com>.

> Similarly, I think "txn-commit-time-total" and
"offset-commit-time-total" may better be inside producer and consumer
clients respectively.

I agree for offset-commit-time-total. For txn-commit-time-total I'm
proposing we measure `StreamsProducer.commitTransaction`, which wraps
multiple producer calls (sendOffsets, commitTransaction)

> > For "txn-commit-time-total" specifically, besides producer.commitTxn.
other txn-related calls may also be blocking, including
producer.beginTxn/abortTxn, I saw you mentioned "txn-begin-time-total"
later in the doc, but did not include it as a separate metric, and
similarly, should we have a `txn-abort-time-total` as well? If yes, could
you update the KIP page accordingly.

`beginTransaction` is not blocking - I meant to remove that from that doc.
I'll add something for abort.

On Mon, Jul 19, 2021 at 11:55 PM Rohan Desai <de...@gmail.com>
wrote:

> Thanks for the review Guozhang! responding to your feedback inline:
>
> > 1) I agree that the current ratio metrics is just "snapshot in point",
> and
> more flexible metrics that would allow reporters to calculate based on
> window intervals are better. However, the current mechanism of the proposed
> metrics assumes the thread->clients mapping as of today, where each thread
> would own exclusively one main consumer, restore consumer, producer and an
> admin client. But this mapping may be subject to change in the future. Have
> you thought about how this metric can be extended when, e.g. the embedded
> clients and stream threads are de-coupled?
>
> Of course this depends on how exactly we refactor the runtime - assuming
> that we plan to factor out consumers into an "I/O" layer that is
> responsible for receiving records and enqueuing them to be processed by
> processing threads, then I think it should be reasonable to count the time
> we spend blocked on this internal queue(s) as blocked. The main concern
> there to me is that the I/O layer would be doing something expensive like
> decompression that shouldn't be counted as "blocked". But if that really is
> so expensive that it starts to throw off our ratios then it's probably
> indicative of a larger problem that the "i/o layer" is a bottleneck and it
> would be worth refactoring so that decompression (or insert other expensive
> thing here) can also be done on the processing threads.
>
> > 2) [This and all below are minor comments] The "flush-time-total" may
> better be a producer client metric, as "flush-wait-time-total", than a
> streams metric, though the streams-level "total-blocked" can still leverage
> it. Similarly, I think "txn-commit-time-total" and
> "offset-commit-time-total" may better be inside producer and consumer
> clients respectively.
>
> Good call - I'll update the KIP
>
> > 3) The doc was not very clear on how "thread-start-time" would be needed
> when calculating streams utilization along with total-blocked time, could
> you elaborate a bit more in the KIP?
>
> Yes, will do.
>
> > For "txn-commit-time-total" specifically, besides producer.commitTxn.
> other txn-related calls may also be blocking, including
> producer.beginTxn/abortTxn, I saw you mentioned "txn-begin-time-total"
> later in the doc, but did not include it as a separate metric, and
> similarly, should we have a `txn-abort-time-total` as well? If yes, could
> you update the KIP page accordingly.
>
> Ack.
>
> On Mon, Jul 12, 2021 at 11:29 PM Rohan Desai <de...@gmail.com>
> wrote:
>
>> Hello All,
>>
>> I'd like to start a discussion on the KIP linked above which proposes
>> some metrics that we would find useful to help measure whether a Kafka
>> Streams application is saturated. The motivation section in the KIP goes
>> into some more detail on why we think this is a useful addition to the
>> metrics already implemented. Thanks in advance for your feedback!
>>
>> Best Regards,
>>
>> Rohan
>>
>> On Mon, Jul 12, 2021 at 12:00 PM Rohan Desai <de...@gmail.com>
>> wrote:
>>
>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-761%3A+Add+Total+Blocked+Time+Metric+to+Streams
>>>
>>

Re: [DISCUSS] KIP-761: Add total blocked time metric to streams

Posted by Leah Thomas <lt...@confluent.io.INVALID>.

Thanks for the KIP Rohan.

I had a question - it seems like from the descriptions
of `txn-commit-time-total` and `offset-commit-time-total` that they measure
similar processes for ALOS and EOS, but only `txn-commit-time-total` is
included in `blocked-time-total`. Why isn't `offset-commit-time-total` also
included?

Aside from `flush-time-total`, `txn-commit-time-total` and
`offset-commit-time-total`, which will be producer/consumer client metrics,
the rest of the metrics will be streams metrics that will be thread level,
is that right?

Cheers,
Leah

On Tue, Jul 20, 2021 at 2:00 AM Rohan Desai <de...@gmail.com> wrote:

> Thanks for the review Guozhang! responding to your feedback inline:
>
> > 1) I agree that the current ratio metrics is just "snapshot in point",
> and
> more flexible metrics that would allow reporters to calculate based on
> window intervals are better. However, the current mechanism of the proposed
> metrics assumes the thread->clients mapping as of today, where each thread
> would own exclusively one main consumer, restore consumer, producer and an
> admin client. But this mapping may be subject to change in the future. Have
> you thought about how this metric can be extended when, e.g. the embedded
> clients and stream threads are de-coupled?
>
> Of course this depends on how exactly we refactor the runtime - assuming
> that we plan to factor out consumers into an "I/O" layer that is
> responsible for receiving records and enqueuing them to be processed by
> processing threads, then I think it should be reasonable to count the time
> we spend blocked on this internal queue(s) as blocked. The main concern
> there to me is that the I/O layer would be doing something expensive like
> decompression that shouldn't be counted as "blocked". But if that really is
> so expensive that it starts to throw off our ratios then it's probably
> indicative of a larger problem that the "i/o layer" is a bottleneck and it
> would be worth refactoring so that decompression (or insert other expensive
> thing here) can also be done on the processing threads.
>
> > 2) [This and all below are minor comments] The "flush-time-total" may
> better be a producer client metric, as "flush-wait-time-total", than a
> streams metric, though the streams-level "total-blocked" can still leverage
> it. Similarly, I think "txn-commit-time-total" and
> "offset-commit-time-total" may better be inside producer and consumer
> clients respectively.
>
> Good call - I'll update the KIP
>
> > 3) The doc was not very clear on how "thread-start-time" would be needed
> when calculating streams utilization along with total-blocked time, could
> you elaborate a bit more in the KIP?
>
> Yes, will do.
>
> > For "txn-commit-time-total" specifically, besides producer.commitTxn.
> other txn-related calls may also be blocking, including
> producer.beginTxn/abortTxn, I saw you mentioned "txn-begin-time-total"
> later in the doc, but did not include it as a separate metric, and
> similarly, should we have a `txn-abort-time-total` as well? If yes, could
> you update the KIP page accordingly.
>
> Ack.
>
> On Mon, Jul 12, 2021 at 11:29 PM Rohan Desai <de...@gmail.com>
> wrote:
>
> > Hello All,
> >
> > I'd like to start a discussion on the KIP linked above which proposes
> some
> > metrics that we would find useful to help measure whether a Kafka Streams
> > application is saturated. The motivation section in the KIP goes into
> some
> > more detail on why we think this is a useful addition to the metrics
> > already implemented. Thanks in advance for your feedback!
> >
> > Best Regards,
> >
> > Rohan
> >
> > On Mon, Jul 12, 2021 at 12:00 PM Rohan Desai <de...@gmail.com>
> > wrote:
> >
> >>
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-761%3A+Add+Total+Blocked+Time+Metric+to+Streams
> >>
> >
>

Re: [DISCUSS] KIP-761: Add total blocked time metric to streams

Posted by Rohan Desai <de...@gmail.com>.

Thanks for the review Guozhang! responding to your feedback inline:

> 1) I agree that the current ratio metrics is just "snapshot in point", and
more flexible metrics that would allow reporters to calculate based on
window intervals are better. However, the current mechanism of the proposed
metrics assumes the thread->clients mapping as of today, where each thread
would own exclusively one main consumer, restore consumer, producer and an
admin client. But this mapping may be subject to change in the future. Have
you thought about how this metric can be extended when, e.g. the embedded
clients and stream threads are de-coupled?

Of course this depends on how exactly we refactor the runtime - assuming
that we plan to factor out consumers into an "I/O" layer that is
responsible for receiving records and enqueuing them to be processed by
processing threads, then I think it should be reasonable to count the time
we spend blocked on this internal queue(s) as blocked. The main concern
there to me is that the I/O layer would be doing something expensive like
decompression that shouldn't be counted as "blocked". But if that really is
so expensive that it starts to throw off our ratios then it's probably
indicative of a larger problem that the "i/o layer" is a bottleneck and it
would be worth refactoring so that decompression (or insert other expensive
thing here) can also be done on the processing threads.

> 2) [This and all below are minor comments] The "flush-time-total" may
better be a producer client metric, as "flush-wait-time-total", than a
streams metric, though the streams-level "total-blocked" can still leverage
it. Similarly, I think "txn-commit-time-total" and
"offset-commit-time-total" may better be inside producer and consumer
clients respectively.

Good call - I'll update the KIP

> 3) The doc was not very clear on how "thread-start-time" would be needed
when calculating streams utilization along with total-blocked time, could
you elaborate a bit more in the KIP?

Yes, will do.

> For "txn-commit-time-total" specifically, besides producer.commitTxn.
other txn-related calls may also be blocking, including
producer.beginTxn/abortTxn, I saw you mentioned "txn-begin-time-total"
later in the doc, but did not include it as a separate metric, and
similarly, should we have a `txn-abort-time-total` as well? If yes, could
you update the KIP page accordingly.

Ack.

On Mon, Jul 12, 2021 at 11:29 PM Rohan Desai <de...@gmail.com>
wrote:

> Hello All,
>
> I'd like to start a discussion on the KIP linked above which proposes some
> metrics that we would find useful to help measure whether a Kafka Streams
> application is saturated. The motivation section in the KIP goes into some
> more detail on why we think this is a useful addition to the metrics
> already implemented. Thanks in advance for your feedback!
>
> Best Regards,
>
> Rohan
>
> On Mon, Jul 12, 2021 at 12:00 PM Rohan Desai <de...@gmail.com>
> wrote:
>
>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-761%3A+Add+Total+Blocked+Time+Metric+to+Streams
>>
>

Re: [DISCUSS] KIP-761: Add total blocked time metric to streams

Posted by Rohan Desai <de...@gmail.com>.

Hello All,

I'd like to start a discussion on the KIP linked above which proposes some
metrics that we would find useful to help measure whether a Kafka Streams
application is saturated. The motivation section in the KIP goes into some
more detail on why we think this is a useful addition to the metrics
already implemented. Thanks in advance for your feedback!

Best Regards,

Rohan

On Mon, Jul 12, 2021 at 12:00 PM Rohan Desai <de...@gmail.com>
wrote:

>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-761%3A+Add+Total+Blocked+Time+Metric+to+Streams
>