You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by Bruno Cadonna <ca...@apache.org> on 2021/08/09 09:19:30 UTC

Re: [DISCUSS] KIP-761: Add total blocked time metric to streams

Hi,

With "group", I actually meant the "type" tag. A thread-level metrics in 
Streams has the following tag "type=stream-thread-metrics" which we also 
call "group" in the code.
I realized now that I mentioned "type" and "group" in my previous e-mail 
which are synonyms for the same concept.

For example, the blocked-time-total metric in Rohan's KIP should be 
defined as:

blocked-time-total

tags:
type = stream-thread-metrics
thread-id = [stream thread ID]

level: INFO

description: the total time the Kafka Streams thread spent blocked on Kafka.

Best,
Bruno


On 27.07.21 10:13, Sophie Blee-Goldman wrote:
> Thanks for the clarifications, that all makes sense.
> 
> I'm ready to vote on the KIP, but can you just update the KIP first to
> address Bruno's feedback? Ie just fix the tags and fill in the missing
> fields.
> 
> For example it sounds like these would be thread-level metrics. You should
> be able to figure out what the values should be from the KIP-444 doc,
> there's a chart with the type and tags for all thread-level metrics.
> 
> Not 100% sure what Bruno meant by "group" but my guess would be whether
> it's INFO/DEBUG/TRACE. This is probably one of the most important
> things to include in a KIP that's introducing new metrics: how big is the
> potential performance impact of recording these metrics? How big is the
> intended audience, would these be useful to almost everyone or are they
> more "niche"?
> 
> It sounds like maybe DEBUG would be most appropriate here -- WDYT?
> 
> -Sophie
> 
> On Thu, Jul 22, 2021 at 9:01 AM Bruno Cadonna <ca...@apache.org> wrote:
> 
>> Hi Rohan,
>>
>> Thank you for the KIP!
>>
>> I agree that the KIP is well-motivated.
>>
>> What is not very clear is the metadata like type, group, and tags of the
>> metrics. For example, there is not application-id tag in Streams and
>> there is also no producer-id tag. The clients, i.e., producer, admin,
>> consumer, and also Streams have a client-id tag, that corresponds to the
>> producer-id, consumer-id, etc you use in the KIP.
>>
>> For examples of metadata used in Streams you can look at the following
>> KIPs:
>>
>> -
>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-444%3A+Augment+metrics+for+Kafka+Streams
>> -
>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-471%3A+Expose+RocksDB+Metrics+in+Kafka+Streams
>> -
>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-607%3A+Add+Metrics+to+Kafka+Streams+to+Report+Properties+of+RocksDB
>> -
>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-613%3A+Add+end-to-end+latency+metrics+to+Streams
>>
>>
>> Best,
>> Bruno
>>
>> On 22.07.21 09:42, Rohan Desai wrote:
>>> re sophie:
>>>
>>> The intent here was to include all blocked time (not just `RUNNING`). The
>>> caller can window the total blocked time themselves, and that can be
>>> compared with a timeseries of the state to understand the ratio in
>>> different states. I'll update the KIP to include `committed`. The admin
>> API
>>> calls should be accounted for by the admin client iotime/iowaittime
>>> metrics.
>>>
>>> On Tue, Jul 20, 2021 at 11:49 PM Rohan Desai <de...@gmail.com>
>>> wrote:
>>>
>>>>> I remember now that we moved the round-trip PID's txn completion logic
>>>> into
>>>> init-transaction and commit/abort-transaction. So I think we'd count
>> time
>>>> as in StreamsProducer#initTransaction as well (admittedly it is in most
>>>> cases a one-time thing).
>>>>
>>>> Makes sense - I'll update the KIP
>>>>
>>>> On Tue, Jul 20, 2021 at 11:48 PM Rohan Desai <de...@gmail.com>
>>>> wrote:
>>>>
>>>>>
>>>>>> I had a question - it seems like from the descriptionsof
>>>>> `txn-commit-time-total` and `offset-commit-time-total` that they
>> measure
>>>>> similar processes for ALOS and EOS, but only `txn-commit-time-total` is
>>>>> included in `blocked-time-total`. Why isn't `offset-commit-time-total`
>> also
>>>>> included?
>>>>>
>>>>> I've updated the KIP to include it.
>>>>>
>>>>>> Aside from `flush-time-total`, `txn-commit-time-total` and
>>>>> `offset-commit-time-total`, which will be producer/consumer client
>>>>> metrics,
>>>>> the rest of the metrics will be streams metrics that will be thread
>> level,
>>>>> is that right?
>>>>>
>>>>> Based on the feedback from Guozhang, I've updated the KIP to reflect
>> that
>>>>> the lower-level metrics are all client metrics that are then summed to
>>>>> compute the blocked time metric, which is a Streams metric.
>>>>>
>>>>> On Tue, Jul 20, 2021 at 11:58 AM Rohan Desai <de...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>>> Similarly, I think "txn-commit-time-total" and
>>>>>> "offset-commit-time-total" may better be inside producer and consumer
>>>>>> clients respectively.
>>>>>>
>>>>>> I agree for offset-commit-time-total. For txn-commit-time-total I'm
>>>>>> proposing we measure `StreamsProducer.commitTransaction`, which wraps
>>>>>> multiple producer calls (sendOffsets, commitTransaction)
>>>>>>
>>>>>>>> For "txn-commit-time-total" specifically, besides
>>>>>> producer.commitTxn.
>>>>>> other txn-related calls may also be blocking, including
>>>>>> producer.beginTxn/abortTxn, I saw you mentioned "txn-begin-time-total"
>>>>>> later in the doc, but did not include it as a separate metric, and
>>>>>> similarly, should we have a `txn-abort-time-total` as well? If yes,
>>>>>> could
>>>>>> you update the KIP page accordingly.
>>>>>>
>>>>>> `beginTransaction` is not blocking - I meant to remove that from that
>>>>>> doc. I'll add something for abort.
>>>>>>
>>>>>> On Mon, Jul 19, 2021 at 11:55 PM Rohan Desai <desai.p.rohan@gmail.com
>>>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks for the review Guozhang! responding to your feedback inline:
>>>>>>>
>>>>>>>> 1) I agree that the current ratio metrics is just "snapshot in
>>>>>>> point", and
>>>>>>> more flexible metrics that would allow reporters to calculate based
>> on
>>>>>>> window intervals are better. However, the current mechanism of the
>>>>>>> proposed
>>>>>>> metrics assumes the thread->clients mapping as of today, where each
>>>>>>> thread
>>>>>>> would own exclusively one main consumer, restore consumer, producer
>> and
>>>>>>> an
>>>>>>> admin client. But this mapping may be subject to change in the
>> future.
>>>>>>> Have
>>>>>>> you thought about how this metric can be extended when, e.g. the
>>>>>>> embedded
>>>>>>> clients and stream threads are de-coupled?
>>>>>>>
>>>>>>> Of course this depends on how exactly we refactor the runtime -
>>>>>>> assuming that we plan to factor out consumers into an "I/O" layer
>> that is
>>>>>>> responsible for receiving records and enqueuing them to be processed
>> by
>>>>>>> processing threads, then I think it should be reasonable to count
>> the time
>>>>>>> we spend blocked on this internal queue(s) as blocked. The main
>> concern
>>>>>>> there to me is that the I/O layer would be doing something expensive
>> like
>>>>>>> decompression that shouldn't be counted as "blocked". But if that
>> really is
>>>>>>> so expensive that it starts to throw off our ratios then it's
>> probably
>>>>>>> indicative of a larger problem that the "i/o layer" is a bottleneck
>> and it
>>>>>>> would be worth refactoring so that decompression (or insert other
>> expensive
>>>>>>> thing here) can also be done on the processing threads.
>>>>>>>
>>>>>>>> 2) [This and all below are minor comments] The "flush-time-total"
>> may
>>>>>>> better be a producer client metric, as "flush-wait-time-total", than
>> a
>>>>>>> streams metric, though the streams-level "total-blocked" can still
>>>>>>> leverage
>>>>>>> it. Similarly, I think "txn-commit-time-total" and
>>>>>>> "offset-commit-time-total" may better be inside producer and consumer
>>>>>>> clients respectively.
>>>>>>>
>>>>>>> Good call - I'll update the KIP
>>>>>>>
>>>>>>>> 3) The doc was not very clear on how "thread-start-time" would be
>>>>>>> needed
>>>>>>> when calculating streams utilization along with total-blocked time,
>>>>>>> could
>>>>>>> you elaborate a bit more in the KIP?
>>>>>>>
>>>>>>> Yes, will do.
>>>>>>>
>>>>>>>> For "txn-commit-time-total" specifically, besides
>> producer.commitTxn.
>>>>>>> other txn-related calls may also be blocking, including
>>>>>>> producer.beginTxn/abortTxn, I saw you mentioned
>> "txn-begin-time-total"
>>>>>>> later in the doc, but did not include it as a separate metric, and
>>>>>>> similarly, should we have a `txn-abort-time-total` as well? If yes,
>>>>>>> could
>>>>>>> you update the KIP page accordingly.
>>>>>>>
>>>>>>> Ack.
>>>>>>>
>>>>>>> On Mon, Jul 12, 2021 at 11:29 PM Rohan Desai <
>> desai.p.rohan@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hello All,
>>>>>>>>
>>>>>>>> I'd like to start a discussion on the KIP linked above which
>> proposes
>>>>>>>> some metrics that we would find useful to help measure whether a
>> Kafka
>>>>>>>> Streams application is saturated. The motivation section in the KIP
>> goes
>>>>>>>> into some more detail on why we think this is a useful addition to
>> the
>>>>>>>> metrics already implemented. Thanks in advance for your feedback!
>>>>>>>>
>>>>>>>> Best Regards,
>>>>>>>>
>>>>>>>> Rohan
>>>>>>>>
>>>>>>>> On Mon, Jul 12, 2021 at 12:00 PM Rohan Desai <
>> desai.p.rohan@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-761%3A+Add+Total+Blocked+Time+Metric+to+Streams
>>>>>>>>>
>>>>>>>>
>>>
>>
>