You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kafka.apache.org by Jun Rao <ju...@confluent.io.INVALID> on 2022/05/18 17:57:07 UTC

Re: [DISCUSS] KIP-714: Client metrics and observability

Hi, Magnus,

Thanks for the updated KIP. Just a couple of more comments.

50. To troubleshoot a particular client issue, I imagine that the client
needs to identify its client_instance_id. How does the client find this
out? Do we plan to include client_instance_id in the client log, expose it
as a metric or something else?

51. The KIP lists a bunch of metrics that need to be collected at the
client side. However, it seems quite a few useful java client metrics like
the following are missing.
    buffer-total-bytes
    buffer-available-bytes
    bufferpool-wait-time
    batch-size-avg
    batch-size-max
    io-wait-ratio
    io-ratio

Thanks,

Jun

On Mon, Apr 4, 2022 at 10:01 AM Jun Rao <ju...@confluent.io> wrote:

> Hi, Xavier,
>
> Thanks for the reply.
>
> 28. It does seem that we have started using KafkaMetrics on the broker
> side. Then, my only concern is on the usage of Histogram in KafkaMetrics.
> Histogram in KafkaMetrics statically divides the value space into a fixed
> number of buckets and only returns values on the bucket boundary. So, the
> returned histogram value may never show up in a recorded value. Yammer
> Histogram, on the other hand, uses reservoir sampling. The reported value
> is always one of the recorded values. So, I am not sure that Histogram in
> KafkaMetrics is as good as Yammer Histogram. ClientMetricsPluginExportTime
> uses Histogram.
>
> Thanks,
>
> Jun
>
> On Thu, Mar 31, 2022 at 5:21 PM Xavier Léauté <xa...@confluent.io.invalid>
> wrote:
>
>> >
>> > 28. On the broker, we typically use Yammer metrics. Only for metrics
>> that
>> > depend on Kafka metric features (e.g., quota), we use the Kafka metric.
>> > Yammer metrics have 4 types: gauge, meter, histogram and timer. meter
>> > calculates a rate, but also exposes an accumulated value.
>> >
>>
>> I don't see a good reason we should limit ourselves to Yammer metrics on
>> the broker. KafkaMetrics was written
>> to replace Yammer metrics and is used for all new components (clients,
>> streams, connect, etc.)
>> My understanding is that the original goal was to retire Yammer metrics in
>> the broker in favor of KafkaMetrics.
>> We just haven't done so out of backwards compatibility concerns.
>> There are other broker metrics such as group coordinator, transaction
>> state
>> manager, and various socket server metrics
>> already using KafkaMetrics that don't need specific Kafka metric features,
>> so I don't see why we should refrain from using
>> Kafka metrics on the broker unless there are real compatibility concerns
>> or
>> where implementation specifics could lead to confusion when comparing
>> metrics using different implementations.
>>
>> In my opinion we should encourage people to use KafkaMetrics going forward
>> on the broker as well, for two reasons:
>> a) yammer metrics is long deprecated and no longer maintained
>> b) yammer metrics are much less expressive
>> c) we don't have a proper API to expose yammer metrics outside of JMX
>> (MetricsReporter only exposes KafkaMetrics)
>>
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Kirk True <ki...@kirktrue.pro>.

Hi Jun,

On Tue, Jun 21, 2022, at 5:24 PM, Jun Rao wrote:
> Hi, Magnus, Kirk,
> 
> Thanks for the reply. A few more comments on your reply.
> 
> 100. I agree there are some benefits of having a set of standard metrics
> across all clients, but I am just wondering how practical it is, given that
> the proposal doesn't require this set like the Kafka protocol.
> 100.1 A client may not implement all or some of the standard metrics. Then,
> we won't have complete standardized names across clients.

True, a client need not implement all the metrics from the KIP. However, those that it does implement will use the names specified in the KIP. The rest of the metrics that a client doesn't implement should be considered as "reserved for future use."

> 100.2 The set of standard metrics needs to be common across all clients.
> For example, client.consumer.poll.latency implies that all clients
> implement a poll() interface. Is that true for all clients?
> client.producer.record.queue.bytes. Do all producers have queues? We
> probably need to make a pass of those metrics to see if they are indeed
> common across all clients.

There are certainly metrics that are not applicable for all client implementations. For example, some of the host-specific CPU timing metrics are "hard" to get on a JVM using standard Java APIs. Ultimately the client author must make a judgement call whether or not to implement a metric. If a given metric from the KIP is truly non-applicable for a client, the author would likely omit it from the client.

Regarding the request to "make a pass" of the clients, are there any client implementations in particular that I should consider reviewing?

I will make an effort to look at some of the more common clients to determine which metrics they expose. I'm a little concerned that could take on outsized amount of effort, depending on the clients' documentation. Researching the code base of each client to ascertain the exposed metrics sounds very time consuming.

> Also, a bunch of standard metrics have type
> Histogram. Java client doesn't have good Histogram support yet. I am also
> not sure if all clients support Histogram. Should we avoid Histogram type
> in standardized metrics?

That's a good question. I can try to get a feel for the existing histogram support in the ecosystem clients and report back.

The KIP does specify an alternate means to report histogram data using time-based averages:

"For [simplicity] a client implementation may choose to provide an average value as [a] Gauge instead of a Histogram. These averages should be using the original Histogram metric name + '.avg', e.g., 'client.request.rtt.avg'."

This approach offers lower fidelity, of course, but it's hopefully more useful in general to have _some_ data than _no_ data?

Perhaps we should replace histograms with this simplified implementation in the KIP, deferring proper histogram support to a future revision?

> 100.3 For a subset of metrics that are truly common across clients, it
> would be confusing for each client to maintain two sets of metrics for the
> same thing. We could document them, but that means every user of every
> client needs to remember this mapping. This is a much bigger
> inconvenience than standardizing the metric names on the server side. If we
> want to go this route, my preference is to deprecate the existing metric
> names that are covered by the standard metric names.

Ah, good point. I admit my focus is too Java-centric.

I want to make sure I understand more specifically what "the server" is in your point regarding 'standardizing the metric names on the server.' At some point there needs to be code that executes on the server that has knowledge of all the clients' metric names as well as a given organization's preferred metric names. Would this code live in the main Apache Kafka repo? Or is it in the organization's ClientTelemetryReceiver implementation? Or somewhere else?

How about introducing a new pluggable mechanism/interface that the broker invokes to determine the metric name mapping? We could provide two out-of-the-box implementations: 1) a default no-op mapper, and 2) a configuration file-based mapper that operates off something akin to a set of Java properties files (one mapping file for each known client). The implementation of the mapper is configured by the cluster administrator and, of course, each organization can provide their own implementation.

> 101. "or if the client-specific metrics are not converted to some common
> form, name, semantic, etc, it'll make creating meaningful aggregations and
> monitoring more complex in the upstream telemetry system with a scattered
> plethora of custom metrics." There will always be client specific metrics.
> So, it seems that we have to deal with scattered custom metrics even with a
> set of standard metrics.

Yes, this is true.

I do believe the KIP should establish a clear means to communicate about the different metrics and their meaning.

When a team is troubleshooting a high-severity incident, these client metrics provide a powerful tool to understand, remediate, and resolve those incidents. The goal of standardizing the metric names is to minimize communication roadblocks in that effort.

> 102. "However, in, let's say the Apache Kafka 3.7 release, the metric name
> is changed to "connections.open.count." At this point, there are two names
> and machine-to-machine communication will likely be effected. With that
> change, all client telemetry plugin(s) used in an organization must be
> updated to reflect that change, else data loss or bugs could be
> introduced." The standard metric names could change too in the future,
> right? So, we need to deal with a similar problem if that happens.

Also true :)

But the metric names, when standardized via a KIP, would undergo a well-known process when being changed in the future. Any metric name changes would be required to be included in a KIP and would require the old and new metric names to co-exist for a period of X releases. This would give teams that are upgrading to newer Kafka versions clear and consistent advance notice to make the needed changes on their end.

Granted, custom, client-specific metrics don't go through the KIP process. We don't "own" that code or their processes, so any usage of client-specific metrics runs the thread of a caveat emptor situation.

> 103. "Are there any inobvious security/privacy-related edge cases where
> shipping certain metrics to the broker would be "bad?"" I am not sure. But
> if a metric can be shipped to the server, it would be useful for the same
> metric to be visible on the client side.

Agreed. The question is, does the reverse hold true?

Thanks Jun!!!!

Kirk

> 
> Thanks,
> 
> Jun
> 
> 
> On Tue, Jun 21, 2022 at 8:19 AM Kirk True <ki...@kirktrue.pro> wrote:
> 
> > Hi Jun,
> >
> > Thank you for all your continued interest in shaping the KIP :)
> >
> > On Thu, Jun 16, 2022, at 2:38 PM, Jun Rao wrote:
> > > Hi, Kirk,
> > >
> > > Thanks for the reply. A couple of more comments.
> > >
> > > (1) "Another perspective is that these two sets of metrics serve
> > different
> > > purposes and/or have different audiences, which argues that they should
> > > maintain their individuality and purpose. " Hmm, I am wondering if those
> > > metrics are really for different audiences and purposes? For example, if
> > > the operator detected an issue through a client metric collected through
> > > the server, the operator may need to communicate that back to the client.
> > > It would be weird if that same metric is not visible on the client side.
> >
> > I agree in the principal that all client metrics visible on the client can
> > also be available to be sent to the broker.
> >
> > Are there any inobvious security/privacy-related edge cases where shipping
> > certain metrics to the broker would be "bad?"
> >
> > > (2) If we could standardize the names on the server side, do we need to
> > > enforce a naming convention for all clients?
> >
> > "Enforce" is such an ugly word :P
> >
> > But yes, I do feel that a consistent naming convention across all clients
> > provides communication benefits between two entities:
> >
> >  1. Human-to-human communication. Ecosystem-wide agreement and
> > understanding of metrics helps all to communicate more efficiently.
> >  2. Machine-to-machine communication. Defining the names via the KIP
> > mechanism help to ensure stabilization across releases of a given client.
> >
> > Point 1: Human-to-human Communication
> >
> > There are quite a handful of parties that must communicate effectively
> > across the Kafka ecosystem. Here are the ones I can think of off the top of
> > my head:
> >
> >  1. Kafka client authors
> >  2. Kafka client users
> >  3. Kafka client telemetry plugin authors
> >  4. Support teams (within an organization or vendor-supplied across
> > organizations)
> >  5. Kafka cluster operators
> >
> > There should be a standard so that these parties can understand the
> > metrics' meaning and be able to correlate that across all clients.
> >
> > As a concrete example, KIP-714 includes a metric for tracking the number
> > of active client connections to a cluster, named
> > "org.apache.kafka.client.connection.active." Given this name, all client
> > implementations can communicate this name and its value to all parties
> > consistently. Without a standard naming convention, the metric might be
> > named "connections.open" in the Java client and "Connections/Alive" in
> > librdkafka. This inconsistency of naming would impact the discussions
> > between one or more of the parties involved.
> >
> > To your point, it's absolutely a design choice to keep the naming
> > convention the same between each client. We can change that if it makes
> > sense.
> >
> > Point 2: Machine-to-machine Communication
> >
> > Standardization at the client level provides stability through an implied
> > contract that a client should not introduce a breaking name change between
> > releases. Otherwise, the ability for the metrics to be "understood" in a
> > machine-to-machine context would be forfeit.
> >
> > For example, let's say that we give the clients the latitude to name
> > metrics as they wish. In this example, let's say that the Apache Kafka 3.4
> > release decides to name this metric "connections.open." It's a good name!
> > It says what it is. However, in, let's say the Apache Kafka 3.7 release,
> > the metric name is changed to "connections.open.count." At this point,
> > there are two names and machine-to-machine communication will likely be
> > effected. With that change, all client telemetry plugin(s) used in an
> > organization must be updated to reflect that change, else data loss or bugs
> > could be introduced.
> >
> > That the KIP defines the names of the metrics does, admittedly, constrain
> > the options of authors of the different clients. The metric named
> > "org.apache.kafka.client.connection.active" may be confusing in some client
> > implementations. For whatever reason, a client author may even find it
> > "undesirable" to include a reference that includes "Apache" in their code.
> >
> > There's also the precedent set by the existing (JMX-based) client metrics.
> > Though these are applicable only to the Java client, we can see that having
> > a standardized naming convention there has helped with communication.
> >
> > So, IMO, it makes sense to define the metric names via the KIP mechanism
> > and--let's say, "ask"--that client implementations abide by those.
> >
> > Thanks,
> > Kirk
> >
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Thu, Jun 16, 2022 at 12:00 PM Kirk True <ki...@kirktrue.pro> wrote:
> > >
> > > > Hi Jun,
> > > >
> > > > I'll try to answer the questions posed...
> > > >
> > > > On Tue, Jun 7, 2022, at 4:32 PM, Jun Rao wrote:
> > > > > Hi, Magnus,
> > > > >
> > > > > Thanks for the reply.
> > > > >
> > > > > So, the standard set of generic metrics is just a recommendation and
> > not
> > > > a
> > > > > requirement? This sounds good to me since it makes the adoption of
> > the
> > > > KIP
> > > > > easier.
> > > >
> > > > I believe that was the intent, yes.
> > > >
> > > > > Regarding the metric names, I have two concerns.
> > > >
> > > > (I'm splitting these two up for readability...)
> > > >
> > > > > (1) If a client already
> > > > > has an existing metric similar to the standard one, duplicating the
> > > > metric
> > > > > seems to be confusing.
> > > >
> > > > Agreed. I'm dealing with that situation as I write the Java client
> > > > implementation.
> > > >
> > > > The existing Java client exposes a set of metrics via JMX. The updated
> > > > Java client will introduce a second set of metrics, which instead are
> > > > exposed via sending them to the broker. There is substantial overlap
> > with
> > > > the two set of metrics and in a few places in the code under
> > development,
> > > > there are essentially two separate calls to update metrics: one for the
> > > > JMX-bound metrics and one for the broker-bound metrics.
> > > >
> > > > To be candid, I have gone back-and-forth on that design. From one
> > > > perspective, it could be argued that the set of client metrics should
> > be
> > > > standardized across a given client, regardless of how those metrics are
> > > > exposed for consumption. Another perspective is that these two sets of
> > > > metrics serve different purposes and/or have different audiences, which
> > > > argues that they should maintain their individuality and purpose. Your
> > > > inputs/suggestions are certainly welcome!
> > > >
> > > > > (2) If a client needs to implement a standard metric
> > > > > that doesn't exist yet, using a naming convention (e.g., using dash
> > vs
> > > > dot)
> > > > > different from other existing metrics also seems a bit confusing. It
> > > > seems
> > > > > that the main benefit of having standard metric names across clients
> > is
> > > > for
> > > > > better server side monitoring. Could we do the standardization in the
> > > > > plugin on the server?
> > > >
> > > > I think the expectation is that the plugin implementation will perform
> > > > transformation of metric names, if needed, to fit in with an
> > organization's
> > > > monitoring naming standards. Perhaps we need to call that out in the
> > KIP
> > > > itself.
> > > >
> > > > Thanks,
> > > > Kirk
> > > >
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jun
> > > > >
> > > > >
> > > > >
> > > > > On Tue, Jun 7, 2022 at 6:53 AM Magnus Edenhill <ma...@edenhill.se>
> > > > wrote:
> > > > >
> > > > > > Hey Jun,
> > > > > >
> > > > > > I've clarified the scope of the standard metrics in the KIP, but
> > > > basically:
> > > > > >
> > > > > >  * We define a standard set of generic metrics that should be
> > relevant
> > > > to
> > > > > > most client implementations, e.g., each producer implementation
> > > > probably
> > > > > > has some sort of per-partition message queue.
> > > > > >  * A client implementation should strive to implement as many of
> > the
> > > > > > standard metrics as possible, but only the ones that make sense.
> > > > > >  * For metrics that are not in the standard set, a client
> > maintainer
> > > > can
> > > > > > choose to either submit a KIP to add additional standard metrics -
> > if
> > > > > > they're relevant, or go ahead and add custom metrics that are
> > specific
> > > > to
> > > > > > that client implementation. These custom metrics will have a prefix
> > > > > > specific to that client implementation, as opposed to the standard
> > > > metric
> > > > > > set that resides under "org.apache.kafka...". E.g.,
> > > > > > "se.edenhill.librdkafka" or whatever.
> > > > > >  * Existing non-KIP-714 metrics should remain untouched. In some
> > cases
> > > > we
> > > > > > might be able to use the same meter given it is compatible with the
> > > > > > standard metric set definition, in other cases a semi-duplicate
> > meter
> > > > may
> > > > > > be needed. Thus this will not affect the metrics exposed through
> > JMX,
> > > > or
> > > > > > vice versa.
> > > > > >
> > > > > > Thanks,
> > > > > > Magnus
> > > > > >
> > > > > >
> > > > > >
> > > > > > Den ons 1 juni 2022 kl 18:55 skrev Jun Rao
> > <ju...@confluent.io.invalid>:
> > > > > >
> > > > > > > Hi, Magnus,
> > > > > > >
> > > > > > > 51. Just to clarify my question.  (1) Are standard metrics
> > required
> > > > for
> > > > > > > every client for this KIP to function?  (2) Are we converting
> > > > existing
> > > > > > java
> > > > > > > metrics to the standard metrics and deprecating the old ones? If
> > so,
> > > > > > could
> > > > > > > we list all existing java metrics that need to be renamed and the
> > > > > > > corresponding new name?
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > Jun
> > > > > > >
> > > > > > > On Tue, May 31, 2022 at 3:29 PM Jun Rao <ju...@confluent.io>
> > wrote:
> > > > > > >
> > > > > > > > Hi, Magnus,
> > > > > > > >
> > > > > > > > Thanks for the reply.
> > > > > > > >
> > > > > > > > 51. I think it's fine to have a list of recommended metrics for
> > > > every
> > > > > > > > client to implement. I am just not sure that standardizing on
> > the
> > > > > > metric
> > > > > > > > names across all clients is practical. The list of common
> > metrics
> > > > in
> > > > > > the
> > > > > > > > KIP have completely different names from the java metric names.
> > > > Some of
> > > > > > > > them have different types. For example, some of the common
> > metrics
> > > > > > have a
> > > > > > > > type of histogram, but the java client metrics don't use
> > histogram
> > > > in
> > > > > > > > general. Requiring the operator to translate those names and
> > > > understand
> > > > > > > the
> > > > > > > > subtle differences across clients seem to cause more confusion
> > > > during
> > > > > > > > troubleshooting.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > >
> > > > > > > > Jun
> > > > > > > >
> > > > > > > > On Tue, May 31, 2022 at 5:02 AM Magnus Edenhill <
> > > > magnus@edenhill.se>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > >> Den fre 20 maj 2022 kl 01:23 skrev Jun Rao
> > > > <jun@confluent.io.invalid
> > > > > > >:
> > > > > > > >>
> > > > > > > >> > Hi, Magus,
> > > > > > > >> >
> > > > > > > >> > Thanks for the reply.
> > > > > > > >> >
> > > > > > > >> > 50. Sounds good.
> > > > > > > >> >
> > > > > > > >> > 51. I miss-understood the proposal in the KIP then. The
> > > > proposal is
> > > > > > to
> > > > > > > >> > define a set of common metric names that every client should
> > > > > > > implement.
> > > > > > > >> The
> > > > > > > >> > problem is that every client already has its own set of
> > metrics
> > > > with
> > > > > > > its
> > > > > > > >> > own names. I am not sure that we could easily agree upon a
> > > > common
> > > > > > set
> > > > > > > of
> > > > > > > >> > metrics that work with all clients. There are likely to be
> > some
> > > > > > > metrics
> > > > > > > >> > that are client specific. Translating between the common
> > name
> > > > and
> > > > > > > client
> > > > > > > >> > specific name is probably going to add more confusion. As
> > > > mentioned
> > > > > > in
> > > > > > > >> the
> > > > > > > >> > KIP, similar metrics from different clients could have
> > subtle
> > > > > > > >> > semantic differences. Could we just let each client use its
> > own
> > > > set
> > > > > > of
> > > > > > > >> > metric names?
> > > > > > > >> >
> > > > > > > >>
> > > > > > > >> We identified a common set of metrics that should be relevant
> > for
> > > > most
> > > > > > > >> client implementations,
> > > > > > > >> they're the ones listed in the KIP.
> > > > > > > >> A supporting client does not have to implement all those
> > metrics,
> > > > only
> > > > > > > the
> > > > > > > >> ones that makes sense
> > > > > > > >> based on that client implementation, and a client may
> > implement
> > > > other
> > > > > > > >> metrics that are not listed
> > > > > > > >> in the KIP under its own namespace.
> > > > > > > >> This approach has two benefits:
> > > > > > > >>  - there will be a common set of metrics that most/all clients
> > > > > > > implement,
> > > > > > > >> which makes monitoring
> > > > > > > >>   and troubleshooting easier across fleets with multiple Kafka
> > > > client
> > > > > > > >> languages/implementations.
> > > > > > > >>  - client-specific metrics are still possible, so if there is
> > no
> > > > > > > suitable
> > > > > > > >> standard metric a client can still
> > > > > > > >>    provide what special metrics it has.
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> Thanks,
> > > > > > > >> Magnus
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> On Thu, May 19, 2022 at 10:39 AM Magnus Edenhill <
> > > > magnus@edenhill.se>
> > > > > > > >> wrote:
> > > > > > > >> >
> > > > > > > >> > > Den ons 18 maj 2022 kl 19:57 skrev Jun Rao
> > > > > > <jun@confluent.io.invalid
> > > > > > > >> >:
> > > > > > > >> > >
> > > > > > > >> > > > Hi, Magnus,
> > > > > > > >> > > >
> > > > > > > >> > >
> > > > > > > >> > > Hi Jun
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> > > >
> > > > > > > >> > > > Thanks for the updated KIP. Just a couple of more
> > comments.
> > > > > > > >> > > >
> > > > > > > >> > > > 50. To troubleshoot a particular client issue, I imagine
> > > > that
> > > > > > the
> > > > > > > >> > client
> > > > > > > >> > > > needs to identify its client_instance_id. How does the
> > > > client
> > > > > > find
> > > > > > > >> this
> > > > > > > >> > > > out? Do we plan to include client_instance_id in the
> > client
> > > > log,
> > > > > > > >> expose
> > > > > > > >> > > it
> > > > > > > >> > > > as a metric or something else?
> > > > > > > >> > > >
> > > > > > > >> > >
> > > > > > > >> > > The KIP suggests that client implementations emit an
> > > > informative
> > > > > > log
> > > > > > > >> > > message
> > > > > > > >> > > with the assigned client-instance-id once it is retrieved
> > > > (once
> > > > > > per
> > > > > > > >> > client
> > > > > > > >> > > instance lifetime).
> > > > > > > >> > > There's also a clientInstanceId() method that an
> > application
> > > > can
> > > > > > use
> > > > > > > >> to
> > > > > > > >> > > retrieve
> > > > > > > >> > > the client instance id and emit through whatever side
> > channels
> > > > > > makes
> > > > > > > >> > sense.
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> > > > 51. The KIP lists a bunch of metrics that need to be
> > > > collected
> > > > > > at
> > > > > > > >> the
> > > > > > > >> > > > client side. However, it seems quite a few useful java
> > > > client
> > > > > > > >> metrics
> > > > > > > >> > > like
> > > > > > > >> > > > the following are missing.
> > > > > > > >> > > >     buffer-total-bytes
> > > > > > > >> > > >     buffer-available-bytes
> > > > > > > >> > > >
> > > > > > > >> > >
> > > > > > > >> > > These are covered by client.producer.record.queue.bytes
> > and
> > > > > > > >> > > client.producer.record.queue.max.bytes.
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> > > >     bufferpool-wait-time
> > > > > > > >> > > >
> > > > > > > >> > >
> > > > > > > >> > > Missing, but somewhat implementation specific.
> > > > > > > >> > > If it was up to me we would add this later if there's a
> > need.
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> > > >     batch-size-avg
> > > > > > > >> > > >     batch-size-max
> > > > > > > >> > > >
> > > > > > > >> > >
> > > > > > > >> > > These are missing and would be suitably represented as a
> > > > > > histogram.
> > > > > > > >> I'll
> > > > > > > >> > > add them.
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> > > >     io-wait-ratio
> > > > > > > >> > > >     io-ratio
> > > > > > > >> > > >
> > > > > > > >> > >
> > > > > > > >> > > There's client.io.wait.time which should cover
> > io-wait-ratio.
> > > > > > > >> > > We could add a client.io.time as well, now or in a later
> > KIP.
> > > > > > > >> > >
> > > > > > > >> > > Thanks,
> > > > > > > >> > > Magnus
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> > > >
> > > > > > > >> > > > Thanks,
> > > > > > > >> > > >
> > > > > > > >> > > > Jun
> > > > > > > >> > > >
> > > > > > > >> > > > On Mon, Apr 4, 2022 at 10:01 AM Jun Rao <
> > jun@confluent.io>
> > > > > > wrote:
> > > > > > > >> > > >
> > > > > > > >> > > > > Hi, Xavier,
> > > > > > > >> > > > >
> > > > > > > >> > > > > Thanks for the reply.
> > > > > > > >> > > > >
> > > > > > > >> > > > > 28. It does seem that we have started using
> > KafkaMetrics
> > > > on
> > > > > > the
> > > > > > > >> > broker
> > > > > > > >> > > > > side. Then, my only concern is on the usage of
> > Histogram
> > > > in
> > > > > > > >> > > KafkaMetrics.
> > > > > > > >> > > > > Histogram in KafkaMetrics statically divides the value
> > > > space
> > > > > > > into
> > > > > > > >> a
> > > > > > > >> > > fixed
> > > > > > > >> > > > > number of buckets and only returns values on the
> > bucket
> > > > > > > boundary.
> > > > > > > >> So,
> > > > > > > >> > > the
> > > > > > > >> > > > > returned histogram value may never show up in a
> > recorded
> > > > > > value.
> > > > > > > >> > Yammer
> > > > > > > >> > > > > Histogram, on the other hand, uses reservoir
> > sampling. The
> > > > > > > >> reported
> > > > > > > >> > > value
> > > > > > > >> > > > > is always one of the recorded values. So, I am not
> > sure
> > > > that
> > > > > > > >> > Histogram
> > > > > > > >> > > in
> > > > > > > >> > > > > KafkaMetrics is as good as Yammer Histogram.
> > > > > > > >> > > > ClientMetricsPluginExportTime
> > > > > > > >> > > > > uses Histogram.
> > > > > > > >> > > > >
> > > > > > > >> > > > > Thanks,
> > > > > > > >> > > > >
> > > > > > > >> > > > > Jun
> > > > > > > >> > > > >
> > > > > > > >> > > > > On Thu, Mar 31, 2022 at 5:21 PM Xavier Léauté
> > > > > > > >> > > > <xa...@confluent.io.invalid>
> > > > > > > >> > > > > wrote:
> > > > > > > >> > > > >
> > > > > > > >> > > > >> >
> > > > > > > >> > > > >> > 28. On the broker, we typically use Yammer metrics.
> > > > Only
> > > > > > for
> > > > > > > >> > metrics
> > > > > > > >> > > > >> that
> > > > > > > >> > > > >> > depend on Kafka metric features (e.g., quota), we
> > use
> > > > the
> > > > > > > Kafka
> > > > > > > >> > > > metric.
> > > > > > > >> > > > >> > Yammer metrics have 4 types: gauge, meter,
> > histogram
> > > > and
> > > > > > > timer.
> > > > > > > >> > > meter
> > > > > > > >> > > > >> > calculates a rate, but also exposes an accumulated
> > > > value.
> > > > > > > >> > > > >> >
> > > > > > > >> > > > >>
> > > > > > > >> > > > >> I don't see a good reason we should limit ourselves
> > to
> > > > Yammer
> > > > > > > >> > metrics
> > > > > > > >> > > on
> > > > > > > >> > > > >> the broker. KafkaMetrics was written
> > > > > > > >> > > > >> to replace Yammer metrics and is used for all new
> > > > components
> > > > > > > >> > (clients,
> > > > > > > >> > > > >> streams, connect, etc.)
> > > > > > > >> > > > >> My understanding is that the original goal was to
> > retire
> > > > > > Yammer
> > > > > > > >> > > metrics
> > > > > > > >> > > > in
> > > > > > > >> > > > >> the broker in favor of KafkaMetrics.
> > > > > > > >> > > > >> We just haven't done so out of backwards
> > compatibility
> > > > > > > concerns.
> > > > > > > >> > > > >> There are other broker metrics such as group
> > coordinator,
> > > > > > > >> > transaction
> > > > > > > >> > > > >> state
> > > > > > > >> > > > >> manager, and various socket server metrics
> > > > > > > >> > > > >> already using KafkaMetrics that don't need specific
> > Kafka
> > > > > > > metric
> > > > > > > >> > > > features,
> > > > > > > >> > > > >> so I don't see why we should refrain from using
> > > > > > > >> > > > >> Kafka metrics on the broker unless there are real
> > > > > > compatibility
> > > > > > > >> > > concerns
> > > > > > > >> > > > >> or
> > > > > > > >> > > > >> where implementation specifics could lead to
> > confusion
> > > > when
> > > > > > > >> > comparing
> > > > > > > >> > > > >> metrics using different implementations.
> > > > > > > >> > > > >>
> > > > > > > >> > > > >> In my opinion we should encourage people to use
> > > > KafkaMetrics
> > > > > > > >> going
> > > > > > > >> > > > forward
> > > > > > > >> > > > >> on the broker as well, for two reasons:
> > > > > > > >> > > > >> a) yammer metrics is long deprecated and no longer
> > > > maintained
> > > > > > > >> > > > >> b) yammer metrics are much less expressive
> > > > > > > >> > > > >> c) we don't have a proper API to expose yammer
> > metrics
> > > > > > outside
> > > > > > > of
> > > > > > > >> > JMX
> > > > > > > >> > > > >> (MetricsReporter only exposes KafkaMetrics)
> > > > > > > >> > > > >>
> > > > > > > >> > > > >
> > > > > > > >> > > >
> > > > > > > >> > >
> > > > > > > >> >
> > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Jun Rao <ju...@confluent.io.INVALID>.

Hi, Magnus, Kirk,

Thanks for the reply. A few more comments on your reply.

100. I agree there are some benefits of having a set of standard metrics
across all clients, but I am just wondering how practical it is, given that
the proposal doesn't require this set like the Kafka protocol.
100.1 A client may not implement all or some of the standard metrics. Then,
we won't have complete standardized names across clients.
100.2 The set of standard metrics needs to be common across all clients.
For example, client.consumer.poll.latency implies that all clients
implement a poll() interface. Is that true for all clients?
client.producer.record.queue.bytes. Do all producers have queues? We
probably need to make a pass of those metrics to see if they are indeed
common across all clients. Also, a bunch of standard metrics have type
Histogram. Java client doesn't have good Histogram support yet. I am also
not sure if all clients support Histogram. Should we avoid Histogram type
in standardized metrics?
100.3 For a subset of metrics that are truly common across clients, it
would be confusing for each client to maintain two sets of metrics for the
same thing. We could document them, but that means every user of every
client needs to remember this mapping. This is a much bigger
inconvenience than standardizing the metric names on the server side. If we
want to go this route, my preference is to deprecate the existing metric
names that are covered by the standard metric names.

101. "or if the client-specific metrics are not converted to some common
form, name, semantic, etc, it'll make creating meaningful aggregations and
monitoring more complex in the upstream telemetry system with a scattered
plethora of custom metrics." There will always be client specific metrics.
So, it seems that we have to deal with scattered custom metrics even with a
set of standard metrics.

102. "However, in, let's say the Apache Kafka 3.7 release, the metric name
is changed to "connections.open.count." At this point, there are two names
and machine-to-machine communication will likely be effected. With that
change, all client telemetry plugin(s) used in an organization must be
updated to reflect that change, else data loss or bugs could be
introduced." The standard metric names could change too in the future,
right? So, we need to deal with a similar problem if that happens.

103. "Are there any inobvious security/privacy-related edge cases where
shipping certain metrics to the broker would be "bad?"" I am not sure. But
if a metric can be shipped to the server, it would be useful for the same
metric to be visible on the client side.

Thanks,

Jun


On Tue, Jun 21, 2022 at 8:19 AM Kirk True <ki...@kirktrue.pro> wrote:

> Hi Jun,
>
> Thank you for all your continued interest in shaping the KIP :)
>
> On Thu, Jun 16, 2022, at 2:38 PM, Jun Rao wrote:
> > Hi, Kirk,
> >
> > Thanks for the reply. A couple of more comments.
> >
> > (1) "Another perspective is that these two sets of metrics serve
> different
> > purposes and/or have different audiences, which argues that they should
> > maintain their individuality and purpose. " Hmm, I am wondering if those
> > metrics are really for different audiences and purposes? For example, if
> > the operator detected an issue through a client metric collected through
> > the server, the operator may need to communicate that back to the client.
> > It would be weird if that same metric is not visible on the client side.
>
> I agree in the principal that all client metrics visible on the client can
> also be available to be sent to the broker.
>
> Are there any inobvious security/privacy-related edge cases where shipping
> certain metrics to the broker would be "bad?"
>
> > (2) If we could standardize the names on the server side, do we need to
> > enforce a naming convention for all clients?
>
> "Enforce" is such an ugly word :P
>
> But yes, I do feel that a consistent naming convention across all clients
> provides communication benefits between two entities:
>
>  1. Human-to-human communication. Ecosystem-wide agreement and
> understanding of metrics helps all to communicate more efficiently.
>  2. Machine-to-machine communication. Defining the names via the KIP
> mechanism help to ensure stabilization across releases of a given client.
>
> Point 1: Human-to-human Communication
>
> There are quite a handful of parties that must communicate effectively
> across the Kafka ecosystem. Here are the ones I can think of off the top of
> my head:
>
>  1. Kafka client authors
>  2. Kafka client users
>  3. Kafka client telemetry plugin authors
>  4. Support teams (within an organization or vendor-supplied across
> organizations)
>  5. Kafka cluster operators
>
> There should be a standard so that these parties can understand the
> metrics' meaning and be able to correlate that across all clients.
>
> As a concrete example, KIP-714 includes a metric for tracking the number
> of active client connections to a cluster, named
> "org.apache.kafka.client.connection.active." Given this name, all client
> implementations can communicate this name and its value to all parties
> consistently. Without a standard naming convention, the metric might be
> named "connections.open" in the Java client and "Connections/Alive" in
> librdkafka. This inconsistency of naming would impact the discussions
> between one or more of the parties involved.
>
> To your point, it's absolutely a design choice to keep the naming
> convention the same between each client. We can change that if it makes
> sense.
>
> Point 2: Machine-to-machine Communication
>
> Standardization at the client level provides stability through an implied
> contract that a client should not introduce a breaking name change between
> releases. Otherwise, the ability for the metrics to be "understood" in a
> machine-to-machine context would be forfeit.
>
> For example, let's say that we give the clients the latitude to name
> metrics as they wish. In this example, let's say that the Apache Kafka 3.4
> release decides to name this metric "connections.open." It's a good name!
> It says what it is. However, in, let's say the Apache Kafka 3.7 release,
> the metric name is changed to "connections.open.count." At this point,
> there are two names and machine-to-machine communication will likely be
> effected. With that change, all client telemetry plugin(s) used in an
> organization must be updated to reflect that change, else data loss or bugs
> could be introduced.
>
> That the KIP defines the names of the metrics does, admittedly, constrain
> the options of authors of the different clients. The metric named
> "org.apache.kafka.client.connection.active" may be confusing in some client
> implementations. For whatever reason, a client author may even find it
> "undesirable" to include a reference that includes "Apache" in their code.
>
> There's also the precedent set by the existing (JMX-based) client metrics.
> Though these are applicable only to the Java client, we can see that having
> a standardized naming convention there has helped with communication.
>
> So, IMO, it makes sense to define the metric names via the KIP mechanism
> and--let's say, "ask"--that client implementations abide by those.
>
> Thanks,
> Kirk
>
> >
> > Thanks,
> >
> > Jun
> >
> > On Thu, Jun 16, 2022 at 12:00 PM Kirk True <ki...@kirktrue.pro> wrote:
> >
> > > Hi Jun,
> > >
> > > I'll try to answer the questions posed...
> > >
> > > On Tue, Jun 7, 2022, at 4:32 PM, Jun Rao wrote:
> > > > Hi, Magnus,
> > > >
> > > > Thanks for the reply.
> > > >
> > > > So, the standard set of generic metrics is just a recommendation and
> not
> > > a
> > > > requirement? This sounds good to me since it makes the adoption of
> the
> > > KIP
> > > > easier.
> > >
> > > I believe that was the intent, yes.
> > >
> > > > Regarding the metric names, I have two concerns.
> > >
> > > (I'm splitting these two up for readability...)
> > >
> > > > (1) If a client already
> > > > has an existing metric similar to the standard one, duplicating the
> > > metric
> > > > seems to be confusing.
> > >
> > > Agreed. I'm dealing with that situation as I write the Java client
> > > implementation.
> > >
> > > The existing Java client exposes a set of metrics via JMX. The updated
> > > Java client will introduce a second set of metrics, which instead are
> > > exposed via sending them to the broker. There is substantial overlap
> with
> > > the two set of metrics and in a few places in the code under
> development,
> > > there are essentially two separate calls to update metrics: one for the
> > > JMX-bound metrics and one for the broker-bound metrics.
> > >
> > > To be candid, I have gone back-and-forth on that design. From one
> > > perspective, it could be argued that the set of client metrics should
> be
> > > standardized across a given client, regardless of how those metrics are
> > > exposed for consumption. Another perspective is that these two sets of
> > > metrics serve different purposes and/or have different audiences, which
> > > argues that they should maintain their individuality and purpose. Your
> > > inputs/suggestions are certainly welcome!
> > >
> > > > (2) If a client needs to implement a standard metric
> > > > that doesn't exist yet, using a naming convention (e.g., using dash
> vs
> > > dot)
> > > > different from other existing metrics also seems a bit confusing. It
> > > seems
> > > > that the main benefit of having standard metric names across clients
> is
> > > for
> > > > better server side monitoring. Could we do the standardization in the
> > > > plugin on the server?
> > >
> > > I think the expectation is that the plugin implementation will perform
> > > transformation of metric names, if needed, to fit in with an
> organization's
> > > monitoring naming standards. Perhaps we need to call that out in the
> KIP
> > > itself.
> > >
> > > Thanks,
> > > Kirk
> > >
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > >
> > > >
> > > > On Tue, Jun 7, 2022 at 6:53 AM Magnus Edenhill <ma...@edenhill.se>
> > > wrote:
> > > >
> > > > > Hey Jun,
> > > > >
> > > > > I've clarified the scope of the standard metrics in the KIP, but
> > > basically:
> > > > >
> > > > >  * We define a standard set of generic metrics that should be
> relevant
> > > to
> > > > > most client implementations, e.g., each producer implementation
> > > probably
> > > > > has some sort of per-partition message queue.
> > > > >  * A client implementation should strive to implement as many of
> the
> > > > > standard metrics as possible, but only the ones that make sense.
> > > > >  * For metrics that are not in the standard set, a client
> maintainer
> > > can
> > > > > choose to either submit a KIP to add additional standard metrics -
> if
> > > > > they're relevant, or go ahead and add custom metrics that are
> specific
> > > to
> > > > > that client implementation. These custom metrics will have a prefix
> > > > > specific to that client implementation, as opposed to the standard
> > > metric
> > > > > set that resides under "org.apache.kafka...". E.g.,
> > > > > "se.edenhill.librdkafka" or whatever.
> > > > >  * Existing non-KIP-714 metrics should remain untouched. In some
> cases
> > > we
> > > > > might be able to use the same meter given it is compatible with the
> > > > > standard metric set definition, in other cases a semi-duplicate
> meter
> > > may
> > > > > be needed. Thus this will not affect the metrics exposed through
> JMX,
> > > or
> > > > > vice versa.
> > > > >
> > > > > Thanks,
> > > > > Magnus
> > > > >
> > > > >
> > > > >
> > > > > Den ons 1 juni 2022 kl 18:55 skrev Jun Rao
> <ju...@confluent.io.invalid>:
> > > > >
> > > > > > Hi, Magnus,
> > > > > >
> > > > > > 51. Just to clarify my question.  (1) Are standard metrics
> required
> > > for
> > > > > > every client for this KIP to function?  (2) Are we converting
> > > existing
> > > > > java
> > > > > > metrics to the standard metrics and deprecating the old ones? If
> so,
> > > > > could
> > > > > > we list all existing java metrics that need to be renamed and the
> > > > > > corresponding new name?
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Jun
> > > > > >
> > > > > > On Tue, May 31, 2022 at 3:29 PM Jun Rao <ju...@confluent.io>
> wrote:
> > > > > >
> > > > > > > Hi, Magnus,
> > > > > > >
> > > > > > > Thanks for the reply.
> > > > > > >
> > > > > > > 51. I think it's fine to have a list of recommended metrics for
> > > every
> > > > > > > client to implement. I am just not sure that standardizing on
> the
> > > > > metric
> > > > > > > names across all clients is practical. The list of common
> metrics
> > > in
> > > > > the
> > > > > > > KIP have completely different names from the java metric names.
> > > Some of
> > > > > > > them have different types. For example, some of the common
> metrics
> > > > > have a
> > > > > > > type of histogram, but the java client metrics don't use
> histogram
> > > in
> > > > > > > general. Requiring the operator to translate those names and
> > > understand
> > > > > > the
> > > > > > > subtle differences across clients seem to cause more confusion
> > > during
> > > > > > > troubleshooting.
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > Jun
> > > > > > >
> > > > > > > On Tue, May 31, 2022 at 5:02 AM Magnus Edenhill <
> > > magnus@edenhill.se>
> > > > > > > wrote:
> > > > > > >
> > > > > > >> Den fre 20 maj 2022 kl 01:23 skrev Jun Rao
> > > <jun@confluent.io.invalid
> > > > > >:
> > > > > > >>
> > > > > > >> > Hi, Magus,
> > > > > > >> >
> > > > > > >> > Thanks for the reply.
> > > > > > >> >
> > > > > > >> > 50. Sounds good.
> > > > > > >> >
> > > > > > >> > 51. I miss-understood the proposal in the KIP then. The
> > > proposal is
> > > > > to
> > > > > > >> > define a set of common metric names that every client should
> > > > > > implement.
> > > > > > >> The
> > > > > > >> > problem is that every client already has its own set of
> metrics
> > > with
> > > > > > its
> > > > > > >> > own names. I am not sure that we could easily agree upon a
> > > common
> > > > > set
> > > > > > of
> > > > > > >> > metrics that work with all clients. There are likely to be
> some
> > > > > > metrics
> > > > > > >> > that are client specific. Translating between the common
> name
> > > and
> > > > > > client
> > > > > > >> > specific name is probably going to add more confusion. As
> > > mentioned
> > > > > in
> > > > > > >> the
> > > > > > >> > KIP, similar metrics from different clients could have
> subtle
> > > > > > >> > semantic differences. Could we just let each client use its
> own
> > > set
> > > > > of
> > > > > > >> > metric names?
> > > > > > >> >
> > > > > > >>
> > > > > > >> We identified a common set of metrics that should be relevant
> for
> > > most
> > > > > > >> client implementations,
> > > > > > >> they're the ones listed in the KIP.
> > > > > > >> A supporting client does not have to implement all those
> metrics,
> > > only
> > > > > > the
> > > > > > >> ones that makes sense
> > > > > > >> based on that client implementation, and a client may
> implement
> > > other
> > > > > > >> metrics that are not listed
> > > > > > >> in the KIP under its own namespace.
> > > > > > >> This approach has two benefits:
> > > > > > >>  - there will be a common set of metrics that most/all clients
> > > > > > implement,
> > > > > > >> which makes monitoring
> > > > > > >>   and troubleshooting easier across fleets with multiple Kafka
> > > client
> > > > > > >> languages/implementations.
> > > > > > >>  - client-specific metrics are still possible, so if there is
> no
> > > > > > suitable
> > > > > > >> standard metric a client can still
> > > > > > >>    provide what special metrics it has.
> > > > > > >>
> > > > > > >>
> > > > > > >> Thanks,
> > > > > > >> Magnus
> > > > > > >>
> > > > > > >>
> > > > > > >> On Thu, May 19, 2022 at 10:39 AM Magnus Edenhill <
> > > magnus@edenhill.se>
> > > > > > >> wrote:
> > > > > > >> >
> > > > > > >> > > Den ons 18 maj 2022 kl 19:57 skrev Jun Rao
> > > > > <jun@confluent.io.invalid
> > > > > > >> >:
> > > > > > >> > >
> > > > > > >> > > > Hi, Magnus,
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> > > Hi Jun
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > > >
> > > > > > >> > > > Thanks for the updated KIP. Just a couple of more
> comments.
> > > > > > >> > > >
> > > > > > >> > > > 50. To troubleshoot a particular client issue, I imagine
> > > that
> > > > > the
> > > > > > >> > client
> > > > > > >> > > > needs to identify its client_instance_id. How does the
> > > client
> > > > > find
> > > > > > >> this
> > > > > > >> > > > out? Do we plan to include client_instance_id in the
> client
> > > log,
> > > > > > >> expose
> > > > > > >> > > it
> > > > > > >> > > > as a metric or something else?
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> > > The KIP suggests that client implementations emit an
> > > informative
> > > > > log
> > > > > > >> > > message
> > > > > > >> > > with the assigned client-instance-id once it is retrieved
> > > (once
> > > > > per
> > > > > > >> > client
> > > > > > >> > > instance lifetime).
> > > > > > >> > > There's also a clientInstanceId() method that an
> application
> > > can
> > > > > use
> > > > > > >> to
> > > > > > >> > > retrieve
> > > > > > >> > > the client instance id and emit through whatever side
> channels
> > > > > makes
> > > > > > >> > sense.
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > > > 51. The KIP lists a bunch of metrics that need to be
> > > collected
> > > > > at
> > > > > > >> the
> > > > > > >> > > > client side. However, it seems quite a few useful java
> > > client
> > > > > > >> metrics
> > > > > > >> > > like
> > > > > > >> > > > the following are missing.
> > > > > > >> > > >     buffer-total-bytes
> > > > > > >> > > >     buffer-available-bytes
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> > > These are covered by client.producer.record.queue.bytes
> and
> > > > > > >> > > client.producer.record.queue.max.bytes.
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > > >     bufferpool-wait-time
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> > > Missing, but somewhat implementation specific.
> > > > > > >> > > If it was up to me we would add this later if there's a
> need.
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > > >     batch-size-avg
> > > > > > >> > > >     batch-size-max
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> > > These are missing and would be suitably represented as a
> > > > > histogram.
> > > > > > >> I'll
> > > > > > >> > > add them.
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > > >     io-wait-ratio
> > > > > > >> > > >     io-ratio
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> > > There's client.io.wait.time which should cover
> io-wait-ratio.
> > > > > > >> > > We could add a client.io.time as well, now or in a later
> KIP.
> > > > > > >> > >
> > > > > > >> > > Thanks,
> > > > > > >> > > Magnus
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > > >
> > > > > > >> > > > Thanks,
> > > > > > >> > > >
> > > > > > >> > > > Jun
> > > > > > >> > > >
> > > > > > >> > > > On Mon, Apr 4, 2022 at 10:01 AM Jun Rao <
> jun@confluent.io>
> > > > > wrote:
> > > > > > >> > > >
> > > > > > >> > > > > Hi, Xavier,
> > > > > > >> > > > >
> > > > > > >> > > > > Thanks for the reply.
> > > > > > >> > > > >
> > > > > > >> > > > > 28. It does seem that we have started using
> KafkaMetrics
> > > on
> > > > > the
> > > > > > >> > broker
> > > > > > >> > > > > side. Then, my only concern is on the usage of
> Histogram
> > > in
> > > > > > >> > > KafkaMetrics.
> > > > > > >> > > > > Histogram in KafkaMetrics statically divides the value
> > > space
> > > > > > into
> > > > > > >> a
> > > > > > >> > > fixed
> > > > > > >> > > > > number of buckets and only returns values on the
> bucket
> > > > > > boundary.
> > > > > > >> So,
> > > > > > >> > > the
> > > > > > >> > > > > returned histogram value may never show up in a
> recorded
> > > > > value.
> > > > > > >> > Yammer
> > > > > > >> > > > > Histogram, on the other hand, uses reservoir
> sampling. The
> > > > > > >> reported
> > > > > > >> > > value
> > > > > > >> > > > > is always one of the recorded values. So, I am not
> sure
> > > that
> > > > > > >> > Histogram
> > > > > > >> > > in
> > > > > > >> > > > > KafkaMetrics is as good as Yammer Histogram.
> > > > > > >> > > > ClientMetricsPluginExportTime
> > > > > > >> > > > > uses Histogram.
> > > > > > >> > > > >
> > > > > > >> > > > > Thanks,
> > > > > > >> > > > >
> > > > > > >> > > > > Jun
> > > > > > >> > > > >
> > > > > > >> > > > > On Thu, Mar 31, 2022 at 5:21 PM Xavier Léauté
> > > > > > >> > > > <xa...@confluent.io.invalid>
> > > > > > >> > > > > wrote:
> > > > > > >> > > > >
> > > > > > >> > > > >> >
> > > > > > >> > > > >> > 28. On the broker, we typically use Yammer metrics.
> > > Only
> > > > > for
> > > > > > >> > metrics
> > > > > > >> > > > >> that
> > > > > > >> > > > >> > depend on Kafka metric features (e.g., quota), we
> use
> > > the
> > > > > > Kafka
> > > > > > >> > > > metric.
> > > > > > >> > > > >> > Yammer metrics have 4 types: gauge, meter,
> histogram
> > > and
> > > > > > timer.
> > > > > > >> > > meter
> > > > > > >> > > > >> > calculates a rate, but also exposes an accumulated
> > > value.
> > > > > > >> > > > >> >
> > > > > > >> > > > >>
> > > > > > >> > > > >> I don't see a good reason we should limit ourselves
> to
> > > Yammer
> > > > > > >> > metrics
> > > > > > >> > > on
> > > > > > >> > > > >> the broker. KafkaMetrics was written
> > > > > > >> > > > >> to replace Yammer metrics and is used for all new
> > > components
> > > > > > >> > (clients,
> > > > > > >> > > > >> streams, connect, etc.)
> > > > > > >> > > > >> My understanding is that the original goal was to
> retire
> > > > > Yammer
> > > > > > >> > > metrics
> > > > > > >> > > > in
> > > > > > >> > > > >> the broker in favor of KafkaMetrics.
> > > > > > >> > > > >> We just haven't done so out of backwards
> compatibility
> > > > > > concerns.
> > > > > > >> > > > >> There are other broker metrics such as group
> coordinator,
> > > > > > >> > transaction
> > > > > > >> > > > >> state
> > > > > > >> > > > >> manager, and various socket server metrics
> > > > > > >> > > > >> already using KafkaMetrics that don't need specific
> Kafka
> > > > > > metric
> > > > > > >> > > > features,
> > > > > > >> > > > >> so I don't see why we should refrain from using
> > > > > > >> > > > >> Kafka metrics on the broker unless there are real
> > > > > compatibility
> > > > > > >> > > concerns
> > > > > > >> > > > >> or
> > > > > > >> > > > >> where implementation specifics could lead to
> confusion
> > > when
> > > > > > >> > comparing
> > > > > > >> > > > >> metrics using different implementations.
> > > > > > >> > > > >>
> > > > > > >> > > > >> In my opinion we should encourage people to use
> > > KafkaMetrics
> > > > > > >> going
> > > > > > >> > > > forward
> > > > > > >> > > > >> on the broker as well, for two reasons:
> > > > > > >> > > > >> a) yammer metrics is long deprecated and no longer
> > > maintained
> > > > > > >> > > > >> b) yammer metrics are much less expressive
> > > > > > >> > > > >> c) we don't have a proper API to expose yammer
> metrics
> > > > > outside
> > > > > > of
> > > > > > >> > JMX
> > > > > > >> > > > >> (MetricsReporter only exposes KafkaMetrics)
> > > > > > >> > > > >>
> > > > > > >> > > > >
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> >
> > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Kirk True <ki...@kirktrue.pro>.

Hi Jun,

Thank you for all your continued interest in shaping the KIP :)

On Thu, Jun 16, 2022, at 2:38 PM, Jun Rao wrote:
> Hi, Kirk,
> 
> Thanks for the reply. A couple of more comments.
> 
> (1) "Another perspective is that these two sets of metrics serve different
> purposes and/or have different audiences, which argues that they should
> maintain their individuality and purpose. " Hmm, I am wondering if those
> metrics are really for different audiences and purposes? For example, if
> the operator detected an issue through a client metric collected through
> the server, the operator may need to communicate that back to the client.
> It would be weird if that same metric is not visible on the client side.

I agree in the principal that all client metrics visible on the client can also be available to be sent to the broker.

Are there any inobvious security/privacy-related edge cases where shipping certain metrics to the broker would be "bad?"

> (2) If we could standardize the names on the server side, do we need to
> enforce a naming convention for all clients?

"Enforce" is such an ugly word :P

But yes, I do feel that a consistent naming convention across all clients provides communication benefits between two entities:

 1. Human-to-human communication. Ecosystem-wide agreement and understanding of metrics helps all to communicate more efficiently.
 2. Machine-to-machine communication. Defining the names via the KIP mechanism help to ensure stabilization across releases of a given client.

Point 1: Human-to-human Communication

There are quite a handful of parties that must communicate effectively across the Kafka ecosystem. Here are the ones I can think of off the top of my head:

 1. Kafka client authors
 2. Kafka client users
 3. Kafka client telemetry plugin authors
 4. Support teams (within an organization or vendor-supplied across organizations)
 5. Kafka cluster operators

There should be a standard so that these parties can understand the metrics' meaning and be able to correlate that across all clients.

As a concrete example, KIP-714 includes a metric for tracking the number of active client connections to a cluster, named "org.apache.kafka.client.connection.active." Given this name, all client implementations can communicate this name and its value to all parties consistently. Without a standard naming convention, the metric might be named "connections.open" in the Java client and "Connections/Alive" in librdkafka. This inconsistency of naming would impact the discussions between one or more of the parties involved.

To your point, it's absolutely a design choice to keep the naming convention the same between each client. We can change that if it makes sense.

Point 2: Machine-to-machine Communication

Standardization at the client level provides stability through an implied contract that a client should not introduce a breaking name change between releases. Otherwise, the ability for the metrics to be "understood" in a machine-to-machine context would be forfeit.

For example, let's say that we give the clients the latitude to name metrics as they wish. In this example, let's say that the Apache Kafka 3.4 release decides to name this metric "connections.open." It's a good name! It says what it is. However, in, let's say the Apache Kafka 3.7 release, the metric name is changed to "connections.open.count." At this point, there are two names and machine-to-machine communication will likely be effected. With that change, all client telemetry plugin(s) used in an organization must be updated to reflect that change, else data loss or bugs could be introduced.

That the KIP defines the names of the metrics does, admittedly, constrain the options of authors of the different clients. The metric named "org.apache.kafka.client.connection.active" may be confusing in some client implementations. For whatever reason, a client author may even find it "undesirable" to include a reference that includes "Apache" in their code.

There's also the precedent set by the existing (JMX-based) client metrics. Though these are applicable only to the Java client, we can see that having a standardized naming convention there has helped with communication.

So, IMO, it makes sense to define the metric names via the KIP mechanism and--let's say, "ask"--that client implementations abide by those.

Thanks,
Kirk

> 
> Thanks,
> 
> Jun
> 
> On Thu, Jun 16, 2022 at 12:00 PM Kirk True <ki...@kirktrue.pro> wrote:
> 
> > Hi Jun,
> >
> > I'll try to answer the questions posed...
> >
> > On Tue, Jun 7, 2022, at 4:32 PM, Jun Rao wrote:
> > > Hi, Magnus,
> > >
> > > Thanks for the reply.
> > >
> > > So, the standard set of generic metrics is just a recommendation and not
> > a
> > > requirement? This sounds good to me since it makes the adoption of the
> > KIP
> > > easier.
> >
> > I believe that was the intent, yes.
> >
> > > Regarding the metric names, I have two concerns.
> >
> > (I'm splitting these two up for readability...)
> >
> > > (1) If a client already
> > > has an existing metric similar to the standard one, duplicating the
> > metric
> > > seems to be confusing.
> >
> > Agreed. I'm dealing with that situation as I write the Java client
> > implementation.
> >
> > The existing Java client exposes a set of metrics via JMX. The updated
> > Java client will introduce a second set of metrics, which instead are
> > exposed via sending them to the broker. There is substantial overlap with
> > the two set of metrics and in a few places in the code under development,
> > there are essentially two separate calls to update metrics: one for the
> > JMX-bound metrics and one for the broker-bound metrics.
> >
> > To be candid, I have gone back-and-forth on that design. From one
> > perspective, it could be argued that the set of client metrics should be
> > standardized across a given client, regardless of how those metrics are
> > exposed for consumption. Another perspective is that these two sets of
> > metrics serve different purposes and/or have different audiences, which
> > argues that they should maintain their individuality and purpose. Your
> > inputs/suggestions are certainly welcome!
> >
> > > (2) If a client needs to implement a standard metric
> > > that doesn't exist yet, using a naming convention (e.g., using dash vs
> > dot)
> > > different from other existing metrics also seems a bit confusing. It
> > seems
> > > that the main benefit of having standard metric names across clients is
> > for
> > > better server side monitoring. Could we do the standardization in the
> > > plugin on the server?
> >
> > I think the expectation is that the plugin implementation will perform
> > transformation of metric names, if needed, to fit in with an organization's
> > monitoring naming standards. Perhaps we need to call that out in the KIP
> > itself.
> >
> > Thanks,
> > Kirk
> >
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > >
> > >
> > > On Tue, Jun 7, 2022 at 6:53 AM Magnus Edenhill <ma...@edenhill.se>
> > wrote:
> > >
> > > > Hey Jun,
> > > >
> > > > I've clarified the scope of the standard metrics in the KIP, but
> > basically:
> > > >
> > > >  * We define a standard set of generic metrics that should be relevant
> > to
> > > > most client implementations, e.g., each producer implementation
> > probably
> > > > has some sort of per-partition message queue.
> > > >  * A client implementation should strive to implement as many of the
> > > > standard metrics as possible, but only the ones that make sense.
> > > >  * For metrics that are not in the standard set, a client maintainer
> > can
> > > > choose to either submit a KIP to add additional standard metrics - if
> > > > they're relevant, or go ahead and add custom metrics that are specific
> > to
> > > > that client implementation. These custom metrics will have a prefix
> > > > specific to that client implementation, as opposed to the standard
> > metric
> > > > set that resides under "org.apache.kafka...". E.g.,
> > > > "se.edenhill.librdkafka" or whatever.
> > > >  * Existing non-KIP-714 metrics should remain untouched. In some cases
> > we
> > > > might be able to use the same meter given it is compatible with the
> > > > standard metric set definition, in other cases a semi-duplicate meter
> > may
> > > > be needed. Thus this will not affect the metrics exposed through JMX,
> > or
> > > > vice versa.
> > > >
> > > > Thanks,
> > > > Magnus
> > > >
> > > >
> > > >
> > > > Den ons 1 juni 2022 kl 18:55 skrev Jun Rao <ju...@confluent.io.invalid>:
> > > >
> > > > > Hi, Magnus,
> > > > >
> > > > > 51. Just to clarify my question.  (1) Are standard metrics required
> > for
> > > > > every client for this KIP to function?  (2) Are we converting
> > existing
> > > > java
> > > > > metrics to the standard metrics and deprecating the old ones? If so,
> > > > could
> > > > > we list all existing java metrics that need to be renamed and the
> > > > > corresponding new name?
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jun
> > > > >
> > > > > On Tue, May 31, 2022 at 3:29 PM Jun Rao <ju...@confluent.io> wrote:
> > > > >
> > > > > > Hi, Magnus,
> > > > > >
> > > > > > Thanks for the reply.
> > > > > >
> > > > > > 51. I think it's fine to have a list of recommended metrics for
> > every
> > > > > > client to implement. I am just not sure that standardizing on the
> > > > metric
> > > > > > names across all clients is practical. The list of common metrics
> > in
> > > > the
> > > > > > KIP have completely different names from the java metric names.
> > Some of
> > > > > > them have different types. For example, some of the common metrics
> > > > have a
> > > > > > type of histogram, but the java client metrics don't use histogram
> > in
> > > > > > general. Requiring the operator to translate those names and
> > understand
> > > > > the
> > > > > > subtle differences across clients seem to cause more confusion
> > during
> > > > > > troubleshooting.
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Jun
> > > > > >
> > > > > > On Tue, May 31, 2022 at 5:02 AM Magnus Edenhill <
> > magnus@edenhill.se>
> > > > > > wrote:
> > > > > >
> > > > > >> Den fre 20 maj 2022 kl 01:23 skrev Jun Rao
> > <jun@confluent.io.invalid
> > > > >:
> > > > > >>
> > > > > >> > Hi, Magus,
> > > > > >> >
> > > > > >> > Thanks for the reply.
> > > > > >> >
> > > > > >> > 50. Sounds good.
> > > > > >> >
> > > > > >> > 51. I miss-understood the proposal in the KIP then. The
> > proposal is
> > > > to
> > > > > >> > define a set of common metric names that every client should
> > > > > implement.
> > > > > >> The
> > > > > >> > problem is that every client already has its own set of metrics
> > with
> > > > > its
> > > > > >> > own names. I am not sure that we could easily agree upon a
> > common
> > > > set
> > > > > of
> > > > > >> > metrics that work with all clients. There are likely to be some
> > > > > metrics
> > > > > >> > that are client specific. Translating between the common name
> > and
> > > > > client
> > > > > >> > specific name is probably going to add more confusion. As
> > mentioned
> > > > in
> > > > > >> the
> > > > > >> > KIP, similar metrics from different clients could have subtle
> > > > > >> > semantic differences. Could we just let each client use its own
> > set
> > > > of
> > > > > >> > metric names?
> > > > > >> >
> > > > > >>
> > > > > >> We identified a common set of metrics that should be relevant for
> > most
> > > > > >> client implementations,
> > > > > >> they're the ones listed in the KIP.
> > > > > >> A supporting client does not have to implement all those metrics,
> > only
> > > > > the
> > > > > >> ones that makes sense
> > > > > >> based on that client implementation, and a client may implement
> > other
> > > > > >> metrics that are not listed
> > > > > >> in the KIP under its own namespace.
> > > > > >> This approach has two benefits:
> > > > > >>  - there will be a common set of metrics that most/all clients
> > > > > implement,
> > > > > >> which makes monitoring
> > > > > >>   and troubleshooting easier across fleets with multiple Kafka
> > client
> > > > > >> languages/implementations.
> > > > > >>  - client-specific metrics are still possible, so if there is no
> > > > > suitable
> > > > > >> standard metric a client can still
> > > > > >>    provide what special metrics it has.
> > > > > >>
> > > > > >>
> > > > > >> Thanks,
> > > > > >> Magnus
> > > > > >>
> > > > > >>
> > > > > >> On Thu, May 19, 2022 at 10:39 AM Magnus Edenhill <
> > magnus@edenhill.se>
> > > > > >> wrote:
> > > > > >> >
> > > > > >> > > Den ons 18 maj 2022 kl 19:57 skrev Jun Rao
> > > > <jun@confluent.io.invalid
> > > > > >> >:
> > > > > >> > >
> > > > > >> > > > Hi, Magnus,
> > > > > >> > > >
> > > > > >> > >
> > > > > >> > > Hi Jun
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > >
> > > > > >> > > > Thanks for the updated KIP. Just a couple of more comments.
> > > > > >> > > >
> > > > > >> > > > 50. To troubleshoot a particular client issue, I imagine
> > that
> > > > the
> > > > > >> > client
> > > > > >> > > > needs to identify its client_instance_id. How does the
> > client
> > > > find
> > > > > >> this
> > > > > >> > > > out? Do we plan to include client_instance_id in the client
> > log,
> > > > > >> expose
> > > > > >> > > it
> > > > > >> > > > as a metric or something else?
> > > > > >> > > >
> > > > > >> > >
> > > > > >> > > The KIP suggests that client implementations emit an
> > informative
> > > > log
> > > > > >> > > message
> > > > > >> > > with the assigned client-instance-id once it is retrieved
> > (once
> > > > per
> > > > > >> > client
> > > > > >> > > instance lifetime).
> > > > > >> > > There's also a clientInstanceId() method that an application
> > can
> > > > use
> > > > > >> to
> > > > > >> > > retrieve
> > > > > >> > > the client instance id and emit through whatever side channels
> > > > makes
> > > > > >> > sense.
> > > > > >> > >
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > > 51. The KIP lists a bunch of metrics that need to be
> > collected
> > > > at
> > > > > >> the
> > > > > >> > > > client side. However, it seems quite a few useful java
> > client
> > > > > >> metrics
> > > > > >> > > like
> > > > > >> > > > the following are missing.
> > > > > >> > > >     buffer-total-bytes
> > > > > >> > > >     buffer-available-bytes
> > > > > >> > > >
> > > > > >> > >
> > > > > >> > > These are covered by client.producer.record.queue.bytes and
> > > > > >> > > client.producer.record.queue.max.bytes.
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > >     bufferpool-wait-time
> > > > > >> > > >
> > > > > >> > >
> > > > > >> > > Missing, but somewhat implementation specific.
> > > > > >> > > If it was up to me we would add this later if there's a need.
> > > > > >> > >
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > >     batch-size-avg
> > > > > >> > > >     batch-size-max
> > > > > >> > > >
> > > > > >> > >
> > > > > >> > > These are missing and would be suitably represented as a
> > > > histogram.
> > > > > >> I'll
> > > > > >> > > add them.
> > > > > >> > >
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > >     io-wait-ratio
> > > > > >> > > >     io-ratio
> > > > > >> > > >
> > > > > >> > >
> > > > > >> > > There's client.io.wait.time which should cover io-wait-ratio.
> > > > > >> > > We could add a client.io.time as well, now or in a later KIP.
> > > > > >> > >
> > > > > >> > > Thanks,
> > > > > >> > > Magnus
> > > > > >> > >
> > > > > >> > >
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > >
> > > > > >> > > > Thanks,
> > > > > >> > > >
> > > > > >> > > > Jun
> > > > > >> > > >
> > > > > >> > > > On Mon, Apr 4, 2022 at 10:01 AM Jun Rao <ju...@confluent.io>
> > > > wrote:
> > > > > >> > > >
> > > > > >> > > > > Hi, Xavier,
> > > > > >> > > > >
> > > > > >> > > > > Thanks for the reply.
> > > > > >> > > > >
> > > > > >> > > > > 28. It does seem that we have started using KafkaMetrics
> > on
> > > > the
> > > > > >> > broker
> > > > > >> > > > > side. Then, my only concern is on the usage of Histogram
> > in
> > > > > >> > > KafkaMetrics.
> > > > > >> > > > > Histogram in KafkaMetrics statically divides the value
> > space
> > > > > into
> > > > > >> a
> > > > > >> > > fixed
> > > > > >> > > > > number of buckets and only returns values on the bucket
> > > > > boundary.
> > > > > >> So,
> > > > > >> > > the
> > > > > >> > > > > returned histogram value may never show up in a recorded
> > > > value.
> > > > > >> > Yammer
> > > > > >> > > > > Histogram, on the other hand, uses reservoir sampling. The
> > > > > >> reported
> > > > > >> > > value
> > > > > >> > > > > is always one of the recorded values. So, I am not sure
> > that
> > > > > >> > Histogram
> > > > > >> > > in
> > > > > >> > > > > KafkaMetrics is as good as Yammer Histogram.
> > > > > >> > > > ClientMetricsPluginExportTime
> > > > > >> > > > > uses Histogram.
> > > > > >> > > > >
> > > > > >> > > > > Thanks,
> > > > > >> > > > >
> > > > > >> > > > > Jun
> > > > > >> > > > >
> > > > > >> > > > > On Thu, Mar 31, 2022 at 5:21 PM Xavier Léauté
> > > > > >> > > > <xa...@confluent.io.invalid>
> > > > > >> > > > > wrote:
> > > > > >> > > > >
> > > > > >> > > > >> >
> > > > > >> > > > >> > 28. On the broker, we typically use Yammer metrics.
> > Only
> > > > for
> > > > > >> > metrics
> > > > > >> > > > >> that
> > > > > >> > > > >> > depend on Kafka metric features (e.g., quota), we use
> > the
> > > > > Kafka
> > > > > >> > > > metric.
> > > > > >> > > > >> > Yammer metrics have 4 types: gauge, meter, histogram
> > and
> > > > > timer.
> > > > > >> > > meter
> > > > > >> > > > >> > calculates a rate, but also exposes an accumulated
> > value.
> > > > > >> > > > >> >
> > > > > >> > > > >>
> > > > > >> > > > >> I don't see a good reason we should limit ourselves to
> > Yammer
> > > > > >> > metrics
> > > > > >> > > on
> > > > > >> > > > >> the broker. KafkaMetrics was written
> > > > > >> > > > >> to replace Yammer metrics and is used for all new
> > components
> > > > > >> > (clients,
> > > > > >> > > > >> streams, connect, etc.)
> > > > > >> > > > >> My understanding is that the original goal was to retire
> > > > Yammer
> > > > > >> > > metrics
> > > > > >> > > > in
> > > > > >> > > > >> the broker in favor of KafkaMetrics.
> > > > > >> > > > >> We just haven't done so out of backwards compatibility
> > > > > concerns.
> > > > > >> > > > >> There are other broker metrics such as group coordinator,
> > > > > >> > transaction
> > > > > >> > > > >> state
> > > > > >> > > > >> manager, and various socket server metrics
> > > > > >> > > > >> already using KafkaMetrics that don't need specific Kafka
> > > > > metric
> > > > > >> > > > features,
> > > > > >> > > > >> so I don't see why we should refrain from using
> > > > > >> > > > >> Kafka metrics on the broker unless there are real
> > > > compatibility
> > > > > >> > > concerns
> > > > > >> > > > >> or
> > > > > >> > > > >> where implementation specifics could lead to confusion
> > when
> > > > > >> > comparing
> > > > > >> > > > >> metrics using different implementations.
> > > > > >> > > > >>
> > > > > >> > > > >> In my opinion we should encourage people to use
> > KafkaMetrics
> > > > > >> going
> > > > > >> > > > forward
> > > > > >> > > > >> on the broker as well, for two reasons:
> > > > > >> > > > >> a) yammer metrics is long deprecated and no longer
> > maintained
> > > > > >> > > > >> b) yammer metrics are much less expressive
> > > > > >> > > > >> c) we don't have a proper API to expose yammer metrics
> > > > outside
> > > > > of
> > > > > >> > JMX
> > > > > >> > > > >> (MetricsReporter only exposes KafkaMetrics)
> > > > > >> > > > >>
> > > > > >> > > > >
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Magnus Edenhill <ma...@edenhill.se>.

Hey Jun and Kirk,


I see that there's a lot of focus on the existing metrics in the Java
clients, which makes sense,
but the KIP aims to approach the problem space from a higher and more
generic level by
defining:
1) a standard protocol for subscribing to, and pushing metrics,
2) an existing industry standard encoding and semantics for those metrics
(OTLP),
3) as well as a standard set of metrics that we believe are relevant to
most/all client implementations


The counter-alternative to these points, which have come up before in
various forms during the KIP discussions (see rejected alternatives) in the
KIP are:
1) use an existing out-of-band protocol,
2) use Kafka protocol encoding for the metrics,
3) let each client implementation provide their own set of metrics.

So why is the KIP not suggesting this approach? Well, in short:
 1) defies the zero-conf/always-available requirement - clients, networks,
firewalls, etc, must be specifically configured - which will not be
feasible.
 2) we would need to duplicate the work of the industry leading telemetry
people (opentelemetry) - reaping no benefits of their existing and future
work, and making integration with upstream telemetry systems harder,
 3a) these client-specific metrics would either need to be converted to
some common form - which is not only cpu/memory costly - but also hard from
an operational standpoint:
     someone, is it the kafka operator?, would need to understand what
client-specific metrics are available and what their semantics are - and
then for each such client implementation write translation code in the
broker-side plugin to try to mangle the custom metrics into a standard set
of metrics that can be monitored with a single upstream metric. With seven
or eight different client implementations in the wild, all with new
releases coming out every now and then some perhaps without per-metric
documentation, well that just seems like a daunting task that will be hard
to win.
 3b) or if the client-specific metrics are not converted to some common
form, name, semantic, etc, it'll make creating meaningful aggregations and
monitoring more complex in the upstream telemetry system with a scattered
plethora of custom metrics.

Additionally, the proposed standard set of metrics are derived from what is
available in existing clients and while the fit may not be perfect to
existing metrics, they won't be too off.
More so, having a standard set of metrics to implement makes it easier for
client maintainers to know which metrics they should expose and are
considered relevant to monitoring and troubleshooting.

As for manually mapping KIP-714 metric names to JMX during troubleshooting;
I agree that is not perfect but could be solved quite easily through
documentation. E.g,, "MetricA is also known as metric.foo.a in OTLP".

Another point worth mentioning is that, while the KIP does not cover it, a
future enhancement to the clients is to also expose the OTLP metrics
directly to the application as an alternative to JMX (or whatever the
client currently exposes, e.g. JSON), which makes integration with upstream
metrics systems easier.


Thanks,
Magnus







Den tors 16 juni 2022 kl 23:38 skrev Jun Rao <ju...@confluent.io.invalid>:

> Hi, Kirk,
>
> Thanks for the reply. A couple of more comments.
>
> (1) "Another perspective is that these two sets of metrics serve different
> purposes and/or have different audiences, which argues that they should
> maintain their individuality and purpose. " Hmm, I am wondering if those
> metrics are really for different audiences and purposes? For example, if
> the operator detected an issue through a client metric collected through
> the server, the operator may need to communicate that back to the client.
> It would be weird if that same metric is not visible on the client side.
>
> (2) If we could standardize the names on the server side, do we need to
> enforce a naming convention for all clients?


> Thanks,
>
> Jun
>
> On Thu, Jun 16, 2022 at 12:00 PM Kirk True <ki...@kirktrue.pro> wrote:
>
> > Hi Jun,
> >
> > I'll try to answer the questions posed...
> >
> > On Tue, Jun 7, 2022, at 4:32 PM, Jun Rao wrote:
> > > Hi, Magnus,
> > >
> > > Thanks for the reply.
> > >
> > > So, the standard set of generic metrics is just a recommendation and
> not
> > a
> > > requirement? This sounds good to me since it makes the adoption of the
> > KIP
> > > easier.
> >
> > I believe that was the intent, yes.
> >
> > > Regarding the metric names, I have two concerns.
> >
> > (I'm splitting these two up for readability...)
> >
> > > (1) If a client already
> > > has an existing metric similar to the standard one, duplicating the
> > metric
> > > seems to be confusing.
> >
> > Agreed. I'm dealing with that situation as I write the Java client
> > implementation.
> >
> > The existing Java client exposes a set of metrics via JMX. The updated
> > Java client will introduce a second set of metrics, which instead are
> > exposed via sending them to the broker. There is substantial overlap with
> > the two set of metrics and in a few places in the code under development,
> > there are essentially two separate calls to update metrics: one for the
> > JMX-bound metrics and one for the broker-bound metrics.
> >
> > To be candid, I have gone back-and-forth on that design. From one
> > perspective, it could be argued that the set of client metrics should be
> > standardized across a given client, regardless of how those metrics are
> > exposed for consumption. Another perspective is that these two sets of
> > metrics serve different purposes and/or have different audiences, which
> > argues that they should maintain their individuality and purpose. Your
> > inputs/suggestions are certainly welcome!
> >
> > > (2) If a client needs to implement a standard metric
> > > that doesn't exist yet, using a naming convention (e.g., using dash vs
> > dot)
> > > different from other existing metrics also seems a bit confusing. It
> > seems
> > > that the main benefit of having standard metric names across clients is
> > for
> > > better server side monitoring. Could we do the standardization in the
> > > plugin on the server?
> >
> > I think the expectation is that the plugin implementation will perform
> > transformation of metric names, if needed, to fit in with an
> organization's
> > monitoring naming standards. Perhaps we need to call that out in the KIP
> > itself.
> >
> > Thanks,
> > Kirk
> >
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > >
> > >
> > > On Tue, Jun 7, 2022 at 6:53 AM Magnus Edenhill <ma...@edenhill.se>
> > wrote:
> > >
> > > > Hey Jun,
> > > >
> > > > I've clarified the scope of the standard metrics in the KIP, but
> > basically:
> > > >
> > > >  * We define a standard set of generic metrics that should be
> relevant
> > to
> > > > most client implementations, e.g., each producer implementation
> > probably
> > > > has some sort of per-partition message queue.
> > > >  * A client implementation should strive to implement as many of the
> > > > standard metrics as possible, but only the ones that make sense.
> > > >  * For metrics that are not in the standard set, a client maintainer
> > can
> > > > choose to either submit a KIP to add additional standard metrics - if
> > > > they're relevant, or go ahead and add custom metrics that are
> specific
> > to
> > > > that client implementation. These custom metrics will have a prefix
> > > > specific to that client implementation, as opposed to the standard
> > metric
> > > > set that resides under "org.apache.kafka...". E.g.,
> > > > "se.edenhill.librdkafka" or whatever.
> > > >  * Existing non-KIP-714 metrics should remain untouched. In some
> cases
> > we
> > > > might be able to use the same meter given it is compatible with the
> > > > standard metric set definition, in other cases a semi-duplicate meter
> > may
> > > > be needed. Thus this will not affect the metrics exposed through JMX,
> > or
> > > > vice versa.
> > > >
> > > > Thanks,
> > > > Magnus
> > > >
> > > >
> > > >
> > > > Den ons 1 juni 2022 kl 18:55 skrev Jun Rao <jun@confluent.io.invalid
> >:
> > > >
> > > > > Hi, Magnus,
> > > > >
> > > > > 51. Just to clarify my question.  (1) Are standard metrics required
> > for
> > > > > every client for this KIP to function?  (2) Are we converting
> > existing
> > > > java
> > > > > metrics to the standard metrics and deprecating the old ones? If
> so,
> > > > could
> > > > > we list all existing java metrics that need to be renamed and the
> > > > > corresponding new name?
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jun
> > > > >
> > > > > On Tue, May 31, 2022 at 3:29 PM Jun Rao <ju...@confluent.io> wrote:
> > > > >
> > > > > > Hi, Magnus,
> > > > > >
> > > > > > Thanks for the reply.
> > > > > >
> > > > > > 51. I think it's fine to have a list of recommended metrics for
> > every
> > > > > > client to implement. I am just not sure that standardizing on the
> > > > metric
> > > > > > names across all clients is practical. The list of common metrics
> > in
> > > > the
> > > > > > KIP have completely different names from the java metric names.
> > Some of
> > > > > > them have different types. For example, some of the common
> metrics
> > > > have a
> > > > > > type of histogram, but the java client metrics don't use
> histogram
> > in
> > > > > > general. Requiring the operator to translate those names and
> > understand
> > > > > the
> > > > > > subtle differences across clients seem to cause more confusion
> > during
> > > > > > troubleshooting.
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Jun
> > > > > >
> > > > > > On Tue, May 31, 2022 at 5:02 AM Magnus Edenhill <
> > magnus@edenhill.se>
> > > > > > wrote:
> > > > > >
> > > > > >> Den fre 20 maj 2022 kl 01:23 skrev Jun Rao
> > <jun@confluent.io.invalid
> > > > >:
> > > > > >>
> > > > > >> > Hi, Magus,
> > > > > >> >
> > > > > >> > Thanks for the reply.
> > > > > >> >
> > > > > >> > 50. Sounds good.
> > > > > >> >
> > > > > >> > 51. I miss-understood the proposal in the KIP then. The
> > proposal is
> > > > to
> > > > > >> > define a set of common metric names that every client should
> > > > > implement.
> > > > > >> The
> > > > > >> > problem is that every client already has its own set of
> metrics
> > with
> > > > > its
> > > > > >> > own names. I am not sure that we could easily agree upon a
> > common
> > > > set
> > > > > of
> > > > > >> > metrics that work with all clients. There are likely to be
> some
> > > > > metrics
> > > > > >> > that are client specific. Translating between the common name
> > and
> > > > > client
> > > > > >> > specific name is probably going to add more confusion. As
> > mentioned
> > > > in
> > > > > >> the
> > > > > >> > KIP, similar metrics from different clients could have subtle
> > > > > >> > semantic differences. Could we just let each client use its
> own
> > set
> > > > of
> > > > > >> > metric names?
> > > > > >> >
> > > > > >>
> > > > > >> We identified a common set of metrics that should be relevant
> for
> > most
> > > > > >> client implementations,
> > > > > >> they're the ones listed in the KIP.
> > > > > >> A supporting client does not have to implement all those
> metrics,
> > only
> > > > > the
> > > > > >> ones that makes sense
> > > > > >> based on that client implementation, and a client may implement
> > other
> > > > > >> metrics that are not listed
> > > > > >> in the KIP under its own namespace.
> > > > > >> This approach has two benefits:
> > > > > >>  - there will be a common set of metrics that most/all clients
> > > > > implement,
> > > > > >> which makes monitoring
> > > > > >>   and troubleshooting easier across fleets with multiple Kafka
> > client
> > > > > >> languages/implementations.
> > > > > >>  - client-specific metrics are still possible, so if there is no
> > > > > suitable
> > > > > >> standard metric a client can still
> > > > > >>    provide what special metrics it has.
> > > > > >>
> > > > > >>
> > > > > >> Thanks,
> > > > > >> Magnus
> > > > > >>
> > > > > >>
> > > > > >> On Thu, May 19, 2022 at 10:39 AM Magnus Edenhill <
> > magnus@edenhill.se>
> > > > > >> wrote:
> > > > > >> >
> > > > > >> > > Den ons 18 maj 2022 kl 19:57 skrev Jun Rao
> > > > <jun@confluent.io.invalid
> > > > > >> >:
> > > > > >> > >
> > > > > >> > > > Hi, Magnus,
> > > > > >> > > >
> > > > > >> > >
> > > > > >> > > Hi Jun
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > >
> > > > > >> > > > Thanks for the updated KIP. Just a couple of more
> comments.
> > > > > >> > > >
> > > > > >> > > > 50. To troubleshoot a particular client issue, I imagine
> > that
> > > > the
> > > > > >> > client
> > > > > >> > > > needs to identify its client_instance_id. How does the
> > client
> > > > find
> > > > > >> this
> > > > > >> > > > out? Do we plan to include client_instance_id in the
> client
> > log,
> > > > > >> expose
> > > > > >> > > it
> > > > > >> > > > as a metric or something else?
> > > > > >> > > >
> > > > > >> > >
> > > > > >> > > The KIP suggests that client implementations emit an
> > informative
> > > > log
> > > > > >> > > message
> > > > > >> > > with the assigned client-instance-id once it is retrieved
> > (once
> > > > per
> > > > > >> > client
> > > > > >> > > instance lifetime).
> > > > > >> > > There's also a clientInstanceId() method that an application
> > can
> > > > use
> > > > > >> to
> > > > > >> > > retrieve
> > > > > >> > > the client instance id and emit through whatever side
> channels
> > > > makes
> > > > > >> > sense.
> > > > > >> > >
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > > 51. The KIP lists a bunch of metrics that need to be
> > collected
> > > > at
> > > > > >> the
> > > > > >> > > > client side. However, it seems quite a few useful java
> > client
> > > > > >> metrics
> > > > > >> > > like
> > > > > >> > > > the following are missing.
> > > > > >> > > >     buffer-total-bytes
> > > > > >> > > >     buffer-available-bytes
> > > > > >> > > >
> > > > > >> > >
> > > > > >> > > These are covered by client.producer.record.queue.bytes and
> > > > > >> > > client.producer.record.queue.max.bytes.
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > >     bufferpool-wait-time
> > > > > >> > > >
> > > > > >> > >
> > > > > >> > > Missing, but somewhat implementation specific.
> > > > > >> > > If it was up to me we would add this later if there's a
> need.
> > > > > >> > >
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > >     batch-size-avg
> > > > > >> > > >     batch-size-max
> > > > > >> > > >
> > > > > >> > >
> > > > > >> > > These are missing and would be suitably represented as a
> > > > histogram.
> > > > > >> I'll
> > > > > >> > > add them.
> > > > > >> > >
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > >     io-wait-ratio
> > > > > >> > > >     io-ratio
> > > > > >> > > >
> > > > > >> > >
> > > > > >> > > There's client.io.wait.time which should cover
> io-wait-ratio.
> > > > > >> > > We could add a client.io.time as well, now or in a later
> KIP.
> > > > > >> > >
> > > > > >> > > Thanks,
> > > > > >> > > Magnus
> > > > > >> > >
> > > > > >> > >
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > >
> > > > > >> > > > Thanks,
> > > > > >> > > >
> > > > > >> > > > Jun
> > > > > >> > > >
> > > > > >> > > > On Mon, Apr 4, 2022 at 10:01 AM Jun Rao <jun@confluent.io
> >
> > > > wrote:
> > > > > >> > > >
> > > > > >> > > > > Hi, Xavier,
> > > > > >> > > > >
> > > > > >> > > > > Thanks for the reply.
> > > > > >> > > > >
> > > > > >> > > > > 28. It does seem that we have started using KafkaMetrics
> > on
> > > > the
> > > > > >> > broker
> > > > > >> > > > > side. Then, my only concern is on the usage of Histogram
> > in
> > > > > >> > > KafkaMetrics.
> > > > > >> > > > > Histogram in KafkaMetrics statically divides the value
> > space
> > > > > into
> > > > > >> a
> > > > > >> > > fixed
> > > > > >> > > > > number of buckets and only returns values on the bucket
> > > > > boundary.
> > > > > >> So,
> > > > > >> > > the
> > > > > >> > > > > returned histogram value may never show up in a recorded
> > > > value.
> > > > > >> > Yammer
> > > > > >> > > > > Histogram, on the other hand, uses reservoir sampling.
> The
> > > > > >> reported
> > > > > >> > > value
> > > > > >> > > > > is always one of the recorded values. So, I am not sure
> > that
> > > > > >> > Histogram
> > > > > >> > > in
> > > > > >> > > > > KafkaMetrics is as good as Yammer Histogram.
> > > > > >> > > > ClientMetricsPluginExportTime
> > > > > >> > > > > uses Histogram.
> > > > > >> > > > >
> > > > > >> > > > > Thanks,
> > > > > >> > > > >
> > > > > >> > > > > Jun
> > > > > >> > > > >
> > > > > >> > > > > On Thu, Mar 31, 2022 at 5:21 PM Xavier Léauté
> > > > > >> > > > <xa...@confluent.io.invalid>
> > > > > >> > > > > wrote:
> > > > > >> > > > >
> > > > > >> > > > >> >
> > > > > >> > > > >> > 28. On the broker, we typically use Yammer metrics.
> > Only
> > > > for
> > > > > >> > metrics
> > > > > >> > > > >> that
> > > > > >> > > > >> > depend on Kafka metric features (e.g., quota), we use
> > the
> > > > > Kafka
> > > > > >> > > > metric.
> > > > > >> > > > >> > Yammer metrics have 4 types: gauge, meter, histogram
> > and
> > > > > timer.
> > > > > >> > > meter
> > > > > >> > > > >> > calculates a rate, but also exposes an accumulated
> > value.
> > > > > >> > > > >> >
> > > > > >> > > > >>
> > > > > >> > > > >> I don't see a good reason we should limit ourselves to
> > Yammer
> > > > > >> > metrics
> > > > > >> > > on
> > > > > >> > > > >> the broker. KafkaMetrics was written
> > > > > >> > > > >> to replace Yammer metrics and is used for all new
> > components
> > > > > >> > (clients,
> > > > > >> > > > >> streams, connect, etc.)
> > > > > >> > > > >> My understanding is that the original goal was to
> retire
> > > > Yammer
> > > > > >> > > metrics
> > > > > >> > > > in
> > > > > >> > > > >> the broker in favor of KafkaMetrics.
> > > > > >> > > > >> We just haven't done so out of backwards compatibility
> > > > > concerns.
> > > > > >> > > > >> There are other broker metrics such as group
> coordinator,
> > > > > >> > transaction
> > > > > >> > > > >> state
> > > > > >> > > > >> manager, and various socket server metrics
> > > > > >> > > > >> already using KafkaMetrics that don't need specific
> Kafka
> > > > > metric
> > > > > >> > > > features,
> > > > > >> > > > >> so I don't see why we should refrain from using
> > > > > >> > > > >> Kafka metrics on the broker unless there are real
> > > > compatibility
> > > > > >> > > concerns
> > > > > >> > > > >> or
> > > > > >> > > > >> where implementation specifics could lead to confusion
> > when
> > > > > >> > comparing
> > > > > >> > > > >> metrics using different implementations.
> > > > > >> > > > >>
> > > > > >> > > > >> In my opinion we should encourage people to use
> > KafkaMetrics
> > > > > >> going
> > > > > >> > > > forward
> > > > > >> > > > >> on the broker as well, for two reasons:
> > > > > >> > > > >> a) yammer metrics is long deprecated and no longer
> > maintained
> > > > > >> > > > >> b) yammer metrics are much less expressive
> > > > > >> > > > >> c) we don't have a proper API to expose yammer metrics
> > > > outside
> > > > > of
> > > > > >> > JMX
> > > > > >> > > > >> (MetricsReporter only exposes KafkaMetrics)
> > > > > >> > > > >>
> > > > > >> > > > >
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Jun Rao <ju...@confluent.io.INVALID>.

Hi, Kirk,

Thanks for the reply. A couple of more comments.

(1) "Another perspective is that these two sets of metrics serve different
purposes and/or have different audiences, which argues that they should
maintain their individuality and purpose. " Hmm, I am wondering if those
metrics are really for different audiences and purposes? For example, if
the operator detected an issue through a client metric collected through
the server, the operator may need to communicate that back to the client.
It would be weird if that same metric is not visible on the client side.

(2) If we could standardize the names on the server side, do we need to
enforce a naming convention for all clients?

Thanks,

Jun

On Thu, Jun 16, 2022 at 12:00 PM Kirk True <ki...@kirktrue.pro> wrote:

> Hi Jun,
>
> I'll try to answer the questions posed...
>
> On Tue, Jun 7, 2022, at 4:32 PM, Jun Rao wrote:
> > Hi, Magnus,
> >
> > Thanks for the reply.
> >
> > So, the standard set of generic metrics is just a recommendation and not
> a
> > requirement? This sounds good to me since it makes the adoption of the
> KIP
> > easier.
>
> I believe that was the intent, yes.
>
> > Regarding the metric names, I have two concerns.
>
> (I'm splitting these two up for readability...)
>
> > (1) If a client already
> > has an existing metric similar to the standard one, duplicating the
> metric
> > seems to be confusing.
>
> Agreed. I'm dealing with that situation as I write the Java client
> implementation.
>
> The existing Java client exposes a set of metrics via JMX. The updated
> Java client will introduce a second set of metrics, which instead are
> exposed via sending them to the broker. There is substantial overlap with
> the two set of metrics and in a few places in the code under development,
> there are essentially two separate calls to update metrics: one for the
> JMX-bound metrics and one for the broker-bound metrics.
>
> To be candid, I have gone back-and-forth on that design. From one
> perspective, it could be argued that the set of client metrics should be
> standardized across a given client, regardless of how those metrics are
> exposed for consumption. Another perspective is that these two sets of
> metrics serve different purposes and/or have different audiences, which
> argues that they should maintain their individuality and purpose. Your
> inputs/suggestions are certainly welcome!
>
> > (2) If a client needs to implement a standard metric
> > that doesn't exist yet, using a naming convention (e.g., using dash vs
> dot)
> > different from other existing metrics also seems a bit confusing. It
> seems
> > that the main benefit of having standard metric names across clients is
> for
> > better server side monitoring. Could we do the standardization in the
> > plugin on the server?
>
> I think the expectation is that the plugin implementation will perform
> transformation of metric names, if needed, to fit in with an organization's
> monitoring naming standards. Perhaps we need to call that out in the KIP
> itself.
>
> Thanks,
> Kirk
>
> >
> > Thanks,
> >
> > Jun
> >
> >
> >
> > On Tue, Jun 7, 2022 at 6:53 AM Magnus Edenhill <ma...@edenhill.se>
> wrote:
> >
> > > Hey Jun,
> > >
> > > I've clarified the scope of the standard metrics in the KIP, but
> basically:
> > >
> > >  * We define a standard set of generic metrics that should be relevant
> to
> > > most client implementations, e.g., each producer implementation
> probably
> > > has some sort of per-partition message queue.
> > >  * A client implementation should strive to implement as many of the
> > > standard metrics as possible, but only the ones that make sense.
> > >  * For metrics that are not in the standard set, a client maintainer
> can
> > > choose to either submit a KIP to add additional standard metrics - if
> > > they're relevant, or go ahead and add custom metrics that are specific
> to
> > > that client implementation. These custom metrics will have a prefix
> > > specific to that client implementation, as opposed to the standard
> metric
> > > set that resides under "org.apache.kafka...". E.g.,
> > > "se.edenhill.librdkafka" or whatever.
> > >  * Existing non-KIP-714 metrics should remain untouched. In some cases
> we
> > > might be able to use the same meter given it is compatible with the
> > > standard metric set definition, in other cases a semi-duplicate meter
> may
> > > be needed. Thus this will not affect the metrics exposed through JMX,
> or
> > > vice versa.
> > >
> > > Thanks,
> > > Magnus
> > >
> > >
> > >
> > > Den ons 1 juni 2022 kl 18:55 skrev Jun Rao <ju...@confluent.io.invalid>:
> > >
> > > > Hi, Magnus,
> > > >
> > > > 51. Just to clarify my question.  (1) Are standard metrics required
> for
> > > > every client for this KIP to function?  (2) Are we converting
> existing
> > > java
> > > > metrics to the standard metrics and deprecating the old ones? If so,
> > > could
> > > > we list all existing java metrics that need to be renamed and the
> > > > corresponding new name?
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > > On Tue, May 31, 2022 at 3:29 PM Jun Rao <ju...@confluent.io> wrote:
> > > >
> > > > > Hi, Magnus,
> > > > >
> > > > > Thanks for the reply.
> > > > >
> > > > > 51. I think it's fine to have a list of recommended metrics for
> every
> > > > > client to implement. I am just not sure that standardizing on the
> > > metric
> > > > > names across all clients is practical. The list of common metrics
> in
> > > the
> > > > > KIP have completely different names from the java metric names.
> Some of
> > > > > them have different types. For example, some of the common metrics
> > > have a
> > > > > type of histogram, but the java client metrics don't use histogram
> in
> > > > > general. Requiring the operator to translate those names and
> understand
> > > > the
> > > > > subtle differences across clients seem to cause more confusion
> during
> > > > > troubleshooting.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jun
> > > > >
> > > > > On Tue, May 31, 2022 at 5:02 AM Magnus Edenhill <
> magnus@edenhill.se>
> > > > > wrote:
> > > > >
> > > > >> Den fre 20 maj 2022 kl 01:23 skrev Jun Rao
> <jun@confluent.io.invalid
> > > >:
> > > > >>
> > > > >> > Hi, Magus,
> > > > >> >
> > > > >> > Thanks for the reply.
> > > > >> >
> > > > >> > 50. Sounds good.
> > > > >> >
> > > > >> > 51. I miss-understood the proposal in the KIP then. The
> proposal is
> > > to
> > > > >> > define a set of common metric names that every client should
> > > > implement.
> > > > >> The
> > > > >> > problem is that every client already has its own set of metrics
> with
> > > > its
> > > > >> > own names. I am not sure that we could easily agree upon a
> common
> > > set
> > > > of
> > > > >> > metrics that work with all clients. There are likely to be some
> > > > metrics
> > > > >> > that are client specific. Translating between the common name
> and
> > > > client
> > > > >> > specific name is probably going to add more confusion. As
> mentioned
> > > in
> > > > >> the
> > > > >> > KIP, similar metrics from different clients could have subtle
> > > > >> > semantic differences. Could we just let each client use its own
> set
> > > of
> > > > >> > metric names?
> > > > >> >
> > > > >>
> > > > >> We identified a common set of metrics that should be relevant for
> most
> > > > >> client implementations,
> > > > >> they're the ones listed in the KIP.
> > > > >> A supporting client does not have to implement all those metrics,
> only
> > > > the
> > > > >> ones that makes sense
> > > > >> based on that client implementation, and a client may implement
> other
> > > > >> metrics that are not listed
> > > > >> in the KIP under its own namespace.
> > > > >> This approach has two benefits:
> > > > >>  - there will be a common set of metrics that most/all clients
> > > > implement,
> > > > >> which makes monitoring
> > > > >>   and troubleshooting easier across fleets with multiple Kafka
> client
> > > > >> languages/implementations.
> > > > >>  - client-specific metrics are still possible, so if there is no
> > > > suitable
> > > > >> standard metric a client can still
> > > > >>    provide what special metrics it has.
> > > > >>
> > > > >>
> > > > >> Thanks,
> > > > >> Magnus
> > > > >>
> > > > >>
> > > > >> On Thu, May 19, 2022 at 10:39 AM Magnus Edenhill <
> magnus@edenhill.se>
> > > > >> wrote:
> > > > >> >
> > > > >> > > Den ons 18 maj 2022 kl 19:57 skrev Jun Rao
> > > <jun@confluent.io.invalid
> > > > >> >:
> > > > >> > >
> > > > >> > > > Hi, Magnus,
> > > > >> > > >
> > > > >> > >
> > > > >> > > Hi Jun
> > > > >> > >
> > > > >> > >
> > > > >> > > >
> > > > >> > > > Thanks for the updated KIP. Just a couple of more comments.
> > > > >> > > >
> > > > >> > > > 50. To troubleshoot a particular client issue, I imagine
> that
> > > the
> > > > >> > client
> > > > >> > > > needs to identify its client_instance_id. How does the
> client
> > > find
> > > > >> this
> > > > >> > > > out? Do we plan to include client_instance_id in the client
> log,
> > > > >> expose
> > > > >> > > it
> > > > >> > > > as a metric or something else?
> > > > >> > > >
> > > > >> > >
> > > > >> > > The KIP suggests that client implementations emit an
> informative
> > > log
> > > > >> > > message
> > > > >> > > with the assigned client-instance-id once it is retrieved
> (once
> > > per
> > > > >> > client
> > > > >> > > instance lifetime).
> > > > >> > > There's also a clientInstanceId() method that an application
> can
> > > use
> > > > >> to
> > > > >> > > retrieve
> > > > >> > > the client instance id and emit through whatever side channels
> > > makes
> > > > >> > sense.
> > > > >> > >
> > > > >> > >
> > > > >> > >
> > > > >> > > > 51. The KIP lists a bunch of metrics that need to be
> collected
> > > at
> > > > >> the
> > > > >> > > > client side. However, it seems quite a few useful java
> client
> > > > >> metrics
> > > > >> > > like
> > > > >> > > > the following are missing.
> > > > >> > > >     buffer-total-bytes
> > > > >> > > >     buffer-available-bytes
> > > > >> > > >
> > > > >> > >
> > > > >> > > These are covered by client.producer.record.queue.bytes and
> > > > >> > > client.producer.record.queue.max.bytes.
> > > > >> > >
> > > > >> > >
> > > > >> > > >     bufferpool-wait-time
> > > > >> > > >
> > > > >> > >
> > > > >> > > Missing, but somewhat implementation specific.
> > > > >> > > If it was up to me we would add this later if there's a need.
> > > > >> > >
> > > > >> > >
> > > > >> > >
> > > > >> > > >     batch-size-avg
> > > > >> > > >     batch-size-max
> > > > >> > > >
> > > > >> > >
> > > > >> > > These are missing and would be suitably represented as a
> > > histogram.
> > > > >> I'll
> > > > >> > > add them.
> > > > >> > >
> > > > >> > >
> > > > >> > >
> > > > >> > > >     io-wait-ratio
> > > > >> > > >     io-ratio
> > > > >> > > >
> > > > >> > >
> > > > >> > > There's client.io.wait.time which should cover io-wait-ratio.
> > > > >> > > We could add a client.io.time as well, now or in a later KIP.
> > > > >> > >
> > > > >> > > Thanks,
> > > > >> > > Magnus
> > > > >> > >
> > > > >> > >
> > > > >> > >
> > > > >> > >
> > > > >> > > >
> > > > >> > > > Thanks,
> > > > >> > > >
> > > > >> > > > Jun
> > > > >> > > >
> > > > >> > > > On Mon, Apr 4, 2022 at 10:01 AM Jun Rao <ju...@confluent.io>
> > > wrote:
> > > > >> > > >
> > > > >> > > > > Hi, Xavier,
> > > > >> > > > >
> > > > >> > > > > Thanks for the reply.
> > > > >> > > > >
> > > > >> > > > > 28. It does seem that we have started using KafkaMetrics
> on
> > > the
> > > > >> > broker
> > > > >> > > > > side. Then, my only concern is on the usage of Histogram
> in
> > > > >> > > KafkaMetrics.
> > > > >> > > > > Histogram in KafkaMetrics statically divides the value
> space
> > > > into
> > > > >> a
> > > > >> > > fixed
> > > > >> > > > > number of buckets and only returns values on the bucket
> > > > boundary.
> > > > >> So,
> > > > >> > > the
> > > > >> > > > > returned histogram value may never show up in a recorded
> > > value.
> > > > >> > Yammer
> > > > >> > > > > Histogram, on the other hand, uses reservoir sampling. The
> > > > >> reported
> > > > >> > > value
> > > > >> > > > > is always one of the recorded values. So, I am not sure
> that
> > > > >> > Histogram
> > > > >> > > in
> > > > >> > > > > KafkaMetrics is as good as Yammer Histogram.
> > > > >> > > > ClientMetricsPluginExportTime
> > > > >> > > > > uses Histogram.
> > > > >> > > > >
> > > > >> > > > > Thanks,
> > > > >> > > > >
> > > > >> > > > > Jun
> > > > >> > > > >
> > > > >> > > > > On Thu, Mar 31, 2022 at 5:21 PM Xavier Léauté
> > > > >> > > > <xa...@confluent.io.invalid>
> > > > >> > > > > wrote:
> > > > >> > > > >
> > > > >> > > > >> >
> > > > >> > > > >> > 28. On the broker, we typically use Yammer metrics.
> Only
> > > for
> > > > >> > metrics
> > > > >> > > > >> that
> > > > >> > > > >> > depend on Kafka metric features (e.g., quota), we use
> the
> > > > Kafka
> > > > >> > > > metric.
> > > > >> > > > >> > Yammer metrics have 4 types: gauge, meter, histogram
> and
> > > > timer.
> > > > >> > > meter
> > > > >> > > > >> > calculates a rate, but also exposes an accumulated
> value.
> > > > >> > > > >> >
> > > > >> > > > >>
> > > > >> > > > >> I don't see a good reason we should limit ourselves to
> Yammer
> > > > >> > metrics
> > > > >> > > on
> > > > >> > > > >> the broker. KafkaMetrics was written
> > > > >> > > > >> to replace Yammer metrics and is used for all new
> components
> > > > >> > (clients,
> > > > >> > > > >> streams, connect, etc.)
> > > > >> > > > >> My understanding is that the original goal was to retire
> > > Yammer
> > > > >> > > metrics
> > > > >> > > > in
> > > > >> > > > >> the broker in favor of KafkaMetrics.
> > > > >> > > > >> We just haven't done so out of backwards compatibility
> > > > concerns.
> > > > >> > > > >> There are other broker metrics such as group coordinator,
> > > > >> > transaction
> > > > >> > > > >> state
> > > > >> > > > >> manager, and various socket server metrics
> > > > >> > > > >> already using KafkaMetrics that don't need specific Kafka
> > > > metric
> > > > >> > > > features,
> > > > >> > > > >> so I don't see why we should refrain from using
> > > > >> > > > >> Kafka metrics on the broker unless there are real
> > > compatibility
> > > > >> > > concerns
> > > > >> > > > >> or
> > > > >> > > > >> where implementation specifics could lead to confusion
> when
> > > > >> > comparing
> > > > >> > > > >> metrics using different implementations.
> > > > >> > > > >>
> > > > >> > > > >> In my opinion we should encourage people to use
> KafkaMetrics
> > > > >> going
> > > > >> > > > forward
> > > > >> > > > >> on the broker as well, for two reasons:
> > > > >> > > > >> a) yammer metrics is long deprecated and no longer
> maintained
> > > > >> > > > >> b) yammer metrics are much less expressive
> > > > >> > > > >> c) we don't have a proper API to expose yammer metrics
> > > outside
> > > > of
> > > > >> > JMX
> > > > >> > > > >> (MetricsReporter only exposes KafkaMetrics)
> > > > >> > > > >>
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Kirk True <ki...@kirktrue.pro>.

Hi Jun,

I'll try to answer the questions posed...

On Tue, Jun 7, 2022, at 4:32 PM, Jun Rao wrote:
> Hi, Magnus,
> 
> Thanks for the reply.
> 
> So, the standard set of generic metrics is just a recommendation and not a
> requirement? This sounds good to me since it makes the adoption of the KIP
> easier.

I believe that was the intent, yes.

> Regarding the metric names, I have two concerns.

(I'm splitting these two up for readability...)

> (1) If a client already
> has an existing metric similar to the standard one, duplicating the metric
> seems to be confusing.

Agreed. I'm dealing with that situation as I write the Java client implementation.

The existing Java client exposes a set of metrics via JMX. The updated Java client will introduce a second set of metrics, which instead are exposed via sending them to the broker. There is substantial overlap with the two set of metrics and in a few places in the code under development, there are essentially two separate calls to update metrics: one for the JMX-bound metrics and one for the broker-bound metrics.

To be candid, I have gone back-and-forth on that design. From one perspective, it could be argued that the set of client metrics should be standardized across a given client, regardless of how those metrics are exposed for consumption. Another perspective is that these two sets of metrics serve different purposes and/or have different audiences, which argues that they should maintain their individuality and purpose. Your inputs/suggestions are certainly welcome! 

> (2) If a client needs to implement a standard metric
> that doesn't exist yet, using a naming convention (e.g., using dash vs dot)
> different from other existing metrics also seems a bit confusing. It seems
> that the main benefit of having standard metric names across clients is for
> better server side monitoring. Could we do the standardization in the
> plugin on the server?

I think the expectation is that the plugin implementation will perform transformation of metric names, if needed, to fit in with an organization's monitoring naming standards. Perhaps we need to call that out in the KIP itself.

Thanks,
Kirk

> 
> Thanks,
> 
> Jun
> 
> 
> 
> On Tue, Jun 7, 2022 at 6:53 AM Magnus Edenhill <ma...@edenhill.se> wrote:
> 
> > Hey Jun,
> >
> > I've clarified the scope of the standard metrics in the KIP, but basically:
> >
> >  * We define a standard set of generic metrics that should be relevant to
> > most client implementations, e.g., each producer implementation probably
> > has some sort of per-partition message queue.
> >  * A client implementation should strive to implement as many of the
> > standard metrics as possible, but only the ones that make sense.
> >  * For metrics that are not in the standard set, a client maintainer can
> > choose to either submit a KIP to add additional standard metrics - if
> > they're relevant, or go ahead and add custom metrics that are specific to
> > that client implementation. These custom metrics will have a prefix
> > specific to that client implementation, as opposed to the standard metric
> > set that resides under "org.apache.kafka...". E.g.,
> > "se.edenhill.librdkafka" or whatever.
> >  * Existing non-KIP-714 metrics should remain untouched. In some cases we
> > might be able to use the same meter given it is compatible with the
> > standard metric set definition, in other cases a semi-duplicate meter may
> > be needed. Thus this will not affect the metrics exposed through JMX, or
> > vice versa.
> >
> > Thanks,
> > Magnus
> >
> >
> >
> > Den ons 1 juni 2022 kl 18:55 skrev Jun Rao <ju...@confluent.io.invalid>:
> >
> > > Hi, Magnus,
> > >
> > > 51. Just to clarify my question.  (1) Are standard metrics required for
> > > every client for this KIP to function?  (2) Are we converting existing
> > java
> > > metrics to the standard metrics and deprecating the old ones? If so,
> > could
> > > we list all existing java metrics that need to be renamed and the
> > > corresponding new name?
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Tue, May 31, 2022 at 3:29 PM Jun Rao <ju...@confluent.io> wrote:
> > >
> > > > Hi, Magnus,
> > > >
> > > > Thanks for the reply.
> > > >
> > > > 51. I think it's fine to have a list of recommended metrics for every
> > > > client to implement. I am just not sure that standardizing on the
> > metric
> > > > names across all clients is practical. The list of common metrics in
> > the
> > > > KIP have completely different names from the java metric names. Some of
> > > > them have different types. For example, some of the common metrics
> > have a
> > > > type of histogram, but the java client metrics don't use histogram in
> > > > general. Requiring the operator to translate those names and understand
> > > the
> > > > subtle differences across clients seem to cause more confusion during
> > > > troubleshooting.
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > > On Tue, May 31, 2022 at 5:02 AM Magnus Edenhill <ma...@edenhill.se>
> > > > wrote:
> > > >
> > > >> Den fre 20 maj 2022 kl 01:23 skrev Jun Rao <jun@confluent.io.invalid
> > >:
> > > >>
> > > >> > Hi, Magus,
> > > >> >
> > > >> > Thanks for the reply.
> > > >> >
> > > >> > 50. Sounds good.
> > > >> >
> > > >> > 51. I miss-understood the proposal in the KIP then. The proposal is
> > to
> > > >> > define a set of common metric names that every client should
> > > implement.
> > > >> The
> > > >> > problem is that every client already has its own set of metrics with
> > > its
> > > >> > own names. I am not sure that we could easily agree upon a common
> > set
> > > of
> > > >> > metrics that work with all clients. There are likely to be some
> > > metrics
> > > >> > that are client specific. Translating between the common name and
> > > client
> > > >> > specific name is probably going to add more confusion. As mentioned
> > in
> > > >> the
> > > >> > KIP, similar metrics from different clients could have subtle
> > > >> > semantic differences. Could we just let each client use its own set
> > of
> > > >> > metric names?
> > > >> >
> > > >>
> > > >> We identified a common set of metrics that should be relevant for most
> > > >> client implementations,
> > > >> they're the ones listed in the KIP.
> > > >> A supporting client does not have to implement all those metrics, only
> > > the
> > > >> ones that makes sense
> > > >> based on that client implementation, and a client may implement other
> > > >> metrics that are not listed
> > > >> in the KIP under its own namespace.
> > > >> This approach has two benefits:
> > > >>  - there will be a common set of metrics that most/all clients
> > > implement,
> > > >> which makes monitoring
> > > >>   and troubleshooting easier across fleets with multiple Kafka client
> > > >> languages/implementations.
> > > >>  - client-specific metrics are still possible, so if there is no
> > > suitable
> > > >> standard metric a client can still
> > > >>    provide what special metrics it has.
> > > >>
> > > >>
> > > >> Thanks,
> > > >> Magnus
> > > >>
> > > >>
> > > >> On Thu, May 19, 2022 at 10:39 AM Magnus Edenhill <ma...@edenhill.se>
> > > >> wrote:
> > > >> >
> > > >> > > Den ons 18 maj 2022 kl 19:57 skrev Jun Rao
> > <jun@confluent.io.invalid
> > > >> >:
> > > >> > >
> > > >> > > > Hi, Magnus,
> > > >> > > >
> > > >> > >
> > > >> > > Hi Jun
> > > >> > >
> > > >> > >
> > > >> > > >
> > > >> > > > Thanks for the updated KIP. Just a couple of more comments.
> > > >> > > >
> > > >> > > > 50. To troubleshoot a particular client issue, I imagine that
> > the
> > > >> > client
> > > >> > > > needs to identify its client_instance_id. How does the client
> > find
> > > >> this
> > > >> > > > out? Do we plan to include client_instance_id in the client log,
> > > >> expose
> > > >> > > it
> > > >> > > > as a metric or something else?
> > > >> > > >
> > > >> > >
> > > >> > > The KIP suggests that client implementations emit an informative
> > log
> > > >> > > message
> > > >> > > with the assigned client-instance-id once it is retrieved (once
> > per
> > > >> > client
> > > >> > > instance lifetime).
> > > >> > > There's also a clientInstanceId() method that an application can
> > use
> > > >> to
> > > >> > > retrieve
> > > >> > > the client instance id and emit through whatever side channels
> > makes
> > > >> > sense.
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> > > > 51. The KIP lists a bunch of metrics that need to be collected
> > at
> > > >> the
> > > >> > > > client side. However, it seems quite a few useful java client
> > > >> metrics
> > > >> > > like
> > > >> > > > the following are missing.
> > > >> > > >     buffer-total-bytes
> > > >> > > >     buffer-available-bytes
> > > >> > > >
> > > >> > >
> > > >> > > These are covered by client.producer.record.queue.bytes and
> > > >> > > client.producer.record.queue.max.bytes.
> > > >> > >
> > > >> > >
> > > >> > > >     bufferpool-wait-time
> > > >> > > >
> > > >> > >
> > > >> > > Missing, but somewhat implementation specific.
> > > >> > > If it was up to me we would add this later if there's a need.
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> > > >     batch-size-avg
> > > >> > > >     batch-size-max
> > > >> > > >
> > > >> > >
> > > >> > > These are missing and would be suitably represented as a
> > histogram.
> > > >> I'll
> > > >> > > add them.
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> > > >     io-wait-ratio
> > > >> > > >     io-ratio
> > > >> > > >
> > > >> > >
> > > >> > > There's client.io.wait.time which should cover io-wait-ratio.
> > > >> > > We could add a client.io.time as well, now or in a later KIP.
> > > >> > >
> > > >> > > Thanks,
> > > >> > > Magnus
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> > > >
> > > >> > > > Thanks,
> > > >> > > >
> > > >> > > > Jun
> > > >> > > >
> > > >> > > > On Mon, Apr 4, 2022 at 10:01 AM Jun Rao <ju...@confluent.io>
> > wrote:
> > > >> > > >
> > > >> > > > > Hi, Xavier,
> > > >> > > > >
> > > >> > > > > Thanks for the reply.
> > > >> > > > >
> > > >> > > > > 28. It does seem that we have started using KafkaMetrics on
> > the
> > > >> > broker
> > > >> > > > > side. Then, my only concern is on the usage of Histogram in
> > > >> > > KafkaMetrics.
> > > >> > > > > Histogram in KafkaMetrics statically divides the value space
> > > into
> > > >> a
> > > >> > > fixed
> > > >> > > > > number of buckets and only returns values on the bucket
> > > boundary.
> > > >> So,
> > > >> > > the
> > > >> > > > > returned histogram value may never show up in a recorded
> > value.
> > > >> > Yammer
> > > >> > > > > Histogram, on the other hand, uses reservoir sampling. The
> > > >> reported
> > > >> > > value
> > > >> > > > > is always one of the recorded values. So, I am not sure that
> > > >> > Histogram
> > > >> > > in
> > > >> > > > > KafkaMetrics is as good as Yammer Histogram.
> > > >> > > > ClientMetricsPluginExportTime
> > > >> > > > > uses Histogram.
> > > >> > > > >
> > > >> > > > > Thanks,
> > > >> > > > >
> > > >> > > > > Jun
> > > >> > > > >
> > > >> > > > > On Thu, Mar 31, 2022 at 5:21 PM Xavier Léauté
> > > >> > > > <xa...@confluent.io.invalid>
> > > >> > > > > wrote:
> > > >> > > > >
> > > >> > > > >> >
> > > >> > > > >> > 28. On the broker, we typically use Yammer metrics. Only
> > for
> > > >> > metrics
> > > >> > > > >> that
> > > >> > > > >> > depend on Kafka metric features (e.g., quota), we use the
> > > Kafka
> > > >> > > > metric.
> > > >> > > > >> > Yammer metrics have 4 types: gauge, meter, histogram and
> > > timer.
> > > >> > > meter
> > > >> > > > >> > calculates a rate, but also exposes an accumulated value.
> > > >> > > > >> >
> > > >> > > > >>
> > > >> > > > >> I don't see a good reason we should limit ourselves to Yammer
> > > >> > metrics
> > > >> > > on
> > > >> > > > >> the broker. KafkaMetrics was written
> > > >> > > > >> to replace Yammer metrics and is used for all new components
> > > >> > (clients,
> > > >> > > > >> streams, connect, etc.)
> > > >> > > > >> My understanding is that the original goal was to retire
> > Yammer
> > > >> > > metrics
> > > >> > > > in
> > > >> > > > >> the broker in favor of KafkaMetrics.
> > > >> > > > >> We just haven't done so out of backwards compatibility
> > > concerns.
> > > >> > > > >> There are other broker metrics such as group coordinator,
> > > >> > transaction
> > > >> > > > >> state
> > > >> > > > >> manager, and various socket server metrics
> > > >> > > > >> already using KafkaMetrics that don't need specific Kafka
> > > metric
> > > >> > > > features,
> > > >> > > > >> so I don't see why we should refrain from using
> > > >> > > > >> Kafka metrics on the broker unless there are real
> > compatibility
> > > >> > > concerns
> > > >> > > > >> or
> > > >> > > > >> where implementation specifics could lead to confusion when
> > > >> > comparing
> > > >> > > > >> metrics using different implementations.
> > > >> > > > >>
> > > >> > > > >> In my opinion we should encourage people to use KafkaMetrics
> > > >> going
> > > >> > > > forward
> > > >> > > > >> on the broker as well, for two reasons:
> > > >> > > > >> a) yammer metrics is long deprecated and no longer maintained
> > > >> > > > >> b) yammer metrics are much less expressive
> > > >> > > > >> c) we don't have a proper API to expose yammer metrics
> > outside
> > > of
> > > >> > JMX
> > > >> > > > >> (MetricsReporter only exposes KafkaMetrics)
> > > >> > > > >>
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Jun Rao <ju...@confluent.io.INVALID>.

Hi, Magnus,

Thanks for the reply.

So, the standard set of generic metrics is just a recommendation and not a
requirement? This sounds good to me since it makes the adoption of the KIP
easier.

Regarding the metric names, I have two concerns. (1) If a client already
has an existing metric similar to the standard one, duplicating the metric
seems to be confusing. (2) If a client needs to implement a standard metric
that doesn't exist yet, using a naming convention (e.g., using dash vs dot)
different from other existing metrics also seems a bit confusing. It seems
that the main benefit of having standard metric names across clients is for
better server side monitoring. Could we do the standardization in the
plugin on the server?

Thanks,

Jun



On Tue, Jun 7, 2022 at 6:53 AM Magnus Edenhill <ma...@edenhill.se> wrote:

> Hey Jun,
>
> I've clarified the scope of the standard metrics in the KIP, but basically:
>
>  * We define a standard set of generic metrics that should be relevant to
> most client implementations, e.g., each producer implementation probably
> has some sort of per-partition message queue.
>  * A client implementation should strive to implement as many of the
> standard metrics as possible, but only the ones that make sense.
>  * For metrics that are not in the standard set, a client maintainer can
> choose to either submit a KIP to add additional standard metrics - if
> they're relevant, or go ahead and add custom metrics that are specific to
> that client implementation. These custom metrics will have a prefix
> specific to that client implementation, as opposed to the standard metric
> set that resides under "org.apache.kafka...". E.g.,
> "se.edenhill.librdkafka" or whatever.
>  * Existing non-KIP-714 metrics should remain untouched. In some cases we
> might be able to use the same meter given it is compatible with the
> standard metric set definition, in other cases a semi-duplicate meter may
> be needed. Thus this will not affect the metrics exposed through JMX, or
> vice versa.
>
> Thanks,
> Magnus
>
>
>
> Den ons 1 juni 2022 kl 18:55 skrev Jun Rao <ju...@confluent.io.invalid>:
>
> > Hi, Magnus,
> >
> > 51. Just to clarify my question.  (1) Are standard metrics required for
> > every client for this KIP to function?  (2) Are we converting existing
> java
> > metrics to the standard metrics and deprecating the old ones? If so,
> could
> > we list all existing java metrics that need to be renamed and the
> > corresponding new name?
> >
> > Thanks,
> >
> > Jun
> >
> > On Tue, May 31, 2022 at 3:29 PM Jun Rao <ju...@confluent.io> wrote:
> >
> > > Hi, Magnus,
> > >
> > > Thanks for the reply.
> > >
> > > 51. I think it's fine to have a list of recommended metrics for every
> > > client to implement. I am just not sure that standardizing on the
> metric
> > > names across all clients is practical. The list of common metrics in
> the
> > > KIP have completely different names from the java metric names. Some of
> > > them have different types. For example, some of the common metrics
> have a
> > > type of histogram, but the java client metrics don't use histogram in
> > > general. Requiring the operator to translate those names and understand
> > the
> > > subtle differences across clients seem to cause more confusion during
> > > troubleshooting.
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Tue, May 31, 2022 at 5:02 AM Magnus Edenhill <ma...@edenhill.se>
> > > wrote:
> > >
> > >> Den fre 20 maj 2022 kl 01:23 skrev Jun Rao <jun@confluent.io.invalid
> >:
> > >>
> > >> > Hi, Magus,
> > >> >
> > >> > Thanks for the reply.
> > >> >
> > >> > 50. Sounds good.
> > >> >
> > >> > 51. I miss-understood the proposal in the KIP then. The proposal is
> to
> > >> > define a set of common metric names that every client should
> > implement.
> > >> The
> > >> > problem is that every client already has its own set of metrics with
> > its
> > >> > own names. I am not sure that we could easily agree upon a common
> set
> > of
> > >> > metrics that work with all clients. There are likely to be some
> > metrics
> > >> > that are client specific. Translating between the common name and
> > client
> > >> > specific name is probably going to add more confusion. As mentioned
> in
> > >> the
> > >> > KIP, similar metrics from different clients could have subtle
> > >> > semantic differences. Could we just let each client use its own set
> of
> > >> > metric names?
> > >> >
> > >>
> > >> We identified a common set of metrics that should be relevant for most
> > >> client implementations,
> > >> they're the ones listed in the KIP.
> > >> A supporting client does not have to implement all those metrics, only
> > the
> > >> ones that makes sense
> > >> based on that client implementation, and a client may implement other
> > >> metrics that are not listed
> > >> in the KIP under its own namespace.
> > >> This approach has two benefits:
> > >>  - there will be a common set of metrics that most/all clients
> > implement,
> > >> which makes monitoring
> > >>   and troubleshooting easier across fleets with multiple Kafka client
> > >> languages/implementations.
> > >>  - client-specific metrics are still possible, so if there is no
> > suitable
> > >> standard metric a client can still
> > >>    provide what special metrics it has.
> > >>
> > >>
> > >> Thanks,
> > >> Magnus
> > >>
> > >>
> > >> On Thu, May 19, 2022 at 10:39 AM Magnus Edenhill <ma...@edenhill.se>
> > >> wrote:
> > >> >
> > >> > > Den ons 18 maj 2022 kl 19:57 skrev Jun Rao
> <jun@confluent.io.invalid
> > >> >:
> > >> > >
> > >> > > > Hi, Magnus,
> > >> > > >
> > >> > >
> > >> > > Hi Jun
> > >> > >
> > >> > >
> > >> > > >
> > >> > > > Thanks for the updated KIP. Just a couple of more comments.
> > >> > > >
> > >> > > > 50. To troubleshoot a particular client issue, I imagine that
> the
> > >> > client
> > >> > > > needs to identify its client_instance_id. How does the client
> find
> > >> this
> > >> > > > out? Do we plan to include client_instance_id in the client log,
> > >> expose
> > >> > > it
> > >> > > > as a metric or something else?
> > >> > > >
> > >> > >
> > >> > > The KIP suggests that client implementations emit an informative
> log
> > >> > > message
> > >> > > with the assigned client-instance-id once it is retrieved (once
> per
> > >> > client
> > >> > > instance lifetime).
> > >> > > There's also a clientInstanceId() method that an application can
> use
> > >> to
> > >> > > retrieve
> > >> > > the client instance id and emit through whatever side channels
> makes
> > >> > sense.
> > >> > >
> > >> > >
> > >> > >
> > >> > > > 51. The KIP lists a bunch of metrics that need to be collected
> at
> > >> the
> > >> > > > client side. However, it seems quite a few useful java client
> > >> metrics
> > >> > > like
> > >> > > > the following are missing.
> > >> > > >     buffer-total-bytes
> > >> > > >     buffer-available-bytes
> > >> > > >
> > >> > >
> > >> > > These are covered by client.producer.record.queue.bytes and
> > >> > > client.producer.record.queue.max.bytes.
> > >> > >
> > >> > >
> > >> > > >     bufferpool-wait-time
> > >> > > >
> > >> > >
> > >> > > Missing, but somewhat implementation specific.
> > >> > > If it was up to me we would add this later if there's a need.
> > >> > >
> > >> > >
> > >> > >
> > >> > > >     batch-size-avg
> > >> > > >     batch-size-max
> > >> > > >
> > >> > >
> > >> > > These are missing and would be suitably represented as a
> histogram.
> > >> I'll
> > >> > > add them.
> > >> > >
> > >> > >
> > >> > >
> > >> > > >     io-wait-ratio
> > >> > > >     io-ratio
> > >> > > >
> > >> > >
> > >> > > There's client.io.wait.time which should cover io-wait-ratio.
> > >> > > We could add a client.io.time as well, now or in a later KIP.
> > >> > >
> > >> > > Thanks,
> > >> > > Magnus
> > >> > >
> > >> > >
> > >> > >
> > >> > >
> > >> > > >
> > >> > > > Thanks,
> > >> > > >
> > >> > > > Jun
> > >> > > >
> > >> > > > On Mon, Apr 4, 2022 at 10:01 AM Jun Rao <ju...@confluent.io>
> wrote:
> > >> > > >
> > >> > > > > Hi, Xavier,
> > >> > > > >
> > >> > > > > Thanks for the reply.
> > >> > > > >
> > >> > > > > 28. It does seem that we have started using KafkaMetrics on
> the
> > >> > broker
> > >> > > > > side. Then, my only concern is on the usage of Histogram in
> > >> > > KafkaMetrics.
> > >> > > > > Histogram in KafkaMetrics statically divides the value space
> > into
> > >> a
> > >> > > fixed
> > >> > > > > number of buckets and only returns values on the bucket
> > boundary.
> > >> So,
> > >> > > the
> > >> > > > > returned histogram value may never show up in a recorded
> value.
> > >> > Yammer
> > >> > > > > Histogram, on the other hand, uses reservoir sampling. The
> > >> reported
> > >> > > value
> > >> > > > > is always one of the recorded values. So, I am not sure that
> > >> > Histogram
> > >> > > in
> > >> > > > > KafkaMetrics is as good as Yammer Histogram.
> > >> > > > ClientMetricsPluginExportTime
> > >> > > > > uses Histogram.
> > >> > > > >
> > >> > > > > Thanks,
> > >> > > > >
> > >> > > > > Jun
> > >> > > > >
> > >> > > > > On Thu, Mar 31, 2022 at 5:21 PM Xavier Léauté
> > >> > > > <xa...@confluent.io.invalid>
> > >> > > > > wrote:
> > >> > > > >
> > >> > > > >> >
> > >> > > > >> > 28. On the broker, we typically use Yammer metrics. Only
> for
> > >> > metrics
> > >> > > > >> that
> > >> > > > >> > depend on Kafka metric features (e.g., quota), we use the
> > Kafka
> > >> > > > metric.
> > >> > > > >> > Yammer metrics have 4 types: gauge, meter, histogram and
> > timer.
> > >> > > meter
> > >> > > > >> > calculates a rate, but also exposes an accumulated value.
> > >> > > > >> >
> > >> > > > >>
> > >> > > > >> I don't see a good reason we should limit ourselves to Yammer
> > >> > metrics
> > >> > > on
> > >> > > > >> the broker. KafkaMetrics was written
> > >> > > > >> to replace Yammer metrics and is used for all new components
> > >> > (clients,
> > >> > > > >> streams, connect, etc.)
> > >> > > > >> My understanding is that the original goal was to retire
> Yammer
> > >> > > metrics
> > >> > > > in
> > >> > > > >> the broker in favor of KafkaMetrics.
> > >> > > > >> We just haven't done so out of backwards compatibility
> > concerns.
> > >> > > > >> There are other broker metrics such as group coordinator,
> > >> > transaction
> > >> > > > >> state
> > >> > > > >> manager, and various socket server metrics
> > >> > > > >> already using KafkaMetrics that don't need specific Kafka
> > metric
> > >> > > > features,
> > >> > > > >> so I don't see why we should refrain from using
> > >> > > > >> Kafka metrics on the broker unless there are real
> compatibility
> > >> > > concerns
> > >> > > > >> or
> > >> > > > >> where implementation specifics could lead to confusion when
> > >> > comparing
> > >> > > > >> metrics using different implementations.
> > >> > > > >>
> > >> > > > >> In my opinion we should encourage people to use KafkaMetrics
> > >> going
> > >> > > > forward
> > >> > > > >> on the broker as well, for two reasons:
> > >> > > > >> a) yammer metrics is long deprecated and no longer maintained
> > >> > > > >> b) yammer metrics are much less expressive
> > >> > > > >> c) we don't have a proper API to expose yammer metrics
> outside
> > of
> > >> > JMX
> > >> > > > >> (MetricsReporter only exposes KafkaMetrics)
> > >> > > > >>
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Magnus Edenhill <ma...@edenhill.se>.

Hey Jun,

I've clarified the scope of the standard metrics in the KIP, but basically:

 * We define a standard set of generic metrics that should be relevant to
most client implementations, e.g., each producer implementation probably
has some sort of per-partition message queue.
 * A client implementation should strive to implement as many of the
standard metrics as possible, but only the ones that make sense.
 * For metrics that are not in the standard set, a client maintainer can
choose to either submit a KIP to add additional standard metrics - if
they're relevant, or go ahead and add custom metrics that are specific to
that client implementation. These custom metrics will have a prefix
specific to that client implementation, as opposed to the standard metric
set that resides under "org.apache.kafka...". E.g.,
"se.edenhill.librdkafka" or whatever.
 * Existing non-KIP-714 metrics should remain untouched. In some cases we
might be able to use the same meter given it is compatible with the
standard metric set definition, in other cases a semi-duplicate meter may
be needed. Thus this will not affect the metrics exposed through JMX, or
vice versa.

Thanks,
Magnus



Den ons 1 juni 2022 kl 18:55 skrev Jun Rao <ju...@confluent.io.invalid>:

> Hi, Magnus,
>
> 51. Just to clarify my question.  (1) Are standard metrics required for
> every client for this KIP to function?  (2) Are we converting existing java
> metrics to the standard metrics and deprecating the old ones? If so, could
> we list all existing java metrics that need to be renamed and the
> corresponding new name?
>
> Thanks,
>
> Jun
>
> On Tue, May 31, 2022 at 3:29 PM Jun Rao <ju...@confluent.io> wrote:
>
> > Hi, Magnus,
> >
> > Thanks for the reply.
> >
> > 51. I think it's fine to have a list of recommended metrics for every
> > client to implement. I am just not sure that standardizing on the metric
> > names across all clients is practical. The list of common metrics in the
> > KIP have completely different names from the java metric names. Some of
> > them have different types. For example, some of the common metrics have a
> > type of histogram, but the java client metrics don't use histogram in
> > general. Requiring the operator to translate those names and understand
> the
> > subtle differences across clients seem to cause more confusion during
> > troubleshooting.
> >
> > Thanks,
> >
> > Jun
> >
> > On Tue, May 31, 2022 at 5:02 AM Magnus Edenhill <ma...@edenhill.se>
> > wrote:
> >
> >> Den fre 20 maj 2022 kl 01:23 skrev Jun Rao <ju...@confluent.io.invalid>:
> >>
> >> > Hi, Magus,
> >> >
> >> > Thanks for the reply.
> >> >
> >> > 50. Sounds good.
> >> >
> >> > 51. I miss-understood the proposal in the KIP then. The proposal is to
> >> > define a set of common metric names that every client should
> implement.
> >> The
> >> > problem is that every client already has its own set of metrics with
> its
> >> > own names. I am not sure that we could easily agree upon a common set
> of
> >> > metrics that work with all clients. There are likely to be some
> metrics
> >> > that are client specific. Translating between the common name and
> client
> >> > specific name is probably going to add more confusion. As mentioned in
> >> the
> >> > KIP, similar metrics from different clients could have subtle
> >> > semantic differences. Could we just let each client use its own set of
> >> > metric names?
> >> >
> >>
> >> We identified a common set of metrics that should be relevant for most
> >> client implementations,
> >> they're the ones listed in the KIP.
> >> A supporting client does not have to implement all those metrics, only
> the
> >> ones that makes sense
> >> based on that client implementation, and a client may implement other
> >> metrics that are not listed
> >> in the KIP under its own namespace.
> >> This approach has two benefits:
> >>  - there will be a common set of metrics that most/all clients
> implement,
> >> which makes monitoring
> >>   and troubleshooting easier across fleets with multiple Kafka client
> >> languages/implementations.
> >>  - client-specific metrics are still possible, so if there is no
> suitable
> >> standard metric a client can still
> >>    provide what special metrics it has.
> >>
> >>
> >> Thanks,
> >> Magnus
> >>
> >>
> >> On Thu, May 19, 2022 at 10:39 AM Magnus Edenhill <ma...@edenhill.se>
> >> wrote:
> >> >
> >> > > Den ons 18 maj 2022 kl 19:57 skrev Jun Rao <jun@confluent.io.invalid
> >> >:
> >> > >
> >> > > > Hi, Magnus,
> >> > > >
> >> > >
> >> > > Hi Jun
> >> > >
> >> > >
> >> > > >
> >> > > > Thanks for the updated KIP. Just a couple of more comments.
> >> > > >
> >> > > > 50. To troubleshoot a particular client issue, I imagine that the
> >> > client
> >> > > > needs to identify its client_instance_id. How does the client find
> >> this
> >> > > > out? Do we plan to include client_instance_id in the client log,
> >> expose
> >> > > it
> >> > > > as a metric or something else?
> >> > > >
> >> > >
> >> > > The KIP suggests that client implementations emit an informative log
> >> > > message
> >> > > with the assigned client-instance-id once it is retrieved (once per
> >> > client
> >> > > instance lifetime).
> >> > > There's also a clientInstanceId() method that an application can use
> >> to
> >> > > retrieve
> >> > > the client instance id and emit through whatever side channels makes
> >> > sense.
> >> > >
> >> > >
> >> > >
> >> > > > 51. The KIP lists a bunch of metrics that need to be collected at
> >> the
> >> > > > client side. However, it seems quite a few useful java client
> >> metrics
> >> > > like
> >> > > > the following are missing.
> >> > > >     buffer-total-bytes
> >> > > >     buffer-available-bytes
> >> > > >
> >> > >
> >> > > These are covered by client.producer.record.queue.bytes and
> >> > > client.producer.record.queue.max.bytes.
> >> > >
> >> > >
> >> > > >     bufferpool-wait-time
> >> > > >
> >> > >
> >> > > Missing, but somewhat implementation specific.
> >> > > If it was up to me we would add this later if there's a need.
> >> > >
> >> > >
> >> > >
> >> > > >     batch-size-avg
> >> > > >     batch-size-max
> >> > > >
> >> > >
> >> > > These are missing and would be suitably represented as a histogram.
> >> I'll
> >> > > add them.
> >> > >
> >> > >
> >> > >
> >> > > >     io-wait-ratio
> >> > > >     io-ratio
> >> > > >
> >> > >
> >> > > There's client.io.wait.time which should cover io-wait-ratio.
> >> > > We could add a client.io.time as well, now or in a later KIP.
> >> > >
> >> > > Thanks,
> >> > > Magnus
> >> > >
> >> > >
> >> > >
> >> > >
> >> > > >
> >> > > > Thanks,
> >> > > >
> >> > > > Jun
> >> > > >
> >> > > > On Mon, Apr 4, 2022 at 10:01 AM Jun Rao <ju...@confluent.io> wrote:
> >> > > >
> >> > > > > Hi, Xavier,
> >> > > > >
> >> > > > > Thanks for the reply.
> >> > > > >
> >> > > > > 28. It does seem that we have started using KafkaMetrics on the
> >> > broker
> >> > > > > side. Then, my only concern is on the usage of Histogram in
> >> > > KafkaMetrics.
> >> > > > > Histogram in KafkaMetrics statically divides the value space
> into
> >> a
> >> > > fixed
> >> > > > > number of buckets and only returns values on the bucket
> boundary.
> >> So,
> >> > > the
> >> > > > > returned histogram value may never show up in a recorded value.
> >> > Yammer
> >> > > > > Histogram, on the other hand, uses reservoir sampling. The
> >> reported
> >> > > value
> >> > > > > is always one of the recorded values. So, I am not sure that
> >> > Histogram
> >> > > in
> >> > > > > KafkaMetrics is as good as Yammer Histogram.
> >> > > > ClientMetricsPluginExportTime
> >> > > > > uses Histogram.
> >> > > > >
> >> > > > > Thanks,
> >> > > > >
> >> > > > > Jun
> >> > > > >
> >> > > > > On Thu, Mar 31, 2022 at 5:21 PM Xavier Léauté
> >> > > > <xa...@confluent.io.invalid>
> >> > > > > wrote:
> >> > > > >
> >> > > > >> >
> >> > > > >> > 28. On the broker, we typically use Yammer metrics. Only for
> >> > metrics
> >> > > > >> that
> >> > > > >> > depend on Kafka metric features (e.g., quota), we use the
> Kafka
> >> > > > metric.
> >> > > > >> > Yammer metrics have 4 types: gauge, meter, histogram and
> timer.
> >> > > meter
> >> > > > >> > calculates a rate, but also exposes an accumulated value.
> >> > > > >> >
> >> > > > >>
> >> > > > >> I don't see a good reason we should limit ourselves to Yammer
> >> > metrics
> >> > > on
> >> > > > >> the broker. KafkaMetrics was written
> >> > > > >> to replace Yammer metrics and is used for all new components
> >> > (clients,
> >> > > > >> streams, connect, etc.)
> >> > > > >> My understanding is that the original goal was to retire Yammer
> >> > > metrics
> >> > > > in
> >> > > > >> the broker in favor of KafkaMetrics.
> >> > > > >> We just haven't done so out of backwards compatibility
> concerns.
> >> > > > >> There are other broker metrics such as group coordinator,
> >> > transaction
> >> > > > >> state
> >> > > > >> manager, and various socket server metrics
> >> > > > >> already using KafkaMetrics that don't need specific Kafka
> metric
> >> > > > features,
> >> > > > >> so I don't see why we should refrain from using
> >> > > > >> Kafka metrics on the broker unless there are real compatibility
> >> > > concerns
> >> > > > >> or
> >> > > > >> where implementation specifics could lead to confusion when
> >> > comparing
> >> > > > >> metrics using different implementations.
> >> > > > >>
> >> > > > >> In my opinion we should encourage people to use KafkaMetrics
> >> going
> >> > > > forward
> >> > > > >> on the broker as well, for two reasons:
> >> > > > >> a) yammer metrics is long deprecated and no longer maintained
> >> > > > >> b) yammer metrics are much less expressive
> >> > > > >> c) we don't have a proper API to expose yammer metrics outside
> of
> >> > JMX
> >> > > > >> (MetricsReporter only exposes KafkaMetrics)
> >> > > > >>
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Jun Rao <ju...@confluent.io.INVALID>.

Hi, Magnus,

51. Just to clarify my question.  (1) Are standard metrics required for
every client for this KIP to function?  (2) Are we converting existing java
metrics to the standard metrics and deprecating the old ones? If so, could
we list all existing java metrics that need to be renamed and the
corresponding new name?

Thanks,

Jun

On Tue, May 31, 2022 at 3:29 PM Jun Rao <ju...@confluent.io> wrote:

> Hi, Magnus,
>
> Thanks for the reply.
>
> 51. I think it's fine to have a list of recommended metrics for every
> client to implement. I am just not sure that standardizing on the metric
> names across all clients is practical. The list of common metrics in the
> KIP have completely different names from the java metric names. Some of
> them have different types. For example, some of the common metrics have a
> type of histogram, but the java client metrics don't use histogram in
> general. Requiring the operator to translate those names and understand the
> subtle differences across clients seem to cause more confusion during
> troubleshooting.
>
> Thanks,
>
> Jun
>
> On Tue, May 31, 2022 at 5:02 AM Magnus Edenhill <ma...@edenhill.se>
> wrote:
>
>> Den fre 20 maj 2022 kl 01:23 skrev Jun Rao <ju...@confluent.io.invalid>:
>>
>> > Hi, Magus,
>> >
>> > Thanks for the reply.
>> >
>> > 50. Sounds good.
>> >
>> > 51. I miss-understood the proposal in the KIP then. The proposal is to
>> > define a set of common metric names that every client should implement.
>> The
>> > problem is that every client already has its own set of metrics with its
>> > own names. I am not sure that we could easily agree upon a common set of
>> > metrics that work with all clients. There are likely to be some metrics
>> > that are client specific. Translating between the common name and client
>> > specific name is probably going to add more confusion. As mentioned in
>> the
>> > KIP, similar metrics from different clients could have subtle
>> > semantic differences. Could we just let each client use its own set of
>> > metric names?
>> >
>>
>> We identified a common set of metrics that should be relevant for most
>> client implementations,
>> they're the ones listed in the KIP.
>> A supporting client does not have to implement all those metrics, only the
>> ones that makes sense
>> based on that client implementation, and a client may implement other
>> metrics that are not listed
>> in the KIP under its own namespace.
>> This approach has two benefits:
>>  - there will be a common set of metrics that most/all clients implement,
>> which makes monitoring
>>   and troubleshooting easier across fleets with multiple Kafka client
>> languages/implementations.
>>  - client-specific metrics are still possible, so if there is no suitable
>> standard metric a client can still
>>    provide what special metrics it has.
>>
>>
>> Thanks,
>> Magnus
>>
>>
>> On Thu, May 19, 2022 at 10:39 AM Magnus Edenhill <ma...@edenhill.se>
>> wrote:
>> >
>> > > Den ons 18 maj 2022 kl 19:57 skrev Jun Rao <jun@confluent.io.invalid
>> >:
>> > >
>> > > > Hi, Magnus,
>> > > >
>> > >
>> > > Hi Jun
>> > >
>> > >
>> > > >
>> > > > Thanks for the updated KIP. Just a couple of more comments.
>> > > >
>> > > > 50. To troubleshoot a particular client issue, I imagine that the
>> > client
>> > > > needs to identify its client_instance_id. How does the client find
>> this
>> > > > out? Do we plan to include client_instance_id in the client log,
>> expose
>> > > it
>> > > > as a metric or something else?
>> > > >
>> > >
>> > > The KIP suggests that client implementations emit an informative log
>> > > message
>> > > with the assigned client-instance-id once it is retrieved (once per
>> > client
>> > > instance lifetime).
>> > > There's also a clientInstanceId() method that an application can use
>> to
>> > > retrieve
>> > > the client instance id and emit through whatever side channels makes
>> > sense.
>> > >
>> > >
>> > >
>> > > > 51. The KIP lists a bunch of metrics that need to be collected at
>> the
>> > > > client side. However, it seems quite a few useful java client
>> metrics
>> > > like
>> > > > the following are missing.
>> > > >     buffer-total-bytes
>> > > >     buffer-available-bytes
>> > > >
>> > >
>> > > These are covered by client.producer.record.queue.bytes and
>> > > client.producer.record.queue.max.bytes.
>> > >
>> > >
>> > > >     bufferpool-wait-time
>> > > >
>> > >
>> > > Missing, but somewhat implementation specific.
>> > > If it was up to me we would add this later if there's a need.
>> > >
>> > >
>> > >
>> > > >     batch-size-avg
>> > > >     batch-size-max
>> > > >
>> > >
>> > > These are missing and would be suitably represented as a histogram.
>> I'll
>> > > add them.
>> > >
>> > >
>> > >
>> > > >     io-wait-ratio
>> > > >     io-ratio
>> > > >
>> > >
>> > > There's client.io.wait.time which should cover io-wait-ratio.
>> > > We could add a client.io.time as well, now or in a later KIP.
>> > >
>> > > Thanks,
>> > > Magnus
>> > >
>> > >
>> > >
>> > >
>> > > >
>> > > > Thanks,
>> > > >
>> > > > Jun
>> > > >
>> > > > On Mon, Apr 4, 2022 at 10:01 AM Jun Rao <ju...@confluent.io> wrote:
>> > > >
>> > > > > Hi, Xavier,
>> > > > >
>> > > > > Thanks for the reply.
>> > > > >
>> > > > > 28. It does seem that we have started using KafkaMetrics on the
>> > broker
>> > > > > side. Then, my only concern is on the usage of Histogram in
>> > > KafkaMetrics.
>> > > > > Histogram in KafkaMetrics statically divides the value space into
>> a
>> > > fixed
>> > > > > number of buckets and only returns values on the bucket boundary.
>> So,
>> > > the
>> > > > > returned histogram value may never show up in a recorded value.
>> > Yammer
>> > > > > Histogram, on the other hand, uses reservoir sampling. The
>> reported
>> > > value
>> > > > > is always one of the recorded values. So, I am not sure that
>> > Histogram
>> > > in
>> > > > > KafkaMetrics is as good as Yammer Histogram.
>> > > > ClientMetricsPluginExportTime
>> > > > > uses Histogram.
>> > > > >
>> > > > > Thanks,
>> > > > >
>> > > > > Jun
>> > > > >
>> > > > > On Thu, Mar 31, 2022 at 5:21 PM Xavier Léauté
>> > > > <xa...@confluent.io.invalid>
>> > > > > wrote:
>> > > > >
>> > > > >> >
>> > > > >> > 28. On the broker, we typically use Yammer metrics. Only for
>> > metrics
>> > > > >> that
>> > > > >> > depend on Kafka metric features (e.g., quota), we use the Kafka
>> > > > metric.
>> > > > >> > Yammer metrics have 4 types: gauge, meter, histogram and timer.
>> > > meter
>> > > > >> > calculates a rate, but also exposes an accumulated value.
>> > > > >> >
>> > > > >>
>> > > > >> I don't see a good reason we should limit ourselves to Yammer
>> > metrics
>> > > on
>> > > > >> the broker. KafkaMetrics was written
>> > > > >> to replace Yammer metrics and is used for all new components
>> > (clients,
>> > > > >> streams, connect, etc.)
>> > > > >> My understanding is that the original goal was to retire Yammer
>> > > metrics
>> > > > in
>> > > > >> the broker in favor of KafkaMetrics.
>> > > > >> We just haven't done so out of backwards compatibility concerns.
>> > > > >> There are other broker metrics such as group coordinator,
>> > transaction
>> > > > >> state
>> > > > >> manager, and various socket server metrics
>> > > > >> already using KafkaMetrics that don't need specific Kafka metric
>> > > > features,
>> > > > >> so I don't see why we should refrain from using
>> > > > >> Kafka metrics on the broker unless there are real compatibility
>> > > concerns
>> > > > >> or
>> > > > >> where implementation specifics could lead to confusion when
>> > comparing
>> > > > >> metrics using different implementations.
>> > > > >>
>> > > > >> In my opinion we should encourage people to use KafkaMetrics
>> going
>> > > > forward
>> > > > >> on the broker as well, for two reasons:
>> > > > >> a) yammer metrics is long deprecated and no longer maintained
>> > > > >> b) yammer metrics are much less expressive
>> > > > >> c) we don't have a proper API to expose yammer metrics outside of
>> > JMX
>> > > > >> (MetricsReporter only exposes KafkaMetrics)
>> > > > >>
>> > > > >
>> > > >
>> > >
>> >
>>
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Jun Rao <ju...@confluent.io.INVALID>.

Hi, Magnus,

Thanks for the reply.

51. I think it's fine to have a list of recommended metrics for every
client to implement. I am just not sure that standardizing on the metric
names across all clients is practical. The list of common metrics in the
KIP have completely different names from the java metric names. Some of
them have different types. For example, some of the common metrics have a
type of histogram, but the java client metrics don't use histogram in
general. Requiring the operator to translate those names and understand the
subtle differences across clients seem to cause more confusion during
troubleshooting.

Thanks,

Jun

On Tue, May 31, 2022 at 5:02 AM Magnus Edenhill <ma...@edenhill.se> wrote:

> Den fre 20 maj 2022 kl 01:23 skrev Jun Rao <ju...@confluent.io.invalid>:
>
> > Hi, Magus,
> >
> > Thanks for the reply.
> >
> > 50. Sounds good.
> >
> > 51. I miss-understood the proposal in the KIP then. The proposal is to
> > define a set of common metric names that every client should implement.
> The
> > problem is that every client already has its own set of metrics with its
> > own names. I am not sure that we could easily agree upon a common set of
> > metrics that work with all clients. There are likely to be some metrics
> > that are client specific. Translating between the common name and client
> > specific name is probably going to add more confusion. As mentioned in
> the
> > KIP, similar metrics from different clients could have subtle
> > semantic differences. Could we just let each client use its own set of
> > metric names?
> >
>
> We identified a common set of metrics that should be relevant for most
> client implementations,
> they're the ones listed in the KIP.
> A supporting client does not have to implement all those metrics, only the
> ones that makes sense
> based on that client implementation, and a client may implement other
> metrics that are not listed
> in the KIP under its own namespace.
> This approach has two benefits:
>  - there will be a common set of metrics that most/all clients implement,
> which makes monitoring
>   and troubleshooting easier across fleets with multiple Kafka client
> languages/implementations.
>  - client-specific metrics are still possible, so if there is no suitable
> standard metric a client can still
>    provide what special metrics it has.
>
>
> Thanks,
> Magnus
>
>
> On Thu, May 19, 2022 at 10:39 AM Magnus Edenhill <ma...@edenhill.se>
> wrote:
> >
> > > Den ons 18 maj 2022 kl 19:57 skrev Jun Rao <ju...@confluent.io.invalid>:
> > >
> > > > Hi, Magnus,
> > > >
> > >
> > > Hi Jun
> > >
> > >
> > > >
> > > > Thanks for the updated KIP. Just a couple of more comments.
> > > >
> > > > 50. To troubleshoot a particular client issue, I imagine that the
> > client
> > > > needs to identify its client_instance_id. How does the client find
> this
> > > > out? Do we plan to include client_instance_id in the client log,
> expose
> > > it
> > > > as a metric or something else?
> > > >
> > >
> > > The KIP suggests that client implementations emit an informative log
> > > message
> > > with the assigned client-instance-id once it is retrieved (once per
> > client
> > > instance lifetime).
> > > There's also a clientInstanceId() method that an application can use to
> > > retrieve
> > > the client instance id and emit through whatever side channels makes
> > sense.
> > >
> > >
> > >
> > > > 51. The KIP lists a bunch of metrics that need to be collected at the
> > > > client side. However, it seems quite a few useful java client metrics
> > > like
> > > > the following are missing.
> > > >     buffer-total-bytes
> > > >     buffer-available-bytes
> > > >
> > >
> > > These are covered by client.producer.record.queue.bytes and
> > > client.producer.record.queue.max.bytes.
> > >
> > >
> > > >     bufferpool-wait-time
> > > >
> > >
> > > Missing, but somewhat implementation specific.
> > > If it was up to me we would add this later if there's a need.
> > >
> > >
> > >
> > > >     batch-size-avg
> > > >     batch-size-max
> > > >
> > >
> > > These are missing and would be suitably represented as a histogram.
> I'll
> > > add them.
> > >
> > >
> > >
> > > >     io-wait-ratio
> > > >     io-ratio
> > > >
> > >
> > > There's client.io.wait.time which should cover io-wait-ratio.
> > > We could add a client.io.time as well, now or in a later KIP.
> > >
> > > Thanks,
> > > Magnus
> > >
> > >
> > >
> > >
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > > On Mon, Apr 4, 2022 at 10:01 AM Jun Rao <ju...@confluent.io> wrote:
> > > >
> > > > > Hi, Xavier,
> > > > >
> > > > > Thanks for the reply.
> > > > >
> > > > > 28. It does seem that we have started using KafkaMetrics on the
> > broker
> > > > > side. Then, my only concern is on the usage of Histogram in
> > > KafkaMetrics.
> > > > > Histogram in KafkaMetrics statically divides the value space into a
> > > fixed
> > > > > number of buckets and only returns values on the bucket boundary.
> So,
> > > the
> > > > > returned histogram value may never show up in a recorded value.
> > Yammer
> > > > > Histogram, on the other hand, uses reservoir sampling. The reported
> > > value
> > > > > is always one of the recorded values. So, I am not sure that
> > Histogram
> > > in
> > > > > KafkaMetrics is as good as Yammer Histogram.
> > > > ClientMetricsPluginExportTime
> > > > > uses Histogram.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jun
> > > > >
> > > > > On Thu, Mar 31, 2022 at 5:21 PM Xavier Léauté
> > > > <xa...@confluent.io.invalid>
> > > > > wrote:
> > > > >
> > > > >> >
> > > > >> > 28. On the broker, we typically use Yammer metrics. Only for
> > metrics
> > > > >> that
> > > > >> > depend on Kafka metric features (e.g., quota), we use the Kafka
> > > > metric.
> > > > >> > Yammer metrics have 4 types: gauge, meter, histogram and timer.
> > > meter
> > > > >> > calculates a rate, but also exposes an accumulated value.
> > > > >> >
> > > > >>
> > > > >> I don't see a good reason we should limit ourselves to Yammer
> > metrics
> > > on
> > > > >> the broker. KafkaMetrics was written
> > > > >> to replace Yammer metrics and is used for all new components
> > (clients,
> > > > >> streams, connect, etc.)
> > > > >> My understanding is that the original goal was to retire Yammer
> > > metrics
> > > > in
> > > > >> the broker in favor of KafkaMetrics.
> > > > >> We just haven't done so out of backwards compatibility concerns.
> > > > >> There are other broker metrics such as group coordinator,
> > transaction
> > > > >> state
> > > > >> manager, and various socket server metrics
> > > > >> already using KafkaMetrics that don't need specific Kafka metric
> > > > features,
> > > > >> so I don't see why we should refrain from using
> > > > >> Kafka metrics on the broker unless there are real compatibility
> > > concerns
> > > > >> or
> > > > >> where implementation specifics could lead to confusion when
> > comparing
> > > > >> metrics using different implementations.
> > > > >>
> > > > >> In my opinion we should encourage people to use KafkaMetrics going
> > > > forward
> > > > >> on the broker as well, for two reasons:
> > > > >> a) yammer metrics is long deprecated and no longer maintained
> > > > >> b) yammer metrics are much less expressive
> > > > >> c) we don't have a proper API to expose yammer metrics outside of
> > JMX
> > > > >> (MetricsReporter only exposes KafkaMetrics)
> > > > >>
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Magnus Edenhill <ma...@edenhill.se>.

Den fre 20 maj 2022 kl 01:23 skrev Jun Rao <ju...@confluent.io.invalid>:

> Hi, Magus,
>
> Thanks for the reply.
>
> 50. Sounds good.
>
> 51. I miss-understood the proposal in the KIP then. The proposal is to
> define a set of common metric names that every client should implement. The
> problem is that every client already has its own set of metrics with its
> own names. I am not sure that we could easily agree upon a common set of
> metrics that work with all clients. There are likely to be some metrics
> that are client specific. Translating between the common name and client
> specific name is probably going to add more confusion. As mentioned in the
> KIP, similar metrics from different clients could have subtle
> semantic differences. Could we just let each client use its own set of
> metric names?
>

We identified a common set of metrics that should be relevant for most
client implementations,
they're the ones listed in the KIP.
A supporting client does not have to implement all those metrics, only the
ones that makes sense
based on that client implementation, and a client may implement other
metrics that are not listed
in the KIP under its own namespace.
This approach has two benefits:
 - there will be a common set of metrics that most/all clients implement,
which makes monitoring
  and troubleshooting easier across fleets with multiple Kafka client
languages/implementations.
 - client-specific metrics are still possible, so if there is no suitable
standard metric a client can still
   provide what special metrics it has.


Thanks,
Magnus


On Thu, May 19, 2022 at 10:39 AM Magnus Edenhill <ma...@edenhill.se> wrote:
>
> > Den ons 18 maj 2022 kl 19:57 skrev Jun Rao <ju...@confluent.io.invalid>:
> >
> > > Hi, Magnus,
> > >
> >
> > Hi Jun
> >
> >
> > >
> > > Thanks for the updated KIP. Just a couple of more comments.
> > >
> > > 50. To troubleshoot a particular client issue, I imagine that the
> client
> > > needs to identify its client_instance_id. How does the client find this
> > > out? Do we plan to include client_instance_id in the client log, expose
> > it
> > > as a metric or something else?
> > >
> >
> > The KIP suggests that client implementations emit an informative log
> > message
> > with the assigned client-instance-id once it is retrieved (once per
> client
> > instance lifetime).
> > There's also a clientInstanceId() method that an application can use to
> > retrieve
> > the client instance id and emit through whatever side channels makes
> sense.
> >
> >
> >
> > > 51. The KIP lists a bunch of metrics that need to be collected at the
> > > client side. However, it seems quite a few useful java client metrics
> > like
> > > the following are missing.
> > >     buffer-total-bytes
> > >     buffer-available-bytes
> > >
> >
> > These are covered by client.producer.record.queue.bytes and
> > client.producer.record.queue.max.bytes.
> >
> >
> > >     bufferpool-wait-time
> > >
> >
> > Missing, but somewhat implementation specific.
> > If it was up to me we would add this later if there's a need.
> >
> >
> >
> > >     batch-size-avg
> > >     batch-size-max
> > >
> >
> > These are missing and would be suitably represented as a histogram. I'll
> > add them.
> >
> >
> >
> > >     io-wait-ratio
> > >     io-ratio
> > >
> >
> > There's client.io.wait.time which should cover io-wait-ratio.
> > We could add a client.io.time as well, now or in a later KIP.
> >
> > Thanks,
> > Magnus
> >
> >
> >
> >
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Mon, Apr 4, 2022 at 10:01 AM Jun Rao <ju...@confluent.io> wrote:
> > >
> > > > Hi, Xavier,
> > > >
> > > > Thanks for the reply.
> > > >
> > > > 28. It does seem that we have started using KafkaMetrics on the
> broker
> > > > side. Then, my only concern is on the usage of Histogram in
> > KafkaMetrics.
> > > > Histogram in KafkaMetrics statically divides the value space into a
> > fixed
> > > > number of buckets and only returns values on the bucket boundary. So,
> > the
> > > > returned histogram value may never show up in a recorded value.
> Yammer
> > > > Histogram, on the other hand, uses reservoir sampling. The reported
> > value
> > > > is always one of the recorded values. So, I am not sure that
> Histogram
> > in
> > > > KafkaMetrics is as good as Yammer Histogram.
> > > ClientMetricsPluginExportTime
> > > > uses Histogram.
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > > On Thu, Mar 31, 2022 at 5:21 PM Xavier Léauté
> > > <xa...@confluent.io.invalid>
> > > > wrote:
> > > >
> > > >> >
> > > >> > 28. On the broker, we typically use Yammer metrics. Only for
> metrics
> > > >> that
> > > >> > depend on Kafka metric features (e.g., quota), we use the Kafka
> > > metric.
> > > >> > Yammer metrics have 4 types: gauge, meter, histogram and timer.
> > meter
> > > >> > calculates a rate, but also exposes an accumulated value.
> > > >> >
> > > >>
> > > >> I don't see a good reason we should limit ourselves to Yammer
> metrics
> > on
> > > >> the broker. KafkaMetrics was written
> > > >> to replace Yammer metrics and is used for all new components
> (clients,
> > > >> streams, connect, etc.)
> > > >> My understanding is that the original goal was to retire Yammer
> > metrics
> > > in
> > > >> the broker in favor of KafkaMetrics.
> > > >> We just haven't done so out of backwards compatibility concerns.
> > > >> There are other broker metrics such as group coordinator,
> transaction
> > > >> state
> > > >> manager, and various socket server metrics
> > > >> already using KafkaMetrics that don't need specific Kafka metric
> > > features,
> > > >> so I don't see why we should refrain from using
> > > >> Kafka metrics on the broker unless there are real compatibility
> > concerns
> > > >> or
> > > >> where implementation specifics could lead to confusion when
> comparing
> > > >> metrics using different implementations.
> > > >>
> > > >> In my opinion we should encourage people to use KafkaMetrics going
> > > forward
> > > >> on the broker as well, for two reasons:
> > > >> a) yammer metrics is long deprecated and no longer maintained
> > > >> b) yammer metrics are much less expressive
> > > >> c) we don't have a proper API to expose yammer metrics outside of
> JMX
> > > >> (MetricsReporter only exposes KafkaMetrics)
> > > >>
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Jun Rao <ju...@confluent.io.INVALID>.

Hi, Magus,

Thanks for the reply.

50. Sounds good.

51. I miss-understood the proposal in the KIP then. The proposal is to
define a set of common metric names that every client should implement. The
problem is that every client already has its own set of metrics with its
own names. I am not sure that we could easily agree upon a common set of
metrics that work with all clients. There are likely to be some metrics
that are client specific. Translating between the common name and client
specific name is probably going to add more confusion. As mentioned in the
KIP, similar metrics from different clients could have subtle
semantic differences. Could we just let each client use its own set of
metric names?

Thanks,

Jun

On Thu, May 19, 2022 at 10:39 AM Magnus Edenhill <ma...@edenhill.se> wrote:

> Den ons 18 maj 2022 kl 19:57 skrev Jun Rao <ju...@confluent.io.invalid>:
>
> > Hi, Magnus,
> >
>
> Hi Jun
>
>
> >
> > Thanks for the updated KIP. Just a couple of more comments.
> >
> > 50. To troubleshoot a particular client issue, I imagine that the client
> > needs to identify its client_instance_id. How does the client find this
> > out? Do we plan to include client_instance_id in the client log, expose
> it
> > as a metric or something else?
> >
>
> The KIP suggests that client implementations emit an informative log
> message
> with the assigned client-instance-id once it is retrieved (once per client
> instance lifetime).
> There's also a clientInstanceId() method that an application can use to
> retrieve
> the client instance id and emit through whatever side channels makes sense.
>
>
>
> > 51. The KIP lists a bunch of metrics that need to be collected at the
> > client side. However, it seems quite a few useful java client metrics
> like
> > the following are missing.
> >     buffer-total-bytes
> >     buffer-available-bytes
> >
>
> These are covered by client.producer.record.queue.bytes and
> client.producer.record.queue.max.bytes.
>
>
> >     bufferpool-wait-time
> >
>
> Missing, but somewhat implementation specific.
> If it was up to me we would add this later if there's a need.
>
>
>
> >     batch-size-avg
> >     batch-size-max
> >
>
> These are missing and would be suitably represented as a histogram. I'll
> add them.
>
>
>
> >     io-wait-ratio
> >     io-ratio
> >
>
> There's client.io.wait.time which should cover io-wait-ratio.
> We could add a client.io.time as well, now or in a later KIP.
>
> Thanks,
> Magnus
>
>
>
>
> >
> > Thanks,
> >
> > Jun
> >
> > On Mon, Apr 4, 2022 at 10:01 AM Jun Rao <ju...@confluent.io> wrote:
> >
> > > Hi, Xavier,
> > >
> > > Thanks for the reply.
> > >
> > > 28. It does seem that we have started using KafkaMetrics on the broker
> > > side. Then, my only concern is on the usage of Histogram in
> KafkaMetrics.
> > > Histogram in KafkaMetrics statically divides the value space into a
> fixed
> > > number of buckets and only returns values on the bucket boundary. So,
> the
> > > returned histogram value may never show up in a recorded value. Yammer
> > > Histogram, on the other hand, uses reservoir sampling. The reported
> value
> > > is always one of the recorded values. So, I am not sure that Histogram
> in
> > > KafkaMetrics is as good as Yammer Histogram.
> > ClientMetricsPluginExportTime
> > > uses Histogram.
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Thu, Mar 31, 2022 at 5:21 PM Xavier Léauté
> > <xa...@confluent.io.invalid>
> > > wrote:
> > >
> > >> >
> > >> > 28. On the broker, we typically use Yammer metrics. Only for metrics
> > >> that
> > >> > depend on Kafka metric features (e.g., quota), we use the Kafka
> > metric.
> > >> > Yammer metrics have 4 types: gauge, meter, histogram and timer.
> meter
> > >> > calculates a rate, but also exposes an accumulated value.
> > >> >
> > >>
> > >> I don't see a good reason we should limit ourselves to Yammer metrics
> on
> > >> the broker. KafkaMetrics was written
> > >> to replace Yammer metrics and is used for all new components (clients,
> > >> streams, connect, etc.)
> > >> My understanding is that the original goal was to retire Yammer
> metrics
> > in
> > >> the broker in favor of KafkaMetrics.
> > >> We just haven't done so out of backwards compatibility concerns.
> > >> There are other broker metrics such as group coordinator, transaction
> > >> state
> > >> manager, and various socket server metrics
> > >> already using KafkaMetrics that don't need specific Kafka metric
> > features,
> > >> so I don't see why we should refrain from using
> > >> Kafka metrics on the broker unless there are real compatibility
> concerns
> > >> or
> > >> where implementation specifics could lead to confusion when comparing
> > >> metrics using different implementations.
> > >>
> > >> In my opinion we should encourage people to use KafkaMetrics going
> > forward
> > >> on the broker as well, for two reasons:
> > >> a) yammer metrics is long deprecated and no longer maintained
> > >> b) yammer metrics are much less expressive
> > >> c) we don't have a proper API to expose yammer metrics outside of JMX
> > >> (MetricsReporter only exposes KafkaMetrics)
> > >>
> > >
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Magnus Edenhill <ma...@edenhill.se>.

Den ons 18 maj 2022 kl 19:57 skrev Jun Rao <ju...@confluent.io.invalid>:

> Hi, Magnus,
>

Hi Jun


>
> Thanks for the updated KIP. Just a couple of more comments.
>
> 50. To troubleshoot a particular client issue, I imagine that the client
> needs to identify its client_instance_id. How does the client find this
> out? Do we plan to include client_instance_id in the client log, expose it
> as a metric or something else?
>

The KIP suggests that client implementations emit an informative log message
with the assigned client-instance-id once it is retrieved (once per client
instance lifetime).
There's also a clientInstanceId() method that an application can use to
retrieve
the client instance id and emit through whatever side channels makes sense.



> 51. The KIP lists a bunch of metrics that need to be collected at the
> client side. However, it seems quite a few useful java client metrics like
> the following are missing.
>     buffer-total-bytes
>     buffer-available-bytes
>

These are covered by client.producer.record.queue.bytes and
client.producer.record.queue.max.bytes.


>     bufferpool-wait-time
>

Missing, but somewhat implementation specific.
If it was up to me we would add this later if there's a need.



>     batch-size-avg
>     batch-size-max
>

These are missing and would be suitably represented as a histogram. I'll
add them.



>     io-wait-ratio
>     io-ratio
>

There's client.io.wait.time which should cover io-wait-ratio.
We could add a client.io.time as well, now or in a later KIP.

Thanks,
Magnus




>
> Thanks,
>
> Jun
>
> On Mon, Apr 4, 2022 at 10:01 AM Jun Rao <ju...@confluent.io> wrote:
>
> > Hi, Xavier,
> >
> > Thanks for the reply.
> >
> > 28. It does seem that we have started using KafkaMetrics on the broker
> > side. Then, my only concern is on the usage of Histogram in KafkaMetrics.
> > Histogram in KafkaMetrics statically divides the value space into a fixed
> > number of buckets and only returns values on the bucket boundary. So, the
> > returned histogram value may never show up in a recorded value. Yammer
> > Histogram, on the other hand, uses reservoir sampling. The reported value
> > is always one of the recorded values. So, I am not sure that Histogram in
> > KafkaMetrics is as good as Yammer Histogram.
> ClientMetricsPluginExportTime
> > uses Histogram.
> >
> > Thanks,
> >
> > Jun
> >
> > On Thu, Mar 31, 2022 at 5:21 PM Xavier Léauté
> <xa...@confluent.io.invalid>
> > wrote:
> >
> >> >
> >> > 28. On the broker, we typically use Yammer metrics. Only for metrics
> >> that
> >> > depend on Kafka metric features (e.g., quota), we use the Kafka
> metric.
> >> > Yammer metrics have 4 types: gauge, meter, histogram and timer. meter
> >> > calculates a rate, but also exposes an accumulated value.
> >> >
> >>
> >> I don't see a good reason we should limit ourselves to Yammer metrics on
> >> the broker. KafkaMetrics was written
> >> to replace Yammer metrics and is used for all new components (clients,
> >> streams, connect, etc.)
> >> My understanding is that the original goal was to retire Yammer metrics
> in
> >> the broker in favor of KafkaMetrics.
> >> We just haven't done so out of backwards compatibility concerns.
> >> There are other broker metrics such as group coordinator, transaction
> >> state
> >> manager, and various socket server metrics
> >> already using KafkaMetrics that don't need specific Kafka metric
> features,
> >> so I don't see why we should refrain from using
> >> Kafka metrics on the broker unless there are real compatibility concerns
> >> or
> >> where implementation specifics could lead to confusion when comparing
> >> metrics using different implementations.
> >>
> >> In my opinion we should encourage people to use KafkaMetrics going
> forward
> >> on the broker as well, for two reasons:
> >> a) yammer metrics is long deprecated and no longer maintained
> >> b) yammer metrics are much less expressive
> >> c) we don't have a proper API to expose yammer metrics outside of JMX
> >> (MetricsReporter only exposes KafkaMetrics)
> >>
> >
>