You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kafka.apache.org by Magnus Edenhill <ma...@edenhill.se> on 2021/06/02 12:45:45 UTC

[DISCUSS] KIP-714: Client metrics and observability

Hey all,

I'm proposing KIP-714 to add remote Client metrics and observability.
This functionality will allow centralized monitoring and troubleshooting of
clients and their internals.

Please see
https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability

Looking forward to your feedback!

Regards,
Magnus

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Travis Bischel <tr...@gmail.com>.

Apologies for this duplicate reply, I did not notice the success confirmation on the first submission.

On 2021/06/14 04:52:11, Travis Bischel <tr...@gmail.com> wrote: 
> Hi! I have a few thoughts on this KIP. First, I'd like to thank you for your work
> and writeup, it's clear that a lot of thought went into this and it's very thorough!
> However, I'm not convinced it's the right approach from a fundamental level.
> 
> Fundamentally, this KIP seems like somewhat of a solution to an organizational
> problem. Metrics are organizational concerns, not Kafka operator concerns.
> Clients should make it easy to plug in metrics (this is the approach I take in
> my own client), and organizations should have processes such that all clients
> gather and ship metrics how that organization desires. If an organization is
> set up correctly, there is no reason for metrics to be forwarded through Kafka.
> This feels like a solution to an organization not properly setting up how
> processes ship metrics, and in some ways, it's an overbroad solution, and in
> other ways, it doesn't cover the entire problem.
> 
> From the perspective of Kafka operators, it is easy to see that this KIP is
> nice in that it just dictates what clients should support for metrics and that
> the metrics should ship through Kafka. But, from the perspective of an
> observability team, this workflow is basically hijacking the standard flow that
> organizations may have. I would rather have applications collect metrics and
> ship them the same way every other application does. I'd rather not have to
> configure additional plugins within Kafka to take metrics and forward them.
> 
> More importantly, this KIP prescibes cardinality problems, requires that to
> officially support the KIP a client must support all relevant metrics within
> the KIP, and requires that a client cannot support other metrics unless those
> other metrics also go through a KIP process. It is difficult to imagine all of
> these metrics being relevant to every organization, and there is no way for an
> organization to filter what is relevant within the client. Instead, the
> filtering is pushed downwards, meaning more network IO and more CPU costs to
> filter what is irrelevant and aggregate what needs to be aggregated, and more
> time for an organization to setup whatever it is that will be doing this
> filtering and aggregating. Contrast this with a client that enables hooking in
> to capture numbers that are relevant within an org itself: the org can gather
> what they want, ship only want they want, and ship directly to the
> observability system they have already set up. As an aside, it may also be
> wise to avoid shipping metrics through Kafka about client interaction with
> Kafka, because if Kafka is having problems, then orgs lose insight into those
> problems. This would be like statuspage using itself for status on its own
> systems.
> 
> Another downside is that by dictating the important metrics, this KIP either
> has two choices: try to choose what is important to every org, and inevitably
> leave out something important to somebody else, or just add everything and let
> the orgs filter. This KIP mostly looks to go with the latter approach, meaning
> orgs will be shipping & filtering. With hooks, an org would be able to gather
> exactly what they want.
> 
> As well, I expect that org applications have metrics on the state of the
> applications outside of the Kafka client. Applications are already sending
> non-Kafka-client related metrics outbound to observability systems. If a Kafka
> client provided hooks, then users could just gather the additional relevant
> Kafka client metrics and ship those metrics the same way they do all of their
> other metrics. It feels a bit odd for a Kafka client to have its own separate
> way of forwarding metrics. Another benefit hooks in clients is that
> organizations do not _have_ to set up additional plugins to forward metrics
> from Kafka. Hooks avoid extra organizational work.
> 
> The option that the KIP provides for users of clients to opt out of metrics may
> avoid some of the above issues (by just disabling things at the user level),
> but that's not really great from the perspective of client authors, because the
> existence of this KIP forces authors to either just not implement the KIP, or
> increase complexity within the KIP. Further, from an operator perspective, if I
> would prefer clients to ship metrics through the systems they already have in
> place, now I have to expect that anything that uses librdkafka or the official
> Java client will be shipping me metrics that I have to deal with (since the KIP
> is default enabled).
> 
> Lastly, I'm a little wary that this KIP may stem from a product goal of
> Confluent: since most everything uses librdkafka or the Java client, then by
> defaulting clients sending metrics, Confluent gets an easy way to provide
> metric panels for a nice cloud UI. If any client does not want to support these
> metrics, and then a user wonders why these hypothetical panels have no metrics,
> then Confluent can just reply "use a supported client".  Even if this
> (potentially unlikely) scenario is true, then hooks would still be a great
> alternative, because then Confluent could provide drop-in hooks for any client
> and the end result of easy-panels would be the same.
> 
> In summary,
> 
> - Metrics are more of an organizational concern, not specifically a broker
>   operator concern.
> 
> - The proposal seems to hijack how metrics are gathered within organizations
> 
> - I don't think KIPs should dictate which metrics should be gathered and which
>   should not. Clients instead should make it easy for users to gather anything
>   they could be interested in, and ignore anything they are not.
> 
> - I think hooks are more extensible, more exact, and fit better into
>   organizational workflows.
> 
> On 2021/06/02 12:45:45, Magnus Edenhill <ma...@edenhill.se> wrote: 
> > Hey all,
> > 
> > I'm proposing KIP-714 to add remote Client metrics and observability.
> > This functionality will allow centralized monitoring and troubleshooting of
> > clients and their internals.
> > 
> > Please see
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability
> > 
> > Looking forward to your feedback!
> > 
> > Regards,
> > Magnus
> > 
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Travis Bischel <tr...@gmail.com>.

Hi! I have a few thoughts on this KIP. First, I'd like to thank you for your work
and writeup, it's clear that a lot of thought went into this and it's very thorough!
However, I'm not convinced it's the right approach from a fundamental level.

Fundamentally, this KIP seems like somewhat of a solution to an organizational
problem. Metrics are organizational concerns, not Kafka operator concerns.
Clients should make it easy to plug in metrics (this is the approach I take in
my own client), and organizations should have processes such that all clients
gather and ship metrics how that organization desires. If an organization is
set up correctly, there is no reason for metrics to be forwarded through Kafka.
This feels like a solution to an organization not properly setting up how
processes ship metrics, and in some ways, it's an overbroad solution, and in
other ways, it doesn't cover the entire problem.

From the perspective of Kafka operators, it is easy to see that this KIP is
nice in that it just dictates what clients should support for metrics and that
the metrics should ship through Kafka. But, from the perspective of an
observability team, this workflow is basically hijacking the standard flow that
organizations may have. I would rather have applications collect metrics and
ship them the same way every other application does. I'd rather not have to
configure additional plugins within Kafka to take metrics and forward them.

More importantly, this KIP prescibes cardinality problems, requires that to
officially support the KIP a client must support all relevant metrics within
the KIP, and requires that a client cannot support other metrics unless those
other metrics also go through a KIP process. It is difficult to imagine all of
these metrics being relevant to every organization, and there is no way for an
organization to filter what is relevant within the client. Instead, the
filtering is pushed downwards, meaning more network IO and more CPU costs to
filter what is irrelevant and aggregate what needs to be aggregated, and more
time for an organization to setup whatever it is that will be doing this
filtering and aggregating. Contrast this with a client that enables hooking in
to capture numbers that are relevant within an org itself: the org can gather
what they want, ship only want they want, and ship directly to the
observability system they have already set up. As an aside, it may also be
wise to avoid shipping metrics through Kafka about client interaction with
Kafka, because if Kafka is having problems, then orgs lose insight into those
problems. This would be like statuspage using itself for status on its own
systems.

Another downside is that by dictating the important metrics, this KIP either
has two choices: try to choose what is important to every org, and inevitably
leave out something important to somebody else, or just add everything and let
the orgs filter. This KIP mostly looks to go with the latter approach, meaning
orgs will be shipping & filtering. With hooks, an org would be able to gather
exactly what they want.

As well, I expect that org applications have metrics on the state of the
applications outside of the Kafka client. Applications are already sending
non-Kafka-client related metrics outbound to observability systems. If a Kafka
client provided hooks, then users could just gather the additional relevant
Kafka client metrics and ship those metrics the same way they do all of their
other metrics. It feels a bit odd for a Kafka client to have its own separate
way of forwarding metrics. Another benefit hooks in clients is that
organizations do not _have_ to set up additional plugins to forward metrics
from Kafka. Hooks avoid extra organizational work.

The option that the KIP provides for users of clients to opt out of metrics may
avoid some of the above issues (by just disabling things at the user level),
but that's not really great from the perspective of client authors, because the
existence of this KIP forces authors to either just not implement the KIP, or
increase complexity within the KIP. Further, from an operator perspective, if I
would prefer clients to ship metrics through the systems they already have in
place, now I have to expect that anything that uses librdkafka or the official
Java client will be shipping me metrics that I have to deal with (since the KIP
is default enabled).

Lastly, I'm a little wary that this KIP may stem from a product goal of
Confluent: since most everything uses librdkafka or the Java client, then by
defaulting clients sending metrics, Confluent gets an easy way to provide
metric panels for a nice cloud UI. If any client does not want to support these
metrics, and then a user wonders why these hypothetical panels have no metrics,
then Confluent can just reply "use a supported client".  Even if this
(potentially unlikely) scenario is true, then hooks would still be a great
alternative, because then Confluent could provide drop-in hooks for any client
and the end result of easy-panels would be the same.

In summary,

- Metrics are more of an organizational concern, not specifically a broker
  operator concern.

- The proposal seems to hijack how metrics are gathered within organizations

- I don't think KIPs should dictate which metrics should be gathered and which
  should not. Clients instead should make it easy for users to gather anything
  they could be interested in, and ignore anything they are not.

- I think hooks are more extensible, more exact, and fit better into
  organizational workflows.

On 2021/06/02 12:45:45, Magnus Edenhill <ma...@edenhill.se> wrote: 
> Hey all,
> 
> I'm proposing KIP-714 to add remote Client metrics and observability.
> This functionality will allow centralized monitoring and troubleshooting of
> clients and their internals.
> 
> Please see
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability
> 
> Looking forward to your feedback!
> 
> Regards,
> Magnus
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Colin McCabe <cm...@apache.org>.

On Thu, Jun 17, 2021, at 12:13, Ryanne Dolan wrote:
> Colin,
> 
> > lack of support for collecting client metrics
> 
> ...but kafka is not a metrics collector. There are lots of things kafka
> doesn't support. Should it also collect clients' logs for the same reasons?
> What other side channels should it proxy through brokers?
> 

Hi Ryanne,

Kafka already is a metrics collector. 

Take a look at KIP-511: "Collect and Expose Client's Name and Version in the Brokers," which aggregates metrics from various clients and re-exposes it as a broker metric. Or KIP-607: "Add Metrics to Kafka Streams to Report Properties of RocksDB" which aggregates metrics from the local RocksDB instances and re-exposes them. Or KIP-608 - "Expose Kafka Metrics in Authorizer". Or lots of other KIPs.

This has been the direction we've been moving for a while. It's a direction motivated by our experiences in the field with users, who find it cumbersome to set up dedicated infra to monitor individual Kafka clients. Magnus, especially, has a huge amount of experience here.

>
> > He mentioned the fact that configuring client metrics usually involves
> > setting up a separate metrics collection infrastructure.
> 
> This is not changed with the KIP. It's just a matter of who owns that
> infra, which I don't think should matter to Apache Kafka.
> 

Magnus and I explained a few times the reasons why it does matter. Within most organizations, there are usually several teams using clients, which are separate from the team which maintains the Kafka cluster. The Kafka team has the Kafka experts, which makes it the best place to centralize collecting and analyzing Kafka metrics.

In a sense the whole concept of cloud computing is "just a matter of who owns infra." It is quite important to users.

> We already have MetricsReporter. I still don't see specific motivation
> beyond the "opt-out" part?
> 
> I think we need exceptional motivation for such a proposal.
> 

 As I've said earlier, if you are happy with the current metrics setup, then you can continue using it -- nothing in this KIP means you have to change what you're doing.

best,
Colin


> On Thu, Jun 17, 2021, 1:43 PM Colin McCabe <cm...@apache.org> wrote:
> 
> > Hi Ryan,
> >
> > These are not "arguments for observability in general" but descriptions of
> > specific issues that come up due to Kafka's lack of support for collecting
> > client metrics. He mentioned the fact that configuring client metrics
> > usually involves setting up a separate metrics collection infrastructure.
> > Even if this is easy and straightforward to do (which is not the case for
> > most organizations), it still requires reconfiguring and restarting the
> > application, which is disruptive. Correlating client metrics with server
> > metrics is also often hard. These issues are all mitigated by centralizing
> > metrics collection on the broker.
> >
> > best,
> > Colin
> >
> >
> > On Wed, Jun 16, 2021, at 19:03, Ryanne Dolan wrote:
> > > Magnus, I think these are arguments for observability in general, but not
> > > why kafka should sit between a client and a metics collector.
> > >
> > > Ryanne
> > >
> > > On Wed, Jun 16, 2021, 10:27 AM Magnus Edenhill <ma...@edenhill.se>
> > wrote:
> > >
> > > > Hi Ryanne,
> > > >
> > > > this proposal stems from a need to improve troubleshooting Kafka
> > issues.
> > > >
> > > > As it currently stands, when an application team is experiencing Kafka
> > > > service degradation,
> > > > or the Kafka operator is seeing misbehaving clients, there are plenty
> > of
> > > > steps that needs
> > > > to be taken before any client-side metrics can be observed at all, if
> > at
> > > > all:
> > > >  - Is the application even collecting client metrics? If not it needs
> > to be
> > > > reconfigured or implemented, and restarted;
> > > >    a restart may have business impact, and may also temporarily?
> > remedy the
> > > > problem without giving any further insight
> > > >    into what was wrong.
> > > >  - Are the desired metrics collected? Where are they stored? For how
> > long?
> > > > Is there enough correlating information
> > > >    to map it to cluster-side metrics and events? Does the application
> > > > on-call know how to find the collected metrics?
> > > >  - Export and send these metrics to whoever knows how to interpret
> > them. In
> > > > what format? Are all relevant metadata fields
> > > >    provided?
> > > >
> > > > The KIP aims to solve all these obstacles by giving the Kafka operator
> > the
> > > > tools to collect this information.
> > > >
> > > > Regards,
> > > > Magnus
> > > >
> > > >
> > > > Den tis 15 juni 2021 kl 02:37 skrev Ryanne Dolan <
> > ryannedolan@gmail.com>:
> > > >
> > > > > Magnus, I think such a substantial change requires more motivation
> > than
> > > > is
> > > > > currently provided. As I read it, the motivation boils down to this:
> > you
> > > > > want your clients to phone-home unless they opt-out. As stated in the
> > > > KIP,
> > > > > "there are plenty of existing solutions [...] to send metrics [...]
> > to a
> > > > > collector", so the opt-out appears to be the only motivation. Am I
> > > > missing
> > > > > something?
> > > > >
> > > > > Ryanne
> > > > >
> > > > > On Wed, Jun 2, 2021 at 7:46 AM Magnus Edenhill <ma...@edenhill.se>
> > > > wrote:
> > > > >
> > > > > > Hey all,
> > > > > >
> > > > > > I'm proposing KIP-714 to add remote Client metrics and
> > observability.
> > > > > > This functionality will allow centralized monitoring and
> > > > troubleshooting
> > > > > of
> > > > > > clients and their internals.
> > > > > >
> > > > > > Please see
> > > > > >
> > > > > >
> > > > >
> > > >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability
> > > > > >
> > > > > > Looking forward to your feedback!
> > > > > >
> > > > > > Regards,
> > > > > > Magnus
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Ryanne Dolan <ry...@gmail.com>.

Colin,

> lack of support for collecting client metrics

...but kafka is not a metrics collector. There are lots of things kafka
doesn't support. Should it also collect clients' logs for the same reasons?
What other side channels should it proxy through brokers?

> He mentioned the fact that configuring client metrics usually involves
setting up a separate metrics collection infrastructure.

This is not changed with the KIP. It's just a matter of who owns that
infra, which I don't think should matter to Apache Kafka.

We already have MetricsReporter. I still don't see specific motivation
beyond the "opt-out" part?

I think we need exceptional motivation for such a proposal.

On Thu, Jun 17, 2021, 1:43 PM Colin McCabe <cm...@apache.org> wrote:

> Hi Ryan,
>
> These are not "arguments for observability in general" but descriptions of
> specific issues that come up due to Kafka's lack of support for collecting
> client metrics. He mentioned the fact that configuring client metrics
> usually involves setting up a separate metrics collection infrastructure.
> Even if this is easy and straightforward to do (which is not the case for
> most organizations), it still requires reconfiguring and restarting the
> application, which is disruptive. Correlating client metrics with server
> metrics is also often hard. These issues are all mitigated by centralizing
> metrics collection on the broker.
>
> best,
> Colin
>
>
> On Wed, Jun 16, 2021, at 19:03, Ryanne Dolan wrote:
> > Magnus, I think these are arguments for observability in general, but not
> > why kafka should sit between a client and a metics collector.
> >
> > Ryanne
> >
> > On Wed, Jun 16, 2021, 10:27 AM Magnus Edenhill <ma...@edenhill.se>
> wrote:
> >
> > > Hi Ryanne,
> > >
> > > this proposal stems from a need to improve troubleshooting Kafka
> issues.
> > >
> > > As it currently stands, when an application team is experiencing Kafka
> > > service degradation,
> > > or the Kafka operator is seeing misbehaving clients, there are plenty
> of
> > > steps that needs
> > > to be taken before any client-side metrics can be observed at all, if
> at
> > > all:
> > >  - Is the application even collecting client metrics? If not it needs
> to be
> > > reconfigured or implemented, and restarted;
> > >    a restart may have business impact, and may also temporarily?
> remedy the
> > > problem without giving any further insight
> > >    into what was wrong.
> > >  - Are the desired metrics collected? Where are they stored? For how
> long?
> > > Is there enough correlating information
> > >    to map it to cluster-side metrics and events? Does the application
> > > on-call know how to find the collected metrics?
> > >  - Export and send these metrics to whoever knows how to interpret
> them. In
> > > what format? Are all relevant metadata fields
> > >    provided?
> > >
> > > The KIP aims to solve all these obstacles by giving the Kafka operator
> the
> > > tools to collect this information.
> > >
> > > Regards,
> > > Magnus
> > >
> > >
> > > Den tis 15 juni 2021 kl 02:37 skrev Ryanne Dolan <
> ryannedolan@gmail.com>:
> > >
> > > > Magnus, I think such a substantial change requires more motivation
> than
> > > is
> > > > currently provided. As I read it, the motivation boils down to this:
> you
> > > > want your clients to phone-home unless they opt-out. As stated in the
> > > KIP,
> > > > "there are plenty of existing solutions [...] to send metrics [...]
> to a
> > > > collector", so the opt-out appears to be the only motivation. Am I
> > > missing
> > > > something?
> > > >
> > > > Ryanne
> > > >
> > > > On Wed, Jun 2, 2021 at 7:46 AM Magnus Edenhill <ma...@edenhill.se>
> > > wrote:
> > > >
> > > > > Hey all,
> > > > >
> > > > > I'm proposing KIP-714 to add remote Client metrics and
> observability.
> > > > > This functionality will allow centralized monitoring and
> > > troubleshooting
> > > > of
> > > > > clients and their internals.
> > > > >
> > > > > Please see
> > > > >
> > > > >
> > > >
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability
> > > > >
> > > > > Looking forward to your feedback!
> > > > >
> > > > > Regards,
> > > > > Magnus
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Colin McCabe <cm...@apache.org>.

Hi Ryan,

These are not "arguments for observability in general" but descriptions of specific issues that come up due to Kafka's lack of support for collecting  client metrics. He mentioned the fact that configuring client metrics usually involves setting up a separate metrics collection infrastructure. Even if this is easy and straightforward to do (which is not the case for most organizations), it still requires reconfiguring and restarting the application, which is disruptive. Correlating client metrics with server metrics is also often hard. These issues are all mitigated by centralizing metrics collection on the broker.

best,
Colin


On Wed, Jun 16, 2021, at 19:03, Ryanne Dolan wrote:
> Magnus, I think these are arguments for observability in general, but not
> why kafka should sit between a client and a metics collector.
> 
> Ryanne
> 
> On Wed, Jun 16, 2021, 10:27 AM Magnus Edenhill <ma...@edenhill.se> wrote:
> 
> > Hi Ryanne,
> >
> > this proposal stems from a need to improve troubleshooting Kafka issues.
> >
> > As it currently stands, when an application team is experiencing Kafka
> > service degradation,
> > or the Kafka operator is seeing misbehaving clients, there are plenty of
> > steps that needs
> > to be taken before any client-side metrics can be observed at all, if at
> > all:
> >  - Is the application even collecting client metrics? If not it needs to be
> > reconfigured or implemented, and restarted;
> >    a restart may have business impact, and may also temporarily? remedy the
> > problem without giving any further insight
> >    into what was wrong.
> >  - Are the desired metrics collected? Where are they stored? For how long?
> > Is there enough correlating information
> >    to map it to cluster-side metrics and events? Does the application
> > on-call know how to find the collected metrics?
> >  - Export and send these metrics to whoever knows how to interpret them. In
> > what format? Are all relevant metadata fields
> >    provided?
> >
> > The KIP aims to solve all these obstacles by giving the Kafka operator the
> > tools to collect this information.
> >
> > Regards,
> > Magnus
> >
> >
> > Den tis 15 juni 2021 kl 02:37 skrev Ryanne Dolan <ry...@gmail.com>:
> >
> > > Magnus, I think such a substantial change requires more motivation than
> > is
> > > currently provided. As I read it, the motivation boils down to this: you
> > > want your clients to phone-home unless they opt-out. As stated in the
> > KIP,
> > > "there are plenty of existing solutions [...] to send metrics [...] to a
> > > collector", so the opt-out appears to be the only motivation. Am I
> > missing
> > > something?
> > >
> > > Ryanne
> > >
> > > On Wed, Jun 2, 2021 at 7:46 AM Magnus Edenhill <ma...@edenhill.se>
> > wrote:
> > >
> > > > Hey all,
> > > >
> > > > I'm proposing KIP-714 to add remote Client metrics and observability.
> > > > This functionality will allow centralized monitoring and
> > troubleshooting
> > > of
> > > > clients and their internals.
> > > >
> > > > Please see
> > > >
> > > >
> > >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability
> > > >
> > > > Looking forward to your feedback!
> > > >
> > > > Regards,
> > > > Magnus
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Ryanne Dolan <ry...@gmail.com>.

Magnus, I think these are arguments for observability in general, but not
why kafka should sit between a client and a metics collector.

Ryanne

On Wed, Jun 16, 2021, 10:27 AM Magnus Edenhill <ma...@edenhill.se> wrote:

> Hi Ryanne,
>
> this proposal stems from a need to improve troubleshooting Kafka issues.
>
> As it currently stands, when an application team is experiencing Kafka
> service degradation,
> or the Kafka operator is seeing misbehaving clients, there are plenty of
> steps that needs
> to be taken before any client-side metrics can be observed at all, if at
> all:
>  - Is the application even collecting client metrics? If not it needs to be
> reconfigured or implemented, and restarted;
>    a restart may have business impact, and may also temporarily? remedy the
> problem without giving any further insight
>    into what was wrong.
>  - Are the desired metrics collected? Where are they stored? For how long?
> Is there enough correlating information
>    to map it to cluster-side metrics and events? Does the application
> on-call know how to find the collected metrics?
>  - Export and send these metrics to whoever knows how to interpret them. In
> what format? Are all relevant metadata fields
>    provided?
>
> The KIP aims to solve all these obstacles by giving the Kafka operator the
> tools to collect this information.
>
> Regards,
> Magnus
>
>
> Den tis 15 juni 2021 kl 02:37 skrev Ryanne Dolan <ry...@gmail.com>:
>
> > Magnus, I think such a substantial change requires more motivation than
> is
> > currently provided. As I read it, the motivation boils down to this: you
> > want your clients to phone-home unless they opt-out. As stated in the
> KIP,
> > "there are plenty of existing solutions [...] to send metrics [...] to a
> > collector", so the opt-out appears to be the only motivation. Am I
> missing
> > something?
> >
> > Ryanne
> >
> > On Wed, Jun 2, 2021 at 7:46 AM Magnus Edenhill <ma...@edenhill.se>
> wrote:
> >
> > > Hey all,
> > >
> > > I'm proposing KIP-714 to add remote Client metrics and observability.
> > > This functionality will allow centralized monitoring and
> troubleshooting
> > of
> > > clients and their internals.
> > >
> > > Please see
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability
> > >
> > > Looking forward to your feedback!
> > >
> > > Regards,
> > > Magnus
> > >
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Feng Min <fm...@confluent.io.INVALID>.

On Wed, Jul 21, 2021 at 6:17 PM Colin McCabe <cm...@apache.org> wrote:

> On Tue, Jun 29, 2021, at 07:22, Magnus Edenhill wrote:
> > Den tors 17 juni 2021 kl 00:52 skrev Colin McCabe <cm...@apache.org>:
> > > A few critiques:
> > >
> > > - As I wrote above, I think this could benefit a lot by being split
> into
> > > several RPCs. A registration RPC, a report RPC, and an unregister RPC
> seem
> > > like logical choices.
> > >
> >
> > Responded to this in your previous mail, but in short I think a single
> > request is sufficient and keeps the implementation complexity / state
> down.
> >
>
> Hi Magnus,
>
> I still suspect that trying to do everything with a single RPC is more
> complex than using multiple RPCs.
>
> Can you go into more detail about how the client learns what metrics it
> should send? This was the purpose of the "registration" step in my scheme
> above.
>
> It seems quite awkward to combine an RPC for reporting metrics with and
> RPC for finding out what metrics are configured to be reported. For
> example, how would you build a tool to check what metrics are configured to
> be reported? Does the tool have to report fake metrics, just because
> there's no other way to get back that information? Seems wrong. (It would
> be a bit like combining createTopics and listTopics for "simplicity")
>

 +1 on separate RPC on metric discovery and metric report. I actually think
it makes complexity/state down compared with single RPC.


>
> > > - I don't think the client should be able to choose its own UUID. This
> > > adds complexity and introduces a chance that clients will choose an ID
> that
> > > is not unique. We already have an ID that the client itself supplies
> > > (clientID) so there is no need to introduce another such ID.
> > >
> >
> > The CLIENT_INSTANCE_ID (which is a combination of the client.id and a
> UUID)
> > is actually generated by the receiving broker on first contact.
> > The need for a new unique semi-random id is outlined in the KIP, but in
> > short; the client.id is not unique, and we need something unique that
> still
> > is prefix-matchable to the client.id so that we can add subscriptions
> > either using prefix-matching of just the client.id (which may match one
> or
> > more client instances), and exact matching which will match a one
> specific
> > client instance.
>
> Hmm... the client id is already sent in every RPC as part of the header.
> It's not necessary to send it again as part of one of the other RPC fields,
> right?
>
> More generally, why does the client instance ID need to be
> prefix-matchable? That seems like an implementation detail of the metrics
> collection system used on the broker side. Maybe someone wants to group by
> things other than client IDs -- perhaps client versions, for instance. By
> the same argument, we should put the client version string in the client
> instance ID, since someone might want to group by that. Or maybe we should
> include the hostname, and the IP, and, and, and.... You see the issue here.
> I think we shouldn't get involved in this kind of decision -- if we just
> pass a UUID, the broker-side software can group it or prefix it however it
> wants internally.
>
> > > - In general the schema seems to have a bad case of string-itis. UUID,
> > > content type, and requested metrics are all strings. Since these
> messages
> > > will be sent very frequently, it's quite costly to use strings for all
> > > these things. We have a type for UUID, which uses 16 bytes -- let's use
> > > that type for client instance ID, rather than a string which will be
> much
> > > larger. Also, since we already send clientID in the message header,
> there
> > > is no need to include it again in the instance ID.
> > >
> >
> > As explained above we need the client.id in the CLIENT_INSTANCE_ID. And
> I
> > don't think the overhead of this one string per request is going to be
> much
> > of an issue,
> > typical metric push intervals are probably in the >60s range.
> > If this becomes a problem we could use a per-connection identifier that
> the
> > broker translates to the client instance id before pushing metrics
> upwards
> > in the system.
> >
>
> This is actually an interesting design question -- why not use a
> per-TCP-connection identifier, rather than a per-client-instance
> identifier? If we are grouping by other things anyway (clientID, principal,
> etc.) on the server side, do we need to maintain a per-process identifier
> rather than a per-connection one?
>
> >
> > > - I think it would also be nice to have an enum or something for
> > > AcceptedContentTypes, RequestedMetrics, etc. We know that new
> additions to
> > > these categories will require KIPs, so it should be straightforward
> for the
> > > project to just have an enum that allows us to communicate these as
> ints.
> > >
> >
> > I'm thinking this might be overly constraining. The broker doesn't parse
> or
> > handle the received metrics data itself but just pushes it to the metrics
> > plugin, using an enum would require a KIP and broker upgrade if the
> metrics plugin
> > supports a newer version of OTLP.
> > It is probably better if we don't strictly control the metric format
> itself.
> >
>
> Unfortunately, we have to strictly control the metrics format, because
> otherwise clients can't implement it. I agree that we don't need to specify
> how the broker-side code works, since that is pluggable. It's also
> reasonable for the clients to have pluggable extensions as well, but this
> KIP won't be of much use if we don't at least define a basic set of metrics
> that most clients can understand how to send. The open source clients will
> not implement anything more than what is specified in the KIP (or at least
> the AK one won't...)
>
> >
> >
> > > - Can you talk about whether you are adding any new library
> dependencies
> > > to the Kafka client? It seems like you'd want to add opencensus /
> > > opentelemetry, if we are using that format here.
> > >
> >
> > Yeah, as we get closer to concensus more implementation specific details
> > will be added to the KIP.
> >
>
> I'm not sure if OpenCensus adds any value to this KIP, to be honest. Their
> primary focus was never on the format of the data being sent (in fact, the
> last time they checked, they left the format up to each OpenCensus
> implementation). That may have changed, but I think it still has limited
> usefulness to us, since we have our own format which we have to use anyway.
>
> >
> > >
> > > - Standard client resource labels: can we send these only in the
> > > registration RPC?
> > >
> >
> > These labels are part of the serialized OTLP data, which means it would
> > need to be unpacked and repacked (including compression) by the broker
> (or
> > metrics plugin), which I believe is more costly than sending them for
> each request.
> >
>
> Hmm, that data is about 10 fields, most of which are strings. It certainly
> adds a lot of overhead to resend it each time.
>
> I don't follow the comment about unpacking and repacking -- since the
> client registered with the broker it already knows all this information, so
> there's nothing to unpack or repack, except from memory. If it's more
> convenient to serialize it once rather than multiple times, that is an
> implementation detail of the broker side plugin, which we are not
> specifying here anyway.
>
> best,
> Colin
>
> > Thanks,
> > Magnus
> >
> > >
> > >
> >
>


-- 
Best,
Feng

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Feng Min <fm...@confluent.io.INVALID>.

LGTM in terms of RPC separation and the new SubscriptionId to detect target
metric change on the server side.

On Tue, Sep 14, 2021 at 12:48 AM Magnus Edenhill <ma...@edenhill.se> wrote:

> Thanks for your feedback Colin, see my updated proposal below.
>
>
> Den tors 22 juli 2021 kl 03:17 skrev Colin McCabe <cm...@apache.org>:
>
> > On Tue, Jun 29, 2021, at 07:22, Magnus Edenhill wrote:
> > > Den tors 17 juni 2021 kl 00:52 skrev Colin McCabe <cmccabe@apache.org
> >:
> > > > A few critiques:
> > > >
> > > > - As I wrote above, I think this could benefit a lot by being split
> > into
> > > > several RPCs. A registration RPC, a report RPC, and an unregister RPC
> > seem
> > > > like logical choices.
> > > >
> > >
> > > Responded to this in your previous mail, but in short I think a single
> > > request is sufficient and keeps the implementation complexity / state
> > down.
> > >
> >
> > Hi Magnus,
> >
> > I still suspect that trying to do everything with a single RPC is more
> > complex than using multiple RPCs.
> >
> > Can you go into more detail about how the client learns what metrics it
> > should send? This was the purpose of the "registration" step in my scheme
> > above.
> >
> > It seems quite awkward to combine an RPC for reporting metrics with and
> > RPC for finding out what metrics are configured to be reported. For
> > example, how would you build a tool to check what metrics are configured
> to
> > be reported? Does the tool have to report fake metrics, just because
> > there's no other way to get back that information? Seems wrong. (It would
> > be a bit like combining createTopics and listTopics for "simplicity")
> >
>
>
>
> Splitting up the API into separate data and control requests makes sense.
> With a split we would have one API for querying the broker for configured
> metrics subscriptions,
> and one API for pushing the collected metrics to the broker.
>
> A mechanism is still needed to notify the client when the subscription is
> changed;
> I’ve added a SubscriptionId for this purpose (which could be a checksum of
> the configured metrics subscription), this id is sent to the client along
> with the metrics subscription, and the client sends it back to the broker
> when pushing metrics. If the broker finds the pushed subscription id to
> differ from what is expected it will return an error to the client, which
> triggers the client to retrieve the new subscribed metrics and an updated
> subscription id. The generation of the subscriptionId is opaque to the
> client.
>
>
> Something like this:
>
> // Get the configured metrics subscription.
> GetTelemetrySubscriptionsRequest {
>    StrNull  ClientInstanceId  // Null on first invocation to retrieve a
> newly generated instance id from the broker.
> }
>
> GetTelemetrySubscriptionsResponse {
>   Int16  ErrorCode
>   Int32  SubscriptionId   // This is used for comparison in
> PushTelemetryRequest. Could be a crc32 of the subscription.
>   Str    ClientInstanceId
>   Int8   AcceptedContentTypes
>   Array  SubscribedMetrics[] {
>       String MetricsPrefix
>       Int32  IntervalMs
>   }
> }
>
>
> The ContentType is a bitmask in this new proposal, high bits indicate
> compression:
>   0x01   OTLPv08
>   0x10   GZIP
>   0x40   ZSTD
>   0x80   LZ4
>
>
> // Push metrics
> PushTelemetryRequest {
>    Str    ClientInstanceId
>    Int32  SubscriptionId    // The collected metrics in this request are
> based on the subscription with this Id.
>    Int8   ContentType       // E.g., OTLPv08|ZSTD
>    Bool   Terminating
>    Binary Metrics
> }
>
>
> PushTelemetryResponse {
>    Int32 ThrottleTime
>    Int16 ErrorCode
> }
>
>
> An example run:
>
> 1. Client instance starts, connects to broker.
> 2. > GetTelemetrySubscriptionsRequest{ ClientInstanceId=Null } // Requests
> an instance id and the subscribed metrics.
> 3. < GetTelemetrySubscriptionsResponse{
>       ErrorCode = 0,
>       SubscriptionId = 0x234adf34,
>       ClientInstanceId = f00d-feed-deff-ceff-ffff-…,
>       AcceptedContentTypes = OTLPv08|ZSTD|LZ4,
>       SubscribeddMetrics[] = {
>          { “client.producer.tx.”, 60000 },
>          { “client.memory.rss”, 900000 },
>       }
>    }
> 4. Client updates its metrics subscription, next push to fire in 60
> seconds.
> 5. 60 seconds passes
> 6. > PushTelemetryRequest{
>        ClientInstanceId = f00d-feed-deff-ceff-ffff-….,
>        SubscriptionId = 0x234adf34,
>        ContentType = OTLPv8|ZSTD,
>        Terminating = False,
>        Metrics = …// zstd-compressed OTLPv08-protobuf-serialized metrics
>   }
> 7. < PushTelemetryResponse{ 0, NO_ERROR }
> 8. 60 seconds passes
> 9. > PushTelemetryRequest…
> …
> 56. The operator changes the configured metrics subscriptions (through
> Admin API).
> 57. > PushTelemetryRequest{ .. SubscriptionId = 0x234adf34 .. }
> 58. The subscriptionId no longer matches since the subscription has been
> updated, broker responds with an error:
> 59. < PushTelemetryResponse{ 0,   ERR_INVALID_SUBSCRIPTION_ID }
> 60. The error triggers the client to request the subscriptions again.
> 61. > GetTelemetrySubscriptionsRequest{..}
> 62. < GetTelemetrySubscriptionsResponse { .. SubscriptionId = 0x777772211,
> SubscribedMetrics[] = .. }
> 63. Client update its subscription and continues to push metrics
> accordingly.
> …
>
>
> If the broker connection goes down or the connection is to be used for
> other purposes (e.g., blocking FetchRequests), the client will send
> PushTelemetryRequests to any other broker in the cluster, using the same
> ClientInstanceId and SubscriptionId as received in the latest
> GetTelemetrySubscriptionsResponse.
>
> While the subscriptionId may change during the lifetime of the client
> instance (when metric subscriptions are updated), the ClientInstanceId is
> only acquired once and must not change (as it is used to identify the
> unique client instance).
>
>
> >
> > > > - I don't think the client should be able to choose its own UUID.
> This
> > > > adds complexity and introduces a chance that clients will choose an
> ID
> > that
> > > > is not unique. We already have an ID that the client itself supplies
> > > > (clientID) so there is no need to introduce another such ID.
> > > >
> > >
> > > The CLIENT_INSTANCE_ID (which is a combination of the client.id and a
> > UUID)
> > > is actually generated by the receiving broker on first contact.
> > > The need for a new unique semi-random id is outlined in the KIP, but in
> > > short; the client.id is not unique, and we need something unique that
> > still
> > > is prefix-matchable to the client.id so that we can add subscriptions
> > > either using prefix-matching of just the client.id (which may match
> one
> > or
> > > more client instances), and exact matching which will match a one
> > specific
> > > client instance.
> >
> > Hmm... the client id is already sent in every RPC as part of the header.
> > It's not necessary to send it again as part of one of the other RPC
> fields,
> > right?
> >
> > More generally, why does the client instance ID need to be
> > prefix-matchable? That seems like an implementation detail of the metrics
> > collection system used on the broker side. Maybe someone wants to group
> by
> > things other than client IDs -- perhaps client versions, for instance. By
> > the same argument, we should put the client version string in the client
> > instance ID, since someone might want to group by that. Or maybe we
> should
> > include the hostname, and the IP, and, and, and.... You see the issue
> here.
> > I think we shouldn't get involved in this kind of decision -- if we just
> > pass a UUID, the broker-side software can group it or prefix it however
> it
> > wants internally.
> >
>
> Yes, I agree, other selectors will indeed be needed eventually.
> I'll remove the client.id from the CLIENT_INSTANCE_ID and only keep the
> UUID part.
> My assumption is that the set of subscribed metrics prefixes throughout a
> cluster will be quite small initially, so maybe we could leave fine-grained
> selectors out of this proposal
> and address it later when an actual need arises (maybe ACLs can be used for
> selector matching).
> And there is no harm for a client in having a metrics subscription with
> metrics it does not provide, e.g.,  including the consumer metrics for a
> producer, and vice versa, it will just be ignored by the client
> if it doesn't match a metrics prefix it can provide.
>
> What we do want though is ability to single out a specific client instance
> to give it a more fine-grained subscription for troubleshooting, and
> we can do that with the current proposal with matching solely on the
> CLIENT_INSTANCE_ID.
> In other words; all clients will have the same standard metrics
> subscription, but specific client instances can have alternate
> subscriptions.
>
>
> > > - In general the schema seems to have a bad case of string-itis. UUID,
> > > > content type, and requested metrics are all strings. Since these
> > messages
> > > > will be sent very frequently, it's quite costly to use strings for
> all
> > > > these things. We have a type for UUID, which uses 16 bytes -- let's
> use
> > > > that type for client instance ID, rather than a string which will be
> > much
> > > > larger. Also, since we already send clientID in the message header,
> > there
> > > > is no need to include it again in the instance ID.
> > > >
> > >
> > > As explained above we need the client.id in the CLIENT_INSTANCE_ID.
> And
> > I
> > > don't think the overhead of this one string per request is going to be
> > much
> > > of an issue,
> > > typical metric push intervals are probably in the >60s range.
> > > If this becomes a problem we could use a per-connection identifier that
> > the
> > > broker translates to the client instance id before pushing metrics
> > upwards
> > > in the system.
> > >
> >
> > This is actually an interesting design question -- why not use a
> > per-TCP-connection identifier, rather than a per-client-instance
> > identifier? If we are grouping by other things anyway (clientID,
> principal,
> > etc.) on the server side, do we need to maintain a per-process identifier
> > rather than a per-connection one?
> >
>
>
> The metrics collector/tsdb/whatever will need to identify a single client
> instance, regardless of which broker received the metrics.
> The chapter on CLIENT_INSTANCE_ID motivates why we need a unique
> identifier, basically because neither clientID, principal or remote
> address:port, etc, can be
> used to identify a single client instance.
>
>
>
>
> > >
> > > > - I think it would also be nice to have an enum or something for
> > > > AcceptedContentTypes, RequestedMetrics, etc. We know that new
> > additions to
> > > > these categories will require KIPs, so it should be straightforward
> > for the
> > > > project to just have an enum that allows us to communicate these as
> > ints.
> > > >
> > >
> > > I'm thinking this might be overly constraining. The broker doesn't
> parse
> > or
> > > handle the received metrics data itself but just pushes it to the
> metrics
> > > plugin, using an enum would require a KIP and broker upgrade if the
> > metrics plugin
> > > supports a newer version of OTLP.
> > > It is probably better if we don't strictly control the metric format
> > itself.
> > >
> >
> > Unfortunately, we have to strictly control the metrics format, because
> > otherwise clients can't implement it. I agree that we don't need to
> specify
> > how the broker-side code works, since that is pluggable. It's also
> > reasonable for the clients to have pluggable extensions as well, but this
> > KIP won't be of much use if we don't at least define a basic set of
> metrics
> > that most clients can understand how to send. The open source clients
> will
> > not implement anything more than what is specified in the KIP (or at
> least
> > the AK one won't...)
> >
>
> Makes sense, in the updated proposal above I changed ContentType to a
> bitmask.
>
>
> >
> > >
> > >
> > > > - Can you talk about whether you are adding any new library
> > dependencies
> > > > to the Kafka client? It seems like you'd want to add opencensus /
> > > > opentelemetry, if we are using that format here.
> > > >
> > >
> > > Yeah, as we get closer to concensus more implementation specific
> details
> > > will be added to the KIP.
> > >
> >
> > I'm not sure if OpenCensus adds any value to this KIP, to be honest.
> Their
> > primary focus was never on the format of the data being sent (in fact,
> the
> > last time they checked, they left the format up to each OpenCensus
> > implementation). That may have changed, but I think it still has limited
> > usefulness to us, since we have our own format which we have to use
> anyway.
> >
>
> Oh, I meant concensus as in kafka-dev agreement :)
>
> Feng is looking into the implementation details of the Java client and will
> update the KIP with regards to dependencies.
>
>
>
> >
> > >
> > > >
> > > > - Standard client resource labels: can we send these only in the
> > > > registration RPC?
> > > >
> > >
> > > These labels are part of the serialized OTLP data, which means it would
> > > need to be unpacked and repacked (including compression) by the broker
> > (or
> > > metrics plugin), which I believe is more costly than sending them for
> > each request.
> > >
> >
> > Hmm, that data is about 10 fields, most of which are strings. It
> certainly
> > adds a lot of overhead to resend it each time.
> >
> > I don't follow the comment about unpacking and repacking -- since the
> > client registered with the broker it already knows all this information,
> so
> > there's nothing to unpack or repack, except from memory. If it's more
> > convenient to serialize it once rather than multiple times, that is an
> > implementation detail of the broker side plugin, which we are not
> > specifying here anyway.
> >
>
> The current proposal is pretty much stateless on the broker, it does not
> need to hold any state for a client (instance), and no state
> synchronization is needed
> between brokers in the cluster, which allows a client to seamlessly send
> metrics to any broker it wants and keeps the API overhead down (no need to
> re-register when
> switching brokers for instance).
>
> We could remove the labels that are already available to the broker on a
> per-request basis or that it already maintains state for:
>  - client_id
>  - client_instance_id
>  - client_software_*
>
> Leaving the following to still be included:
>  - group_id
>  - group_instance_id
>  - transactional_id
>   etc..
>
> What do you think of that?
>
>
> Thanks,
> Magnus
>
>
>
> >
> > best,
> > Colin
> >
> > > Thanks,
> > > Magnus
> > >
> > > >
> > > >
> > >
> >
>


-- 
Best,
Feng

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Tom Bentley <tb...@redhat.com>.

Hi Ashenafi,

You'll need to unsubscribe from the dev mailing list by sending an email to
dev-unsubscribe@kafka.apache.org. No one else can do this for you.

Kind regards,

Tom

On Tue, 8 Mar 2022 at 04:40, Ashenafi Marcos <as...@gmail.com> wrote:

> Hi,
> Can you please take out my email I’d so that will not be able to receive
> any mail from you.
> Thank you
>
> On Tue, Oct 19, 2021 at 1:30 PM Mickael Maison <mi...@gmail.com>
> wrote:
>
> > Hi Magnus,
> >
> > Thanks for the proposal.
> >
> > 1. Looking at the protocol section, isn't "ClientInstanceId" expected
> > to be a field in GetTelemetrySubscriptionsResponseV0? Otherwise, how
> > does a client retrieve this value?
> >
> > 2. In the client API section, you mention a new method
> > "clientInstanceId()". Can you clarify which interfaces are affected?
> > Is it only Consumer and Producer?
> >
> > 3. I'm a bit concerned this is enabled by default. Even if the data
> > collected is supposed to be not sensitive, I think this can be
> > problematic in some environments. Also users don't seem to have the
> > choice to only expose some metrics. Knowing how much data transit
> > through some applications can be considered critical.
> >
> > 4. As a user, how do you know if your application is actively sending
> > metrics? Are there new metrics exposing what's going on, like how much
> > data is being sent?
> >
> > 5. If all metrics are enabled on a regular Consumer or Producer, do
> > you have an idea how much throughput this would use?
> >
> > Thanks
> >
> > On Tue, Oct 19, 2021 at 5:06 PM Magnus Edenhill <ma...@edenhill.se>
> > wrote:
> > >
> > > Den tis 19 okt. 2021 kl 13:22 skrev Tom Bentley <tb...@redhat.com>:
> > >
> > > > Hi Magnus,
> > > >
> > > > I reviewed the KIP since you called the vote (sorry for not reviewing
> > when
> > > > you announced your intention to call the vote). I have a few
> questions
> > on
> > > > some of the details.
> > > >
> > > > 1. There's no Javadoc on ClientTelemetryPayload.data(), so I don't
> know
> > > > whether the payload is exposed through this method as compressed or
> > not.
> > > > Later on you say "Decompression of the payloads will be handled by
> the
> > > > broker metrics plugin, the broker should expose a suitable
> > decompression
> > > > API to the metrics plugin for this purpose.", which suggests it's the
> > > > compressed data in the buffer, but then we don't know which codec was
> > used,
> > > > nor the API via which the plugin should decompress it if required for
> > > > forwarding to the ultimate metrics store. Should the
> > ClientTelemetryPayload
> > > > expose a method to get the compression and a decompressor?
> > > >
> > >
> > > Good point, updated.
> > >
> > >
> > >
> > > > 2. The client-side API is expressed as StringOrError
> > > > ClientInstance::ClientInstanceId(int timeout_ms). I understand that
> > you're
> > > > thinking about the librdkafka implementation, but it would be good to
> > show
> > > > the API as it would appear on the Apache Kafka clients.
> > > >
> > >
> > > This was meant as pseudo-code, but I changed it to Java.
> > >
> > >
> > > > 3. "PushTelemetryRequest|Response - protocol request used by the
> > client to
> > > > send metrics to any broker it is connected to." To be clear, this
> means
> > > > that the client can choose any of the connected brokers and push to
> > just
> > > > one of them? What should a supporting client do if it gets an error
> > when
> > > > pushing metrics to a broker, retry sending to the same broker or try
> > > > pushing to another broker, or drop the metrics? Should supporting
> > clients
> > > > send successive requests to a single broker, or round robin, or is
> > that up
> > > > to the client author? I'm guessing the behaviour should be sticky to
> > > > support the rate limiting features, but I think it would be good for
> > client
> > > > authors if this section were explicit on the recommended behaviour.
> > > >
> > >
> > > You are right, I've updated the KIP to make this clearer.
> > >
> > >
> > > > 4. "Mapping the client instance id to an actual application instance
> > > > running on a (virtual) machine can be done by inspecting the metrics
> > > > resource labels, such as the client source address and source port,
> or
> > > > security principal, all of which are added by the receiving broker.
> > This
> > > > will allow the operator together with the user to identify the actual
> > > > application instance." Is this really always true? The source IP and
> > port
> > > > might be a loadbalancer/proxy in some setups. The principal, as
> already
> > > > mentioned in the KIP, might be shared between multiple applications.
> > So at
> > > > worst the organization running the clients might have to consult the
> > logs
> > > > of a set of client applications, right?
> > > >
> > >
> > > Yes, that's correct. There's no guaranteed mapping from
> > client_instance_id
> > > to
> > > an actual instance, that's why the KIP recommends client
> implementations
> > to
> > > log the client instance id
> > > upon retrieval, and also provide an API for the application to retrieve
> > the
> > > instance id programmatically
> > > if it has a better way of exposing it.
> > >
> > >
> > > 5. "Tests indicate that a compression ratio up to 10x is possible for
> the
> > > > standard metrics." Client authors might appreciate your mentioning
> > which
> > > > compression codec got these results.
> > > >
> > >
> > > Good point. Updated.
> > >
> > >
> > > > 6. "Should the client send a push request prior to expiry of the
> > previously
> > > > calculated PushIntervalMs the broker will discard the metrics and
> > return a
> > > > PushTelemetryResponse with the ErrorCode set to RateLimited." Is this
> > > > RATE_LIMITED a new error code? It's not mentioned in the "New Error
> > Codes"
> > > > section.
> > > >
> > >
> > > That's a leftover, it should be using the standard ThrottleTime
> > mechanism.
> > > Fixed.
> > >
> > >
> > > > 7. In the section "Standard client resource labels" application_id is
> > > > described as Kafka Streams only, but the section of "Client
> > Identification"
> > > > talks about "application instance id as an optional future
> nice-to-have
> > > > that may be included as a metrics label if it has been set by the
> > user", so
> > > > I'm confused whether non-Kafka Streams clients should set an
> > application_id
> > > > or not.
> > > >
> > >
> > > I'll clarify this in the KIP, but basically we would need to add an `
> > > application.id` config
> > > property for non-streams clients for this purpose, and that's outside
> the
> > > scope of this KIP since we want to make it zero-conf:ish on the client
> > side.
> > >
> > >
> > > >
> > > > Kind regards,
> > > >
> > > > Tom
> > > >
> > >
> > > Thanks for the review,
> > > Magnus
> > >
> > >
> > >
> > > >
> > > > On Thu, Oct 7, 2021 at 5:26 PM Magnus Edenhill <ma...@edenhill.se>
> > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > I've updated the KIP following our recent discussions on the
> mailing
> > > > list:
> > > > >  - split the protocol in two, one for getting the metrics
> > subscriptions,
> > > > > and one for pushing the metrics.
> > > > >  - simplifications: initially only one supported metrics format, no
> > > > > client.id in the instance id, etc.
> > > > >  - made CLIENT_METRICS subscription configuration entries more
> > structured
> > > > >    and allowing better client matching selectors (not only on the
> > > > instance
> > > > > id, but also the other
> > > > >    client resource labels, such as client_software_name, etc.).
> > > > >
> > > > > Unless there are further comments I'll call the vote in a day or
> two.
> > > > >
> > > > > Regards,
> > > > > Magnus
> > > > >
> > > > >
> > > > >
> > > > > Den mån 4 okt. 2021 kl 20:57 skrev Magnus Edenhill <
> > magnus@edenhill.se>:
> > > > >
> > > > > > Hi Gwen,
> > > > > >
> > > > > > I'm finishing up the KIP based on the last couple of discussion
> > points
> > > > in
> > > > > > this thread
> > > > > > and will call the Vote later this week.
> > > > > >
> > > > > > Best,
> > > > > > Magnus
> > > > > >
> > > > > > Den lör 2 okt. 2021 kl 02:01 skrev Gwen Shapira
> > > > > <gwen@confluent.io.invalid
> > > > > > >:
> > > > > >
> > > > > >> Hey,
> > > > > >>
> > > > > >> I noticed that there was no discussion for the last 10 days,
> but I
> > > > > >> couldn't
> > > > > >> find the vote thread. Is there one that I'm missing?
> > > > > >>
> > > > > >> Gwen
> > > > > >>
> > > > > >> On Wed, Sep 22, 2021 at 4:58 AM Magnus Edenhill <
> > magnus@edenhill.se>
> > > > > >> wrote:
> > > > > >>
> > > > > >> > Den tis 21 sep. 2021 kl 06:58 skrev Colin McCabe <
> > > > cmccabe@apache.org
> > > > > >:
> > > > > >> >
> > > > > >> > > On Mon, Sep 20, 2021, at 17:35, Feng Min wrote:
> > > > > >> > > > Thanks Magnus & Colin for the discussion.
> > > > > >> > > >
> > > > > >> > > > Based on KIP-714's stateless design, Client can pretty
> much
> > use
> > > > > any
> > > > > >> > > > connection to any broker to send metrics. We are not
> > associating
> > > > > >> > > connection
> > > > > >> > > > with client metric state. Is my understanding correct? If
> > yes,
> > > > > how
> > > > > >> > about
> > > > > >> > > > the following two scenarios
> > > > > >> > > >
> > > > > >> > > > 1) One Client (Client-ID) registers two different client
> > > > instance
> > > > > id
> > > > > >> > via
> > > > > >> > > > separate registration. Is it permitted? If OK, how to
> > > > distinguish
> > > > > >> them
> > > > > >> > > from
> > > > > >> > > > the case 2 below.
> > > > > >> > > >
> > > > > >> > >
> > > > > >> > > Hi Feng,
> > > > > >> > >
> > > > > >> > > My understanding, which Magnus can clarify I guess, is that
> > you
> > > > > could
> > > > > >> > have
> > > > > >> > > something like two Producer instances running with the same
> > > > > client.id
> > > > > >> > > (perhaps because they're using the same config file, for
> > example).
> > > > > >> They
> > > > > >> > > could even be in the same process. But they would get
> separate
> > > > > UUIDs.
> > > > > >> > >
> > > > > >> > > I believe Magnus used the term client to mean "Producer or
> > > > > Consumer".
> > > > > >> So
> > > > > >> > > if you have both a Producer and a Consumer in your
> > application I
> > > > > would
> > > > > >> > > expect you'd get separate UUIDs for both. Again Magnus can
> > chime
> > > > in
> > > > > >> > here, I
> > > > > >> > > guess.
> > > > > >> > >
> > > > > >> >
> > > > > >> > That's correct.
> > > > > >> >
> > > > > >> >
> > > > > >> > >
> > > > > >> > > > 2) How about the client restarting? What's the
> expectation?
> > > > Should
> > > > > >> the
> > > > > >> > > > server expect the client to carry a persisted client
> > instance id
> > > > > or
> > > > > >> > > should
> > > > > >> > > > the client be treated as a new instance?
> > > > > >> > >
> > > > > >> > > The KIP doesn't describe any mechanism for persistence, so I
> > would
> > > > > >> assume
> > > > > >> > > that when you restart the client you get a new UUID. I agree
> > that
> > > > it
> > > > > >> > would
> > > > > >> > > be good to spell this out.
> > > > > >> > >
> > > > > >> > >
> > > > > >> > Right, it will not be persisted since a client instance can't
> be
> > > > > >> restarted.
> > > > > >> >
> > > > > >> > Will update the KIP to make this clearer.
> > > > > >> >
> > > > > >> > /Magnus
> > > > > >> >
> > > > > >>
> > > > > >>
> > > > > >> --
> > > > > >> Gwen Shapira
> > > > > >> Engineering Manager | Confluent
> > > > > >> 650.450.2760 | @gwenshap
> > > > > >> Follow us: Twitter | blog
> > > > > >>
> > > > > >
> > > > >
> > > >
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Ashenafi Marcos <as...@gmail.com>.

Hi,
Can you please take out my email I’d so that will not be able to receive
any mail from you.
Thank you

On Tue, Oct 19, 2021 at 1:30 PM Mickael Maison <mi...@gmail.com>
wrote:

> Hi Magnus,
>
> Thanks for the proposal.
>
> 1. Looking at the protocol section, isn't "ClientInstanceId" expected
> to be a field in GetTelemetrySubscriptionsResponseV0? Otherwise, how
> does a client retrieve this value?
>
> 2. In the client API section, you mention a new method
> "clientInstanceId()". Can you clarify which interfaces are affected?
> Is it only Consumer and Producer?
>
> 3. I'm a bit concerned this is enabled by default. Even if the data
> collected is supposed to be not sensitive, I think this can be
> problematic in some environments. Also users don't seem to have the
> choice to only expose some metrics. Knowing how much data transit
> through some applications can be considered critical.
>
> 4. As a user, how do you know if your application is actively sending
> metrics? Are there new metrics exposing what's going on, like how much
> data is being sent?
>
> 5. If all metrics are enabled on a regular Consumer or Producer, do
> you have an idea how much throughput this would use?
>
> Thanks
>
> On Tue, Oct 19, 2021 at 5:06 PM Magnus Edenhill <ma...@edenhill.se>
> wrote:
> >
> > Den tis 19 okt. 2021 kl 13:22 skrev Tom Bentley <tb...@redhat.com>:
> >
> > > Hi Magnus,
> > >
> > > I reviewed the KIP since you called the vote (sorry for not reviewing
> when
> > > you announced your intention to call the vote). I have a few questions
> on
> > > some of the details.
> > >
> > > 1. There's no Javadoc on ClientTelemetryPayload.data(), so I don't know
> > > whether the payload is exposed through this method as compressed or
> not.
> > > Later on you say "Decompression of the payloads will be handled by the
> > > broker metrics plugin, the broker should expose a suitable
> decompression
> > > API to the metrics plugin for this purpose.", which suggests it's the
> > > compressed data in the buffer, but then we don't know which codec was
> used,
> > > nor the API via which the plugin should decompress it if required for
> > > forwarding to the ultimate metrics store. Should the
> ClientTelemetryPayload
> > > expose a method to get the compression and a decompressor?
> > >
> >
> > Good point, updated.
> >
> >
> >
> > > 2. The client-side API is expressed as StringOrError
> > > ClientInstance::ClientInstanceId(int timeout_ms). I understand that
> you're
> > > thinking about the librdkafka implementation, but it would be good to
> show
> > > the API as it would appear on the Apache Kafka clients.
> > >
> >
> > This was meant as pseudo-code, but I changed it to Java.
> >
> >
> > > 3. "PushTelemetryRequest|Response - protocol request used by the
> client to
> > > send metrics to any broker it is connected to." To be clear, this means
> > > that the client can choose any of the connected brokers and push to
> just
> > > one of them? What should a supporting client do if it gets an error
> when
> > > pushing metrics to a broker, retry sending to the same broker or try
> > > pushing to another broker, or drop the metrics? Should supporting
> clients
> > > send successive requests to a single broker, or round robin, or is
> that up
> > > to the client author? I'm guessing the behaviour should be sticky to
> > > support the rate limiting features, but I think it would be good for
> client
> > > authors if this section were explicit on the recommended behaviour.
> > >
> >
> > You are right, I've updated the KIP to make this clearer.
> >
> >
> > > 4. "Mapping the client instance id to an actual application instance
> > > running on a (virtual) machine can be done by inspecting the metrics
> > > resource labels, such as the client source address and source port, or
> > > security principal, all of which are added by the receiving broker.
> This
> > > will allow the operator together with the user to identify the actual
> > > application instance." Is this really always true? The source IP and
> port
> > > might be a loadbalancer/proxy in some setups. The principal, as already
> > > mentioned in the KIP, might be shared between multiple applications.
> So at
> > > worst the organization running the clients might have to consult the
> logs
> > > of a set of client applications, right?
> > >
> >
> > Yes, that's correct. There's no guaranteed mapping from
> client_instance_id
> > to
> > an actual instance, that's why the KIP recommends client implementations
> to
> > log the client instance id
> > upon retrieval, and also provide an API for the application to retrieve
> the
> > instance id programmatically
> > if it has a better way of exposing it.
> >
> >
> > 5. "Tests indicate that a compression ratio up to 10x is possible for the
> > > standard metrics." Client authors might appreciate your mentioning
> which
> > > compression codec got these results.
> > >
> >
> > Good point. Updated.
> >
> >
> > > 6. "Should the client send a push request prior to expiry of the
> previously
> > > calculated PushIntervalMs the broker will discard the metrics and
> return a
> > > PushTelemetryResponse with the ErrorCode set to RateLimited." Is this
> > > RATE_LIMITED a new error code? It's not mentioned in the "New Error
> Codes"
> > > section.
> > >
> >
> > That's a leftover, it should be using the standard ThrottleTime
> mechanism.
> > Fixed.
> >
> >
> > > 7. In the section "Standard client resource labels" application_id is
> > > described as Kafka Streams only, but the section of "Client
> Identification"
> > > talks about "application instance id as an optional future nice-to-have
> > > that may be included as a metrics label if it has been set by the
> user", so
> > > I'm confused whether non-Kafka Streams clients should set an
> application_id
> > > or not.
> > >
> >
> > I'll clarify this in the KIP, but basically we would need to add an `
> > application.id` config
> > property for non-streams clients for this purpose, and that's outside the
> > scope of this KIP since we want to make it zero-conf:ish on the client
> side.
> >
> >
> > >
> > > Kind regards,
> > >
> > > Tom
> > >
> >
> > Thanks for the review,
> > Magnus
> >
> >
> >
> > >
> > > On Thu, Oct 7, 2021 at 5:26 PM Magnus Edenhill <ma...@edenhill.se>
> wrote:
> > >
> > > > Hi all,
> > > >
> > > > I've updated the KIP following our recent discussions on the mailing
> > > list:
> > > >  - split the protocol in two, one for getting the metrics
> subscriptions,
> > > > and one for pushing the metrics.
> > > >  - simplifications: initially only one supported metrics format, no
> > > > client.id in the instance id, etc.
> > > >  - made CLIENT_METRICS subscription configuration entries more
> structured
> > > >    and allowing better client matching selectors (not only on the
> > > instance
> > > > id, but also the other
> > > >    client resource labels, such as client_software_name, etc.).
> > > >
> > > > Unless there are further comments I'll call the vote in a day or two.
> > > >
> > > > Regards,
> > > > Magnus
> > > >
> > > >
> > > >
> > > > Den mån 4 okt. 2021 kl 20:57 skrev Magnus Edenhill <
> magnus@edenhill.se>:
> > > >
> > > > > Hi Gwen,
> > > > >
> > > > > I'm finishing up the KIP based on the last couple of discussion
> points
> > > in
> > > > > this thread
> > > > > and will call the Vote later this week.
> > > > >
> > > > > Best,
> > > > > Magnus
> > > > >
> > > > > Den lör 2 okt. 2021 kl 02:01 skrev Gwen Shapira
> > > > <gwen@confluent.io.invalid
> > > > > >:
> > > > >
> > > > >> Hey,
> > > > >>
> > > > >> I noticed that there was no discussion for the last 10 days, but I
> > > > >> couldn't
> > > > >> find the vote thread. Is there one that I'm missing?
> > > > >>
> > > > >> Gwen
> > > > >>
> > > > >> On Wed, Sep 22, 2021 at 4:58 AM Magnus Edenhill <
> magnus@edenhill.se>
> > > > >> wrote:
> > > > >>
> > > > >> > Den tis 21 sep. 2021 kl 06:58 skrev Colin McCabe <
> > > cmccabe@apache.org
> > > > >:
> > > > >> >
> > > > >> > > On Mon, Sep 20, 2021, at 17:35, Feng Min wrote:
> > > > >> > > > Thanks Magnus & Colin for the discussion.
> > > > >> > > >
> > > > >> > > > Based on KIP-714's stateless design, Client can pretty much
> use
> > > > any
> > > > >> > > > connection to any broker to send metrics. We are not
> associating
> > > > >> > > connection
> > > > >> > > > with client metric state. Is my understanding correct? If
> yes,
> > > > how
> > > > >> > about
> > > > >> > > > the following two scenarios
> > > > >> > > >
> > > > >> > > > 1) One Client (Client-ID) registers two different client
> > > instance
> > > > id
> > > > >> > via
> > > > >> > > > separate registration. Is it permitted? If OK, how to
> > > distinguish
> > > > >> them
> > > > >> > > from
> > > > >> > > > the case 2 below.
> > > > >> > > >
> > > > >> > >
> > > > >> > > Hi Feng,
> > > > >> > >
> > > > >> > > My understanding, which Magnus can clarify I guess, is that
> you
> > > > could
> > > > >> > have
> > > > >> > > something like two Producer instances running with the same
> > > > client.id
> > > > >> > > (perhaps because they're using the same config file, for
> example).
> > > > >> They
> > > > >> > > could even be in the same process. But they would get separate
> > > > UUIDs.
> > > > >> > >
> > > > >> > > I believe Magnus used the term client to mean "Producer or
> > > > Consumer".
> > > > >> So
> > > > >> > > if you have both a Producer and a Consumer in your
> application I
> > > > would
> > > > >> > > expect you'd get separate UUIDs for both. Again Magnus can
> chime
> > > in
> > > > >> > here, I
> > > > >> > > guess.
> > > > >> > >
> > > > >> >
> > > > >> > That's correct.
> > > > >> >
> > > > >> >
> > > > >> > >
> > > > >> > > > 2) How about the client restarting? What's the expectation?
> > > Should
> > > > >> the
> > > > >> > > > server expect the client to carry a persisted client
> instance id
> > > > or
> > > > >> > > should
> > > > >> > > > the client be treated as a new instance?
> > > > >> > >
> > > > >> > > The KIP doesn't describe any mechanism for persistence, so I
> would
> > > > >> assume
> > > > >> > > that when you restart the client you get a new UUID. I agree
> that
> > > it
> > > > >> > would
> > > > >> > > be good to spell this out.
> > > > >> > >
> > > > >> > >
> > > > >> > Right, it will not be persisted since a client instance can't be
> > > > >> restarted.
> > > > >> >
> > > > >> > Will update the KIP to make this clearer.
> > > > >> >
> > > > >> > /Magnus
> > > > >> >
> > > > >>
> > > > >>
> > > > >> --
> > > > >> Gwen Shapira
> > > > >> Engineering Manager | Confluent
> > > > >> 650.450.2760 | @gwenshap
> > > > >> Follow us: Twitter | blog
> > > > >>
> > > > >
> > > >
> > >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Xavier Léauté <xa...@confluent.io.INVALID>.

 24.2 Does delta only apply to Counter type?

> 24.3 In the delta representation, the first request needs to send the full
> value, how does the broker plugin know whether a value is full or delta?
>

The temporarily semantics are defined by the OpenTelemetry data model.
Deferring to OpenTelemetry avoids having to redefine all those semantics in
the Kafka protocol.
https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/datamodel.md

Hopefully that clarifies things,
Xavier

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Sarat Kakarla <sk...@confluent.io.INVALID>.

Jun

Following are the answers for some the questions raised by you.

>> 26. client-metrics entity:
>> 26.1 It seems that we could add multiple entities that match to the same client. Which one takes precedent?

All the matching client metrics would be compiled into a single list and send to the client.

>> 26.2 How do we persist the new client metrics entities? Do we need to add new ZK paths and new records in KRaft?

The idea is to add a new ConfigResourceType:CLIENT_METRICS and follow the same code paths as the other config resources as described in ConfigResource.Type, which means new ZK paths and new KRAFT records would be added.

Thanks
Sarat




    On Wed, Jan 26, 2022 at 6:32 AM Magnus Edenhill <ma...@edenhill.se> wrote:

    > Hi all,
    >
    > I've updated the KIP with responses to the latest comments: Java client
    > dependencies (Thanks Kirk!), alternate designs (separate cluster, separate
    > producer, etc), etc.
    >
    > I will revive the vote thread.
    >
    > Thanks,
    > Magnus
    >
    >
    > Den mån 13 dec. 2021 kl 22:32 skrev Ryanne Dolan <ry...@gmail.com>:
    >
    > > I think we should be very careful about introducing new runtime
    > > dependencies into the clients. Historically this has been rare and
    > > essentially necessary (e.g. compression libs).
    > >
    > > Ryanne
    > >
    > > On Mon, Dec 13, 2021, 1:06 PM Kirk True <ki...@mustardgrain.com> wrote:
    > >
    > > > Hi Jun,
    > > >
    > > > On Thu, Dec 9, 2021, at 2:28 PM, Jun Rao wrote:
    > > > > 13. Using OpenTelemetry. Does that require runtime dependency
    > > > > on OpenTelemetry library? How good is the compatibility story
    > > > > of OpenTelemetry? This is important since an application could have
    > > other
    > > > > OpenTelemetry dependencies than the Kafka client.
    > > >
    > > > The current design is that the OpenTelemetry JARs would ship with the
    > > > client. Perhaps we can design the client such that the JARs aren't even
    > > > loaded if the user has opted out. The user could even exclude the JARs
    > > from
    > > > their dependencies if they so wished.
    > > >
    > > > I can't speak to the compatibility of the libraries. Is it possible
    > that
    > > > we include a shaded version?
    > > >
    > > > Thanks,
    > > > Kirk
    > > >
    > > > >
    > > > > 14. The proposal listed idempotence=true. This is more of a
    > > configuration
    > > > > than a metric. Are we including that as a metric? What other
    > > > configurations
    > > > > are we including? Should we separate the configurations from the
    > > metrics?
    > > > >
    > > > > Thanks,
    > > > >
    > > > > Jun
    > > > >
    > > > > On Mon, Nov 29, 2021 at 7:34 AM Magnus Edenhill <ma...@edenhill.se>
    > > > wrote:
    > > > >
    > > > > > Hey Bob,
    > > > > >
    > > > > > That's a good point.
    > > > > >
    > > > > > Request type labels were considered but since they're already
    > tracked
    > > > by
    > > > > > broker-side metrics
    > > > > > they were left out as to avoid metric duplication, however those
    > > > metrics
    > > > > > are not per connection,
    > > > > > so they won't be that useful in practice for troubleshooting
    > specific
    > > > > > client instances.
    > > > > >
    > > > > > I'll add the request_type label to the relevant metrics.
    > > > > >
    > > > > > Thanks,
    > > > > > Magnus
    > > > > >
    > > > > >
    > > > > > Den tis 23 nov. 2021 kl 19:20 skrev Bob Barrett
    > > > > > <bo...@confluent.io.invalid>:
    > > > > >
    > > > > > > Hi Magnus,
    > > > > > >
    > > > > > > Thanks for the thorough KIP, this seems very useful.
    > > > > > >
    > > > > > > Would it make sense to include the request type as a label for
    > the
    > > > > > > `client.request.success`, `client.request.errors` and
    > > > > > `client.request.rtt`
    > > > > > > metrics? I think it would be very useful to see which specific
    > > > requests
    > > > > > are
    > > > > > > succeeding and failing for a client. One specific case I can
    > think
    > > of
    > > > > > where
    > > > > > > this could be useful is producer batch timeouts. If a Java
    > > > application
    > > > > > does
    > > > > > > not enable producer client logs (unfortunately, in my experience
    > > this
    > > > > > > happens more often than it should), the application logs will
    > only
    > > > > > contain
    > > > > > > the expiration error message, but no information about what is
    > > > causing
    > > > > > the
    > > > > > > timeout. The requests might all be succeeding but taking too long
    > > to
    > > > > > > process batches, or metadata requests might be failing, or some
    > or
    > > > all
    > > > > > > produce requests might be failing (if the bootstrap servers are
    > > > reachable
    > > > > > > from the client but one or more other brokers are not, for
    > > example).
    > > > If
    > > > > > the
    > > > > > > cluster operator is able to identify the specific requests that
    > are
    > > > slow
    > > > > > or
    > > > > > > failing for a client, they will be better able to diagnose the
    > > issue
    > > > > > > causing batch timeouts.
    > > > > > >
    > > > > > > One drawback I can think of is that this will increase the
    > > > cardinality of
    > > > > > > the request metrics. But any given client is only going to use a
    > > > small
    > > > > > > subset of the request types, and since we already have partition
    > > > labels
    > > > > > for
    > > > > > > the topic-level metrics, I think request labels will still make
    > up
    > > a
    > > > > > > relatively small percentage of the set of metrics.
    > > > > > >
    > > > > > > Thanks,
    > > > > > > Bob
    > > > > > >
    > > > > > > On Mon, Nov 22, 2021 at 2:08 AM Viktor Somogyi-Vass <
    > > > > > > viktorsomogyi@gmail.com>
    > > > > > > wrote:
    > > > > > >
    > > > > > > > Hi Magnus,
    > > > > > > >
    > > > > > > > I think this is a very useful addition. We also have a similar
    > > (but
    > > > > > much
    > > > > > > > more simplistic) implementation of this. Maybe I missed it in
    > the
    > > > KIP
    > > > > > but
    > > > > > > > what about adding metrics about the subscription cache itself?
    > > > That I
    > > > > > > think
    > > > > > > > would improve its usability and debuggability as we'd be able
    > to
    > > > see
    > > > > > its
    > > > > > > > performance, hit/miss rates, eviction counts and others.
    > > > > > > >
    > > > > > > > Best,
    > > > > > > > Viktor
    > > > > > > >
    > > > > > > > On Thu, Nov 18, 2021 at 5:12 PM Magnus Edenhill <
    > > > magnus@edenhill.se>
    > > > > > > > wrote:
    > > > > > > >
    > > > > > > > > Hi Mickael,
    > > > > > > > >
    > > > > > > > > see inline.
    > > > > > > > >
    > > > > > > > > Den ons 10 nov. 2021 kl 15:21 skrev Mickael Maison <
    > > > > > > > > mickael.maison@gmail.com
    > > > > > > > > >:
    > > > > > > > >
    > > > > > > > > > Hi Magnus,
    > > > > > > > > >
    > > > > > > > > > I see you've addressed some of the points I raised above
    > but
    > > > some
    > > > > > (4,
    > > > > > > > > > 5) have not been addressed yet.
    > > > > > > > > >
    > > > > > > > >
    > > > > > > > > Re 4) How will the user/app know metrics are being sent.
    > > > > > > > >
    > > > > > > > > One possibility is to add a JMX metric (thus for user
    > > > consumption)
    > > > > > for
    > > > > > > > the
    > > > > > > > > number of metric pushes the
    > > > > > > > > client has performed, or perhaps the number of metrics
    > > > subscriptions
    > > > > > > > > currently being collected.
    > > > > > > > > Would that be sufficient?
    > > > > > > > >
    > > > > > > > > Re 5) Metric sizes and rates
    > > > > > > > >
    > > > > > > > > A worst case scenario for a producer that is producing to 50
    > > > unique
    > > > > > > > topics
    > > > > > > > > and emitting all standard metrics yields
    > > > > > > > > a serialized size of around 100KB prior to compression, which
    > > > > > > compresses
    > > > > > > > > down to about 20-30% of that depending
    > > > > > > > > on compression type and topic name uniqueness.
    > > > > > > > > The numbers for a consumer would be similar.
    > > > > > > > >
    > > > > > > > > In practice the number of unique topics would be far less,
    > and
    > > > the
    > > > > > > > > subscription set would typically be for a subset of metrics.
    > > > > > > > > So we're probably closer to 1kb, or less, compressed size per
    > > > client
    > > > > > > per
    > > > > > > > > push interval.
    > > > > > > > >
    > > > > > > > > As both the subscription set and push intervals are
    > controlled
    > > > by the
    > > > > > > > > cluster operator it shouldn't be too hard
    > > > > > > > > to strike a good balance between metrics overhead and
    > > > granularity.
    > > > > > > > >
    > > > > > > > >
    > > > > > > > >
    > > > > > > > > >
    > > > > > > > > > I'm really uneasy with this being enabled by default on the
    > > > client
    > > > > > > > > > side. When collecting data, I think the best practice is to
    > > > ensure
    > > > > > > > > > users are explicitly enabling it.
    > > > > > > > > >
    > > > > > > > >
    > > > > > > > > Requiring metrics to be explicitly enabled on clients
    > severely
    > > > > > cripples
    > > > > > > > its
    > > > > > > > > usability and value.
    > > > > > > > >
    > > > > > > > > One of the problems that this KIP aims to solve is for useful
    > > > metrics
    > > > > > > to
    > > > > > > > be
    > > > > > > > > available on demand
    > > > > > > > > regardless of the technical expertise of the user. As Ryanne
    > > > points,
    > > > > > > out
    > > > > > > > a
    > > > > > > > > savvy user/organization
    > > > > > > > > will typically have metrics collection and monitoring in
    > place
    > > > > > already,
    > > > > > > > and
    > > > > > > > > the benefits of this KIP
    > > > > > > > > are then more of a common set and format metrics across
    > client
    > > > > > > > > implementations and languages.
    > > > > > > > > But that is not the typical Kafka user in my experience,
    > > they're
    > > > not
    > > > > > > > Kafka
    > > > > > > > > experts and they don't have the
    > > > > > > > > knowledge of how to best instrument their clients.
    > > > > > > > > Having metrics enabled by default for this user base allows
    > the
    > > > Kafka
    > > > > > > > > operators to proactively and reactively
    > > > > > > > > monitor and troubleshoot client issues, without the need for
    > > the
    > > > less
    > > > > > > > savvy
    > > > > > > > > user to do anything.
    > > > > > > > > It is often too late to tell a user to enable metrics when
    > the
    > > > > > problem
    > > > > > > > has
    > > > > > > > > already occurred.
    > > > > > > > >
    > > > > > > > > Now, to be clear, even though metrics are enabled by default
    > on
    > > > > > clients
    > > > > > > > it
    > > > > > > > > is not enabled by default
    > > > > > > > > on the brokers; the Kafka operator needs to build and set up
    > a
    > > > > > metrics
    > > > > > > > > plugin and add metrics subscriptions
    > > > > > > > > before anything is sent from the client.
    > > > > > > > > It is opt-out on the clients and opt-in on the broker.
    > > > > > > > >
    > > > > > > > >
    > > > > > > > >
    > > > > > > > >
    > > > > > > > > > You mentioned brokers already have
    > > > > > > > > > some(most?) of the information contained in metrics, if so
    > > > then why
    > > > > > > > > > are we collecting it again? Surely there must be some new
    > > > > > information
    > > > > > > > > > in the client metrics.
    > > > > > > > > >
    > > > > > > > >
    > > > > > > > > From the user's perspective the Kafka infrastructure extends
    > > from
    > > > > > > > > producer.send() to
    > > > > > > > > messages being returned from consumer.poll(), a giant black
    > box
    > > > where
    > > > > > > > > there's a lot going on between those
    > > > > > > > > two points. The brokers currently only see what happens once
    > > > those
    > > > > > > > requests
    > > > > > > > > and messages hits the broker,
    > > > > > > > > but as Kafka clients are complex pieces of machinery there's
    > a
    > > > myriad
    > > > > > > of
    > > > > > > > > queues, timers, and state
    > > > > > > > > that's critical to the operation and infrastructure that's
    > not
    > > > > > > currently
    > > > > > > > > visible to the operator.
    > > > > > > > > Relying on the user to accurately and timely provide this
    > > missing
    > > > > > > > > information is not generally feasible.
    > > > > > > > >
    > > > > > > > >
    > > > > > > > > Most of the standard metrics listed in the KIP are data
    > points
    > > > that
    > > > > > the
    > > > > > > > > broker does not have.
    > > > > > > > > Only a small number of metrics are duplicates (like the
    > request
    > > > > > counts
    > > > > > > > and
    > > > > > > > > sizes), but they are included
    > > > > > > > > to ease correlation when inspecting these client metrics.
    > > > > > > > >
    > > > > > > > >
    > > > > > > > >
    > > > > > > > > > Moreover this is a brand new feature so it's even harder to
    > > > justify
    > > > > > > > > > enabling it and forcing onto all our users. If disabled by
    > > > default,
    > > > > > > > > > it's relatively easy to enable in a new release if we
    > decide
    > > > to,
    > > > > > but
    > > > > > > > > > once enabled by default it's much harder to disable. Also
    > > this
    > > > > > > feature
    > > > > > > > > > will apply to all future metrics we will add.
    > > > > > > > > >
    > > > > > > > >
    > > > > > > > > I think maturity of a feature implementation should be the
    > > > deciding
    > > > > > > > factor,
    > > > > > > > > rather than
    > > > > > > > > the design of it (which this KIP is). I.e., if the
    > > > implementation is
    > > > > > > not
    > > > > > > > > deemed mature enough
    > > > > > > > > for release X.Y it will be disabled.
    > > > > > > > >
    > > > > > > > >
    > > > > > > > >
    > > > > > > > > > Overall I think it's an interesting feature but I'd prefer
    > to
    > > > be
    > > > > > > > > > slightly defensive and see how it works in practice before
    > > > enabling
    > > > > > > it
    > > > > > > > > > everywhere.
    > > > > > > > > >
    > > > > > > > >
    > > > > > > > > Right, and I agree on being defensive, but since this feature
    > > > still
    > > > > > > > > requires manual
    > > > > > > > > enabling on the brokers before actually being used, I think
    > > that
    > > > > > gives
    > > > > > > > > enough control
    > > > > > > > > to opt-in or out of this feature as needed.
    > > > > > > > >
    > > > > > > > > Thanks for your comments!
    > > > > > > > >
    > > > > > > > > Regards,
    > > > > > > > > Magnus
    > > > > > > > >
    > > > > > > > >
    > > > > > > > >
    > > > > > > > > > Thanks,
    > > > > > > > > > Mickael
    > > > > > > > > >
    > > > > > > > > > On Mon, Nov 8, 2021 at 7:55 AM Magnus Edenhill <
    > > > magnus@edenhill.se
    > > > > > >
    > > > > > > > > wrote:
    > > > > > > > > > >
    > > > > > > > > > > Thanks David for pointing this out,
    > > > > > > > > > > I've updated the KIP to include client_id as a matching
    > > > selector.
    > > > > > > > > > >
    > > > > > > > > > > Regards,
    > > > > > > > > > > Magnus
    > > > > > > > > > >
    > > > > > > > > > > Den tors 4 nov. 2021 kl 18:01 skrev David Mao
    > > > > > > > > <dmao@confluent.io.invalid
    > > > > > > > > > >:
    > > > > > > > > > >
    > > > > > > > > > > > Hey Magnus,
    > > > > > > > > > > >
    > > > > > > > > > > > I noticed that the KIP outlines the initial selectors
    > > > supported
    > > > > > > as:
    > > > > > > > > > > >
    > > > > > > > > > > >    - client_instance_id - CLIENT_INSTANCE_ID UUID
    > string
    > > > > > > > > > representation.
    > > > > > > > > > > >    - client_software_name  - client software
    > > implementation
    > > > > > name.
    > > > > > > > > > > >    - client_software_version  - client software
    > > > implementation
    > > > > > > > > version.
    > > > > > > > > > > >
    > > > > > > > > > > > In the given reactive monitoring workflow, we mention
    > > that
    > > > the
    > > > > > > > > > application
    > > > > > > > > > > > user does not know their client's client instance ID,
    > but
    > > > it's
    > > > > > > > > outlined
    > > > > > > > > > > > that the operator can add a metrics subscription
    > > selecting
    > > > for
    > > > > > > > > > clientId. I
    > > > > > > > > > > > don't see clientId as one of the supported selectors.
    > > > > > > > > > > > I can see how this would have made sense in a previous
    > > > > > iteration
    > > > > > > > > given
    > > > > > > > > > that
    > > > > > > > > > > > the previous client instance ID proposal was to
    > construct
    > > > the
    > > > > > > > client
    > > > > > > > > > > > instance ID using clientId as a prefix. Now that the
    > > client
    > > > > > > > instance
    > > > > > > > > > ID is
    > > > > > > > > > > > a UUID, would we want to add clientId as a supported
    > > > selector?
    > > > > > > > > > > > Let me know what you think.
    > > > > > > > > > > >
    > > > > > > > > > > > David
    > > > > > > > > > > >
    > > > > > > > > > > > On Tue, Oct 19, 2021 at 12:33 PM Magnus Edenhill <
    > > > > > > > magnus@edenhill.se
    > > > > > > > > >
    > > > > > > > > > > > wrote:
    > > > > > > > > > > >
    > > > > > > > > > > > > Hi Mickael!
    > > > > > > > > > > > >
    > > > > > > > > > > > > Den tis 19 okt. 2021 kl 19:30 skrev Mickael Maison <
    > > > > > > > > > > > > mickael.maison@gmail.com
    > > > > > > > > > > > > >:
    > > > > > > > > > > > >
    > > > > > > > > > > > > > Hi Magnus,
    > > > > > > > > > > > > >
    > > > > > > > > > > > > > Thanks for the proposal.
    > > > > > > > > > > > > >
    > > > > > > > > > > > > > 1. Looking at the protocol section, isn't
    > > > > > "ClientInstanceId"
    > > > > > > > > > expected
    > > > > > > > > > > > > > to be a field in
    > GetTelemetrySubscriptionsResponseV0?
    > > > > > > > Otherwise,
    > > > > > > > > > how
    > > > > > > > > > > > > > does a client retrieve this value?
    > > > > > > > > > > > > >
    > > > > > > > > > > > >
    > > > > > > > > > > > > Good catch, it got removed by mistake in one of the
    > > > edits.
    > > > > > > > > > > > >
    > > > > > > > > > > > >
    > > > > > > > > > > > > >
    > > > > > > > > > > > > > 2. In the client API section, you mention a new
    > > method
    > > > > > > > > > > > > > "clientInstanceId()". Can you clarify which
    > > interfaces
    > > > are
    > > > > > > > > > affected?
    > > > > > > > > > > > > > Is it only Consumer and Producer?
    > > > > > > > > > > > > >
    > > > > > > > > > > > >
    > > > > > > > > > > > > And Admin. Will update the KIP.
    > > > > > > > > > > > >
    > > > > > > > > > > > >
    > > > > > > > > > > > >
    > > > > > > > > > > > > > 3. I'm a bit concerned this is enabled by default.
    > > > Even if
    > > > > > > the
    > > > > > > > > data
    > > > > > > > > > > > > > collected is supposed to be not sensitive, I think
    > > > this can
    > > > > > > be
    > > > > > > > > > > > > > problematic in some environments. Also users don't
    > > > seem to
    > > > > > > have
    > > > > > > > > the
    > > > > > > > > > > > > > choice to only expose some metrics. Knowing how
    > much
    > > > data
    > > > > > > > transit
    > > > > > > > > > > > > > through some applications can be considered
    > critical.
    > > > > > > > > > > > > >
    > > > > > > > > > > > >
    > > > > > > > > > > > > The broker already knows how much data transits
    > through
    > > > the
    > > > > > > > client
    > > > > > > > > > > > though,
    > > > > > > > > > > > > right?
    > > > > > > > > > > > > Care has been taken not to expose information in the
    > > > standard
    > > > > > > > > metrics
    > > > > > > > > > > > that
    > > > > > > > > > > > > might
    > > > > > > > > > > > > reveal sensitive information.
    > > > > > > > > > > > >
    > > > > > > > > > > > > Do you have an example of how the proposed metrics
    > > could
    > > > leak
    > > > > > > > > > sensitive
    > > > > > > > > > > > > information?
    > > > > > > > > > > > > As for limiting the what metrics to export; I guess
    > > that
    > > > > > could
    > > > > > > > make
    > > > > > > > > > sense
    > > > > > > > > > > > > in some
    > > > > > > > > > > > > very sensitive use-cases, but those users might
    > disable
    > > > > > metrics
    > > > > > > > > > > > altogether
    > > > > > > > > > > > > for now.
    > > > > > > > > > > > > Could these concerns be addressed by a later KIP?
    > > > > > > > > > > > >
    > > > > > > > > > > > >
    > > > > > > > > > > > >
    > > > > > > > > > > > > >
    > > > > > > > > > > > > > 4. As a user, how do you know if your application
    > is
    > > > > > actively
    > > > > > > > > > sending
    > > > > > > > > > > > > > metrics? Are there new metrics exposing what's
    > going
    > > > on,
    > > > > > like
    > > > > > > > how
    > > > > > > > > > much
    > > > > > > > > > > > > > data is being sent?
    > > > > > > > > > > > > >
    > > > > > > > > > > > >
    > > > > > > > > > > > > That's a good question.
    > > > > > > > > > > > > Since the proposed metrics interface is not aimed at,
    > > or
    > > > > > > directly
    > > > > > > > > > > > available
    > > > > > > > > > > > > to, the application
    > > > > > > > > > > > > I guess there's little point of adding it here, but
    > > > instead
    > > > > > > > adding
    > > > > > > > > > > > > something to the
    > > > > > > > > > > > > existing JMX metrics?
    > > > > > > > > > > > > Do you have any suggestions?
    > > > > > > > > > > > >
    > > > > > > > > > > > >
    > > > > > > > > > > > >
    > > > > > > > > > > > > > 5. If all metrics are enabled on a regular Consumer
    > > or
    > > > > > > > Producer,
    > > > > > > > > do
    > > > > > > > > > > > > > you have an idea how much throughput this would
    > use?
    > > > > > > > > > > > > >
    > > > > > > > > > > > >
    > > > > > > > > > > > > It depends on the number of partition/topics/etc the
    > > > client
    > > > > > is
    > > > > > > > > > producing
    > > > > > > > > > > > > to/consuming from.
    > > > > > > > > > > > > I'll add some sizes to the KIP for some typical
    > > > use-cases.
    > > > > > > > > > > > >
    > > > > > > > > > > > >
    > > > > > > > > > > > > Thanks,
    > > > > > > > > > > > > Magnus
    > > > > > > > > > > > >
    > > > > > > > > > > > >
    > > > > > > > > > > > > > Thanks
    > > > > > > > > > > > > >
    > > > > > > > > > > > > > On Tue, Oct 19, 2021 at 5:06 PM Magnus Edenhill <
    > > > > > > > > > magnus@edenhill.se>
    > > > > > > > > > > > > > wrote:
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > Den tis 19 okt. 2021 kl 13:22 skrev Tom Bentley <
    > > > > > > > > > tbentley@redhat.com
    > > > > > > > > > > > >:
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > Hi Magnus,
    > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > I reviewed the KIP since you called the vote
    > > > (sorry for
    > > > > > > not
    > > > > > > > > > > > reviewing
    > > > > > > > > > > > > > when
    > > > > > > > > > > > > > > > you announced your intention to call the
    > vote). I
    > > > have
    > > > > > a
    > > > > > > > few
    > > > > > > > > > > > > questions
    > > > > > > > > > > > > > on
    > > > > > > > > > > > > > > > some of the details.
    > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > 1. There's no Javadoc on
    > > > ClientTelemetryPayload.data(),
    > > > > > > so
    > > > > > > > I
    > > > > > > > > > don't
    > > > > > > > > > > > > know
    > > > > > > > > > > > > > > > whether the payload is exposed through this
    > > method
    > > > as
    > > > > > > > > > compressed or
    > > > > > > > > > > > > > not.
    > > > > > > > > > > > > > > > Later on you say "Decompression of the payloads
    > > > will be
    > > > > > > > > > handled by
    > > > > > > > > > > > > the
    > > > > > > > > > > > > > > > broker metrics plugin, the broker should
    > expose a
    > > > > > > suitable
    > > > > > > > > > > > > > decompression
    > > > > > > > > > > > > > > > API to the metrics plugin for this purpose.",
    > > which
    > > > > > > > suggests
    > > > > > > > > > it's
    > > > > > > > > > > > the
    > > > > > > > > > > > > > > > compressed data in the buffer, but then we
    > don't
    > > > know
    > > > > > > which
    > > > > > > > > > codec
    > > > > > > > > > > > was
    > > > > > > > > > > > > > used,
    > > > > > > > > > > > > > > > nor the API via which the plugin should
    > > decompress
    > > > it
    > > > > > if
    > > > > > > > > > required
    > > > > > > > > > > > for
    > > > > > > > > > > > > > > > forwarding to the ultimate metrics store.
    > Should
    > > > the
    > > > > > > > > > > > > > ClientTelemetryPayload
    > > > > > > > > > > > > > > > expose a method to get the compression and a
    > > > > > > decompressor?
    > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > Good point, updated.
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > 2. The client-side API is expressed as
    > > > StringOrError
    > > > > > > > > > > > > > > > ClientInstance::ClientInstanceId(int
    > > timeout_ms). I
    > > > > > > > > understand
    > > > > > > > > > that
    > > > > > > > > > > > > > you're
    > > > > > > > > > > > > > > > thinking about the librdkafka implementation,
    > but
    > > > it
    > > > > > > would
    > > > > > > > be
    > > > > > > > > > good
    > > > > > > > > > > > to
    > > > > > > > > > > > > > show
    > > > > > > > > > > > > > > > the API as it would appear on the Apache Kafka
    > > > clients.
    > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > This was meant as pseudo-code, but I changed it
    > to
    > > > Java.
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > 3. "PushTelemetryRequest|Response - protocol
    > > > request
    > > > > > used
    > > > > > > > by
    > > > > > > > > > the
    > > > > > > > > > > > > > client to
    > > > > > > > > > > > > > > > send metrics to any broker it is connected to."
    > > To
    > > > be
    > > > > > > > clear,
    > > > > > > > > > this
    > > > > > > > > > > > > means
    > > > > > > > > > > > > > > > that the client can choose any of the connected
    > > > brokers
    > > > > > > and
    > > > > > > > > > push to
    > > > > > > > > > > > > > just
    > > > > > > > > > > > > > > > one of them? What should a supporting client do
    > > if
    > > > it
    > > > > > > gets
    > > > > > > > an
    > > > > > > > > > error
    > > > > > > > > > > > > > when
    > > > > > > > > > > > > > > > pushing metrics to a broker, retry sending to
    > the
    > > > same
    > > > > > > > broker
    > > > > > > > > > or
    > > > > > > > > > > > try
    > > > > > > > > > > > > > > > pushing to another broker, or drop the metrics?
    > > > Should
    > > > > > > > > > supporting
    > > > > > > > > > > > > > clients
    > > > > > > > > > > > > > > > send successive requests to a single broker, or
    > > > round
    > > > > > > > robin,
    > > > > > > > > > or is
    > > > > > > > > > > > > > that up
    > > > > > > > > > > > > > > > to the client author? I'm guessing the
    > behaviour
    > > > should
    > > > > > > be
    > > > > > > > > > sticky
    > > > > > > > > > > > to
    > > > > > > > > > > > > > > > support the rate limiting features, but I think
    > > it
    > > > > > would
    > > > > > > be
    > > > > > > > > > good
    > > > > > > > > > > > for
    > > > > > > > > > > > > > client
    > > > > > > > > > > > > > > > authors if this section were explicit on the
    > > > > > recommended
    > > > > > > > > > behaviour.
    > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > You are right, I've updated the KIP to make this
    > > > clearer.
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > 4. "Mapping the client instance id to an actual
    > > > > > > application
    > > > > > > > > > > > instance
    > > > > > > > > > > > > > > > running on a (virtual) machine can be done by
    > > > > > inspecting
    > > > > > > > the
    > > > > > > > > > > > metrics
    > > > > > > > > > > > > > > > resource labels, such as the client source
    > > address
    > > > and
    > > > > > > > source
    > > > > > > > > > port,
    > > > > > > > > > > > > or
    > > > > > > > > > > > > > > > security principal, all of which are added by
    > the
    > > > > > > receiving
    > > > > > > > > > broker.
    > > > > > > > > > > > > > This
    > > > > > > > > > > > > > > > will allow the operator together with the user
    > to
    > > > > > > identify
    > > > > > > > > the
    > > > > > > > > > > > actual
    > > > > > > > > > > > > > > > application instance." Is this really always
    > > true?
    > > > The
    > > > > > > > source
    > > > > > > > > > IP
    > > > > > > > > > > > and
    > > > > > > > > > > > > > port
    > > > > > > > > > > > > > > > might be a loadbalancer/proxy in some setups.
    > The
    > > > > > > > principal,
    > > > > > > > > as
    > > > > > > > > > > > > already
    > > > > > > > > > > > > > > > mentioned in the KIP, might be shared between
    > > > multiple
    > > > > > > > > > > > applications.
    > > > > > > > > > > > > > So at
    > > > > > > > > > > > > > > > worst the organization running the clients
    > might
    > > > have
    > > > > > to
    > > > > > > > > > consult
    > > > > > > > > > > > the
    > > > > > > > > > > > > > logs
    > > > > > > > > > > > > > > > of a set of client applications, right?
    > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > Yes, that's correct. There's no guaranteed
    > mapping
    > > > from
    > > > > > > > > > > > > > client_instance_id
    > > > > > > > > > > > > > > to
    > > > > > > > > > > > > > > an actual instance, that's why the KIP recommends
    > > > client
    > > > > > > > > > > > > implementations
    > > > > > > > > > > > > > to
    > > > > > > > > > > > > > > log the client instance id
    > > > > > > > > > > > > > > upon retrieval, and also provide an API for the
    > > > > > application
    > > > > > > > to
    > > > > > > > > > > > retrieve
    > > > > > > > > > > > > > the
    > > > > > > > > > > > > > > instance id programmatically
    > > > > > > > > > > > > > > if it has a better way of exposing it.
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > 5. "Tests indicate that a compression ratio up to
    > > > 10x is
    > > > > > > > > > possible for
    > > > > > > > > > > > > the
    > > > > > > > > > > > > > > > standard metrics." Client authors might
    > > appreciate
    > > > your
    > > > > > > > > > mentioning
    > > > > > > > > > > > > > which
    > > > > > > > > > > > > > > > compression codec got these results.
    > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > Good point. Updated.
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > 6. "Should the client send a push request prior
    > > to
    > > > > > expiry
    > > > > > > > of
    > > > > > > > > > the
    > > > > > > > > > > > > > previously
    > > > > > > > > > > > > > > > calculated PushIntervalMs the broker will
    > discard
    > > > the
    > > > > > > > metrics
    > > > > > > > > > and
    > > > > > > > > > > > > > return a
    > > > > > > > > > > > > > > > PushTelemetryResponse with the ErrorCode set to
    > > > > > > > RateLimited."
    > > > > > > > > > Is
    > > > > > > > > > > > this
    > > > > > > > > > > > > > > > RATE_LIMITED a new error code? It's not
    > mentioned
    > > > in
    > > > > > the
    > > > > > > > "New
    > > > > > > > > > Error
    > > > > > > > > > > > > > Codes"
    > > > > > > > > > > > > > > > section.
    > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > That's a leftover, it should be using the
    > standard
    > > > > > > > ThrottleTime
    > > > > > > > > > > > > > mechanism.
    > > > > > > > > > > > > > > Fixed.
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > 7. In the section "Standard client resource
    > > labels"
    > > > > > > > > > application_id
    > > > > > > > > > > > is
    > > > > > > > > > > > > > > > described as Kafka Streams only, but the
    > section
    > > of
    > > > > > > "Client
    > > > > > > > > > > > > > Identification"
    > > > > > > > > > > > > > > > talks about "application instance id as an
    > > optional
    > > > > > > future
    > > > > > > > > > > > > nice-to-have
    > > > > > > > > > > > > > > > that may be included as a metrics label if it
    > has
    > > > been
    > > > > > > set
    > > > > > > > by
    > > > > > > > > > the
    > > > > > > > > > > > > > user", so
    > > > > > > > > > > > > > > > I'm confused whether non-Kafka Streams clients
    > > > should
    > > > > > set
    > > > > > > > an
    > > > > > > > > > > > > > application_id
    > > > > > > > > > > > > > > > or not.
    > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > I'll clarify this in the KIP, but basically we
    > > would
    > > > need
    > > > > > > to
    > > > > > > > > add
    > > > > > > > > > an `
    > > > > > > > > > > > > > > application.id` config
    > > > > > > > > > > > > > > property for non-streams clients for this
    > purpose,
    > > > and
    > > > > > > that's
    > > > > > > > > > outside
    > > > > > > > > > > > > the
    > > > > > > > > > > > > > > scope of this KIP since we want to make it
    > > > zero-conf:ish
    > > > > > on
    > > > > > > > the
    > > > > > > > > > > > client
    > > > > > > > > > > > > > side.
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > Kind regards,
    > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > Tom
    > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > Thanks for the review,
    > > > > > > > > > > > > > > Magnus
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > On Thu, Oct 7, 2021 at 5:26 PM Magnus Edenhill
    > <
    > > > > > > > > > magnus@edenhill.se
    > > > > > > > > > > > >
    > > > > > > > > > > > > > wrote:
    > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > Hi all,
    > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > I've updated the KIP following our recent
    > > > discussions
    > > > > > > on
    > > > > > > > > the
    > > > > > > > > > > > > mailing
    > > > > > > > > > > > > > > > list:
    > > > > > > > > > > > > > > > >  - split the protocol in two, one for getting
    > > the
    > > > > > > metrics
    > > > > > > > > > > > > > subscriptions,
    > > > > > > > > > > > > > > > > and one for pushing the metrics.
    > > > > > > > > > > > > > > > >  - simplifications: initially only one
    > > supported
    > > > > > > metrics
    > > > > > > > > > format,
    > > > > > > > > > > > no
    > > > > > > > > > > > > > > > > client.id in the instance id, etc.
    > > > > > > > > > > > > > > > >  - made CLIENT_METRICS subscription
    > > configuration
    > > > > > > entries
    > > > > > > > > > more
    > > > > > > > > > > > > > structured
    > > > > > > > > > > > > > > > >    and allowing better client matching
    > > selectors
    > > > (not
    > > > > > > > only
    > > > > > > > > > on the
    > > > > > > > > > > > > > > > instance
    > > > > > > > > > > > > > > > > id, but also the other
    > > > > > > > > > > > > > > > >    client resource labels, such as
    > > > > > > client_software_name,
    > > > > > > > > > etc.).
    > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > Unless there are further comments I'll call
    > the
    > > > vote
    > > > > > > in a
    > > > > > > > > > day or
    > > > > > > > > > > > > two.
    > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > Regards,
    > > > > > > > > > > > > > > > > Magnus
    > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > Den mån 4 okt. 2021 kl 20:57 skrev Magnus
    > > > Edenhill <
    > > > > > > > > > > > > > magnus@edenhill.se>:
    > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > > Hi Gwen,
    > > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > > I'm finishing up the KIP based on the last
    > > > couple
    > > > > > of
    > > > > > > > > > discussion
    > > > > > > > > > > > > > points
    > > > > > > > > > > > > > > > in
    > > > > > > > > > > > > > > > > > this thread
    > > > > > > > > > > > > > > > > > and will call the Vote later this week.
    > > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > > Best,
    > > > > > > > > > > > > > > > > > Magnus
    > > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > > Den lör 2 okt. 2021 kl 02:01 skrev Gwen
    > > Shapira
    > > > > > > > > > > > > > > > > <gwen@confluent.io.invalid
    > > > > > > > > > > > > > > > > > >:
    > > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > >> Hey,
    > > > > > > > > > > > > > > > > >>
    > > > > > > > > > > > > > > > > >> I noticed that there was no discussion for
    > > the
    > > > > > last
    > > > > > > 10
    > > > > > > > > > days,
    > > > > > > > > > > > > but I
    > > > > > > > > > > > > > > > > >> couldn't
    > > > > > > > > > > > > > > > > >> find the vote thread. Is there one that
    > I'm
    > > > > > missing?
    > > > > > > > > > > > > > > > > >>
    > > > > > > > > > > > > > > > > >> Gwen
    > > > > > > > > > > > > > > > > >>
    > > > > > > > > > > > > > > > > >> On Wed, Sep 22, 2021 at 4:58 AM Magnus
    > > > Edenhill <
    > > > > > > > > > > > > > magnus@edenhill.se>
    > > > > > > > > > > > > > > > > >> wrote:
    > > > > > > > > > > > > > > > > >>
    > > > > > > > > > > > > > > > > >> > Den tis 21 sep. 2021 kl 06:58 skrev
    > Colin
    > > > > > McCabe <
    > > > > > > > > > > > > > > > cmccabe@apache.org
    > > > > > > > > > > > > > > > > >:
    > > > > > > > > > > > > > > > > >> >
    > > > > > > > > > > > > > > > > >> > > On Mon, Sep 20, 2021, at 17:35, Feng
    > Min
    > > > > > wrote:
    > > > > > > > > > > > > > > > > >> > > > Thanks Magnus & Colin for the
    > > > discussion.
    > > > > > > > > > > > > > > > > >> > > >
    > > > > > > > > > > > > > > > > >> > > > Based on KIP-714's stateless design,
    > > > Client
    > > > > > > can
    > > > > > > > > > pretty
    > > > > > > > > > > > > much
    > > > > > > > > > > > > > use
    > > > > > > > > > > > > > > > > any
    > > > > > > > > > > > > > > > > >> > > > connection to any broker to send
    > > > metrics. We
    > > > > > > are
    > > > > > > > > not
    > > > > > > > > > > > > > associating
    > > > > > > > > > > > > > > > > >> > > connection
    > > > > > > > > > > > > > > > > >> > > > with client metric state. Is my
    > > > > > understanding
    > > > > > > > > > correct?
    > > > > > > > > > > > If
    > > > > > > > > > > > > > yes,
    > > > > > > > > > > > > > > > > how
    > > > > > > > > > > > > > > > > >> > about
    > > > > > > > > > > > > > > > > >> > > > the following two scenarios
    > > > > > > > > > > > > > > > > >> > > >
    > > > > > > > > > > > > > > > > >> > > > 1) One Client (Client-ID) registers
    > > two
    > > > > > > > different
    > > > > > > > > > client
    > > > > > > > > > > > > > > > instance
    > > > > > > > > > > > > > > > > id
    > > > > > > > > > > > > > > > > >> > via
    > > > > > > > > > > > > > > > > >> > > > separate registration. Is it
    > > permitted?
    > > > If
    > > > > > OK,
    > > > > > > > how
    > > > > > > > > > to
    > > > > > > > > > > > > > > > distinguish
    > > > > > > > > > > > > > > > > >> them
    > > > > > > > > > > > > > > > > >> > > from
    > > > > > > > > > > > > > > > > >> > > > the case 2 below.
    > > > > > > > > > > > > > > > > >> > > >
    > > > > > > > > > > > > > > > > >> > >
    > > > > > > > > > > > > > > > > >> > > Hi Feng,
    > > > > > > > > > > > > > > > > >> > >
    > > > > > > > > > > > > > > > > >> > > My understanding, which Magnus can
    > > > clarify I
    > > > > > > > guess,
    > > > > > > > > is
    > > > > > > > > > > > that
    > > > > > > > > > > > > > you
    > > > > > > > > > > > > > > > > could
    > > > > > > > > > > > > > > > > >> > have
    > > > > > > > > > > > > > > > > >> > > something like two Producer instances
    > > > running
    > > > > > > with
    > > > > > > > > the
    > > > > > > > > > > > same
    > > > > > > > > > > > > > > > > client.id
    > > > > > > > > > > > > > > > > >> > > (perhaps because they're using the
    > same
    > > > config
    > > > > > > > file,
    > > > > > > > > > for
    > > > > > > > > > > > > > example).
    > > > > > > > > > > > > > > > > >> They
    > > > > > > > > > > > > > > > > >> > > could even be in the same process. But
    > > > they
    > > > > > > would
    > > > > > > > > get
    > > > > > > > > > > > > separate
    > > > > > > > > > > > > > > > > UUIDs.
    > > > > > > > > > > > > > > > > >> > >
    > > > > > > > > > > > > > > > > >> > > I believe Magnus used the term client
    > to
    > > > mean
    > > > > > > > > > "Producer or
    > > > > > > > > > > > > > > > > Consumer".
    > > > > > > > > > > > > > > > > >> So
    > > > > > > > > > > > > > > > > >> > > if you have both a Producer and a
    > > > Consumer in
    > > > > > > your
    > > > > > > > > > > > > > application I
    > > > > > > > > > > > > > > > > would
    > > > > > > > > > > > > > > > > >> > > expect you'd get separate UUIDs for
    > > both.
    > > > > > Again
    > > > > > > > > > Magnus can
    > > > > > > > > > > > > > chime
    > > > > > > > > > > > > > > > in
    > > > > > > > > > > > > > > > > >> > here, I
    > > > > > > > > > > > > > > > > >> > > guess.
    > > > > > > > > > > > > > > > > >> > >
    > > > > > > > > > > > > > > > > >> >
    > > > > > > > > > > > > > > > > >> > That's correct.
    > > > > > > > > > > > > > > > > >> >
    > > > > > > > > > > > > > > > > >> >
    > > > > > > > > > > > > > > > > >> > >
    > > > > > > > > > > > > > > > > >> > > > 2) How about the client restarting?
    > > > What's
    > > > > > the
    > > > > > > > > > > > > expectation?
    > > > > > > > > > > > > > > > Should
    > > > > > > > > > > > > > > > > >> the
    > > > > > > > > > > > > > > > > >> > > > server expect the client to carry a
    > > > > > persisted
    > > > > > > > > client
    > > > > > > > > > > > > > instance id
    > > > > > > > > > > > > > > > > or
    > > > > > > > > > > > > > > > > >> > > should
    > > > > > > > > > > > > > > > > >> > > > the client be treated as a new
    > > instance?
    > > > > > > > > > > > > > > > > >> > >
    > > > > > > > > > > > > > > > > >> > > The KIP doesn't describe any mechanism
    > > for
    > > > > > > > > > persistence,
    > > > > > > > > > > > so I
    > > > > > > > > > > > > > would
    > > > > > > > > > > > > > > > > >> assume
    > > > > > > > > > > > > > > > > >> > > that when you restart the client you
    > get
    > > > a new
    > > > > > > > > UUID. I
    > > > > > > > > > > > agree
    > > > > > > > > > > > > > that
    > > > > > > > > > > > > > > > it
    > > > > > > > > > > > > > > > > >> > would
    > > > > > > > > > > > > > > > > >> > > be good to spell this out.
    > > > > > > > > > > > > > > > > >> > >
    > > > > > > > > > > > > > > > > >> > >
    > > > > > > > > > > > > > > > > >> > Right, it will not be persisted since a
    > > > client
    > > > > > > > > instance
    > > > > > > > > > > > can't
    > > > > > > > > > > > > be
    > > > > > > > > > > > > > > > > >> restarted.
    > > > > > > > > > > > > > > > > >> >
    > > > > > > > > > > > > > > > > >> > Will update the KIP to make this
    > clearer.
    > > > > > > > > > > > > > > > > >> >
    > > > > > > > > > > > > > > > > >> > /Magnus
    > > > > > > > > > > > > > > > > >> >
    > > > > > > > > > > > > > > > > >>
    > > > > > > > > > > > > > > > > >>
    > > > > > > > > > > > > > > > > >> --
    > > > > > > > > > > > > > > > > >> Gwen Shapira
    > > > > > > > > > > > > > > > > >> Engineering Manager | Confluent
    > > > > > > > > > > > > > > > > >> 650.450.2760 | @gwenshap
    > > > > > > > > > > > > > > > > >> Follow us: Twitter | blog
    > > > > > > > > > > > > > > > > >>
    > > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > >
    > > > > > > > > > > > > >
    > > > > > > > > > > > >
    > > > > > > > > > > >
    > > > > > > > > >
    > > > > > > > >
    > > > > > > >
    > > > > > >
    > > > > >
    > > > >
    > > >
    > >
    >

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Magnus Edenhill <ma...@edenhill.se>.

Hey Ismael,


> > The PushTelemetryRequest handler decompresses the payload before passing
> it
> > to the metrics plugin.
> > This was done to avoid having to expose a public decompression interface
> to
> > metrics plugin developers.
> >
>
> Are there cases where the metrics plugin developers would want to forward
> the compressed payload without decompressing?
>

Maybe, but most plugins probably want to either add some extra information
(e.g., from the auth context), or convert to another format, so the original
compressed blob is most likely not that interesting.
In any case the plugin will want to inspect the uncompressed metrics data to
verify it is not garbage before forwarding it upstream.

We could always add an option later to allow passing the metrics payload
verbatim if the need arises.

Thanks,
Magnus

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Xavier Léauté <xa...@confluent.io.INVALID>.

>
> Are there cases where the metrics plugin developers would want to forward
> the compressed payload without decompressing?


The only interoperable use-case I can think of would be to forward the
payloads directly to an OpenTelemetry collector backend.
Today OTLP only mandates gzip/none compression support for gRPC and HTTP
protocols, so this might only work for a limited set
of compression formats (or no compression) out of the box.

see
https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/protocol/otlp.md#protocol-details

Maybe we could consider exposing the raw uncompressed bytes regardless of
client side compression, if someone wanted
to avoid the cost of de-serializing the payload, since there would always
be an option to forward that as-is, and let the opentelemetry collector add
tags relevant to the broker originating those client metrics.

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Ismael Juma <is...@juma.me.uk>.

On Wed, Mar 30, 2022 at 4:08 AM Magnus Edenhill <ma...@edenhill.se> wrote:

> > 41. We include CompressionType in PushTelemetryRequestV0, but not in
> > ClientTelemetryPayload. How would the implementer know the compression
> type
> > for the telemetry payload?
> The PushTelemetryRequest handler decompresses the payload before passing it
> to the metrics plugin.
> This was done to avoid having to expose a public decompression interface to
> metrics plugin developers.
>

Are there cases where the metrics plugin developers would want to forward
the compressed payload without decompressing?

Ismael

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Magnus Edenhill <ma...@edenhill.se>.

Hey Jun,

see response inline:

Den mån 21 mars 2022 kl 19:31 skrev Jun Rao <ju...@confluent.io.invalid>:

> Hi, Kirk, Sarat,
>
> A few more comments.
>
> 40. GetTelemetrySubscriptionsResponseV0 : RequestedMetrics Array[string]
> uses "Array[0] empty string" to represent all metrics subscribed. We had a
> similar issue with the topics field in MetadataRequest and used the
> following convention.
> In version 1 and higher, an empty array indicates "request metadata for no
> topics," and a null array is used to indicate "request metadata for all
> topics."
> Should we use the same convention in GetTelemetrySubscriptionsResponseV0?
>

Right, I considered this but chose the current design because the
subscriptions are prefix-matched,
so an empty string will automatically match everything.
It is not critical in any way, so if you feel it is better to follow the
way MetadataRequest does it, I can change it?



>
> 41. We include CompressionType in PushTelemetryRequestV0, but not in
> ClientTelemetryPayload. How would the implementer know the compression type
> for the telemetry payload?
>
>
The PushTelemetryRequest handler decompresses the payload before passing it
to the metrics plugin.
This was done to avoid having to expose a public decompression interface to
metrics plugin developers.



> 42. For blocking the metrics for certain clients in the following example,
> could you describe the corresponding config value used through the
> kafka-config command?
> kafka-client-metrics.sh --bootstrap-server $BROKERS \
>    --add \
>    --name 'Disabe_b69cc35a' \  # A descriptive name makes it easier to
> clean up old subscriptions.
>    --match client_instance_id=b69cc35a-7a54-4790-aa69-cc2bd4ee4538 \  #
> Match this specific client instance
>    --block
>

--block will set the "inteval" ConfigEntry to "0", which overrides and
disables all accumulated subscriptions for the matching client instance.


Thanks,
Magnus



> On Thu, Mar 10, 2022 at 11:57 AM Jun Rao <ju...@confluent.io> wrote:
>
> > Hi, Kirk, Sarat,
> >
> > Thanks for the reply.
> >
> > 28. On the broker, we typically use Yammer metrics. Only for metrics that
> > depend on Kafka metric features (e.g., quota), we use the Kafka metric.
> > Yammer metrics have 4 types: gauge, meter, histogram and timer. meter
> > calculates a rate, but also exposes an accumulated value.
> >
> > 29. The Histogram class in org.apache.kafka.common.metrics.stats was
> never
> > used in the client metrics. The implementation of Histogram only
> provides a
> > fixed number of values in the domain and may not capture the quantiles
> very
> > accurately. So, we punted on using it.
> >
> > Thanks,
> >
> > Jun
> >
> >
> >
> > On Thu, Mar 10, 2022 at 10:59 AM Sarat Kakarla
> > <sk...@confluent.io.invalid> wrote:
> >
> >> Jun,
> >>
> >>   >>  28. For the broker metrics, could you spell out the full metric
> name
> >>   >>   including groups, tags, etc? We typically don't add the broker_id
> >> label for
> >>   >>   broker metrics. Also, brokers use Yammer metrics, which doesn't
> >> have type
> >>   >>   Sum.
> >>
> >> Sure,  I will update the KIP-714 with the above information, will remove
> >> the broker-id label from the metrics.
> >>
> >> Regarding the type is CumulativeSum the right type to use in the place
> of
> >> Sum?
> >>
> >> Thanks
> >> Sarat
> >>
> >>
> >> On 3/8/22, 5:48 PM, "Jun Rao" <ju...@confluent.io.INVALID> wrote:
> >>
> >>     Hi, Magnus, Sarat and Xavier,
> >>
> >>     Thanks for the reply. A few more comments below.
> >>
> >>     20. It seems that we are piggybacking the plugin on the
> >>     existing MetricsReporter. So, this seems fine.
> >>
> >>     21. That could work. Are we requiring any additional jar dependency
> >> on the
> >>     client? Or, are you suggesting that we check the runtime dependency
> >> to pick
> >>     the compression codec?
> >>
> >>     28. For the broker metrics, could you spell out the full metric name
> >>     including groups, tags, etc? We typically don't add the broker_id
> >> label for
> >>     broker metrics. Also, brokers use Yammer metrics, which doesn't have
> >> type
> >>     Sum.
> >>
> >>     29. There are several client metrics listed as histogram. However,
> >> the java
> >>     client currently doesn't support histogram type.
> >>
> >>     30. Could you show an example of the metric payload in
> >> PushTelemetryRequest
> >>     to help understand how we organize metrics at different levels (per
> >>     instance, per topic, per partition, per broker, etc)?
> >>
> >>     31. Could you add a bit more detail on which client thread sends the
> >>     PushTelemetryRequest?
> >>
> >>     Thanks,
> >>
> >>     Jun
> >>
> >>     On Mon, Mar 7, 2022 at 11:48 AM Magnus Edenhill <magnus@edenhill.se
> >
> >> wrote:
> >>
> >>     > Hi Jun,
> >>     >
> >>     > thanks for your initiated questions, see my answers below.
> >>     > There's been a number of clarifications to the KIP.
> >>     >
> >>     >
> >>     >
> >>     > Den tors 27 jan. 2022 kl 20:08 skrev Jun Rao
> >> <ju...@confluent.io.invalid>:
> >>     >
> >>     > > Hi, Magnus,
> >>     > >
> >>     > > Thanks for updating the KIP. The overall approach makes sense to
> >> me. A
> >>     > few
> >>     > > more detailed comments below.
> >>     > >
> >>     > > 20. ClientTelemetry: Should it be extending configurable and
> >> closable?
> >>     > >
> >>     >
> >>     > I'll pass this question to Sarat and/or Xavier.
> >>     >
> >>     >
> >>     >
> >>     > > 21. Compression of the metrics on the client: what's the
> default?
> >>     > >
> >>     >
> >>     > How about we specify a prioritized list: zstd, lz4, snappy, gzip?
> >>     > But ultimately it is up to what the client supports.
> >>     >
> >>     >
> >>     > 23. A client instance is considered a metric resource and the
> >>     > > resource-level (thus client instance level) labels could
> include:
> >>     > >     client_software_name=confluent-kafka-python
> >>     > >     client_software_version=v2.1.3
> >>     > >     client_instance_id=B64CD139-3975-440A-91D4
> >>     > >     transactional_id=someTxnApp
> >>     > > Are those labels added in PushTelemetryRequest? If so, are they
> >> per
> >>     > metric
> >>     > > or per request?
> >>     > >
> >>     >
> >>     >
> >>     > client_software* and client_instance_id are not added by the
> >> client, but
> >>     > available to
> >>     > the broker-side metrics plugin for adding as it see fits, remove
> >> them from
> >>     > the KIP.
> >>     >
> >>     > As for transactional_id, group_id, etc, which I believe will be
> >> useful in
> >>     > troubleshooting,
> >>     > are included only once (per push) as resource-level attributes
> (the
> >> client
> >>     > instance is a singular resource).
> >>     >
> >>     >
> >>     > >
> >>     > > 24.  "the broker will only send
> >>     > > GetTelemetrySubscriptionsResponse.DeltaTemporality=True" :
> >>     > > 24.1 If it's always true, does it need to be part of the
> protocol?
> >>     > >
> >>     >
> >>     > We're anticipating that it will take a lot longer to upgrade the
> >> majority
> >>     > of clients than the
> >>     > broker/plugin side, which is why we want the client to support
> both
> >>     > temporalities out-of-the-box
> >>     > so that cumulative reporting can be turned on seamlessly in the
> >> future.
> >>     >
> >>     >
> >>     >
> >>     > > 24.2 Does delta only apply to Counter type?
> >>     > >
> >>     >
> >>     >
> >>     > And Histograms. More details in Xavier's OTLP link.
> >>     >
> >>     >
> >>     >
> >>     > > 24.3 In the delta representation, the first request needs to
> send
> >> the
> >>     > full
> >>     > > value, how does the broker plugin know whether a value is full
> or
> >> delta?
> >>     > >
> >>     >
> >>     > The client may (should) send the start time for each metric
> sample,
> >>     > indicating when
> >>     > the metric began to be collected.
> >>     > We've discussed whether this should be the client instance start
> >> time or
> >>     > the time when a matching
> >>     > metric subscription for that metric is received.
> >>     > For completeness we recommend using the former, the client
> instance
> >> start
> >>     > time.
> >>     >
> >>     >
> >>     >
> >>     > > 25. quota:
> >>     > > 25.1 Since we are fitting PushTelemetryRequest into the existing
> >> request
> >>     > > quota, it would be useful to document the impact, i.e. client
> >> metric
> >>     > > throttling causes the data from the same client to be delayed.
> >>     > > 25.2 Is PushTelemetryRequest subject to the write bandwidth
> quota
> >> like
> >>     > the
> >>     > > producer?
> >>     > >
> >>     >
> >>     >
> >>     > Yes, it should be, as to protect the cluster from rogue clients.
> >>     > But, in practice the size of metrics will be quite low (e.g.,
> >> 1-10kb per
> >>     > 60s interval), so I don't think this will pose a problem.
> >>     > The KIP has been updated with more details on quota/throttling
> >> behaviour,
> >>     > see the
> >>     > "Throttling and rate-limiting" section.
> >>     >
> >>     >
> >>     > 25.3 THROTTLING_QUOTA_EXCEEDED: Currently, we don't send this
> error
> >> when
> >>     > > the request/bandwidth quota is exceeded since those requests are
> >> not
> >>     > > rejected. We only set this error when the request is rejected
> >> (e.g.,
> >>     > topic
> >>     > > creation). It would be useful to clarify when this error is
> used.
> >>     > >
> >>     >
> >>     > Right, I was trying to reuse an existing error-code. We can
> >> introduce
> >>     > a new one for the case where a client pushes metrics at a higher
> >> frequency
> >>     > than the
> >>     > than the configured push interval (e.g., out-of-profile sends).
> >>     > This causes the broker to drop those metrics and send this error
> >> code back
> >>     > to the client. There will be no connection throttling /
> >> channel-muting in
> >>     > this
> >>     > case (unless the standard quotas are exceeded).
> >>     >
> >>     >
> >>     > > 27. kafka-client-metrics.sh: Could we add an example on how to
> >> disable a
> >>     > > bad client?
> >>     > >
> >>     >
> >>     > There's now a --block option to kafka-client-metrics.sh which
> >> overrides all
> >>     > subscriptions
> >>     > for the matched client(s). This allows silencing metrics for one
> or
> >> more
> >>     > clients without having
> >>     > to remove existing subscriptions. From the client's perspective it
> >> will
> >>     > look like it no longer has
> >>     > any subscriptions.
> >>     >
> >>     > # Block metrics collection for a specific client instance
> >>     > $ kafka-client-metrics.sh --bootstrap-server $BROKERS \
> >>     >    --add \
> >>     >    --name 'Disabe_b69cc35a' \  # A descriptive name makes it
> easier
> >> to
> >>     > clean up old subscriptions.
> >>     >    --match client_instance_id=b69cc35a-7a54-4790-aa69-cc2bd4ee4538
> >> \  #
> >>     > Match this specific client instance
> >>     >    --block
> >>     >
> >>     >
> >>     >
> >>     >
> >>     > > 28. New broker side metrics: Could we spell out the details of
> the
> >>     > metrics
> >>     > > (e.g., group, tags, etc)?
> >>     > >
> >>     >
> >>     > KIP has been updated accordingly (thanks Sarat).
> >>     >
> >>     >
> >>     >
> >>     > >
> >>     > > 29. Client instance-level metrics: client.io.wait.time is a
> gauge
> >> not a
> >>     > > histogram.
> >>     > >
> >>     >
> >>     > I believe a population/distribution should preferably be
> >> represented as a
> >>     > histogram, space permitting,
> >>     > and only secondarily as a Gauge average.
> >>     > While we might not want to maintain a bunch of histograms for each
> >>     > partition, since that could be
> >>     > quite space consuming, this client.io.wait.time is a single metric
> >> per
> >>     > client instance and can
> >>     > thus afford a Histogram representation.
> >>     >
> >>     >
> >>     >
> >>     > Thanks,
> >>     > Magnus
> >>     >
> >>     >
> >>     >
> >>     > > Thanks,
> >>     > >
> >>     > > Jun
> >>     > >
> >>     > > On Wed, Jan 26, 2022 at 6:32 AM Magnus Edenhill <
> >> magnus@edenhill.se>
> >>     > > wrote:
> >>     > >
> >>     > > > Hi all,
> >>     > > >
> >>     > > > I've updated the KIP with responses to the latest comments:
> >> Java client
> >>     > > > dependencies (Thanks Kirk!), alternate designs (separate
> >> cluster,
> >>     > > separate
> >>     > > > producer, etc), etc.
> >>     > > >
> >>     > > > I will revive the vote thread.
> >>     > > >
> >>     > > > Thanks,
> >>     > > > Magnus
> >>     > > >
> >>     > > >
> >>     > > > Den mån 13 dec. 2021 kl 22:32 skrev Ryanne Dolan <
> >>     > ryannedolan@gmail.com
> >>     > > >:
> >>     > > >
> >>     > > > > I think we should be very careful about introducing new
> >> runtime
> >>     > > > > dependencies into the clients. Historically this has been
> >> rare and
> >>     > > > > essentially necessary (e.g. compression libs).
> >>     > > > >
> >>     > > > > Ryanne
> >>     > > > >
> >>     > > > > On Mon, Dec 13, 2021, 1:06 PM Kirk True <
> >> kirk@mustardgrain.com>
> >>     > wrote:
> >>     > > > >
> >>     > > > > > Hi Jun,
> >>     > > > > >
> >>     > > > > > On Thu, Dec 9, 2021, at 2:28 PM, Jun Rao wrote:
> >>     > > > > > > 13. Using OpenTelemetry. Does that require runtime
> >> dependency
> >>     > > > > > > on OpenTelemetry library? How good is the compatibility
> >> story
> >>     > > > > > > of OpenTelemetry? This is important since an application
> >> could
> >>     > have
> >>     > > > > other
> >>     > > > > > > OpenTelemetry dependencies than the Kafka client.
> >>     > > > > >
> >>     > > > > > The current design is that the OpenTelemetry JARs would
> >> ship with
> >>     > the
> >>     > > > > > client. Perhaps we can design the client such that the
> JARs
> >> aren't
> >>     > > even
> >>     > > > > > loaded if the user has opted out. The user could even
> >> exclude the
> >>     > > JARs
> >>     > > > > from
> >>     > > > > > their dependencies if they so wished.
> >>     > > > > >
> >>     > > > > > I can't speak to the compatibility of the libraries. Is it
> >> possible
> >>     > > > that
> >>     > > > > > we include a shaded version?
> >>     > > > > >
> >>     > > > > > Thanks,
> >>     > > > > > Kirk
> >>     > > > > >
> >>     > > > > > >
> >>     > > > > > > 14. The proposal listed idempotence=true. This is more
> of
> >> a
> >>     > > > > configuration
> >>     > > > > > > than a metric. Are we including that as a metric? What
> >> other
> >>     > > > > > configurations
> >>     > > > > > > are we including? Should we separate the configurations
> >> from the
> >>     > > > > metrics?
> >>     > > > > > >
> >>     > > > > > > Thanks,
> >>     > > > > > >
> >>     > > > > > > Jun
> >>     > > > > > >
> >>     > > > > > > On Mon, Nov 29, 2021 at 7:34 AM Magnus Edenhill <
> >>     > > magnus@edenhill.se>
> >>     > > > > > wrote:
> >>     > > > > > >
> >>     > > > > > > > Hey Bob,
> >>     > > > > > > >
> >>     > > > > > > > That's a good point.
> >>     > > > > > > >
> >>     > > > > > > > Request type labels were considered but since they're
> >> already
> >>     > > > tracked
> >>     > > > > > by
> >>     > > > > > > > broker-side metrics
> >>     > > > > > > > they were left out as to avoid metric duplication,
> >> however
> >>     > those
> >>     > > > > > metrics
> >>     > > > > > > > are not per connection,
> >>     > > > > > > > so they won't be that useful in practice for
> >> troubleshooting
> >>     > > > specific
> >>     > > > > > > > client instances.
> >>     > > > > > > >
> >>     > > > > > > > I'll add the request_type label to the relevant
> metrics.
> >>     > > > > > > >
> >>     > > > > > > > Thanks,
> >>     > > > > > > > Magnus
> >>     > > > > > > >
> >>     > > > > > > >
> >>     > > > > > > > Den tis 23 nov. 2021 kl 19:20 skrev Bob Barrett
> >>     > > > > > > > <bo...@confluent.io.invalid>:
> >>     > > > > > > >
> >>     > > > > > > > > Hi Magnus,
> >>     > > > > > > > >
> >>     > > > > > > > > Thanks for the thorough KIP, this seems very useful.
> >>     > > > > > > > >
> >>     > > > > > > > > Would it make sense to include the request type as a
> >> label
> >>     > for
> >>     > > > the
> >>     > > > > > > > > `client.request.success`, `client.request.errors`
> and
> >>     > > > > > > > `client.request.rtt`
> >>     > > > > > > > > metrics? I think it would be very useful to see
> which
> >>     > specific
> >>     > > > > > requests
> >>     > > > > > > > are
> >>     > > > > > > > > succeeding and failing for a client. One specific
> >> case I can
> >>     > > > think
> >>     > > > > of
> >>     > > > > > > > where
> >>     > > > > > > > > this could be useful is producer batch timeouts. If
> a
> >> Java
> >>     > > > > > application
> >>     > > > > > > > does
> >>     > > > > > > > > not enable producer client logs (unfortunately, in
> my
> >>     > > experience
> >>     > > > > this
> >>     > > > > > > > > happens more often than it should), the application
> >> logs will
> >>     > > > only
> >>     > > > > > > > contain
> >>     > > > > > > > > the expiration error message, but no information
> >> about what
> >>     > is
> >>     > > > > > causing
> >>     > > > > > > > the
> >>     > > > > > > > > timeout. The requests might all be succeeding but
> >> taking too
> >>     > > long
> >>     > > > > to
> >>     > > > > > > > > process batches, or metadata requests might be
> >> failing, or
> >>     > some
> >>     > > > or
> >>     > > > > > all
> >>     > > > > > > > > produce requests might be failing (if the bootstrap
> >> servers
> >>     > are
> >>     > > > > > reachable
> >>     > > > > > > > > from the client but one or more other brokers are
> >> not, for
> >>     > > > > example).
> >>     > > > > > If
> >>     > > > > > > > the
> >>     > > > > > > > > cluster operator is able to identify the specific
> >> requests
> >>     > that
> >>     > > > are
> >>     > > > > > slow
> >>     > > > > > > > or
> >>     > > > > > > > > failing for a client, they will be better able to
> >> diagnose
> >>     > the
> >>     > > > > issue
> >>     > > > > > > > > causing batch timeouts.
> >>     > > > > > > > >
> >>     > > > > > > > > One drawback I can think of is that this will
> >> increase the
> >>     > > > > > cardinality of
> >>     > > > > > > > > the request metrics. But any given client is only
> >> going to
> >>     > use
> >>     > > a
> >>     > > > > > small
> >>     > > > > > > > > subset of the request types, and since we already
> have
> >>     > > partition
> >>     > > > > > labels
> >>     > > > > > > > for
> >>     > > > > > > > > the topic-level metrics, I think request labels will
> >> still
> >>     > make
> >>     > > > up
> >>     > > > > a
> >>     > > > > > > > > relatively small percentage of the set of metrics.
> >>     > > > > > > > >
> >>     > > > > > > > > Thanks,
> >>     > > > > > > > > Bob
> >>     > > > > > > > >
> >>     > > > > > > > > On Mon, Nov 22, 2021 at 2:08 AM Viktor Somogyi-Vass
> <
> >>     > > > > > > > > viktorsomogyi@gmail.com>
> >>     > > > > > > > > wrote:
> >>     > > > > > > > >
> >>     > > > > > > > > > Hi Magnus,
> >>     > > > > > > > > >
> >>     > > > > > > > > > I think this is a very useful addition. We also
> >> have a
> >>     > > similar
> >>     > > > > (but
> >>     > > > > > > > much
> >>     > > > > > > > > > more simplistic) implementation of this. Maybe I
> >> missed it
> >>     > in
> >>     > > > the
> >>     > > > > > KIP
> >>     > > > > > > > but
> >>     > > > > > > > > > what about adding metrics about the subscription
> >> cache
> >>     > > itself?
> >>     > > > > > That I
> >>     > > > > > > > > think
> >>     > > > > > > > > > would improve its usability and debuggability as
> >> we'd be
> >>     > able
> >>     > > > to
> >>     > > > > > see
> >>     > > > > > > > its
> >>     > > > > > > > > > performance, hit/miss rates, eviction counts and
> >> others.
> >>     > > > > > > > > >
> >>     > > > > > > > > > Best,
> >>     > > > > > > > > > Viktor
> >>     > > > > > > > > >
> >>     > > > > > > > > > On Thu, Nov 18, 2021 at 5:12 PM Magnus Edenhill <
> >>     > > > > > magnus@edenhill.se>
> >>     > > > > > > > > > wrote:
> >>     > > > > > > > > >
> >>     > > > > > > > > > > Hi Mickael,
> >>     > > > > > > > > > >
> >>     > > > > > > > > > > see inline.
> >>     > > > > > > > > > >
> >>     > > > > > > > > > > Den ons 10 nov. 2021 kl 15:21 skrev Mickael
> >> Maison <
> >>     > > > > > > > > > > mickael.maison@gmail.com
> >>     > > > > > > > > > > >:
> >>     > > > > > > > > > >
> >>     > > > > > > > > > > > Hi Magnus,
> >>     > > > > > > > > > > >
> >>     > > > > > > > > > > > I see you've addressed some of the points I
> >> raised
> >>     > above
> >>     > > > but
> >>     > > > > > some
> >>     > > > > > > > (4,
> >>     > > > > > > > > > > > 5) have not been addressed yet.
> >>     > > > > > > > > > > >
> >>     > > > > > > > > > >
> >>     > > > > > > > > > > Re 4) How will the user/app know metrics are
> >> being sent.
> >>     > > > > > > > > > >
> >>     > > > > > > > > > > One possibility is to add a JMX metric (thus for
> >> user
> >>     > > > > > consumption)
> >>     > > > > > > > for
> >>     > > > > > > > > > the
> >>     > > > > > > > > > > number of metric pushes the
> >>     > > > > > > > > > > client has performed, or perhaps the number of
> >> metrics
> >>     > > > > > subscriptions
> >>     > > > > > > > > > > currently being collected.
> >>     > > > > > > > > > > Would that be sufficient?
> >>     > > > > > > > > > >
> >>     > > > > > > > > > > Re 5) Metric sizes and rates
> >>     > > > > > > > > > >
> >>     > > > > > > > > > > A worst case scenario for a producer that is
> >> producing to
> >>     > > 50
> >>     > > > > > unique
> >>     > > > > > > > > > topics
> >>     > > > > > > > > > > and emitting all standard metrics yields
> >>     > > > > > > > > > > a serialized size of around 100KB prior to
> >> compression,
> >>     > > which
> >>     > > > > > > > > compresses
> >>     > > > > > > > > > > down to about 20-30% of that depending
> >>     > > > > > > > > > > on compression type and topic name uniqueness.
> >>     > > > > > > > > > > The numbers for a consumer would be similar.
> >>     > > > > > > > > > >
> >>     > > > > > > > > > > In practice the number of unique topics would be
> >> far
> >>     > less,
> >>     > > > and
> >>     > > > > > the
> >>     > > > > > > > > > > subscription set would typically be for a subset
> >> of
> >>     > > metrics.
> >>     > > > > > > > > > > So we're probably closer to 1kb, or less,
> >> compressed size
> >>     > > per
> >>     > > > > > client
> >>     > > > > > > > > per
> >>     > > > > > > > > > > push interval.
> >>     > > > > > > > > > >
> >>     > > > > > > > > > > As both the subscription set and push intervals
> >> are
> >>     > > > controlled
> >>     > > > > > by the
> >>     > > > > > > > > > > cluster operator it shouldn't be too hard
> >>     > > > > > > > > > > to strike a good balance between metrics
> overhead
> >> and
> >>     > > > > > granularity.
> >>     > > > > > > > > > >
> >>     > > > > > > > > > >
> >>     > > > > > > > > > >
> >>     > > > > > > > > > > >
> >>     > > > > > > > > > > > I'm really uneasy with this being enabled by
> >> default on
> >>     > > the
> >>     > > > > > client
> >>     > > > > > > > > > > > side. When collecting data, I think the best
> >> practice
> >>     > is
> >>     > > to
> >>     > > > > > ensure
> >>     > > > > > > > > > > > users are explicitly enabling it.
> >>     > > > > > > > > > > >
> >>     > > > > > > > > > >
> >>     > > > > > > > > > > Requiring metrics to be explicitly enabled on
> >> clients
> >>     > > > severely
> >>     > > > > > > > cripples
> >>     > > > > > > > > > its
> >>     > > > > > > > > > > usability and value.
> >>     > > > > > > > > > >
> >>     > > > > > > > > > > One of the problems that this KIP aims to solve
> >> is for
> >>     > > useful
> >>     > > > > > metrics
> >>     > > > > > > > > to
> >>     > > > > > > > > > be
> >>     > > > > > > > > > > available on demand
> >>     > > > > > > > > > > regardless of the technical expertise of the
> >> user. As
> >>     > > Ryanne
> >>     > > > > > points,
> >>     > > > > > > > > out
> >>     > > > > > > > > > a
> >>     > > > > > > > > > > savvy user/organization
> >>     > > > > > > > > > > will typically have metrics collection and
> >> monitoring in
> >>     > > > place
> >>     > > > > > > > already,
> >>     > > > > > > > > > and
> >>     > > > > > > > > > > the benefits of this KIP
> >>     > > > > > > > > > > are then more of a common set and format metrics
> >> across
> >>     > > > client
> >>     > > > > > > > > > > implementations and languages.
> >>     > > > > > > > > > > But that is not the typical Kafka user in my
> >> experience,
> >>     > > > > they're
> >>     > > > > > not
> >>     > > > > > > > > > Kafka
> >>     > > > > > > > > > > experts and they don't have the
> >>     > > > > > > > > > > knowledge of how to best instrument their
> clients.
> >>     > > > > > > > > > > Having metrics enabled by default for this user
> >> base
> >>     > allows
> >>     > > > the
> >>     > > > > > Kafka
> >>     > > > > > > > > > > operators to proactively and reactively
> >>     > > > > > > > > > > monitor and troubleshoot client issues, without
> >> the need
> >>     > > for
> >>     > > > > the
> >>     > > > > > less
> >>     > > > > > > > > > savvy
> >>     > > > > > > > > > > user to do anything.
> >>     > > > > > > > > > > It is often too late to tell a user to enable
> >> metrics
> >>     > when
> >>     > > > the
> >>     > > > > > > > problem
> >>     > > > > > > > > > has
> >>     > > > > > > > > > > already occurred.
> >>     > > > > > > > > > >
> >>     > > > > > > > > > > Now, to be clear, even though metrics are
> enabled
> >> by
> >>     > > default
> >>     > > > on
> >>     > > > > > > > clients
> >>     > > > > > > > > > it
> >>     > > > > > > > > > > is not enabled by default
> >>     > > > > > > > > > > on the brokers; the Kafka operator needs to
> build
> >> and set
> >>     > > up
> >>     > > > a
> >>     > > > > > > > metrics
> >>     > > > > > > > > > > plugin and add metrics subscriptions
> >>     > > > > > > > > > > before anything is sent from the client.
> >>     > > > > > > > > > > It is opt-out on the clients and opt-in on the
> >> broker.
> >>     > > > > > > > > > >
> >>     > > > > > > > > > >
> >>     > > > > > > > > > >
> >>     > > > > > > > > > >
> >>     > > > > > > > > > > > You mentioned brokers already have
> >>     > > > > > > > > > > > some(most?) of the information contained in
> >> metrics, if
> >>     > > so
> >>     > > > > > then why
> >>     > > > > > > > > > > > are we collecting it again? Surely there must
> >> be some
> >>     > new
> >>     > > > > > > > information
> >>     > > > > > > > > > > > in the client metrics.
> >>     > > > > > > > > > > >
> >>     > > > > > > > > > >
> >>     > > > > > > > > > > From the user's perspective the Kafka
> >> infrastructure
> >>     > > extends
> >>     > > > > from
> >>     > > > > > > > > > > producer.send() to
> >>     > > > > > > > > > > messages being returned from consumer.poll(), a
> >> giant
> >>     > black
> >>     > > > box
> >>     > > > > > where
> >>     > > > > > > > > > > there's a lot going on between those
> >>     > > > > > > > > > > two points. The brokers currently only see what
> >> happens
> >>     > > once
> >>     > > > > > those
> >>     > > > > > > > > > requests
> >>     > > > > > > > > > > and messages hits the broker,
> >>     > > > > > > > > > > but as Kafka clients are complex pieces of
> >> machinery
> >>     > > there's
> >>     > > > a
> >>     > > > > > myriad
> >>     > > > > > > > > of
> >>     > > > > > > > > > > queues, timers, and state
> >>     > > > > > > > > > > that's critical to the operation and
> >> infrastructure
> >>     > that's
> >>     > > > not
> >>     > > > > > > > > currently
> >>     > > > > > > > > > > visible to the operator.
> >>     > > > > > > > > > > Relying on the user to accurately and timely
> >> provide this
> >>     > > > > missing
> >>     > > > > > > > > > > information is not generally feasible.
> >>     > > > > > > > > > >
> >>     > > > > > > > > > >
> >>     > > > > > > > > > > Most of the standard metrics listed in the KIP
> >> are data
> >>     > > > points
> >>     > > > > > that
> >>     > > > > > > > the
> >>     > > > > > > > > > > broker does not have.
> >>     > > > > > > > > > > Only a small number of metrics are duplicates
> >> (like the
> >>     > > > request
> >>     > > > > > > > counts
> >>     > > > > > > > > > and
> >>     > > > > > > > > > > sizes), but they are included
> >>     > > > > > > > > > > to ease correlation when inspecting these client
> >> metrics.
> >>     > > > > > > > > > >
> >>     > > > > > > > > > >
> >>     > > > > > > > > > >
> >>     > > > > > > > > > > > Moreover this is a brand new feature so it's
> >> even
> >>     > harder
> >>     > > to
> >>     > > > > > justify
> >>     > > > > > > > > > > > enabling it and forcing onto all our users. If
> >> disabled
> >>     > > by
> >>     > > > > > default,
> >>     > > > > > > > > > > > it's relatively easy to enable in a new
> release
> >> if we
> >>     > > > decide
> >>     > > > > > to,
> >>     > > > > > > > but
> >>     > > > > > > > > > > > once enabled by default it's much harder to
> >> disable.
> >>     > Also
> >>     > > > > this
> >>     > > > > > > > > feature
> >>     > > > > > > > > > > > will apply to all future metrics we will add.
> >>     > > > > > > > > > > >
> >>     > > > > > > > > > >
> >>     > > > > > > > > > > I think maturity of a feature implementation
> >> should be
> >>     > the
> >>     > > > > > deciding
> >>     > > > > > > > > > factor,
> >>     > > > > > > > > > > rather than
> >>     > > > > > > > > > > the design of it (which this KIP is). I.e., if
> the
> >>     > > > > > implementation is
> >>     > > > > > > > > not
> >>     > > > > > > > > > > deemed mature enough
> >>     > > > > > > > > > > for release X.Y it will be disabled.
> >>     > > > > > > > > > >
> >>     > > > > > > > > > >
> >>     > > > > > > > > > >
> >>     > > > > > > > > > > > Overall I think it's an interesting feature
> but
> >> I'd
> >>     > > prefer
> >>     > > > to
> >>     > > > > > be
> >>     > > > > > > > > > > > slightly defensive and see how it works in
> >> practice
> >>     > > before
> >>     > > > > > enabling
> >>     > > > > > > > > it
> >>     > > > > > > > > > > > everywhere.
> >>     > > > > > > > > > > >
> >>     > > > > > > > > > >
> >>     > > > > > > > > > > Right, and I agree on being defensive, but since
> >> this
> >>     > > feature
> >>     > > > > > still
> >>     > > > > > > > > > > requires manual
> >>     > > > > > > > > > > enabling on the brokers before actually being
> >> used, I
> >>     > think
> >>     > > > > that
> >>     > > > > > > > gives
> >>     > > > > > > > > > > enough control
> >>     > > > > > > > > > > to opt-in or out of this feature as needed.
> >>     > > > > > > > > > >
> >>     > > > > > > > > > > Thanks for your comments!
> >>     > > > > > > > > > >
> >>     > > > > > > > > > > Regards,
> >>     > > > > > > > > > > Magnus
> >>     > > > > > > > > > >
> >>     > > > > > > > > > >
> >>     > > > > > > > > > >
> >>     > > > > > > > > > > > Thanks,
> >>     > > > > > > > > > > > Mickael
> >>     > > > > > > > > > > >
> >>     > > > > > > > > > > > On Mon, Nov 8, 2021 at 7:55 AM Magnus
> Edenhill <
> >>     > > > > > magnus@edenhill.se
> >>     > > > > > > > >
> >>     > > > > > > > > > > wrote:
> >>     > > > > > > > > > > > >
> >>     > > > > > > > > > > > > Thanks David for pointing this out,
> >>     > > > > > > > > > > > > I've updated the KIP to include client_id
> as a
> >>     > matching
> >>     > > > > > selector.
> >>     > > > > > > > > > > > >
> >>     > > > > > > > > > > > > Regards,
> >>     > > > > > > > > > > > > Magnus
> >>     > > > > > > > > > > > >
> >>     > > > > > > > > > > > > Den tors 4 nov. 2021 kl 18:01 skrev David
> Mao
> >>     > > > > > > > > > > <dmao@confluent.io.invalid
> >>     > > > > > > > > > > > >:
> >>     > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > Hey Magnus,
> >>     > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > I noticed that the KIP outlines the
> initial
> >>     > selectors
> >>     > > > > > supported
> >>     > > > > > > > > as:
> >>     > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > >    - client_instance_id -
> >> CLIENT_INSTANCE_ID UUID
> >>     > > > string
> >>     > > > > > > > > > > > representation.
> >>     > > > > > > > > > > > > >    - client_software_name  - client
> software
> >>     > > > > implementation
> >>     > > > > > > > name.
> >>     > > > > > > > > > > > > >    - client_software_version  - client
> >> software
> >>     > > > > > implementation
> >>     > > > > > > > > > > version.
> >>     > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > In the given reactive monitoring workflow,
> >> we
> >>     > mention
> >>     > > > > that
> >>     > > > > > the
> >>     > > > > > > > > > > > application
> >>     > > > > > > > > > > > > > user does not know their client's client
> >> instance
> >>     > ID,
> >>     > > > but
> >>     > > > > > it's
> >>     > > > > > > > > > > outlined
> >>     > > > > > > > > > > > > > that the operator can add a metrics
> >> subscription
> >>     > > > > selecting
> >>     > > > > > for
> >>     > > > > > > > > > > > clientId. I
> >>     > > > > > > > > > > > > > don't see clientId as one of the supported
> >>     > selectors.
> >>     > > > > > > > > > > > > > I can see how this would have made sense
> in
> >> a
> >>     > > previous
> >>     > > > > > > > iteration
> >>     > > > > > > > > > > given
> >>     > > > > > > > > > > > that
> >>     > > > > > > > > > > > > > the previous client instance ID proposal
> >> was to
> >>     > > > construct
> >>     > > > > > the
> >>     > > > > > > > > > client
> >>     > > > > > > > > > > > > > instance ID using clientId as a prefix.
> Now
> >> that
> >>     > the
> >>     > > > > client
> >>     > > > > > > > > > instance
> >>     > > > > > > > > > > > ID is
> >>     > > > > > > > > > > > > > a UUID, would we want to add clientId as a
> >>     > supported
> >>     > > > > > selector?
> >>     > > > > > > > > > > > > > Let me know what you think.
> >>     > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > David
> >>     > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > On Tue, Oct 19, 2021 at 12:33 PM Magnus
> >> Edenhill <
> >>     > > > > > > > > > magnus@edenhill.se
> >>     > > > > > > > > > > >
> >>     > > > > > > > > > > > > > wrote:
> >>     > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > Hi Mickael!
> >>     > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > Den tis 19 okt. 2021 kl 19:30 skrev
> >> Mickael
> >>     > Maison
> >>     > > <
> >>     > > > > > > > > > > > > > > mickael.maison@gmail.com
> >>     > > > > > > > > > > > > > > >:
> >>     > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > Hi Magnus,
> >>     > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > Thanks for the proposal.
> >>     > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > 1. Looking at the protocol section,
> >> isn't
> >>     > > > > > > > "ClientInstanceId"
> >>     > > > > > > > > > > > expected
> >>     > > > > > > > > > > > > > > > to be a field in
> >>     > > > GetTelemetrySubscriptionsResponseV0?
> >>     > > > > > > > > > Otherwise,
> >>     > > > > > > > > > > > how
> >>     > > > > > > > > > > > > > > > does a client retrieve this value?
> >>     > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > Good catch, it got removed by mistake in
> >> one of
> >>     > the
> >>     > > > > > edits.
> >>     > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > 2. In the client API section, you
> >> mention a new
> >>     > > > > method
> >>     > > > > > > > > > > > > > > > "clientInstanceId()". Can you clarify
> >> which
> >>     > > > > interfaces
> >>     > > > > > are
> >>     > > > > > > > > > > > affected?
> >>     > > > > > > > > > > > > > > > Is it only Consumer and Producer?
> >>     > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > And Admin. Will update the KIP.
> >>     > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > 3. I'm a bit concerned this is enabled
> >> by
> >>     > > default.
> >>     > > > > > Even if
> >>     > > > > > > > > the
> >>     > > > > > > > > > > data
> >>     > > > > > > > > > > > > > > > collected is supposed to be not
> >> sensitive, I
> >>     > > think
> >>     > > > > > this can
> >>     > > > > > > > > be
> >>     > > > > > > > > > > > > > > > problematic in some environments. Also
> >> users
> >>     > > don't
> >>     > > > > > seem to
> >>     > > > > > > > > have
> >>     > > > > > > > > > > the
> >>     > > > > > > > > > > > > > > > choice to only expose some metrics.
> >> Knowing how
> >>     > > > much
> >>     > > > > > data
> >>     > > > > > > > > > transit
> >>     > > > > > > > > > > > > > > > through some applications can be
> >> considered
> >>     > > > critical.
> >>     > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > The broker already knows how much data
> >> transits
> >>     > > > through
> >>     > > > > > the
> >>     > > > > > > > > > client
> >>     > > > > > > > > > > > > > though,
> >>     > > > > > > > > > > > > > > right?
> >>     > > > > > > > > > > > > > > Care has been taken not to expose
> >> information in
> >>     > > the
> >>     > > > > > standard
> >>     > > > > > > > > > > metrics
> >>     > > > > > > > > > > > > > that
> >>     > > > > > > > > > > > > > > might
> >>     > > > > > > > > > > > > > > reveal sensitive information.
> >>     > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > Do you have an example of how the
> proposed
> >>     > metrics
> >>     > > > > could
> >>     > > > > > leak
> >>     > > > > > > > > > > > sensitive
> >>     > > > > > > > > > > > > > > information?
> >>     > > > > > > > > > > > > > > As for limiting the what metrics to
> >> export; I
> >>     > guess
> >>     > > > > that
> >>     > > > > > > > could
> >>     > > > > > > > > > make
> >>     > > > > > > > > > > > sense
> >>     > > > > > > > > > > > > > > in some
> >>     > > > > > > > > > > > > > > very sensitive use-cases, but those
> users
> >> might
> >>     > > > disable
> >>     > > > > > > > metrics
> >>     > > > > > > > > > > > > > altogether
> >>     > > > > > > > > > > > > > > for now.
> >>     > > > > > > > > > > > > > > Could these concerns be addressed by a
> >> later KIP?
> >>     > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > 4. As a user, how do you know if your
> >>     > application
> >>     > > > is
> >>     > > > > > > > actively
> >>     > > > > > > > > > > > sending
> >>     > > > > > > > > > > > > > > > metrics? Are there new metrics
> exposing
> >> what's
> >>     > > > going
> >>     > > > > > on,
> >>     > > > > > > > like
> >>     > > > > > > > > > how
> >>     > > > > > > > > > > > much
> >>     > > > > > > > > > > > > > > > data is being sent?
> >>     > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > That's a good question.
> >>     > > > > > > > > > > > > > > Since the proposed metrics interface is
> >> not aimed
> >>     > > at,
> >>     > > > > or
> >>     > > > > > > > > directly
> >>     > > > > > > > > > > > > > available
> >>     > > > > > > > > > > > > > > to, the application
> >>     > > > > > > > > > > > > > > I guess there's little point of adding
> it
> >> here,
> >>     > but
> >>     > > > > > instead
> >>     > > > > > > > > > adding
> >>     > > > > > > > > > > > > > > something to the
> >>     > > > > > > > > > > > > > > existing JMX metrics?
> >>     > > > > > > > > > > > > > > Do you have any suggestions?
> >>     > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > 5. If all metrics are enabled on a
> >> regular
> >>     > > Consumer
> >>     > > > > or
> >>     > > > > > > > > > Producer,
> >>     > > > > > > > > > > do
> >>     > > > > > > > > > > > > > > > you have an idea how much throughput
> >> this would
> >>     > > > use?
> >>     > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > It depends on the number of
> >> partition/topics/etc
> >>     > > the
> >>     > > > > > client
> >>     > > > > > > > is
> >>     > > > > > > > > > > > producing
> >>     > > > > > > > > > > > > > > to/consuming from.
> >>     > > > > > > > > > > > > > > I'll add some sizes to the KIP for some
> >> typical
> >>     > > > > > use-cases.
> >>     > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > Thanks,
> >>     > > > > > > > > > > > > > > Magnus
> >>     > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > Thanks
> >>     > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > On Tue, Oct 19, 2021 at 5:06 PM Magnus
> >>     > Edenhill <
> >>     > > > > > > > > > > > magnus@edenhill.se>
> >>     > > > > > > > > > > > > > > > wrote:
> >>     > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > Den tis 19 okt. 2021 kl 13:22 skrev
> >> Tom
> >>     > > Bentley <
> >>     > > > > > > > > > > > tbentley@redhat.com
> >>     > > > > > > > > > > > > > >:
> >>     > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > > Hi Magnus,
> >>     > > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > > I reviewed the KIP since you
> called
> >> the
> >>     > vote
> >>     > > > > > (sorry for
> >>     > > > > > > > > not
> >>     > > > > > > > > > > > > > reviewing
> >>     > > > > > > > > > > > > > > > when
> >>     > > > > > > > > > > > > > > > > > you announced your intention to
> >> call the
> >>     > > > vote). I
> >>     > > > > > have
> >>     > > > > > > > a
> >>     > > > > > > > > > few
> >>     > > > > > > > > > > > > > > questions
> >>     > > > > > > > > > > > > > > > on
> >>     > > > > > > > > > > > > > > > > > some of the details.
> >>     > > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > > 1. There's no Javadoc on
> >>     > > > > > ClientTelemetryPayload.data(),
> >>     > > > > > > > > so
> >>     > > > > > > > > > I
> >>     > > > > > > > > > > > don't
> >>     > > > > > > > > > > > > > > know
> >>     > > > > > > > > > > > > > > > > > whether the payload is exposed
> >> through this
> >>     > > > > method
> >>     > > > > > as
> >>     > > > > > > > > > > > compressed or
> >>     > > > > > > > > > > > > > > > not.
> >>     > > > > > > > > > > > > > > > > > Later on you say "Decompression of
> >> the
> >>     > > payloads
> >>     > > > > > will be
> >>     > > > > > > > > > > > handled by
> >>     > > > > > > > > > > > > > > the
> >>     > > > > > > > > > > > > > > > > > broker metrics plugin, the broker
> >> should
> >>     > > > expose a
> >>     > > > > > > > > suitable
> >>     > > > > > > > > > > > > > > > decompression
> >>     > > > > > > > > > > > > > > > > > API to the metrics plugin for this
> >>     > purpose.",
> >>     > > > > which
> >>     > > > > > > > > > suggests
> >>     > > > > > > > > > > > it's
> >>     > > > > > > > > > > > > > the
> >>     > > > > > > > > > > > > > > > > > compressed data in the buffer, but
> >> then we
> >>     > > > don't
> >>     > > > > > know
> >>     > > > > > > > > which
> >>     > > > > > > > > > > > codec
> >>     > > > > > > > > > > > > > was
> >>     > > > > > > > > > > > > > > > used,
> >>     > > > > > > > > > > > > > > > > > nor the API via which the plugin
> >> should
> >>     > > > > decompress
> >>     > > > > > it
> >>     > > > > > > > if
> >>     > > > > > > > > > > > required
> >>     > > > > > > > > > > > > > for
> >>     > > > > > > > > > > > > > > > > > forwarding to the ultimate metrics
> >> store.
> >>     > > > Should
> >>     > > > > > the
> >>     > > > > > > > > > > > > > > > ClientTelemetryPayload
> >>     > > > > > > > > > > > > > > > > > expose a method to get the
> >> compression and
> >>     > a
> >>     > > > > > > > > decompressor?
> >>     > > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > Good point, updated.
> >>     > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > > 2. The client-side API is
> expressed
> >> as
> >>     > > > > > StringOrError
> >>     > > > > > > > > > > > > > > > > >
> ClientInstance::ClientInstanceId(int
> >>     > > > > timeout_ms). I
> >>     > > > > > > > > > > understand
> >>     > > > > > > > > > > > that
> >>     > > > > > > > > > > > > > > > you're
> >>     > > > > > > > > > > > > > > > > > thinking about the librdkafka
> >>     > implementation,
> >>     > > > but
> >>     > > > > > it
> >>     > > > > > > > > would
> >>     > > > > > > > > > be
> >>     > > > > > > > > > > > good
> >>     > > > > > > > > > > > > > to
> >>     > > > > > > > > > > > > > > > show
> >>     > > > > > > > > > > > > > > > > > the API as it would appear on the
> >> Apache
> >>     > > Kafka
> >>     > > > > > clients.
> >>     > > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > This was meant as pseudo-code, but I
> >> changed
> >>     > it
> >>     > > > to
> >>     > > > > > Java.
> >>     > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > > 3. "PushTelemetryRequest|Response
> -
> >>     > protocol
> >>     > > > > > request
> >>     > > > > > > > used
> >>     > > > > > > > > > by
> >>     > > > > > > > > > > > the
> >>     > > > > > > > > > > > > > > > client to
> >>     > > > > > > > > > > > > > > > > > send metrics to any broker it is
> >> connected
> >>     > > to."
> >>     > > > > To
> >>     > > > > > be
> >>     > > > > > > > > > clear,
> >>     > > > > > > > > > > > this
> >>     > > > > > > > > > > > > > > means
> >>     > > > > > > > > > > > > > > > > > that the client can choose any of
> >> the
> >>     > > connected
> >>     > > > > > brokers
> >>     > > > > > > > > and
> >>     > > > > > > > > > > > push to
> >>     > > > > > > > > > > > > > > > just
> >>     > > > > > > > > > > > > > > > > > one of them? What should a
> >> supporting
> >>     > client
> >>     > > do
> >>     > > > > if
> >>     > > > > > it
> >>     > > > > > > > > gets
> >>     > > > > > > > > > an
> >>     > > > > > > > > > > > error
> >>     > > > > > > > > > > > > > > > when
> >>     > > > > > > > > > > > > > > > > > pushing metrics to a broker, retry
> >> sending
> >>     > to
> >>     > > > the
> >>     > > > > > same
> >>     > > > > > > > > > broker
> >>     > > > > > > > > > > > or
> >>     > > > > > > > > > > > > > try
> >>     > > > > > > > > > > > > > > > > > pushing to another broker, or drop
> >> the
> >>     > > metrics?
> >>     > > > > > Should
> >>     > > > > > > > > > > > supporting
> >>     > > > > > > > > > > > > > > > clients
> >>     > > > > > > > > > > > > > > > > > send successive requests to a
> single
> >>     > broker,
> >>     > > or
> >>     > > > > > round
> >>     > > > > > > > > > robin,
> >>     > > > > > > > > > > > or is
> >>     > > > > > > > > > > > > > > > that up
> >>     > > > > > > > > > > > > > > > > > to the client author? I'm guessing
> >> the
> >>     > > > behaviour
> >>     > > > > > should
> >>     > > > > > > > > be
> >>     > > > > > > > > > > > sticky
> >>     > > > > > > > > > > > > > to
> >>     > > > > > > > > > > > > > > > > > support the rate limiting
> features,
> >> but I
> >>     > > think
> >>     > > > > it
> >>     > > > > > > > would
> >>     > > > > > > > > be
> >>     > > > > > > > > > > > good
> >>     > > > > > > > > > > > > > for
> >>     > > > > > > > > > > > > > > > client
> >>     > > > > > > > > > > > > > > > > > authors if this section were
> >> explicit on
> >>     > the
> >>     > > > > > > > recommended
> >>     > > > > > > > > > > > behaviour.
> >>     > > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > You are right, I've updated the KIP
> >> to make
> >>     > > this
> >>     > > > > > clearer.
> >>     > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > > 4. "Mapping the client instance id
> >> to an
> >>     > > actual
> >>     > > > > > > > > application
> >>     > > > > > > > > > > > > > instance
> >>     > > > > > > > > > > > > > > > > > running on a (virtual) machine can
> >> be done
> >>     > by
> >>     > > > > > > > inspecting
> >>     > > > > > > > > > the
> >>     > > > > > > > > > > > > > metrics
> >>     > > > > > > > > > > > > > > > > > resource labels, such as the
> client
> >> source
> >>     > > > > address
> >>     > > > > > and
> >>     > > > > > > > > > source
> >>     > > > > > > > > > > > port,
> >>     > > > > > > > > > > > > > > or
> >>     > > > > > > > > > > > > > > > > > security principal, all of which
> >> are added
> >>     > by
> >>     > > > the
> >>     > > > > > > > > receiving
> >>     > > > > > > > > > > > broker.
> >>     > > > > > > > > > > > > > > > This
> >>     > > > > > > > > > > > > > > > > > will allow the operator together
> >> with the
> >>     > > user
> >>     > > > to
> >>     > > > > > > > > identify
> >>     > > > > > > > > > > the
> >>     > > > > > > > > > > > > > actual
> >>     > > > > > > > > > > > > > > > > > application instance." Is this
> >> really
> >>     > always
> >>     > > > > true?
> >>     > > > > > The
> >>     > > > > > > > > > source
> >>     > > > > > > > > > > > IP
> >>     > > > > > > > > > > > > > and
> >>     > > > > > > > > > > > > > > > port
> >>     > > > > > > > > > > > > > > > > > might be a loadbalancer/proxy in
> >> some
> >>     > setups.
> >>     > > > The
> >>     > > > > > > > > > principal,
> >>     > > > > > > > > > > as
> >>     > > > > > > > > > > > > > > already
> >>     > > > > > > > > > > > > > > > > > mentioned in the KIP, might be
> >> shared
> >>     > between
> >>     > > > > > multiple
> >>     > > > > > > > > > > > > > applications.
> >>     > > > > > > > > > > > > > > > So at
> >>     > > > > > > > > > > > > > > > > > worst the organization running the
> >> clients
> >>     > > > might
> >>     > > > > > have
> >>     > > > > > > > to
> >>     > > > > > > > > > > > consult
> >>     > > > > > > > > > > > > > the
> >>     > > > > > > > > > > > > > > > logs
> >>     > > > > > > > > > > > > > > > > > of a set of client applications,
> >> right?
> >>     > > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > Yes, that's correct. There's no
> >> guaranteed
> >>     > > > mapping
> >>     > > > > > from
> >>     > > > > > > > > > > > > > > > client_instance_id
> >>     > > > > > > > > > > > > > > > > to
> >>     > > > > > > > > > > > > > > > > an actual instance, that's why the
> KIP
> >>     > > recommends
> >>     > > > > > client
> >>     > > > > > > > > > > > > > > implementations
> >>     > > > > > > > > > > > > > > > to
> >>     > > > > > > > > > > > > > > > > log the client instance id
> >>     > > > > > > > > > > > > > > > > upon retrieval, and also provide an
> >> API for
> >>     > the
> >>     > > > > > > > application
> >>     > > > > > > > > > to
> >>     > > > > > > > > > > > > > retrieve
> >>     > > > > > > > > > > > > > > > the
> >>     > > > > > > > > > > > > > > > > instance id programmatically
> >>     > > > > > > > > > > > > > > > > if it has a better way of exposing
> it.
> >>     > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > 5. "Tests indicate that a
> compression
> >> ratio
> >>     > up
> >>     > > to
> >>     > > > > > 10x is
> >>     > > > > > > > > > > > possible for
> >>     > > > > > > > > > > > > > > the
> >>     > > > > > > > > > > > > > > > > > standard metrics." Client authors
> >> might
> >>     > > > > appreciate
> >>     > > > > > your
> >>     > > > > > > > > > > > mentioning
> >>     > > > > > > > > > > > > > > > which
> >>     > > > > > > > > > > > > > > > > > compression codec got these
> results.
> >>     > > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > Good point. Updated.
> >>     > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > > 6. "Should the client send a push
> >> request
> >>     > > prior
> >>     > > > > to
> >>     > > > > > > > expiry
> >>     > > > > > > > > > of
> >>     > > > > > > > > > > > the
> >>     > > > > > > > > > > > > > > > previously
> >>     > > > > > > > > > > > > > > > > > calculated PushIntervalMs the
> >> broker will
> >>     > > > discard
> >>     > > > > > the
> >>     > > > > > > > > > metrics
> >>     > > > > > > > > > > > and
> >>     > > > > > > > > > > > > > > > return a
> >>     > > > > > > > > > > > > > > > > > PushTelemetryResponse with the
> >> ErrorCode
> >>     > set
> >>     > > to
> >>     > > > > > > > > > RateLimited."
> >>     > > > > > > > > > > > Is
> >>     > > > > > > > > > > > > > this
> >>     > > > > > > > > > > > > > > > > > RATE_LIMITED a new error code?
> It's
> >> not
> >>     > > > mentioned
> >>     > > > > > in
> >>     > > > > > > > the
> >>     > > > > > > > > > "New
> >>     > > > > > > > > > > > Error
> >>     > > > > > > > > > > > > > > > Codes"
> >>     > > > > > > > > > > > > > > > > > section.
> >>     > > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > That's a leftover, it should be
> using
> >> the
> >>     > > > standard
> >>     > > > > > > > > > ThrottleTime
> >>     > > > > > > > > > > > > > > > mechanism.
> >>     > > > > > > > > > > > > > > > > Fixed.
> >>     > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > > 7. In the section "Standard client
> >> resource
> >>     > > > > labels"
> >>     > > > > > > > > > > > application_id
> >>     > > > > > > > > > > > > > is
> >>     > > > > > > > > > > > > > > > > > described as Kafka Streams only,
> >> but the
> >>     > > > section
> >>     > > > > of
> >>     > > > > > > > > "Client
> >>     > > > > > > > > > > > > > > > Identification"
> >>     > > > > > > > > > > > > > > > > > talks about "application instance
> >> id as an
> >>     > > > > optional
> >>     > > > > > > > > future
> >>     > > > > > > > > > > > > > > nice-to-have
> >>     > > > > > > > > > > > > > > > > > that may be included as a metrics
> >> label if
> >>     > it
> >>     > > > has
> >>     > > > > > been
> >>     > > > > > > > > set
> >>     > > > > > > > > > by
> >>     > > > > > > > > > > > the
> >>     > > > > > > > > > > > > > > > user", so
> >>     > > > > > > > > > > > > > > > > > I'm confused whether non-Kafka
> >> Streams
> >>     > > clients
> >>     > > > > > should
> >>     > > > > > > > set
> >>     > > > > > > > > > an
> >>     > > > > > > > > > > > > > > > application_id
> >>     > > > > > > > > > > > > > > > > > or not.
> >>     > > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > I'll clarify this in the KIP, but
> >> basically
> >>     > we
> >>     > > > > would
> >>     > > > > > need
> >>     > > > > > > > > to
> >>     > > > > > > > > > > add
> >>     > > > > > > > > > > > an `
> >>     > > > > > > > > > > > > > > > > application.id` config
> >>     > > > > > > > > > > > > > > > > property for non-streams clients for
> >> this
> >>     > > > purpose,
> >>     > > > > > and
> >>     > > > > > > > > that's
> >>     > > > > > > > > > > > outside
> >>     > > > > > > > > > > > > > > the
> >>     > > > > > > > > > > > > > > > > scope of this KIP since we want to
> >> make it
> >>     > > > > > zero-conf:ish
> >>     > > > > > > > on
> >>     > > > > > > > > > the
> >>     > > > > > > > > > > > > > client
> >>     > > > > > > > > > > > > > > > side.
> >>     > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > > Kind regards,
> >>     > > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > > Tom
> >>     > > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > Thanks for the review,
> >>     > > > > > > > > > > > > > > > > Magnus
> >>     > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > > On Thu, Oct 7, 2021 at 5:26 PM
> >> Magnus
> >>     > > Edenhill
> >>     > > > <
> >>     > > > > > > > > > > > magnus@edenhill.se
> >>     > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > wrote:
> >>     > > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > > > Hi all,
> >>     > > > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > > > I've updated the KIP following
> >> our recent
> >>     > > > > > discussions
> >>     > > > > > > > > on
> >>     > > > > > > > > > > the
> >>     > > > > > > > > > > > > > > mailing
> >>     > > > > > > > > > > > > > > > > > list:
> >>     > > > > > > > > > > > > > > > > > >  - split the protocol in two,
> one
> >> for
> >>     > > getting
> >>     > > > > the
> >>     > > > > > > > > metrics
> >>     > > > > > > > > > > > > > > > subscriptions,
> >>     > > > > > > > > > > > > > > > > > > and one for pushing the metrics.
> >>     > > > > > > > > > > > > > > > > > >  - simplifications: initially
> >> only one
> >>     > > > > supported
> >>     > > > > > > > > metrics
> >>     > > > > > > > > > > > format,
> >>     > > > > > > > > > > > > > no
> >>     > > > > > > > > > > > > > > > > > > client.id in the instance id,
> >> etc.
> >>     > > > > > > > > > > > > > > > > > >  - made CLIENT_METRICS
> >> subscription
> >>     > > > > configuration
> >>     > > > > > > > > entries
> >>     > > > > > > > > > > > more
> >>     > > > > > > > > > > > > > > > structured
> >>     > > > > > > > > > > > > > > > > > >    and allowing better client
> >> matching
> >>     > > > > selectors
> >>     > > > > > (not
> >>     > > > > > > > > > only
> >>     > > > > > > > > > > > on the
> >>     > > > > > > > > > > > > > > > > > instance
> >>     > > > > > > > > > > > > > > > > > > id, but also the other
> >>     > > > > > > > > > > > > > > > > > >    client resource labels, such
> as
> >>     > > > > > > > > client_software_name,
> >>     > > > > > > > > > > > etc.).
> >>     > > > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > > > Unless there are further
> comments
> >> I'll
> >>     > call
> >>     > > > the
> >>     > > > > > vote
> >>     > > > > > > > > in a
> >>     > > > > > > > > > > > day or
> >>     > > > > > > > > > > > > > > two.
> >>     > > > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > > > Regards,
> >>     > > > > > > > > > > > > > > > > > > Magnus
> >>     > > > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > > > Den mån 4 okt. 2021 kl 20:57
> >> skrev Magnus
> >>     > > > > > Edenhill <
> >>     > > > > > > > > > > > > > > > magnus@edenhill.se>:
> >>     > > > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > > > > Hi Gwen,
> >>     > > > > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > > > > I'm finishing up the KIP based
> >> on the
> >>     > > last
> >>     > > > > > couple
> >>     > > > > > > > of
> >>     > > > > > > > > > > > discussion
> >>     > > > > > > > > > > > > > > > points
> >>     > > > > > > > > > > > > > > > > > in
> >>     > > > > > > > > > > > > > > > > > > > this thread
> >>     > > > > > > > > > > > > > > > > > > > and will call the Vote later
> >> this week.
> >>     > > > > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > > > > Best,
> >>     > > > > > > > > > > > > > > > > > > > Magnus
> >>     > > > > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > > > > Den lör 2 okt. 2021 kl 02:01
> >> skrev Gwen
> >>     > > > > Shapira
> >>     > > > > > > > > > > > > > > > > > > <gwen@confluent.io.invalid
> >>     > > > > > > > > > > > > > > > > > > > >:
> >>     > > > > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > > > >> Hey,
> >>     > > > > > > > > > > > > > > > > > > >>
> >>     > > > > > > > > > > > > > > > > > > >> I noticed that there was no
> >> discussion
> >>     > > for
> >>     > > > > the
> >>     > > > > > > > last
> >>     > > > > > > > > 10
> >>     > > > > > > > > > > > days,
> >>     > > > > > > > > > > > > > > but I
> >>     > > > > > > > > > > > > > > > > > > >> couldn't
> >>     > > > > > > > > > > > > > > > > > > >> find the vote thread. Is
> there
> >> one
> >>     > that
> >>     > > > I'm
> >>     > > > > > > > missing?
> >>     > > > > > > > > > > > > > > > > > > >>
> >>     > > > > > > > > > > > > > > > > > > >> Gwen
> >>     > > > > > > > > > > > > > > > > > > >>
> >>     > > > > > > > > > > > > > > > > > > >> On Wed, Sep 22, 2021 at 4:58
> >> AM Magnus
> >>     > > > > > Edenhill <
> >>     > > > > > > > > > > > > > > > magnus@edenhill.se>
> >>     > > > > > > > > > > > > > > > > > > >> wrote:
> >>     > > > > > > > > > > > > > > > > > > >>
> >>     > > > > > > > > > > > > > > > > > > >> > Den tis 21 sep. 2021 kl
> >> 06:58 skrev
> >>     > > > Colin
> >>     > > > > > > > McCabe <
> >>     > > > > > > > > > > > > > > > > > cmccabe@apache.org
> >>     > > > > > > > > > > > > > > > > > > >:
> >>     > > > > > > > > > > > > > > > > > > >> >
> >>     > > > > > > > > > > > > > > > > > > >> > > On Mon, Sep 20, 2021, at
> >> 17:35,
> >>     > Feng
> >>     > > > Min
> >>     > > > > > > > wrote:
> >>     > > > > > > > > > > > > > > > > > > >> > > > Thanks Magnus & Colin
> >> for the
> >>     > > > > > discussion.
> >>     > > > > > > > > > > > > > > > > > > >> > > >
> >>     > > > > > > > > > > > > > > > > > > >> > > > Based on KIP-714's
> >> stateless
> >>     > > design,
> >>     > > > > > Client
> >>     > > > > > > > > can
> >>     > > > > > > > > > > > pretty
> >>     > > > > > > > > > > > > > > much
> >>     > > > > > > > > > > > > > > > use
> >>     > > > > > > > > > > > > > > > > > > any
> >>     > > > > > > > > > > > > > > > > > > >> > > > connection to any
> broker
> >> to send
> >>     > > > > > metrics. We
> >>     > > > > > > > > are
> >>     > > > > > > > > > > not
> >>     > > > > > > > > > > > > > > > associating
> >>     > > > > > > > > > > > > > > > > > > >> > > connection
> >>     > > > > > > > > > > > > > > > > > > >> > > > with client metric
> >> state. Is my
> >>     > > > > > > > understanding
> >>     > > > > > > > > > > > correct?
> >>     > > > > > > > > > > > > > If
> >>     > > > > > > > > > > > > > > > yes,
> >>     > > > > > > > > > > > > > > > > > > how
> >>     > > > > > > > > > > > > > > > > > > >> > about
> >>     > > > > > > > > > > > > > > > > > > >> > > > the following two
> >> scenarios
> >>     > > > > > > > > > > > > > > > > > > >> > > >
> >>     > > > > > > > > > > > > > > > > > > >> > > > 1) One Client
> (Client-ID)
> >>     > > registers
> >>     > > > > two
> >>     > > > > > > > > > different
> >>     > > > > > > > > > > > client
> >>     > > > > > > > > > > > > > > > > > instance
> >>     > > > > > > > > > > > > > > > > > > id
> >>     > > > > > > > > > > > > > > > > > > >> > via
> >>     > > > > > > > > > > > > > > > > > > >> > > > separate registration.
> >> Is it
> >>     > > > > permitted?
> >>     > > > > > If
> >>     > > > > > > > OK,
> >>     > > > > > > > > > how
> >>     > > > > > > > > > > > to
> >>     > > > > > > > > > > > > > > > > > distinguish
> >>     > > > > > > > > > > > > > > > > > > >> them
> >>     > > > > > > > > > > > > > > > > > > >> > > from
> >>     > > > > > > > > > > > > > > > > > > >> > > > the case 2 below.
> >>     > > > > > > > > > > > > > > > > > > >> > > >
> >>     > > > > > > > > > > > > > > > > > > >> > >
> >>     > > > > > > > > > > > > > > > > > > >> > > Hi Feng,
> >>     > > > > > > > > > > > > > > > > > > >> > >
> >>     > > > > > > > > > > > > > > > > > > >> > > My understanding, which
> >> Magnus can
> >>     > > > > > clarify I
> >>     > > > > > > > > > guess,
> >>     > > > > > > > > > > is
> >>     > > > > > > > > > > > > > that
> >>     > > > > > > > > > > > > > > > you
> >>     > > > > > > > > > > > > > > > > > > could
> >>     > > > > > > > > > > > > > > > > > > >> > have
> >>     > > > > > > > > > > > > > > > > > > >> > > something like two
> Producer
> >>     > > instances
> >>     > > > > > running
> >>     > > > > > > > > with
> >>     > > > > > > > > > > the
> >>     > > > > > > > > > > > > > same
> >>     > > > > > > > > > > > > > > > > > > client.id
> >>     > > > > > > > > > > > > > > > > > > >> > > (perhaps because they're
> >> using the
> >>     > > > same
> >>     > > > > > config
> >>     > > > > > > > > > file,
> >>     > > > > > > > > > > > for
> >>     > > > > > > > > > > > > > > > example).
> >>     > > > > > > > > > > > > > > > > > > >> They
> >>     > > > > > > > > > > > > > > > > > > >> > > could even be in the same
> >> process.
> >>     > > But
> >>     > > > > > they
> >>     > > > > > > > > would
> >>     > > > > > > > > > > get
> >>     > > > > > > > > > > > > > > separate
> >>     > > > > > > > > > > > > > > > > > > UUIDs.
> >>     > > > > > > > > > > > > > > > > > > >> > >
> >>     > > > > > > > > > > > > > > > > > > >> > > I believe Magnus used the
> >> term
> >>     > > client
> >>     > > > to
> >>     > > > > > mean
> >>     > > > > > > > > > > > "Producer or
> >>     > > > > > > > > > > > > > > > > > > Consumer".
> >>     > > > > > > > > > > > > > > > > > > >> So
> >>     > > > > > > > > > > > > > > > > > > >> > > if you have both a
> >> Producer and a
> >>     > > > > > Consumer in
> >>     > > > > > > > > your
> >>     > > > > > > > > > > > > > > > application I
> >>     > > > > > > > > > > > > > > > > > > would
> >>     > > > > > > > > > > > > > > > > > > >> > > expect you'd get separate
> >> UUIDs
> >>     > for
> >>     > > > > both.
> >>     > > > > > > > Again
> >>     > > > > > > > > > > > Magnus can
> >>     > > > > > > > > > > > > > > > chime
> >>     > > > > > > > > > > > > > > > > > in
> >>     > > > > > > > > > > > > > > > > > > >> > here, I
> >>     > > > > > > > > > > > > > > > > > > >> > > guess.
> >>     > > > > > > > > > > > > > > > > > > >> > >
> >>     > > > > > > > > > > > > > > > > > > >> >
> >>     > > > > > > > > > > > > > > > > > > >> > That's correct.
> >>     > > > > > > > > > > > > > > > > > > >> >
> >>     > > > > > > > > > > > > > > > > > > >> >
> >>     > > > > > > > > > > > > > > > > > > >> > >
> >>     > > > > > > > > > > > > > > > > > > >> > > > 2) How about the client
> >>     > > restarting?
> >>     > > > > > What's
> >>     > > > > > > > the
> >>     > > > > > > > > > > > > > > expectation?
> >>     > > > > > > > > > > > > > > > > > Should
> >>     > > > > > > > > > > > > > > > > > > >> the
> >>     > > > > > > > > > > > > > > > > > > >> > > > server expect the
> client
> >> to
> >>     > carry
> >>     > > a
> >>     > > > > > > > persisted
> >>     > > > > > > > > > > client
> >>     > > > > > > > > > > > > > > > instance id
> >>     > > > > > > > > > > > > > > > > > > or
> >>     > > > > > > > > > > > > > > > > > > >> > > should
> >>     > > > > > > > > > > > > > > > > > > >> > > > the client be treated
> as
> >> a new
> >>     > > > > instance?
> >>     > > > > > > > > > > > > > > > > > > >> > >
> >>     > > > > > > > > > > > > > > > > > > >> > > The KIP doesn't describe
> >> any
> >>     > > mechanism
> >>     > > > > for
> >>     > > > > > > > > > > > persistence,
> >>     > > > > > > > > > > > > > so I
> >>     > > > > > > > > > > > > > > > would
> >>     > > > > > > > > > > > > > > > > > > >> assume
> >>     > > > > > > > > > > > > > > > > > > >> > > that when you restart the
> >> client
> >>     > you
> >>     > > > get
> >>     > > > > > a new
> >>     > > > > > > > > > > UUID. I
> >>     > > > > > > > > > > > > > agree
> >>     > > > > > > > > > > > > > > > that
> >>     > > > > > > > > > > > > > > > > > it
> >>     > > > > > > > > > > > > > > > > > > >> > would
> >>     > > > > > > > > > > > > > > > > > > >> > > be good to spell this
> out.
> >>     > > > > > > > > > > > > > > > > > > >> > >
> >>     > > > > > > > > > > > > > > > > > > >> > >
> >>     > > > > > > > > > > > > > > > > > > >> > Right, it will not be
> >> persisted
> >>     > since
> >>     > > a
> >>     > > > > > client
> >>     > > > > > > > > > > instance
> >>     > > > > > > > > > > > > > can't
> >>     > > > > > > > > > > > > > > be
> >>     > > > > > > > > > > > > > > > > > > >> restarted.
> >>     > > > > > > > > > > > > > > > > > > >> >
> >>     > > > > > > > > > > > > > > > > > > >> > Will update the KIP to make
> >> this
> >>     > > > clearer.
> >>     > > > > > > > > > > > > > > > > > > >> >
> >>     > > > > > > > > > > > > > > > > > > >> > /Magnus
> >>     > > > > > > > > > > > > > > > > > > >> >
> >>     > > > > > > > > > > > > > > > > > > >>
> >>     > > > > > > > > > > > > > > > > > > >>
> >>     > > > > > > > > > > > > > > > > > > >> --
> >>     > > > > > > > > > > > > > > > > > > >> Gwen Shapira
> >>     > > > > > > > > > > > > > > > > > > >> Engineering Manager |
> Confluent
> >>     > > > > > > > > > > > > > > > > > > >> 650.450.2760 | @gwenshap
> >>     > > > > > > > > > > > > > > > > > > >> Follow us: Twitter | blog
> >>     > > > > > > > > > > > > > > > > > > >>
> >>     > > > > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > >
> >>     > > > > > > > > > > >
> >>     > > > > > > > > > >
> >>     > > > > > > > > >
> >>     > > > > > > > >
> >>     > > > > > > >
> >>     > > > > > >
> >>     > > > > >
> >>     > > > >
> >>     > > >
> >>     > >
> >>     >
> >>
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Jun Rao <ju...@confluent.io.INVALID>.

Hi, Kirk, Sarat,

A few more comments.

40. GetTelemetrySubscriptionsResponseV0 : RequestedMetrics Array[string]
uses "Array[0] empty string" to represent all metrics subscribed. We had a
similar issue with the topics field in MetadataRequest and used the
following convention.
In version 1 and higher, an empty array indicates "request metadata for no
topics," and a null array is used to indicate "request metadata for all
topics."
Should we use the same convention in GetTelemetrySubscriptionsResponseV0?

41. We include CompressionType in PushTelemetryRequestV0, but not in
ClientTelemetryPayload. How would the implementer know the compression type
for the telemetry payload?

42. For blocking the metrics for certain clients in the following example,
could you describe the corresponding config value used through the
kafka-config command?
kafka-client-metrics.sh --bootstrap-server $BROKERS \
   --add \
   --name 'Disabe_b69cc35a' \  # A descriptive name makes it easier to
clean up old subscriptions.
   --match client_instance_id=b69cc35a-7a54-4790-aa69-cc2bd4ee4538 \  #
Match this specific client instance
   --block

Thanks,

Jun


On Thu, Mar 10, 2022 at 11:57 AM Jun Rao <ju...@confluent.io> wrote:

> Hi, Kirk, Sarat,
>
> Thanks for the reply.
>
> 28. On the broker, we typically use Yammer metrics. Only for metrics that
> depend on Kafka metric features (e.g., quota), we use the Kafka metric.
> Yammer metrics have 4 types: gauge, meter, histogram and timer. meter
> calculates a rate, but also exposes an accumulated value.
>
> 29. The Histogram class in org.apache.kafka.common.metrics.stats was never
> used in the client metrics. The implementation of Histogram only provides a
> fixed number of values in the domain and may not capture the quantiles very
> accurately. So, we punted on using it.
>
> Thanks,
>
> Jun
>
>
>
> On Thu, Mar 10, 2022 at 10:59 AM Sarat Kakarla
> <sk...@confluent.io.invalid> wrote:
>
>> Jun,
>>
>>   >>  28. For the broker metrics, could you spell out the full metric name
>>   >>   including groups, tags, etc? We typically don't add the broker_id
>> label for
>>   >>   broker metrics. Also, brokers use Yammer metrics, which doesn't
>> have type
>>   >>   Sum.
>>
>> Sure,  I will update the KIP-714 with the above information, will remove
>> the broker-id label from the metrics.
>>
>> Regarding the type is CumulativeSum the right type to use in the place of
>> Sum?
>>
>> Thanks
>> Sarat
>>
>>
>> On 3/8/22, 5:48 PM, "Jun Rao" <ju...@confluent.io.INVALID> wrote:
>>
>>     Hi, Magnus, Sarat and Xavier,
>>
>>     Thanks for the reply. A few more comments below.
>>
>>     20. It seems that we are piggybacking the plugin on the
>>     existing MetricsReporter. So, this seems fine.
>>
>>     21. That could work. Are we requiring any additional jar dependency
>> on the
>>     client? Or, are you suggesting that we check the runtime dependency
>> to pick
>>     the compression codec?
>>
>>     28. For the broker metrics, could you spell out the full metric name
>>     including groups, tags, etc? We typically don't add the broker_id
>> label for
>>     broker metrics. Also, brokers use Yammer metrics, which doesn't have
>> type
>>     Sum.
>>
>>     29. There are several client metrics listed as histogram. However,
>> the java
>>     client currently doesn't support histogram type.
>>
>>     30. Could you show an example of the metric payload in
>> PushTelemetryRequest
>>     to help understand how we organize metrics at different levels (per
>>     instance, per topic, per partition, per broker, etc)?
>>
>>     31. Could you add a bit more detail on which client thread sends the
>>     PushTelemetryRequest?
>>
>>     Thanks,
>>
>>     Jun
>>
>>     On Mon, Mar 7, 2022 at 11:48 AM Magnus Edenhill <ma...@edenhill.se>
>> wrote:
>>
>>     > Hi Jun,
>>     >
>>     > thanks for your initiated questions, see my answers below.
>>     > There's been a number of clarifications to the KIP.
>>     >
>>     >
>>     >
>>     > Den tors 27 jan. 2022 kl 20:08 skrev Jun Rao
>> <ju...@confluent.io.invalid>:
>>     >
>>     > > Hi, Magnus,
>>     > >
>>     > > Thanks for updating the KIP. The overall approach makes sense to
>> me. A
>>     > few
>>     > > more detailed comments below.
>>     > >
>>     > > 20. ClientTelemetry: Should it be extending configurable and
>> closable?
>>     > >
>>     >
>>     > I'll pass this question to Sarat and/or Xavier.
>>     >
>>     >
>>     >
>>     > > 21. Compression of the metrics on the client: what's the default?
>>     > >
>>     >
>>     > How about we specify a prioritized list: zstd, lz4, snappy, gzip?
>>     > But ultimately it is up to what the client supports.
>>     >
>>     >
>>     > 23. A client instance is considered a metric resource and the
>>     > > resource-level (thus client instance level) labels could include:
>>     > >     client_software_name=confluent-kafka-python
>>     > >     client_software_version=v2.1.3
>>     > >     client_instance_id=B64CD139-3975-440A-91D4
>>     > >     transactional_id=someTxnApp
>>     > > Are those labels added in PushTelemetryRequest? If so, are they
>> per
>>     > metric
>>     > > or per request?
>>     > >
>>     >
>>     >
>>     > client_software* and client_instance_id are not added by the
>> client, but
>>     > available to
>>     > the broker-side metrics plugin for adding as it see fits, remove
>> them from
>>     > the KIP.
>>     >
>>     > As for transactional_id, group_id, etc, which I believe will be
>> useful in
>>     > troubleshooting,
>>     > are included only once (per push) as resource-level attributes (the
>> client
>>     > instance is a singular resource).
>>     >
>>     >
>>     > >
>>     > > 24.  "the broker will only send
>>     > > GetTelemetrySubscriptionsResponse.DeltaTemporality=True" :
>>     > > 24.1 If it's always true, does it need to be part of the protocol?
>>     > >
>>     >
>>     > We're anticipating that it will take a lot longer to upgrade the
>> majority
>>     > of clients than the
>>     > broker/plugin side, which is why we want the client to support both
>>     > temporalities out-of-the-box
>>     > so that cumulative reporting can be turned on seamlessly in the
>> future.
>>     >
>>     >
>>     >
>>     > > 24.2 Does delta only apply to Counter type?
>>     > >
>>     >
>>     >
>>     > And Histograms. More details in Xavier's OTLP link.
>>     >
>>     >
>>     >
>>     > > 24.3 In the delta representation, the first request needs to send
>> the
>>     > full
>>     > > value, how does the broker plugin know whether a value is full or
>> delta?
>>     > >
>>     >
>>     > The client may (should) send the start time for each metric sample,
>>     > indicating when
>>     > the metric began to be collected.
>>     > We've discussed whether this should be the client instance start
>> time or
>>     > the time when a matching
>>     > metric subscription for that metric is received.
>>     > For completeness we recommend using the former, the client instance
>> start
>>     > time.
>>     >
>>     >
>>     >
>>     > > 25. quota:
>>     > > 25.1 Since we are fitting PushTelemetryRequest into the existing
>> request
>>     > > quota, it would be useful to document the impact, i.e. client
>> metric
>>     > > throttling causes the data from the same client to be delayed.
>>     > > 25.2 Is PushTelemetryRequest subject to the write bandwidth quota
>> like
>>     > the
>>     > > producer?
>>     > >
>>     >
>>     >
>>     > Yes, it should be, as to protect the cluster from rogue clients.
>>     > But, in practice the size of metrics will be quite low (e.g.,
>> 1-10kb per
>>     > 60s interval), so I don't think this will pose a problem.
>>     > The KIP has been updated with more details on quota/throttling
>> behaviour,
>>     > see the
>>     > "Throttling and rate-limiting" section.
>>     >
>>     >
>>     > 25.3 THROTTLING_QUOTA_EXCEEDED: Currently, we don't send this error
>> when
>>     > > the request/bandwidth quota is exceeded since those requests are
>> not
>>     > > rejected. We only set this error when the request is rejected
>> (e.g.,
>>     > topic
>>     > > creation). It would be useful to clarify when this error is used.
>>     > >
>>     >
>>     > Right, I was trying to reuse an existing error-code. We can
>> introduce
>>     > a new one for the case where a client pushes metrics at a higher
>> frequency
>>     > than the
>>     > than the configured push interval (e.g., out-of-profile sends).
>>     > This causes the broker to drop those metrics and send this error
>> code back
>>     > to the client. There will be no connection throttling /
>> channel-muting in
>>     > this
>>     > case (unless the standard quotas are exceeded).
>>     >
>>     >
>>     > > 27. kafka-client-metrics.sh: Could we add an example on how to
>> disable a
>>     > > bad client?
>>     > >
>>     >
>>     > There's now a --block option to kafka-client-metrics.sh which
>> overrides all
>>     > subscriptions
>>     > for the matched client(s). This allows silencing metrics for one or
>> more
>>     > clients without having
>>     > to remove existing subscriptions. From the client's perspective it
>> will
>>     > look like it no longer has
>>     > any subscriptions.
>>     >
>>     > # Block metrics collection for a specific client instance
>>     > $ kafka-client-metrics.sh --bootstrap-server $BROKERS \
>>     >    --add \
>>     >    --name 'Disabe_b69cc35a' \  # A descriptive name makes it easier
>> to
>>     > clean up old subscriptions.
>>     >    --match client_instance_id=b69cc35a-7a54-4790-aa69-cc2bd4ee4538
>> \  #
>>     > Match this specific client instance
>>     >    --block
>>     >
>>     >
>>     >
>>     >
>>     > > 28. New broker side metrics: Could we spell out the details of the
>>     > metrics
>>     > > (e.g., group, tags, etc)?
>>     > >
>>     >
>>     > KIP has been updated accordingly (thanks Sarat).
>>     >
>>     >
>>     >
>>     > >
>>     > > 29. Client instance-level metrics: client.io.wait.time is a gauge
>> not a
>>     > > histogram.
>>     > >
>>     >
>>     > I believe a population/distribution should preferably be
>> represented as a
>>     > histogram, space permitting,
>>     > and only secondarily as a Gauge average.
>>     > While we might not want to maintain a bunch of histograms for each
>>     > partition, since that could be
>>     > quite space consuming, this client.io.wait.time is a single metric
>> per
>>     > client instance and can
>>     > thus afford a Histogram representation.
>>     >
>>     >
>>     >
>>     > Thanks,
>>     > Magnus
>>     >
>>     >
>>     >
>>     > > Thanks,
>>     > >
>>     > > Jun
>>     > >
>>     > > On Wed, Jan 26, 2022 at 6:32 AM Magnus Edenhill <
>> magnus@edenhill.se>
>>     > > wrote:
>>     > >
>>     > > > Hi all,
>>     > > >
>>     > > > I've updated the KIP with responses to the latest comments:
>> Java client
>>     > > > dependencies (Thanks Kirk!), alternate designs (separate
>> cluster,
>>     > > separate
>>     > > > producer, etc), etc.
>>     > > >
>>     > > > I will revive the vote thread.
>>     > > >
>>     > > > Thanks,
>>     > > > Magnus
>>     > > >
>>     > > >
>>     > > > Den mån 13 dec. 2021 kl 22:32 skrev Ryanne Dolan <
>>     > ryannedolan@gmail.com
>>     > > >:
>>     > > >
>>     > > > > I think we should be very careful about introducing new
>> runtime
>>     > > > > dependencies into the clients. Historically this has been
>> rare and
>>     > > > > essentially necessary (e.g. compression libs).
>>     > > > >
>>     > > > > Ryanne
>>     > > > >
>>     > > > > On Mon, Dec 13, 2021, 1:06 PM Kirk True <
>> kirk@mustardgrain.com>
>>     > wrote:
>>     > > > >
>>     > > > > > Hi Jun,
>>     > > > > >
>>     > > > > > On Thu, Dec 9, 2021, at 2:28 PM, Jun Rao wrote:
>>     > > > > > > 13. Using OpenTelemetry. Does that require runtime
>> dependency
>>     > > > > > > on OpenTelemetry library? How good is the compatibility
>> story
>>     > > > > > > of OpenTelemetry? This is important since an application
>> could
>>     > have
>>     > > > > other
>>     > > > > > > OpenTelemetry dependencies than the Kafka client.
>>     > > > > >
>>     > > > > > The current design is that the OpenTelemetry JARs would
>> ship with
>>     > the
>>     > > > > > client. Perhaps we can design the client such that the JARs
>> aren't
>>     > > even
>>     > > > > > loaded if the user has opted out. The user could even
>> exclude the
>>     > > JARs
>>     > > > > from
>>     > > > > > their dependencies if they so wished.
>>     > > > > >
>>     > > > > > I can't speak to the compatibility of the libraries. Is it
>> possible
>>     > > > that
>>     > > > > > we include a shaded version?
>>     > > > > >
>>     > > > > > Thanks,
>>     > > > > > Kirk
>>     > > > > >
>>     > > > > > >
>>     > > > > > > 14. The proposal listed idempotence=true. This is more of
>> a
>>     > > > > configuration
>>     > > > > > > than a metric. Are we including that as a metric? What
>> other
>>     > > > > > configurations
>>     > > > > > > are we including? Should we separate the configurations
>> from the
>>     > > > > metrics?
>>     > > > > > >
>>     > > > > > > Thanks,
>>     > > > > > >
>>     > > > > > > Jun
>>     > > > > > >
>>     > > > > > > On Mon, Nov 29, 2021 at 7:34 AM Magnus Edenhill <
>>     > > magnus@edenhill.se>
>>     > > > > > wrote:
>>     > > > > > >
>>     > > > > > > > Hey Bob,
>>     > > > > > > >
>>     > > > > > > > That's a good point.
>>     > > > > > > >
>>     > > > > > > > Request type labels were considered but since they're
>> already
>>     > > > tracked
>>     > > > > > by
>>     > > > > > > > broker-side metrics
>>     > > > > > > > they were left out as to avoid metric duplication,
>> however
>>     > those
>>     > > > > > metrics
>>     > > > > > > > are not per connection,
>>     > > > > > > > so they won't be that useful in practice for
>> troubleshooting
>>     > > > specific
>>     > > > > > > > client instances.
>>     > > > > > > >
>>     > > > > > > > I'll add the request_type label to the relevant metrics.
>>     > > > > > > >
>>     > > > > > > > Thanks,
>>     > > > > > > > Magnus
>>     > > > > > > >
>>     > > > > > > >
>>     > > > > > > > Den tis 23 nov. 2021 kl 19:20 skrev Bob Barrett
>>     > > > > > > > <bo...@confluent.io.invalid>:
>>     > > > > > > >
>>     > > > > > > > > Hi Magnus,
>>     > > > > > > > >
>>     > > > > > > > > Thanks for the thorough KIP, this seems very useful.
>>     > > > > > > > >
>>     > > > > > > > > Would it make sense to include the request type as a
>> label
>>     > for
>>     > > > the
>>     > > > > > > > > `client.request.success`, `client.request.errors` and
>>     > > > > > > > `client.request.rtt`
>>     > > > > > > > > metrics? I think it would be very useful to see which
>>     > specific
>>     > > > > > requests
>>     > > > > > > > are
>>     > > > > > > > > succeeding and failing for a client. One specific
>> case I can
>>     > > > think
>>     > > > > of
>>     > > > > > > > where
>>     > > > > > > > > this could be useful is producer batch timeouts. If a
>> Java
>>     > > > > > application
>>     > > > > > > > does
>>     > > > > > > > > not enable producer client logs (unfortunately, in my
>>     > > experience
>>     > > > > this
>>     > > > > > > > > happens more often than it should), the application
>> logs will
>>     > > > only
>>     > > > > > > > contain
>>     > > > > > > > > the expiration error message, but no information
>> about what
>>     > is
>>     > > > > > causing
>>     > > > > > > > the
>>     > > > > > > > > timeout. The requests might all be succeeding but
>> taking too
>>     > > long
>>     > > > > to
>>     > > > > > > > > process batches, or metadata requests might be
>> failing, or
>>     > some
>>     > > > or
>>     > > > > > all
>>     > > > > > > > > produce requests might be failing (if the bootstrap
>> servers
>>     > are
>>     > > > > > reachable
>>     > > > > > > > > from the client but one or more other brokers are
>> not, for
>>     > > > > example).
>>     > > > > > If
>>     > > > > > > > the
>>     > > > > > > > > cluster operator is able to identify the specific
>> requests
>>     > that
>>     > > > are
>>     > > > > > slow
>>     > > > > > > > or
>>     > > > > > > > > failing for a client, they will be better able to
>> diagnose
>>     > the
>>     > > > > issue
>>     > > > > > > > > causing batch timeouts.
>>     > > > > > > > >
>>     > > > > > > > > One drawback I can think of is that this will
>> increase the
>>     > > > > > cardinality of
>>     > > > > > > > > the request metrics. But any given client is only
>> going to
>>     > use
>>     > > a
>>     > > > > > small
>>     > > > > > > > > subset of the request types, and since we already have
>>     > > partition
>>     > > > > > labels
>>     > > > > > > > for
>>     > > > > > > > > the topic-level metrics, I think request labels will
>> still
>>     > make
>>     > > > up
>>     > > > > a
>>     > > > > > > > > relatively small percentage of the set of metrics.
>>     > > > > > > > >
>>     > > > > > > > > Thanks,
>>     > > > > > > > > Bob
>>     > > > > > > > >
>>     > > > > > > > > On Mon, Nov 22, 2021 at 2:08 AM Viktor Somogyi-Vass <
>>     > > > > > > > > viktorsomogyi@gmail.com>
>>     > > > > > > > > wrote:
>>     > > > > > > > >
>>     > > > > > > > > > Hi Magnus,
>>     > > > > > > > > >
>>     > > > > > > > > > I think this is a very useful addition. We also
>> have a
>>     > > similar
>>     > > > > (but
>>     > > > > > > > much
>>     > > > > > > > > > more simplistic) implementation of this. Maybe I
>> missed it
>>     > in
>>     > > > the
>>     > > > > > KIP
>>     > > > > > > > but
>>     > > > > > > > > > what about adding metrics about the subscription
>> cache
>>     > > itself?
>>     > > > > > That I
>>     > > > > > > > > think
>>     > > > > > > > > > would improve its usability and debuggability as
>> we'd be
>>     > able
>>     > > > to
>>     > > > > > see
>>     > > > > > > > its
>>     > > > > > > > > > performance, hit/miss rates, eviction counts and
>> others.
>>     > > > > > > > > >
>>     > > > > > > > > > Best,
>>     > > > > > > > > > Viktor
>>     > > > > > > > > >
>>     > > > > > > > > > On Thu, Nov 18, 2021 at 5:12 PM Magnus Edenhill <
>>     > > > > > magnus@edenhill.se>
>>     > > > > > > > > > wrote:
>>     > > > > > > > > >
>>     > > > > > > > > > > Hi Mickael,
>>     > > > > > > > > > >
>>     > > > > > > > > > > see inline.
>>     > > > > > > > > > >
>>     > > > > > > > > > > Den ons 10 nov. 2021 kl 15:21 skrev Mickael
>> Maison <
>>     > > > > > > > > > > mickael.maison@gmail.com
>>     > > > > > > > > > > >:
>>     > > > > > > > > > >
>>     > > > > > > > > > > > Hi Magnus,
>>     > > > > > > > > > > >
>>     > > > > > > > > > > > I see you've addressed some of the points I
>> raised
>>     > above
>>     > > > but
>>     > > > > > some
>>     > > > > > > > (4,
>>     > > > > > > > > > > > 5) have not been addressed yet.
>>     > > > > > > > > > > >
>>     > > > > > > > > > >
>>     > > > > > > > > > > Re 4) How will the user/app know metrics are
>> being sent.
>>     > > > > > > > > > >
>>     > > > > > > > > > > One possibility is to add a JMX metric (thus for
>> user
>>     > > > > > consumption)
>>     > > > > > > > for
>>     > > > > > > > > > the
>>     > > > > > > > > > > number of metric pushes the
>>     > > > > > > > > > > client has performed, or perhaps the number of
>> metrics
>>     > > > > > subscriptions
>>     > > > > > > > > > > currently being collected.
>>     > > > > > > > > > > Would that be sufficient?
>>     > > > > > > > > > >
>>     > > > > > > > > > > Re 5) Metric sizes and rates
>>     > > > > > > > > > >
>>     > > > > > > > > > > A worst case scenario for a producer that is
>> producing to
>>     > > 50
>>     > > > > > unique
>>     > > > > > > > > > topics
>>     > > > > > > > > > > and emitting all standard metrics yields
>>     > > > > > > > > > > a serialized size of around 100KB prior to
>> compression,
>>     > > which
>>     > > > > > > > > compresses
>>     > > > > > > > > > > down to about 20-30% of that depending
>>     > > > > > > > > > > on compression type and topic name uniqueness.
>>     > > > > > > > > > > The numbers for a consumer would be similar.
>>     > > > > > > > > > >
>>     > > > > > > > > > > In practice the number of unique topics would be
>> far
>>     > less,
>>     > > > and
>>     > > > > > the
>>     > > > > > > > > > > subscription set would typically be for a subset
>> of
>>     > > metrics.
>>     > > > > > > > > > > So we're probably closer to 1kb, or less,
>> compressed size
>>     > > per
>>     > > > > > client
>>     > > > > > > > > per
>>     > > > > > > > > > > push interval.
>>     > > > > > > > > > >
>>     > > > > > > > > > > As both the subscription set and push intervals
>> are
>>     > > > controlled
>>     > > > > > by the
>>     > > > > > > > > > > cluster operator it shouldn't be too hard
>>     > > > > > > > > > > to strike a good balance between metrics overhead
>> and
>>     > > > > > granularity.
>>     > > > > > > > > > >
>>     > > > > > > > > > >
>>     > > > > > > > > > >
>>     > > > > > > > > > > >
>>     > > > > > > > > > > > I'm really uneasy with this being enabled by
>> default on
>>     > > the
>>     > > > > > client
>>     > > > > > > > > > > > side. When collecting data, I think the best
>> practice
>>     > is
>>     > > to
>>     > > > > > ensure
>>     > > > > > > > > > > > users are explicitly enabling it.
>>     > > > > > > > > > > >
>>     > > > > > > > > > >
>>     > > > > > > > > > > Requiring metrics to be explicitly enabled on
>> clients
>>     > > > severely
>>     > > > > > > > cripples
>>     > > > > > > > > > its
>>     > > > > > > > > > > usability and value.
>>     > > > > > > > > > >
>>     > > > > > > > > > > One of the problems that this KIP aims to solve
>> is for
>>     > > useful
>>     > > > > > metrics
>>     > > > > > > > > to
>>     > > > > > > > > > be
>>     > > > > > > > > > > available on demand
>>     > > > > > > > > > > regardless of the technical expertise of the
>> user. As
>>     > > Ryanne
>>     > > > > > points,
>>     > > > > > > > > out
>>     > > > > > > > > > a
>>     > > > > > > > > > > savvy user/organization
>>     > > > > > > > > > > will typically have metrics collection and
>> monitoring in
>>     > > > place
>>     > > > > > > > already,
>>     > > > > > > > > > and
>>     > > > > > > > > > > the benefits of this KIP
>>     > > > > > > > > > > are then more of a common set and format metrics
>> across
>>     > > > client
>>     > > > > > > > > > > implementations and languages.
>>     > > > > > > > > > > But that is not the typical Kafka user in my
>> experience,
>>     > > > > they're
>>     > > > > > not
>>     > > > > > > > > > Kafka
>>     > > > > > > > > > > experts and they don't have the
>>     > > > > > > > > > > knowledge of how to best instrument their clients.
>>     > > > > > > > > > > Having metrics enabled by default for this user
>> base
>>     > allows
>>     > > > the
>>     > > > > > Kafka
>>     > > > > > > > > > > operators to proactively and reactively
>>     > > > > > > > > > > monitor and troubleshoot client issues, without
>> the need
>>     > > for
>>     > > > > the
>>     > > > > > less
>>     > > > > > > > > > savvy
>>     > > > > > > > > > > user to do anything.
>>     > > > > > > > > > > It is often too late to tell a user to enable
>> metrics
>>     > when
>>     > > > the
>>     > > > > > > > problem
>>     > > > > > > > > > has
>>     > > > > > > > > > > already occurred.
>>     > > > > > > > > > >
>>     > > > > > > > > > > Now, to be clear, even though metrics are enabled
>> by
>>     > > default
>>     > > > on
>>     > > > > > > > clients
>>     > > > > > > > > > it
>>     > > > > > > > > > > is not enabled by default
>>     > > > > > > > > > > on the brokers; the Kafka operator needs to build
>> and set
>>     > > up
>>     > > > a
>>     > > > > > > > metrics
>>     > > > > > > > > > > plugin and add metrics subscriptions
>>     > > > > > > > > > > before anything is sent from the client.
>>     > > > > > > > > > > It is opt-out on the clients and opt-in on the
>> broker.
>>     > > > > > > > > > >
>>     > > > > > > > > > >
>>     > > > > > > > > > >
>>     > > > > > > > > > >
>>     > > > > > > > > > > > You mentioned brokers already have
>>     > > > > > > > > > > > some(most?) of the information contained in
>> metrics, if
>>     > > so
>>     > > > > > then why
>>     > > > > > > > > > > > are we collecting it again? Surely there must
>> be some
>>     > new
>>     > > > > > > > information
>>     > > > > > > > > > > > in the client metrics.
>>     > > > > > > > > > > >
>>     > > > > > > > > > >
>>     > > > > > > > > > > From the user's perspective the Kafka
>> infrastructure
>>     > > extends
>>     > > > > from
>>     > > > > > > > > > > producer.send() to
>>     > > > > > > > > > > messages being returned from consumer.poll(), a
>> giant
>>     > black
>>     > > > box
>>     > > > > > where
>>     > > > > > > > > > > there's a lot going on between those
>>     > > > > > > > > > > two points. The brokers currently only see what
>> happens
>>     > > once
>>     > > > > > those
>>     > > > > > > > > > requests
>>     > > > > > > > > > > and messages hits the broker,
>>     > > > > > > > > > > but as Kafka clients are complex pieces of
>> machinery
>>     > > there's
>>     > > > a
>>     > > > > > myriad
>>     > > > > > > > > of
>>     > > > > > > > > > > queues, timers, and state
>>     > > > > > > > > > > that's critical to the operation and
>> infrastructure
>>     > that's
>>     > > > not
>>     > > > > > > > > currently
>>     > > > > > > > > > > visible to the operator.
>>     > > > > > > > > > > Relying on the user to accurately and timely
>> provide this
>>     > > > > missing
>>     > > > > > > > > > > information is not generally feasible.
>>     > > > > > > > > > >
>>     > > > > > > > > > >
>>     > > > > > > > > > > Most of the standard metrics listed in the KIP
>> are data
>>     > > > points
>>     > > > > > that
>>     > > > > > > > the
>>     > > > > > > > > > > broker does not have.
>>     > > > > > > > > > > Only a small number of metrics are duplicates
>> (like the
>>     > > > request
>>     > > > > > > > counts
>>     > > > > > > > > > and
>>     > > > > > > > > > > sizes), but they are included
>>     > > > > > > > > > > to ease correlation when inspecting these client
>> metrics.
>>     > > > > > > > > > >
>>     > > > > > > > > > >
>>     > > > > > > > > > >
>>     > > > > > > > > > > > Moreover this is a brand new feature so it's
>> even
>>     > harder
>>     > > to
>>     > > > > > justify
>>     > > > > > > > > > > > enabling it and forcing onto all our users. If
>> disabled
>>     > > by
>>     > > > > > default,
>>     > > > > > > > > > > > it's relatively easy to enable in a new release
>> if we
>>     > > > decide
>>     > > > > > to,
>>     > > > > > > > but
>>     > > > > > > > > > > > once enabled by default it's much harder to
>> disable.
>>     > Also
>>     > > > > this
>>     > > > > > > > > feature
>>     > > > > > > > > > > > will apply to all future metrics we will add.
>>     > > > > > > > > > > >
>>     > > > > > > > > > >
>>     > > > > > > > > > > I think maturity of a feature implementation
>> should be
>>     > the
>>     > > > > > deciding
>>     > > > > > > > > > factor,
>>     > > > > > > > > > > rather than
>>     > > > > > > > > > > the design of it (which this KIP is). I.e., if the
>>     > > > > > implementation is
>>     > > > > > > > > not
>>     > > > > > > > > > > deemed mature enough
>>     > > > > > > > > > > for release X.Y it will be disabled.
>>     > > > > > > > > > >
>>     > > > > > > > > > >
>>     > > > > > > > > > >
>>     > > > > > > > > > > > Overall I think it's an interesting feature but
>> I'd
>>     > > prefer
>>     > > > to
>>     > > > > > be
>>     > > > > > > > > > > > slightly defensive and see how it works in
>> practice
>>     > > before
>>     > > > > > enabling
>>     > > > > > > > > it
>>     > > > > > > > > > > > everywhere.
>>     > > > > > > > > > > >
>>     > > > > > > > > > >
>>     > > > > > > > > > > Right, and I agree on being defensive, but since
>> this
>>     > > feature
>>     > > > > > still
>>     > > > > > > > > > > requires manual
>>     > > > > > > > > > > enabling on the brokers before actually being
>> used, I
>>     > think
>>     > > > > that
>>     > > > > > > > gives
>>     > > > > > > > > > > enough control
>>     > > > > > > > > > > to opt-in or out of this feature as needed.
>>     > > > > > > > > > >
>>     > > > > > > > > > > Thanks for your comments!
>>     > > > > > > > > > >
>>     > > > > > > > > > > Regards,
>>     > > > > > > > > > > Magnus
>>     > > > > > > > > > >
>>     > > > > > > > > > >
>>     > > > > > > > > > >
>>     > > > > > > > > > > > Thanks,
>>     > > > > > > > > > > > Mickael
>>     > > > > > > > > > > >
>>     > > > > > > > > > > > On Mon, Nov 8, 2021 at 7:55 AM Magnus Edenhill <
>>     > > > > > magnus@edenhill.se
>>     > > > > > > > >
>>     > > > > > > > > > > wrote:
>>     > > > > > > > > > > > >
>>     > > > > > > > > > > > > Thanks David for pointing this out,
>>     > > > > > > > > > > > > I've updated the KIP to include client_id as a
>>     > matching
>>     > > > > > selector.
>>     > > > > > > > > > > > >
>>     > > > > > > > > > > > > Regards,
>>     > > > > > > > > > > > > Magnus
>>     > > > > > > > > > > > >
>>     > > > > > > > > > > > > Den tors 4 nov. 2021 kl 18:01 skrev David Mao
>>     > > > > > > > > > > <dmao@confluent.io.invalid
>>     > > > > > > > > > > > >:
>>     > > > > > > > > > > > >
>>     > > > > > > > > > > > > > Hey Magnus,
>>     > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > I noticed that the KIP outlines the initial
>>     > selectors
>>     > > > > > supported
>>     > > > > > > > > as:
>>     > > > > > > > > > > > > >
>>     > > > > > > > > > > > > >    - client_instance_id -
>> CLIENT_INSTANCE_ID UUID
>>     > > > string
>>     > > > > > > > > > > > representation.
>>     > > > > > > > > > > > > >    - client_software_name  - client software
>>     > > > > implementation
>>     > > > > > > > name.
>>     > > > > > > > > > > > > >    - client_software_version  - client
>> software
>>     > > > > > implementation
>>     > > > > > > > > > > version.
>>     > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > In the given reactive monitoring workflow,
>> we
>>     > mention
>>     > > > > that
>>     > > > > > the
>>     > > > > > > > > > > > application
>>     > > > > > > > > > > > > > user does not know their client's client
>> instance
>>     > ID,
>>     > > > but
>>     > > > > > it's
>>     > > > > > > > > > > outlined
>>     > > > > > > > > > > > > > that the operator can add a metrics
>> subscription
>>     > > > > selecting
>>     > > > > > for
>>     > > > > > > > > > > > clientId. I
>>     > > > > > > > > > > > > > don't see clientId as one of the supported
>>     > selectors.
>>     > > > > > > > > > > > > > I can see how this would have made sense in
>> a
>>     > > previous
>>     > > > > > > > iteration
>>     > > > > > > > > > > given
>>     > > > > > > > > > > > that
>>     > > > > > > > > > > > > > the previous client instance ID proposal
>> was to
>>     > > > construct
>>     > > > > > the
>>     > > > > > > > > > client
>>     > > > > > > > > > > > > > instance ID using clientId as a prefix. Now
>> that
>>     > the
>>     > > > > client
>>     > > > > > > > > > instance
>>     > > > > > > > > > > > ID is
>>     > > > > > > > > > > > > > a UUID, would we want to add clientId as a
>>     > supported
>>     > > > > > selector?
>>     > > > > > > > > > > > > > Let me know what you think.
>>     > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > David
>>     > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > On Tue, Oct 19, 2021 at 12:33 PM Magnus
>> Edenhill <
>>     > > > > > > > > > magnus@edenhill.se
>>     > > > > > > > > > > >
>>     > > > > > > > > > > > > > wrote:
>>     > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > Hi Mickael!
>>     > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > Den tis 19 okt. 2021 kl 19:30 skrev
>> Mickael
>>     > Maison
>>     > > <
>>     > > > > > > > > > > > > > > mickael.maison@gmail.com
>>     > > > > > > > > > > > > > > >:
>>     > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > Hi Magnus,
>>     > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > Thanks for the proposal.
>>     > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > 1. Looking at the protocol section,
>> isn't
>>     > > > > > > > "ClientInstanceId"
>>     > > > > > > > > > > > expected
>>     > > > > > > > > > > > > > > > to be a field in
>>     > > > GetTelemetrySubscriptionsResponseV0?
>>     > > > > > > > > > Otherwise,
>>     > > > > > > > > > > > how
>>     > > > > > > > > > > > > > > > does a client retrieve this value?
>>     > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > Good catch, it got removed by mistake in
>> one of
>>     > the
>>     > > > > > edits.
>>     > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > 2. In the client API section, you
>> mention a new
>>     > > > > method
>>     > > > > > > > > > > > > > > > "clientInstanceId()". Can you clarify
>> which
>>     > > > > interfaces
>>     > > > > > are
>>     > > > > > > > > > > > affected?
>>     > > > > > > > > > > > > > > > Is it only Consumer and Producer?
>>     > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > And Admin. Will update the KIP.
>>     > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > 3. I'm a bit concerned this is enabled
>> by
>>     > > default.
>>     > > > > > Even if
>>     > > > > > > > > the
>>     > > > > > > > > > > data
>>     > > > > > > > > > > > > > > > collected is supposed to be not
>> sensitive, I
>>     > > think
>>     > > > > > this can
>>     > > > > > > > > be
>>     > > > > > > > > > > > > > > > problematic in some environments. Also
>> users
>>     > > don't
>>     > > > > > seem to
>>     > > > > > > > > have
>>     > > > > > > > > > > the
>>     > > > > > > > > > > > > > > > choice to only expose some metrics.
>> Knowing how
>>     > > > much
>>     > > > > > data
>>     > > > > > > > > > transit
>>     > > > > > > > > > > > > > > > through some applications can be
>> considered
>>     > > > critical.
>>     > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > The broker already knows how much data
>> transits
>>     > > > through
>>     > > > > > the
>>     > > > > > > > > > client
>>     > > > > > > > > > > > > > though,
>>     > > > > > > > > > > > > > > right?
>>     > > > > > > > > > > > > > > Care has been taken not to expose
>> information in
>>     > > the
>>     > > > > > standard
>>     > > > > > > > > > > metrics
>>     > > > > > > > > > > > > > that
>>     > > > > > > > > > > > > > > might
>>     > > > > > > > > > > > > > > reveal sensitive information.
>>     > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > Do you have an example of how the proposed
>>     > metrics
>>     > > > > could
>>     > > > > > leak
>>     > > > > > > > > > > > sensitive
>>     > > > > > > > > > > > > > > information?
>>     > > > > > > > > > > > > > > As for limiting the what metrics to
>> export; I
>>     > guess
>>     > > > > that
>>     > > > > > > > could
>>     > > > > > > > > > make
>>     > > > > > > > > > > > sense
>>     > > > > > > > > > > > > > > in some
>>     > > > > > > > > > > > > > > very sensitive use-cases, but those users
>> might
>>     > > > disable
>>     > > > > > > > metrics
>>     > > > > > > > > > > > > > altogether
>>     > > > > > > > > > > > > > > for now.
>>     > > > > > > > > > > > > > > Could these concerns be addressed by a
>> later KIP?
>>     > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > 4. As a user, how do you know if your
>>     > application
>>     > > > is
>>     > > > > > > > actively
>>     > > > > > > > > > > > sending
>>     > > > > > > > > > > > > > > > metrics? Are there new metrics exposing
>> what's
>>     > > > going
>>     > > > > > on,
>>     > > > > > > > like
>>     > > > > > > > > > how
>>     > > > > > > > > > > > much
>>     > > > > > > > > > > > > > > > data is being sent?
>>     > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > That's a good question.
>>     > > > > > > > > > > > > > > Since the proposed metrics interface is
>> not aimed
>>     > > at,
>>     > > > > or
>>     > > > > > > > > directly
>>     > > > > > > > > > > > > > available
>>     > > > > > > > > > > > > > > to, the application
>>     > > > > > > > > > > > > > > I guess there's little point of adding it
>> here,
>>     > but
>>     > > > > > instead
>>     > > > > > > > > > adding
>>     > > > > > > > > > > > > > > something to the
>>     > > > > > > > > > > > > > > existing JMX metrics?
>>     > > > > > > > > > > > > > > Do you have any suggestions?
>>     > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > 5. If all metrics are enabled on a
>> regular
>>     > > Consumer
>>     > > > > or
>>     > > > > > > > > > Producer,
>>     > > > > > > > > > > do
>>     > > > > > > > > > > > > > > > you have an idea how much throughput
>> this would
>>     > > > use?
>>     > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > It depends on the number of
>> partition/topics/etc
>>     > > the
>>     > > > > > client
>>     > > > > > > > is
>>     > > > > > > > > > > > producing
>>     > > > > > > > > > > > > > > to/consuming from.
>>     > > > > > > > > > > > > > > I'll add some sizes to the KIP for some
>> typical
>>     > > > > > use-cases.
>>     > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > Thanks,
>>     > > > > > > > > > > > > > > Magnus
>>     > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > Thanks
>>     > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > On Tue, Oct 19, 2021 at 5:06 PM Magnus
>>     > Edenhill <
>>     > > > > > > > > > > > magnus@edenhill.se>
>>     > > > > > > > > > > > > > > > wrote:
>>     > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > Den tis 19 okt. 2021 kl 13:22 skrev
>> Tom
>>     > > Bentley <
>>     > > > > > > > > > > > tbentley@redhat.com
>>     > > > > > > > > > > > > > >:
>>     > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > > Hi Magnus,
>>     > > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > > I reviewed the KIP since you called
>> the
>>     > vote
>>     > > > > > (sorry for
>>     > > > > > > > > not
>>     > > > > > > > > > > > > > reviewing
>>     > > > > > > > > > > > > > > > when
>>     > > > > > > > > > > > > > > > > > you announced your intention to
>> call the
>>     > > > vote). I
>>     > > > > > have
>>     > > > > > > > a
>>     > > > > > > > > > few
>>     > > > > > > > > > > > > > > questions
>>     > > > > > > > > > > > > > > > on
>>     > > > > > > > > > > > > > > > > > some of the details.
>>     > > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > > 1. There's no Javadoc on
>>     > > > > > ClientTelemetryPayload.data(),
>>     > > > > > > > > so
>>     > > > > > > > > > I
>>     > > > > > > > > > > > don't
>>     > > > > > > > > > > > > > > know
>>     > > > > > > > > > > > > > > > > > whether the payload is exposed
>> through this
>>     > > > > method
>>     > > > > > as
>>     > > > > > > > > > > > compressed or
>>     > > > > > > > > > > > > > > > not.
>>     > > > > > > > > > > > > > > > > > Later on you say "Decompression of
>> the
>>     > > payloads
>>     > > > > > will be
>>     > > > > > > > > > > > handled by
>>     > > > > > > > > > > > > > > the
>>     > > > > > > > > > > > > > > > > > broker metrics plugin, the broker
>> should
>>     > > > expose a
>>     > > > > > > > > suitable
>>     > > > > > > > > > > > > > > > decompression
>>     > > > > > > > > > > > > > > > > > API to the metrics plugin for this
>>     > purpose.",
>>     > > > > which
>>     > > > > > > > > > suggests
>>     > > > > > > > > > > > it's
>>     > > > > > > > > > > > > > the
>>     > > > > > > > > > > > > > > > > > compressed data in the buffer, but
>> then we
>>     > > > don't
>>     > > > > > know
>>     > > > > > > > > which
>>     > > > > > > > > > > > codec
>>     > > > > > > > > > > > > > was
>>     > > > > > > > > > > > > > > > used,
>>     > > > > > > > > > > > > > > > > > nor the API via which the plugin
>> should
>>     > > > > decompress
>>     > > > > > it
>>     > > > > > > > if
>>     > > > > > > > > > > > required
>>     > > > > > > > > > > > > > for
>>     > > > > > > > > > > > > > > > > > forwarding to the ultimate metrics
>> store.
>>     > > > Should
>>     > > > > > the
>>     > > > > > > > > > > > > > > > ClientTelemetryPayload
>>     > > > > > > > > > > > > > > > > > expose a method to get the
>> compression and
>>     > a
>>     > > > > > > > > decompressor?
>>     > > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > Good point, updated.
>>     > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > > 2. The client-side API is expressed
>> as
>>     > > > > > StringOrError
>>     > > > > > > > > > > > > > > > > > ClientInstance::ClientInstanceId(int
>>     > > > > timeout_ms). I
>>     > > > > > > > > > > understand
>>     > > > > > > > > > > > that
>>     > > > > > > > > > > > > > > > you're
>>     > > > > > > > > > > > > > > > > > thinking about the librdkafka
>>     > implementation,
>>     > > > but
>>     > > > > > it
>>     > > > > > > > > would
>>     > > > > > > > > > be
>>     > > > > > > > > > > > good
>>     > > > > > > > > > > > > > to
>>     > > > > > > > > > > > > > > > show
>>     > > > > > > > > > > > > > > > > > the API as it would appear on the
>> Apache
>>     > > Kafka
>>     > > > > > clients.
>>     > > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > This was meant as pseudo-code, but I
>> changed
>>     > it
>>     > > > to
>>     > > > > > Java.
>>     > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > > 3. "PushTelemetryRequest|Response -
>>     > protocol
>>     > > > > > request
>>     > > > > > > > used
>>     > > > > > > > > > by
>>     > > > > > > > > > > > the
>>     > > > > > > > > > > > > > > > client to
>>     > > > > > > > > > > > > > > > > > send metrics to any broker it is
>> connected
>>     > > to."
>>     > > > > To
>>     > > > > > be
>>     > > > > > > > > > clear,
>>     > > > > > > > > > > > this
>>     > > > > > > > > > > > > > > means
>>     > > > > > > > > > > > > > > > > > that the client can choose any of
>> the
>>     > > connected
>>     > > > > > brokers
>>     > > > > > > > > and
>>     > > > > > > > > > > > push to
>>     > > > > > > > > > > > > > > > just
>>     > > > > > > > > > > > > > > > > > one of them? What should a
>> supporting
>>     > client
>>     > > do
>>     > > > > if
>>     > > > > > it
>>     > > > > > > > > gets
>>     > > > > > > > > > an
>>     > > > > > > > > > > > error
>>     > > > > > > > > > > > > > > > when
>>     > > > > > > > > > > > > > > > > > pushing metrics to a broker, retry
>> sending
>>     > to
>>     > > > the
>>     > > > > > same
>>     > > > > > > > > > broker
>>     > > > > > > > > > > > or
>>     > > > > > > > > > > > > > try
>>     > > > > > > > > > > > > > > > > > pushing to another broker, or drop
>> the
>>     > > metrics?
>>     > > > > > Should
>>     > > > > > > > > > > > supporting
>>     > > > > > > > > > > > > > > > clients
>>     > > > > > > > > > > > > > > > > > send successive requests to a single
>>     > broker,
>>     > > or
>>     > > > > > round
>>     > > > > > > > > > robin,
>>     > > > > > > > > > > > or is
>>     > > > > > > > > > > > > > > > that up
>>     > > > > > > > > > > > > > > > > > to the client author? I'm guessing
>> the
>>     > > > behaviour
>>     > > > > > should
>>     > > > > > > > > be
>>     > > > > > > > > > > > sticky
>>     > > > > > > > > > > > > > to
>>     > > > > > > > > > > > > > > > > > support the rate limiting features,
>> but I
>>     > > think
>>     > > > > it
>>     > > > > > > > would
>>     > > > > > > > > be
>>     > > > > > > > > > > > good
>>     > > > > > > > > > > > > > for
>>     > > > > > > > > > > > > > > > client
>>     > > > > > > > > > > > > > > > > > authors if this section were
>> explicit on
>>     > the
>>     > > > > > > > recommended
>>     > > > > > > > > > > > behaviour.
>>     > > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > You are right, I've updated the KIP
>> to make
>>     > > this
>>     > > > > > clearer.
>>     > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > > 4. "Mapping the client instance id
>> to an
>>     > > actual
>>     > > > > > > > > application
>>     > > > > > > > > > > > > > instance
>>     > > > > > > > > > > > > > > > > > running on a (virtual) machine can
>> be done
>>     > by
>>     > > > > > > > inspecting
>>     > > > > > > > > > the
>>     > > > > > > > > > > > > > metrics
>>     > > > > > > > > > > > > > > > > > resource labels, such as the client
>> source
>>     > > > > address
>>     > > > > > and
>>     > > > > > > > > > source
>>     > > > > > > > > > > > port,
>>     > > > > > > > > > > > > > > or
>>     > > > > > > > > > > > > > > > > > security principal, all of which
>> are added
>>     > by
>>     > > > the
>>     > > > > > > > > receiving
>>     > > > > > > > > > > > broker.
>>     > > > > > > > > > > > > > > > This
>>     > > > > > > > > > > > > > > > > > will allow the operator together
>> with the
>>     > > user
>>     > > > to
>>     > > > > > > > > identify
>>     > > > > > > > > > > the
>>     > > > > > > > > > > > > > actual
>>     > > > > > > > > > > > > > > > > > application instance." Is this
>> really
>>     > always
>>     > > > > true?
>>     > > > > > The
>>     > > > > > > > > > source
>>     > > > > > > > > > > > IP
>>     > > > > > > > > > > > > > and
>>     > > > > > > > > > > > > > > > port
>>     > > > > > > > > > > > > > > > > > might be a loadbalancer/proxy in
>> some
>>     > setups.
>>     > > > The
>>     > > > > > > > > > principal,
>>     > > > > > > > > > > as
>>     > > > > > > > > > > > > > > already
>>     > > > > > > > > > > > > > > > > > mentioned in the KIP, might be
>> shared
>>     > between
>>     > > > > > multiple
>>     > > > > > > > > > > > > > applications.
>>     > > > > > > > > > > > > > > > So at
>>     > > > > > > > > > > > > > > > > > worst the organization running the
>> clients
>>     > > > might
>>     > > > > > have
>>     > > > > > > > to
>>     > > > > > > > > > > > consult
>>     > > > > > > > > > > > > > the
>>     > > > > > > > > > > > > > > > logs
>>     > > > > > > > > > > > > > > > > > of a set of client applications,
>> right?
>>     > > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > Yes, that's correct. There's no
>> guaranteed
>>     > > > mapping
>>     > > > > > from
>>     > > > > > > > > > > > > > > > client_instance_id
>>     > > > > > > > > > > > > > > > > to
>>     > > > > > > > > > > > > > > > > an actual instance, that's why the KIP
>>     > > recommends
>>     > > > > > client
>>     > > > > > > > > > > > > > > implementations
>>     > > > > > > > > > > > > > > > to
>>     > > > > > > > > > > > > > > > > log the client instance id
>>     > > > > > > > > > > > > > > > > upon retrieval, and also provide an
>> API for
>>     > the
>>     > > > > > > > application
>>     > > > > > > > > > to
>>     > > > > > > > > > > > > > retrieve
>>     > > > > > > > > > > > > > > > the
>>     > > > > > > > > > > > > > > > > instance id programmatically
>>     > > > > > > > > > > > > > > > > if it has a better way of exposing it.
>>     > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > 5. "Tests indicate that a compression
>> ratio
>>     > up
>>     > > to
>>     > > > > > 10x is
>>     > > > > > > > > > > > possible for
>>     > > > > > > > > > > > > > > the
>>     > > > > > > > > > > > > > > > > > standard metrics." Client authors
>> might
>>     > > > > appreciate
>>     > > > > > your
>>     > > > > > > > > > > > mentioning
>>     > > > > > > > > > > > > > > > which
>>     > > > > > > > > > > > > > > > > > compression codec got these results.
>>     > > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > Good point. Updated.
>>     > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > > 6. "Should the client send a push
>> request
>>     > > prior
>>     > > > > to
>>     > > > > > > > expiry
>>     > > > > > > > > > of
>>     > > > > > > > > > > > the
>>     > > > > > > > > > > > > > > > previously
>>     > > > > > > > > > > > > > > > > > calculated PushIntervalMs the
>> broker will
>>     > > > discard
>>     > > > > > the
>>     > > > > > > > > > metrics
>>     > > > > > > > > > > > and
>>     > > > > > > > > > > > > > > > return a
>>     > > > > > > > > > > > > > > > > > PushTelemetryResponse with the
>> ErrorCode
>>     > set
>>     > > to
>>     > > > > > > > > > RateLimited."
>>     > > > > > > > > > > > Is
>>     > > > > > > > > > > > > > this
>>     > > > > > > > > > > > > > > > > > RATE_LIMITED a new error code? It's
>> not
>>     > > > mentioned
>>     > > > > > in
>>     > > > > > > > the
>>     > > > > > > > > > "New
>>     > > > > > > > > > > > Error
>>     > > > > > > > > > > > > > > > Codes"
>>     > > > > > > > > > > > > > > > > > section.
>>     > > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > That's a leftover, it should be using
>> the
>>     > > > standard
>>     > > > > > > > > > ThrottleTime
>>     > > > > > > > > > > > > > > > mechanism.
>>     > > > > > > > > > > > > > > > > Fixed.
>>     > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > > 7. In the section "Standard client
>> resource
>>     > > > > labels"
>>     > > > > > > > > > > > application_id
>>     > > > > > > > > > > > > > is
>>     > > > > > > > > > > > > > > > > > described as Kafka Streams only,
>> but the
>>     > > > section
>>     > > > > of
>>     > > > > > > > > "Client
>>     > > > > > > > > > > > > > > > Identification"
>>     > > > > > > > > > > > > > > > > > talks about "application instance
>> id as an
>>     > > > > optional
>>     > > > > > > > > future
>>     > > > > > > > > > > > > > > nice-to-have
>>     > > > > > > > > > > > > > > > > > that may be included as a metrics
>> label if
>>     > it
>>     > > > has
>>     > > > > > been
>>     > > > > > > > > set
>>     > > > > > > > > > by
>>     > > > > > > > > > > > the
>>     > > > > > > > > > > > > > > > user", so
>>     > > > > > > > > > > > > > > > > > I'm confused whether non-Kafka
>> Streams
>>     > > clients
>>     > > > > > should
>>     > > > > > > > set
>>     > > > > > > > > > an
>>     > > > > > > > > > > > > > > > application_id
>>     > > > > > > > > > > > > > > > > > or not.
>>     > > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > I'll clarify this in the KIP, but
>> basically
>>     > we
>>     > > > > would
>>     > > > > > need
>>     > > > > > > > > to
>>     > > > > > > > > > > add
>>     > > > > > > > > > > > an `
>>     > > > > > > > > > > > > > > > > application.id` config
>>     > > > > > > > > > > > > > > > > property for non-streams clients for
>> this
>>     > > > purpose,
>>     > > > > > and
>>     > > > > > > > > that's
>>     > > > > > > > > > > > outside
>>     > > > > > > > > > > > > > > the
>>     > > > > > > > > > > > > > > > > scope of this KIP since we want to
>> make it
>>     > > > > > zero-conf:ish
>>     > > > > > > > on
>>     > > > > > > > > > the
>>     > > > > > > > > > > > > > client
>>     > > > > > > > > > > > > > > > side.
>>     > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > > Kind regards,
>>     > > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > > Tom
>>     > > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > Thanks for the review,
>>     > > > > > > > > > > > > > > > > Magnus
>>     > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > > On Thu, Oct 7, 2021 at 5:26 PM
>> Magnus
>>     > > Edenhill
>>     > > > <
>>     > > > > > > > > > > > magnus@edenhill.se
>>     > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > wrote:
>>     > > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > > > Hi all,
>>     > > > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > > > I've updated the KIP following
>> our recent
>>     > > > > > discussions
>>     > > > > > > > > on
>>     > > > > > > > > > > the
>>     > > > > > > > > > > > > > > mailing
>>     > > > > > > > > > > > > > > > > > list:
>>     > > > > > > > > > > > > > > > > > >  - split the protocol in two, one
>> for
>>     > > getting
>>     > > > > the
>>     > > > > > > > > metrics
>>     > > > > > > > > > > > > > > > subscriptions,
>>     > > > > > > > > > > > > > > > > > > and one for pushing the metrics.
>>     > > > > > > > > > > > > > > > > > >  - simplifications: initially
>> only one
>>     > > > > supported
>>     > > > > > > > > metrics
>>     > > > > > > > > > > > format,
>>     > > > > > > > > > > > > > no
>>     > > > > > > > > > > > > > > > > > > client.id in the instance id,
>> etc.
>>     > > > > > > > > > > > > > > > > > >  - made CLIENT_METRICS
>> subscription
>>     > > > > configuration
>>     > > > > > > > > entries
>>     > > > > > > > > > > > more
>>     > > > > > > > > > > > > > > > structured
>>     > > > > > > > > > > > > > > > > > >    and allowing better client
>> matching
>>     > > > > selectors
>>     > > > > > (not
>>     > > > > > > > > > only
>>     > > > > > > > > > > > on the
>>     > > > > > > > > > > > > > > > > > instance
>>     > > > > > > > > > > > > > > > > > > id, but also the other
>>     > > > > > > > > > > > > > > > > > >    client resource labels, such as
>>     > > > > > > > > client_software_name,
>>     > > > > > > > > > > > etc.).
>>     > > > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > > > Unless there are further comments
>> I'll
>>     > call
>>     > > > the
>>     > > > > > vote
>>     > > > > > > > > in a
>>     > > > > > > > > > > > day or
>>     > > > > > > > > > > > > > > two.
>>     > > > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > > > Regards,
>>     > > > > > > > > > > > > > > > > > > Magnus
>>     > > > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > > > Den mån 4 okt. 2021 kl 20:57
>> skrev Magnus
>>     > > > > > Edenhill <
>>     > > > > > > > > > > > > > > > magnus@edenhill.se>:
>>     > > > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > > > > Hi Gwen,
>>     > > > > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > > > > I'm finishing up the KIP based
>> on the
>>     > > last
>>     > > > > > couple
>>     > > > > > > > of
>>     > > > > > > > > > > > discussion
>>     > > > > > > > > > > > > > > > points
>>     > > > > > > > > > > > > > > > > > in
>>     > > > > > > > > > > > > > > > > > > > this thread
>>     > > > > > > > > > > > > > > > > > > > and will call the Vote later
>> this week.
>>     > > > > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > > > > Best,
>>     > > > > > > > > > > > > > > > > > > > Magnus
>>     > > > > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > > > > Den lör 2 okt. 2021 kl 02:01
>> skrev Gwen
>>     > > > > Shapira
>>     > > > > > > > > > > > > > > > > > > <gwen@confluent.io.invalid
>>     > > > > > > > > > > > > > > > > > > > >:
>>     > > > > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > > > >> Hey,
>>     > > > > > > > > > > > > > > > > > > >>
>>     > > > > > > > > > > > > > > > > > > >> I noticed that there was no
>> discussion
>>     > > for
>>     > > > > the
>>     > > > > > > > last
>>     > > > > > > > > 10
>>     > > > > > > > > > > > days,
>>     > > > > > > > > > > > > > > but I
>>     > > > > > > > > > > > > > > > > > > >> couldn't
>>     > > > > > > > > > > > > > > > > > > >> find the vote thread. Is there
>> one
>>     > that
>>     > > > I'm
>>     > > > > > > > missing?
>>     > > > > > > > > > > > > > > > > > > >>
>>     > > > > > > > > > > > > > > > > > > >> Gwen
>>     > > > > > > > > > > > > > > > > > > >>
>>     > > > > > > > > > > > > > > > > > > >> On Wed, Sep 22, 2021 at 4:58
>> AM Magnus
>>     > > > > > Edenhill <
>>     > > > > > > > > > > > > > > > magnus@edenhill.se>
>>     > > > > > > > > > > > > > > > > > > >> wrote:
>>     > > > > > > > > > > > > > > > > > > >>
>>     > > > > > > > > > > > > > > > > > > >> > Den tis 21 sep. 2021 kl
>> 06:58 skrev
>>     > > > Colin
>>     > > > > > > > McCabe <
>>     > > > > > > > > > > > > > > > > > cmccabe@apache.org
>>     > > > > > > > > > > > > > > > > > > >:
>>     > > > > > > > > > > > > > > > > > > >> >
>>     > > > > > > > > > > > > > > > > > > >> > > On Mon, Sep 20, 2021, at
>> 17:35,
>>     > Feng
>>     > > > Min
>>     > > > > > > > wrote:
>>     > > > > > > > > > > > > > > > > > > >> > > > Thanks Magnus & Colin
>> for the
>>     > > > > > discussion.
>>     > > > > > > > > > > > > > > > > > > >> > > >
>>     > > > > > > > > > > > > > > > > > > >> > > > Based on KIP-714's
>> stateless
>>     > > design,
>>     > > > > > Client
>>     > > > > > > > > can
>>     > > > > > > > > > > > pretty
>>     > > > > > > > > > > > > > > much
>>     > > > > > > > > > > > > > > > use
>>     > > > > > > > > > > > > > > > > > > any
>>     > > > > > > > > > > > > > > > > > > >> > > > connection to any broker
>> to send
>>     > > > > > metrics. We
>>     > > > > > > > > are
>>     > > > > > > > > > > not
>>     > > > > > > > > > > > > > > > associating
>>     > > > > > > > > > > > > > > > > > > >> > > connection
>>     > > > > > > > > > > > > > > > > > > >> > > > with client metric
>> state. Is my
>>     > > > > > > > understanding
>>     > > > > > > > > > > > correct?
>>     > > > > > > > > > > > > > If
>>     > > > > > > > > > > > > > > > yes,
>>     > > > > > > > > > > > > > > > > > > how
>>     > > > > > > > > > > > > > > > > > > >> > about
>>     > > > > > > > > > > > > > > > > > > >> > > > the following two
>> scenarios
>>     > > > > > > > > > > > > > > > > > > >> > > >
>>     > > > > > > > > > > > > > > > > > > >> > > > 1) One Client (Client-ID)
>>     > > registers
>>     > > > > two
>>     > > > > > > > > > different
>>     > > > > > > > > > > > client
>>     > > > > > > > > > > > > > > > > > instance
>>     > > > > > > > > > > > > > > > > > > id
>>     > > > > > > > > > > > > > > > > > > >> > via
>>     > > > > > > > > > > > > > > > > > > >> > > > separate registration.
>> Is it
>>     > > > > permitted?
>>     > > > > > If
>>     > > > > > > > OK,
>>     > > > > > > > > > how
>>     > > > > > > > > > > > to
>>     > > > > > > > > > > > > > > > > > distinguish
>>     > > > > > > > > > > > > > > > > > > >> them
>>     > > > > > > > > > > > > > > > > > > >> > > from
>>     > > > > > > > > > > > > > > > > > > >> > > > the case 2 below.
>>     > > > > > > > > > > > > > > > > > > >> > > >
>>     > > > > > > > > > > > > > > > > > > >> > >
>>     > > > > > > > > > > > > > > > > > > >> > > Hi Feng,
>>     > > > > > > > > > > > > > > > > > > >> > >
>>     > > > > > > > > > > > > > > > > > > >> > > My understanding, which
>> Magnus can
>>     > > > > > clarify I
>>     > > > > > > > > > guess,
>>     > > > > > > > > > > is
>>     > > > > > > > > > > > > > that
>>     > > > > > > > > > > > > > > > you
>>     > > > > > > > > > > > > > > > > > > could
>>     > > > > > > > > > > > > > > > > > > >> > have
>>     > > > > > > > > > > > > > > > > > > >> > > something like two Producer
>>     > > instances
>>     > > > > > running
>>     > > > > > > > > with
>>     > > > > > > > > > > the
>>     > > > > > > > > > > > > > same
>>     > > > > > > > > > > > > > > > > > > client.id
>>     > > > > > > > > > > > > > > > > > > >> > > (perhaps because they're
>> using the
>>     > > > same
>>     > > > > > config
>>     > > > > > > > > > file,
>>     > > > > > > > > > > > for
>>     > > > > > > > > > > > > > > > example).
>>     > > > > > > > > > > > > > > > > > > >> They
>>     > > > > > > > > > > > > > > > > > > >> > > could even be in the same
>> process.
>>     > > But
>>     > > > > > they
>>     > > > > > > > > would
>>     > > > > > > > > > > get
>>     > > > > > > > > > > > > > > separate
>>     > > > > > > > > > > > > > > > > > > UUIDs.
>>     > > > > > > > > > > > > > > > > > > >> > >
>>     > > > > > > > > > > > > > > > > > > >> > > I believe Magnus used the
>> term
>>     > > client
>>     > > > to
>>     > > > > > mean
>>     > > > > > > > > > > > "Producer or
>>     > > > > > > > > > > > > > > > > > > Consumer".
>>     > > > > > > > > > > > > > > > > > > >> So
>>     > > > > > > > > > > > > > > > > > > >> > > if you have both a
>> Producer and a
>>     > > > > > Consumer in
>>     > > > > > > > > your
>>     > > > > > > > > > > > > > > > application I
>>     > > > > > > > > > > > > > > > > > > would
>>     > > > > > > > > > > > > > > > > > > >> > > expect you'd get separate
>> UUIDs
>>     > for
>>     > > > > both.
>>     > > > > > > > Again
>>     > > > > > > > > > > > Magnus can
>>     > > > > > > > > > > > > > > > chime
>>     > > > > > > > > > > > > > > > > > in
>>     > > > > > > > > > > > > > > > > > > >> > here, I
>>     > > > > > > > > > > > > > > > > > > >> > > guess.
>>     > > > > > > > > > > > > > > > > > > >> > >
>>     > > > > > > > > > > > > > > > > > > >> >
>>     > > > > > > > > > > > > > > > > > > >> > That's correct.
>>     > > > > > > > > > > > > > > > > > > >> >
>>     > > > > > > > > > > > > > > > > > > >> >
>>     > > > > > > > > > > > > > > > > > > >> > >
>>     > > > > > > > > > > > > > > > > > > >> > > > 2) How about the client
>>     > > restarting?
>>     > > > > > What's
>>     > > > > > > > the
>>     > > > > > > > > > > > > > > expectation?
>>     > > > > > > > > > > > > > > > > > Should
>>     > > > > > > > > > > > > > > > > > > >> the
>>     > > > > > > > > > > > > > > > > > > >> > > > server expect the client
>> to
>>     > carry
>>     > > a
>>     > > > > > > > persisted
>>     > > > > > > > > > > client
>>     > > > > > > > > > > > > > > > instance id
>>     > > > > > > > > > > > > > > > > > > or
>>     > > > > > > > > > > > > > > > > > > >> > > should
>>     > > > > > > > > > > > > > > > > > > >> > > > the client be treated as
>> a new
>>     > > > > instance?
>>     > > > > > > > > > > > > > > > > > > >> > >
>>     > > > > > > > > > > > > > > > > > > >> > > The KIP doesn't describe
>> any
>>     > > mechanism
>>     > > > > for
>>     > > > > > > > > > > > persistence,
>>     > > > > > > > > > > > > > so I
>>     > > > > > > > > > > > > > > > would
>>     > > > > > > > > > > > > > > > > > > >> assume
>>     > > > > > > > > > > > > > > > > > > >> > > that when you restart the
>> client
>>     > you
>>     > > > get
>>     > > > > > a new
>>     > > > > > > > > > > UUID. I
>>     > > > > > > > > > > > > > agree
>>     > > > > > > > > > > > > > > > that
>>     > > > > > > > > > > > > > > > > > it
>>     > > > > > > > > > > > > > > > > > > >> > would
>>     > > > > > > > > > > > > > > > > > > >> > > be good to spell this out.
>>     > > > > > > > > > > > > > > > > > > >> > >
>>     > > > > > > > > > > > > > > > > > > >> > >
>>     > > > > > > > > > > > > > > > > > > >> > Right, it will not be
>> persisted
>>     > since
>>     > > a
>>     > > > > > client
>>     > > > > > > > > > > instance
>>     > > > > > > > > > > > > > can't
>>     > > > > > > > > > > > > > > be
>>     > > > > > > > > > > > > > > > > > > >> restarted.
>>     > > > > > > > > > > > > > > > > > > >> >
>>     > > > > > > > > > > > > > > > > > > >> > Will update the KIP to make
>> this
>>     > > > clearer.
>>     > > > > > > > > > > > > > > > > > > >> >
>>     > > > > > > > > > > > > > > > > > > >> > /Magnus
>>     > > > > > > > > > > > > > > > > > > >> >
>>     > > > > > > > > > > > > > > > > > > >>
>>     > > > > > > > > > > > > > > > > > > >>
>>     > > > > > > > > > > > > > > > > > > >> --
>>     > > > > > > > > > > > > > > > > > > >> Gwen Shapira
>>     > > > > > > > > > > > > > > > > > > >> Engineering Manager | Confluent
>>     > > > > > > > > > > > > > > > > > > >> 650.450.2760 | @gwenshap
>>     > > > > > > > > > > > > > > > > > > >> Follow us: Twitter | blog
>>     > > > > > > > > > > > > > > > > > > >>
>>     > > > > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > >
>>     > > > > > > > > > > >
>>     > > > > > > > > > >
>>     > > > > > > > > >
>>     > > > > > > > >
>>     > > > > > > >
>>     > > > > > >
>>     > > > > >
>>     > > > >
>>     > > >
>>     > >
>>     >
>>
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Kirk True <ki...@kirktrue.pro>.

Hi Jun,

On Tue, Jun 21, 2022, at 5:24 PM, Jun Rao wrote:
> Hi, Magnus, Kirk,
> 
> Thanks for the reply. A few more comments on your reply.
> 
> 100. I agree there are some benefits of having a set of standard metrics
> across all clients, but I am just wondering how practical it is, given that
> the proposal doesn't require this set like the Kafka protocol.
> 100.1 A client may not implement all or some of the standard metrics. Then,
> we won't have complete standardized names across clients.

True, a client need not implement all the metrics from the KIP. However, those that it does implement will use the names specified in the KIP. The rest of the metrics that a client doesn't implement should be considered as "reserved for future use."

> 100.2 The set of standard metrics needs to be common across all clients.
> For example, client.consumer.poll.latency implies that all clients
> implement a poll() interface. Is that true for all clients?
> client.producer.record.queue.bytes. Do all producers have queues? We
> probably need to make a pass of those metrics to see if they are indeed
> common across all clients.

There are certainly metrics that are not applicable for all client implementations. For example, some of the host-specific CPU timing metrics are "hard" to get on a JVM using standard Java APIs. Ultimately the client author must make a judgement call whether or not to implement a metric. If a given metric from the KIP is truly non-applicable for a client, the author would likely omit it from the client.

Regarding the request to "make a pass" of the clients, are there any client implementations in particular that I should consider reviewing?

I will make an effort to look at some of the more common clients to determine which metrics they expose. I'm a little concerned that could take on outsized amount of effort, depending on the clients' documentation. Researching the code base of each client to ascertain the exposed metrics sounds very time consuming.

> Also, a bunch of standard metrics have type
> Histogram. Java client doesn't have good Histogram support yet. I am also
> not sure if all clients support Histogram. Should we avoid Histogram type
> in standardized metrics?

That's a good question. I can try to get a feel for the existing histogram support in the ecosystem clients and report back.

The KIP does specify an alternate means to report histogram data using time-based averages:

"For [simplicity] a client implementation may choose to provide an average value as [a] Gauge instead of a Histogram. These averages should be using the original Histogram metric name + '.avg', e.g., 'client.request.rtt.avg'."

This approach offers lower fidelity, of course, but it's hopefully more useful in general to have _some_ data than _no_ data?

Perhaps we should replace histograms with this simplified implementation in the KIP, deferring proper histogram support to a future revision?

> 100.3 For a subset of metrics that are truly common across clients, it
> would be confusing for each client to maintain two sets of metrics for the
> same thing. We could document them, but that means every user of every
> client needs to remember this mapping. This is a much bigger
> inconvenience than standardizing the metric names on the server side. If we
> want to go this route, my preference is to deprecate the existing metric
> names that are covered by the standard metric names.

Ah, good point. I admit my focus is too Java-centric.

I want to make sure I understand more specifically what "the server" is in your point regarding 'standardizing the metric names on the server.' At some point there needs to be code that executes on the server that has knowledge of all the clients' metric names as well as a given organization's preferred metric names. Would this code live in the main Apache Kafka repo? Or is it in the organization's ClientTelemetryReceiver implementation? Or somewhere else?

How about introducing a new pluggable mechanism/interface that the broker invokes to determine the metric name mapping? We could provide two out-of-the-box implementations: 1) a default no-op mapper, and 2) a configuration file-based mapper that operates off something akin to a set of Java properties files (one mapping file for each known client). The implementation of the mapper is configured by the cluster administrator and, of course, each organization can provide their own implementation.

> 101. "or if the client-specific metrics are not converted to some common
> form, name, semantic, etc, it'll make creating meaningful aggregations and
> monitoring more complex in the upstream telemetry system with a scattered
> plethora of custom metrics." There will always be client specific metrics.
> So, it seems that we have to deal with scattered custom metrics even with a
> set of standard metrics.

Yes, this is true.

I do believe the KIP should establish a clear means to communicate about the different metrics and their meaning.

When a team is troubleshooting a high-severity incident, these client metrics provide a powerful tool to understand, remediate, and resolve those incidents. The goal of standardizing the metric names is to minimize communication roadblocks in that effort.

> 102. "However, in, let's say the Apache Kafka 3.7 release, the metric name
> is changed to "connections.open.count." At this point, there are two names
> and machine-to-machine communication will likely be effected. With that
> change, all client telemetry plugin(s) used in an organization must be
> updated to reflect that change, else data loss or bugs could be
> introduced." The standard metric names could change too in the future,
> right? So, we need to deal with a similar problem if that happens.

Also true :)

But the metric names, when standardized via a KIP, would undergo a well-known process when being changed in the future. Any metric name changes would be required to be included in a KIP and would require the old and new metric names to co-exist for a period of X releases. This would give teams that are upgrading to newer Kafka versions clear and consistent advance notice to make the needed changes on their end.

Granted, custom, client-specific metrics don't go through the KIP process. We don't "own" that code or their processes, so any usage of client-specific metrics runs the thread of a caveat emptor situation.

> 103. "Are there any inobvious security/privacy-related edge cases where
> shipping certain metrics to the broker would be "bad?"" I am not sure. But
> if a metric can be shipped to the server, it would be useful for the same
> metric to be visible on the client side.

Agreed. The question is, does the reverse hold true?

Thanks Jun!!!!

Kirk

> 
> Thanks,
> 
> Jun
> 
> 
> On Tue, Jun 21, 2022 at 8:19 AM Kirk True <ki...@kirktrue.pro> wrote:
> 
> > Hi Jun,
> >
> > Thank you for all your continued interest in shaping the KIP :)
> >
> > On Thu, Jun 16, 2022, at 2:38 PM, Jun Rao wrote:
> > > Hi, Kirk,
> > >
> > > Thanks for the reply. A couple of more comments.
> > >
> > > (1) "Another perspective is that these two sets of metrics serve
> > different
> > > purposes and/or have different audiences, which argues that they should
> > > maintain their individuality and purpose. " Hmm, I am wondering if those
> > > metrics are really for different audiences and purposes? For example, if
> > > the operator detected an issue through a client metric collected through
> > > the server, the operator may need to communicate that back to the client.
> > > It would be weird if that same metric is not visible on the client side.
> >
> > I agree in the principal that all client metrics visible on the client can
> > also be available to be sent to the broker.
> >
> > Are there any inobvious security/privacy-related edge cases where shipping
> > certain metrics to the broker would be "bad?"
> >
> > > (2) If we could standardize the names on the server side, do we need to
> > > enforce a naming convention for all clients?
> >
> > "Enforce" is such an ugly word :P
> >
> > But yes, I do feel that a consistent naming convention across all clients
> > provides communication benefits between two entities:
> >
> >  1. Human-to-human communication. Ecosystem-wide agreement and
> > understanding of metrics helps all to communicate more efficiently.
> >  2. Machine-to-machine communication. Defining the names via the KIP
> > mechanism help to ensure stabilization across releases of a given client.
> >
> > Point 1: Human-to-human Communication
> >
> > There are quite a handful of parties that must communicate effectively
> > across the Kafka ecosystem. Here are the ones I can think of off the top of
> > my head:
> >
> >  1. Kafka client authors
> >  2. Kafka client users
> >  3. Kafka client telemetry plugin authors
> >  4. Support teams (within an organization or vendor-supplied across
> > organizations)
> >  5. Kafka cluster operators
> >
> > There should be a standard so that these parties can understand the
> > metrics' meaning and be able to correlate that across all clients.
> >
> > As a concrete example, KIP-714 includes a metric for tracking the number
> > of active client connections to a cluster, named
> > "org.apache.kafka.client.connection.active." Given this name, all client
> > implementations can communicate this name and its value to all parties
> > consistently. Without a standard naming convention, the metric might be
> > named "connections.open" in the Java client and "Connections/Alive" in
> > librdkafka. This inconsistency of naming would impact the discussions
> > between one or more of the parties involved.
> >
> > To your point, it's absolutely a design choice to keep the naming
> > convention the same between each client. We can change that if it makes
> > sense.
> >
> > Point 2: Machine-to-machine Communication
> >
> > Standardization at the client level provides stability through an implied
> > contract that a client should not introduce a breaking name change between
> > releases. Otherwise, the ability for the metrics to be "understood" in a
> > machine-to-machine context would be forfeit.
> >
> > For example, let's say that we give the clients the latitude to name
> > metrics as they wish. In this example, let's say that the Apache Kafka 3.4
> > release decides to name this metric "connections.open." It's a good name!
> > It says what it is. However, in, let's say the Apache Kafka 3.7 release,
> > the metric name is changed to "connections.open.count." At this point,
> > there are two names and machine-to-machine communication will likely be
> > effected. With that change, all client telemetry plugin(s) used in an
> > organization must be updated to reflect that change, else data loss or bugs
> > could be introduced.
> >
> > That the KIP defines the names of the metrics does, admittedly, constrain
> > the options of authors of the different clients. The metric named
> > "org.apache.kafka.client.connection.active" may be confusing in some client
> > implementations. For whatever reason, a client author may even find it
> > "undesirable" to include a reference that includes "Apache" in their code.
> >
> > There's also the precedent set by the existing (JMX-based) client metrics.
> > Though these are applicable only to the Java client, we can see that having
> > a standardized naming convention there has helped with communication.
> >
> > So, IMO, it makes sense to define the metric names via the KIP mechanism
> > and--let's say, "ask"--that client implementations abide by those.
> >
> > Thanks,
> > Kirk
> >
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Thu, Jun 16, 2022 at 12:00 PM Kirk True <ki...@kirktrue.pro> wrote:
> > >
> > > > Hi Jun,
> > > >
> > > > I'll try to answer the questions posed...
> > > >
> > > > On Tue, Jun 7, 2022, at 4:32 PM, Jun Rao wrote:
> > > > > Hi, Magnus,
> > > > >
> > > > > Thanks for the reply.
> > > > >
> > > > > So, the standard set of generic metrics is just a recommendation and
> > not
> > > > a
> > > > > requirement? This sounds good to me since it makes the adoption of
> > the
> > > > KIP
> > > > > easier.
> > > >
> > > > I believe that was the intent, yes.
> > > >
> > > > > Regarding the metric names, I have two concerns.
> > > >
> > > > (I'm splitting these two up for readability...)
> > > >
> > > > > (1) If a client already
> > > > > has an existing metric similar to the standard one, duplicating the
> > > > metric
> > > > > seems to be confusing.
> > > >
> > > > Agreed. I'm dealing with that situation as I write the Java client
> > > > implementation.
> > > >
> > > > The existing Java client exposes a set of metrics via JMX. The updated
> > > > Java client will introduce a second set of metrics, which instead are
> > > > exposed via sending them to the broker. There is substantial overlap
> > with
> > > > the two set of metrics and in a few places in the code under
> > development,
> > > > there are essentially two separate calls to update metrics: one for the
> > > > JMX-bound metrics and one for the broker-bound metrics.
> > > >
> > > > To be candid, I have gone back-and-forth on that design. From one
> > > > perspective, it could be argued that the set of client metrics should
> > be
> > > > standardized across a given client, regardless of how those metrics are
> > > > exposed for consumption. Another perspective is that these two sets of
> > > > metrics serve different purposes and/or have different audiences, which
> > > > argues that they should maintain their individuality and purpose. Your
> > > > inputs/suggestions are certainly welcome!
> > > >
> > > > > (2) If a client needs to implement a standard metric
> > > > > that doesn't exist yet, using a naming convention (e.g., using dash
> > vs
> > > > dot)
> > > > > different from other existing metrics also seems a bit confusing. It
> > > > seems
> > > > > that the main benefit of having standard metric names across clients
> > is
> > > > for
> > > > > better server side monitoring. Could we do the standardization in the
> > > > > plugin on the server?
> > > >
> > > > I think the expectation is that the plugin implementation will perform
> > > > transformation of metric names, if needed, to fit in with an
> > organization's
> > > > monitoring naming standards. Perhaps we need to call that out in the
> > KIP
> > > > itself.
> > > >
> > > > Thanks,
> > > > Kirk
> > > >
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jun
> > > > >
> > > > >
> > > > >
> > > > > On Tue, Jun 7, 2022 at 6:53 AM Magnus Edenhill <ma...@edenhill.se>
> > > > wrote:
> > > > >
> > > > > > Hey Jun,
> > > > > >
> > > > > > I've clarified the scope of the standard metrics in the KIP, but
> > > > basically:
> > > > > >
> > > > > >  * We define a standard set of generic metrics that should be
> > relevant
> > > > to
> > > > > > most client implementations, e.g., each producer implementation
> > > > probably
> > > > > > has some sort of per-partition message queue.
> > > > > >  * A client implementation should strive to implement as many of
> > the
> > > > > > standard metrics as possible, but only the ones that make sense.
> > > > > >  * For metrics that are not in the standard set, a client
> > maintainer
> > > > can
> > > > > > choose to either submit a KIP to add additional standard metrics -
> > if
> > > > > > they're relevant, or go ahead and add custom metrics that are
> > specific
> > > > to
> > > > > > that client implementation. These custom metrics will have a prefix
> > > > > > specific to that client implementation, as opposed to the standard
> > > > metric
> > > > > > set that resides under "org.apache.kafka...". E.g.,
> > > > > > "se.edenhill.librdkafka" or whatever.
> > > > > >  * Existing non-KIP-714 metrics should remain untouched. In some
> > cases
> > > > we
> > > > > > might be able to use the same meter given it is compatible with the
> > > > > > standard metric set definition, in other cases a semi-duplicate
> > meter
> > > > may
> > > > > > be needed. Thus this will not affect the metrics exposed through
> > JMX,
> > > > or
> > > > > > vice versa.
> > > > > >
> > > > > > Thanks,
> > > > > > Magnus
> > > > > >
> > > > > >
> > > > > >
> > > > > > Den ons 1 juni 2022 kl 18:55 skrev Jun Rao
> > <ju...@confluent.io.invalid>:
> > > > > >
> > > > > > > Hi, Magnus,
> > > > > > >
> > > > > > > 51. Just to clarify my question.  (1) Are standard metrics
> > required
> > > > for
> > > > > > > every client for this KIP to function?  (2) Are we converting
> > > > existing
> > > > > > java
> > > > > > > metrics to the standard metrics and deprecating the old ones? If
> > so,
> > > > > > could
> > > > > > > we list all existing java metrics that need to be renamed and the
> > > > > > > corresponding new name?
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > Jun
> > > > > > >
> > > > > > > On Tue, May 31, 2022 at 3:29 PM Jun Rao <ju...@confluent.io>
> > wrote:
> > > > > > >
> > > > > > > > Hi, Magnus,
> > > > > > > >
> > > > > > > > Thanks for the reply.
> > > > > > > >
> > > > > > > > 51. I think it's fine to have a list of recommended metrics for
> > > > every
> > > > > > > > client to implement. I am just not sure that standardizing on
> > the
> > > > > > metric
> > > > > > > > names across all clients is practical. The list of common
> > metrics
> > > > in
> > > > > > the
> > > > > > > > KIP have completely different names from the java metric names.
> > > > Some of
> > > > > > > > them have different types. For example, some of the common
> > metrics
> > > > > > have a
> > > > > > > > type of histogram, but the java client metrics don't use
> > histogram
> > > > in
> > > > > > > > general. Requiring the operator to translate those names and
> > > > understand
> > > > > > > the
> > > > > > > > subtle differences across clients seem to cause more confusion
> > > > during
> > > > > > > > troubleshooting.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > >
> > > > > > > > Jun
> > > > > > > >
> > > > > > > > On Tue, May 31, 2022 at 5:02 AM Magnus Edenhill <
> > > > magnus@edenhill.se>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > >> Den fre 20 maj 2022 kl 01:23 skrev Jun Rao
> > > > <jun@confluent.io.invalid
> > > > > > >:
> > > > > > > >>
> > > > > > > >> > Hi, Magus,
> > > > > > > >> >
> > > > > > > >> > Thanks for the reply.
> > > > > > > >> >
> > > > > > > >> > 50. Sounds good.
> > > > > > > >> >
> > > > > > > >> > 51. I miss-understood the proposal in the KIP then. The
> > > > proposal is
> > > > > > to
> > > > > > > >> > define a set of common metric names that every client should
> > > > > > > implement.
> > > > > > > >> The
> > > > > > > >> > problem is that every client already has its own set of
> > metrics
> > > > with
> > > > > > > its
> > > > > > > >> > own names. I am not sure that we could easily agree upon a
> > > > common
> > > > > > set
> > > > > > > of
> > > > > > > >> > metrics that work with all clients. There are likely to be
> > some
> > > > > > > metrics
> > > > > > > >> > that are client specific. Translating between the common
> > name
> > > > and
> > > > > > > client
> > > > > > > >> > specific name is probably going to add more confusion. As
> > > > mentioned
> > > > > > in
> > > > > > > >> the
> > > > > > > >> > KIP, similar metrics from different clients could have
> > subtle
> > > > > > > >> > semantic differences. Could we just let each client use its
> > own
> > > > set
> > > > > > of
> > > > > > > >> > metric names?
> > > > > > > >> >
> > > > > > > >>
> > > > > > > >> We identified a common set of metrics that should be relevant
> > for
> > > > most
> > > > > > > >> client implementations,
> > > > > > > >> they're the ones listed in the KIP.
> > > > > > > >> A supporting client does not have to implement all those
> > metrics,
> > > > only
> > > > > > > the
> > > > > > > >> ones that makes sense
> > > > > > > >> based on that client implementation, and a client may
> > implement
> > > > other
> > > > > > > >> metrics that are not listed
> > > > > > > >> in the KIP under its own namespace.
> > > > > > > >> This approach has two benefits:
> > > > > > > >>  - there will be a common set of metrics that most/all clients
> > > > > > > implement,
> > > > > > > >> which makes monitoring
> > > > > > > >>   and troubleshooting easier across fleets with multiple Kafka
> > > > client
> > > > > > > >> languages/implementations.
> > > > > > > >>  - client-specific metrics are still possible, so if there is
> > no
> > > > > > > suitable
> > > > > > > >> standard metric a client can still
> > > > > > > >>    provide what special metrics it has.
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> Thanks,
> > > > > > > >> Magnus
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> On Thu, May 19, 2022 at 10:39 AM Magnus Edenhill <
> > > > magnus@edenhill.se>
> > > > > > > >> wrote:
> > > > > > > >> >
> > > > > > > >> > > Den ons 18 maj 2022 kl 19:57 skrev Jun Rao
> > > > > > <jun@confluent.io.invalid
> > > > > > > >> >:
> > > > > > > >> > >
> > > > > > > >> > > > Hi, Magnus,
> > > > > > > >> > > >
> > > > > > > >> > >
> > > > > > > >> > > Hi Jun
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> > > >
> > > > > > > >> > > > Thanks for the updated KIP. Just a couple of more
> > comments.
> > > > > > > >> > > >
> > > > > > > >> > > > 50. To troubleshoot a particular client issue, I imagine
> > > > that
> > > > > > the
> > > > > > > >> > client
> > > > > > > >> > > > needs to identify its client_instance_id. How does the
> > > > client
> > > > > > find
> > > > > > > >> this
> > > > > > > >> > > > out? Do we plan to include client_instance_id in the
> > client
> > > > log,
> > > > > > > >> expose
> > > > > > > >> > > it
> > > > > > > >> > > > as a metric or something else?
> > > > > > > >> > > >
> > > > > > > >> > >
> > > > > > > >> > > The KIP suggests that client implementations emit an
> > > > informative
> > > > > > log
> > > > > > > >> > > message
> > > > > > > >> > > with the assigned client-instance-id once it is retrieved
> > > > (once
> > > > > > per
> > > > > > > >> > client
> > > > > > > >> > > instance lifetime).
> > > > > > > >> > > There's also a clientInstanceId() method that an
> > application
> > > > can
> > > > > > use
> > > > > > > >> to
> > > > > > > >> > > retrieve
> > > > > > > >> > > the client instance id and emit through whatever side
> > channels
> > > > > > makes
> > > > > > > >> > sense.
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> > > > 51. The KIP lists a bunch of metrics that need to be
> > > > collected
> > > > > > at
> > > > > > > >> the
> > > > > > > >> > > > client side. However, it seems quite a few useful java
> > > > client
> > > > > > > >> metrics
> > > > > > > >> > > like
> > > > > > > >> > > > the following are missing.
> > > > > > > >> > > >     buffer-total-bytes
> > > > > > > >> > > >     buffer-available-bytes
> > > > > > > >> > > >
> > > > > > > >> > >
> > > > > > > >> > > These are covered by client.producer.record.queue.bytes
> > and
> > > > > > > >> > > client.producer.record.queue.max.bytes.
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> > > >     bufferpool-wait-time
> > > > > > > >> > > >
> > > > > > > >> > >
> > > > > > > >> > > Missing, but somewhat implementation specific.
> > > > > > > >> > > If it was up to me we would add this later if there's a
> > need.
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> > > >     batch-size-avg
> > > > > > > >> > > >     batch-size-max
> > > > > > > >> > > >
> > > > > > > >> > >
> > > > > > > >> > > These are missing and would be suitably represented as a
> > > > > > histogram.
> > > > > > > >> I'll
> > > > > > > >> > > add them.
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> > > >     io-wait-ratio
> > > > > > > >> > > >     io-ratio
> > > > > > > >> > > >
> > > > > > > >> > >
> > > > > > > >> > > There's client.io.wait.time which should cover
> > io-wait-ratio.
> > > > > > > >> > > We could add a client.io.time as well, now or in a later
> > KIP.
> > > > > > > >> > >
> > > > > > > >> > > Thanks,
> > > > > > > >> > > Magnus
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> > > >
> > > > > > > >> > > > Thanks,
> > > > > > > >> > > >
> > > > > > > >> > > > Jun
> > > > > > > >> > > >
> > > > > > > >> > > > On Mon, Apr 4, 2022 at 10:01 AM Jun Rao <
> > jun@confluent.io>
> > > > > > wrote:
> > > > > > > >> > > >
> > > > > > > >> > > > > Hi, Xavier,
> > > > > > > >> > > > >
> > > > > > > >> > > > > Thanks for the reply.
> > > > > > > >> > > > >
> > > > > > > >> > > > > 28. It does seem that we have started using
> > KafkaMetrics
> > > > on
> > > > > > the
> > > > > > > >> > broker
> > > > > > > >> > > > > side. Then, my only concern is on the usage of
> > Histogram
> > > > in
> > > > > > > >> > > KafkaMetrics.
> > > > > > > >> > > > > Histogram in KafkaMetrics statically divides the value
> > > > space
> > > > > > > into
> > > > > > > >> a
> > > > > > > >> > > fixed
> > > > > > > >> > > > > number of buckets and only returns values on the
> > bucket
> > > > > > > boundary.
> > > > > > > >> So,
> > > > > > > >> > > the
> > > > > > > >> > > > > returned histogram value may never show up in a
> > recorded
> > > > > > value.
> > > > > > > >> > Yammer
> > > > > > > >> > > > > Histogram, on the other hand, uses reservoir
> > sampling. The
> > > > > > > >> reported
> > > > > > > >> > > value
> > > > > > > >> > > > > is always one of the recorded values. So, I am not
> > sure
> > > > that
> > > > > > > >> > Histogram
> > > > > > > >> > > in
> > > > > > > >> > > > > KafkaMetrics is as good as Yammer Histogram.
> > > > > > > >> > > > ClientMetricsPluginExportTime
> > > > > > > >> > > > > uses Histogram.
> > > > > > > >> > > > >
> > > > > > > >> > > > > Thanks,
> > > > > > > >> > > > >
> > > > > > > >> > > > > Jun
> > > > > > > >> > > > >
> > > > > > > >> > > > > On Thu, Mar 31, 2022 at 5:21 PM Xavier Léauté
> > > > > > > >> > > > <xa...@confluent.io.invalid>
> > > > > > > >> > > > > wrote:
> > > > > > > >> > > > >
> > > > > > > >> > > > >> >
> > > > > > > >> > > > >> > 28. On the broker, we typically use Yammer metrics.
> > > > Only
> > > > > > for
> > > > > > > >> > metrics
> > > > > > > >> > > > >> that
> > > > > > > >> > > > >> > depend on Kafka metric features (e.g., quota), we
> > use
> > > > the
> > > > > > > Kafka
> > > > > > > >> > > > metric.
> > > > > > > >> > > > >> > Yammer metrics have 4 types: gauge, meter,
> > histogram
> > > > and
> > > > > > > timer.
> > > > > > > >> > > meter
> > > > > > > >> > > > >> > calculates a rate, but also exposes an accumulated
> > > > value.
> > > > > > > >> > > > >> >
> > > > > > > >> > > > >>
> > > > > > > >> > > > >> I don't see a good reason we should limit ourselves
> > to
> > > > Yammer
> > > > > > > >> > metrics
> > > > > > > >> > > on
> > > > > > > >> > > > >> the broker. KafkaMetrics was written
> > > > > > > >> > > > >> to replace Yammer metrics and is used for all new
> > > > components
> > > > > > > >> > (clients,
> > > > > > > >> > > > >> streams, connect, etc.)
> > > > > > > >> > > > >> My understanding is that the original goal was to
> > retire
> > > > > > Yammer
> > > > > > > >> > > metrics
> > > > > > > >> > > > in
> > > > > > > >> > > > >> the broker in favor of KafkaMetrics.
> > > > > > > >> > > > >> We just haven't done so out of backwards
> > compatibility
> > > > > > > concerns.
> > > > > > > >> > > > >> There are other broker metrics such as group
> > coordinator,
> > > > > > > >> > transaction
> > > > > > > >> > > > >> state
> > > > > > > >> > > > >> manager, and various socket server metrics
> > > > > > > >> > > > >> already using KafkaMetrics that don't need specific
> > Kafka
> > > > > > > metric
> > > > > > > >> > > > features,
> > > > > > > >> > > > >> so I don't see why we should refrain from using
> > > > > > > >> > > > >> Kafka metrics on the broker unless there are real
> > > > > > compatibility
> > > > > > > >> > > concerns
> > > > > > > >> > > > >> or
> > > > > > > >> > > > >> where implementation specifics could lead to
> > confusion
> > > > when
> > > > > > > >> > comparing
> > > > > > > >> > > > >> metrics using different implementations.
> > > > > > > >> > > > >>
> > > > > > > >> > > > >> In my opinion we should encourage people to use
> > > > KafkaMetrics
> > > > > > > >> going
> > > > > > > >> > > > forward
> > > > > > > >> > > > >> on the broker as well, for two reasons:
> > > > > > > >> > > > >> a) yammer metrics is long deprecated and no longer
> > > > maintained
> > > > > > > >> > > > >> b) yammer metrics are much less expressive
> > > > > > > >> > > > >> c) we don't have a proper API to expose yammer
> > metrics
> > > > > > outside
> > > > > > > of
> > > > > > > >> > JMX
> > > > > > > >> > > > >> (MetricsReporter only exposes KafkaMetrics)
> > > > > > > >> > > > >>
> > > > > > > >> > > > >
> > > > > > > >> > > >
> > > > > > > >> > >
> > > > > > > >> >
> > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Jun Rao <ju...@confluent.io.INVALID>.

Hi, Magnus, Kirk,

Thanks for the reply. A few more comments on your reply.

100. I agree there are some benefits of having a set of standard metrics
across all clients, but I am just wondering how practical it is, given that
the proposal doesn't require this set like the Kafka protocol.
100.1 A client may not implement all or some of the standard metrics. Then,
we won't have complete standardized names across clients.
100.2 The set of standard metrics needs to be common across all clients.
For example, client.consumer.poll.latency implies that all clients
implement a poll() interface. Is that true for all clients?
client.producer.record.queue.bytes. Do all producers have queues? We
probably need to make a pass of those metrics to see if they are indeed
common across all clients. Also, a bunch of standard metrics have type
Histogram. Java client doesn't have good Histogram support yet. I am also
not sure if all clients support Histogram. Should we avoid Histogram type
in standardized metrics?
100.3 For a subset of metrics that are truly common across clients, it
would be confusing for each client to maintain two sets of metrics for the
same thing. We could document them, but that means every user of every
client needs to remember this mapping. This is a much bigger
inconvenience than standardizing the metric names on the server side. If we
want to go this route, my preference is to deprecate the existing metric
names that are covered by the standard metric names.

101. "or if the client-specific metrics are not converted to some common
form, name, semantic, etc, it'll make creating meaningful aggregations and
monitoring more complex in the upstream telemetry system with a scattered
plethora of custom metrics." There will always be client specific metrics.
So, it seems that we have to deal with scattered custom metrics even with a
set of standard metrics.

102. "However, in, let's say the Apache Kafka 3.7 release, the metric name
is changed to "connections.open.count." At this point, there are two names
and machine-to-machine communication will likely be effected. With that
change, all client telemetry plugin(s) used in an organization must be
updated to reflect that change, else data loss or bugs could be
introduced." The standard metric names could change too in the future,
right? So, we need to deal with a similar problem if that happens.

103. "Are there any inobvious security/privacy-related edge cases where
shipping certain metrics to the broker would be "bad?"" I am not sure. But
if a metric can be shipped to the server, it would be useful for the same
metric to be visible on the client side.

Thanks,

Jun


On Tue, Jun 21, 2022 at 8:19 AM Kirk True <ki...@kirktrue.pro> wrote:

> Hi Jun,
>
> Thank you for all your continued interest in shaping the KIP :)
>
> On Thu, Jun 16, 2022, at 2:38 PM, Jun Rao wrote:
> > Hi, Kirk,
> >
> > Thanks for the reply. A couple of more comments.
> >
> > (1) "Another perspective is that these two sets of metrics serve
> different
> > purposes and/or have different audiences, which argues that they should
> > maintain their individuality and purpose. " Hmm, I am wondering if those
> > metrics are really for different audiences and purposes? For example, if
> > the operator detected an issue through a client metric collected through
> > the server, the operator may need to communicate that back to the client.
> > It would be weird if that same metric is not visible on the client side.
>
> I agree in the principal that all client metrics visible on the client can
> also be available to be sent to the broker.
>
> Are there any inobvious security/privacy-related edge cases where shipping
> certain metrics to the broker would be "bad?"
>
> > (2) If we could standardize the names on the server side, do we need to
> > enforce a naming convention for all clients?
>
> "Enforce" is such an ugly word :P
>
> But yes, I do feel that a consistent naming convention across all clients
> provides communication benefits between two entities:
>
>  1. Human-to-human communication. Ecosystem-wide agreement and
> understanding of metrics helps all to communicate more efficiently.
>  2. Machine-to-machine communication. Defining the names via the KIP
> mechanism help to ensure stabilization across releases of a given client.
>
> Point 1: Human-to-human Communication
>
> There are quite a handful of parties that must communicate effectively
> across the Kafka ecosystem. Here are the ones I can think of off the top of
> my head:
>
>  1. Kafka client authors
>  2. Kafka client users
>  3. Kafka client telemetry plugin authors
>  4. Support teams (within an organization or vendor-supplied across
> organizations)
>  5. Kafka cluster operators
>
> There should be a standard so that these parties can understand the
> metrics' meaning and be able to correlate that across all clients.
>
> As a concrete example, KIP-714 includes a metric for tracking the number
> of active client connections to a cluster, named
> "org.apache.kafka.client.connection.active." Given this name, all client
> implementations can communicate this name and its value to all parties
> consistently. Without a standard naming convention, the metric might be
> named "connections.open" in the Java client and "Connections/Alive" in
> librdkafka. This inconsistency of naming would impact the discussions
> between one or more of the parties involved.
>
> To your point, it's absolutely a design choice to keep the naming
> convention the same between each client. We can change that if it makes
> sense.
>
> Point 2: Machine-to-machine Communication
>
> Standardization at the client level provides stability through an implied
> contract that a client should not introduce a breaking name change between
> releases. Otherwise, the ability for the metrics to be "understood" in a
> machine-to-machine context would be forfeit.
>
> For example, let's say that we give the clients the latitude to name
> metrics as they wish. In this example, let's say that the Apache Kafka 3.4
> release decides to name this metric "connections.open." It's a good name!
> It says what it is. However, in, let's say the Apache Kafka 3.7 release,
> the metric name is changed to "connections.open.count." At this point,
> there are two names and machine-to-machine communication will likely be
> effected. With that change, all client telemetry plugin(s) used in an
> organization must be updated to reflect that change, else data loss or bugs
> could be introduced.
>
> That the KIP defines the names of the metrics does, admittedly, constrain
> the options of authors of the different clients. The metric named
> "org.apache.kafka.client.connection.active" may be confusing in some client
> implementations. For whatever reason, a client author may even find it
> "undesirable" to include a reference that includes "Apache" in their code.
>
> There's also the precedent set by the existing (JMX-based) client metrics.
> Though these are applicable only to the Java client, we can see that having
> a standardized naming convention there has helped with communication.
>
> So, IMO, it makes sense to define the metric names via the KIP mechanism
> and--let's say, "ask"--that client implementations abide by those.
>
> Thanks,
> Kirk
>
> >
> > Thanks,
> >
> > Jun
> >
> > On Thu, Jun 16, 2022 at 12:00 PM Kirk True <ki...@kirktrue.pro> wrote:
> >
> > > Hi Jun,
> > >
> > > I'll try to answer the questions posed...
> > >
> > > On Tue, Jun 7, 2022, at 4:32 PM, Jun Rao wrote:
> > > > Hi, Magnus,
> > > >
> > > > Thanks for the reply.
> > > >
> > > > So, the standard set of generic metrics is just a recommendation and
> not
> > > a
> > > > requirement? This sounds good to me since it makes the adoption of
> the
> > > KIP
> > > > easier.
> > >
> > > I believe that was the intent, yes.
> > >
> > > > Regarding the metric names, I have two concerns.
> > >
> > > (I'm splitting these two up for readability...)
> > >
> > > > (1) If a client already
> > > > has an existing metric similar to the standard one, duplicating the
> > > metric
> > > > seems to be confusing.
> > >
> > > Agreed. I'm dealing with that situation as I write the Java client
> > > implementation.
> > >
> > > The existing Java client exposes a set of metrics via JMX. The updated
> > > Java client will introduce a second set of metrics, which instead are
> > > exposed via sending them to the broker. There is substantial overlap
> with
> > > the two set of metrics and in a few places in the code under
> development,
> > > there are essentially two separate calls to update metrics: one for the
> > > JMX-bound metrics and one for the broker-bound metrics.
> > >
> > > To be candid, I have gone back-and-forth on that design. From one
> > > perspective, it could be argued that the set of client metrics should
> be
> > > standardized across a given client, regardless of how those metrics are
> > > exposed for consumption. Another perspective is that these two sets of
> > > metrics serve different purposes and/or have different audiences, which
> > > argues that they should maintain their individuality and purpose. Your
> > > inputs/suggestions are certainly welcome!
> > >
> > > > (2) If a client needs to implement a standard metric
> > > > that doesn't exist yet, using a naming convention (e.g., using dash
> vs
> > > dot)
> > > > different from other existing metrics also seems a bit confusing. It
> > > seems
> > > > that the main benefit of having standard metric names across clients
> is
> > > for
> > > > better server side monitoring. Could we do the standardization in the
> > > > plugin on the server?
> > >
> > > I think the expectation is that the plugin implementation will perform
> > > transformation of metric names, if needed, to fit in with an
> organization's
> > > monitoring naming standards. Perhaps we need to call that out in the
> KIP
> > > itself.
> > >
> > > Thanks,
> > > Kirk
> > >
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > >
> > > >
> > > > On Tue, Jun 7, 2022 at 6:53 AM Magnus Edenhill <ma...@edenhill.se>
> > > wrote:
> > > >
> > > > > Hey Jun,
> > > > >
> > > > > I've clarified the scope of the standard metrics in the KIP, but
> > > basically:
> > > > >
> > > > >  * We define a standard set of generic metrics that should be
> relevant
> > > to
> > > > > most client implementations, e.g., each producer implementation
> > > probably
> > > > > has some sort of per-partition message queue.
> > > > >  * A client implementation should strive to implement as many of
> the
> > > > > standard metrics as possible, but only the ones that make sense.
> > > > >  * For metrics that are not in the standard set, a client
> maintainer
> > > can
> > > > > choose to either submit a KIP to add additional standard metrics -
> if
> > > > > they're relevant, or go ahead and add custom metrics that are
> specific
> > > to
> > > > > that client implementation. These custom metrics will have a prefix
> > > > > specific to that client implementation, as opposed to the standard
> > > metric
> > > > > set that resides under "org.apache.kafka...". E.g.,
> > > > > "se.edenhill.librdkafka" or whatever.
> > > > >  * Existing non-KIP-714 metrics should remain untouched. In some
> cases
> > > we
> > > > > might be able to use the same meter given it is compatible with the
> > > > > standard metric set definition, in other cases a semi-duplicate
> meter
> > > may
> > > > > be needed. Thus this will not affect the metrics exposed through
> JMX,
> > > or
> > > > > vice versa.
> > > > >
> > > > > Thanks,
> > > > > Magnus
> > > > >
> > > > >
> > > > >
> > > > > Den ons 1 juni 2022 kl 18:55 skrev Jun Rao
> <ju...@confluent.io.invalid>:
> > > > >
> > > > > > Hi, Magnus,
> > > > > >
> > > > > > 51. Just to clarify my question.  (1) Are standard metrics
> required
> > > for
> > > > > > every client for this KIP to function?  (2) Are we converting
> > > existing
> > > > > java
> > > > > > metrics to the standard metrics and deprecating the old ones? If
> so,
> > > > > could
> > > > > > we list all existing java metrics that need to be renamed and the
> > > > > > corresponding new name?
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Jun
> > > > > >
> > > > > > On Tue, May 31, 2022 at 3:29 PM Jun Rao <ju...@confluent.io>
> wrote:
> > > > > >
> > > > > > > Hi, Magnus,
> > > > > > >
> > > > > > > Thanks for the reply.
> > > > > > >
> > > > > > > 51. I think it's fine to have a list of recommended metrics for
> > > every
> > > > > > > client to implement. I am just not sure that standardizing on
> the
> > > > > metric
> > > > > > > names across all clients is practical. The list of common
> metrics
> > > in
> > > > > the
> > > > > > > KIP have completely different names from the java metric names.
> > > Some of
> > > > > > > them have different types. For example, some of the common
> metrics
> > > > > have a
> > > > > > > type of histogram, but the java client metrics don't use
> histogram
> > > in
> > > > > > > general. Requiring the operator to translate those names and
> > > understand
> > > > > > the
> > > > > > > subtle differences across clients seem to cause more confusion
> > > during
> > > > > > > troubleshooting.
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > Jun
> > > > > > >
> > > > > > > On Tue, May 31, 2022 at 5:02 AM Magnus Edenhill <
> > > magnus@edenhill.se>
> > > > > > > wrote:
> > > > > > >
> > > > > > >> Den fre 20 maj 2022 kl 01:23 skrev Jun Rao
> > > <jun@confluent.io.invalid
> > > > > >:
> > > > > > >>
> > > > > > >> > Hi, Magus,
> > > > > > >> >
> > > > > > >> > Thanks for the reply.
> > > > > > >> >
> > > > > > >> > 50. Sounds good.
> > > > > > >> >
> > > > > > >> > 51. I miss-understood the proposal in the KIP then. The
> > > proposal is
> > > > > to
> > > > > > >> > define a set of common metric names that every client should
> > > > > > implement.
> > > > > > >> The
> > > > > > >> > problem is that every client already has its own set of
> metrics
> > > with
> > > > > > its
> > > > > > >> > own names. I am not sure that we could easily agree upon a
> > > common
> > > > > set
> > > > > > of
> > > > > > >> > metrics that work with all clients. There are likely to be
> some
> > > > > > metrics
> > > > > > >> > that are client specific. Translating between the common
> name
> > > and
> > > > > > client
> > > > > > >> > specific name is probably going to add more confusion. As
> > > mentioned
> > > > > in
> > > > > > >> the
> > > > > > >> > KIP, similar metrics from different clients could have
> subtle
> > > > > > >> > semantic differences. Could we just let each client use its
> own
> > > set
> > > > > of
> > > > > > >> > metric names?
> > > > > > >> >
> > > > > > >>
> > > > > > >> We identified a common set of metrics that should be relevant
> for
> > > most
> > > > > > >> client implementations,
> > > > > > >> they're the ones listed in the KIP.
> > > > > > >> A supporting client does not have to implement all those
> metrics,
> > > only
> > > > > > the
> > > > > > >> ones that makes sense
> > > > > > >> based on that client implementation, and a client may
> implement
> > > other
> > > > > > >> metrics that are not listed
> > > > > > >> in the KIP under its own namespace.
> > > > > > >> This approach has two benefits:
> > > > > > >>  - there will be a common set of metrics that most/all clients
> > > > > > implement,
> > > > > > >> which makes monitoring
> > > > > > >>   and troubleshooting easier across fleets with multiple Kafka
> > > client
> > > > > > >> languages/implementations.
> > > > > > >>  - client-specific metrics are still possible, so if there is
> no
> > > > > > suitable
> > > > > > >> standard metric a client can still
> > > > > > >>    provide what special metrics it has.
> > > > > > >>
> > > > > > >>
> > > > > > >> Thanks,
> > > > > > >> Magnus
> > > > > > >>
> > > > > > >>
> > > > > > >> On Thu, May 19, 2022 at 10:39 AM Magnus Edenhill <
> > > magnus@edenhill.se>
> > > > > > >> wrote:
> > > > > > >> >
> > > > > > >> > > Den ons 18 maj 2022 kl 19:57 skrev Jun Rao
> > > > > <jun@confluent.io.invalid
> > > > > > >> >:
> > > > > > >> > >
> > > > > > >> > > > Hi, Magnus,
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> > > Hi Jun
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > > >
> > > > > > >> > > > Thanks for the updated KIP. Just a couple of more
> comments.
> > > > > > >> > > >
> > > > > > >> > > > 50. To troubleshoot a particular client issue, I imagine
> > > that
> > > > > the
> > > > > > >> > client
> > > > > > >> > > > needs to identify its client_instance_id. How does the
> > > client
> > > > > find
> > > > > > >> this
> > > > > > >> > > > out? Do we plan to include client_instance_id in the
> client
> > > log,
> > > > > > >> expose
> > > > > > >> > > it
> > > > > > >> > > > as a metric or something else?
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> > > The KIP suggests that client implementations emit an
> > > informative
> > > > > log
> > > > > > >> > > message
> > > > > > >> > > with the assigned client-instance-id once it is retrieved
> > > (once
> > > > > per
> > > > > > >> > client
> > > > > > >> > > instance lifetime).
> > > > > > >> > > There's also a clientInstanceId() method that an
> application
> > > can
> > > > > use
> > > > > > >> to
> > > > > > >> > > retrieve
> > > > > > >> > > the client instance id and emit through whatever side
> channels
> > > > > makes
> > > > > > >> > sense.
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > > > 51. The KIP lists a bunch of metrics that need to be
> > > collected
> > > > > at
> > > > > > >> the
> > > > > > >> > > > client side. However, it seems quite a few useful java
> > > client
> > > > > > >> metrics
> > > > > > >> > > like
> > > > > > >> > > > the following are missing.
> > > > > > >> > > >     buffer-total-bytes
> > > > > > >> > > >     buffer-available-bytes
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> > > These are covered by client.producer.record.queue.bytes
> and
> > > > > > >> > > client.producer.record.queue.max.bytes.
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > > >     bufferpool-wait-time
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> > > Missing, but somewhat implementation specific.
> > > > > > >> > > If it was up to me we would add this later if there's a
> need.
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > > >     batch-size-avg
> > > > > > >> > > >     batch-size-max
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> > > These are missing and would be suitably represented as a
> > > > > histogram.
> > > > > > >> I'll
> > > > > > >> > > add them.
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > > >     io-wait-ratio
> > > > > > >> > > >     io-ratio
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> > > There's client.io.wait.time which should cover
> io-wait-ratio.
> > > > > > >> > > We could add a client.io.time as well, now or in a later
> KIP.
> > > > > > >> > >
> > > > > > >> > > Thanks,
> > > > > > >> > > Magnus
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > > >
> > > > > > >> > > > Thanks,
> > > > > > >> > > >
> > > > > > >> > > > Jun
> > > > > > >> > > >
> > > > > > >> > > > On Mon, Apr 4, 2022 at 10:01 AM Jun Rao <
> jun@confluent.io>
> > > > > wrote:
> > > > > > >> > > >
> > > > > > >> > > > > Hi, Xavier,
> > > > > > >> > > > >
> > > > > > >> > > > > Thanks for the reply.
> > > > > > >> > > > >
> > > > > > >> > > > > 28. It does seem that we have started using
> KafkaMetrics
> > > on
> > > > > the
> > > > > > >> > broker
> > > > > > >> > > > > side. Then, my only concern is on the usage of
> Histogram
> > > in
> > > > > > >> > > KafkaMetrics.
> > > > > > >> > > > > Histogram in KafkaMetrics statically divides the value
> > > space
> > > > > > into
> > > > > > >> a
> > > > > > >> > > fixed
> > > > > > >> > > > > number of buckets and only returns values on the
> bucket
> > > > > > boundary.
> > > > > > >> So,
> > > > > > >> > > the
> > > > > > >> > > > > returned histogram value may never show up in a
> recorded
> > > > > value.
> > > > > > >> > Yammer
> > > > > > >> > > > > Histogram, on the other hand, uses reservoir
> sampling. The
> > > > > > >> reported
> > > > > > >> > > value
> > > > > > >> > > > > is always one of the recorded values. So, I am not
> sure
> > > that
> > > > > > >> > Histogram
> > > > > > >> > > in
> > > > > > >> > > > > KafkaMetrics is as good as Yammer Histogram.
> > > > > > >> > > > ClientMetricsPluginExportTime
> > > > > > >> > > > > uses Histogram.
> > > > > > >> > > > >
> > > > > > >> > > > > Thanks,
> > > > > > >> > > > >
> > > > > > >> > > > > Jun
> > > > > > >> > > > >
> > > > > > >> > > > > On Thu, Mar 31, 2022 at 5:21 PM Xavier Léauté
> > > > > > >> > > > <xa...@confluent.io.invalid>
> > > > > > >> > > > > wrote:
> > > > > > >> > > > >
> > > > > > >> > > > >> >
> > > > > > >> > > > >> > 28. On the broker, we typically use Yammer metrics.
> > > Only
> > > > > for
> > > > > > >> > metrics
> > > > > > >> > > > >> that
> > > > > > >> > > > >> > depend on Kafka metric features (e.g., quota), we
> use
> > > the
> > > > > > Kafka
> > > > > > >> > > > metric.
> > > > > > >> > > > >> > Yammer metrics have 4 types: gauge, meter,
> histogram
> > > and
> > > > > > timer.
> > > > > > >> > > meter
> > > > > > >> > > > >> > calculates a rate, but also exposes an accumulated
> > > value.
> > > > > > >> > > > >> >
> > > > > > >> > > > >>
> > > > > > >> > > > >> I don't see a good reason we should limit ourselves
> to
> > > Yammer
> > > > > > >> > metrics
> > > > > > >> > > on
> > > > > > >> > > > >> the broker. KafkaMetrics was written
> > > > > > >> > > > >> to replace Yammer metrics and is used for all new
> > > components
> > > > > > >> > (clients,
> > > > > > >> > > > >> streams, connect, etc.)
> > > > > > >> > > > >> My understanding is that the original goal was to
> retire
> > > > > Yammer
> > > > > > >> > > metrics
> > > > > > >> > > > in
> > > > > > >> > > > >> the broker in favor of KafkaMetrics.
> > > > > > >> > > > >> We just haven't done so out of backwards
> compatibility
> > > > > > concerns.
> > > > > > >> > > > >> There are other broker metrics such as group
> coordinator,
> > > > > > >> > transaction
> > > > > > >> > > > >> state
> > > > > > >> > > > >> manager, and various socket server metrics
> > > > > > >> > > > >> already using KafkaMetrics that don't need specific
> Kafka
> > > > > > metric
> > > > > > >> > > > features,
> > > > > > >> > > > >> so I don't see why we should refrain from using
> > > > > > >> > > > >> Kafka metrics on the broker unless there are real
> > > > > compatibility
> > > > > > >> > > concerns
> > > > > > >> > > > >> or
> > > > > > >> > > > >> where implementation specifics could lead to
> confusion
> > > when
> > > > > > >> > comparing
> > > > > > >> > > > >> metrics using different implementations.
> > > > > > >> > > > >>
> > > > > > >> > > > >> In my opinion we should encourage people to use
> > > KafkaMetrics
> > > > > > >> going
> > > > > > >> > > > forward
> > > > > > >> > > > >> on the broker as well, for two reasons:
> > > > > > >> > > > >> a) yammer metrics is long deprecated and no longer
> > > maintained
> > > > > > >> > > > >> b) yammer metrics are much less expressive
> > > > > > >> > > > >> c) we don't have a proper API to expose yammer
> metrics
> > > > > outside
> > > > > > of
> > > > > > >> > JMX
> > > > > > >> > > > >> (MetricsReporter only exposes KafkaMetrics)
> > > > > > >> > > > >>
> > > > > > >> > > > >
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> >
> > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Kirk True <ki...@kirktrue.pro>.

Hi Jun,

Thank you for all your continued interest in shaping the KIP :)

On Thu, Jun 16, 2022, at 2:38 PM, Jun Rao wrote:
> Hi, Kirk,
> 
> Thanks for the reply. A couple of more comments.
> 
> (1) "Another perspective is that these two sets of metrics serve different
> purposes and/or have different audiences, which argues that they should
> maintain their individuality and purpose. " Hmm, I am wondering if those
> metrics are really for different audiences and purposes? For example, if
> the operator detected an issue through a client metric collected through
> the server, the operator may need to communicate that back to the client.
> It would be weird if that same metric is not visible on the client side.

I agree in the principal that all client metrics visible on the client can also be available to be sent to the broker.

Are there any inobvious security/privacy-related edge cases where shipping certain metrics to the broker would be "bad?"

> (2) If we could standardize the names on the server side, do we need to
> enforce a naming convention for all clients?

"Enforce" is such an ugly word :P

But yes, I do feel that a consistent naming convention across all clients provides communication benefits between two entities:

 1. Human-to-human communication. Ecosystem-wide agreement and understanding of metrics helps all to communicate more efficiently.
 2. Machine-to-machine communication. Defining the names via the KIP mechanism help to ensure stabilization across releases of a given client.

Point 1: Human-to-human Communication

There are quite a handful of parties that must communicate effectively across the Kafka ecosystem. Here are the ones I can think of off the top of my head:

 1. Kafka client authors
 2. Kafka client users
 3. Kafka client telemetry plugin authors
 4. Support teams (within an organization or vendor-supplied across organizations)
 5. Kafka cluster operators

There should be a standard so that these parties can understand the metrics' meaning and be able to correlate that across all clients.

As a concrete example, KIP-714 includes a metric for tracking the number of active client connections to a cluster, named "org.apache.kafka.client.connection.active." Given this name, all client implementations can communicate this name and its value to all parties consistently. Without a standard naming convention, the metric might be named "connections.open" in the Java client and "Connections/Alive" in librdkafka. This inconsistency of naming would impact the discussions between one or more of the parties involved.

To your point, it's absolutely a design choice to keep the naming convention the same between each client. We can change that if it makes sense.

Point 2: Machine-to-machine Communication

Standardization at the client level provides stability through an implied contract that a client should not introduce a breaking name change between releases. Otherwise, the ability for the metrics to be "understood" in a machine-to-machine context would be forfeit.

For example, let's say that we give the clients the latitude to name metrics as they wish. In this example, let's say that the Apache Kafka 3.4 release decides to name this metric "connections.open." It's a good name! It says what it is. However, in, let's say the Apache Kafka 3.7 release, the metric name is changed to "connections.open.count." At this point, there are two names and machine-to-machine communication will likely be effected. With that change, all client telemetry plugin(s) used in an organization must be updated to reflect that change, else data loss or bugs could be introduced.

That the KIP defines the names of the metrics does, admittedly, constrain the options of authors of the different clients. The metric named "org.apache.kafka.client.connection.active" may be confusing in some client implementations. For whatever reason, a client author may even find it "undesirable" to include a reference that includes "Apache" in their code.

There's also the precedent set by the existing (JMX-based) client metrics. Though these are applicable only to the Java client, we can see that having a standardized naming convention there has helped with communication.

So, IMO, it makes sense to define the metric names via the KIP mechanism and--let's say, "ask"--that client implementations abide by those.

Thanks,
Kirk

> 
> Thanks,
> 
> Jun
> 
> On Thu, Jun 16, 2022 at 12:00 PM Kirk True <ki...@kirktrue.pro> wrote:
> 
> > Hi Jun,
> >
> > I'll try to answer the questions posed...
> >
> > On Tue, Jun 7, 2022, at 4:32 PM, Jun Rao wrote:
> > > Hi, Magnus,
> > >
> > > Thanks for the reply.
> > >
> > > So, the standard set of generic metrics is just a recommendation and not
> > a
> > > requirement? This sounds good to me since it makes the adoption of the
> > KIP
> > > easier.
> >
> > I believe that was the intent, yes.
> >
> > > Regarding the metric names, I have two concerns.
> >
> > (I'm splitting these two up for readability...)
> >
> > > (1) If a client already
> > > has an existing metric similar to the standard one, duplicating the
> > metric
> > > seems to be confusing.
> >
> > Agreed. I'm dealing with that situation as I write the Java client
> > implementation.
> >
> > The existing Java client exposes a set of metrics via JMX. The updated
> > Java client will introduce a second set of metrics, which instead are
> > exposed via sending them to the broker. There is substantial overlap with
> > the two set of metrics and in a few places in the code under development,
> > there are essentially two separate calls to update metrics: one for the
> > JMX-bound metrics and one for the broker-bound metrics.
> >
> > To be candid, I have gone back-and-forth on that design. From one
> > perspective, it could be argued that the set of client metrics should be
> > standardized across a given client, regardless of how those metrics are
> > exposed for consumption. Another perspective is that these two sets of
> > metrics serve different purposes and/or have different audiences, which
> > argues that they should maintain their individuality and purpose. Your
> > inputs/suggestions are certainly welcome!
> >
> > > (2) If a client needs to implement a standard metric
> > > that doesn't exist yet, using a naming convention (e.g., using dash vs
> > dot)
> > > different from other existing metrics also seems a bit confusing. It
> > seems
> > > that the main benefit of having standard metric names across clients is
> > for
> > > better server side monitoring. Could we do the standardization in the
> > > plugin on the server?
> >
> > I think the expectation is that the plugin implementation will perform
> > transformation of metric names, if needed, to fit in with an organization's
> > monitoring naming standards. Perhaps we need to call that out in the KIP
> > itself.
> >
> > Thanks,
> > Kirk
> >
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > >
> > >
> > > On Tue, Jun 7, 2022 at 6:53 AM Magnus Edenhill <ma...@edenhill.se>
> > wrote:
> > >
> > > > Hey Jun,
> > > >
> > > > I've clarified the scope of the standard metrics in the KIP, but
> > basically:
> > > >
> > > >  * We define a standard set of generic metrics that should be relevant
> > to
> > > > most client implementations, e.g., each producer implementation
> > probably
> > > > has some sort of per-partition message queue.
> > > >  * A client implementation should strive to implement as many of the
> > > > standard metrics as possible, but only the ones that make sense.
> > > >  * For metrics that are not in the standard set, a client maintainer
> > can
> > > > choose to either submit a KIP to add additional standard metrics - if
> > > > they're relevant, or go ahead and add custom metrics that are specific
> > to
> > > > that client implementation. These custom metrics will have a prefix
> > > > specific to that client implementation, as opposed to the standard
> > metric
> > > > set that resides under "org.apache.kafka...". E.g.,
> > > > "se.edenhill.librdkafka" or whatever.
> > > >  * Existing non-KIP-714 metrics should remain untouched. In some cases
> > we
> > > > might be able to use the same meter given it is compatible with the
> > > > standard metric set definition, in other cases a semi-duplicate meter
> > may
> > > > be needed. Thus this will not affect the metrics exposed through JMX,
> > or
> > > > vice versa.
> > > >
> > > > Thanks,
> > > > Magnus
> > > >
> > > >
> > > >
> > > > Den ons 1 juni 2022 kl 18:55 skrev Jun Rao <ju...@confluent.io.invalid>:
> > > >
> > > > > Hi, Magnus,
> > > > >
> > > > > 51. Just to clarify my question.  (1) Are standard metrics required
> > for
> > > > > every client for this KIP to function?  (2) Are we converting
> > existing
> > > > java
> > > > > metrics to the standard metrics and deprecating the old ones? If so,
> > > > could
> > > > > we list all existing java metrics that need to be renamed and the
> > > > > corresponding new name?
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jun
> > > > >
> > > > > On Tue, May 31, 2022 at 3:29 PM Jun Rao <ju...@confluent.io> wrote:
> > > > >
> > > > > > Hi, Magnus,
> > > > > >
> > > > > > Thanks for the reply.
> > > > > >
> > > > > > 51. I think it's fine to have a list of recommended metrics for
> > every
> > > > > > client to implement. I am just not sure that standardizing on the
> > > > metric
> > > > > > names across all clients is practical. The list of common metrics
> > in
> > > > the
> > > > > > KIP have completely different names from the java metric names.
> > Some of
> > > > > > them have different types. For example, some of the common metrics
> > > > have a
> > > > > > type of histogram, but the java client metrics don't use histogram
> > in
> > > > > > general. Requiring the operator to translate those names and
> > understand
> > > > > the
> > > > > > subtle differences across clients seem to cause more confusion
> > during
> > > > > > troubleshooting.
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Jun
> > > > > >
> > > > > > On Tue, May 31, 2022 at 5:02 AM Magnus Edenhill <
> > magnus@edenhill.se>
> > > > > > wrote:
> > > > > >
> > > > > >> Den fre 20 maj 2022 kl 01:23 skrev Jun Rao
> > <jun@confluent.io.invalid
> > > > >:
> > > > > >>
> > > > > >> > Hi, Magus,
> > > > > >> >
> > > > > >> > Thanks for the reply.
> > > > > >> >
> > > > > >> > 50. Sounds good.
> > > > > >> >
> > > > > >> > 51. I miss-understood the proposal in the KIP then. The
> > proposal is
> > > > to
> > > > > >> > define a set of common metric names that every client should
> > > > > implement.
> > > > > >> The
> > > > > >> > problem is that every client already has its own set of metrics
> > with
> > > > > its
> > > > > >> > own names. I am not sure that we could easily agree upon a
> > common
> > > > set
> > > > > of
> > > > > >> > metrics that work with all clients. There are likely to be some
> > > > > metrics
> > > > > >> > that are client specific. Translating between the common name
> > and
> > > > > client
> > > > > >> > specific name is probably going to add more confusion. As
> > mentioned
> > > > in
> > > > > >> the
> > > > > >> > KIP, similar metrics from different clients could have subtle
> > > > > >> > semantic differences. Could we just let each client use its own
> > set
> > > > of
> > > > > >> > metric names?
> > > > > >> >
> > > > > >>
> > > > > >> We identified a common set of metrics that should be relevant for
> > most
> > > > > >> client implementations,
> > > > > >> they're the ones listed in the KIP.
> > > > > >> A supporting client does not have to implement all those metrics,
> > only
> > > > > the
> > > > > >> ones that makes sense
> > > > > >> based on that client implementation, and a client may implement
> > other
> > > > > >> metrics that are not listed
> > > > > >> in the KIP under its own namespace.
> > > > > >> This approach has two benefits:
> > > > > >>  - there will be a common set of metrics that most/all clients
> > > > > implement,
> > > > > >> which makes monitoring
> > > > > >>   and troubleshooting easier across fleets with multiple Kafka
> > client
> > > > > >> languages/implementations.
> > > > > >>  - client-specific metrics are still possible, so if there is no
> > > > > suitable
> > > > > >> standard metric a client can still
> > > > > >>    provide what special metrics it has.
> > > > > >>
> > > > > >>
> > > > > >> Thanks,
> > > > > >> Magnus
> > > > > >>
> > > > > >>
> > > > > >> On Thu, May 19, 2022 at 10:39 AM Magnus Edenhill <
> > magnus@edenhill.se>
> > > > > >> wrote:
> > > > > >> >
> > > > > >> > > Den ons 18 maj 2022 kl 19:57 skrev Jun Rao
> > > > <jun@confluent.io.invalid
> > > > > >> >:
> > > > > >> > >
> > > > > >> > > > Hi, Magnus,
> > > > > >> > > >
> > > > > >> > >
> > > > > >> > > Hi Jun
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > >
> > > > > >> > > > Thanks for the updated KIP. Just a couple of more comments.
> > > > > >> > > >
> > > > > >> > > > 50. To troubleshoot a particular client issue, I imagine
> > that
> > > > the
> > > > > >> > client
> > > > > >> > > > needs to identify its client_instance_id. How does the
> > client
> > > > find
> > > > > >> this
> > > > > >> > > > out? Do we plan to include client_instance_id in the client
> > log,
> > > > > >> expose
> > > > > >> > > it
> > > > > >> > > > as a metric or something else?
> > > > > >> > > >
> > > > > >> > >
> > > > > >> > > The KIP suggests that client implementations emit an
> > informative
> > > > log
> > > > > >> > > message
> > > > > >> > > with the assigned client-instance-id once it is retrieved
> > (once
> > > > per
> > > > > >> > client
> > > > > >> > > instance lifetime).
> > > > > >> > > There's also a clientInstanceId() method that an application
> > can
> > > > use
> > > > > >> to
> > > > > >> > > retrieve
> > > > > >> > > the client instance id and emit through whatever side channels
> > > > makes
> > > > > >> > sense.
> > > > > >> > >
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > > 51. The KIP lists a bunch of metrics that need to be
> > collected
> > > > at
> > > > > >> the
> > > > > >> > > > client side. However, it seems quite a few useful java
> > client
> > > > > >> metrics
> > > > > >> > > like
> > > > > >> > > > the following are missing.
> > > > > >> > > >     buffer-total-bytes
> > > > > >> > > >     buffer-available-bytes
> > > > > >> > > >
> > > > > >> > >
> > > > > >> > > These are covered by client.producer.record.queue.bytes and
> > > > > >> > > client.producer.record.queue.max.bytes.
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > >     bufferpool-wait-time
> > > > > >> > > >
> > > > > >> > >
> > > > > >> > > Missing, but somewhat implementation specific.
> > > > > >> > > If it was up to me we would add this later if there's a need.
> > > > > >> > >
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > >     batch-size-avg
> > > > > >> > > >     batch-size-max
> > > > > >> > > >
> > > > > >> > >
> > > > > >> > > These are missing and would be suitably represented as a
> > > > histogram.
> > > > > >> I'll
> > > > > >> > > add them.
> > > > > >> > >
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > >     io-wait-ratio
> > > > > >> > > >     io-ratio
> > > > > >> > > >
> > > > > >> > >
> > > > > >> > > There's client.io.wait.time which should cover io-wait-ratio.
> > > > > >> > > We could add a client.io.time as well, now or in a later KIP.
> > > > > >> > >
> > > > > >> > > Thanks,
> > > > > >> > > Magnus
> > > > > >> > >
> > > > > >> > >
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > >
> > > > > >> > > > Thanks,
> > > > > >> > > >
> > > > > >> > > > Jun
> > > > > >> > > >
> > > > > >> > > > On Mon, Apr 4, 2022 at 10:01 AM Jun Rao <ju...@confluent.io>
> > > > wrote:
> > > > > >> > > >
> > > > > >> > > > > Hi, Xavier,
> > > > > >> > > > >
> > > > > >> > > > > Thanks for the reply.
> > > > > >> > > > >
> > > > > >> > > > > 28. It does seem that we have started using KafkaMetrics
> > on
> > > > the
> > > > > >> > broker
> > > > > >> > > > > side. Then, my only concern is on the usage of Histogram
> > in
> > > > > >> > > KafkaMetrics.
> > > > > >> > > > > Histogram in KafkaMetrics statically divides the value
> > space
> > > > > into
> > > > > >> a
> > > > > >> > > fixed
> > > > > >> > > > > number of buckets and only returns values on the bucket
> > > > > boundary.
> > > > > >> So,
> > > > > >> > > the
> > > > > >> > > > > returned histogram value may never show up in a recorded
> > > > value.
> > > > > >> > Yammer
> > > > > >> > > > > Histogram, on the other hand, uses reservoir sampling. The
> > > > > >> reported
> > > > > >> > > value
> > > > > >> > > > > is always one of the recorded values. So, I am not sure
> > that
> > > > > >> > Histogram
> > > > > >> > > in
> > > > > >> > > > > KafkaMetrics is as good as Yammer Histogram.
> > > > > >> > > > ClientMetricsPluginExportTime
> > > > > >> > > > > uses Histogram.
> > > > > >> > > > >
> > > > > >> > > > > Thanks,
> > > > > >> > > > >
> > > > > >> > > > > Jun
> > > > > >> > > > >
> > > > > >> > > > > On Thu, Mar 31, 2022 at 5:21 PM Xavier Léauté
> > > > > >> > > > <xa...@confluent.io.invalid>
> > > > > >> > > > > wrote:
> > > > > >> > > > >
> > > > > >> > > > >> >
> > > > > >> > > > >> > 28. On the broker, we typically use Yammer metrics.
> > Only
> > > > for
> > > > > >> > metrics
> > > > > >> > > > >> that
> > > > > >> > > > >> > depend on Kafka metric features (e.g., quota), we use
> > the
> > > > > Kafka
> > > > > >> > > > metric.
> > > > > >> > > > >> > Yammer metrics have 4 types: gauge, meter, histogram
> > and
> > > > > timer.
> > > > > >> > > meter
> > > > > >> > > > >> > calculates a rate, but also exposes an accumulated
> > value.
> > > > > >> > > > >> >
> > > > > >> > > > >>
> > > > > >> > > > >> I don't see a good reason we should limit ourselves to
> > Yammer
> > > > > >> > metrics
> > > > > >> > > on
> > > > > >> > > > >> the broker. KafkaMetrics was written
> > > > > >> > > > >> to replace Yammer metrics and is used for all new
> > components
> > > > > >> > (clients,
> > > > > >> > > > >> streams, connect, etc.)
> > > > > >> > > > >> My understanding is that the original goal was to retire
> > > > Yammer
> > > > > >> > > metrics
> > > > > >> > > > in
> > > > > >> > > > >> the broker in favor of KafkaMetrics.
> > > > > >> > > > >> We just haven't done so out of backwards compatibility
> > > > > concerns.
> > > > > >> > > > >> There are other broker metrics such as group coordinator,
> > > > > >> > transaction
> > > > > >> > > > >> state
> > > > > >> > > > >> manager, and various socket server metrics
> > > > > >> > > > >> already using KafkaMetrics that don't need specific Kafka
> > > > > metric
> > > > > >> > > > features,
> > > > > >> > > > >> so I don't see why we should refrain from using
> > > > > >> > > > >> Kafka metrics on the broker unless there are real
> > > > compatibility
> > > > > >> > > concerns
> > > > > >> > > > >> or
> > > > > >> > > > >> where implementation specifics could lead to confusion
> > when
> > > > > >> > comparing
> > > > > >> > > > >> metrics using different implementations.
> > > > > >> > > > >>
> > > > > >> > > > >> In my opinion we should encourage people to use
> > KafkaMetrics
> > > > > >> going
> > > > > >> > > > forward
> > > > > >> > > > >> on the broker as well, for two reasons:
> > > > > >> > > > >> a) yammer metrics is long deprecated and no longer
> > maintained
> > > > > >> > > > >> b) yammer metrics are much less expressive
> > > > > >> > > > >> c) we don't have a proper API to expose yammer metrics
> > > > outside
> > > > > of
> > > > > >> > JMX
> > > > > >> > > > >> (MetricsReporter only exposes KafkaMetrics)
> > > > > >> > > > >>
> > > > > >> > > > >
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Magnus Edenhill <ma...@edenhill.se>.

Hey Jun and Kirk,


I see that there's a lot of focus on the existing metrics in the Java
clients, which makes sense,
but the KIP aims to approach the problem space from a higher and more
generic level by
defining:
1) a standard protocol for subscribing to, and pushing metrics,
2) an existing industry standard encoding and semantics for those metrics
(OTLP),
3) as well as a standard set of metrics that we believe are relevant to
most/all client implementations


The counter-alternative to these points, which have come up before in
various forms during the KIP discussions (see rejected alternatives) in the
KIP are:
1) use an existing out-of-band protocol,
2) use Kafka protocol encoding for the metrics,
3) let each client implementation provide their own set of metrics.

So why is the KIP not suggesting this approach? Well, in short:
 1) defies the zero-conf/always-available requirement - clients, networks,
firewalls, etc, must be specifically configured - which will not be
feasible.
 2) we would need to duplicate the work of the industry leading telemetry
people (opentelemetry) - reaping no benefits of their existing and future
work, and making integration with upstream telemetry systems harder,
 3a) these client-specific metrics would either need to be converted to
some common form - which is not only cpu/memory costly - but also hard from
an operational standpoint:
     someone, is it the kafka operator?, would need to understand what
client-specific metrics are available and what their semantics are - and
then for each such client implementation write translation code in the
broker-side plugin to try to mangle the custom metrics into a standard set
of metrics that can be monitored with a single upstream metric. With seven
or eight different client implementations in the wild, all with new
releases coming out every now and then some perhaps without per-metric
documentation, well that just seems like a daunting task that will be hard
to win.
 3b) or if the client-specific metrics are not converted to some common
form, name, semantic, etc, it'll make creating meaningful aggregations and
monitoring more complex in the upstream telemetry system with a scattered
plethora of custom metrics.

Additionally, the proposed standard set of metrics are derived from what is
available in existing clients and while the fit may not be perfect to
existing metrics, they won't be too off.
More so, having a standard set of metrics to implement makes it easier for
client maintainers to know which metrics they should expose and are
considered relevant to monitoring and troubleshooting.

As for manually mapping KIP-714 metric names to JMX during troubleshooting;
I agree that is not perfect but could be solved quite easily through
documentation. E.g,, "MetricA is also known as metric.foo.a in OTLP".

Another point worth mentioning is that, while the KIP does not cover it, a
future enhancement to the clients is to also expose the OTLP metrics
directly to the application as an alternative to JMX (or whatever the
client currently exposes, e.g. JSON), which makes integration with upstream
metrics systems easier.


Thanks,
Magnus







Den tors 16 juni 2022 kl 23:38 skrev Jun Rao <ju...@confluent.io.invalid>:

> Hi, Kirk,
>
> Thanks for the reply. A couple of more comments.
>
> (1) "Another perspective is that these two sets of metrics serve different
> purposes and/or have different audiences, which argues that they should
> maintain their individuality and purpose. " Hmm, I am wondering if those
> metrics are really for different audiences and purposes? For example, if
> the operator detected an issue through a client metric collected through
> the server, the operator may need to communicate that back to the client.
> It would be weird if that same metric is not visible on the client side.
>
> (2) If we could standardize the names on the server side, do we need to
> enforce a naming convention for all clients?


> Thanks,
>
> Jun
>
> On Thu, Jun 16, 2022 at 12:00 PM Kirk True <ki...@kirktrue.pro> wrote:
>
> > Hi Jun,
> >
> > I'll try to answer the questions posed...
> >
> > On Tue, Jun 7, 2022, at 4:32 PM, Jun Rao wrote:
> > > Hi, Magnus,
> > >
> > > Thanks for the reply.
> > >
> > > So, the standard set of generic metrics is just a recommendation and
> not
> > a
> > > requirement? This sounds good to me since it makes the adoption of the
> > KIP
> > > easier.
> >
> > I believe that was the intent, yes.
> >
> > > Regarding the metric names, I have two concerns.
> >
> > (I'm splitting these two up for readability...)
> >
> > > (1) If a client already
> > > has an existing metric similar to the standard one, duplicating the
> > metric
> > > seems to be confusing.
> >
> > Agreed. I'm dealing with that situation as I write the Java client
> > implementation.
> >
> > The existing Java client exposes a set of metrics via JMX. The updated
> > Java client will introduce a second set of metrics, which instead are
> > exposed via sending them to the broker. There is substantial overlap with
> > the two set of metrics and in a few places in the code under development,
> > there are essentially two separate calls to update metrics: one for the
> > JMX-bound metrics and one for the broker-bound metrics.
> >
> > To be candid, I have gone back-and-forth on that design. From one
> > perspective, it could be argued that the set of client metrics should be
> > standardized across a given client, regardless of how those metrics are
> > exposed for consumption. Another perspective is that these two sets of
> > metrics serve different purposes and/or have different audiences, which
> > argues that they should maintain their individuality and purpose. Your
> > inputs/suggestions are certainly welcome!
> >
> > > (2) If a client needs to implement a standard metric
> > > that doesn't exist yet, using a naming convention (e.g., using dash vs
> > dot)
> > > different from other existing metrics also seems a bit confusing. It
> > seems
> > > that the main benefit of having standard metric names across clients is
> > for
> > > better server side monitoring. Could we do the standardization in the
> > > plugin on the server?
> >
> > I think the expectation is that the plugin implementation will perform
> > transformation of metric names, if needed, to fit in with an
> organization's
> > monitoring naming standards. Perhaps we need to call that out in the KIP
> > itself.
> >
> > Thanks,
> > Kirk
> >
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > >
> > >
> > > On Tue, Jun 7, 2022 at 6:53 AM Magnus Edenhill <ma...@edenhill.se>
> > wrote:
> > >
> > > > Hey Jun,
> > > >
> > > > I've clarified the scope of the standard metrics in the KIP, but
> > basically:
> > > >
> > > >  * We define a standard set of generic metrics that should be
> relevant
> > to
> > > > most client implementations, e.g., each producer implementation
> > probably
> > > > has some sort of per-partition message queue.
> > > >  * A client implementation should strive to implement as many of the
> > > > standard metrics as possible, but only the ones that make sense.
> > > >  * For metrics that are not in the standard set, a client maintainer
> > can
> > > > choose to either submit a KIP to add additional standard metrics - if
> > > > they're relevant, or go ahead and add custom metrics that are
> specific
> > to
> > > > that client implementation. These custom metrics will have a prefix
> > > > specific to that client implementation, as opposed to the standard
> > metric
> > > > set that resides under "org.apache.kafka...". E.g.,
> > > > "se.edenhill.librdkafka" or whatever.
> > > >  * Existing non-KIP-714 metrics should remain untouched. In some
> cases
> > we
> > > > might be able to use the same meter given it is compatible with the
> > > > standard metric set definition, in other cases a semi-duplicate meter
> > may
> > > > be needed. Thus this will not affect the metrics exposed through JMX,
> > or
> > > > vice versa.
> > > >
> > > > Thanks,
> > > > Magnus
> > > >
> > > >
> > > >
> > > > Den ons 1 juni 2022 kl 18:55 skrev Jun Rao <jun@confluent.io.invalid
> >:
> > > >
> > > > > Hi, Magnus,
> > > > >
> > > > > 51. Just to clarify my question.  (1) Are standard metrics required
> > for
> > > > > every client for this KIP to function?  (2) Are we converting
> > existing
> > > > java
> > > > > metrics to the standard metrics and deprecating the old ones? If
> so,
> > > > could
> > > > > we list all existing java metrics that need to be renamed and the
> > > > > corresponding new name?
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jun
> > > > >
> > > > > On Tue, May 31, 2022 at 3:29 PM Jun Rao <ju...@confluent.io> wrote:
> > > > >
> > > > > > Hi, Magnus,
> > > > > >
> > > > > > Thanks for the reply.
> > > > > >
> > > > > > 51. I think it's fine to have a list of recommended metrics for
> > every
> > > > > > client to implement. I am just not sure that standardizing on the
> > > > metric
> > > > > > names across all clients is practical. The list of common metrics
> > in
> > > > the
> > > > > > KIP have completely different names from the java metric names.
> > Some of
> > > > > > them have different types. For example, some of the common
> metrics
> > > > have a
> > > > > > type of histogram, but the java client metrics don't use
> histogram
> > in
> > > > > > general. Requiring the operator to translate those names and
> > understand
> > > > > the
> > > > > > subtle differences across clients seem to cause more confusion
> > during
> > > > > > troubleshooting.
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Jun
> > > > > >
> > > > > > On Tue, May 31, 2022 at 5:02 AM Magnus Edenhill <
> > magnus@edenhill.se>
> > > > > > wrote:
> > > > > >
> > > > > >> Den fre 20 maj 2022 kl 01:23 skrev Jun Rao
> > <jun@confluent.io.invalid
> > > > >:
> > > > > >>
> > > > > >> > Hi, Magus,
> > > > > >> >
> > > > > >> > Thanks for the reply.
> > > > > >> >
> > > > > >> > 50. Sounds good.
> > > > > >> >
> > > > > >> > 51. I miss-understood the proposal in the KIP then. The
> > proposal is
> > > > to
> > > > > >> > define a set of common metric names that every client should
> > > > > implement.
> > > > > >> The
> > > > > >> > problem is that every client already has its own set of
> metrics
> > with
> > > > > its
> > > > > >> > own names. I am not sure that we could easily agree upon a
> > common
> > > > set
> > > > > of
> > > > > >> > metrics that work with all clients. There are likely to be
> some
> > > > > metrics
> > > > > >> > that are client specific. Translating between the common name
> > and
> > > > > client
> > > > > >> > specific name is probably going to add more confusion. As
> > mentioned
> > > > in
> > > > > >> the
> > > > > >> > KIP, similar metrics from different clients could have subtle
> > > > > >> > semantic differences. Could we just let each client use its
> own
> > set
> > > > of
> > > > > >> > metric names?
> > > > > >> >
> > > > > >>
> > > > > >> We identified a common set of metrics that should be relevant
> for
> > most
> > > > > >> client implementations,
> > > > > >> they're the ones listed in the KIP.
> > > > > >> A supporting client does not have to implement all those
> metrics,
> > only
> > > > > the
> > > > > >> ones that makes sense
> > > > > >> based on that client implementation, and a client may implement
> > other
> > > > > >> metrics that are not listed
> > > > > >> in the KIP under its own namespace.
> > > > > >> This approach has two benefits:
> > > > > >>  - there will be a common set of metrics that most/all clients
> > > > > implement,
> > > > > >> which makes monitoring
> > > > > >>   and troubleshooting easier across fleets with multiple Kafka
> > client
> > > > > >> languages/implementations.
> > > > > >>  - client-specific metrics are still possible, so if there is no
> > > > > suitable
> > > > > >> standard metric a client can still
> > > > > >>    provide what special metrics it has.
> > > > > >>
> > > > > >>
> > > > > >> Thanks,
> > > > > >> Magnus
> > > > > >>
> > > > > >>
> > > > > >> On Thu, May 19, 2022 at 10:39 AM Magnus Edenhill <
> > magnus@edenhill.se>
> > > > > >> wrote:
> > > > > >> >
> > > > > >> > > Den ons 18 maj 2022 kl 19:57 skrev Jun Rao
> > > > <jun@confluent.io.invalid
> > > > > >> >:
> > > > > >> > >
> > > > > >> > > > Hi, Magnus,
> > > > > >> > > >
> > > > > >> > >
> > > > > >> > > Hi Jun
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > >
> > > > > >> > > > Thanks for the updated KIP. Just a couple of more
> comments.
> > > > > >> > > >
> > > > > >> > > > 50. To troubleshoot a particular client issue, I imagine
> > that
> > > > the
> > > > > >> > client
> > > > > >> > > > needs to identify its client_instance_id. How does the
> > client
> > > > find
> > > > > >> this
> > > > > >> > > > out? Do we plan to include client_instance_id in the
> client
> > log,
> > > > > >> expose
> > > > > >> > > it
> > > > > >> > > > as a metric or something else?
> > > > > >> > > >
> > > > > >> > >
> > > > > >> > > The KIP suggests that client implementations emit an
> > informative
> > > > log
> > > > > >> > > message
> > > > > >> > > with the assigned client-instance-id once it is retrieved
> > (once
> > > > per
> > > > > >> > client
> > > > > >> > > instance lifetime).
> > > > > >> > > There's also a clientInstanceId() method that an application
> > can
> > > > use
> > > > > >> to
> > > > > >> > > retrieve
> > > > > >> > > the client instance id and emit through whatever side
> channels
> > > > makes
> > > > > >> > sense.
> > > > > >> > >
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > > 51. The KIP lists a bunch of metrics that need to be
> > collected
> > > > at
> > > > > >> the
> > > > > >> > > > client side. However, it seems quite a few useful java
> > client
> > > > > >> metrics
> > > > > >> > > like
> > > > > >> > > > the following are missing.
> > > > > >> > > >     buffer-total-bytes
> > > > > >> > > >     buffer-available-bytes
> > > > > >> > > >
> > > > > >> > >
> > > > > >> > > These are covered by client.producer.record.queue.bytes and
> > > > > >> > > client.producer.record.queue.max.bytes.
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > >     bufferpool-wait-time
> > > > > >> > > >
> > > > > >> > >
> > > > > >> > > Missing, but somewhat implementation specific.
> > > > > >> > > If it was up to me we would add this later if there's a
> need.
> > > > > >> > >
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > >     batch-size-avg
> > > > > >> > > >     batch-size-max
> > > > > >> > > >
> > > > > >> > >
> > > > > >> > > These are missing and would be suitably represented as a
> > > > histogram.
> > > > > >> I'll
> > > > > >> > > add them.
> > > > > >> > >
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > >     io-wait-ratio
> > > > > >> > > >     io-ratio
> > > > > >> > > >
> > > > > >> > >
> > > > > >> > > There's client.io.wait.time which should cover
> io-wait-ratio.
> > > > > >> > > We could add a client.io.time as well, now or in a later
> KIP.
> > > > > >> > >
> > > > > >> > > Thanks,
> > > > > >> > > Magnus
> > > > > >> > >
> > > > > >> > >
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > >
> > > > > >> > > > Thanks,
> > > > > >> > > >
> > > > > >> > > > Jun
> > > > > >> > > >
> > > > > >> > > > On Mon, Apr 4, 2022 at 10:01 AM Jun Rao <jun@confluent.io
> >
> > > > wrote:
> > > > > >> > > >
> > > > > >> > > > > Hi, Xavier,
> > > > > >> > > > >
> > > > > >> > > > > Thanks for the reply.
> > > > > >> > > > >
> > > > > >> > > > > 28. It does seem that we have started using KafkaMetrics
> > on
> > > > the
> > > > > >> > broker
> > > > > >> > > > > side. Then, my only concern is on the usage of Histogram
> > in
> > > > > >> > > KafkaMetrics.
> > > > > >> > > > > Histogram in KafkaMetrics statically divides the value
> > space
> > > > > into
> > > > > >> a
> > > > > >> > > fixed
> > > > > >> > > > > number of buckets and only returns values on the bucket
> > > > > boundary.
> > > > > >> So,
> > > > > >> > > the
> > > > > >> > > > > returned histogram value may never show up in a recorded
> > > > value.
> > > > > >> > Yammer
> > > > > >> > > > > Histogram, on the other hand, uses reservoir sampling.
> The
> > > > > >> reported
> > > > > >> > > value
> > > > > >> > > > > is always one of the recorded values. So, I am not sure
> > that
> > > > > >> > Histogram
> > > > > >> > > in
> > > > > >> > > > > KafkaMetrics is as good as Yammer Histogram.
> > > > > >> > > > ClientMetricsPluginExportTime
> > > > > >> > > > > uses Histogram.
> > > > > >> > > > >
> > > > > >> > > > > Thanks,
> > > > > >> > > > >
> > > > > >> > > > > Jun
> > > > > >> > > > >
> > > > > >> > > > > On Thu, Mar 31, 2022 at 5:21 PM Xavier Léauté
> > > > > >> > > > <xa...@confluent.io.invalid>
> > > > > >> > > > > wrote:
> > > > > >> > > > >
> > > > > >> > > > >> >
> > > > > >> > > > >> > 28. On the broker, we typically use Yammer metrics.
> > Only
> > > > for
> > > > > >> > metrics
> > > > > >> > > > >> that
> > > > > >> > > > >> > depend on Kafka metric features (e.g., quota), we use
> > the
> > > > > Kafka
> > > > > >> > > > metric.
> > > > > >> > > > >> > Yammer metrics have 4 types: gauge, meter, histogram
> > and
> > > > > timer.
> > > > > >> > > meter
> > > > > >> > > > >> > calculates a rate, but also exposes an accumulated
> > value.
> > > > > >> > > > >> >
> > > > > >> > > > >>
> > > > > >> > > > >> I don't see a good reason we should limit ourselves to
> > Yammer
> > > > > >> > metrics
> > > > > >> > > on
> > > > > >> > > > >> the broker. KafkaMetrics was written
> > > > > >> > > > >> to replace Yammer metrics and is used for all new
> > components
> > > > > >> > (clients,
> > > > > >> > > > >> streams, connect, etc.)
> > > > > >> > > > >> My understanding is that the original goal was to
> retire
> > > > Yammer
> > > > > >> > > metrics
> > > > > >> > > > in
> > > > > >> > > > >> the broker in favor of KafkaMetrics.
> > > > > >> > > > >> We just haven't done so out of backwards compatibility
> > > > > concerns.
> > > > > >> > > > >> There are other broker metrics such as group
> coordinator,
> > > > > >> > transaction
> > > > > >> > > > >> state
> > > > > >> > > > >> manager, and various socket server metrics
> > > > > >> > > > >> already using KafkaMetrics that don't need specific
> Kafka
> > > > > metric
> > > > > >> > > > features,
> > > > > >> > > > >> so I don't see why we should refrain from using
> > > > > >> > > > >> Kafka metrics on the broker unless there are real
> > > > compatibility
> > > > > >> > > concerns
> > > > > >> > > > >> or
> > > > > >> > > > >> where implementation specifics could lead to confusion
> > when
> > > > > >> > comparing
> > > > > >> > > > >> metrics using different implementations.
> > > > > >> > > > >>
> > > > > >> > > > >> In my opinion we should encourage people to use
> > KafkaMetrics
> > > > > >> going
> > > > > >> > > > forward
> > > > > >> > > > >> on the broker as well, for two reasons:
> > > > > >> > > > >> a) yammer metrics is long deprecated and no longer
> > maintained
> > > > > >> > > > >> b) yammer metrics are much less expressive
> > > > > >> > > > >> c) we don't have a proper API to expose yammer metrics
> > > > outside
> > > > > of
> > > > > >> > JMX
> > > > > >> > > > >> (MetricsReporter only exposes KafkaMetrics)
> > > > > >> > > > >>
> > > > > >> > > > >
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Jun Rao <ju...@confluent.io.INVALID>.

Hi, Kirk,

Thanks for the reply. A couple of more comments.

(1) "Another perspective is that these two sets of metrics serve different
purposes and/or have different audiences, which argues that they should
maintain their individuality and purpose. " Hmm, I am wondering if those
metrics are really for different audiences and purposes? For example, if
the operator detected an issue through a client metric collected through
the server, the operator may need to communicate that back to the client.
It would be weird if that same metric is not visible on the client side.

(2) If we could standardize the names on the server side, do we need to
enforce a naming convention for all clients?

Thanks,

Jun

On Thu, Jun 16, 2022 at 12:00 PM Kirk True <ki...@kirktrue.pro> wrote:

> Hi Jun,
>
> I'll try to answer the questions posed...
>
> On Tue, Jun 7, 2022, at 4:32 PM, Jun Rao wrote:
> > Hi, Magnus,
> >
> > Thanks for the reply.
> >
> > So, the standard set of generic metrics is just a recommendation and not
> a
> > requirement? This sounds good to me since it makes the adoption of the
> KIP
> > easier.
>
> I believe that was the intent, yes.
>
> > Regarding the metric names, I have two concerns.
>
> (I'm splitting these two up for readability...)
>
> > (1) If a client already
> > has an existing metric similar to the standard one, duplicating the
> metric
> > seems to be confusing.
>
> Agreed. I'm dealing with that situation as I write the Java client
> implementation.
>
> The existing Java client exposes a set of metrics via JMX. The updated
> Java client will introduce a second set of metrics, which instead are
> exposed via sending them to the broker. There is substantial overlap with
> the two set of metrics and in a few places in the code under development,
> there are essentially two separate calls to update metrics: one for the
> JMX-bound metrics and one for the broker-bound metrics.
>
> To be candid, I have gone back-and-forth on that design. From one
> perspective, it could be argued that the set of client metrics should be
> standardized across a given client, regardless of how those metrics are
> exposed for consumption. Another perspective is that these two sets of
> metrics serve different purposes and/or have different audiences, which
> argues that they should maintain their individuality and purpose. Your
> inputs/suggestions are certainly welcome!
>
> > (2) If a client needs to implement a standard metric
> > that doesn't exist yet, using a naming convention (e.g., using dash vs
> dot)
> > different from other existing metrics also seems a bit confusing. It
> seems
> > that the main benefit of having standard metric names across clients is
> for
> > better server side monitoring. Could we do the standardization in the
> > plugin on the server?
>
> I think the expectation is that the plugin implementation will perform
> transformation of metric names, if needed, to fit in with an organization's
> monitoring naming standards. Perhaps we need to call that out in the KIP
> itself.
>
> Thanks,
> Kirk
>
> >
> > Thanks,
> >
> > Jun
> >
> >
> >
> > On Tue, Jun 7, 2022 at 6:53 AM Magnus Edenhill <ma...@edenhill.se>
> wrote:
> >
> > > Hey Jun,
> > >
> > > I've clarified the scope of the standard metrics in the KIP, but
> basically:
> > >
> > >  * We define a standard set of generic metrics that should be relevant
> to
> > > most client implementations, e.g., each producer implementation
> probably
> > > has some sort of per-partition message queue.
> > >  * A client implementation should strive to implement as many of the
> > > standard metrics as possible, but only the ones that make sense.
> > >  * For metrics that are not in the standard set, a client maintainer
> can
> > > choose to either submit a KIP to add additional standard metrics - if
> > > they're relevant, or go ahead and add custom metrics that are specific
> to
> > > that client implementation. These custom metrics will have a prefix
> > > specific to that client implementation, as opposed to the standard
> metric
> > > set that resides under "org.apache.kafka...". E.g.,
> > > "se.edenhill.librdkafka" or whatever.
> > >  * Existing non-KIP-714 metrics should remain untouched. In some cases
> we
> > > might be able to use the same meter given it is compatible with the
> > > standard metric set definition, in other cases a semi-duplicate meter
> may
> > > be needed. Thus this will not affect the metrics exposed through JMX,
> or
> > > vice versa.
> > >
> > > Thanks,
> > > Magnus
> > >
> > >
> > >
> > > Den ons 1 juni 2022 kl 18:55 skrev Jun Rao <ju...@confluent.io.invalid>:
> > >
> > > > Hi, Magnus,
> > > >
> > > > 51. Just to clarify my question.  (1) Are standard metrics required
> for
> > > > every client for this KIP to function?  (2) Are we converting
> existing
> > > java
> > > > metrics to the standard metrics and deprecating the old ones? If so,
> > > could
> > > > we list all existing java metrics that need to be renamed and the
> > > > corresponding new name?
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > > On Tue, May 31, 2022 at 3:29 PM Jun Rao <ju...@confluent.io> wrote:
> > > >
> > > > > Hi, Magnus,
> > > > >
> > > > > Thanks for the reply.
> > > > >
> > > > > 51. I think it's fine to have a list of recommended metrics for
> every
> > > > > client to implement. I am just not sure that standardizing on the
> > > metric
> > > > > names across all clients is practical. The list of common metrics
> in
> > > the
> > > > > KIP have completely different names from the java metric names.
> Some of
> > > > > them have different types. For example, some of the common metrics
> > > have a
> > > > > type of histogram, but the java client metrics don't use histogram
> in
> > > > > general. Requiring the operator to translate those names and
> understand
> > > > the
> > > > > subtle differences across clients seem to cause more confusion
> during
> > > > > troubleshooting.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jun
> > > > >
> > > > > On Tue, May 31, 2022 at 5:02 AM Magnus Edenhill <
> magnus@edenhill.se>
> > > > > wrote:
> > > > >
> > > > >> Den fre 20 maj 2022 kl 01:23 skrev Jun Rao
> <jun@confluent.io.invalid
> > > >:
> > > > >>
> > > > >> > Hi, Magus,
> > > > >> >
> > > > >> > Thanks for the reply.
> > > > >> >
> > > > >> > 50. Sounds good.
> > > > >> >
> > > > >> > 51. I miss-understood the proposal in the KIP then. The
> proposal is
> > > to
> > > > >> > define a set of common metric names that every client should
> > > > implement.
> > > > >> The
> > > > >> > problem is that every client already has its own set of metrics
> with
> > > > its
> > > > >> > own names. I am not sure that we could easily agree upon a
> common
> > > set
> > > > of
> > > > >> > metrics that work with all clients. There are likely to be some
> > > > metrics
> > > > >> > that are client specific. Translating between the common name
> and
> > > > client
> > > > >> > specific name is probably going to add more confusion. As
> mentioned
> > > in
> > > > >> the
> > > > >> > KIP, similar metrics from different clients could have subtle
> > > > >> > semantic differences. Could we just let each client use its own
> set
> > > of
> > > > >> > metric names?
> > > > >> >
> > > > >>
> > > > >> We identified a common set of metrics that should be relevant for
> most
> > > > >> client implementations,
> > > > >> they're the ones listed in the KIP.
> > > > >> A supporting client does not have to implement all those metrics,
> only
> > > > the
> > > > >> ones that makes sense
> > > > >> based on that client implementation, and a client may implement
> other
> > > > >> metrics that are not listed
> > > > >> in the KIP under its own namespace.
> > > > >> This approach has two benefits:
> > > > >>  - there will be a common set of metrics that most/all clients
> > > > implement,
> > > > >> which makes monitoring
> > > > >>   and troubleshooting easier across fleets with multiple Kafka
> client
> > > > >> languages/implementations.
> > > > >>  - client-specific metrics are still possible, so if there is no
> > > > suitable
> > > > >> standard metric a client can still
> > > > >>    provide what special metrics it has.
> > > > >>
> > > > >>
> > > > >> Thanks,
> > > > >> Magnus
> > > > >>
> > > > >>
> > > > >> On Thu, May 19, 2022 at 10:39 AM Magnus Edenhill <
> magnus@edenhill.se>
> > > > >> wrote:
> > > > >> >
> > > > >> > > Den ons 18 maj 2022 kl 19:57 skrev Jun Rao
> > > <jun@confluent.io.invalid
> > > > >> >:
> > > > >> > >
> > > > >> > > > Hi, Magnus,
> > > > >> > > >
> > > > >> > >
> > > > >> > > Hi Jun
> > > > >> > >
> > > > >> > >
> > > > >> > > >
> > > > >> > > > Thanks for the updated KIP. Just a couple of more comments.
> > > > >> > > >
> > > > >> > > > 50. To troubleshoot a particular client issue, I imagine
> that
> > > the
> > > > >> > client
> > > > >> > > > needs to identify its client_instance_id. How does the
> client
> > > find
> > > > >> this
> > > > >> > > > out? Do we plan to include client_instance_id in the client
> log,
> > > > >> expose
> > > > >> > > it
> > > > >> > > > as a metric or something else?
> > > > >> > > >
> > > > >> > >
> > > > >> > > The KIP suggests that client implementations emit an
> informative
> > > log
> > > > >> > > message
> > > > >> > > with the assigned client-instance-id once it is retrieved
> (once
> > > per
> > > > >> > client
> > > > >> > > instance lifetime).
> > > > >> > > There's also a clientInstanceId() method that an application
> can
> > > use
> > > > >> to
> > > > >> > > retrieve
> > > > >> > > the client instance id and emit through whatever side channels
> > > makes
> > > > >> > sense.
> > > > >> > >
> > > > >> > >
> > > > >> > >
> > > > >> > > > 51. The KIP lists a bunch of metrics that need to be
> collected
> > > at
> > > > >> the
> > > > >> > > > client side. However, it seems quite a few useful java
> client
> > > > >> metrics
> > > > >> > > like
> > > > >> > > > the following are missing.
> > > > >> > > >     buffer-total-bytes
> > > > >> > > >     buffer-available-bytes
> > > > >> > > >
> > > > >> > >
> > > > >> > > These are covered by client.producer.record.queue.bytes and
> > > > >> > > client.producer.record.queue.max.bytes.
> > > > >> > >
> > > > >> > >
> > > > >> > > >     bufferpool-wait-time
> > > > >> > > >
> > > > >> > >
> > > > >> > > Missing, but somewhat implementation specific.
> > > > >> > > If it was up to me we would add this later if there's a need.
> > > > >> > >
> > > > >> > >
> > > > >> > >
> > > > >> > > >     batch-size-avg
> > > > >> > > >     batch-size-max
> > > > >> > > >
> > > > >> > >
> > > > >> > > These are missing and would be suitably represented as a
> > > histogram.
> > > > >> I'll
> > > > >> > > add them.
> > > > >> > >
> > > > >> > >
> > > > >> > >
> > > > >> > > >     io-wait-ratio
> > > > >> > > >     io-ratio
> > > > >> > > >
> > > > >> > >
> > > > >> > > There's client.io.wait.time which should cover io-wait-ratio.
> > > > >> > > We could add a client.io.time as well, now or in a later KIP.
> > > > >> > >
> > > > >> > > Thanks,
> > > > >> > > Magnus
> > > > >> > >
> > > > >> > >
> > > > >> > >
> > > > >> > >
> > > > >> > > >
> > > > >> > > > Thanks,
> > > > >> > > >
> > > > >> > > > Jun
> > > > >> > > >
> > > > >> > > > On Mon, Apr 4, 2022 at 10:01 AM Jun Rao <ju...@confluent.io>
> > > wrote:
> > > > >> > > >
> > > > >> > > > > Hi, Xavier,
> > > > >> > > > >
> > > > >> > > > > Thanks for the reply.
> > > > >> > > > >
> > > > >> > > > > 28. It does seem that we have started using KafkaMetrics
> on
> > > the
> > > > >> > broker
> > > > >> > > > > side. Then, my only concern is on the usage of Histogram
> in
> > > > >> > > KafkaMetrics.
> > > > >> > > > > Histogram in KafkaMetrics statically divides the value
> space
> > > > into
> > > > >> a
> > > > >> > > fixed
> > > > >> > > > > number of buckets and only returns values on the bucket
> > > > boundary.
> > > > >> So,
> > > > >> > > the
> > > > >> > > > > returned histogram value may never show up in a recorded
> > > value.
> > > > >> > Yammer
> > > > >> > > > > Histogram, on the other hand, uses reservoir sampling. The
> > > > >> reported
> > > > >> > > value
> > > > >> > > > > is always one of the recorded values. So, I am not sure
> that
> > > > >> > Histogram
> > > > >> > > in
> > > > >> > > > > KafkaMetrics is as good as Yammer Histogram.
> > > > >> > > > ClientMetricsPluginExportTime
> > > > >> > > > > uses Histogram.
> > > > >> > > > >
> > > > >> > > > > Thanks,
> > > > >> > > > >
> > > > >> > > > > Jun
> > > > >> > > > >
> > > > >> > > > > On Thu, Mar 31, 2022 at 5:21 PM Xavier Léauté
> > > > >> > > > <xa...@confluent.io.invalid>
> > > > >> > > > > wrote:
> > > > >> > > > >
> > > > >> > > > >> >
> > > > >> > > > >> > 28. On the broker, we typically use Yammer metrics.
> Only
> > > for
> > > > >> > metrics
> > > > >> > > > >> that
> > > > >> > > > >> > depend on Kafka metric features (e.g., quota), we use
> the
> > > > Kafka
> > > > >> > > > metric.
> > > > >> > > > >> > Yammer metrics have 4 types: gauge, meter, histogram
> and
> > > > timer.
> > > > >> > > meter
> > > > >> > > > >> > calculates a rate, but also exposes an accumulated
> value.
> > > > >> > > > >> >
> > > > >> > > > >>
> > > > >> > > > >> I don't see a good reason we should limit ourselves to
> Yammer
> > > > >> > metrics
> > > > >> > > on
> > > > >> > > > >> the broker. KafkaMetrics was written
> > > > >> > > > >> to replace Yammer metrics and is used for all new
> components
> > > > >> > (clients,
> > > > >> > > > >> streams, connect, etc.)
> > > > >> > > > >> My understanding is that the original goal was to retire
> > > Yammer
> > > > >> > > metrics
> > > > >> > > > in
> > > > >> > > > >> the broker in favor of KafkaMetrics.
> > > > >> > > > >> We just haven't done so out of backwards compatibility
> > > > concerns.
> > > > >> > > > >> There are other broker metrics such as group coordinator,
> > > > >> > transaction
> > > > >> > > > >> state
> > > > >> > > > >> manager, and various socket server metrics
> > > > >> > > > >> already using KafkaMetrics that don't need specific Kafka
> > > > metric
> > > > >> > > > features,
> > > > >> > > > >> so I don't see why we should refrain from using
> > > > >> > > > >> Kafka metrics on the broker unless there are real
> > > compatibility
> > > > >> > > concerns
> > > > >> > > > >> or
> > > > >> > > > >> where implementation specifics could lead to confusion
> when
> > > > >> > comparing
> > > > >> > > > >> metrics using different implementations.
> > > > >> > > > >>
> > > > >> > > > >> In my opinion we should encourage people to use
> KafkaMetrics
> > > > >> going
> > > > >> > > > forward
> > > > >> > > > >> on the broker as well, for two reasons:
> > > > >> > > > >> a) yammer metrics is long deprecated and no longer
> maintained
> > > > >> > > > >> b) yammer metrics are much less expressive
> > > > >> > > > >> c) we don't have a proper API to expose yammer metrics
> > > outside
> > > > of
> > > > >> > JMX
> > > > >> > > > >> (MetricsReporter only exposes KafkaMetrics)
> > > > >> > > > >>
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Kirk True <ki...@kirktrue.pro>.

Hi Jun,

I'll try to answer the questions posed...

On Tue, Jun 7, 2022, at 4:32 PM, Jun Rao wrote:
> Hi, Magnus,
> 
> Thanks for the reply.
> 
> So, the standard set of generic metrics is just a recommendation and not a
> requirement? This sounds good to me since it makes the adoption of the KIP
> easier.

I believe that was the intent, yes.

> Regarding the metric names, I have two concerns.

(I'm splitting these two up for readability...)

> (1) If a client already
> has an existing metric similar to the standard one, duplicating the metric
> seems to be confusing.

Agreed. I'm dealing with that situation as I write the Java client implementation.

The existing Java client exposes a set of metrics via JMX. The updated Java client will introduce a second set of metrics, which instead are exposed via sending them to the broker. There is substantial overlap with the two set of metrics and in a few places in the code under development, there are essentially two separate calls to update metrics: one for the JMX-bound metrics and one for the broker-bound metrics.

To be candid, I have gone back-and-forth on that design. From one perspective, it could be argued that the set of client metrics should be standardized across a given client, regardless of how those metrics are exposed for consumption. Another perspective is that these two sets of metrics serve different purposes and/or have different audiences, which argues that they should maintain their individuality and purpose. Your inputs/suggestions are certainly welcome! 

> (2) If a client needs to implement a standard metric
> that doesn't exist yet, using a naming convention (e.g., using dash vs dot)
> different from other existing metrics also seems a bit confusing. It seems
> that the main benefit of having standard metric names across clients is for
> better server side monitoring. Could we do the standardization in the
> plugin on the server?

I think the expectation is that the plugin implementation will perform transformation of metric names, if needed, to fit in with an organization's monitoring naming standards. Perhaps we need to call that out in the KIP itself.

Thanks,
Kirk

> 
> Thanks,
> 
> Jun
> 
> 
> 
> On Tue, Jun 7, 2022 at 6:53 AM Magnus Edenhill <ma...@edenhill.se> wrote:
> 
> > Hey Jun,
> >
> > I've clarified the scope of the standard metrics in the KIP, but basically:
> >
> >  * We define a standard set of generic metrics that should be relevant to
> > most client implementations, e.g., each producer implementation probably
> > has some sort of per-partition message queue.
> >  * A client implementation should strive to implement as many of the
> > standard metrics as possible, but only the ones that make sense.
> >  * For metrics that are not in the standard set, a client maintainer can
> > choose to either submit a KIP to add additional standard metrics - if
> > they're relevant, or go ahead and add custom metrics that are specific to
> > that client implementation. These custom metrics will have a prefix
> > specific to that client implementation, as opposed to the standard metric
> > set that resides under "org.apache.kafka...". E.g.,
> > "se.edenhill.librdkafka" or whatever.
> >  * Existing non-KIP-714 metrics should remain untouched. In some cases we
> > might be able to use the same meter given it is compatible with the
> > standard metric set definition, in other cases a semi-duplicate meter may
> > be needed. Thus this will not affect the metrics exposed through JMX, or
> > vice versa.
> >
> > Thanks,
> > Magnus
> >
> >
> >
> > Den ons 1 juni 2022 kl 18:55 skrev Jun Rao <ju...@confluent.io.invalid>:
> >
> > > Hi, Magnus,
> > >
> > > 51. Just to clarify my question.  (1) Are standard metrics required for
> > > every client for this KIP to function?  (2) Are we converting existing
> > java
> > > metrics to the standard metrics and deprecating the old ones? If so,
> > could
> > > we list all existing java metrics that need to be renamed and the
> > > corresponding new name?
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Tue, May 31, 2022 at 3:29 PM Jun Rao <ju...@confluent.io> wrote:
> > >
> > > > Hi, Magnus,
> > > >
> > > > Thanks for the reply.
> > > >
> > > > 51. I think it's fine to have a list of recommended metrics for every
> > > > client to implement. I am just not sure that standardizing on the
> > metric
> > > > names across all clients is practical. The list of common metrics in
> > the
> > > > KIP have completely different names from the java metric names. Some of
> > > > them have different types. For example, some of the common metrics
> > have a
> > > > type of histogram, but the java client metrics don't use histogram in
> > > > general. Requiring the operator to translate those names and understand
> > > the
> > > > subtle differences across clients seem to cause more confusion during
> > > > troubleshooting.
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > > On Tue, May 31, 2022 at 5:02 AM Magnus Edenhill <ma...@edenhill.se>
> > > > wrote:
> > > >
> > > >> Den fre 20 maj 2022 kl 01:23 skrev Jun Rao <jun@confluent.io.invalid
> > >:
> > > >>
> > > >> > Hi, Magus,
> > > >> >
> > > >> > Thanks for the reply.
> > > >> >
> > > >> > 50. Sounds good.
> > > >> >
> > > >> > 51. I miss-understood the proposal in the KIP then. The proposal is
> > to
> > > >> > define a set of common metric names that every client should
> > > implement.
> > > >> The
> > > >> > problem is that every client already has its own set of metrics with
> > > its
> > > >> > own names. I am not sure that we could easily agree upon a common
> > set
> > > of
> > > >> > metrics that work with all clients. There are likely to be some
> > > metrics
> > > >> > that are client specific. Translating between the common name and
> > > client
> > > >> > specific name is probably going to add more confusion. As mentioned
> > in
> > > >> the
> > > >> > KIP, similar metrics from different clients could have subtle
> > > >> > semantic differences. Could we just let each client use its own set
> > of
> > > >> > metric names?
> > > >> >
> > > >>
> > > >> We identified a common set of metrics that should be relevant for most
> > > >> client implementations,
> > > >> they're the ones listed in the KIP.
> > > >> A supporting client does not have to implement all those metrics, only
> > > the
> > > >> ones that makes sense
> > > >> based on that client implementation, and a client may implement other
> > > >> metrics that are not listed
> > > >> in the KIP under its own namespace.
> > > >> This approach has two benefits:
> > > >>  - there will be a common set of metrics that most/all clients
> > > implement,
> > > >> which makes monitoring
> > > >>   and troubleshooting easier across fleets with multiple Kafka client
> > > >> languages/implementations.
> > > >>  - client-specific metrics are still possible, so if there is no
> > > suitable
> > > >> standard metric a client can still
> > > >>    provide what special metrics it has.
> > > >>
> > > >>
> > > >> Thanks,
> > > >> Magnus
> > > >>
> > > >>
> > > >> On Thu, May 19, 2022 at 10:39 AM Magnus Edenhill <ma...@edenhill.se>
> > > >> wrote:
> > > >> >
> > > >> > > Den ons 18 maj 2022 kl 19:57 skrev Jun Rao
> > <jun@confluent.io.invalid
> > > >> >:
> > > >> > >
> > > >> > > > Hi, Magnus,
> > > >> > > >
> > > >> > >
> > > >> > > Hi Jun
> > > >> > >
> > > >> > >
> > > >> > > >
> > > >> > > > Thanks for the updated KIP. Just a couple of more comments.
> > > >> > > >
> > > >> > > > 50. To troubleshoot a particular client issue, I imagine that
> > the
> > > >> > client
> > > >> > > > needs to identify its client_instance_id. How does the client
> > find
> > > >> this
> > > >> > > > out? Do we plan to include client_instance_id in the client log,
> > > >> expose
> > > >> > > it
> > > >> > > > as a metric or something else?
> > > >> > > >
> > > >> > >
> > > >> > > The KIP suggests that client implementations emit an informative
> > log
> > > >> > > message
> > > >> > > with the assigned client-instance-id once it is retrieved (once
> > per
> > > >> > client
> > > >> > > instance lifetime).
> > > >> > > There's also a clientInstanceId() method that an application can
> > use
> > > >> to
> > > >> > > retrieve
> > > >> > > the client instance id and emit through whatever side channels
> > makes
> > > >> > sense.
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> > > > 51. The KIP lists a bunch of metrics that need to be collected
> > at
> > > >> the
> > > >> > > > client side. However, it seems quite a few useful java client
> > > >> metrics
> > > >> > > like
> > > >> > > > the following are missing.
> > > >> > > >     buffer-total-bytes
> > > >> > > >     buffer-available-bytes
> > > >> > > >
> > > >> > >
> > > >> > > These are covered by client.producer.record.queue.bytes and
> > > >> > > client.producer.record.queue.max.bytes.
> > > >> > >
> > > >> > >
> > > >> > > >     bufferpool-wait-time
> > > >> > > >
> > > >> > >
> > > >> > > Missing, but somewhat implementation specific.
> > > >> > > If it was up to me we would add this later if there's a need.
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> > > >     batch-size-avg
> > > >> > > >     batch-size-max
> > > >> > > >
> > > >> > >
> > > >> > > These are missing and would be suitably represented as a
> > histogram.
> > > >> I'll
> > > >> > > add them.
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> > > >     io-wait-ratio
> > > >> > > >     io-ratio
> > > >> > > >
> > > >> > >
> > > >> > > There's client.io.wait.time which should cover io-wait-ratio.
> > > >> > > We could add a client.io.time as well, now or in a later KIP.
> > > >> > >
> > > >> > > Thanks,
> > > >> > > Magnus
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> > > >
> > > >> > > > Thanks,
> > > >> > > >
> > > >> > > > Jun
> > > >> > > >
> > > >> > > > On Mon, Apr 4, 2022 at 10:01 AM Jun Rao <ju...@confluent.io>
> > wrote:
> > > >> > > >
> > > >> > > > > Hi, Xavier,
> > > >> > > > >
> > > >> > > > > Thanks for the reply.
> > > >> > > > >
> > > >> > > > > 28. It does seem that we have started using KafkaMetrics on
> > the
> > > >> > broker
> > > >> > > > > side. Then, my only concern is on the usage of Histogram in
> > > >> > > KafkaMetrics.
> > > >> > > > > Histogram in KafkaMetrics statically divides the value space
> > > into
> > > >> a
> > > >> > > fixed
> > > >> > > > > number of buckets and only returns values on the bucket
> > > boundary.
> > > >> So,
> > > >> > > the
> > > >> > > > > returned histogram value may never show up in a recorded
> > value.
> > > >> > Yammer
> > > >> > > > > Histogram, on the other hand, uses reservoir sampling. The
> > > >> reported
> > > >> > > value
> > > >> > > > > is always one of the recorded values. So, I am not sure that
> > > >> > Histogram
> > > >> > > in
> > > >> > > > > KafkaMetrics is as good as Yammer Histogram.
> > > >> > > > ClientMetricsPluginExportTime
> > > >> > > > > uses Histogram.
> > > >> > > > >
> > > >> > > > > Thanks,
> > > >> > > > >
> > > >> > > > > Jun
> > > >> > > > >
> > > >> > > > > On Thu, Mar 31, 2022 at 5:21 PM Xavier Léauté
> > > >> > > > <xa...@confluent.io.invalid>
> > > >> > > > > wrote:
> > > >> > > > >
> > > >> > > > >> >
> > > >> > > > >> > 28. On the broker, we typically use Yammer metrics. Only
> > for
> > > >> > metrics
> > > >> > > > >> that
> > > >> > > > >> > depend on Kafka metric features (e.g., quota), we use the
> > > Kafka
> > > >> > > > metric.
> > > >> > > > >> > Yammer metrics have 4 types: gauge, meter, histogram and
> > > timer.
> > > >> > > meter
> > > >> > > > >> > calculates a rate, but also exposes an accumulated value.
> > > >> > > > >> >
> > > >> > > > >>
> > > >> > > > >> I don't see a good reason we should limit ourselves to Yammer
> > > >> > metrics
> > > >> > > on
> > > >> > > > >> the broker. KafkaMetrics was written
> > > >> > > > >> to replace Yammer metrics and is used for all new components
> > > >> > (clients,
> > > >> > > > >> streams, connect, etc.)
> > > >> > > > >> My understanding is that the original goal was to retire
> > Yammer
> > > >> > > metrics
> > > >> > > > in
> > > >> > > > >> the broker in favor of KafkaMetrics.
> > > >> > > > >> We just haven't done so out of backwards compatibility
> > > concerns.
> > > >> > > > >> There are other broker metrics such as group coordinator,
> > > >> > transaction
> > > >> > > > >> state
> > > >> > > > >> manager, and various socket server metrics
> > > >> > > > >> already using KafkaMetrics that don't need specific Kafka
> > > metric
> > > >> > > > features,
> > > >> > > > >> so I don't see why we should refrain from using
> > > >> > > > >> Kafka metrics on the broker unless there are real
> > compatibility
> > > >> > > concerns
> > > >> > > > >> or
> > > >> > > > >> where implementation specifics could lead to confusion when
> > > >> > comparing
> > > >> > > > >> metrics using different implementations.
> > > >> > > > >>
> > > >> > > > >> In my opinion we should encourage people to use KafkaMetrics
> > > >> going
> > > >> > > > forward
> > > >> > > > >> on the broker as well, for two reasons:
> > > >> > > > >> a) yammer metrics is long deprecated and no longer maintained
> > > >> > > > >> b) yammer metrics are much less expressive
> > > >> > > > >> c) we don't have a proper API to expose yammer metrics
> > outside
> > > of
> > > >> > JMX
> > > >> > > > >> (MetricsReporter only exposes KafkaMetrics)
> > > >> > > > >>
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Jun Rao <ju...@confluent.io.INVALID>.

Hi, Magnus,

Thanks for the reply.

So, the standard set of generic metrics is just a recommendation and not a
requirement? This sounds good to me since it makes the adoption of the KIP
easier.

Regarding the metric names, I have two concerns. (1) If a client already
has an existing metric similar to the standard one, duplicating the metric
seems to be confusing. (2) If a client needs to implement a standard metric
that doesn't exist yet, using a naming convention (e.g., using dash vs dot)
different from other existing metrics also seems a bit confusing. It seems
that the main benefit of having standard metric names across clients is for
better server side monitoring. Could we do the standardization in the
plugin on the server?

Thanks,

Jun



On Tue, Jun 7, 2022 at 6:53 AM Magnus Edenhill <ma...@edenhill.se> wrote:

> Hey Jun,
>
> I've clarified the scope of the standard metrics in the KIP, but basically:
>
>  * We define a standard set of generic metrics that should be relevant to
> most client implementations, e.g., each producer implementation probably
> has some sort of per-partition message queue.
>  * A client implementation should strive to implement as many of the
> standard metrics as possible, but only the ones that make sense.
>  * For metrics that are not in the standard set, a client maintainer can
> choose to either submit a KIP to add additional standard metrics - if
> they're relevant, or go ahead and add custom metrics that are specific to
> that client implementation. These custom metrics will have a prefix
> specific to that client implementation, as opposed to the standard metric
> set that resides under "org.apache.kafka...". E.g.,
> "se.edenhill.librdkafka" or whatever.
>  * Existing non-KIP-714 metrics should remain untouched. In some cases we
> might be able to use the same meter given it is compatible with the
> standard metric set definition, in other cases a semi-duplicate meter may
> be needed. Thus this will not affect the metrics exposed through JMX, or
> vice versa.
>
> Thanks,
> Magnus
>
>
>
> Den ons 1 juni 2022 kl 18:55 skrev Jun Rao <ju...@confluent.io.invalid>:
>
> > Hi, Magnus,
> >
> > 51. Just to clarify my question.  (1) Are standard metrics required for
> > every client for this KIP to function?  (2) Are we converting existing
> java
> > metrics to the standard metrics and deprecating the old ones? If so,
> could
> > we list all existing java metrics that need to be renamed and the
> > corresponding new name?
> >
> > Thanks,
> >
> > Jun
> >
> > On Tue, May 31, 2022 at 3:29 PM Jun Rao <ju...@confluent.io> wrote:
> >
> > > Hi, Magnus,
> > >
> > > Thanks for the reply.
> > >
> > > 51. I think it's fine to have a list of recommended metrics for every
> > > client to implement. I am just not sure that standardizing on the
> metric
> > > names across all clients is practical. The list of common metrics in
> the
> > > KIP have completely different names from the java metric names. Some of
> > > them have different types. For example, some of the common metrics
> have a
> > > type of histogram, but the java client metrics don't use histogram in
> > > general. Requiring the operator to translate those names and understand
> > the
> > > subtle differences across clients seem to cause more confusion during
> > > troubleshooting.
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Tue, May 31, 2022 at 5:02 AM Magnus Edenhill <ma...@edenhill.se>
> > > wrote:
> > >
> > >> Den fre 20 maj 2022 kl 01:23 skrev Jun Rao <jun@confluent.io.invalid
> >:
> > >>
> > >> > Hi, Magus,
> > >> >
> > >> > Thanks for the reply.
> > >> >
> > >> > 50. Sounds good.
> > >> >
> > >> > 51. I miss-understood the proposal in the KIP then. The proposal is
> to
> > >> > define a set of common metric names that every client should
> > implement.
> > >> The
> > >> > problem is that every client already has its own set of metrics with
> > its
> > >> > own names. I am not sure that we could easily agree upon a common
> set
> > of
> > >> > metrics that work with all clients. There are likely to be some
> > metrics
> > >> > that are client specific. Translating between the common name and
> > client
> > >> > specific name is probably going to add more confusion. As mentioned
> in
> > >> the
> > >> > KIP, similar metrics from different clients could have subtle
> > >> > semantic differences. Could we just let each client use its own set
> of
> > >> > metric names?
> > >> >
> > >>
> > >> We identified a common set of metrics that should be relevant for most
> > >> client implementations,
> > >> they're the ones listed in the KIP.
> > >> A supporting client does not have to implement all those metrics, only
> > the
> > >> ones that makes sense
> > >> based on that client implementation, and a client may implement other
> > >> metrics that are not listed
> > >> in the KIP under its own namespace.
> > >> This approach has two benefits:
> > >>  - there will be a common set of metrics that most/all clients
> > implement,
> > >> which makes monitoring
> > >>   and troubleshooting easier across fleets with multiple Kafka client
> > >> languages/implementations.
> > >>  - client-specific metrics are still possible, so if there is no
> > suitable
> > >> standard metric a client can still
> > >>    provide what special metrics it has.
> > >>
> > >>
> > >> Thanks,
> > >> Magnus
> > >>
> > >>
> > >> On Thu, May 19, 2022 at 10:39 AM Magnus Edenhill <ma...@edenhill.se>
> > >> wrote:
> > >> >
> > >> > > Den ons 18 maj 2022 kl 19:57 skrev Jun Rao
> <jun@confluent.io.invalid
> > >> >:
> > >> > >
> > >> > > > Hi, Magnus,
> > >> > > >
> > >> > >
> > >> > > Hi Jun
> > >> > >
> > >> > >
> > >> > > >
> > >> > > > Thanks for the updated KIP. Just a couple of more comments.
> > >> > > >
> > >> > > > 50. To troubleshoot a particular client issue, I imagine that
> the
> > >> > client
> > >> > > > needs to identify its client_instance_id. How does the client
> find
> > >> this
> > >> > > > out? Do we plan to include client_instance_id in the client log,
> > >> expose
> > >> > > it
> > >> > > > as a metric or something else?
> > >> > > >
> > >> > >
> > >> > > The KIP suggests that client implementations emit an informative
> log
> > >> > > message
> > >> > > with the assigned client-instance-id once it is retrieved (once
> per
> > >> > client
> > >> > > instance lifetime).
> > >> > > There's also a clientInstanceId() method that an application can
> use
> > >> to
> > >> > > retrieve
> > >> > > the client instance id and emit through whatever side channels
> makes
> > >> > sense.
> > >> > >
> > >> > >
> > >> > >
> > >> > > > 51. The KIP lists a bunch of metrics that need to be collected
> at
> > >> the
> > >> > > > client side. However, it seems quite a few useful java client
> > >> metrics
> > >> > > like
> > >> > > > the following are missing.
> > >> > > >     buffer-total-bytes
> > >> > > >     buffer-available-bytes
> > >> > > >
> > >> > >
> > >> > > These are covered by client.producer.record.queue.bytes and
> > >> > > client.producer.record.queue.max.bytes.
> > >> > >
> > >> > >
> > >> > > >     bufferpool-wait-time
> > >> > > >
> > >> > >
> > >> > > Missing, but somewhat implementation specific.
> > >> > > If it was up to me we would add this later if there's a need.
> > >> > >
> > >> > >
> > >> > >
> > >> > > >     batch-size-avg
> > >> > > >     batch-size-max
> > >> > > >
> > >> > >
> > >> > > These are missing and would be suitably represented as a
> histogram.
> > >> I'll
> > >> > > add them.
> > >> > >
> > >> > >
> > >> > >
> > >> > > >     io-wait-ratio
> > >> > > >     io-ratio
> > >> > > >
> > >> > >
> > >> > > There's client.io.wait.time which should cover io-wait-ratio.
> > >> > > We could add a client.io.time as well, now or in a later KIP.
> > >> > >
> > >> > > Thanks,
> > >> > > Magnus
> > >> > >
> > >> > >
> > >> > >
> > >> > >
> > >> > > >
> > >> > > > Thanks,
> > >> > > >
> > >> > > > Jun
> > >> > > >
> > >> > > > On Mon, Apr 4, 2022 at 10:01 AM Jun Rao <ju...@confluent.io>
> wrote:
> > >> > > >
> > >> > > > > Hi, Xavier,
> > >> > > > >
> > >> > > > > Thanks for the reply.
> > >> > > > >
> > >> > > > > 28. It does seem that we have started using KafkaMetrics on
> the
> > >> > broker
> > >> > > > > side. Then, my only concern is on the usage of Histogram in
> > >> > > KafkaMetrics.
> > >> > > > > Histogram in KafkaMetrics statically divides the value space
> > into
> > >> a
> > >> > > fixed
> > >> > > > > number of buckets and only returns values on the bucket
> > boundary.
> > >> So,
> > >> > > the
> > >> > > > > returned histogram value may never show up in a recorded
> value.
> > >> > Yammer
> > >> > > > > Histogram, on the other hand, uses reservoir sampling. The
> > >> reported
> > >> > > value
> > >> > > > > is always one of the recorded values. So, I am not sure that
> > >> > Histogram
> > >> > > in
> > >> > > > > KafkaMetrics is as good as Yammer Histogram.
> > >> > > > ClientMetricsPluginExportTime
> > >> > > > > uses Histogram.
> > >> > > > >
> > >> > > > > Thanks,
> > >> > > > >
> > >> > > > > Jun
> > >> > > > >
> > >> > > > > On Thu, Mar 31, 2022 at 5:21 PM Xavier Léauté
> > >> > > > <xa...@confluent.io.invalid>
> > >> > > > > wrote:
> > >> > > > >
> > >> > > > >> >
> > >> > > > >> > 28. On the broker, we typically use Yammer metrics. Only
> for
> > >> > metrics
> > >> > > > >> that
> > >> > > > >> > depend on Kafka metric features (e.g., quota), we use the
> > Kafka
> > >> > > > metric.
> > >> > > > >> > Yammer metrics have 4 types: gauge, meter, histogram and
> > timer.
> > >> > > meter
> > >> > > > >> > calculates a rate, but also exposes an accumulated value.
> > >> > > > >> >
> > >> > > > >>
> > >> > > > >> I don't see a good reason we should limit ourselves to Yammer
> > >> > metrics
> > >> > > on
> > >> > > > >> the broker. KafkaMetrics was written
> > >> > > > >> to replace Yammer metrics and is used for all new components
> > >> > (clients,
> > >> > > > >> streams, connect, etc.)
> > >> > > > >> My understanding is that the original goal was to retire
> Yammer
> > >> > > metrics
> > >> > > > in
> > >> > > > >> the broker in favor of KafkaMetrics.
> > >> > > > >> We just haven't done so out of backwards compatibility
> > concerns.
> > >> > > > >> There are other broker metrics such as group coordinator,
> > >> > transaction
> > >> > > > >> state
> > >> > > > >> manager, and various socket server metrics
> > >> > > > >> already using KafkaMetrics that don't need specific Kafka
> > metric
> > >> > > > features,
> > >> > > > >> so I don't see why we should refrain from using
> > >> > > > >> Kafka metrics on the broker unless there are real
> compatibility
> > >> > > concerns
> > >> > > > >> or
> > >> > > > >> where implementation specifics could lead to confusion when
> > >> > comparing
> > >> > > > >> metrics using different implementations.
> > >> > > > >>
> > >> > > > >> In my opinion we should encourage people to use KafkaMetrics
> > >> going
> > >> > > > forward
> > >> > > > >> on the broker as well, for two reasons:
> > >> > > > >> a) yammer metrics is long deprecated and no longer maintained
> > >> > > > >> b) yammer metrics are much less expressive
> > >> > > > >> c) we don't have a proper API to expose yammer metrics
> outside
> > of
> > >> > JMX
> > >> > > > >> (MetricsReporter only exposes KafkaMetrics)
> > >> > > > >>
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Magnus Edenhill <ma...@edenhill.se>.

Hey Jun,

I've clarified the scope of the standard metrics in the KIP, but basically:

 * We define a standard set of generic metrics that should be relevant to
most client implementations, e.g., each producer implementation probably
has some sort of per-partition message queue.
 * A client implementation should strive to implement as many of the
standard metrics as possible, but only the ones that make sense.
 * For metrics that are not in the standard set, a client maintainer can
choose to either submit a KIP to add additional standard metrics - if
they're relevant, or go ahead and add custom metrics that are specific to
that client implementation. These custom metrics will have a prefix
specific to that client implementation, as opposed to the standard metric
set that resides under "org.apache.kafka...". E.g.,
"se.edenhill.librdkafka" or whatever.
 * Existing non-KIP-714 metrics should remain untouched. In some cases we
might be able to use the same meter given it is compatible with the
standard metric set definition, in other cases a semi-duplicate meter may
be needed. Thus this will not affect the metrics exposed through JMX, or
vice versa.

Thanks,
Magnus



Den ons 1 juni 2022 kl 18:55 skrev Jun Rao <ju...@confluent.io.invalid>:

> Hi, Magnus,
>
> 51. Just to clarify my question.  (1) Are standard metrics required for
> every client for this KIP to function?  (2) Are we converting existing java
> metrics to the standard metrics and deprecating the old ones? If so, could
> we list all existing java metrics that need to be renamed and the
> corresponding new name?
>
> Thanks,
>
> Jun
>
> On Tue, May 31, 2022 at 3:29 PM Jun Rao <ju...@confluent.io> wrote:
>
> > Hi, Magnus,
> >
> > Thanks for the reply.
> >
> > 51. I think it's fine to have a list of recommended metrics for every
> > client to implement. I am just not sure that standardizing on the metric
> > names across all clients is practical. The list of common metrics in the
> > KIP have completely different names from the java metric names. Some of
> > them have different types. For example, some of the common metrics have a
> > type of histogram, but the java client metrics don't use histogram in
> > general. Requiring the operator to translate those names and understand
> the
> > subtle differences across clients seem to cause more confusion during
> > troubleshooting.
> >
> > Thanks,
> >
> > Jun
> >
> > On Tue, May 31, 2022 at 5:02 AM Magnus Edenhill <ma...@edenhill.se>
> > wrote:
> >
> >> Den fre 20 maj 2022 kl 01:23 skrev Jun Rao <ju...@confluent.io.invalid>:
> >>
> >> > Hi, Magus,
> >> >
> >> > Thanks for the reply.
> >> >
> >> > 50. Sounds good.
> >> >
> >> > 51. I miss-understood the proposal in the KIP then. The proposal is to
> >> > define a set of common metric names that every client should
> implement.
> >> The
> >> > problem is that every client already has its own set of metrics with
> its
> >> > own names. I am not sure that we could easily agree upon a common set
> of
> >> > metrics that work with all clients. There are likely to be some
> metrics
> >> > that are client specific. Translating between the common name and
> client
> >> > specific name is probably going to add more confusion. As mentioned in
> >> the
> >> > KIP, similar metrics from different clients could have subtle
> >> > semantic differences. Could we just let each client use its own set of
> >> > metric names?
> >> >
> >>
> >> We identified a common set of metrics that should be relevant for most
> >> client implementations,
> >> they're the ones listed in the KIP.
> >> A supporting client does not have to implement all those metrics, only
> the
> >> ones that makes sense
> >> based on that client implementation, and a client may implement other
> >> metrics that are not listed
> >> in the KIP under its own namespace.
> >> This approach has two benefits:
> >>  - there will be a common set of metrics that most/all clients
> implement,
> >> which makes monitoring
> >>   and troubleshooting easier across fleets with multiple Kafka client
> >> languages/implementations.
> >>  - client-specific metrics are still possible, so if there is no
> suitable
> >> standard metric a client can still
> >>    provide what special metrics it has.
> >>
> >>
> >> Thanks,
> >> Magnus
> >>
> >>
> >> On Thu, May 19, 2022 at 10:39 AM Magnus Edenhill <ma...@edenhill.se>
> >> wrote:
> >> >
> >> > > Den ons 18 maj 2022 kl 19:57 skrev Jun Rao <jun@confluent.io.invalid
> >> >:
> >> > >
> >> > > > Hi, Magnus,
> >> > > >
> >> > >
> >> > > Hi Jun
> >> > >
> >> > >
> >> > > >
> >> > > > Thanks for the updated KIP. Just a couple of more comments.
> >> > > >
> >> > > > 50. To troubleshoot a particular client issue, I imagine that the
> >> > client
> >> > > > needs to identify its client_instance_id. How does the client find
> >> this
> >> > > > out? Do we plan to include client_instance_id in the client log,
> >> expose
> >> > > it
> >> > > > as a metric or something else?
> >> > > >
> >> > >
> >> > > The KIP suggests that client implementations emit an informative log
> >> > > message
> >> > > with the assigned client-instance-id once it is retrieved (once per
> >> > client
> >> > > instance lifetime).
> >> > > There's also a clientInstanceId() method that an application can use
> >> to
> >> > > retrieve
> >> > > the client instance id and emit through whatever side channels makes
> >> > sense.
> >> > >
> >> > >
> >> > >
> >> > > > 51. The KIP lists a bunch of metrics that need to be collected at
> >> the
> >> > > > client side. However, it seems quite a few useful java client
> >> metrics
> >> > > like
> >> > > > the following are missing.
> >> > > >     buffer-total-bytes
> >> > > >     buffer-available-bytes
> >> > > >
> >> > >
> >> > > These are covered by client.producer.record.queue.bytes and
> >> > > client.producer.record.queue.max.bytes.
> >> > >
> >> > >
> >> > > >     bufferpool-wait-time
> >> > > >
> >> > >
> >> > > Missing, but somewhat implementation specific.
> >> > > If it was up to me we would add this later if there's a need.
> >> > >
> >> > >
> >> > >
> >> > > >     batch-size-avg
> >> > > >     batch-size-max
> >> > > >
> >> > >
> >> > > These are missing and would be suitably represented as a histogram.
> >> I'll
> >> > > add them.
> >> > >
> >> > >
> >> > >
> >> > > >     io-wait-ratio
> >> > > >     io-ratio
> >> > > >
> >> > >
> >> > > There's client.io.wait.time which should cover io-wait-ratio.
> >> > > We could add a client.io.time as well, now or in a later KIP.
> >> > >
> >> > > Thanks,
> >> > > Magnus
> >> > >
> >> > >
> >> > >
> >> > >
> >> > > >
> >> > > > Thanks,
> >> > > >
> >> > > > Jun
> >> > > >
> >> > > > On Mon, Apr 4, 2022 at 10:01 AM Jun Rao <ju...@confluent.io> wrote:
> >> > > >
> >> > > > > Hi, Xavier,
> >> > > > >
> >> > > > > Thanks for the reply.
> >> > > > >
> >> > > > > 28. It does seem that we have started using KafkaMetrics on the
> >> > broker
> >> > > > > side. Then, my only concern is on the usage of Histogram in
> >> > > KafkaMetrics.
> >> > > > > Histogram in KafkaMetrics statically divides the value space
> into
> >> a
> >> > > fixed
> >> > > > > number of buckets and only returns values on the bucket
> boundary.
> >> So,
> >> > > the
> >> > > > > returned histogram value may never show up in a recorded value.
> >> > Yammer
> >> > > > > Histogram, on the other hand, uses reservoir sampling. The
> >> reported
> >> > > value
> >> > > > > is always one of the recorded values. So, I am not sure that
> >> > Histogram
> >> > > in
> >> > > > > KafkaMetrics is as good as Yammer Histogram.
> >> > > > ClientMetricsPluginExportTime
> >> > > > > uses Histogram.
> >> > > > >
> >> > > > > Thanks,
> >> > > > >
> >> > > > > Jun
> >> > > > >
> >> > > > > On Thu, Mar 31, 2022 at 5:21 PM Xavier Léauté
> >> > > > <xa...@confluent.io.invalid>
> >> > > > > wrote:
> >> > > > >
> >> > > > >> >
> >> > > > >> > 28. On the broker, we typically use Yammer metrics. Only for
> >> > metrics
> >> > > > >> that
> >> > > > >> > depend on Kafka metric features (e.g., quota), we use the
> Kafka
> >> > > > metric.
> >> > > > >> > Yammer metrics have 4 types: gauge, meter, histogram and
> timer.
> >> > > meter
> >> > > > >> > calculates a rate, but also exposes an accumulated value.
> >> > > > >> >
> >> > > > >>
> >> > > > >> I don't see a good reason we should limit ourselves to Yammer
> >> > metrics
> >> > > on
> >> > > > >> the broker. KafkaMetrics was written
> >> > > > >> to replace Yammer metrics and is used for all new components
> >> > (clients,
> >> > > > >> streams, connect, etc.)
> >> > > > >> My understanding is that the original goal was to retire Yammer
> >> > > metrics
> >> > > > in
> >> > > > >> the broker in favor of KafkaMetrics.
> >> > > > >> We just haven't done so out of backwards compatibility
> concerns.
> >> > > > >> There are other broker metrics such as group coordinator,
> >> > transaction
> >> > > > >> state
> >> > > > >> manager, and various socket server metrics
> >> > > > >> already using KafkaMetrics that don't need specific Kafka
> metric
> >> > > > features,
> >> > > > >> so I don't see why we should refrain from using
> >> > > > >> Kafka metrics on the broker unless there are real compatibility
> >> > > concerns
> >> > > > >> or
> >> > > > >> where implementation specifics could lead to confusion when
> >> > comparing
> >> > > > >> metrics using different implementations.
> >> > > > >>
> >> > > > >> In my opinion we should encourage people to use KafkaMetrics
> >> going
> >> > > > forward
> >> > > > >> on the broker as well, for two reasons:
> >> > > > >> a) yammer metrics is long deprecated and no longer maintained
> >> > > > >> b) yammer metrics are much less expressive
> >> > > > >> c) we don't have a proper API to expose yammer metrics outside
> of
> >> > JMX
> >> > > > >> (MetricsReporter only exposes KafkaMetrics)
> >> > > > >>
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Jun Rao <ju...@confluent.io.INVALID>.

Hi, Magnus,

51. Just to clarify my question.  (1) Are standard metrics required for
every client for this KIP to function?  (2) Are we converting existing java
metrics to the standard metrics and deprecating the old ones? If so, could
we list all existing java metrics that need to be renamed and the
corresponding new name?

Thanks,

Jun

On Tue, May 31, 2022 at 3:29 PM Jun Rao <ju...@confluent.io> wrote:

> Hi, Magnus,
>
> Thanks for the reply.
>
> 51. I think it's fine to have a list of recommended metrics for every
> client to implement. I am just not sure that standardizing on the metric
> names across all clients is practical. The list of common metrics in the
> KIP have completely different names from the java metric names. Some of
> them have different types. For example, some of the common metrics have a
> type of histogram, but the java client metrics don't use histogram in
> general. Requiring the operator to translate those names and understand the
> subtle differences across clients seem to cause more confusion during
> troubleshooting.
>
> Thanks,
>
> Jun
>
> On Tue, May 31, 2022 at 5:02 AM Magnus Edenhill <ma...@edenhill.se>
> wrote:
>
>> Den fre 20 maj 2022 kl 01:23 skrev Jun Rao <ju...@confluent.io.invalid>:
>>
>> > Hi, Magus,
>> >
>> > Thanks for the reply.
>> >
>> > 50. Sounds good.
>> >
>> > 51. I miss-understood the proposal in the KIP then. The proposal is to
>> > define a set of common metric names that every client should implement.
>> The
>> > problem is that every client already has its own set of metrics with its
>> > own names. I am not sure that we could easily agree upon a common set of
>> > metrics that work with all clients. There are likely to be some metrics
>> > that are client specific. Translating between the common name and client
>> > specific name is probably going to add more confusion. As mentioned in
>> the
>> > KIP, similar metrics from different clients could have subtle
>> > semantic differences. Could we just let each client use its own set of
>> > metric names?
>> >
>>
>> We identified a common set of metrics that should be relevant for most
>> client implementations,
>> they're the ones listed in the KIP.
>> A supporting client does not have to implement all those metrics, only the
>> ones that makes sense
>> based on that client implementation, and a client may implement other
>> metrics that are not listed
>> in the KIP under its own namespace.
>> This approach has two benefits:
>>  - there will be a common set of metrics that most/all clients implement,
>> which makes monitoring
>>   and troubleshooting easier across fleets with multiple Kafka client
>> languages/implementations.
>>  - client-specific metrics are still possible, so if there is no suitable
>> standard metric a client can still
>>    provide what special metrics it has.
>>
>>
>> Thanks,
>> Magnus
>>
>>
>> On Thu, May 19, 2022 at 10:39 AM Magnus Edenhill <ma...@edenhill.se>
>> wrote:
>> >
>> > > Den ons 18 maj 2022 kl 19:57 skrev Jun Rao <jun@confluent.io.invalid
>> >:
>> > >
>> > > > Hi, Magnus,
>> > > >
>> > >
>> > > Hi Jun
>> > >
>> > >
>> > > >
>> > > > Thanks for the updated KIP. Just a couple of more comments.
>> > > >
>> > > > 50. To troubleshoot a particular client issue, I imagine that the
>> > client
>> > > > needs to identify its client_instance_id. How does the client find
>> this
>> > > > out? Do we plan to include client_instance_id in the client log,
>> expose
>> > > it
>> > > > as a metric or something else?
>> > > >
>> > >
>> > > The KIP suggests that client implementations emit an informative log
>> > > message
>> > > with the assigned client-instance-id once it is retrieved (once per
>> > client
>> > > instance lifetime).
>> > > There's also a clientInstanceId() method that an application can use
>> to
>> > > retrieve
>> > > the client instance id and emit through whatever side channels makes
>> > sense.
>> > >
>> > >
>> > >
>> > > > 51. The KIP lists a bunch of metrics that need to be collected at
>> the
>> > > > client side. However, it seems quite a few useful java client
>> metrics
>> > > like
>> > > > the following are missing.
>> > > >     buffer-total-bytes
>> > > >     buffer-available-bytes
>> > > >
>> > >
>> > > These are covered by client.producer.record.queue.bytes and
>> > > client.producer.record.queue.max.bytes.
>> > >
>> > >
>> > > >     bufferpool-wait-time
>> > > >
>> > >
>> > > Missing, but somewhat implementation specific.
>> > > If it was up to me we would add this later if there's a need.
>> > >
>> > >
>> > >
>> > > >     batch-size-avg
>> > > >     batch-size-max
>> > > >
>> > >
>> > > These are missing and would be suitably represented as a histogram.
>> I'll
>> > > add them.
>> > >
>> > >
>> > >
>> > > >     io-wait-ratio
>> > > >     io-ratio
>> > > >
>> > >
>> > > There's client.io.wait.time which should cover io-wait-ratio.
>> > > We could add a client.io.time as well, now or in a later KIP.
>> > >
>> > > Thanks,
>> > > Magnus
>> > >
>> > >
>> > >
>> > >
>> > > >
>> > > > Thanks,
>> > > >
>> > > > Jun
>> > > >
>> > > > On Mon, Apr 4, 2022 at 10:01 AM Jun Rao <ju...@confluent.io> wrote:
>> > > >
>> > > > > Hi, Xavier,
>> > > > >
>> > > > > Thanks for the reply.
>> > > > >
>> > > > > 28. It does seem that we have started using KafkaMetrics on the
>> > broker
>> > > > > side. Then, my only concern is on the usage of Histogram in
>> > > KafkaMetrics.
>> > > > > Histogram in KafkaMetrics statically divides the value space into
>> a
>> > > fixed
>> > > > > number of buckets and only returns values on the bucket boundary.
>> So,
>> > > the
>> > > > > returned histogram value may never show up in a recorded value.
>> > Yammer
>> > > > > Histogram, on the other hand, uses reservoir sampling. The
>> reported
>> > > value
>> > > > > is always one of the recorded values. So, I am not sure that
>> > Histogram
>> > > in
>> > > > > KafkaMetrics is as good as Yammer Histogram.
>> > > > ClientMetricsPluginExportTime
>> > > > > uses Histogram.
>> > > > >
>> > > > > Thanks,
>> > > > >
>> > > > > Jun
>> > > > >
>> > > > > On Thu, Mar 31, 2022 at 5:21 PM Xavier Léauté
>> > > > <xa...@confluent.io.invalid>
>> > > > > wrote:
>> > > > >
>> > > > >> >
>> > > > >> > 28. On the broker, we typically use Yammer metrics. Only for
>> > metrics
>> > > > >> that
>> > > > >> > depend on Kafka metric features (e.g., quota), we use the Kafka
>> > > > metric.
>> > > > >> > Yammer metrics have 4 types: gauge, meter, histogram and timer.
>> > > meter
>> > > > >> > calculates a rate, but also exposes an accumulated value.
>> > > > >> >
>> > > > >>
>> > > > >> I don't see a good reason we should limit ourselves to Yammer
>> > metrics
>> > > on
>> > > > >> the broker. KafkaMetrics was written
>> > > > >> to replace Yammer metrics and is used for all new components
>> > (clients,
>> > > > >> streams, connect, etc.)
>> > > > >> My understanding is that the original goal was to retire Yammer
>> > > metrics
>> > > > in
>> > > > >> the broker in favor of KafkaMetrics.
>> > > > >> We just haven't done so out of backwards compatibility concerns.
>> > > > >> There are other broker metrics such as group coordinator,
>> > transaction
>> > > > >> state
>> > > > >> manager, and various socket server metrics
>> > > > >> already using KafkaMetrics that don't need specific Kafka metric
>> > > > features,
>> > > > >> so I don't see why we should refrain from using
>> > > > >> Kafka metrics on the broker unless there are real compatibility
>> > > concerns
>> > > > >> or
>> > > > >> where implementation specifics could lead to confusion when
>> > comparing
>> > > > >> metrics using different implementations.
>> > > > >>
>> > > > >> In my opinion we should encourage people to use KafkaMetrics
>> going
>> > > > forward
>> > > > >> on the broker as well, for two reasons:
>> > > > >> a) yammer metrics is long deprecated and no longer maintained
>> > > > >> b) yammer metrics are much less expressive
>> > > > >> c) we don't have a proper API to expose yammer metrics outside of
>> > JMX
>> > > > >> (MetricsReporter only exposes KafkaMetrics)
>> > > > >>
>> > > > >
>> > > >
>> > >
>> >
>>
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Jun Rao <ju...@confluent.io.INVALID>.

Hi, Magnus,

Thanks for the reply.

51. I think it's fine to have a list of recommended metrics for every
client to implement. I am just not sure that standardizing on the metric
names across all clients is practical. The list of common metrics in the
KIP have completely different names from the java metric names. Some of
them have different types. For example, some of the common metrics have a
type of histogram, but the java client metrics don't use histogram in
general. Requiring the operator to translate those names and understand the
subtle differences across clients seem to cause more confusion during
troubleshooting.

Thanks,

Jun

On Tue, May 31, 2022 at 5:02 AM Magnus Edenhill <ma...@edenhill.se> wrote:

> Den fre 20 maj 2022 kl 01:23 skrev Jun Rao <ju...@confluent.io.invalid>:
>
> > Hi, Magus,
> >
> > Thanks for the reply.
> >
> > 50. Sounds good.
> >
> > 51. I miss-understood the proposal in the KIP then. The proposal is to
> > define a set of common metric names that every client should implement.
> The
> > problem is that every client already has its own set of metrics with its
> > own names. I am not sure that we could easily agree upon a common set of
> > metrics that work with all clients. There are likely to be some metrics
> > that are client specific. Translating between the common name and client
> > specific name is probably going to add more confusion. As mentioned in
> the
> > KIP, similar metrics from different clients could have subtle
> > semantic differences. Could we just let each client use its own set of
> > metric names?
> >
>
> We identified a common set of metrics that should be relevant for most
> client implementations,
> they're the ones listed in the KIP.
> A supporting client does not have to implement all those metrics, only the
> ones that makes sense
> based on that client implementation, and a client may implement other
> metrics that are not listed
> in the KIP under its own namespace.
> This approach has two benefits:
>  - there will be a common set of metrics that most/all clients implement,
> which makes monitoring
>   and troubleshooting easier across fleets with multiple Kafka client
> languages/implementations.
>  - client-specific metrics are still possible, so if there is no suitable
> standard metric a client can still
>    provide what special metrics it has.
>
>
> Thanks,
> Magnus
>
>
> On Thu, May 19, 2022 at 10:39 AM Magnus Edenhill <ma...@edenhill.se>
> wrote:
> >
> > > Den ons 18 maj 2022 kl 19:57 skrev Jun Rao <ju...@confluent.io.invalid>:
> > >
> > > > Hi, Magnus,
> > > >
> > >
> > > Hi Jun
> > >
> > >
> > > >
> > > > Thanks for the updated KIP. Just a couple of more comments.
> > > >
> > > > 50. To troubleshoot a particular client issue, I imagine that the
> > client
> > > > needs to identify its client_instance_id. How does the client find
> this
> > > > out? Do we plan to include client_instance_id in the client log,
> expose
> > > it
> > > > as a metric or something else?
> > > >
> > >
> > > The KIP suggests that client implementations emit an informative log
> > > message
> > > with the assigned client-instance-id once it is retrieved (once per
> > client
> > > instance lifetime).
> > > There's also a clientInstanceId() method that an application can use to
> > > retrieve
> > > the client instance id and emit through whatever side channels makes
> > sense.
> > >
> > >
> > >
> > > > 51. The KIP lists a bunch of metrics that need to be collected at the
> > > > client side. However, it seems quite a few useful java client metrics
> > > like
> > > > the following are missing.
> > > >     buffer-total-bytes
> > > >     buffer-available-bytes
> > > >
> > >
> > > These are covered by client.producer.record.queue.bytes and
> > > client.producer.record.queue.max.bytes.
> > >
> > >
> > > >     bufferpool-wait-time
> > > >
> > >
> > > Missing, but somewhat implementation specific.
> > > If it was up to me we would add this later if there's a need.
> > >
> > >
> > >
> > > >     batch-size-avg
> > > >     batch-size-max
> > > >
> > >
> > > These are missing and would be suitably represented as a histogram.
> I'll
> > > add them.
> > >
> > >
> > >
> > > >     io-wait-ratio
> > > >     io-ratio
> > > >
> > >
> > > There's client.io.wait.time which should cover io-wait-ratio.
> > > We could add a client.io.time as well, now or in a later KIP.
> > >
> > > Thanks,
> > > Magnus
> > >
> > >
> > >
> > >
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > > On Mon, Apr 4, 2022 at 10:01 AM Jun Rao <ju...@confluent.io> wrote:
> > > >
> > > > > Hi, Xavier,
> > > > >
> > > > > Thanks for the reply.
> > > > >
> > > > > 28. It does seem that we have started using KafkaMetrics on the
> > broker
> > > > > side. Then, my only concern is on the usage of Histogram in
> > > KafkaMetrics.
> > > > > Histogram in KafkaMetrics statically divides the value space into a
> > > fixed
> > > > > number of buckets and only returns values on the bucket boundary.
> So,
> > > the
> > > > > returned histogram value may never show up in a recorded value.
> > Yammer
> > > > > Histogram, on the other hand, uses reservoir sampling. The reported
> > > value
> > > > > is always one of the recorded values. So, I am not sure that
> > Histogram
> > > in
> > > > > KafkaMetrics is as good as Yammer Histogram.
> > > > ClientMetricsPluginExportTime
> > > > > uses Histogram.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jun
> > > > >
> > > > > On Thu, Mar 31, 2022 at 5:21 PM Xavier Léauté
> > > > <xa...@confluent.io.invalid>
> > > > > wrote:
> > > > >
> > > > >> >
> > > > >> > 28. On the broker, we typically use Yammer metrics. Only for
> > metrics
> > > > >> that
> > > > >> > depend on Kafka metric features (e.g., quota), we use the Kafka
> > > > metric.
> > > > >> > Yammer metrics have 4 types: gauge, meter, histogram and timer.
> > > meter
> > > > >> > calculates a rate, but also exposes an accumulated value.
> > > > >> >
> > > > >>
> > > > >> I don't see a good reason we should limit ourselves to Yammer
> > metrics
> > > on
> > > > >> the broker. KafkaMetrics was written
> > > > >> to replace Yammer metrics and is used for all new components
> > (clients,
> > > > >> streams, connect, etc.)
> > > > >> My understanding is that the original goal was to retire Yammer
> > > metrics
> > > > in
> > > > >> the broker in favor of KafkaMetrics.
> > > > >> We just haven't done so out of backwards compatibility concerns.
> > > > >> There are other broker metrics such as group coordinator,
> > transaction
> > > > >> state
> > > > >> manager, and various socket server metrics
> > > > >> already using KafkaMetrics that don't need specific Kafka metric
> > > > features,
> > > > >> so I don't see why we should refrain from using
> > > > >> Kafka metrics on the broker unless there are real compatibility
> > > concerns
> > > > >> or
> > > > >> where implementation specifics could lead to confusion when
> > comparing
> > > > >> metrics using different implementations.
> > > > >>
> > > > >> In my opinion we should encourage people to use KafkaMetrics going
> > > > forward
> > > > >> on the broker as well, for two reasons:
> > > > >> a) yammer metrics is long deprecated and no longer maintained
> > > > >> b) yammer metrics are much less expressive
> > > > >> c) we don't have a proper API to expose yammer metrics outside of
> > JMX
> > > > >> (MetricsReporter only exposes KafkaMetrics)
> > > > >>
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Magnus Edenhill <ma...@edenhill.se>.

Den fre 20 maj 2022 kl 01:23 skrev Jun Rao <ju...@confluent.io.invalid>:

> Hi, Magus,
>
> Thanks for the reply.
>
> 50. Sounds good.
>
> 51. I miss-understood the proposal in the KIP then. The proposal is to
> define a set of common metric names that every client should implement. The
> problem is that every client already has its own set of metrics with its
> own names. I am not sure that we could easily agree upon a common set of
> metrics that work with all clients. There are likely to be some metrics
> that are client specific. Translating between the common name and client
> specific name is probably going to add more confusion. As mentioned in the
> KIP, similar metrics from different clients could have subtle
> semantic differences. Could we just let each client use its own set of
> metric names?
>

We identified a common set of metrics that should be relevant for most
client implementations,
they're the ones listed in the KIP.
A supporting client does not have to implement all those metrics, only the
ones that makes sense
based on that client implementation, and a client may implement other
metrics that are not listed
in the KIP under its own namespace.
This approach has two benefits:
 - there will be a common set of metrics that most/all clients implement,
which makes monitoring
  and troubleshooting easier across fleets with multiple Kafka client
languages/implementations.
 - client-specific metrics are still possible, so if there is no suitable
standard metric a client can still
   provide what special metrics it has.


Thanks,
Magnus


On Thu, May 19, 2022 at 10:39 AM Magnus Edenhill <ma...@edenhill.se> wrote:
>
> > Den ons 18 maj 2022 kl 19:57 skrev Jun Rao <ju...@confluent.io.invalid>:
> >
> > > Hi, Magnus,
> > >
> >
> > Hi Jun
> >
> >
> > >
> > > Thanks for the updated KIP. Just a couple of more comments.
> > >
> > > 50. To troubleshoot a particular client issue, I imagine that the
> client
> > > needs to identify its client_instance_id. How does the client find this
> > > out? Do we plan to include client_instance_id in the client log, expose
> > it
> > > as a metric or something else?
> > >
> >
> > The KIP suggests that client implementations emit an informative log
> > message
> > with the assigned client-instance-id once it is retrieved (once per
> client
> > instance lifetime).
> > There's also a clientInstanceId() method that an application can use to
> > retrieve
> > the client instance id and emit through whatever side channels makes
> sense.
> >
> >
> >
> > > 51. The KIP lists a bunch of metrics that need to be collected at the
> > > client side. However, it seems quite a few useful java client metrics
> > like
> > > the following are missing.
> > >     buffer-total-bytes
> > >     buffer-available-bytes
> > >
> >
> > These are covered by client.producer.record.queue.bytes and
> > client.producer.record.queue.max.bytes.
> >
> >
> > >     bufferpool-wait-time
> > >
> >
> > Missing, but somewhat implementation specific.
> > If it was up to me we would add this later if there's a need.
> >
> >
> >
> > >     batch-size-avg
> > >     batch-size-max
> > >
> >
> > These are missing and would be suitably represented as a histogram. I'll
> > add them.
> >
> >
> >
> > >     io-wait-ratio
> > >     io-ratio
> > >
> >
> > There's client.io.wait.time which should cover io-wait-ratio.
> > We could add a client.io.time as well, now or in a later KIP.
> >
> > Thanks,
> > Magnus
> >
> >
> >
> >
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Mon, Apr 4, 2022 at 10:01 AM Jun Rao <ju...@confluent.io> wrote:
> > >
> > > > Hi, Xavier,
> > > >
> > > > Thanks for the reply.
> > > >
> > > > 28. It does seem that we have started using KafkaMetrics on the
> broker
> > > > side. Then, my only concern is on the usage of Histogram in
> > KafkaMetrics.
> > > > Histogram in KafkaMetrics statically divides the value space into a
> > fixed
> > > > number of buckets and only returns values on the bucket boundary. So,
> > the
> > > > returned histogram value may never show up in a recorded value.
> Yammer
> > > > Histogram, on the other hand, uses reservoir sampling. The reported
> > value
> > > > is always one of the recorded values. So, I am not sure that
> Histogram
> > in
> > > > KafkaMetrics is as good as Yammer Histogram.
> > > ClientMetricsPluginExportTime
> > > > uses Histogram.
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > > On Thu, Mar 31, 2022 at 5:21 PM Xavier Léauté
> > > <xa...@confluent.io.invalid>
> > > > wrote:
> > > >
> > > >> >
> > > >> > 28. On the broker, we typically use Yammer metrics. Only for
> metrics
> > > >> that
> > > >> > depend on Kafka metric features (e.g., quota), we use the Kafka
> > > metric.
> > > >> > Yammer metrics have 4 types: gauge, meter, histogram and timer.
> > meter
> > > >> > calculates a rate, but also exposes an accumulated value.
> > > >> >
> > > >>
> > > >> I don't see a good reason we should limit ourselves to Yammer
> metrics
> > on
> > > >> the broker. KafkaMetrics was written
> > > >> to replace Yammer metrics and is used for all new components
> (clients,
> > > >> streams, connect, etc.)
> > > >> My understanding is that the original goal was to retire Yammer
> > metrics
> > > in
> > > >> the broker in favor of KafkaMetrics.
> > > >> We just haven't done so out of backwards compatibility concerns.
> > > >> There are other broker metrics such as group coordinator,
> transaction
> > > >> state
> > > >> manager, and various socket server metrics
> > > >> already using KafkaMetrics that don't need specific Kafka metric
> > > features,
> > > >> so I don't see why we should refrain from using
> > > >> Kafka metrics on the broker unless there are real compatibility
> > concerns
> > > >> or
> > > >> where implementation specifics could lead to confusion when
> comparing
> > > >> metrics using different implementations.
> > > >>
> > > >> In my opinion we should encourage people to use KafkaMetrics going
> > > forward
> > > >> on the broker as well, for two reasons:
> > > >> a) yammer metrics is long deprecated and no longer maintained
> > > >> b) yammer metrics are much less expressive
> > > >> c) we don't have a proper API to expose yammer metrics outside of
> JMX
> > > >> (MetricsReporter only exposes KafkaMetrics)
> > > >>
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Jun Rao <ju...@confluent.io.INVALID>.

Hi, Magus,

Thanks for the reply.

50. Sounds good.

51. I miss-understood the proposal in the KIP then. The proposal is to
define a set of common metric names that every client should implement. The
problem is that every client already has its own set of metrics with its
own names. I am not sure that we could easily agree upon a common set of
metrics that work with all clients. There are likely to be some metrics
that are client specific. Translating between the common name and client
specific name is probably going to add more confusion. As mentioned in the
KIP, similar metrics from different clients could have subtle
semantic differences. Could we just let each client use its own set of
metric names?

Thanks,

Jun

On Thu, May 19, 2022 at 10:39 AM Magnus Edenhill <ma...@edenhill.se> wrote:

> Den ons 18 maj 2022 kl 19:57 skrev Jun Rao <ju...@confluent.io.invalid>:
>
> > Hi, Magnus,
> >
>
> Hi Jun
>
>
> >
> > Thanks for the updated KIP. Just a couple of more comments.
> >
> > 50. To troubleshoot a particular client issue, I imagine that the client
> > needs to identify its client_instance_id. How does the client find this
> > out? Do we plan to include client_instance_id in the client log, expose
> it
> > as a metric or something else?
> >
>
> The KIP suggests that client implementations emit an informative log
> message
> with the assigned client-instance-id once it is retrieved (once per client
> instance lifetime).
> There's also a clientInstanceId() method that an application can use to
> retrieve
> the client instance id and emit through whatever side channels makes sense.
>
>
>
> > 51. The KIP lists a bunch of metrics that need to be collected at the
> > client side. However, it seems quite a few useful java client metrics
> like
> > the following are missing.
> >     buffer-total-bytes
> >     buffer-available-bytes
> >
>
> These are covered by client.producer.record.queue.bytes and
> client.producer.record.queue.max.bytes.
>
>
> >     bufferpool-wait-time
> >
>
> Missing, but somewhat implementation specific.
> If it was up to me we would add this later if there's a need.
>
>
>
> >     batch-size-avg
> >     batch-size-max
> >
>
> These are missing and would be suitably represented as a histogram. I'll
> add them.
>
>
>
> >     io-wait-ratio
> >     io-ratio
> >
>
> There's client.io.wait.time which should cover io-wait-ratio.
> We could add a client.io.time as well, now or in a later KIP.
>
> Thanks,
> Magnus
>
>
>
>
> >
> > Thanks,
> >
> > Jun
> >
> > On Mon, Apr 4, 2022 at 10:01 AM Jun Rao <ju...@confluent.io> wrote:
> >
> > > Hi, Xavier,
> > >
> > > Thanks for the reply.
> > >
> > > 28. It does seem that we have started using KafkaMetrics on the broker
> > > side. Then, my only concern is on the usage of Histogram in
> KafkaMetrics.
> > > Histogram in KafkaMetrics statically divides the value space into a
> fixed
> > > number of buckets and only returns values on the bucket boundary. So,
> the
> > > returned histogram value may never show up in a recorded value. Yammer
> > > Histogram, on the other hand, uses reservoir sampling. The reported
> value
> > > is always one of the recorded values. So, I am not sure that Histogram
> in
> > > KafkaMetrics is as good as Yammer Histogram.
> > ClientMetricsPluginExportTime
> > > uses Histogram.
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Thu, Mar 31, 2022 at 5:21 PM Xavier Léauté
> > <xa...@confluent.io.invalid>
> > > wrote:
> > >
> > >> >
> > >> > 28. On the broker, we typically use Yammer metrics. Only for metrics
> > >> that
> > >> > depend on Kafka metric features (e.g., quota), we use the Kafka
> > metric.
> > >> > Yammer metrics have 4 types: gauge, meter, histogram and timer.
> meter
> > >> > calculates a rate, but also exposes an accumulated value.
> > >> >
> > >>
> > >> I don't see a good reason we should limit ourselves to Yammer metrics
> on
> > >> the broker. KafkaMetrics was written
> > >> to replace Yammer metrics and is used for all new components (clients,
> > >> streams, connect, etc.)
> > >> My understanding is that the original goal was to retire Yammer
> metrics
> > in
> > >> the broker in favor of KafkaMetrics.
> > >> We just haven't done so out of backwards compatibility concerns.
> > >> There are other broker metrics such as group coordinator, transaction
> > >> state
> > >> manager, and various socket server metrics
> > >> already using KafkaMetrics that don't need specific Kafka metric
> > features,
> > >> so I don't see why we should refrain from using
> > >> Kafka metrics on the broker unless there are real compatibility
> concerns
> > >> or
> > >> where implementation specifics could lead to confusion when comparing
> > >> metrics using different implementations.
> > >>
> > >> In my opinion we should encourage people to use KafkaMetrics going
> > forward
> > >> on the broker as well, for two reasons:
> > >> a) yammer metrics is long deprecated and no longer maintained
> > >> b) yammer metrics are much less expressive
> > >> c) we don't have a proper API to expose yammer metrics outside of JMX
> > >> (MetricsReporter only exposes KafkaMetrics)
> > >>
> > >
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Magnus Edenhill <ma...@edenhill.se>.

Den ons 18 maj 2022 kl 19:57 skrev Jun Rao <ju...@confluent.io.invalid>:

> Hi, Magnus,
>

Hi Jun


>
> Thanks for the updated KIP. Just a couple of more comments.
>
> 50. To troubleshoot a particular client issue, I imagine that the client
> needs to identify its client_instance_id. How does the client find this
> out? Do we plan to include client_instance_id in the client log, expose it
> as a metric or something else?
>

The KIP suggests that client implementations emit an informative log message
with the assigned client-instance-id once it is retrieved (once per client
instance lifetime).
There's also a clientInstanceId() method that an application can use to
retrieve
the client instance id and emit through whatever side channels makes sense.



> 51. The KIP lists a bunch of metrics that need to be collected at the
> client side. However, it seems quite a few useful java client metrics like
> the following are missing.
>     buffer-total-bytes
>     buffer-available-bytes
>

These are covered by client.producer.record.queue.bytes and
client.producer.record.queue.max.bytes.


>     bufferpool-wait-time
>

Missing, but somewhat implementation specific.
If it was up to me we would add this later if there's a need.



>     batch-size-avg
>     batch-size-max
>

These are missing and would be suitably represented as a histogram. I'll
add them.



>     io-wait-ratio
>     io-ratio
>

There's client.io.wait.time which should cover io-wait-ratio.
We could add a client.io.time as well, now or in a later KIP.

Thanks,
Magnus




>
> Thanks,
>
> Jun
>
> On Mon, Apr 4, 2022 at 10:01 AM Jun Rao <ju...@confluent.io> wrote:
>
> > Hi, Xavier,
> >
> > Thanks for the reply.
> >
> > 28. It does seem that we have started using KafkaMetrics on the broker
> > side. Then, my only concern is on the usage of Histogram in KafkaMetrics.
> > Histogram in KafkaMetrics statically divides the value space into a fixed
> > number of buckets and only returns values on the bucket boundary. So, the
> > returned histogram value may never show up in a recorded value. Yammer
> > Histogram, on the other hand, uses reservoir sampling. The reported value
> > is always one of the recorded values. So, I am not sure that Histogram in
> > KafkaMetrics is as good as Yammer Histogram.
> ClientMetricsPluginExportTime
> > uses Histogram.
> >
> > Thanks,
> >
> > Jun
> >
> > On Thu, Mar 31, 2022 at 5:21 PM Xavier Léauté
> <xa...@confluent.io.invalid>
> > wrote:
> >
> >> >
> >> > 28. On the broker, we typically use Yammer metrics. Only for metrics
> >> that
> >> > depend on Kafka metric features (e.g., quota), we use the Kafka
> metric.
> >> > Yammer metrics have 4 types: gauge, meter, histogram and timer. meter
> >> > calculates a rate, but also exposes an accumulated value.
> >> >
> >>
> >> I don't see a good reason we should limit ourselves to Yammer metrics on
> >> the broker. KafkaMetrics was written
> >> to replace Yammer metrics and is used for all new components (clients,
> >> streams, connect, etc.)
> >> My understanding is that the original goal was to retire Yammer metrics
> in
> >> the broker in favor of KafkaMetrics.
> >> We just haven't done so out of backwards compatibility concerns.
> >> There are other broker metrics such as group coordinator, transaction
> >> state
> >> manager, and various socket server metrics
> >> already using KafkaMetrics that don't need specific Kafka metric
> features,
> >> so I don't see why we should refrain from using
> >> Kafka metrics on the broker unless there are real compatibility concerns
> >> or
> >> where implementation specifics could lead to confusion when comparing
> >> metrics using different implementations.
> >>
> >> In my opinion we should encourage people to use KafkaMetrics going
> forward
> >> on the broker as well, for two reasons:
> >> a) yammer metrics is long deprecated and no longer maintained
> >> b) yammer metrics are much less expressive
> >> c) we don't have a proper API to expose yammer metrics outside of JMX
> >> (MetricsReporter only exposes KafkaMetrics)
> >>
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Jun Rao <ju...@confluent.io.INVALID>.

Hi, Magnus,

Thanks for the updated KIP. Just a couple of more comments.

50. To troubleshoot a particular client issue, I imagine that the client
needs to identify its client_instance_id. How does the client find this
out? Do we plan to include client_instance_id in the client log, expose it
as a metric or something else?

51. The KIP lists a bunch of metrics that need to be collected at the
client side. However, it seems quite a few useful java client metrics like
the following are missing.
    buffer-total-bytes
    buffer-available-bytes
    bufferpool-wait-time
    batch-size-avg
    batch-size-max
    io-wait-ratio
    io-ratio

Thanks,

Jun

On Mon, Apr 4, 2022 at 10:01 AM Jun Rao <ju...@confluent.io> wrote:

> Hi, Xavier,
>
> Thanks for the reply.
>
> 28. It does seem that we have started using KafkaMetrics on the broker
> side. Then, my only concern is on the usage of Histogram in KafkaMetrics.
> Histogram in KafkaMetrics statically divides the value space into a fixed
> number of buckets and only returns values on the bucket boundary. So, the
> returned histogram value may never show up in a recorded value. Yammer
> Histogram, on the other hand, uses reservoir sampling. The reported value
> is always one of the recorded values. So, I am not sure that Histogram in
> KafkaMetrics is as good as Yammer Histogram. ClientMetricsPluginExportTime
> uses Histogram.
>
> Thanks,
>
> Jun
>
> On Thu, Mar 31, 2022 at 5:21 PM Xavier Léauté <xa...@confluent.io.invalid>
> wrote:
>
>> >
>> > 28. On the broker, we typically use Yammer metrics. Only for metrics
>> that
>> > depend on Kafka metric features (e.g., quota), we use the Kafka metric.
>> > Yammer metrics have 4 types: gauge, meter, histogram and timer. meter
>> > calculates a rate, but also exposes an accumulated value.
>> >
>>
>> I don't see a good reason we should limit ourselves to Yammer metrics on
>> the broker. KafkaMetrics was written
>> to replace Yammer metrics and is used for all new components (clients,
>> streams, connect, etc.)
>> My understanding is that the original goal was to retire Yammer metrics in
>> the broker in favor of KafkaMetrics.
>> We just haven't done so out of backwards compatibility concerns.
>> There are other broker metrics such as group coordinator, transaction
>> state
>> manager, and various socket server metrics
>> already using KafkaMetrics that don't need specific Kafka metric features,
>> so I don't see why we should refrain from using
>> Kafka metrics on the broker unless there are real compatibility concerns
>> or
>> where implementation specifics could lead to confusion when comparing
>> metrics using different implementations.
>>
>> In my opinion we should encourage people to use KafkaMetrics going forward
>> on the broker as well, for two reasons:
>> a) yammer metrics is long deprecated and no longer maintained
>> b) yammer metrics are much less expressive
>> c) we don't have a proper API to expose yammer metrics outside of JMX
>> (MetricsReporter only exposes KafkaMetrics)
>>
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Jun Rao <ju...@confluent.io.INVALID>.

Hi, Xavier,

Thanks for the reply.

28. It does seem that we have started using KafkaMetrics on the broker
side. Then, my only concern is on the usage of Histogram in KafkaMetrics.
Histogram in KafkaMetrics statically divides the value space into a fixed
number of buckets and only returns values on the bucket boundary. So, the
returned histogram value may never show up in a recorded value. Yammer
Histogram, on the other hand, uses reservoir sampling. The reported value
is always one of the recorded values. So, I am not sure that Histogram in
KafkaMetrics is as good as Yammer Histogram. ClientMetricsPluginExportTime
uses Histogram.

Thanks,

Jun

On Thu, Mar 31, 2022 at 5:21 PM Xavier Léauté <xa...@confluent.io.invalid>
wrote:

> >
> > 28. On the broker, we typically use Yammer metrics. Only for metrics that
> > depend on Kafka metric features (e.g., quota), we use the Kafka metric.
> > Yammer metrics have 4 types: gauge, meter, histogram and timer. meter
> > calculates a rate, but also exposes an accumulated value.
> >
>
> I don't see a good reason we should limit ourselves to Yammer metrics on
> the broker. KafkaMetrics was written
> to replace Yammer metrics and is used for all new components (clients,
> streams, connect, etc.)
> My understanding is that the original goal was to retire Yammer metrics in
> the broker in favor of KafkaMetrics.
> We just haven't done so out of backwards compatibility concerns.
> There are other broker metrics such as group coordinator, transaction state
> manager, and various socket server metrics
> already using KafkaMetrics that don't need specific Kafka metric features,
> so I don't see why we should refrain from using
> Kafka metrics on the broker unless there are real compatibility concerns or
> where implementation specifics could lead to confusion when comparing
> metrics using different implementations.
>
> In my opinion we should encourage people to use KafkaMetrics going forward
> on the broker as well, for two reasons:
> a) yammer metrics is long deprecated and no longer maintained
> b) yammer metrics are much less expressive
> c) we don't have a proper API to expose yammer metrics outside of JMX
> (MetricsReporter only exposes KafkaMetrics)
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Xavier Léauté <xa...@confluent.io.INVALID>.

>
> 28. On the broker, we typically use Yammer metrics. Only for metrics that
> depend on Kafka metric features (e.g., quota), we use the Kafka metric.
> Yammer metrics have 4 types: gauge, meter, histogram and timer. meter
> calculates a rate, but also exposes an accumulated value.
>

I don't see a good reason we should limit ourselves to Yammer metrics on
the broker. KafkaMetrics was written
to replace Yammer metrics and is used for all new components (clients,
streams, connect, etc.)
My understanding is that the original goal was to retire Yammer metrics in
the broker in favor of KafkaMetrics.
We just haven't done so out of backwards compatibility concerns.
There are other broker metrics such as group coordinator, transaction state
manager, and various socket server metrics
already using KafkaMetrics that don't need specific Kafka metric features,
so I don't see why we should refrain from using
Kafka metrics on the broker unless there are real compatibility concerns or
where implementation specifics could lead to confusion when comparing
metrics using different implementations.

In my opinion we should encourage people to use KafkaMetrics going forward
on the broker as well, for two reasons:
a) yammer metrics is long deprecated and no longer maintained
b) yammer metrics are much less expressive
c) we don't have a proper API to expose yammer metrics outside of JMX
(MetricsReporter only exposes KafkaMetrics)

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Jun Rao <ju...@confluent.io.INVALID>.

Hi, Kirk, Sarat,

Thanks for the reply.

28. On the broker, we typically use Yammer metrics. Only for metrics that
depend on Kafka metric features (e.g., quota), we use the Kafka metric.
Yammer metrics have 4 types: gauge, meter, histogram and timer. meter
calculates a rate, but also exposes an accumulated value.

29. The Histogram class in org.apache.kafka.common.metrics.stats was never
used in the client metrics. The implementation of Histogram only provides a
fixed number of values in the domain and may not capture the quantiles very
accurately. So, we punted on using it.

Thanks,

Jun



On Thu, Mar 10, 2022 at 10:59 AM Sarat Kakarla
<sk...@confluent.io.invalid> wrote:

> Jun,
>
>   >>  28. For the broker metrics, could you spell out the full metric name
>   >>   including groups, tags, etc? We typically don't add the broker_id
> label for
>   >>   broker metrics. Also, brokers use Yammer metrics, which doesn't
> have type
>   >>   Sum.
>
> Sure,  I will update the KIP-714 with the above information, will remove
> the broker-id label from the metrics.
>
> Regarding the type is CumulativeSum the right type to use in the place of
> Sum?
>
> Thanks
> Sarat
>
>
> On 3/8/22, 5:48 PM, "Jun Rao" <ju...@confluent.io.INVALID> wrote:
>
>     Hi, Magnus, Sarat and Xavier,
>
>     Thanks for the reply. A few more comments below.
>
>     20. It seems that we are piggybacking the plugin on the
>     existing MetricsReporter. So, this seems fine.
>
>     21. That could work. Are we requiring any additional jar dependency on
> the
>     client? Or, are you suggesting that we check the runtime dependency to
> pick
>     the compression codec?
>
>     28. For the broker metrics, could you spell out the full metric name
>     including groups, tags, etc? We typically don't add the broker_id
> label for
>     broker metrics. Also, brokers use Yammer metrics, which doesn't have
> type
>     Sum.
>
>     29. There are several client metrics listed as histogram. However, the
> java
>     client currently doesn't support histogram type.
>
>     30. Could you show an example of the metric payload in
> PushTelemetryRequest
>     to help understand how we organize metrics at different levels (per
>     instance, per topic, per partition, per broker, etc)?
>
>     31. Could you add a bit more detail on which client thread sends the
>     PushTelemetryRequest?
>
>     Thanks,
>
>     Jun
>
>     On Mon, Mar 7, 2022 at 11:48 AM Magnus Edenhill <ma...@edenhill.se>
> wrote:
>
>     > Hi Jun,
>     >
>     > thanks for your initiated questions, see my answers below.
>     > There's been a number of clarifications to the KIP.
>     >
>     >
>     >
>     > Den tors 27 jan. 2022 kl 20:08 skrev Jun Rao
> <ju...@confluent.io.invalid>:
>     >
>     > > Hi, Magnus,
>     > >
>     > > Thanks for updating the KIP. The overall approach makes sense to
> me. A
>     > few
>     > > more detailed comments below.
>     > >
>     > > 20. ClientTelemetry: Should it be extending configurable and
> closable?
>     > >
>     >
>     > I'll pass this question to Sarat and/or Xavier.
>     >
>     >
>     >
>     > > 21. Compression of the metrics on the client: what's the default?
>     > >
>     >
>     > How about we specify a prioritized list: zstd, lz4, snappy, gzip?
>     > But ultimately it is up to what the client supports.
>     >
>     >
>     > 23. A client instance is considered a metric resource and the
>     > > resource-level (thus client instance level) labels could include:
>     > >     client_software_name=confluent-kafka-python
>     > >     client_software_version=v2.1.3
>     > >     client_instance_id=B64CD139-3975-440A-91D4
>     > >     transactional_id=someTxnApp
>     > > Are those labels added in PushTelemetryRequest? If so, are they per
>     > metric
>     > > or per request?
>     > >
>     >
>     >
>     > client_software* and client_instance_id are not added by the client,
> but
>     > available to
>     > the broker-side metrics plugin for adding as it see fits, remove
> them from
>     > the KIP.
>     >
>     > As for transactional_id, group_id, etc, which I believe will be
> useful in
>     > troubleshooting,
>     > are included only once (per push) as resource-level attributes (the
> client
>     > instance is a singular resource).
>     >
>     >
>     > >
>     > > 24.  "the broker will only send
>     > > GetTelemetrySubscriptionsResponse.DeltaTemporality=True" :
>     > > 24.1 If it's always true, does it need to be part of the protocol?
>     > >
>     >
>     > We're anticipating that it will take a lot longer to upgrade the
> majority
>     > of clients than the
>     > broker/plugin side, which is why we want the client to support both
>     > temporalities out-of-the-box
>     > so that cumulative reporting can be turned on seamlessly in the
> future.
>     >
>     >
>     >
>     > > 24.2 Does delta only apply to Counter type?
>     > >
>     >
>     >
>     > And Histograms. More details in Xavier's OTLP link.
>     >
>     >
>     >
>     > > 24.3 In the delta representation, the first request needs to send
> the
>     > full
>     > > value, how does the broker plugin know whether a value is full or
> delta?
>     > >
>     >
>     > The client may (should) send the start time for each metric sample,
>     > indicating when
>     > the metric began to be collected.
>     > We've discussed whether this should be the client instance start
> time or
>     > the time when a matching
>     > metric subscription for that metric is received.
>     > For completeness we recommend using the former, the client instance
> start
>     > time.
>     >
>     >
>     >
>     > > 25. quota:
>     > > 25.1 Since we are fitting PushTelemetryRequest into the existing
> request
>     > > quota, it would be useful to document the impact, i.e. client
> metric
>     > > throttling causes the data from the same client to be delayed.
>     > > 25.2 Is PushTelemetryRequest subject to the write bandwidth quota
> like
>     > the
>     > > producer?
>     > >
>     >
>     >
>     > Yes, it should be, as to protect the cluster from rogue clients.
>     > But, in practice the size of metrics will be quite low (e.g., 1-10kb
> per
>     > 60s interval), so I don't think this will pose a problem.
>     > The KIP has been updated with more details on quota/throttling
> behaviour,
>     > see the
>     > "Throttling and rate-limiting" section.
>     >
>     >
>     > 25.3 THROTTLING_QUOTA_EXCEEDED: Currently, we don't send this error
> when
>     > > the request/bandwidth quota is exceeded since those requests are
> not
>     > > rejected. We only set this error when the request is rejected
> (e.g.,
>     > topic
>     > > creation). It would be useful to clarify when this error is used.
>     > >
>     >
>     > Right, I was trying to reuse an existing error-code. We can introduce
>     > a new one for the case where a client pushes metrics at a higher
> frequency
>     > than the
>     > than the configured push interval (e.g., out-of-profile sends).
>     > This causes the broker to drop those metrics and send this error
> code back
>     > to the client. There will be no connection throttling /
> channel-muting in
>     > this
>     > case (unless the standard quotas are exceeded).
>     >
>     >
>     > > 27. kafka-client-metrics.sh: Could we add an example on how to
> disable a
>     > > bad client?
>     > >
>     >
>     > There's now a --block option to kafka-client-metrics.sh which
> overrides all
>     > subscriptions
>     > for the matched client(s). This allows silencing metrics for one or
> more
>     > clients without having
>     > to remove existing subscriptions. From the client's perspective it
> will
>     > look like it no longer has
>     > any subscriptions.
>     >
>     > # Block metrics collection for a specific client instance
>     > $ kafka-client-metrics.sh --bootstrap-server $BROKERS \
>     >    --add \
>     >    --name 'Disabe_b69cc35a' \  # A descriptive name makes it easier
> to
>     > clean up old subscriptions.
>     >    --match client_instance_id=b69cc35a-7a54-4790-aa69-cc2bd4ee4538
> \  #
>     > Match this specific client instance
>     >    --block
>     >
>     >
>     >
>     >
>     > > 28. New broker side metrics: Could we spell out the details of the
>     > metrics
>     > > (e.g., group, tags, etc)?
>     > >
>     >
>     > KIP has been updated accordingly (thanks Sarat).
>     >
>     >
>     >
>     > >
>     > > 29. Client instance-level metrics: client.io.wait.time is a gauge
> not a
>     > > histogram.
>     > >
>     >
>     > I believe a population/distribution should preferably be represented
> as a
>     > histogram, space permitting,
>     > and only secondarily as a Gauge average.
>     > While we might not want to maintain a bunch of histograms for each
>     > partition, since that could be
>     > quite space consuming, this client.io.wait.time is a single metric
> per
>     > client instance and can
>     > thus afford a Histogram representation.
>     >
>     >
>     >
>     > Thanks,
>     > Magnus
>     >
>     >
>     >
>     > > Thanks,
>     > >
>     > > Jun
>     > >
>     > > On Wed, Jan 26, 2022 at 6:32 AM Magnus Edenhill <
> magnus@edenhill.se>
>     > > wrote:
>     > >
>     > > > Hi all,
>     > > >
>     > > > I've updated the KIP with responses to the latest comments: Java
> client
>     > > > dependencies (Thanks Kirk!), alternate designs (separate cluster,
>     > > separate
>     > > > producer, etc), etc.
>     > > >
>     > > > I will revive the vote thread.
>     > > >
>     > > > Thanks,
>     > > > Magnus
>     > > >
>     > > >
>     > > > Den mån 13 dec. 2021 kl 22:32 skrev Ryanne Dolan <
>     > ryannedolan@gmail.com
>     > > >:
>     > > >
>     > > > > I think we should be very careful about introducing new runtime
>     > > > > dependencies into the clients. Historically this has been rare
> and
>     > > > > essentially necessary (e.g. compression libs).
>     > > > >
>     > > > > Ryanne
>     > > > >
>     > > > > On Mon, Dec 13, 2021, 1:06 PM Kirk True <kirk@mustardgrain.com
> >
>     > wrote:
>     > > > >
>     > > > > > Hi Jun,
>     > > > > >
>     > > > > > On Thu, Dec 9, 2021, at 2:28 PM, Jun Rao wrote:
>     > > > > > > 13. Using OpenTelemetry. Does that require runtime
> dependency
>     > > > > > > on OpenTelemetry library? How good is the compatibility
> story
>     > > > > > > of OpenTelemetry? This is important since an application
> could
>     > have
>     > > > > other
>     > > > > > > OpenTelemetry dependencies than the Kafka client.
>     > > > > >
>     > > > > > The current design is that the OpenTelemetry JARs would ship
> with
>     > the
>     > > > > > client. Perhaps we can design the client such that the JARs
> aren't
>     > > even
>     > > > > > loaded if the user has opted out. The user could even
> exclude the
>     > > JARs
>     > > > > from
>     > > > > > their dependencies if they so wished.
>     > > > > >
>     > > > > > I can't speak to the compatibility of the libraries. Is it
> possible
>     > > > that
>     > > > > > we include a shaded version?
>     > > > > >
>     > > > > > Thanks,
>     > > > > > Kirk
>     > > > > >
>     > > > > > >
>     > > > > > > 14. The proposal listed idempotence=true. This is more of a
>     > > > > configuration
>     > > > > > > than a metric. Are we including that as a metric? What
> other
>     > > > > > configurations
>     > > > > > > are we including? Should we separate the configurations
> from the
>     > > > > metrics?
>     > > > > > >
>     > > > > > > Thanks,
>     > > > > > >
>     > > > > > > Jun
>     > > > > > >
>     > > > > > > On Mon, Nov 29, 2021 at 7:34 AM Magnus Edenhill <
>     > > magnus@edenhill.se>
>     > > > > > wrote:
>     > > > > > >
>     > > > > > > > Hey Bob,
>     > > > > > > >
>     > > > > > > > That's a good point.
>     > > > > > > >
>     > > > > > > > Request type labels were considered but since they're
> already
>     > > > tracked
>     > > > > > by
>     > > > > > > > broker-side metrics
>     > > > > > > > they were left out as to avoid metric duplication,
> however
>     > those
>     > > > > > metrics
>     > > > > > > > are not per connection,
>     > > > > > > > so they won't be that useful in practice for
> troubleshooting
>     > > > specific
>     > > > > > > > client instances.
>     > > > > > > >
>     > > > > > > > I'll add the request_type label to the relevant metrics.
>     > > > > > > >
>     > > > > > > > Thanks,
>     > > > > > > > Magnus
>     > > > > > > >
>     > > > > > > >
>     > > > > > > > Den tis 23 nov. 2021 kl 19:20 skrev Bob Barrett
>     > > > > > > > <bo...@confluent.io.invalid>:
>     > > > > > > >
>     > > > > > > > > Hi Magnus,
>     > > > > > > > >
>     > > > > > > > > Thanks for the thorough KIP, this seems very useful.
>     > > > > > > > >
>     > > > > > > > > Would it make sense to include the request type as a
> label
>     > for
>     > > > the
>     > > > > > > > > `client.request.success`, `client.request.errors` and
>     > > > > > > > `client.request.rtt`
>     > > > > > > > > metrics? I think it would be very useful to see which
>     > specific
>     > > > > > requests
>     > > > > > > > are
>     > > > > > > > > succeeding and failing for a client. One specific case
> I can
>     > > > think
>     > > > > of
>     > > > > > > > where
>     > > > > > > > > this could be useful is producer batch timeouts. If a
> Java
>     > > > > > application
>     > > > > > > > does
>     > > > > > > > > not enable producer client logs (unfortunately, in my
>     > > experience
>     > > > > this
>     > > > > > > > > happens more often than it should), the application
> logs will
>     > > > only
>     > > > > > > > contain
>     > > > > > > > > the expiration error message, but no information about
> what
>     > is
>     > > > > > causing
>     > > > > > > > the
>     > > > > > > > > timeout. The requests might all be succeeding but
> taking too
>     > > long
>     > > > > to
>     > > > > > > > > process batches, or metadata requests might be
> failing, or
>     > some
>     > > > or
>     > > > > > all
>     > > > > > > > > produce requests might be failing (if the bootstrap
> servers
>     > are
>     > > > > > reachable
>     > > > > > > > > from the client but one or more other brokers are not,
> for
>     > > > > example).
>     > > > > > If
>     > > > > > > > the
>     > > > > > > > > cluster operator is able to identify the specific
> requests
>     > that
>     > > > are
>     > > > > > slow
>     > > > > > > > or
>     > > > > > > > > failing for a client, they will be better able to
> diagnose
>     > the
>     > > > > issue
>     > > > > > > > > causing batch timeouts.
>     > > > > > > > >
>     > > > > > > > > One drawback I can think of is that this will increase
> the
>     > > > > > cardinality of
>     > > > > > > > > the request metrics. But any given client is only
> going to
>     > use
>     > > a
>     > > > > > small
>     > > > > > > > > subset of the request types, and since we already have
>     > > partition
>     > > > > > labels
>     > > > > > > > for
>     > > > > > > > > the topic-level metrics, I think request labels will
> still
>     > make
>     > > > up
>     > > > > a
>     > > > > > > > > relatively small percentage of the set of metrics.
>     > > > > > > > >
>     > > > > > > > > Thanks,
>     > > > > > > > > Bob
>     > > > > > > > >
>     > > > > > > > > On Mon, Nov 22, 2021 at 2:08 AM Viktor Somogyi-Vass <
>     > > > > > > > > viktorsomogyi@gmail.com>
>     > > > > > > > > wrote:
>     > > > > > > > >
>     > > > > > > > > > Hi Magnus,
>     > > > > > > > > >
>     > > > > > > > > > I think this is a very useful addition. We also have
> a
>     > > similar
>     > > > > (but
>     > > > > > > > much
>     > > > > > > > > > more simplistic) implementation of this. Maybe I
> missed it
>     > in
>     > > > the
>     > > > > > KIP
>     > > > > > > > but
>     > > > > > > > > > what about adding metrics about the subscription
> cache
>     > > itself?
>     > > > > > That I
>     > > > > > > > > think
>     > > > > > > > > > would improve its usability and debuggability as
> we'd be
>     > able
>     > > > to
>     > > > > > see
>     > > > > > > > its
>     > > > > > > > > > performance, hit/miss rates, eviction counts and
> others.
>     > > > > > > > > >
>     > > > > > > > > > Best,
>     > > > > > > > > > Viktor
>     > > > > > > > > >
>     > > > > > > > > > On Thu, Nov 18, 2021 at 5:12 PM Magnus Edenhill <
>     > > > > > magnus@edenhill.se>
>     > > > > > > > > > wrote:
>     > > > > > > > > >
>     > > > > > > > > > > Hi Mickael,
>     > > > > > > > > > >
>     > > > > > > > > > > see inline.
>     > > > > > > > > > >
>     > > > > > > > > > > Den ons 10 nov. 2021 kl 15:21 skrev Mickael Maison
> <
>     > > > > > > > > > > mickael.maison@gmail.com
>     > > > > > > > > > > >:
>     > > > > > > > > > >
>     > > > > > > > > > > > Hi Magnus,
>     > > > > > > > > > > >
>     > > > > > > > > > > > I see you've addressed some of the points I
> raised
>     > above
>     > > > but
>     > > > > > some
>     > > > > > > > (4,
>     > > > > > > > > > > > 5) have not been addressed yet.
>     > > > > > > > > > > >
>     > > > > > > > > > >
>     > > > > > > > > > > Re 4) How will the user/app know metrics are being
> sent.
>     > > > > > > > > > >
>     > > > > > > > > > > One possibility is to add a JMX metric (thus for
> user
>     > > > > > consumption)
>     > > > > > > > for
>     > > > > > > > > > the
>     > > > > > > > > > > number of metric pushes the
>     > > > > > > > > > > client has performed, or perhaps the number of
> metrics
>     > > > > > subscriptions
>     > > > > > > > > > > currently being collected.
>     > > > > > > > > > > Would that be sufficient?
>     > > > > > > > > > >
>     > > > > > > > > > > Re 5) Metric sizes and rates
>     > > > > > > > > > >
>     > > > > > > > > > > A worst case scenario for a producer that is
> producing to
>     > > 50
>     > > > > > unique
>     > > > > > > > > > topics
>     > > > > > > > > > > and emitting all standard metrics yields
>     > > > > > > > > > > a serialized size of around 100KB prior to
> compression,
>     > > which
>     > > > > > > > > compresses
>     > > > > > > > > > > down to about 20-30% of that depending
>     > > > > > > > > > > on compression type and topic name uniqueness.
>     > > > > > > > > > > The numbers for a consumer would be similar.
>     > > > > > > > > > >
>     > > > > > > > > > > In practice the number of unique topics would be
> far
>     > less,
>     > > > and
>     > > > > > the
>     > > > > > > > > > > subscription set would typically be for a subset of
>     > > metrics.
>     > > > > > > > > > > So we're probably closer to 1kb, or less,
> compressed size
>     > > per
>     > > > > > client
>     > > > > > > > > per
>     > > > > > > > > > > push interval.
>     > > > > > > > > > >
>     > > > > > > > > > > As both the subscription set and push intervals are
>     > > > controlled
>     > > > > > by the
>     > > > > > > > > > > cluster operator it shouldn't be too hard
>     > > > > > > > > > > to strike a good balance between metrics overhead
> and
>     > > > > > granularity.
>     > > > > > > > > > >
>     > > > > > > > > > >
>     > > > > > > > > > >
>     > > > > > > > > > > >
>     > > > > > > > > > > > I'm really uneasy with this being enabled by
> default on
>     > > the
>     > > > > > client
>     > > > > > > > > > > > side. When collecting data, I think the best
> practice
>     > is
>     > > to
>     > > > > > ensure
>     > > > > > > > > > > > users are explicitly enabling it.
>     > > > > > > > > > > >
>     > > > > > > > > > >
>     > > > > > > > > > > Requiring metrics to be explicitly enabled on
> clients
>     > > > severely
>     > > > > > > > cripples
>     > > > > > > > > > its
>     > > > > > > > > > > usability and value.
>     > > > > > > > > > >
>     > > > > > > > > > > One of the problems that this KIP aims to solve is
> for
>     > > useful
>     > > > > > metrics
>     > > > > > > > > to
>     > > > > > > > > > be
>     > > > > > > > > > > available on demand
>     > > > > > > > > > > regardless of the technical expertise of the user.
> As
>     > > Ryanne
>     > > > > > points,
>     > > > > > > > > out
>     > > > > > > > > > a
>     > > > > > > > > > > savvy user/organization
>     > > > > > > > > > > will typically have metrics collection and
> monitoring in
>     > > > place
>     > > > > > > > already,
>     > > > > > > > > > and
>     > > > > > > > > > > the benefits of this KIP
>     > > > > > > > > > > are then more of a common set and format metrics
> across
>     > > > client
>     > > > > > > > > > > implementations and languages.
>     > > > > > > > > > > But that is not the typical Kafka user in my
> experience,
>     > > > > they're
>     > > > > > not
>     > > > > > > > > > Kafka
>     > > > > > > > > > > experts and they don't have the
>     > > > > > > > > > > knowledge of how to best instrument their clients.
>     > > > > > > > > > > Having metrics enabled by default for this user
> base
>     > allows
>     > > > the
>     > > > > > Kafka
>     > > > > > > > > > > operators to proactively and reactively
>     > > > > > > > > > > monitor and troubleshoot client issues, without
> the need
>     > > for
>     > > > > the
>     > > > > > less
>     > > > > > > > > > savvy
>     > > > > > > > > > > user to do anything.
>     > > > > > > > > > > It is often too late to tell a user to enable
> metrics
>     > when
>     > > > the
>     > > > > > > > problem
>     > > > > > > > > > has
>     > > > > > > > > > > already occurred.
>     > > > > > > > > > >
>     > > > > > > > > > > Now, to be clear, even though metrics are enabled
> by
>     > > default
>     > > > on
>     > > > > > > > clients
>     > > > > > > > > > it
>     > > > > > > > > > > is not enabled by default
>     > > > > > > > > > > on the brokers; the Kafka operator needs to build
> and set
>     > > up
>     > > > a
>     > > > > > > > metrics
>     > > > > > > > > > > plugin and add metrics subscriptions
>     > > > > > > > > > > before anything is sent from the client.
>     > > > > > > > > > > It is opt-out on the clients and opt-in on the
> broker.
>     > > > > > > > > > >
>     > > > > > > > > > >
>     > > > > > > > > > >
>     > > > > > > > > > >
>     > > > > > > > > > > > You mentioned brokers already have
>     > > > > > > > > > > > some(most?) of the information contained in
> metrics, if
>     > > so
>     > > > > > then why
>     > > > > > > > > > > > are we collecting it again? Surely there must be
> some
>     > new
>     > > > > > > > information
>     > > > > > > > > > > > in the client metrics.
>     > > > > > > > > > > >
>     > > > > > > > > > >
>     > > > > > > > > > > From the user's perspective the Kafka
> infrastructure
>     > > extends
>     > > > > from
>     > > > > > > > > > > producer.send() to
>     > > > > > > > > > > messages being returned from consumer.poll(), a
> giant
>     > black
>     > > > box
>     > > > > > where
>     > > > > > > > > > > there's a lot going on between those
>     > > > > > > > > > > two points. The brokers currently only see what
> happens
>     > > once
>     > > > > > those
>     > > > > > > > > > requests
>     > > > > > > > > > > and messages hits the broker,
>     > > > > > > > > > > but as Kafka clients are complex pieces of
> machinery
>     > > there's
>     > > > a
>     > > > > > myriad
>     > > > > > > > > of
>     > > > > > > > > > > queues, timers, and state
>     > > > > > > > > > > that's critical to the operation and infrastructure
>     > that's
>     > > > not
>     > > > > > > > > currently
>     > > > > > > > > > > visible to the operator.
>     > > > > > > > > > > Relying on the user to accurately and timely
> provide this
>     > > > > missing
>     > > > > > > > > > > information is not generally feasible.
>     > > > > > > > > > >
>     > > > > > > > > > >
>     > > > > > > > > > > Most of the standard metrics listed in the KIP are
> data
>     > > > points
>     > > > > > that
>     > > > > > > > the
>     > > > > > > > > > > broker does not have.
>     > > > > > > > > > > Only a small number of metrics are duplicates
> (like the
>     > > > request
>     > > > > > > > counts
>     > > > > > > > > > and
>     > > > > > > > > > > sizes), but they are included
>     > > > > > > > > > > to ease correlation when inspecting these client
> metrics.
>     > > > > > > > > > >
>     > > > > > > > > > >
>     > > > > > > > > > >
>     > > > > > > > > > > > Moreover this is a brand new feature so it's even
>     > harder
>     > > to
>     > > > > > justify
>     > > > > > > > > > > > enabling it and forcing onto all our users. If
> disabled
>     > > by
>     > > > > > default,
>     > > > > > > > > > > > it's relatively easy to enable in a new release
> if we
>     > > > decide
>     > > > > > to,
>     > > > > > > > but
>     > > > > > > > > > > > once enabled by default it's much harder to
> disable.
>     > Also
>     > > > > this
>     > > > > > > > > feature
>     > > > > > > > > > > > will apply to all future metrics we will add.
>     > > > > > > > > > > >
>     > > > > > > > > > >
>     > > > > > > > > > > I think maturity of a feature implementation
> should be
>     > the
>     > > > > > deciding
>     > > > > > > > > > factor,
>     > > > > > > > > > > rather than
>     > > > > > > > > > > the design of it (which this KIP is). I.e., if the
>     > > > > > implementation is
>     > > > > > > > > not
>     > > > > > > > > > > deemed mature enough
>     > > > > > > > > > > for release X.Y it will be disabled.
>     > > > > > > > > > >
>     > > > > > > > > > >
>     > > > > > > > > > >
>     > > > > > > > > > > > Overall I think it's an interesting feature but
> I'd
>     > > prefer
>     > > > to
>     > > > > > be
>     > > > > > > > > > > > slightly defensive and see how it works in
> practice
>     > > before
>     > > > > > enabling
>     > > > > > > > > it
>     > > > > > > > > > > > everywhere.
>     > > > > > > > > > > >
>     > > > > > > > > > >
>     > > > > > > > > > > Right, and I agree on being defensive, but since
> this
>     > > feature
>     > > > > > still
>     > > > > > > > > > > requires manual
>     > > > > > > > > > > enabling on the brokers before actually being
> used, I
>     > think
>     > > > > that
>     > > > > > > > gives
>     > > > > > > > > > > enough control
>     > > > > > > > > > > to opt-in or out of this feature as needed.
>     > > > > > > > > > >
>     > > > > > > > > > > Thanks for your comments!
>     > > > > > > > > > >
>     > > > > > > > > > > Regards,
>     > > > > > > > > > > Magnus
>     > > > > > > > > > >
>     > > > > > > > > > >
>     > > > > > > > > > >
>     > > > > > > > > > > > Thanks,
>     > > > > > > > > > > > Mickael
>     > > > > > > > > > > >
>     > > > > > > > > > > > On Mon, Nov 8, 2021 at 7:55 AM Magnus Edenhill <
>     > > > > > magnus@edenhill.se
>     > > > > > > > >
>     > > > > > > > > > > wrote:
>     > > > > > > > > > > > >
>     > > > > > > > > > > > > Thanks David for pointing this out,
>     > > > > > > > > > > > > I've updated the KIP to include client_id as a
>     > matching
>     > > > > > selector.
>     > > > > > > > > > > > >
>     > > > > > > > > > > > > Regards,
>     > > > > > > > > > > > > Magnus
>     > > > > > > > > > > > >
>     > > > > > > > > > > > > Den tors 4 nov. 2021 kl 18:01 skrev David Mao
>     > > > > > > > > > > <dmao@confluent.io.invalid
>     > > > > > > > > > > > >:
>     > > > > > > > > > > > >
>     > > > > > > > > > > > > > Hey Magnus,
>     > > > > > > > > > > > > >
>     > > > > > > > > > > > > > I noticed that the KIP outlines the initial
>     > selectors
>     > > > > > supported
>     > > > > > > > > as:
>     > > > > > > > > > > > > >
>     > > > > > > > > > > > > >    - client_instance_id - CLIENT_INSTANCE_ID
> UUID
>     > > > string
>     > > > > > > > > > > > representation.
>     > > > > > > > > > > > > >    - client_software_name  - client software
>     > > > > implementation
>     > > > > > > > name.
>     > > > > > > > > > > > > >    - client_software_version  - client
> software
>     > > > > > implementation
>     > > > > > > > > > > version.
>     > > > > > > > > > > > > >
>     > > > > > > > > > > > > > In the given reactive monitoring workflow, we
>     > mention
>     > > > > that
>     > > > > > the
>     > > > > > > > > > > > application
>     > > > > > > > > > > > > > user does not know their client's client
> instance
>     > ID,
>     > > > but
>     > > > > > it's
>     > > > > > > > > > > outlined
>     > > > > > > > > > > > > > that the operator can add a metrics
> subscription
>     > > > > selecting
>     > > > > > for
>     > > > > > > > > > > > clientId. I
>     > > > > > > > > > > > > > don't see clientId as one of the supported
>     > selectors.
>     > > > > > > > > > > > > > I can see how this would have made sense in a
>     > > previous
>     > > > > > > > iteration
>     > > > > > > > > > > given
>     > > > > > > > > > > > that
>     > > > > > > > > > > > > > the previous client instance ID proposal was
> to
>     > > > construct
>     > > > > > the
>     > > > > > > > > > client
>     > > > > > > > > > > > > > instance ID using clientId as a prefix. Now
> that
>     > the
>     > > > > client
>     > > > > > > > > > instance
>     > > > > > > > > > > > ID is
>     > > > > > > > > > > > > > a UUID, would we want to add clientId as a
>     > supported
>     > > > > > selector?
>     > > > > > > > > > > > > > Let me know what you think.
>     > > > > > > > > > > > > >
>     > > > > > > > > > > > > > David
>     > > > > > > > > > > > > >
>     > > > > > > > > > > > > > On Tue, Oct 19, 2021 at 12:33 PM Magnus
> Edenhill <
>     > > > > > > > > > magnus@edenhill.se
>     > > > > > > > > > > >
>     > > > > > > > > > > > > > wrote:
>     > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > Hi Mickael!
>     > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > Den tis 19 okt. 2021 kl 19:30 skrev Mickael
>     > Maison
>     > > <
>     > > > > > > > > > > > > > > mickael.maison@gmail.com
>     > > > > > > > > > > > > > > >:
>     > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > Hi Magnus,
>     > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > Thanks for the proposal.
>     > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > 1. Looking at the protocol section, isn't
>     > > > > > > > "ClientInstanceId"
>     > > > > > > > > > > > expected
>     > > > > > > > > > > > > > > > to be a field in
>     > > > GetTelemetrySubscriptionsResponseV0?
>     > > > > > > > > > Otherwise,
>     > > > > > > > > > > > how
>     > > > > > > > > > > > > > > > does a client retrieve this value?
>     > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > Good catch, it got removed by mistake in
> one of
>     > the
>     > > > > > edits.
>     > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > 2. In the client API section, you
> mention a new
>     > > > > method
>     > > > > > > > > > > > > > > > "clientInstanceId()". Can you clarify
> which
>     > > > > interfaces
>     > > > > > are
>     > > > > > > > > > > > affected?
>     > > > > > > > > > > > > > > > Is it only Consumer and Producer?
>     > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > And Admin. Will update the KIP.
>     > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > 3. I'm a bit concerned this is enabled by
>     > > default.
>     > > > > > Even if
>     > > > > > > > > the
>     > > > > > > > > > > data
>     > > > > > > > > > > > > > > > collected is supposed to be not
> sensitive, I
>     > > think
>     > > > > > this can
>     > > > > > > > > be
>     > > > > > > > > > > > > > > > problematic in some environments. Also
> users
>     > > don't
>     > > > > > seem to
>     > > > > > > > > have
>     > > > > > > > > > > the
>     > > > > > > > > > > > > > > > choice to only expose some metrics.
> Knowing how
>     > > > much
>     > > > > > data
>     > > > > > > > > > transit
>     > > > > > > > > > > > > > > > through some applications can be
> considered
>     > > > critical.
>     > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > The broker already knows how much data
> transits
>     > > > through
>     > > > > > the
>     > > > > > > > > > client
>     > > > > > > > > > > > > > though,
>     > > > > > > > > > > > > > > right?
>     > > > > > > > > > > > > > > Care has been taken not to expose
> information in
>     > > the
>     > > > > > standard
>     > > > > > > > > > > metrics
>     > > > > > > > > > > > > > that
>     > > > > > > > > > > > > > > might
>     > > > > > > > > > > > > > > reveal sensitive information.
>     > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > Do you have an example of how the proposed
>     > metrics
>     > > > > could
>     > > > > > leak
>     > > > > > > > > > > > sensitive
>     > > > > > > > > > > > > > > information?
>     > > > > > > > > > > > > > > As for limiting the what metrics to
> export; I
>     > guess
>     > > > > that
>     > > > > > > > could
>     > > > > > > > > > make
>     > > > > > > > > > > > sense
>     > > > > > > > > > > > > > > in some
>     > > > > > > > > > > > > > > very sensitive use-cases, but those users
> might
>     > > > disable
>     > > > > > > > metrics
>     > > > > > > > > > > > > > altogether
>     > > > > > > > > > > > > > > for now.
>     > > > > > > > > > > > > > > Could these concerns be addressed by a
> later KIP?
>     > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > 4. As a user, how do you know if your
>     > application
>     > > > is
>     > > > > > > > actively
>     > > > > > > > > > > > sending
>     > > > > > > > > > > > > > > > metrics? Are there new metrics exposing
> what's
>     > > > going
>     > > > > > on,
>     > > > > > > > like
>     > > > > > > > > > how
>     > > > > > > > > > > > much
>     > > > > > > > > > > > > > > > data is being sent?
>     > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > That's a good question.
>     > > > > > > > > > > > > > > Since the proposed metrics interface is
> not aimed
>     > > at,
>     > > > > or
>     > > > > > > > > directly
>     > > > > > > > > > > > > > available
>     > > > > > > > > > > > > > > to, the application
>     > > > > > > > > > > > > > > I guess there's little point of adding it
> here,
>     > but
>     > > > > > instead
>     > > > > > > > > > adding
>     > > > > > > > > > > > > > > something to the
>     > > > > > > > > > > > > > > existing JMX metrics?
>     > > > > > > > > > > > > > > Do you have any suggestions?
>     > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > 5. If all metrics are enabled on a
> regular
>     > > Consumer
>     > > > > or
>     > > > > > > > > > Producer,
>     > > > > > > > > > > do
>     > > > > > > > > > > > > > > > you have an idea how much throughput
> this would
>     > > > use?
>     > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > It depends on the number of
> partition/topics/etc
>     > > the
>     > > > > > client
>     > > > > > > > is
>     > > > > > > > > > > > producing
>     > > > > > > > > > > > > > > to/consuming from.
>     > > > > > > > > > > > > > > I'll add some sizes to the KIP for some
> typical
>     > > > > > use-cases.
>     > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > Thanks,
>     > > > > > > > > > > > > > > Magnus
>     > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > Thanks
>     > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > On Tue, Oct 19, 2021 at 5:06 PM Magnus
>     > Edenhill <
>     > > > > > > > > > > > magnus@edenhill.se>
>     > > > > > > > > > > > > > > > wrote:
>     > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > Den tis 19 okt. 2021 kl 13:22 skrev Tom
>     > > Bentley <
>     > > > > > > > > > > > tbentley@redhat.com
>     > > > > > > > > > > > > > >:
>     > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > > Hi Magnus,
>     > > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > > I reviewed the KIP since you called
> the
>     > vote
>     > > > > > (sorry for
>     > > > > > > > > not
>     > > > > > > > > > > > > > reviewing
>     > > > > > > > > > > > > > > > when
>     > > > > > > > > > > > > > > > > > you announced your intention to call
> the
>     > > > vote). I
>     > > > > > have
>     > > > > > > > a
>     > > > > > > > > > few
>     > > > > > > > > > > > > > > questions
>     > > > > > > > > > > > > > > > on
>     > > > > > > > > > > > > > > > > > some of the details.
>     > > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > > 1. There's no Javadoc on
>     > > > > > ClientTelemetryPayload.data(),
>     > > > > > > > > so
>     > > > > > > > > > I
>     > > > > > > > > > > > don't
>     > > > > > > > > > > > > > > know
>     > > > > > > > > > > > > > > > > > whether the payload is exposed
> through this
>     > > > > method
>     > > > > > as
>     > > > > > > > > > > > compressed or
>     > > > > > > > > > > > > > > > not.
>     > > > > > > > > > > > > > > > > > Later on you say "Decompression of
> the
>     > > payloads
>     > > > > > will be
>     > > > > > > > > > > > handled by
>     > > > > > > > > > > > > > > the
>     > > > > > > > > > > > > > > > > > broker metrics plugin, the broker
> should
>     > > > expose a
>     > > > > > > > > suitable
>     > > > > > > > > > > > > > > > decompression
>     > > > > > > > > > > > > > > > > > API to the metrics plugin for this
>     > purpose.",
>     > > > > which
>     > > > > > > > > > suggests
>     > > > > > > > > > > > it's
>     > > > > > > > > > > > > > the
>     > > > > > > > > > > > > > > > > > compressed data in the buffer, but
> then we
>     > > > don't
>     > > > > > know
>     > > > > > > > > which
>     > > > > > > > > > > > codec
>     > > > > > > > > > > > > > was
>     > > > > > > > > > > > > > > > used,
>     > > > > > > > > > > > > > > > > > nor the API via which the plugin
> should
>     > > > > decompress
>     > > > > > it
>     > > > > > > > if
>     > > > > > > > > > > > required
>     > > > > > > > > > > > > > for
>     > > > > > > > > > > > > > > > > > forwarding to the ultimate metrics
> store.
>     > > > Should
>     > > > > > the
>     > > > > > > > > > > > > > > > ClientTelemetryPayload
>     > > > > > > > > > > > > > > > > > expose a method to get the
> compression and
>     > a
>     > > > > > > > > decompressor?
>     > > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > Good point, updated.
>     > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > > 2. The client-side API is expressed
> as
>     > > > > > StringOrError
>     > > > > > > > > > > > > > > > > > ClientInstance::ClientInstanceId(int
>     > > > > timeout_ms). I
>     > > > > > > > > > > understand
>     > > > > > > > > > > > that
>     > > > > > > > > > > > > > > > you're
>     > > > > > > > > > > > > > > > > > thinking about the librdkafka
>     > implementation,
>     > > > but
>     > > > > > it
>     > > > > > > > > would
>     > > > > > > > > > be
>     > > > > > > > > > > > good
>     > > > > > > > > > > > > > to
>     > > > > > > > > > > > > > > > show
>     > > > > > > > > > > > > > > > > > the API as it would appear on the
> Apache
>     > > Kafka
>     > > > > > clients.
>     > > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > This was meant as pseudo-code, but I
> changed
>     > it
>     > > > to
>     > > > > > Java.
>     > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > > 3. "PushTelemetryRequest|Response -
>     > protocol
>     > > > > > request
>     > > > > > > > used
>     > > > > > > > > > by
>     > > > > > > > > > > > the
>     > > > > > > > > > > > > > > > client to
>     > > > > > > > > > > > > > > > > > send metrics to any broker it is
> connected
>     > > to."
>     > > > > To
>     > > > > > be
>     > > > > > > > > > clear,
>     > > > > > > > > > > > this
>     > > > > > > > > > > > > > > means
>     > > > > > > > > > > > > > > > > > that the client can choose any of the
>     > > connected
>     > > > > > brokers
>     > > > > > > > > and
>     > > > > > > > > > > > push to
>     > > > > > > > > > > > > > > > just
>     > > > > > > > > > > > > > > > > > one of them? What should a supporting
>     > client
>     > > do
>     > > > > if
>     > > > > > it
>     > > > > > > > > gets
>     > > > > > > > > > an
>     > > > > > > > > > > > error
>     > > > > > > > > > > > > > > > when
>     > > > > > > > > > > > > > > > > > pushing metrics to a broker, retry
> sending
>     > to
>     > > > the
>     > > > > > same
>     > > > > > > > > > broker
>     > > > > > > > > > > > or
>     > > > > > > > > > > > > > try
>     > > > > > > > > > > > > > > > > > pushing to another broker, or drop
> the
>     > > metrics?
>     > > > > > Should
>     > > > > > > > > > > > supporting
>     > > > > > > > > > > > > > > > clients
>     > > > > > > > > > > > > > > > > > send successive requests to a single
>     > broker,
>     > > or
>     > > > > > round
>     > > > > > > > > > robin,
>     > > > > > > > > > > > or is
>     > > > > > > > > > > > > > > > that up
>     > > > > > > > > > > > > > > > > > to the client author? I'm guessing
> the
>     > > > behaviour
>     > > > > > should
>     > > > > > > > > be
>     > > > > > > > > > > > sticky
>     > > > > > > > > > > > > > to
>     > > > > > > > > > > > > > > > > > support the rate limiting features,
> but I
>     > > think
>     > > > > it
>     > > > > > > > would
>     > > > > > > > > be
>     > > > > > > > > > > > good
>     > > > > > > > > > > > > > for
>     > > > > > > > > > > > > > > > client
>     > > > > > > > > > > > > > > > > > authors if this section were
> explicit on
>     > the
>     > > > > > > > recommended
>     > > > > > > > > > > > behaviour.
>     > > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > You are right, I've updated the KIP to
> make
>     > > this
>     > > > > > clearer.
>     > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > > 4. "Mapping the client instance id
> to an
>     > > actual
>     > > > > > > > > application
>     > > > > > > > > > > > > > instance
>     > > > > > > > > > > > > > > > > > running on a (virtual) machine can
> be done
>     > by
>     > > > > > > > inspecting
>     > > > > > > > > > the
>     > > > > > > > > > > > > > metrics
>     > > > > > > > > > > > > > > > > > resource labels, such as the client
> source
>     > > > > address
>     > > > > > and
>     > > > > > > > > > source
>     > > > > > > > > > > > port,
>     > > > > > > > > > > > > > > or
>     > > > > > > > > > > > > > > > > > security principal, all of which are
> added
>     > by
>     > > > the
>     > > > > > > > > receiving
>     > > > > > > > > > > > broker.
>     > > > > > > > > > > > > > > > This
>     > > > > > > > > > > > > > > > > > will allow the operator together
> with the
>     > > user
>     > > > to
>     > > > > > > > > identify
>     > > > > > > > > > > the
>     > > > > > > > > > > > > > actual
>     > > > > > > > > > > > > > > > > > application instance." Is this really
>     > always
>     > > > > true?
>     > > > > > The
>     > > > > > > > > > source
>     > > > > > > > > > > > IP
>     > > > > > > > > > > > > > and
>     > > > > > > > > > > > > > > > port
>     > > > > > > > > > > > > > > > > > might be a loadbalancer/proxy in some
>     > setups.
>     > > > The
>     > > > > > > > > > principal,
>     > > > > > > > > > > as
>     > > > > > > > > > > > > > > already
>     > > > > > > > > > > > > > > > > > mentioned in the KIP, might be shared
>     > between
>     > > > > > multiple
>     > > > > > > > > > > > > > applications.
>     > > > > > > > > > > > > > > > So at
>     > > > > > > > > > > > > > > > > > worst the organization running the
> clients
>     > > > might
>     > > > > > have
>     > > > > > > > to
>     > > > > > > > > > > > consult
>     > > > > > > > > > > > > > the
>     > > > > > > > > > > > > > > > logs
>     > > > > > > > > > > > > > > > > > of a set of client applications,
> right?
>     > > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > Yes, that's correct. There's no
> guaranteed
>     > > > mapping
>     > > > > > from
>     > > > > > > > > > > > > > > > client_instance_id
>     > > > > > > > > > > > > > > > > to
>     > > > > > > > > > > > > > > > > an actual instance, that's why the KIP
>     > > recommends
>     > > > > > client
>     > > > > > > > > > > > > > > implementations
>     > > > > > > > > > > > > > > > to
>     > > > > > > > > > > > > > > > > log the client instance id
>     > > > > > > > > > > > > > > > > upon retrieval, and also provide an
> API for
>     > the
>     > > > > > > > application
>     > > > > > > > > > to
>     > > > > > > > > > > > > > retrieve
>     > > > > > > > > > > > > > > > the
>     > > > > > > > > > > > > > > > > instance id programmatically
>     > > > > > > > > > > > > > > > > if it has a better way of exposing it.
>     > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > 5. "Tests indicate that a compression
> ratio
>     > up
>     > > to
>     > > > > > 10x is
>     > > > > > > > > > > > possible for
>     > > > > > > > > > > > > > > the
>     > > > > > > > > > > > > > > > > > standard metrics." Client authors
> might
>     > > > > appreciate
>     > > > > > your
>     > > > > > > > > > > > mentioning
>     > > > > > > > > > > > > > > > which
>     > > > > > > > > > > > > > > > > > compression codec got these results.
>     > > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > Good point. Updated.
>     > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > > 6. "Should the client send a push
> request
>     > > prior
>     > > > > to
>     > > > > > > > expiry
>     > > > > > > > > > of
>     > > > > > > > > > > > the
>     > > > > > > > > > > > > > > > previously
>     > > > > > > > > > > > > > > > > > calculated PushIntervalMs the broker
> will
>     > > > discard
>     > > > > > the
>     > > > > > > > > > metrics
>     > > > > > > > > > > > and
>     > > > > > > > > > > > > > > > return a
>     > > > > > > > > > > > > > > > > > PushTelemetryResponse with the
> ErrorCode
>     > set
>     > > to
>     > > > > > > > > > RateLimited."
>     > > > > > > > > > > > Is
>     > > > > > > > > > > > > > this
>     > > > > > > > > > > > > > > > > > RATE_LIMITED a new error code? It's
> not
>     > > > mentioned
>     > > > > > in
>     > > > > > > > the
>     > > > > > > > > > "New
>     > > > > > > > > > > > Error
>     > > > > > > > > > > > > > > > Codes"
>     > > > > > > > > > > > > > > > > > section.
>     > > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > That's a leftover, it should be using
> the
>     > > > standard
>     > > > > > > > > > ThrottleTime
>     > > > > > > > > > > > > > > > mechanism.
>     > > > > > > > > > > > > > > > > Fixed.
>     > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > > 7. In the section "Standard client
> resource
>     > > > > labels"
>     > > > > > > > > > > > application_id
>     > > > > > > > > > > > > > is
>     > > > > > > > > > > > > > > > > > described as Kafka Streams only, but
> the
>     > > > section
>     > > > > of
>     > > > > > > > > "Client
>     > > > > > > > > > > > > > > > Identification"
>     > > > > > > > > > > > > > > > > > talks about "application instance id
> as an
>     > > > > optional
>     > > > > > > > > future
>     > > > > > > > > > > > > > > nice-to-have
>     > > > > > > > > > > > > > > > > > that may be included as a metrics
> label if
>     > it
>     > > > has
>     > > > > > been
>     > > > > > > > > set
>     > > > > > > > > > by
>     > > > > > > > > > > > the
>     > > > > > > > > > > > > > > > user", so
>     > > > > > > > > > > > > > > > > > I'm confused whether non-Kafka
> Streams
>     > > clients
>     > > > > > should
>     > > > > > > > set
>     > > > > > > > > > an
>     > > > > > > > > > > > > > > > application_id
>     > > > > > > > > > > > > > > > > > or not.
>     > > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > I'll clarify this in the KIP, but
> basically
>     > we
>     > > > > would
>     > > > > > need
>     > > > > > > > > to
>     > > > > > > > > > > add
>     > > > > > > > > > > > an `
>     > > > > > > > > > > > > > > > > application.id` config
>     > > > > > > > > > > > > > > > > property for non-streams clients for
> this
>     > > > purpose,
>     > > > > > and
>     > > > > > > > > that's
>     > > > > > > > > > > > outside
>     > > > > > > > > > > > > > > the
>     > > > > > > > > > > > > > > > > scope of this KIP since we want to
> make it
>     > > > > > zero-conf:ish
>     > > > > > > > on
>     > > > > > > > > > the
>     > > > > > > > > > > > > > client
>     > > > > > > > > > > > > > > > side.
>     > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > > Kind regards,
>     > > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > > Tom
>     > > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > Thanks for the review,
>     > > > > > > > > > > > > > > > > Magnus
>     > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > > On Thu, Oct 7, 2021 at 5:26 PM Magnus
>     > > Edenhill
>     > > > <
>     > > > > > > > > > > > magnus@edenhill.se
>     > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > wrote:
>     > > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > > > Hi all,
>     > > > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > > > I've updated the KIP following our
> recent
>     > > > > > discussions
>     > > > > > > > > on
>     > > > > > > > > > > the
>     > > > > > > > > > > > > > > mailing
>     > > > > > > > > > > > > > > > > > list:
>     > > > > > > > > > > > > > > > > > >  - split the protocol in two, one
> for
>     > > getting
>     > > > > the
>     > > > > > > > > metrics
>     > > > > > > > > > > > > > > > subscriptions,
>     > > > > > > > > > > > > > > > > > > and one for pushing the metrics.
>     > > > > > > > > > > > > > > > > > >  - simplifications: initially only
> one
>     > > > > supported
>     > > > > > > > > metrics
>     > > > > > > > > > > > format,
>     > > > > > > > > > > > > > no
>     > > > > > > > > > > > > > > > > > > client.id in the instance id, etc.
>     > > > > > > > > > > > > > > > > > >  - made CLIENT_METRICS subscription
>     > > > > configuration
>     > > > > > > > > entries
>     > > > > > > > > > > > more
>     > > > > > > > > > > > > > > > structured
>     > > > > > > > > > > > > > > > > > >    and allowing better client
> matching
>     > > > > selectors
>     > > > > > (not
>     > > > > > > > > > only
>     > > > > > > > > > > > on the
>     > > > > > > > > > > > > > > > > > instance
>     > > > > > > > > > > > > > > > > > > id, but also the other
>     > > > > > > > > > > > > > > > > > >    client resource labels, such as
>     > > > > > > > > client_software_name,
>     > > > > > > > > > > > etc.).
>     > > > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > > > Unless there are further comments
> I'll
>     > call
>     > > > the
>     > > > > > vote
>     > > > > > > > > in a
>     > > > > > > > > > > > day or
>     > > > > > > > > > > > > > > two.
>     > > > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > > > Regards,
>     > > > > > > > > > > > > > > > > > > Magnus
>     > > > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > > > Den mån 4 okt. 2021 kl 20:57 skrev
> Magnus
>     > > > > > Edenhill <
>     > > > > > > > > > > > > > > > magnus@edenhill.se>:
>     > > > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > > > > Hi Gwen,
>     > > > > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > > > > I'm finishing up the KIP based
> on the
>     > > last
>     > > > > > couple
>     > > > > > > > of
>     > > > > > > > > > > > discussion
>     > > > > > > > > > > > > > > > points
>     > > > > > > > > > > > > > > > > > in
>     > > > > > > > > > > > > > > > > > > > this thread
>     > > > > > > > > > > > > > > > > > > > and will call the Vote later
> this week.
>     > > > > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > > > > Best,
>     > > > > > > > > > > > > > > > > > > > Magnus
>     > > > > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > > > > Den lör 2 okt. 2021 kl 02:01
> skrev Gwen
>     > > > > Shapira
>     > > > > > > > > > > > > > > > > > > <gwen@confluent.io.invalid
>     > > > > > > > > > > > > > > > > > > > >:
>     > > > > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > > > >> Hey,
>     > > > > > > > > > > > > > > > > > > >>
>     > > > > > > > > > > > > > > > > > > >> I noticed that there was no
> discussion
>     > > for
>     > > > > the
>     > > > > > > > last
>     > > > > > > > > 10
>     > > > > > > > > > > > days,
>     > > > > > > > > > > > > > > but I
>     > > > > > > > > > > > > > > > > > > >> couldn't
>     > > > > > > > > > > > > > > > > > > >> find the vote thread. Is there
> one
>     > that
>     > > > I'm
>     > > > > > > > missing?
>     > > > > > > > > > > > > > > > > > > >>
>     > > > > > > > > > > > > > > > > > > >> Gwen
>     > > > > > > > > > > > > > > > > > > >>
>     > > > > > > > > > > > > > > > > > > >> On Wed, Sep 22, 2021 at 4:58 AM
> Magnus
>     > > > > > Edenhill <
>     > > > > > > > > > > > > > > > magnus@edenhill.se>
>     > > > > > > > > > > > > > > > > > > >> wrote:
>     > > > > > > > > > > > > > > > > > > >>
>     > > > > > > > > > > > > > > > > > > >> > Den tis 21 sep. 2021 kl 06:58
> skrev
>     > > > Colin
>     > > > > > > > McCabe <
>     > > > > > > > > > > > > > > > > > cmccabe@apache.org
>     > > > > > > > > > > > > > > > > > > >:
>     > > > > > > > > > > > > > > > > > > >> >
>     > > > > > > > > > > > > > > > > > > >> > > On Mon, Sep 20, 2021, at
> 17:35,
>     > Feng
>     > > > Min
>     > > > > > > > wrote:
>     > > > > > > > > > > > > > > > > > > >> > > > Thanks Magnus & Colin for
> the
>     > > > > > discussion.
>     > > > > > > > > > > > > > > > > > > >> > > >
>     > > > > > > > > > > > > > > > > > > >> > > > Based on KIP-714's
> stateless
>     > > design,
>     > > > > > Client
>     > > > > > > > > can
>     > > > > > > > > > > > pretty
>     > > > > > > > > > > > > > > much
>     > > > > > > > > > > > > > > > use
>     > > > > > > > > > > > > > > > > > > any
>     > > > > > > > > > > > > > > > > > > >> > > > connection to any broker
> to send
>     > > > > > metrics. We
>     > > > > > > > > are
>     > > > > > > > > > > not
>     > > > > > > > > > > > > > > > associating
>     > > > > > > > > > > > > > > > > > > >> > > connection
>     > > > > > > > > > > > > > > > > > > >> > > > with client metric state.
> Is my
>     > > > > > > > understanding
>     > > > > > > > > > > > correct?
>     > > > > > > > > > > > > > If
>     > > > > > > > > > > > > > > > yes,
>     > > > > > > > > > > > > > > > > > > how
>     > > > > > > > > > > > > > > > > > > >> > about
>     > > > > > > > > > > > > > > > > > > >> > > > the following two
> scenarios
>     > > > > > > > > > > > > > > > > > > >> > > >
>     > > > > > > > > > > > > > > > > > > >> > > > 1) One Client (Client-ID)
>     > > registers
>     > > > > two
>     > > > > > > > > > different
>     > > > > > > > > > > > client
>     > > > > > > > > > > > > > > > > > instance
>     > > > > > > > > > > > > > > > > > > id
>     > > > > > > > > > > > > > > > > > > >> > via
>     > > > > > > > > > > > > > > > > > > >> > > > separate registration. Is
> it
>     > > > > permitted?
>     > > > > > If
>     > > > > > > > OK,
>     > > > > > > > > > how
>     > > > > > > > > > > > to
>     > > > > > > > > > > > > > > > > > distinguish
>     > > > > > > > > > > > > > > > > > > >> them
>     > > > > > > > > > > > > > > > > > > >> > > from
>     > > > > > > > > > > > > > > > > > > >> > > > the case 2 below.
>     > > > > > > > > > > > > > > > > > > >> > > >
>     > > > > > > > > > > > > > > > > > > >> > >
>     > > > > > > > > > > > > > > > > > > >> > > Hi Feng,
>     > > > > > > > > > > > > > > > > > > >> > >
>     > > > > > > > > > > > > > > > > > > >> > > My understanding, which
> Magnus can
>     > > > > > clarify I
>     > > > > > > > > > guess,
>     > > > > > > > > > > is
>     > > > > > > > > > > > > > that
>     > > > > > > > > > > > > > > > you
>     > > > > > > > > > > > > > > > > > > could
>     > > > > > > > > > > > > > > > > > > >> > have
>     > > > > > > > > > > > > > > > > > > >> > > something like two Producer
>     > > instances
>     > > > > > running
>     > > > > > > > > with
>     > > > > > > > > > > the
>     > > > > > > > > > > > > > same
>     > > > > > > > > > > > > > > > > > > client.id
>     > > > > > > > > > > > > > > > > > > >> > > (perhaps because they're
> using the
>     > > > same
>     > > > > > config
>     > > > > > > > > > file,
>     > > > > > > > > > > > for
>     > > > > > > > > > > > > > > > example).
>     > > > > > > > > > > > > > > > > > > >> They
>     > > > > > > > > > > > > > > > > > > >> > > could even be in the same
> process.
>     > > But
>     > > > > > they
>     > > > > > > > > would
>     > > > > > > > > > > get
>     > > > > > > > > > > > > > > separate
>     > > > > > > > > > > > > > > > > > > UUIDs.
>     > > > > > > > > > > > > > > > > > > >> > >
>     > > > > > > > > > > > > > > > > > > >> > > I believe Magnus used the
> term
>     > > client
>     > > > to
>     > > > > > mean
>     > > > > > > > > > > > "Producer or
>     > > > > > > > > > > > > > > > > > > Consumer".
>     > > > > > > > > > > > > > > > > > > >> So
>     > > > > > > > > > > > > > > > > > > >> > > if you have both a Producer
> and a
>     > > > > > Consumer in
>     > > > > > > > > your
>     > > > > > > > > > > > > > > > application I
>     > > > > > > > > > > > > > > > > > > would
>     > > > > > > > > > > > > > > > > > > >> > > expect you'd get separate
> UUIDs
>     > for
>     > > > > both.
>     > > > > > > > Again
>     > > > > > > > > > > > Magnus can
>     > > > > > > > > > > > > > > > chime
>     > > > > > > > > > > > > > > > > > in
>     > > > > > > > > > > > > > > > > > > >> > here, I
>     > > > > > > > > > > > > > > > > > > >> > > guess.
>     > > > > > > > > > > > > > > > > > > >> > >
>     > > > > > > > > > > > > > > > > > > >> >
>     > > > > > > > > > > > > > > > > > > >> > That's correct.
>     > > > > > > > > > > > > > > > > > > >> >
>     > > > > > > > > > > > > > > > > > > >> >
>     > > > > > > > > > > > > > > > > > > >> > >
>     > > > > > > > > > > > > > > > > > > >> > > > 2) How about the client
>     > > restarting?
>     > > > > > What's
>     > > > > > > > the
>     > > > > > > > > > > > > > > expectation?
>     > > > > > > > > > > > > > > > > > Should
>     > > > > > > > > > > > > > > > > > > >> the
>     > > > > > > > > > > > > > > > > > > >> > > > server expect the client
> to
>     > carry
>     > > a
>     > > > > > > > persisted
>     > > > > > > > > > > client
>     > > > > > > > > > > > > > > > instance id
>     > > > > > > > > > > > > > > > > > > or
>     > > > > > > > > > > > > > > > > > > >> > > should
>     > > > > > > > > > > > > > > > > > > >> > > > the client be treated as
> a new
>     > > > > instance?
>     > > > > > > > > > > > > > > > > > > >> > >
>     > > > > > > > > > > > > > > > > > > >> > > The KIP doesn't describe any
>     > > mechanism
>     > > > > for
>     > > > > > > > > > > > persistence,
>     > > > > > > > > > > > > > so I
>     > > > > > > > > > > > > > > > would
>     > > > > > > > > > > > > > > > > > > >> assume
>     > > > > > > > > > > > > > > > > > > >> > > that when you restart the
> client
>     > you
>     > > > get
>     > > > > > a new
>     > > > > > > > > > > UUID. I
>     > > > > > > > > > > > > > agree
>     > > > > > > > > > > > > > > > that
>     > > > > > > > > > > > > > > > > > it
>     > > > > > > > > > > > > > > > > > > >> > would
>     > > > > > > > > > > > > > > > > > > >> > > be good to spell this out.
>     > > > > > > > > > > > > > > > > > > >> > >
>     > > > > > > > > > > > > > > > > > > >> > >
>     > > > > > > > > > > > > > > > > > > >> > Right, it will not be
> persisted
>     > since
>     > > a
>     > > > > > client
>     > > > > > > > > > > instance
>     > > > > > > > > > > > > > can't
>     > > > > > > > > > > > > > > be
>     > > > > > > > > > > > > > > > > > > >> restarted.
>     > > > > > > > > > > > > > > > > > > >> >
>     > > > > > > > > > > > > > > > > > > >> > Will update the KIP to make
> this
>     > > > clearer.
>     > > > > > > > > > > > > > > > > > > >> >
>     > > > > > > > > > > > > > > > > > > >> > /Magnus
>     > > > > > > > > > > > > > > > > > > >> >
>     > > > > > > > > > > > > > > > > > > >>
>     > > > > > > > > > > > > > > > > > > >>
>     > > > > > > > > > > > > > > > > > > >> --
>     > > > > > > > > > > > > > > > > > > >> Gwen Shapira
>     > > > > > > > > > > > > > > > > > > >> Engineering Manager | Confluent
>     > > > > > > > > > > > > > > > > > > >> 650.450.2760 | @gwenshap
>     > > > > > > > > > > > > > > > > > > >> Follow us: Twitter | blog
>     > > > > > > > > > > > > > > > > > > >>
>     > > > > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > >
>     > > > > > > > > > > > > >
>     > > > > > > > > > > >
>     > > > > > > > > > >
>     > > > > > > > > >
>     > > > > > > > >
>     > > > > > > >
>     > > > > > >
>     > > > > >
>     > > > >
>     > > >
>     > >
>     >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Sarat Kakarla <sk...@confluent.io.INVALID>.

Jun,
 
  >>  28. For the broker metrics, could you spell out the full metric name
  >>   including groups, tags, etc? We typically don't add the broker_id label for
  >>   broker metrics. Also, brokers use Yammer metrics, which doesn't have type
  >>   Sum.

Sure,  I will update the KIP-714 with the above information, will remove the broker-id label from the metrics.

Regarding the type is CumulativeSum the right type to use in the place of Sum?

Thanks
Sarat


On 3/8/22, 5:48 PM, "Jun Rao" <ju...@confluent.io.INVALID> wrote:

    Hi, Magnus, Sarat and Xavier,

    Thanks for the reply. A few more comments below.

    20. It seems that we are piggybacking the plugin on the
    existing MetricsReporter. So, this seems fine.

    21. That could work. Are we requiring any additional jar dependency on the
    client? Or, are you suggesting that we check the runtime dependency to pick
    the compression codec?

    28. For the broker metrics, could you spell out the full metric name
    including groups, tags, etc? We typically don't add the broker_id label for
    broker metrics. Also, brokers use Yammer metrics, which doesn't have type
    Sum.

    29. There are several client metrics listed as histogram. However, the java
    client currently doesn't support histogram type.

    30. Could you show an example of the metric payload in PushTelemetryRequest
    to help understand how we organize metrics at different levels (per
    instance, per topic, per partition, per broker, etc)?

    31. Could you add a bit more detail on which client thread sends the
    PushTelemetryRequest?

    Thanks,

    Jun

    On Mon, Mar 7, 2022 at 11:48 AM Magnus Edenhill <ma...@edenhill.se> wrote:

    > Hi Jun,
    >
    > thanks for your initiated questions, see my answers below.
    > There's been a number of clarifications to the KIP.
    >
    >
    >
    > Den tors 27 jan. 2022 kl 20:08 skrev Jun Rao <ju...@confluent.io.invalid>:
    >
    > > Hi, Magnus,
    > >
    > > Thanks for updating the KIP. The overall approach makes sense to me. A
    > few
    > > more detailed comments below.
    > >
    > > 20. ClientTelemetry: Should it be extending configurable and closable?
    > >
    >
    > I'll pass this question to Sarat and/or Xavier.
    >
    >
    >
    > > 21. Compression of the metrics on the client: what's the default?
    > >
    >
    > How about we specify a prioritized list: zstd, lz4, snappy, gzip?
    > But ultimately it is up to what the client supports.
    >
    >
    > 23. A client instance is considered a metric resource and the
    > > resource-level (thus client instance level) labels could include:
    > >     client_software_name=confluent-kafka-python
    > >     client_software_version=v2.1.3
    > >     client_instance_id=B64CD139-3975-440A-91D4
    > >     transactional_id=someTxnApp
    > > Are those labels added in PushTelemetryRequest? If so, are they per
    > metric
    > > or per request?
    > >
    >
    >
    > client_software* and client_instance_id are not added by the client, but
    > available to
    > the broker-side metrics plugin for adding as it see fits, remove them from
    > the KIP.
    >
    > As for transactional_id, group_id, etc, which I believe will be useful in
    > troubleshooting,
    > are included only once (per push) as resource-level attributes (the client
    > instance is a singular resource).
    >
    >
    > >
    > > 24.  "the broker will only send
    > > GetTelemetrySubscriptionsResponse.DeltaTemporality=True" :
    > > 24.1 If it's always true, does it need to be part of the protocol?
    > >
    >
    > We're anticipating that it will take a lot longer to upgrade the majority
    > of clients than the
    > broker/plugin side, which is why we want the client to support both
    > temporalities out-of-the-box
    > so that cumulative reporting can be turned on seamlessly in the future.
    >
    >
    >
    > > 24.2 Does delta only apply to Counter type?
    > >
    >
    >
    > And Histograms. More details in Xavier's OTLP link.
    >
    >
    >
    > > 24.3 In the delta representation, the first request needs to send the
    > full
    > > value, how does the broker plugin know whether a value is full or delta?
    > >
    >
    > The client may (should) send the start time for each metric sample,
    > indicating when
    > the metric began to be collected.
    > We've discussed whether this should be the client instance start time or
    > the time when a matching
    > metric subscription for that metric is received.
    > For completeness we recommend using the former, the client instance start
    > time.
    >
    >
    >
    > > 25. quota:
    > > 25.1 Since we are fitting PushTelemetryRequest into the existing request
    > > quota, it would be useful to document the impact, i.e. client metric
    > > throttling causes the data from the same client to be delayed.
    > > 25.2 Is PushTelemetryRequest subject to the write bandwidth quota like
    > the
    > > producer?
    > >
    >
    >
    > Yes, it should be, as to protect the cluster from rogue clients.
    > But, in practice the size of metrics will be quite low (e.g., 1-10kb per
    > 60s interval), so I don't think this will pose a problem.
    > The KIP has been updated with more details on quota/throttling behaviour,
    > see the
    > "Throttling and rate-limiting" section.
    >
    >
    > 25.3 THROTTLING_QUOTA_EXCEEDED: Currently, we don't send this error when
    > > the request/bandwidth quota is exceeded since those requests are not
    > > rejected. We only set this error when the request is rejected (e.g.,
    > topic
    > > creation). It would be useful to clarify when this error is used.
    > >
    >
    > Right, I was trying to reuse an existing error-code. We can introduce
    > a new one for the case where a client pushes metrics at a higher frequency
    > than the
    > than the configured push interval (e.g., out-of-profile sends).
    > This causes the broker to drop those metrics and send this error code back
    > to the client. There will be no connection throttling / channel-muting in
    > this
    > case (unless the standard quotas are exceeded).
    >
    >
    > > 27. kafka-client-metrics.sh: Could we add an example on how to disable a
    > > bad client?
    > >
    >
    > There's now a --block option to kafka-client-metrics.sh which overrides all
    > subscriptions
    > for the matched client(s). This allows silencing metrics for one or more
    > clients without having
    > to remove existing subscriptions. From the client's perspective it will
    > look like it no longer has
    > any subscriptions.
    >
    > # Block metrics collection for a specific client instance
    > $ kafka-client-metrics.sh --bootstrap-server $BROKERS \
    >    --add \
    >    --name 'Disabe_b69cc35a' \  # A descriptive name makes it easier to
    > clean up old subscriptions.
    >    --match client_instance_id=b69cc35a-7a54-4790-aa69-cc2bd4ee4538 \  #
    > Match this specific client instance
    >    --block
    >
    >
    >
    >
    > > 28. New broker side metrics: Could we spell out the details of the
    > metrics
    > > (e.g., group, tags, etc)?
    > >
    >
    > KIP has been updated accordingly (thanks Sarat).
    >
    >
    >
    > >
    > > 29. Client instance-level metrics: client.io.wait.time is a gauge not a
    > > histogram.
    > >
    >
    > I believe a population/distribution should preferably be represented as a
    > histogram, space permitting,
    > and only secondarily as a Gauge average.
    > While we might not want to maintain a bunch of histograms for each
    > partition, since that could be
    > quite space consuming, this client.io.wait.time is a single metric per
    > client instance and can
    > thus afford a Histogram representation.
    >
    >
    >
    > Thanks,
    > Magnus
    >
    >
    >
    > > Thanks,
    > >
    > > Jun
    > >
    > > On Wed, Jan 26, 2022 at 6:32 AM Magnus Edenhill <ma...@edenhill.se>
    > > wrote:
    > >
    > > > Hi all,
    > > >
    > > > I've updated the KIP with responses to the latest comments: Java client
    > > > dependencies (Thanks Kirk!), alternate designs (separate cluster,
    > > separate
    > > > producer, etc), etc.
    > > >
    > > > I will revive the vote thread.
    > > >
    > > > Thanks,
    > > > Magnus
    > > >
    > > >
    > > > Den mån 13 dec. 2021 kl 22:32 skrev Ryanne Dolan <
    > ryannedolan@gmail.com
    > > >:
    > > >
    > > > > I think we should be very careful about introducing new runtime
    > > > > dependencies into the clients. Historically this has been rare and
    > > > > essentially necessary (e.g. compression libs).
    > > > >
    > > > > Ryanne
    > > > >
    > > > > On Mon, Dec 13, 2021, 1:06 PM Kirk True <ki...@mustardgrain.com>
    > wrote:
    > > > >
    > > > > > Hi Jun,
    > > > > >
    > > > > > On Thu, Dec 9, 2021, at 2:28 PM, Jun Rao wrote:
    > > > > > > 13. Using OpenTelemetry. Does that require runtime dependency
    > > > > > > on OpenTelemetry library? How good is the compatibility story
    > > > > > > of OpenTelemetry? This is important since an application could
    > have
    > > > > other
    > > > > > > OpenTelemetry dependencies than the Kafka client.
    > > > > >
    > > > > > The current design is that the OpenTelemetry JARs would ship with
    > the
    > > > > > client. Perhaps we can design the client such that the JARs aren't
    > > even
    > > > > > loaded if the user has opted out. The user could even exclude the
    > > JARs
    > > > > from
    > > > > > their dependencies if they so wished.
    > > > > >
    > > > > > I can't speak to the compatibility of the libraries. Is it possible
    > > > that
    > > > > > we include a shaded version?
    > > > > >
    > > > > > Thanks,
    > > > > > Kirk
    > > > > >
    > > > > > >
    > > > > > > 14. The proposal listed idempotence=true. This is more of a
    > > > > configuration
    > > > > > > than a metric. Are we including that as a metric? What other
    > > > > > configurations
    > > > > > > are we including? Should we separate the configurations from the
    > > > > metrics?
    > > > > > >
    > > > > > > Thanks,
    > > > > > >
    > > > > > > Jun
    > > > > > >
    > > > > > > On Mon, Nov 29, 2021 at 7:34 AM Magnus Edenhill <
    > > magnus@edenhill.se>
    > > > > > wrote:
    > > > > > >
    > > > > > > > Hey Bob,
    > > > > > > >
    > > > > > > > That's a good point.
    > > > > > > >
    > > > > > > > Request type labels were considered but since they're already
    > > > tracked
    > > > > > by
    > > > > > > > broker-side metrics
    > > > > > > > they were left out as to avoid metric duplication, however
    > those
    > > > > > metrics
    > > > > > > > are not per connection,
    > > > > > > > so they won't be that useful in practice for troubleshooting
    > > > specific
    > > > > > > > client instances.
    > > > > > > >
    > > > > > > > I'll add the request_type label to the relevant metrics.
    > > > > > > >
    > > > > > > > Thanks,
    > > > > > > > Magnus
    > > > > > > >
    > > > > > > >
    > > > > > > > Den tis 23 nov. 2021 kl 19:20 skrev Bob Barrett
    > > > > > > > <bo...@confluent.io.invalid>:
    > > > > > > >
    > > > > > > > > Hi Magnus,
    > > > > > > > >
    > > > > > > > > Thanks for the thorough KIP, this seems very useful.
    > > > > > > > >
    > > > > > > > > Would it make sense to include the request type as a label
    > for
    > > > the
    > > > > > > > > `client.request.success`, `client.request.errors` and
    > > > > > > > `client.request.rtt`
    > > > > > > > > metrics? I think it would be very useful to see which
    > specific
    > > > > > requests
    > > > > > > > are
    > > > > > > > > succeeding and failing for a client. One specific case I can
    > > > think
    > > > > of
    > > > > > > > where
    > > > > > > > > this could be useful is producer batch timeouts. If a Java
    > > > > > application
    > > > > > > > does
    > > > > > > > > not enable producer client logs (unfortunately, in my
    > > experience
    > > > > this
    > > > > > > > > happens more often than it should), the application logs will
    > > > only
    > > > > > > > contain
    > > > > > > > > the expiration error message, but no information about what
    > is
    > > > > > causing
    > > > > > > > the
    > > > > > > > > timeout. The requests might all be succeeding but taking too
    > > long
    > > > > to
    > > > > > > > > process batches, or metadata requests might be failing, or
    > some
    > > > or
    > > > > > all
    > > > > > > > > produce requests might be failing (if the bootstrap servers
    > are
    > > > > > reachable
    > > > > > > > > from the client but one or more other brokers are not, for
    > > > > example).
    > > > > > If
    > > > > > > > the
    > > > > > > > > cluster operator is able to identify the specific requests
    > that
    > > > are
    > > > > > slow
    > > > > > > > or
    > > > > > > > > failing for a client, they will be better able to diagnose
    > the
    > > > > issue
    > > > > > > > > causing batch timeouts.
    > > > > > > > >
    > > > > > > > > One drawback I can think of is that this will increase the
    > > > > > cardinality of
    > > > > > > > > the request metrics. But any given client is only going to
    > use
    > > a
    > > > > > small
    > > > > > > > > subset of the request types, and since we already have
    > > partition
    > > > > > labels
    > > > > > > > for
    > > > > > > > > the topic-level metrics, I think request labels will still
    > make
    > > > up
    > > > > a
    > > > > > > > > relatively small percentage of the set of metrics.
    > > > > > > > >
    > > > > > > > > Thanks,
    > > > > > > > > Bob
    > > > > > > > >
    > > > > > > > > On Mon, Nov 22, 2021 at 2:08 AM Viktor Somogyi-Vass <
    > > > > > > > > viktorsomogyi@gmail.com>
    > > > > > > > > wrote:
    > > > > > > > >
    > > > > > > > > > Hi Magnus,
    > > > > > > > > >
    > > > > > > > > > I think this is a very useful addition. We also have a
    > > similar
    > > > > (but
    > > > > > > > much
    > > > > > > > > > more simplistic) implementation of this. Maybe I missed it
    > in
    > > > the
    > > > > > KIP
    > > > > > > > but
    > > > > > > > > > what about adding metrics about the subscription cache
    > > itself?
    > > > > > That I
    > > > > > > > > think
    > > > > > > > > > would improve its usability and debuggability as we'd be
    > able
    > > > to
    > > > > > see
    > > > > > > > its
    > > > > > > > > > performance, hit/miss rates, eviction counts and others.
    > > > > > > > > >
    > > > > > > > > > Best,
    > > > > > > > > > Viktor
    > > > > > > > > >
    > > > > > > > > > On Thu, Nov 18, 2021 at 5:12 PM Magnus Edenhill <
    > > > > > magnus@edenhill.se>
    > > > > > > > > > wrote:
    > > > > > > > > >
    > > > > > > > > > > Hi Mickael,
    > > > > > > > > > >
    > > > > > > > > > > see inline.
    > > > > > > > > > >
    > > > > > > > > > > Den ons 10 nov. 2021 kl 15:21 skrev Mickael Maison <
    > > > > > > > > > > mickael.maison@gmail.com
    > > > > > > > > > > >:
    > > > > > > > > > >
    > > > > > > > > > > > Hi Magnus,
    > > > > > > > > > > >
    > > > > > > > > > > > I see you've addressed some of the points I raised
    > above
    > > > but
    > > > > > some
    > > > > > > > (4,
    > > > > > > > > > > > 5) have not been addressed yet.
    > > > > > > > > > > >
    > > > > > > > > > >
    > > > > > > > > > > Re 4) How will the user/app know metrics are being sent.
    > > > > > > > > > >
    > > > > > > > > > > One possibility is to add a JMX metric (thus for user
    > > > > > consumption)
    > > > > > > > for
    > > > > > > > > > the
    > > > > > > > > > > number of metric pushes the
    > > > > > > > > > > client has performed, or perhaps the number of metrics
    > > > > > subscriptions
    > > > > > > > > > > currently being collected.
    > > > > > > > > > > Would that be sufficient?
    > > > > > > > > > >
    > > > > > > > > > > Re 5) Metric sizes and rates
    > > > > > > > > > >
    > > > > > > > > > > A worst case scenario for a producer that is producing to
    > > 50
    > > > > > unique
    > > > > > > > > > topics
    > > > > > > > > > > and emitting all standard metrics yields
    > > > > > > > > > > a serialized size of around 100KB prior to compression,
    > > which
    > > > > > > > > compresses
    > > > > > > > > > > down to about 20-30% of that depending
    > > > > > > > > > > on compression type and topic name uniqueness.
    > > > > > > > > > > The numbers for a consumer would be similar.
    > > > > > > > > > >
    > > > > > > > > > > In practice the number of unique topics would be far
    > less,
    > > > and
    > > > > > the
    > > > > > > > > > > subscription set would typically be for a subset of
    > > metrics.
    > > > > > > > > > > So we're probably closer to 1kb, or less, compressed size
    > > per
    > > > > > client
    > > > > > > > > per
    > > > > > > > > > > push interval.
    > > > > > > > > > >
    > > > > > > > > > > As both the subscription set and push intervals are
    > > > controlled
    > > > > > by the
    > > > > > > > > > > cluster operator it shouldn't be too hard
    > > > > > > > > > > to strike a good balance between metrics overhead and
    > > > > > granularity.
    > > > > > > > > > >
    > > > > > > > > > >
    > > > > > > > > > >
    > > > > > > > > > > >
    > > > > > > > > > > > I'm really uneasy with this being enabled by default on
    > > the
    > > > > > client
    > > > > > > > > > > > side. When collecting data, I think the best practice
    > is
    > > to
    > > > > > ensure
    > > > > > > > > > > > users are explicitly enabling it.
    > > > > > > > > > > >
    > > > > > > > > > >
    > > > > > > > > > > Requiring metrics to be explicitly enabled on clients
    > > > severely
    > > > > > > > cripples
    > > > > > > > > > its
    > > > > > > > > > > usability and value.
    > > > > > > > > > >
    > > > > > > > > > > One of the problems that this KIP aims to solve is for
    > > useful
    > > > > > metrics
    > > > > > > > > to
    > > > > > > > > > be
    > > > > > > > > > > available on demand
    > > > > > > > > > > regardless of the technical expertise of the user. As
    > > Ryanne
    > > > > > points,
    > > > > > > > > out
    > > > > > > > > > a
    > > > > > > > > > > savvy user/organization
    > > > > > > > > > > will typically have metrics collection and monitoring in
    > > > place
    > > > > > > > already,
    > > > > > > > > > and
    > > > > > > > > > > the benefits of this KIP
    > > > > > > > > > > are then more of a common set and format metrics across
    > > > client
    > > > > > > > > > > implementations and languages.
    > > > > > > > > > > But that is not the typical Kafka user in my experience,
    > > > > they're
    > > > > > not
    > > > > > > > > > Kafka
    > > > > > > > > > > experts and they don't have the
    > > > > > > > > > > knowledge of how to best instrument their clients.
    > > > > > > > > > > Having metrics enabled by default for this user base
    > allows
    > > > the
    > > > > > Kafka
    > > > > > > > > > > operators to proactively and reactively
    > > > > > > > > > > monitor and troubleshoot client issues, without the need
    > > for
    > > > > the
    > > > > > less
    > > > > > > > > > savvy
    > > > > > > > > > > user to do anything.
    > > > > > > > > > > It is often too late to tell a user to enable metrics
    > when
    > > > the
    > > > > > > > problem
    > > > > > > > > > has
    > > > > > > > > > > already occurred.
    > > > > > > > > > >
    > > > > > > > > > > Now, to be clear, even though metrics are enabled by
    > > default
    > > > on
    > > > > > > > clients
    > > > > > > > > > it
    > > > > > > > > > > is not enabled by default
    > > > > > > > > > > on the brokers; the Kafka operator needs to build and set
    > > up
    > > > a
    > > > > > > > metrics
    > > > > > > > > > > plugin and add metrics subscriptions
    > > > > > > > > > > before anything is sent from the client.
    > > > > > > > > > > It is opt-out on the clients and opt-in on the broker.
    > > > > > > > > > >
    > > > > > > > > > >
    > > > > > > > > > >
    > > > > > > > > > >
    > > > > > > > > > > > You mentioned brokers already have
    > > > > > > > > > > > some(most?) of the information contained in metrics, if
    > > so
    > > > > > then why
    > > > > > > > > > > > are we collecting it again? Surely there must be some
    > new
    > > > > > > > information
    > > > > > > > > > > > in the client metrics.
    > > > > > > > > > > >
    > > > > > > > > > >
    > > > > > > > > > > From the user's perspective the Kafka infrastructure
    > > extends
    > > > > from
    > > > > > > > > > > producer.send() to
    > > > > > > > > > > messages being returned from consumer.poll(), a giant
    > black
    > > > box
    > > > > > where
    > > > > > > > > > > there's a lot going on between those
    > > > > > > > > > > two points. The brokers currently only see what happens
    > > once
    > > > > > those
    > > > > > > > > > requests
    > > > > > > > > > > and messages hits the broker,
    > > > > > > > > > > but as Kafka clients are complex pieces of machinery
    > > there's
    > > > a
    > > > > > myriad
    > > > > > > > > of
    > > > > > > > > > > queues, timers, and state
    > > > > > > > > > > that's critical to the operation and infrastructure
    > that's
    > > > not
    > > > > > > > > currently
    > > > > > > > > > > visible to the operator.
    > > > > > > > > > > Relying on the user to accurately and timely provide this
    > > > > missing
    > > > > > > > > > > information is not generally feasible.
    > > > > > > > > > >
    > > > > > > > > > >
    > > > > > > > > > > Most of the standard metrics listed in the KIP are data
    > > > points
    > > > > > that
    > > > > > > > the
    > > > > > > > > > > broker does not have.
    > > > > > > > > > > Only a small number of metrics are duplicates (like the
    > > > request
    > > > > > > > counts
    > > > > > > > > > and
    > > > > > > > > > > sizes), but they are included
    > > > > > > > > > > to ease correlation when inspecting these client metrics.
    > > > > > > > > > >
    > > > > > > > > > >
    > > > > > > > > > >
    > > > > > > > > > > > Moreover this is a brand new feature so it's even
    > harder
    > > to
    > > > > > justify
    > > > > > > > > > > > enabling it and forcing onto all our users. If disabled
    > > by
    > > > > > default,
    > > > > > > > > > > > it's relatively easy to enable in a new release if we
    > > > decide
    > > > > > to,
    > > > > > > > but
    > > > > > > > > > > > once enabled by default it's much harder to disable.
    > Also
    > > > > this
    > > > > > > > > feature
    > > > > > > > > > > > will apply to all future metrics we will add.
    > > > > > > > > > > >
    > > > > > > > > > >
    > > > > > > > > > > I think maturity of a feature implementation should be
    > the
    > > > > > deciding
    > > > > > > > > > factor,
    > > > > > > > > > > rather than
    > > > > > > > > > > the design of it (which this KIP is). I.e., if the
    > > > > > implementation is
    > > > > > > > > not
    > > > > > > > > > > deemed mature enough
    > > > > > > > > > > for release X.Y it will be disabled.
    > > > > > > > > > >
    > > > > > > > > > >
    > > > > > > > > > >
    > > > > > > > > > > > Overall I think it's an interesting feature but I'd
    > > prefer
    > > > to
    > > > > > be
    > > > > > > > > > > > slightly defensive and see how it works in practice
    > > before
    > > > > > enabling
    > > > > > > > > it
    > > > > > > > > > > > everywhere.
    > > > > > > > > > > >
    > > > > > > > > > >
    > > > > > > > > > > Right, and I agree on being defensive, but since this
    > > feature
    > > > > > still
    > > > > > > > > > > requires manual
    > > > > > > > > > > enabling on the brokers before actually being used, I
    > think
    > > > > that
    > > > > > > > gives
    > > > > > > > > > > enough control
    > > > > > > > > > > to opt-in or out of this feature as needed.
    > > > > > > > > > >
    > > > > > > > > > > Thanks for your comments!
    > > > > > > > > > >
    > > > > > > > > > > Regards,
    > > > > > > > > > > Magnus
    > > > > > > > > > >
    > > > > > > > > > >
    > > > > > > > > > >
    > > > > > > > > > > > Thanks,
    > > > > > > > > > > > Mickael
    > > > > > > > > > > >
    > > > > > > > > > > > On Mon, Nov 8, 2021 at 7:55 AM Magnus Edenhill <
    > > > > > magnus@edenhill.se
    > > > > > > > >
    > > > > > > > > > > wrote:
    > > > > > > > > > > > >
    > > > > > > > > > > > > Thanks David for pointing this out,
    > > > > > > > > > > > > I've updated the KIP to include client_id as a
    > matching
    > > > > > selector.
    > > > > > > > > > > > >
    > > > > > > > > > > > > Regards,
    > > > > > > > > > > > > Magnus
    > > > > > > > > > > > >
    > > > > > > > > > > > > Den tors 4 nov. 2021 kl 18:01 skrev David Mao
    > > > > > > > > > > <dmao@confluent.io.invalid
    > > > > > > > > > > > >:
    > > > > > > > > > > > >
    > > > > > > > > > > > > > Hey Magnus,
    > > > > > > > > > > > > >
    > > > > > > > > > > > > > I noticed that the KIP outlines the initial
    > selectors
    > > > > > supported
    > > > > > > > > as:
    > > > > > > > > > > > > >
    > > > > > > > > > > > > >    - client_instance_id - CLIENT_INSTANCE_ID UUID
    > > > string
    > > > > > > > > > > > representation.
    > > > > > > > > > > > > >    - client_software_name  - client software
    > > > > implementation
    > > > > > > > name.
    > > > > > > > > > > > > >    - client_software_version  - client software
    > > > > > implementation
    > > > > > > > > > > version.
    > > > > > > > > > > > > >
    > > > > > > > > > > > > > In the given reactive monitoring workflow, we
    > mention
    > > > > that
    > > > > > the
    > > > > > > > > > > > application
    > > > > > > > > > > > > > user does not know their client's client instance
    > ID,
    > > > but
    > > > > > it's
    > > > > > > > > > > outlined
    > > > > > > > > > > > > > that the operator can add a metrics subscription
    > > > > selecting
    > > > > > for
    > > > > > > > > > > > clientId. I
    > > > > > > > > > > > > > don't see clientId as one of the supported
    > selectors.
    > > > > > > > > > > > > > I can see how this would have made sense in a
    > > previous
    > > > > > > > iteration
    > > > > > > > > > > given
    > > > > > > > > > > > that
    > > > > > > > > > > > > > the previous client instance ID proposal was to
    > > > construct
    > > > > > the
    > > > > > > > > > client
    > > > > > > > > > > > > > instance ID using clientId as a prefix. Now that
    > the
    > > > > client
    > > > > > > > > > instance
    > > > > > > > > > > > ID is
    > > > > > > > > > > > > > a UUID, would we want to add clientId as a
    > supported
    > > > > > selector?
    > > > > > > > > > > > > > Let me know what you think.
    > > > > > > > > > > > > >
    > > > > > > > > > > > > > David
    > > > > > > > > > > > > >
    > > > > > > > > > > > > > On Tue, Oct 19, 2021 at 12:33 PM Magnus Edenhill <
    > > > > > > > > > magnus@edenhill.se
    > > > > > > > > > > >
    > > > > > > > > > > > > > wrote:
    > > > > > > > > > > > > >
    > > > > > > > > > > > > > > Hi Mickael!
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > Den tis 19 okt. 2021 kl 19:30 skrev Mickael
    > Maison
    > > <
    > > > > > > > > > > > > > > mickael.maison@gmail.com
    > > > > > > > > > > > > > > >:
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > Hi Magnus,
    > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > Thanks for the proposal.
    > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > 1. Looking at the protocol section, isn't
    > > > > > > > "ClientInstanceId"
    > > > > > > > > > > > expected
    > > > > > > > > > > > > > > > to be a field in
    > > > GetTelemetrySubscriptionsResponseV0?
    > > > > > > > > > Otherwise,
    > > > > > > > > > > > how
    > > > > > > > > > > > > > > > does a client retrieve this value?
    > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > Good catch, it got removed by mistake in one of
    > the
    > > > > > edits.
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > 2. In the client API section, you mention a new
    > > > > method
    > > > > > > > > > > > > > > > "clientInstanceId()". Can you clarify which
    > > > > interfaces
    > > > > > are
    > > > > > > > > > > > affected?
    > > > > > > > > > > > > > > > Is it only Consumer and Producer?
    > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > And Admin. Will update the KIP.
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > 3. I'm a bit concerned this is enabled by
    > > default.
    > > > > > Even if
    > > > > > > > > the
    > > > > > > > > > > data
    > > > > > > > > > > > > > > > collected is supposed to be not sensitive, I
    > > think
    > > > > > this can
    > > > > > > > > be
    > > > > > > > > > > > > > > > problematic in some environments. Also users
    > > don't
    > > > > > seem to
    > > > > > > > > have
    > > > > > > > > > > the
    > > > > > > > > > > > > > > > choice to only expose some metrics. Knowing how
    > > > much
    > > > > > data
    > > > > > > > > > transit
    > > > > > > > > > > > > > > > through some applications can be considered
    > > > critical.
    > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > The broker already knows how much data transits
    > > > through
    > > > > > the
    > > > > > > > > > client
    > > > > > > > > > > > > > though,
    > > > > > > > > > > > > > > right?
    > > > > > > > > > > > > > > Care has been taken not to expose information in
    > > the
    > > > > > standard
    > > > > > > > > > > metrics
    > > > > > > > > > > > > > that
    > > > > > > > > > > > > > > might
    > > > > > > > > > > > > > > reveal sensitive information.
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > Do you have an example of how the proposed
    > metrics
    > > > > could
    > > > > > leak
    > > > > > > > > > > > sensitive
    > > > > > > > > > > > > > > information?
    > > > > > > > > > > > > > > As for limiting the what metrics to export; I
    > guess
    > > > > that
    > > > > > > > could
    > > > > > > > > > make
    > > > > > > > > > > > sense
    > > > > > > > > > > > > > > in some
    > > > > > > > > > > > > > > very sensitive use-cases, but those users might
    > > > disable
    > > > > > > > metrics
    > > > > > > > > > > > > > altogether
    > > > > > > > > > > > > > > for now.
    > > > > > > > > > > > > > > Could these concerns be addressed by a later KIP?
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > 4. As a user, how do you know if your
    > application
    > > > is
    > > > > > > > actively
    > > > > > > > > > > > sending
    > > > > > > > > > > > > > > > metrics? Are there new metrics exposing what's
    > > > going
    > > > > > on,
    > > > > > > > like
    > > > > > > > > > how
    > > > > > > > > > > > much
    > > > > > > > > > > > > > > > data is being sent?
    > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > That's a good question.
    > > > > > > > > > > > > > > Since the proposed metrics interface is not aimed
    > > at,
    > > > > or
    > > > > > > > > directly
    > > > > > > > > > > > > > available
    > > > > > > > > > > > > > > to, the application
    > > > > > > > > > > > > > > I guess there's little point of adding it here,
    > but
    > > > > > instead
    > > > > > > > > > adding
    > > > > > > > > > > > > > > something to the
    > > > > > > > > > > > > > > existing JMX metrics?
    > > > > > > > > > > > > > > Do you have any suggestions?
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > 5. If all metrics are enabled on a regular
    > > Consumer
    > > > > or
    > > > > > > > > > Producer,
    > > > > > > > > > > do
    > > > > > > > > > > > > > > > you have an idea how much throughput this would
    > > > use?
    > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > It depends on the number of partition/topics/etc
    > > the
    > > > > > client
    > > > > > > > is
    > > > > > > > > > > > producing
    > > > > > > > > > > > > > > to/consuming from.
    > > > > > > > > > > > > > > I'll add some sizes to the KIP for some typical
    > > > > > use-cases.
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > Thanks,
    > > > > > > > > > > > > > > Magnus
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > Thanks
    > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > On Tue, Oct 19, 2021 at 5:06 PM Magnus
    > Edenhill <
    > > > > > > > > > > > magnus@edenhill.se>
    > > > > > > > > > > > > > > > wrote:
    > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > Den tis 19 okt. 2021 kl 13:22 skrev Tom
    > > Bentley <
    > > > > > > > > > > > tbentley@redhat.com
    > > > > > > > > > > > > > >:
    > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > > Hi Magnus,
    > > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > > I reviewed the KIP since you called the
    > vote
    > > > > > (sorry for
    > > > > > > > > not
    > > > > > > > > > > > > > reviewing
    > > > > > > > > > > > > > > > when
    > > > > > > > > > > > > > > > > > you announced your intention to call the
    > > > vote). I
    > > > > > have
    > > > > > > > a
    > > > > > > > > > few
    > > > > > > > > > > > > > > questions
    > > > > > > > > > > > > > > > on
    > > > > > > > > > > > > > > > > > some of the details.
    > > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > > 1. There's no Javadoc on
    > > > > > ClientTelemetryPayload.data(),
    > > > > > > > > so
    > > > > > > > > > I
    > > > > > > > > > > > don't
    > > > > > > > > > > > > > > know
    > > > > > > > > > > > > > > > > > whether the payload is exposed through this
    > > > > method
    > > > > > as
    > > > > > > > > > > > compressed or
    > > > > > > > > > > > > > > > not.
    > > > > > > > > > > > > > > > > > Later on you say "Decompression of the
    > > payloads
    > > > > > will be
    > > > > > > > > > > > handled by
    > > > > > > > > > > > > > > the
    > > > > > > > > > > > > > > > > > broker metrics plugin, the broker should
    > > > expose a
    > > > > > > > > suitable
    > > > > > > > > > > > > > > > decompression
    > > > > > > > > > > > > > > > > > API to the metrics plugin for this
    > purpose.",
    > > > > which
    > > > > > > > > > suggests
    > > > > > > > > > > > it's
    > > > > > > > > > > > > > the
    > > > > > > > > > > > > > > > > > compressed data in the buffer, but then we
    > > > don't
    > > > > > know
    > > > > > > > > which
    > > > > > > > > > > > codec
    > > > > > > > > > > > > > was
    > > > > > > > > > > > > > > > used,
    > > > > > > > > > > > > > > > > > nor the API via which the plugin should
    > > > > decompress
    > > > > > it
    > > > > > > > if
    > > > > > > > > > > > required
    > > > > > > > > > > > > > for
    > > > > > > > > > > > > > > > > > forwarding to the ultimate metrics store.
    > > > Should
    > > > > > the
    > > > > > > > > > > > > > > > ClientTelemetryPayload
    > > > > > > > > > > > > > > > > > expose a method to get the compression and
    > a
    > > > > > > > > decompressor?
    > > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > Good point, updated.
    > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > > 2. The client-side API is expressed as
    > > > > > StringOrError
    > > > > > > > > > > > > > > > > > ClientInstance::ClientInstanceId(int
    > > > > timeout_ms). I
    > > > > > > > > > > understand
    > > > > > > > > > > > that
    > > > > > > > > > > > > > > > you're
    > > > > > > > > > > > > > > > > > thinking about the librdkafka
    > implementation,
    > > > but
    > > > > > it
    > > > > > > > > would
    > > > > > > > > > be
    > > > > > > > > > > > good
    > > > > > > > > > > > > > to
    > > > > > > > > > > > > > > > show
    > > > > > > > > > > > > > > > > > the API as it would appear on the Apache
    > > Kafka
    > > > > > clients.
    > > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > This was meant as pseudo-code, but I changed
    > it
    > > > to
    > > > > > Java.
    > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > > 3. "PushTelemetryRequest|Response -
    > protocol
    > > > > > request
    > > > > > > > used
    > > > > > > > > > by
    > > > > > > > > > > > the
    > > > > > > > > > > > > > > > client to
    > > > > > > > > > > > > > > > > > send metrics to any broker it is connected
    > > to."
    > > > > To
    > > > > > be
    > > > > > > > > > clear,
    > > > > > > > > > > > this
    > > > > > > > > > > > > > > means
    > > > > > > > > > > > > > > > > > that the client can choose any of the
    > > connected
    > > > > > brokers
    > > > > > > > > and
    > > > > > > > > > > > push to
    > > > > > > > > > > > > > > > just
    > > > > > > > > > > > > > > > > > one of them? What should a supporting
    > client
    > > do
    > > > > if
    > > > > > it
    > > > > > > > > gets
    > > > > > > > > > an
    > > > > > > > > > > > error
    > > > > > > > > > > > > > > > when
    > > > > > > > > > > > > > > > > > pushing metrics to a broker, retry sending
    > to
    > > > the
    > > > > > same
    > > > > > > > > > broker
    > > > > > > > > > > > or
    > > > > > > > > > > > > > try
    > > > > > > > > > > > > > > > > > pushing to another broker, or drop the
    > > metrics?
    > > > > > Should
    > > > > > > > > > > > supporting
    > > > > > > > > > > > > > > > clients
    > > > > > > > > > > > > > > > > > send successive requests to a single
    > broker,
    > > or
    > > > > > round
    > > > > > > > > > robin,
    > > > > > > > > > > > or is
    > > > > > > > > > > > > > > > that up
    > > > > > > > > > > > > > > > > > to the client author? I'm guessing the
    > > > behaviour
    > > > > > should
    > > > > > > > > be
    > > > > > > > > > > > sticky
    > > > > > > > > > > > > > to
    > > > > > > > > > > > > > > > > > support the rate limiting features, but I
    > > think
    > > > > it
    > > > > > > > would
    > > > > > > > > be
    > > > > > > > > > > > good
    > > > > > > > > > > > > > for
    > > > > > > > > > > > > > > > client
    > > > > > > > > > > > > > > > > > authors if this section were explicit on
    > the
    > > > > > > > recommended
    > > > > > > > > > > > behaviour.
    > > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > You are right, I've updated the KIP to make
    > > this
    > > > > > clearer.
    > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > > 4. "Mapping the client instance id to an
    > > actual
    > > > > > > > > application
    > > > > > > > > > > > > > instance
    > > > > > > > > > > > > > > > > > running on a (virtual) machine can be done
    > by
    > > > > > > > inspecting
    > > > > > > > > > the
    > > > > > > > > > > > > > metrics
    > > > > > > > > > > > > > > > > > resource labels, such as the client source
    > > > > address
    > > > > > and
    > > > > > > > > > source
    > > > > > > > > > > > port,
    > > > > > > > > > > > > > > or
    > > > > > > > > > > > > > > > > > security principal, all of which are added
    > by
    > > > the
    > > > > > > > > receiving
    > > > > > > > > > > > broker.
    > > > > > > > > > > > > > > > This
    > > > > > > > > > > > > > > > > > will allow the operator together with the
    > > user
    > > > to
    > > > > > > > > identify
    > > > > > > > > > > the
    > > > > > > > > > > > > > actual
    > > > > > > > > > > > > > > > > > application instance." Is this really
    > always
    > > > > true?
    > > > > > The
    > > > > > > > > > source
    > > > > > > > > > > > IP
    > > > > > > > > > > > > > and
    > > > > > > > > > > > > > > > port
    > > > > > > > > > > > > > > > > > might be a loadbalancer/proxy in some
    > setups.
    > > > The
    > > > > > > > > > principal,
    > > > > > > > > > > as
    > > > > > > > > > > > > > > already
    > > > > > > > > > > > > > > > > > mentioned in the KIP, might be shared
    > between
    > > > > > multiple
    > > > > > > > > > > > > > applications.
    > > > > > > > > > > > > > > > So at
    > > > > > > > > > > > > > > > > > worst the organization running the clients
    > > > might
    > > > > > have
    > > > > > > > to
    > > > > > > > > > > > consult
    > > > > > > > > > > > > > the
    > > > > > > > > > > > > > > > logs
    > > > > > > > > > > > > > > > > > of a set of client applications, right?
    > > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > Yes, that's correct. There's no guaranteed
    > > > mapping
    > > > > > from
    > > > > > > > > > > > > > > > client_instance_id
    > > > > > > > > > > > > > > > > to
    > > > > > > > > > > > > > > > > an actual instance, that's why the KIP
    > > recommends
    > > > > > client
    > > > > > > > > > > > > > > implementations
    > > > > > > > > > > > > > > > to
    > > > > > > > > > > > > > > > > log the client instance id
    > > > > > > > > > > > > > > > > upon retrieval, and also provide an API for
    > the
    > > > > > > > application
    > > > > > > > > > to
    > > > > > > > > > > > > > retrieve
    > > > > > > > > > > > > > > > the
    > > > > > > > > > > > > > > > > instance id programmatically
    > > > > > > > > > > > > > > > > if it has a better way of exposing it.
    > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > 5. "Tests indicate that a compression ratio
    > up
    > > to
    > > > > > 10x is
    > > > > > > > > > > > possible for
    > > > > > > > > > > > > > > the
    > > > > > > > > > > > > > > > > > standard metrics." Client authors might
    > > > > appreciate
    > > > > > your
    > > > > > > > > > > > mentioning
    > > > > > > > > > > > > > > > which
    > > > > > > > > > > > > > > > > > compression codec got these results.
    > > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > Good point. Updated.
    > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > > 6. "Should the client send a push request
    > > prior
    > > > > to
    > > > > > > > expiry
    > > > > > > > > > of
    > > > > > > > > > > > the
    > > > > > > > > > > > > > > > previously
    > > > > > > > > > > > > > > > > > calculated PushIntervalMs the broker will
    > > > discard
    > > > > > the
    > > > > > > > > > metrics
    > > > > > > > > > > > and
    > > > > > > > > > > > > > > > return a
    > > > > > > > > > > > > > > > > > PushTelemetryResponse with the ErrorCode
    > set
    > > to
    > > > > > > > > > RateLimited."
    > > > > > > > > > > > Is
    > > > > > > > > > > > > > this
    > > > > > > > > > > > > > > > > > RATE_LIMITED a new error code? It's not
    > > > mentioned
    > > > > > in
    > > > > > > > the
    > > > > > > > > > "New
    > > > > > > > > > > > Error
    > > > > > > > > > > > > > > > Codes"
    > > > > > > > > > > > > > > > > > section.
    > > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > That's a leftover, it should be using the
    > > > standard
    > > > > > > > > > ThrottleTime
    > > > > > > > > > > > > > > > mechanism.
    > > > > > > > > > > > > > > > > Fixed.
    > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > > 7. In the section "Standard client resource
    > > > > labels"
    > > > > > > > > > > > application_id
    > > > > > > > > > > > > > is
    > > > > > > > > > > > > > > > > > described as Kafka Streams only, but the
    > > > section
    > > > > of
    > > > > > > > > "Client
    > > > > > > > > > > > > > > > Identification"
    > > > > > > > > > > > > > > > > > talks about "application instance id as an
    > > > > optional
    > > > > > > > > future
    > > > > > > > > > > > > > > nice-to-have
    > > > > > > > > > > > > > > > > > that may be included as a metrics label if
    > it
    > > > has
    > > > > > been
    > > > > > > > > set
    > > > > > > > > > by
    > > > > > > > > > > > the
    > > > > > > > > > > > > > > > user", so
    > > > > > > > > > > > > > > > > > I'm confused whether non-Kafka Streams
    > > clients
    > > > > > should
    > > > > > > > set
    > > > > > > > > > an
    > > > > > > > > > > > > > > > application_id
    > > > > > > > > > > > > > > > > > or not.
    > > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > I'll clarify this in the KIP, but basically
    > we
    > > > > would
    > > > > > need
    > > > > > > > > to
    > > > > > > > > > > add
    > > > > > > > > > > > an `
    > > > > > > > > > > > > > > > > application.id` config
    > > > > > > > > > > > > > > > > property for non-streams clients for this
    > > > purpose,
    > > > > > and
    > > > > > > > > that's
    > > > > > > > > > > > outside
    > > > > > > > > > > > > > > the
    > > > > > > > > > > > > > > > > scope of this KIP since we want to make it
    > > > > > zero-conf:ish
    > > > > > > > on
    > > > > > > > > > the
    > > > > > > > > > > > > > client
    > > > > > > > > > > > > > > > side.
    > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > > Kind regards,
    > > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > > Tom
    > > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > Thanks for the review,
    > > > > > > > > > > > > > > > > Magnus
    > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > > On Thu, Oct 7, 2021 at 5:26 PM Magnus
    > > Edenhill
    > > > <
    > > > > > > > > > > > magnus@edenhill.se
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > wrote:
    > > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > > > Hi all,
    > > > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > > > I've updated the KIP following our recent
    > > > > > discussions
    > > > > > > > > on
    > > > > > > > > > > the
    > > > > > > > > > > > > > > mailing
    > > > > > > > > > > > > > > > > > list:
    > > > > > > > > > > > > > > > > > >  - split the protocol in two, one for
    > > getting
    > > > > the
    > > > > > > > > metrics
    > > > > > > > > > > > > > > > subscriptions,
    > > > > > > > > > > > > > > > > > > and one for pushing the metrics.
    > > > > > > > > > > > > > > > > > >  - simplifications: initially only one
    > > > > supported
    > > > > > > > > metrics
    > > > > > > > > > > > format,
    > > > > > > > > > > > > > no
    > > > > > > > > > > > > > > > > > > client.id in the instance id, etc.
    > > > > > > > > > > > > > > > > > >  - made CLIENT_METRICS subscription
    > > > > configuration
    > > > > > > > > entries
    > > > > > > > > > > > more
    > > > > > > > > > > > > > > > structured
    > > > > > > > > > > > > > > > > > >    and allowing better client matching
    > > > > selectors
    > > > > > (not
    > > > > > > > > > only
    > > > > > > > > > > > on the
    > > > > > > > > > > > > > > > > > instance
    > > > > > > > > > > > > > > > > > > id, but also the other
    > > > > > > > > > > > > > > > > > >    client resource labels, such as
    > > > > > > > > client_software_name,
    > > > > > > > > > > > etc.).
    > > > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > > > Unless there are further comments I'll
    > call
    > > > the
    > > > > > vote
    > > > > > > > > in a
    > > > > > > > > > > > day or
    > > > > > > > > > > > > > > two.
    > > > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > > > Regards,
    > > > > > > > > > > > > > > > > > > Magnus
    > > > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > > > Den mån 4 okt. 2021 kl 20:57 skrev Magnus
    > > > > > Edenhill <
    > > > > > > > > > > > > > > > magnus@edenhill.se>:
    > > > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > > > > Hi Gwen,
    > > > > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > > > > I'm finishing up the KIP based on the
    > > last
    > > > > > couple
    > > > > > > > of
    > > > > > > > > > > > discussion
    > > > > > > > > > > > > > > > points
    > > > > > > > > > > > > > > > > > in
    > > > > > > > > > > > > > > > > > > > this thread
    > > > > > > > > > > > > > > > > > > > and will call the Vote later this week.
    > > > > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > > > > Best,
    > > > > > > > > > > > > > > > > > > > Magnus
    > > > > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > > > > Den lör 2 okt. 2021 kl 02:01 skrev Gwen
    > > > > Shapira
    > > > > > > > > > > > > > > > > > > <gwen@confluent.io.invalid
    > > > > > > > > > > > > > > > > > > > >:
    > > > > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > > > >> Hey,
    > > > > > > > > > > > > > > > > > > >>
    > > > > > > > > > > > > > > > > > > >> I noticed that there was no discussion
    > > for
    > > > > the
    > > > > > > > last
    > > > > > > > > 10
    > > > > > > > > > > > days,
    > > > > > > > > > > > > > > but I
    > > > > > > > > > > > > > > > > > > >> couldn't
    > > > > > > > > > > > > > > > > > > >> find the vote thread. Is there one
    > that
    > > > I'm
    > > > > > > > missing?
    > > > > > > > > > > > > > > > > > > >>
    > > > > > > > > > > > > > > > > > > >> Gwen
    > > > > > > > > > > > > > > > > > > >>
    > > > > > > > > > > > > > > > > > > >> On Wed, Sep 22, 2021 at 4:58 AM Magnus
    > > > > > Edenhill <
    > > > > > > > > > > > > > > > magnus@edenhill.se>
    > > > > > > > > > > > > > > > > > > >> wrote:
    > > > > > > > > > > > > > > > > > > >>
    > > > > > > > > > > > > > > > > > > >> > Den tis 21 sep. 2021 kl 06:58 skrev
    > > > Colin
    > > > > > > > McCabe <
    > > > > > > > > > > > > > > > > > cmccabe@apache.org
    > > > > > > > > > > > > > > > > > > >:
    > > > > > > > > > > > > > > > > > > >> >
    > > > > > > > > > > > > > > > > > > >> > > On Mon, Sep 20, 2021, at 17:35,
    > Feng
    > > > Min
    > > > > > > > wrote:
    > > > > > > > > > > > > > > > > > > >> > > > Thanks Magnus & Colin for the
    > > > > > discussion.
    > > > > > > > > > > > > > > > > > > >> > > >
    > > > > > > > > > > > > > > > > > > >> > > > Based on KIP-714's stateless
    > > design,
    > > > > > Client
    > > > > > > > > can
    > > > > > > > > > > > pretty
    > > > > > > > > > > > > > > much
    > > > > > > > > > > > > > > > use
    > > > > > > > > > > > > > > > > > > any
    > > > > > > > > > > > > > > > > > > >> > > > connection to any broker to send
    > > > > > metrics. We
    > > > > > > > > are
    > > > > > > > > > > not
    > > > > > > > > > > > > > > > associating
    > > > > > > > > > > > > > > > > > > >> > > connection
    > > > > > > > > > > > > > > > > > > >> > > > with client metric state. Is my
    > > > > > > > understanding
    > > > > > > > > > > > correct?
    > > > > > > > > > > > > > If
    > > > > > > > > > > > > > > > yes,
    > > > > > > > > > > > > > > > > > > how
    > > > > > > > > > > > > > > > > > > >> > about
    > > > > > > > > > > > > > > > > > > >> > > > the following two scenarios
    > > > > > > > > > > > > > > > > > > >> > > >
    > > > > > > > > > > > > > > > > > > >> > > > 1) One Client (Client-ID)
    > > registers
    > > > > two
    > > > > > > > > > different
    > > > > > > > > > > > client
    > > > > > > > > > > > > > > > > > instance
    > > > > > > > > > > > > > > > > > > id
    > > > > > > > > > > > > > > > > > > >> > via
    > > > > > > > > > > > > > > > > > > >> > > > separate registration. Is it
    > > > > permitted?
    > > > > > If
    > > > > > > > OK,
    > > > > > > > > > how
    > > > > > > > > > > > to
    > > > > > > > > > > > > > > > > > distinguish
    > > > > > > > > > > > > > > > > > > >> them
    > > > > > > > > > > > > > > > > > > >> > > from
    > > > > > > > > > > > > > > > > > > >> > > > the case 2 below.
    > > > > > > > > > > > > > > > > > > >> > > >
    > > > > > > > > > > > > > > > > > > >> > >
    > > > > > > > > > > > > > > > > > > >> > > Hi Feng,
    > > > > > > > > > > > > > > > > > > >> > >
    > > > > > > > > > > > > > > > > > > >> > > My understanding, which Magnus can
    > > > > > clarify I
    > > > > > > > > > guess,
    > > > > > > > > > > is
    > > > > > > > > > > > > > that
    > > > > > > > > > > > > > > > you
    > > > > > > > > > > > > > > > > > > could
    > > > > > > > > > > > > > > > > > > >> > have
    > > > > > > > > > > > > > > > > > > >> > > something like two Producer
    > > instances
    > > > > > running
    > > > > > > > > with
    > > > > > > > > > > the
    > > > > > > > > > > > > > same
    > > > > > > > > > > > > > > > > > > client.id
    > > > > > > > > > > > > > > > > > > >> > > (perhaps because they're using the
    > > > same
    > > > > > config
    > > > > > > > > > file,
    > > > > > > > > > > > for
    > > > > > > > > > > > > > > > example).
    > > > > > > > > > > > > > > > > > > >> They
    > > > > > > > > > > > > > > > > > > >> > > could even be in the same process.
    > > But
    > > > > > they
    > > > > > > > > would
    > > > > > > > > > > get
    > > > > > > > > > > > > > > separate
    > > > > > > > > > > > > > > > > > > UUIDs.
    > > > > > > > > > > > > > > > > > > >> > >
    > > > > > > > > > > > > > > > > > > >> > > I believe Magnus used the term
    > > client
    > > > to
    > > > > > mean
    > > > > > > > > > > > "Producer or
    > > > > > > > > > > > > > > > > > > Consumer".
    > > > > > > > > > > > > > > > > > > >> So
    > > > > > > > > > > > > > > > > > > >> > > if you have both a Producer and a
    > > > > > Consumer in
    > > > > > > > > your
    > > > > > > > > > > > > > > > application I
    > > > > > > > > > > > > > > > > > > would
    > > > > > > > > > > > > > > > > > > >> > > expect you'd get separate UUIDs
    > for
    > > > > both.
    > > > > > > > Again
    > > > > > > > > > > > Magnus can
    > > > > > > > > > > > > > > > chime
    > > > > > > > > > > > > > > > > > in
    > > > > > > > > > > > > > > > > > > >> > here, I
    > > > > > > > > > > > > > > > > > > >> > > guess.
    > > > > > > > > > > > > > > > > > > >> > >
    > > > > > > > > > > > > > > > > > > >> >
    > > > > > > > > > > > > > > > > > > >> > That's correct.
    > > > > > > > > > > > > > > > > > > >> >
    > > > > > > > > > > > > > > > > > > >> >
    > > > > > > > > > > > > > > > > > > >> > >
    > > > > > > > > > > > > > > > > > > >> > > > 2) How about the client
    > > restarting?
    > > > > > What's
    > > > > > > > the
    > > > > > > > > > > > > > > expectation?
    > > > > > > > > > > > > > > > > > Should
    > > > > > > > > > > > > > > > > > > >> the
    > > > > > > > > > > > > > > > > > > >> > > > server expect the client to
    > carry
    > > a
    > > > > > > > persisted
    > > > > > > > > > > client
    > > > > > > > > > > > > > > > instance id
    > > > > > > > > > > > > > > > > > > or
    > > > > > > > > > > > > > > > > > > >> > > should
    > > > > > > > > > > > > > > > > > > >> > > > the client be treated as a new
    > > > > instance?
    > > > > > > > > > > > > > > > > > > >> > >
    > > > > > > > > > > > > > > > > > > >> > > The KIP doesn't describe any
    > > mechanism
    > > > > for
    > > > > > > > > > > > persistence,
    > > > > > > > > > > > > > so I
    > > > > > > > > > > > > > > > would
    > > > > > > > > > > > > > > > > > > >> assume
    > > > > > > > > > > > > > > > > > > >> > > that when you restart the client
    > you
    > > > get
    > > > > > a new
    > > > > > > > > > > UUID. I
    > > > > > > > > > > > > > agree
    > > > > > > > > > > > > > > > that
    > > > > > > > > > > > > > > > > > it
    > > > > > > > > > > > > > > > > > > >> > would
    > > > > > > > > > > > > > > > > > > >> > > be good to spell this out.
    > > > > > > > > > > > > > > > > > > >> > >
    > > > > > > > > > > > > > > > > > > >> > >
    > > > > > > > > > > > > > > > > > > >> > Right, it will not be persisted
    > since
    > > a
    > > > > > client
    > > > > > > > > > > instance
    > > > > > > > > > > > > > can't
    > > > > > > > > > > > > > > be
    > > > > > > > > > > > > > > > > > > >> restarted.
    > > > > > > > > > > > > > > > > > > >> >
    > > > > > > > > > > > > > > > > > > >> > Will update the KIP to make this
    > > > clearer.
    > > > > > > > > > > > > > > > > > > >> >
    > > > > > > > > > > > > > > > > > > >> > /Magnus
    > > > > > > > > > > > > > > > > > > >> >
    > > > > > > > > > > > > > > > > > > >>
    > > > > > > > > > > > > > > > > > > >>
    > > > > > > > > > > > > > > > > > > >> --
    > > > > > > > > > > > > > > > > > > >> Gwen Shapira
    > > > > > > > > > > > > > > > > > > >> Engineering Manager | Confluent
    > > > > > > > > > > > > > > > > > > >> 650.450.2760 | @gwenshap
    > > > > > > > > > > > > > > > > > > >> Follow us: Twitter | blog
    > > > > > > > > > > > > > > > > > > >>
    > > > > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > >
    > > > > > > > > > > >
    > > > > > > > > > >
    > > > > > > > > >
    > > > > > > > >
    > > > > > > >
    > > > > > >
    > > > > >
    > > > >
    > > >
    > >
    >

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Kirk True <ki...@mustardgrain.com>.

Hi Jun,

On Tue, Mar 8, 2022, at 5:47 PM, Jun Rao wrote:
> Hi, Magnus, Sarat and Xavier,
> 
> Thanks for the reply. A few more comments below.
> 
> 20. It seems that we are piggybacking the plugin on the
> existing MetricsReporter. So, this seems fine.
> 
> 21. That could work. Are we requiring any additional jar dependency on the
> client? Or, are you suggesting that we check the runtime dependency to pick
> the compression codec?

The Java client doesn't require any additional libraries for compression, no.

> 28. For the broker metrics, could you spell out the full metric name
> including groups, tags, etc? We typically don't add the broker_id label for
> broker metrics. Also, brokers use Yammer metrics, which doesn't have type
> Sum.
> 
> 29. There are several client metrics listed as histogram. However, the java
> client currently doesn't support histogram type.

There does appear to be some code related to histograms in the org.apache.kafka.common.metrics.stats package. But we're still looking into the implementation to see if there's anything needed for KIP-714.

> 30. Could you show an example of the metric payload in PushTelemetryRequest
> to help understand how we organize metrics at different levels (per
> instance, per topic, per partition, per broker, etc)?
> 
> 31. Could you add a bit more detail on which client thread sends the
> PushTelemetryRequest?

Yes, I will add that the KIP.

Thanks,
Kirk

> Thanks,
> 
> Jun
> 
> On Mon, Mar 7, 2022 at 11:48 AM Magnus Edenhill <ma...@edenhill.se> wrote:
> 
> > Hi Jun,
> >
> > thanks for your initiated questions, see my answers below.
> > There's been a number of clarifications to the KIP.
> >
> >
> >
> > Den tors 27 jan. 2022 kl 20:08 skrev Jun Rao <ju...@confluent.io.invalid>:
> >
> > > Hi, Magnus,
> > >
> > > Thanks for updating the KIP. The overall approach makes sense to me. A
> > few
> > > more detailed comments below.
> > >
> > > 20. ClientTelemetry: Should it be extending configurable and closable?
> > >
> >
> > I'll pass this question to Sarat and/or Xavier.
> >
> >
> >
> > > 21. Compression of the metrics on the client: what's the default?
> > >
> >
> > How about we specify a prioritized list: zstd, lz4, snappy, gzip?
> > But ultimately it is up to what the client supports.
> >
> >
> > 23. A client instance is considered a metric resource and the
> > > resource-level (thus client instance level) labels could include:
> > >     client_software_name=confluent-kafka-python
> > >     client_software_version=v2.1.3
> > >     client_instance_id=B64CD139-3975-440A-91D4
> > >     transactional_id=someTxnApp
> > > Are those labels added in PushTelemetryRequest? If so, are they per
> > metric
> > > or per request?
> > >
> >
> >
> > client_software* and client_instance_id are not added by the client, but
> > available to
> > the broker-side metrics plugin for adding as it see fits, remove them from
> > the KIP.
> >
> > As for transactional_id, group_id, etc, which I believe will be useful in
> > troubleshooting,
> > are included only once (per push) as resource-level attributes (the client
> > instance is a singular resource).
> >
> >
> > >
> > > 24.  "the broker will only send
> > > GetTelemetrySubscriptionsResponse.DeltaTemporality=True" :
> > > 24.1 If it's always true, does it need to be part of the protocol?
> > >
> >
> > We're anticipating that it will take a lot longer to upgrade the majority
> > of clients than the
> > broker/plugin side, which is why we want the client to support both
> > temporalities out-of-the-box
> > so that cumulative reporting can be turned on seamlessly in the future.
> >
> >
> >
> > > 24.2 Does delta only apply to Counter type?
> > >
> >
> >
> > And Histograms. More details in Xavier's OTLP link.
> >
> >
> >
> > > 24.3 In the delta representation, the first request needs to send the
> > full
> > > value, how does the broker plugin know whether a value is full or delta?
> > >
> >
> > The client may (should) send the start time for each metric sample,
> > indicating when
> > the metric began to be collected.
> > We've discussed whether this should be the client instance start time or
> > the time when a matching
> > metric subscription for that metric is received.
> > For completeness we recommend using the former, the client instance start
> > time.
> >
> >
> >
> > > 25. quota:
> > > 25.1 Since we are fitting PushTelemetryRequest into the existing request
> > > quota, it would be useful to document the impact, i.e. client metric
> > > throttling causes the data from the same client to be delayed.
> > > 25.2 Is PushTelemetryRequest subject to the write bandwidth quota like
> > the
> > > producer?
> > >
> >
> >
> > Yes, it should be, as to protect the cluster from rogue clients.
> > But, in practice the size of metrics will be quite low (e.g., 1-10kb per
> > 60s interval), so I don't think this will pose a problem.
> > The KIP has been updated with more details on quota/throttling behaviour,
> > see the
> > "Throttling and rate-limiting" section.
> >
> >
> > 25.3 THROTTLING_QUOTA_EXCEEDED: Currently, we don't send this error when
> > > the request/bandwidth quota is exceeded since those requests are not
> > > rejected. We only set this error when the request is rejected (e.g.,
> > topic
> > > creation). It would be useful to clarify when this error is used.
> > >
> >
> > Right, I was trying to reuse an existing error-code. We can introduce
> > a new one for the case where a client pushes metrics at a higher frequency
> > than the
> > than the configured push interval (e.g., out-of-profile sends).
> > This causes the broker to drop those metrics and send this error code back
> > to the client. There will be no connection throttling / channel-muting in
> > this
> > case (unless the standard quotas are exceeded).
> >
> >
> > > 27. kafka-client-metrics.sh: Could we add an example on how to disable a
> > > bad client?
> > >
> >
> > There's now a --block option to kafka-client-metrics.sh which overrides all
> > subscriptions
> > for the matched client(s). This allows silencing metrics for one or more
> > clients without having
> > to remove existing subscriptions. From the client's perspective it will
> > look like it no longer has
> > any subscriptions.
> >
> > # Block metrics collection for a specific client instance
> > $ kafka-client-metrics.sh --bootstrap-server $BROKERS \
> >    --add \
> >    --name 'Disabe_b69cc35a' \  # A descriptive name makes it easier to
> > clean up old subscriptions.
> >    --match client_instance_id=b69cc35a-7a54-4790-aa69-cc2bd4ee4538 \  #
> > Match this specific client instance
> >    --block
> >
> >
> >
> >
> > > 28. New broker side metrics: Could we spell out the details of the
> > metrics
> > > (e.g., group, tags, etc)?
> > >
> >
> > KIP has been updated accordingly (thanks Sarat).
> >
> >
> >
> > >
> > > 29. Client instance-level metrics: client.io.wait.time is a gauge not a
> > > histogram.
> > >
> >
> > I believe a population/distribution should preferably be represented as a
> > histogram, space permitting,
> > and only secondarily as a Gauge average.
> > While we might not want to maintain a bunch of histograms for each
> > partition, since that could be
> > quite space consuming, this client.io.wait.time is a single metric per
> > client instance and can
> > thus afford a Histogram representation.
> >
> >
> >
> > Thanks,
> > Magnus
> >
> >
> >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Wed, Jan 26, 2022 at 6:32 AM Magnus Edenhill <ma...@edenhill.se>
> > > wrote:
> > >
> > > > Hi all,
> > > >
> > > > I've updated the KIP with responses to the latest comments: Java client
> > > > dependencies (Thanks Kirk!), alternate designs (separate cluster,
> > > separate
> > > > producer, etc), etc.
> > > >
> > > > I will revive the vote thread.
> > > >
> > > > Thanks,
> > > > Magnus
> > > >
> > > >
> > > > Den mån 13 dec. 2021 kl 22:32 skrev Ryanne Dolan <
> > ryannedolan@gmail.com
> > > >:
> > > >
> > > > > I think we should be very careful about introducing new runtime
> > > > > dependencies into the clients. Historically this has been rare and
> > > > > essentially necessary (e.g. compression libs).
> > > > >
> > > > > Ryanne
> > > > >
> > > > > On Mon, Dec 13, 2021, 1:06 PM Kirk True <ki...@mustardgrain.com>
> > wrote:
> > > > >
> > > > > > Hi Jun,
> > > > > >
> > > > > > On Thu, Dec 9, 2021, at 2:28 PM, Jun Rao wrote:
> > > > > > > 13. Using OpenTelemetry. Does that require runtime dependency
> > > > > > > on OpenTelemetry library? How good is the compatibility story
> > > > > > > of OpenTelemetry? This is important since an application could
> > have
> > > > > other
> > > > > > > OpenTelemetry dependencies than the Kafka client.
> > > > > >
> > > > > > The current design is that the OpenTelemetry JARs would ship with
> > the
> > > > > > client. Perhaps we can design the client such that the JARs aren't
> > > even
> > > > > > loaded if the user has opted out. The user could even exclude the
> > > JARs
> > > > > from
> > > > > > their dependencies if they so wished.
> > > > > >
> > > > > > I can't speak to the compatibility of the libraries. Is it possible
> > > > that
> > > > > > we include a shaded version?
> > > > > >
> > > > > > Thanks,
> > > > > > Kirk
> > > > > >
> > > > > > >
> > > > > > > 14. The proposal listed idempotence=true. This is more of a
> > > > > configuration
> > > > > > > than a metric. Are we including that as a metric? What other
> > > > > > configurations
> > > > > > > are we including? Should we separate the configurations from the
> > > > > metrics?
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > Jun
> > > > > > >
> > > > > > > On Mon, Nov 29, 2021 at 7:34 AM Magnus Edenhill <
> > > magnus@edenhill.se>
> > > > > > wrote:
> > > > > > >
> > > > > > > > Hey Bob,
> > > > > > > >
> > > > > > > > That's a good point.
> > > > > > > >
> > > > > > > > Request type labels were considered but since they're already
> > > > tracked
> > > > > > by
> > > > > > > > broker-side metrics
> > > > > > > > they were left out as to avoid metric duplication, however
> > those
> > > > > > metrics
> > > > > > > > are not per connection,
> > > > > > > > so they won't be that useful in practice for troubleshooting
> > > > specific
> > > > > > > > client instances.
> > > > > > > >
> > > > > > > > I'll add the request_type label to the relevant metrics.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Magnus
> > > > > > > >
> > > > > > > >
> > > > > > > > Den tis 23 nov. 2021 kl 19:20 skrev Bob Barrett
> > > > > > > > <bo...@confluent.io.invalid>:
> > > > > > > >
> > > > > > > > > Hi Magnus,
> > > > > > > > >
> > > > > > > > > Thanks for the thorough KIP, this seems very useful.
> > > > > > > > >
> > > > > > > > > Would it make sense to include the request type as a label
> > for
> > > > the
> > > > > > > > > `client.request.success`, `client.request.errors` and
> > > > > > > > `client.request.rtt`
> > > > > > > > > metrics? I think it would be very useful to see which
> > specific
> > > > > > requests
> > > > > > > > are
> > > > > > > > > succeeding and failing for a client. One specific case I can
> > > > think
> > > > > of
> > > > > > > > where
> > > > > > > > > this could be useful is producer batch timeouts. If a Java
> > > > > > application
> > > > > > > > does
> > > > > > > > > not enable producer client logs (unfortunately, in my
> > > experience
> > > > > this
> > > > > > > > > happens more often than it should), the application logs will
> > > > only
> > > > > > > > contain
> > > > > > > > > the expiration error message, but no information about what
> > is
> > > > > > causing
> > > > > > > > the
> > > > > > > > > timeout. The requests might all be succeeding but taking too
> > > long
> > > > > to
> > > > > > > > > process batches, or metadata requests might be failing, or
> > some
> > > > or
> > > > > > all
> > > > > > > > > produce requests might be failing (if the bootstrap servers
> > are
> > > > > > reachable
> > > > > > > > > from the client but one or more other brokers are not, for
> > > > > example).
> > > > > > If
> > > > > > > > the
> > > > > > > > > cluster operator is able to identify the specific requests
> > that
> > > > are
> > > > > > slow
> > > > > > > > or
> > > > > > > > > failing for a client, they will be better able to diagnose
> > the
> > > > > issue
> > > > > > > > > causing batch timeouts.
> > > > > > > > >
> > > > > > > > > One drawback I can think of is that this will increase the
> > > > > > cardinality of
> > > > > > > > > the request metrics. But any given client is only going to
> > use
> > > a
> > > > > > small
> > > > > > > > > subset of the request types, and since we already have
> > > partition
> > > > > > labels
> > > > > > > > for
> > > > > > > > > the topic-level metrics, I think request labels will still
> > make
> > > > up
> > > > > a
> > > > > > > > > relatively small percentage of the set of metrics.
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Bob
> > > > > > > > >
> > > > > > > > > On Mon, Nov 22, 2021 at 2:08 AM Viktor Somogyi-Vass <
> > > > > > > > > viktorsomogyi@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi Magnus,
> > > > > > > > > >
> > > > > > > > > > I think this is a very useful addition. We also have a
> > > similar
> > > > > (but
> > > > > > > > much
> > > > > > > > > > more simplistic) implementation of this. Maybe I missed it
> > in
> > > > the
> > > > > > KIP
> > > > > > > > but
> > > > > > > > > > what about adding metrics about the subscription cache
> > > itself?
> > > > > > That I
> > > > > > > > > think
> > > > > > > > > > would improve its usability and debuggability as we'd be
> > able
> > > > to
> > > > > > see
> > > > > > > > its
> > > > > > > > > > performance, hit/miss rates, eviction counts and others.
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Viktor
> > > > > > > > > >
> > > > > > > > > > On Thu, Nov 18, 2021 at 5:12 PM Magnus Edenhill <
> > > > > > magnus@edenhill.se>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi Mickael,
> > > > > > > > > > >
> > > > > > > > > > > see inline.
> > > > > > > > > > >
> > > > > > > > > > > Den ons 10 nov. 2021 kl 15:21 skrev Mickael Maison <
> > > > > > > > > > > mickael.maison@gmail.com
> > > > > > > > > > > >:
> > > > > > > > > > >
> > > > > > > > > > > > Hi Magnus,
> > > > > > > > > > > >
> > > > > > > > > > > > I see you've addressed some of the points I raised
> > above
> > > > but
> > > > > > some
> > > > > > > > (4,
> > > > > > > > > > > > 5) have not been addressed yet.
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Re 4) How will the user/app know metrics are being sent.
> > > > > > > > > > >
> > > > > > > > > > > One possibility is to add a JMX metric (thus for user
> > > > > > consumption)
> > > > > > > > for
> > > > > > > > > > the
> > > > > > > > > > > number of metric pushes the
> > > > > > > > > > > client has performed, or perhaps the number of metrics
> > > > > > subscriptions
> > > > > > > > > > > currently being collected.
> > > > > > > > > > > Would that be sufficient?
> > > > > > > > > > >
> > > > > > > > > > > Re 5) Metric sizes and rates
> > > > > > > > > > >
> > > > > > > > > > > A worst case scenario for a producer that is producing to
> > > 50
> > > > > > unique
> > > > > > > > > > topics
> > > > > > > > > > > and emitting all standard metrics yields
> > > > > > > > > > > a serialized size of around 100KB prior to compression,
> > > which
> > > > > > > > > compresses
> > > > > > > > > > > down to about 20-30% of that depending
> > > > > > > > > > > on compression type and topic name uniqueness.
> > > > > > > > > > > The numbers for a consumer would be similar.
> > > > > > > > > > >
> > > > > > > > > > > In practice the number of unique topics would be far
> > less,
> > > > and
> > > > > > the
> > > > > > > > > > > subscription set would typically be for a subset of
> > > metrics.
> > > > > > > > > > > So we're probably closer to 1kb, or less, compressed size
> > > per
> > > > > > client
> > > > > > > > > per
> > > > > > > > > > > push interval.
> > > > > > > > > > >
> > > > > > > > > > > As both the subscription set and push intervals are
> > > > controlled
> > > > > > by the
> > > > > > > > > > > cluster operator it shouldn't be too hard
> > > > > > > > > > > to strike a good balance between metrics overhead and
> > > > > > granularity.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > I'm really uneasy with this being enabled by default on
> > > the
> > > > > > client
> > > > > > > > > > > > side. When collecting data, I think the best practice
> > is
> > > to
> > > > > > ensure
> > > > > > > > > > > > users are explicitly enabling it.
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Requiring metrics to be explicitly enabled on clients
> > > > severely
> > > > > > > > cripples
> > > > > > > > > > its
> > > > > > > > > > > usability and value.
> > > > > > > > > > >
> > > > > > > > > > > One of the problems that this KIP aims to solve is for
> > > useful
> > > > > > metrics
> > > > > > > > > to
> > > > > > > > > > be
> > > > > > > > > > > available on demand
> > > > > > > > > > > regardless of the technical expertise of the user. As
> > > Ryanne
> > > > > > points,
> > > > > > > > > out
> > > > > > > > > > a
> > > > > > > > > > > savvy user/organization
> > > > > > > > > > > will typically have metrics collection and monitoring in
> > > > place
> > > > > > > > already,
> > > > > > > > > > and
> > > > > > > > > > > the benefits of this KIP
> > > > > > > > > > > are then more of a common set and format metrics across
> > > > client
> > > > > > > > > > > implementations and languages.
> > > > > > > > > > > But that is not the typical Kafka user in my experience,
> > > > > they're
> > > > > > not
> > > > > > > > > > Kafka
> > > > > > > > > > > experts and they don't have the
> > > > > > > > > > > knowledge of how to best instrument their clients.
> > > > > > > > > > > Having metrics enabled by default for this user base
> > allows
> > > > the
> > > > > > Kafka
> > > > > > > > > > > operators to proactively and reactively
> > > > > > > > > > > monitor and troubleshoot client issues, without the need
> > > for
> > > > > the
> > > > > > less
> > > > > > > > > > savvy
> > > > > > > > > > > user to do anything.
> > > > > > > > > > > It is often too late to tell a user to enable metrics
> > when
> > > > the
> > > > > > > > problem
> > > > > > > > > > has
> > > > > > > > > > > already occurred.
> > > > > > > > > > >
> > > > > > > > > > > Now, to be clear, even though metrics are enabled by
> > > default
> > > > on
> > > > > > > > clients
> > > > > > > > > > it
> > > > > > > > > > > is not enabled by default
> > > > > > > > > > > on the brokers; the Kafka operator needs to build and set
> > > up
> > > > a
> > > > > > > > metrics
> > > > > > > > > > > plugin and add metrics subscriptions
> > > > > > > > > > > before anything is sent from the client.
> > > > > > > > > > > It is opt-out on the clients and opt-in on the broker.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > You mentioned brokers already have
> > > > > > > > > > > > some(most?) of the information contained in metrics, if
> > > so
> > > > > > then why
> > > > > > > > > > > > are we collecting it again? Surely there must be some
> > new
> > > > > > > > information
> > > > > > > > > > > > in the client metrics.
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > From the user's perspective the Kafka infrastructure
> > > extends
> > > > > from
> > > > > > > > > > > producer.send() to
> > > > > > > > > > > messages being returned from consumer.poll(), a giant
> > black
> > > > box
> > > > > > where
> > > > > > > > > > > there's a lot going on between those
> > > > > > > > > > > two points. The brokers currently only see what happens
> > > once
> > > > > > those
> > > > > > > > > > requests
> > > > > > > > > > > and messages hits the broker,
> > > > > > > > > > > but as Kafka clients are complex pieces of machinery
> > > there's
> > > > a
> > > > > > myriad
> > > > > > > > > of
> > > > > > > > > > > queues, timers, and state
> > > > > > > > > > > that's critical to the operation and infrastructure
> > that's
> > > > not
> > > > > > > > > currently
> > > > > > > > > > > visible to the operator.
> > > > > > > > > > > Relying on the user to accurately and timely provide this
> > > > > missing
> > > > > > > > > > > information is not generally feasible.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Most of the standard metrics listed in the KIP are data
> > > > points
> > > > > > that
> > > > > > > > the
> > > > > > > > > > > broker does not have.
> > > > > > > > > > > Only a small number of metrics are duplicates (like the
> > > > request
> > > > > > > > counts
> > > > > > > > > > and
> > > > > > > > > > > sizes), but they are included
> > > > > > > > > > > to ease correlation when inspecting these client metrics.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > Moreover this is a brand new feature so it's even
> > harder
> > > to
> > > > > > justify
> > > > > > > > > > > > enabling it and forcing onto all our users. If disabled
> > > by
> > > > > > default,
> > > > > > > > > > > > it's relatively easy to enable in a new release if we
> > > > decide
> > > > > > to,
> > > > > > > > but
> > > > > > > > > > > > once enabled by default it's much harder to disable.
> > Also
> > > > > this
> > > > > > > > > feature
> > > > > > > > > > > > will apply to all future metrics we will add.
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > I think maturity of a feature implementation should be
> > the
> > > > > > deciding
> > > > > > > > > > factor,
> > > > > > > > > > > rather than
> > > > > > > > > > > the design of it (which this KIP is). I.e., if the
> > > > > > implementation is
> > > > > > > > > not
> > > > > > > > > > > deemed mature enough
> > > > > > > > > > > for release X.Y it will be disabled.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > Overall I think it's an interesting feature but I'd
> > > prefer
> > > > to
> > > > > > be
> > > > > > > > > > > > slightly defensive and see how it works in practice
> > > before
> > > > > > enabling
> > > > > > > > > it
> > > > > > > > > > > > everywhere.
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Right, and I agree on being defensive, but since this
> > > feature
> > > > > > still
> > > > > > > > > > > requires manual
> > > > > > > > > > > enabling on the brokers before actually being used, I
> > think
> > > > > that
> > > > > > > > gives
> > > > > > > > > > > enough control
> > > > > > > > > > > to opt-in or out of this feature as needed.
> > > > > > > > > > >
> > > > > > > > > > > Thanks for your comments!
> > > > > > > > > > >
> > > > > > > > > > > Regards,
> > > > > > > > > > > Magnus
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > Thanks,
> > > > > > > > > > > > Mickael
> > > > > > > > > > > >
> > > > > > > > > > > > On Mon, Nov 8, 2021 at 7:55 AM Magnus Edenhill <
> > > > > > magnus@edenhill.se
> > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks David for pointing this out,
> > > > > > > > > > > > > I've updated the KIP to include client_id as a
> > matching
> > > > > > selector.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Regards,
> > > > > > > > > > > > > Magnus
> > > > > > > > > > > > >
> > > > > > > > > > > > > Den tors 4 nov. 2021 kl 18:01 skrev David Mao
> > > > > > > > > > > <dmao@confluent.io.invalid
> > > > > > > > > > > > >:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hey Magnus,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I noticed that the KIP outlines the initial
> > selectors
> > > > > > supported
> > > > > > > > > as:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >    - client_instance_id - CLIENT_INSTANCE_ID UUID
> > > > string
> > > > > > > > > > > > representation.
> > > > > > > > > > > > > >    - client_software_name  - client software
> > > > > implementation
> > > > > > > > name.
> > > > > > > > > > > > > >    - client_software_version  - client software
> > > > > > implementation
> > > > > > > > > > > version.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > In the given reactive monitoring workflow, we
> > mention
> > > > > that
> > > > > > the
> > > > > > > > > > > > application
> > > > > > > > > > > > > > user does not know their client's client instance
> > ID,
> > > > but
> > > > > > it's
> > > > > > > > > > > outlined
> > > > > > > > > > > > > > that the operator can add a metrics subscription
> > > > > selecting
> > > > > > for
> > > > > > > > > > > > clientId. I
> > > > > > > > > > > > > > don't see clientId as one of the supported
> > selectors.
> > > > > > > > > > > > > > I can see how this would have made sense in a
> > > previous
> > > > > > > > iteration
> > > > > > > > > > > given
> > > > > > > > > > > > that
> > > > > > > > > > > > > > the previous client instance ID proposal was to
> > > > construct
> > > > > > the
> > > > > > > > > > client
> > > > > > > > > > > > > > instance ID using clientId as a prefix. Now that
> > the
> > > > > client
> > > > > > > > > > instance
> > > > > > > > > > > > ID is
> > > > > > > > > > > > > > a UUID, would we want to add clientId as a
> > supported
> > > > > > selector?
> > > > > > > > > > > > > > Let me know what you think.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > David
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Tue, Oct 19, 2021 at 12:33 PM Magnus Edenhill <
> > > > > > > > > > magnus@edenhill.se
> > > > > > > > > > > >
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi Mickael!
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Den tis 19 okt. 2021 kl 19:30 skrev Mickael
> > Maison
> > > <
> > > > > > > > > > > > > > > mickael.maison@gmail.com
> > > > > > > > > > > > > > > >:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Hi Magnus,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thanks for the proposal.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 1. Looking at the protocol section, isn't
> > > > > > > > "ClientInstanceId"
> > > > > > > > > > > > expected
> > > > > > > > > > > > > > > > to be a field in
> > > > GetTelemetrySubscriptionsResponseV0?
> > > > > > > > > > Otherwise,
> > > > > > > > > > > > how
> > > > > > > > > > > > > > > > does a client retrieve this value?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Good catch, it got removed by mistake in one of
> > the
> > > > > > edits.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 2. In the client API section, you mention a new
> > > > > method
> > > > > > > > > > > > > > > > "clientInstanceId()". Can you clarify which
> > > > > interfaces
> > > > > > are
> > > > > > > > > > > > affected?
> > > > > > > > > > > > > > > > Is it only Consumer and Producer?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > And Admin. Will update the KIP.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 3. I'm a bit concerned this is enabled by
> > > default.
> > > > > > Even if
> > > > > > > > > the
> > > > > > > > > > > data
> > > > > > > > > > > > > > > > collected is supposed to be not sensitive, I
> > > think
> > > > > > this can
> > > > > > > > > be
> > > > > > > > > > > > > > > > problematic in some environments. Also users
> > > don't
> > > > > > seem to
> > > > > > > > > have
> > > > > > > > > > > the
> > > > > > > > > > > > > > > > choice to only expose some metrics. Knowing how
> > > > much
> > > > > > data
> > > > > > > > > > transit
> > > > > > > > > > > > > > > > through some applications can be considered
> > > > critical.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > The broker already knows how much data transits
> > > > through
> > > > > > the
> > > > > > > > > > client
> > > > > > > > > > > > > > though,
> > > > > > > > > > > > > > > right?
> > > > > > > > > > > > > > > Care has been taken not to expose information in
> > > the
> > > > > > standard
> > > > > > > > > > > metrics
> > > > > > > > > > > > > > that
> > > > > > > > > > > > > > > might
> > > > > > > > > > > > > > > reveal sensitive information.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Do you have an example of how the proposed
> > metrics
> > > > > could
> > > > > > leak
> > > > > > > > > > > > sensitive
> > > > > > > > > > > > > > > information?
> > > > > > > > > > > > > > > As for limiting the what metrics to export; I
> > guess
> > > > > that
> > > > > > > > could
> > > > > > > > > > make
> > > > > > > > > > > > sense
> > > > > > > > > > > > > > > in some
> > > > > > > > > > > > > > > very sensitive use-cases, but those users might
> > > > disable
> > > > > > > > metrics
> > > > > > > > > > > > > > altogether
> > > > > > > > > > > > > > > for now.
> > > > > > > > > > > > > > > Could these concerns be addressed by a later KIP?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 4. As a user, how do you know if your
> > application
> > > > is
> > > > > > > > actively
> > > > > > > > > > > > sending
> > > > > > > > > > > > > > > > metrics? Are there new metrics exposing what's
> > > > going
> > > > > > on,
> > > > > > > > like
> > > > > > > > > > how
> > > > > > > > > > > > much
> > > > > > > > > > > > > > > > data is being sent?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > That's a good question.
> > > > > > > > > > > > > > > Since the proposed metrics interface is not aimed
> > > at,
> > > > > or
> > > > > > > > > directly
> > > > > > > > > > > > > > available
> > > > > > > > > > > > > > > to, the application
> > > > > > > > > > > > > > > I guess there's little point of adding it here,
> > but
> > > > > > instead
> > > > > > > > > > adding
> > > > > > > > > > > > > > > something to the
> > > > > > > > > > > > > > > existing JMX metrics?
> > > > > > > > > > > > > > > Do you have any suggestions?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 5. If all metrics are enabled on a regular
> > > Consumer
> > > > > or
> > > > > > > > > > Producer,
> > > > > > > > > > > do
> > > > > > > > > > > > > > > > you have an idea how much throughput this would
> > > > use?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > It depends on the number of partition/topics/etc
> > > the
> > > > > > client
> > > > > > > > is
> > > > > > > > > > > > producing
> > > > > > > > > > > > > > > to/consuming from.
> > > > > > > > > > > > > > > I'll add some sizes to the KIP for some typical
> > > > > > use-cases.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > Magnus
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thanks
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Tue, Oct 19, 2021 at 5:06 PM Magnus
> > Edenhill <
> > > > > > > > > > > > magnus@edenhill.se>
> > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Den tis 19 okt. 2021 kl 13:22 skrev Tom
> > > Bentley <
> > > > > > > > > > > > tbentley@redhat.com
> > > > > > > > > > > > > > >:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Hi Magnus,
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > I reviewed the KIP since you called the
> > vote
> > > > > > (sorry for
> > > > > > > > > not
> > > > > > > > > > > > > > reviewing
> > > > > > > > > > > > > > > > when
> > > > > > > > > > > > > > > > > > you announced your intention to call the
> > > > vote). I
> > > > > > have
> > > > > > > > a
> > > > > > > > > > few
> > > > > > > > > > > > > > > questions
> > > > > > > > > > > > > > > > on
> > > > > > > > > > > > > > > > > > some of the details.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > 1. There's no Javadoc on
> > > > > > ClientTelemetryPayload.data(),
> > > > > > > > > so
> > > > > > > > > > I
> > > > > > > > > > > > don't
> > > > > > > > > > > > > > > know
> > > > > > > > > > > > > > > > > > whether the payload is exposed through this
> > > > > method
> > > > > > as
> > > > > > > > > > > > compressed or
> > > > > > > > > > > > > > > > not.
> > > > > > > > > > > > > > > > > > Later on you say "Decompression of the
> > > payloads
> > > > > > will be
> > > > > > > > > > > > handled by
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > broker metrics plugin, the broker should
> > > > expose a
> > > > > > > > > suitable
> > > > > > > > > > > > > > > > decompression
> > > > > > > > > > > > > > > > > > API to the metrics plugin for this
> > purpose.",
> > > > > which
> > > > > > > > > > suggests
> > > > > > > > > > > > it's
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > compressed data in the buffer, but then we
> > > > don't
> > > > > > know
> > > > > > > > > which
> > > > > > > > > > > > codec
> > > > > > > > > > > > > > was
> > > > > > > > > > > > > > > > used,
> > > > > > > > > > > > > > > > > > nor the API via which the plugin should
> > > > > decompress
> > > > > > it
> > > > > > > > if
> > > > > > > > > > > > required
> > > > > > > > > > > > > > for
> > > > > > > > > > > > > > > > > > forwarding to the ultimate metrics store.
> > > > Should
> > > > > > the
> > > > > > > > > > > > > > > > ClientTelemetryPayload
> > > > > > > > > > > > > > > > > > expose a method to get the compression and
> > a
> > > > > > > > > decompressor?
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Good point, updated.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > 2. The client-side API is expressed as
> > > > > > StringOrError
> > > > > > > > > > > > > > > > > > ClientInstance::ClientInstanceId(int
> > > > > timeout_ms). I
> > > > > > > > > > > understand
> > > > > > > > > > > > that
> > > > > > > > > > > > > > > > you're
> > > > > > > > > > > > > > > > > > thinking about the librdkafka
> > implementation,
> > > > but
> > > > > > it
> > > > > > > > > would
> > > > > > > > > > be
> > > > > > > > > > > > good
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > show
> > > > > > > > > > > > > > > > > > the API as it would appear on the Apache
> > > Kafka
> > > > > > clients.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > This was meant as pseudo-code, but I changed
> > it
> > > > to
> > > > > > Java.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > 3. "PushTelemetryRequest|Response -
> > protocol
> > > > > > request
> > > > > > > > used
> > > > > > > > > > by
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > > client to
> > > > > > > > > > > > > > > > > > send metrics to any broker it is connected
> > > to."
> > > > > To
> > > > > > be
> > > > > > > > > > clear,
> > > > > > > > > > > > this
> > > > > > > > > > > > > > > means
> > > > > > > > > > > > > > > > > > that the client can choose any of the
> > > connected
> > > > > > brokers
> > > > > > > > > and
> > > > > > > > > > > > push to
> > > > > > > > > > > > > > > > just
> > > > > > > > > > > > > > > > > > one of them? What should a supporting
> > client
> > > do
> > > > > if
> > > > > > it
> > > > > > > > > gets
> > > > > > > > > > an
> > > > > > > > > > > > error
> > > > > > > > > > > > > > > > when
> > > > > > > > > > > > > > > > > > pushing metrics to a broker, retry sending
> > to
> > > > the
> > > > > > same
> > > > > > > > > > broker
> > > > > > > > > > > > or
> > > > > > > > > > > > > > try
> > > > > > > > > > > > > > > > > > pushing to another broker, or drop the
> > > metrics?
> > > > > > Should
> > > > > > > > > > > > supporting
> > > > > > > > > > > > > > > > clients
> > > > > > > > > > > > > > > > > > send successive requests to a single
> > broker,
> > > or
> > > > > > round
> > > > > > > > > > robin,
> > > > > > > > > > > > or is
> > > > > > > > > > > > > > > > that up
> > > > > > > > > > > > > > > > > > to the client author? I'm guessing the
> > > > behaviour
> > > > > > should
> > > > > > > > > be
> > > > > > > > > > > > sticky
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > support the rate limiting features, but I
> > > think
> > > > > it
> > > > > > > > would
> > > > > > > > > be
> > > > > > > > > > > > good
> > > > > > > > > > > > > > for
> > > > > > > > > > > > > > > > client
> > > > > > > > > > > > > > > > > > authors if this section were explicit on
> > the
> > > > > > > > recommended
> > > > > > > > > > > > behaviour.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > You are right, I've updated the KIP to make
> > > this
> > > > > > clearer.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > 4. "Mapping the client instance id to an
> > > actual
> > > > > > > > > application
> > > > > > > > > > > > > > instance
> > > > > > > > > > > > > > > > > > running on a (virtual) machine can be done
> > by
> > > > > > > > inspecting
> > > > > > > > > > the
> > > > > > > > > > > > > > metrics
> > > > > > > > > > > > > > > > > > resource labels, such as the client source
> > > > > address
> > > > > > and
> > > > > > > > > > source
> > > > > > > > > > > > port,
> > > > > > > > > > > > > > > or
> > > > > > > > > > > > > > > > > > security principal, all of which are added
> > by
> > > > the
> > > > > > > > > receiving
> > > > > > > > > > > > broker.
> > > > > > > > > > > > > > > > This
> > > > > > > > > > > > > > > > > > will allow the operator together with the
> > > user
> > > > to
> > > > > > > > > identify
> > > > > > > > > > > the
> > > > > > > > > > > > > > actual
> > > > > > > > > > > > > > > > > > application instance." Is this really
> > always
> > > > > true?
> > > > > > The
> > > > > > > > > > source
> > > > > > > > > > > > IP
> > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > port
> > > > > > > > > > > > > > > > > > might be a loadbalancer/proxy in some
> > setups.
> > > > The
> > > > > > > > > > principal,
> > > > > > > > > > > as
> > > > > > > > > > > > > > > already
> > > > > > > > > > > > > > > > > > mentioned in the KIP, might be shared
> > between
> > > > > > multiple
> > > > > > > > > > > > > > applications.
> > > > > > > > > > > > > > > > So at
> > > > > > > > > > > > > > > > > > worst the organization running the clients
> > > > might
> > > > > > have
> > > > > > > > to
> > > > > > > > > > > > consult
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > logs
> > > > > > > > > > > > > > > > > > of a set of client applications, right?
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Yes, that's correct. There's no guaranteed
> > > > mapping
> > > > > > from
> > > > > > > > > > > > > > > > client_instance_id
> > > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > an actual instance, that's why the KIP
> > > recommends
> > > > > > client
> > > > > > > > > > > > > > > implementations
> > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > log the client instance id
> > > > > > > > > > > > > > > > > upon retrieval, and also provide an API for
> > the
> > > > > > > > application
> > > > > > > > > > to
> > > > > > > > > > > > > > retrieve
> > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > instance id programmatically
> > > > > > > > > > > > > > > > > if it has a better way of exposing it.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > 5. "Tests indicate that a compression ratio
> > up
> > > to
> > > > > > 10x is
> > > > > > > > > > > > possible for
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > standard metrics." Client authors might
> > > > > appreciate
> > > > > > your
> > > > > > > > > > > > mentioning
> > > > > > > > > > > > > > > > which
> > > > > > > > > > > > > > > > > > compression codec got these results.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Good point. Updated.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > 6. "Should the client send a push request
> > > prior
> > > > > to
> > > > > > > > expiry
> > > > > > > > > > of
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > > previously
> > > > > > > > > > > > > > > > > > calculated PushIntervalMs the broker will
> > > > discard
> > > > > > the
> > > > > > > > > > metrics
> > > > > > > > > > > > and
> > > > > > > > > > > > > > > > return a
> > > > > > > > > > > > > > > > > > PushTelemetryResponse with the ErrorCode
> > set
> > > to
> > > > > > > > > > RateLimited."
> > > > > > > > > > > > Is
> > > > > > > > > > > > > > this
> > > > > > > > > > > > > > > > > > RATE_LIMITED a new error code? It's not
> > > > mentioned
> > > > > > in
> > > > > > > > the
> > > > > > > > > > "New
> > > > > > > > > > > > Error
> > > > > > > > > > > > > > > > Codes"
> > > > > > > > > > > > > > > > > > section.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > That's a leftover, it should be using the
> > > > standard
> > > > > > > > > > ThrottleTime
> > > > > > > > > > > > > > > > mechanism.
> > > > > > > > > > > > > > > > > Fixed.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > 7. In the section "Standard client resource
> > > > > labels"
> > > > > > > > > > > > application_id
> > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > > > described as Kafka Streams only, but the
> > > > section
> > > > > of
> > > > > > > > > "Client
> > > > > > > > > > > > > > > > Identification"
> > > > > > > > > > > > > > > > > > talks about "application instance id as an
> > > > > optional
> > > > > > > > > future
> > > > > > > > > > > > > > > nice-to-have
> > > > > > > > > > > > > > > > > > that may be included as a metrics label if
> > it
> > > > has
> > > > > > been
> > > > > > > > > set
> > > > > > > > > > by
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > > user", so
> > > > > > > > > > > > > > > > > > I'm confused whether non-Kafka Streams
> > > clients
> > > > > > should
> > > > > > > > set
> > > > > > > > > > an
> > > > > > > > > > > > > > > > application_id
> > > > > > > > > > > > > > > > > > or not.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > I'll clarify this in the KIP, but basically
> > we
> > > > > would
> > > > > > need
> > > > > > > > > to
> > > > > > > > > > > add
> > > > > > > > > > > > an `
> > > > > > > > > > > > > > > > > application.id` config
> > > > > > > > > > > > > > > > > property for non-streams clients for this
> > > > purpose,
> > > > > > and
> > > > > > > > > that's
> > > > > > > > > > > > outside
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > scope of this KIP since we want to make it
> > > > > > zero-conf:ish
> > > > > > > > on
> > > > > > > > > > the
> > > > > > > > > > > > > > client
> > > > > > > > > > > > > > > > side.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Kind regards,
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Tom
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Thanks for the review,
> > > > > > > > > > > > > > > > > Magnus
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > On Thu, Oct 7, 2021 at 5:26 PM Magnus
> > > Edenhill
> > > > <
> > > > > > > > > > > > magnus@edenhill.se
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Hi all,
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > I've updated the KIP following our recent
> > > > > > discussions
> > > > > > > > > on
> > > > > > > > > > > the
> > > > > > > > > > > > > > > mailing
> > > > > > > > > > > > > > > > > > list:
> > > > > > > > > > > > > > > > > > >  - split the protocol in two, one for
> > > getting
> > > > > the
> > > > > > > > > metrics
> > > > > > > > > > > > > > > > subscriptions,
> > > > > > > > > > > > > > > > > > > and one for pushing the metrics.
> > > > > > > > > > > > > > > > > > >  - simplifications: initially only one
> > > > > supported
> > > > > > > > > metrics
> > > > > > > > > > > > format,
> > > > > > > > > > > > > > no
> > > > > > > > > > > > > > > > > > > client.id in the instance id, etc.
> > > > > > > > > > > > > > > > > > >  - made CLIENT_METRICS subscription
> > > > > configuration
> > > > > > > > > entries
> > > > > > > > > > > > more
> > > > > > > > > > > > > > > > structured
> > > > > > > > > > > > > > > > > > >    and allowing better client matching
> > > > > selectors
> > > > > > (not
> > > > > > > > > > only
> > > > > > > > > > > > on the
> > > > > > > > > > > > > > > > > > instance
> > > > > > > > > > > > > > > > > > > id, but also the other
> > > > > > > > > > > > > > > > > > >    client resource labels, such as
> > > > > > > > > client_software_name,
> > > > > > > > > > > > etc.).
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Unless there are further comments I'll
> > call
> > > > the
> > > > > > vote
> > > > > > > > > in a
> > > > > > > > > > > > day or
> > > > > > > > > > > > > > > two.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Regards,
> > > > > > > > > > > > > > > > > > > Magnus
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Den mån 4 okt. 2021 kl 20:57 skrev Magnus
> > > > > > Edenhill <
> > > > > > > > > > > > > > > > magnus@edenhill.se>:
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Hi Gwen,
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > I'm finishing up the KIP based on the
> > > last
> > > > > > couple
> > > > > > > > of
> > > > > > > > > > > > discussion
> > > > > > > > > > > > > > > > points
> > > > > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > > > > > this thread
> > > > > > > > > > > > > > > > > > > > and will call the Vote later this week.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Best,
> > > > > > > > > > > > > > > > > > > > Magnus
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Den lör 2 okt. 2021 kl 02:01 skrev Gwen
> > > > > Shapira
> > > > > > > > > > > > > > > > > > > <gwen@confluent.io.invalid
> > > > > > > > > > > > > > > > > > > > >:
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >> Hey,
> > > > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > > > >> I noticed that there was no discussion
> > > for
> > > > > the
> > > > > > > > last
> > > > > > > > > 10
> > > > > > > > > > > > days,
> > > > > > > > > > > > > > > but I
> > > > > > > > > > > > > > > > > > > >> couldn't
> > > > > > > > > > > > > > > > > > > >> find the vote thread. Is there one
> > that
> > > > I'm
> > > > > > > > missing?
> > > > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > > > >> Gwen
> > > > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > > > >> On Wed, Sep 22, 2021 at 4:58 AM Magnus
> > > > > > Edenhill <
> > > > > > > > > > > > > > > > magnus@edenhill.se>
> > > > > > > > > > > > > > > > > > > >> wrote:
> > > > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > > > >> > Den tis 21 sep. 2021 kl 06:58 skrev
> > > > Colin
> > > > > > > > McCabe <
> > > > > > > > > > > > > > > > > > cmccabe@apache.org
> > > > > > > > > > > > > > > > > > > >:
> > > > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > > > >> > > On Mon, Sep 20, 2021, at 17:35,
> > Feng
> > > > Min
> > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > >> > > > Thanks Magnus & Colin for the
> > > > > > discussion.
> > > > > > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > > > > > >> > > > Based on KIP-714's stateless
> > > design,
> > > > > > Client
> > > > > > > > > can
> > > > > > > > > > > > pretty
> > > > > > > > > > > > > > > much
> > > > > > > > > > > > > > > > use
> > > > > > > > > > > > > > > > > > > any
> > > > > > > > > > > > > > > > > > > >> > > > connection to any broker to send
> > > > > > metrics. We
> > > > > > > > > are
> > > > > > > > > > > not
> > > > > > > > > > > > > > > > associating
> > > > > > > > > > > > > > > > > > > >> > > connection
> > > > > > > > > > > > > > > > > > > >> > > > with client metric state. Is my
> > > > > > > > understanding
> > > > > > > > > > > > correct?
> > > > > > > > > > > > > > If
> > > > > > > > > > > > > > > > yes,
> > > > > > > > > > > > > > > > > > > how
> > > > > > > > > > > > > > > > > > > >> > about
> > > > > > > > > > > > > > > > > > > >> > > > the following two scenarios
> > > > > > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > > > > > >> > > > 1) One Client (Client-ID)
> > > registers
> > > > > two
> > > > > > > > > > different
> > > > > > > > > > > > client
> > > > > > > > > > > > > > > > > > instance
> > > > > > > > > > > > > > > > > > > id
> > > > > > > > > > > > > > > > > > > >> > via
> > > > > > > > > > > > > > > > > > > >> > > > separate registration. Is it
> > > > > permitted?
> > > > > > If
> > > > > > > > OK,
> > > > > > > > > > how
> > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > distinguish
> > > > > > > > > > > > > > > > > > > >> them
> > > > > > > > > > > > > > > > > > > >> > > from
> > > > > > > > > > > > > > > > > > > >> > > > the case 2 below.
> > > > > > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > > > > >> > > Hi Feng,
> > > > > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > > > > >> > > My understanding, which Magnus can
> > > > > > clarify I
> > > > > > > > > > guess,
> > > > > > > > > > > is
> > > > > > > > > > > > > > that
> > > > > > > > > > > > > > > > you
> > > > > > > > > > > > > > > > > > > could
> > > > > > > > > > > > > > > > > > > >> > have
> > > > > > > > > > > > > > > > > > > >> > > something like two Producer
> > > instances
> > > > > > running
> > > > > > > > > with
> > > > > > > > > > > the
> > > > > > > > > > > > > > same
> > > > > > > > > > > > > > > > > > > client.id
> > > > > > > > > > > > > > > > > > > >> > > (perhaps because they're using the
> > > > same
> > > > > > config
> > > > > > > > > > file,
> > > > > > > > > > > > for
> > > > > > > > > > > > > > > > example).
> > > > > > > > > > > > > > > > > > > >> They
> > > > > > > > > > > > > > > > > > > >> > > could even be in the same process.
> > > But
> > > > > > they
> > > > > > > > > would
> > > > > > > > > > > get
> > > > > > > > > > > > > > > separate
> > > > > > > > > > > > > > > > > > > UUIDs.
> > > > > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > > > > >> > > I believe Magnus used the term
> > > client
> > > > to
> > > > > > mean
> > > > > > > > > > > > "Producer or
> > > > > > > > > > > > > > > > > > > Consumer".
> > > > > > > > > > > > > > > > > > > >> So
> > > > > > > > > > > > > > > > > > > >> > > if you have both a Producer and a
> > > > > > Consumer in
> > > > > > > > > your
> > > > > > > > > > > > > > > > application I
> > > > > > > > > > > > > > > > > > > would
> > > > > > > > > > > > > > > > > > > >> > > expect you'd get separate UUIDs
> > for
> > > > > both.
> > > > > > > > Again
> > > > > > > > > > > > Magnus can
> > > > > > > > > > > > > > > > chime
> > > > > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > > > > >> > here, I
> > > > > > > > > > > > > > > > > > > >> > > guess.
> > > > > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > > > >> > That's correct.
> > > > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > > > > >> > > > 2) How about the client
> > > restarting?
> > > > > > What's
> > > > > > > > the
> > > > > > > > > > > > > > > expectation?
> > > > > > > > > > > > > > > > > > Should
> > > > > > > > > > > > > > > > > > > >> the
> > > > > > > > > > > > > > > > > > > >> > > > server expect the client to
> > carry
> > > a
> > > > > > > > persisted
> > > > > > > > > > > client
> > > > > > > > > > > > > > > > instance id
> > > > > > > > > > > > > > > > > > > or
> > > > > > > > > > > > > > > > > > > >> > > should
> > > > > > > > > > > > > > > > > > > >> > > > the client be treated as a new
> > > > > instance?
> > > > > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > > > > >> > > The KIP doesn't describe any
> > > mechanism
> > > > > for
> > > > > > > > > > > > persistence,
> > > > > > > > > > > > > > so I
> > > > > > > > > > > > > > > > would
> > > > > > > > > > > > > > > > > > > >> assume
> > > > > > > > > > > > > > > > > > > >> > > that when you restart the client
> > you
> > > > get
> > > > > > a new
> > > > > > > > > > > UUID. I
> > > > > > > > > > > > > > agree
> > > > > > > > > > > > > > > > that
> > > > > > > > > > > > > > > > > > it
> > > > > > > > > > > > > > > > > > > >> > would
> > > > > > > > > > > > > > > > > > > >> > > be good to spell this out.
> > > > > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > > > > >> > Right, it will not be persisted
> > since
> > > a
> > > > > > client
> > > > > > > > > > > instance
> > > > > > > > > > > > > > can't
> > > > > > > > > > > > > > > be
> > > > > > > > > > > > > > > > > > > >> restarted.
> > > > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > > > >> > Will update the KIP to make this
> > > > clearer.
> > > > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > > > >> > /Magnus
> > > > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > > > >> --
> > > > > > > > > > > > > > > > > > > >> Gwen Shapira
> > > > > > > > > > > > > > > > > > > >> Engineering Manager | Confluent
> > > > > > > > > > > > > > > > > > > >> 650.450.2760 | @gwenshap
> > > > > > > > > > > > > > > > > > > >> Follow us: Twitter | blog
> > > > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Jun Rao <ju...@confluent.io.INVALID>.

Hi, Magnus, Sarat and Xavier,

Thanks for the reply. A few more comments below.

20. It seems that we are piggybacking the plugin on the
existing MetricsReporter. So, this seems fine.

21. That could work. Are we requiring any additional jar dependency on the
client? Or, are you suggesting that we check the runtime dependency to pick
the compression codec?

28. For the broker metrics, could you spell out the full metric name
including groups, tags, etc? We typically don't add the broker_id label for
broker metrics. Also, brokers use Yammer metrics, which doesn't have type
Sum.

29. There are several client metrics listed as histogram. However, the java
client currently doesn't support histogram type.

30. Could you show an example of the metric payload in PushTelemetryRequest
to help understand how we organize metrics at different levels (per
instance, per topic, per partition, per broker, etc)?

31. Could you add a bit more detail on which client thread sends the
PushTelemetryRequest?

Thanks,

Jun

On Mon, Mar 7, 2022 at 11:48 AM Magnus Edenhill <ma...@edenhill.se> wrote:

> Hi Jun,
>
> thanks for your initiated questions, see my answers below.
> There's been a number of clarifications to the KIP.
>
>
>
> Den tors 27 jan. 2022 kl 20:08 skrev Jun Rao <ju...@confluent.io.invalid>:
>
> > Hi, Magnus,
> >
> > Thanks for updating the KIP. The overall approach makes sense to me. A
> few
> > more detailed comments below.
> >
> > 20. ClientTelemetry: Should it be extending configurable and closable?
> >
>
> I'll pass this question to Sarat and/or Xavier.
>
>
>
> > 21. Compression of the metrics on the client: what's the default?
> >
>
> How about we specify a prioritized list: zstd, lz4, snappy, gzip?
> But ultimately it is up to what the client supports.
>
>
> 23. A client instance is considered a metric resource and the
> > resource-level (thus client instance level) labels could include:
> >     client_software_name=confluent-kafka-python
> >     client_software_version=v2.1.3
> >     client_instance_id=B64CD139-3975-440A-91D4
> >     transactional_id=someTxnApp
> > Are those labels added in PushTelemetryRequest? If so, are they per
> metric
> > or per request?
> >
>
>
> client_software* and client_instance_id are not added by the client, but
> available to
> the broker-side metrics plugin for adding as it see fits, remove them from
> the KIP.
>
> As for transactional_id, group_id, etc, which I believe will be useful in
> troubleshooting,
> are included only once (per push) as resource-level attributes (the client
> instance is a singular resource).
>
>
> >
> > 24.  "the broker will only send
> > GetTelemetrySubscriptionsResponse.DeltaTemporality=True" :
> > 24.1 If it's always true, does it need to be part of the protocol?
> >
>
> We're anticipating that it will take a lot longer to upgrade the majority
> of clients than the
> broker/plugin side, which is why we want the client to support both
> temporalities out-of-the-box
> so that cumulative reporting can be turned on seamlessly in the future.
>
>
>
> > 24.2 Does delta only apply to Counter type?
> >
>
>
> And Histograms. More details in Xavier's OTLP link.
>
>
>
> > 24.3 In the delta representation, the first request needs to send the
> full
> > value, how does the broker plugin know whether a value is full or delta?
> >
>
> The client may (should) send the start time for each metric sample,
> indicating when
> the metric began to be collected.
> We've discussed whether this should be the client instance start time or
> the time when a matching
> metric subscription for that metric is received.
> For completeness we recommend using the former, the client instance start
> time.
>
>
>
> > 25. quota:
> > 25.1 Since we are fitting PushTelemetryRequest into the existing request
> > quota, it would be useful to document the impact, i.e. client metric
> > throttling causes the data from the same client to be delayed.
> > 25.2 Is PushTelemetryRequest subject to the write bandwidth quota like
> the
> > producer?
> >
>
>
> Yes, it should be, as to protect the cluster from rogue clients.
> But, in practice the size of metrics will be quite low (e.g., 1-10kb per
> 60s interval), so I don't think this will pose a problem.
> The KIP has been updated with more details on quota/throttling behaviour,
> see the
> "Throttling and rate-limiting" section.
>
>
> 25.3 THROTTLING_QUOTA_EXCEEDED: Currently, we don't send this error when
> > the request/bandwidth quota is exceeded since those requests are not
> > rejected. We only set this error when the request is rejected (e.g.,
> topic
> > creation). It would be useful to clarify when this error is used.
> >
>
> Right, I was trying to reuse an existing error-code. We can introduce
> a new one for the case where a client pushes metrics at a higher frequency
> than the
> than the configured push interval (e.g., out-of-profile sends).
> This causes the broker to drop those metrics and send this error code back
> to the client. There will be no connection throttling / channel-muting in
> this
> case (unless the standard quotas are exceeded).
>
>
> > 27. kafka-client-metrics.sh: Could we add an example on how to disable a
> > bad client?
> >
>
> There's now a --block option to kafka-client-metrics.sh which overrides all
> subscriptions
> for the matched client(s). This allows silencing metrics for one or more
> clients without having
> to remove existing subscriptions. From the client's perspective it will
> look like it no longer has
> any subscriptions.
>
> # Block metrics collection for a specific client instance
> $ kafka-client-metrics.sh --bootstrap-server $BROKERS \
>    --add \
>    --name 'Disabe_b69cc35a' \  # A descriptive name makes it easier to
> clean up old subscriptions.
>    --match client_instance_id=b69cc35a-7a54-4790-aa69-cc2bd4ee4538 \  #
> Match this specific client instance
>    --block
>
>
>
>
> > 28. New broker side metrics: Could we spell out the details of the
> metrics
> > (e.g., group, tags, etc)?
> >
>
> KIP has been updated accordingly (thanks Sarat).
>
>
>
> >
> > 29. Client instance-level metrics: client.io.wait.time is a gauge not a
> > histogram.
> >
>
> I believe a population/distribution should preferably be represented as a
> histogram, space permitting,
> and only secondarily as a Gauge average.
> While we might not want to maintain a bunch of histograms for each
> partition, since that could be
> quite space consuming, this client.io.wait.time is a single metric per
> client instance and can
> thus afford a Histogram representation.
>
>
>
> Thanks,
> Magnus
>
>
>
> > Thanks,
> >
> > Jun
> >
> > On Wed, Jan 26, 2022 at 6:32 AM Magnus Edenhill <ma...@edenhill.se>
> > wrote:
> >
> > > Hi all,
> > >
> > > I've updated the KIP with responses to the latest comments: Java client
> > > dependencies (Thanks Kirk!), alternate designs (separate cluster,
> > separate
> > > producer, etc), etc.
> > >
> > > I will revive the vote thread.
> > >
> > > Thanks,
> > > Magnus
> > >
> > >
> > > Den mån 13 dec. 2021 kl 22:32 skrev Ryanne Dolan <
> ryannedolan@gmail.com
> > >:
> > >
> > > > I think we should be very careful about introducing new runtime
> > > > dependencies into the clients. Historically this has been rare and
> > > > essentially necessary (e.g. compression libs).
> > > >
> > > > Ryanne
> > > >
> > > > On Mon, Dec 13, 2021, 1:06 PM Kirk True <ki...@mustardgrain.com>
> wrote:
> > > >
> > > > > Hi Jun,
> > > > >
> > > > > On Thu, Dec 9, 2021, at 2:28 PM, Jun Rao wrote:
> > > > > > 13. Using OpenTelemetry. Does that require runtime dependency
> > > > > > on OpenTelemetry library? How good is the compatibility story
> > > > > > of OpenTelemetry? This is important since an application could
> have
> > > > other
> > > > > > OpenTelemetry dependencies than the Kafka client.
> > > > >
> > > > > The current design is that the OpenTelemetry JARs would ship with
> the
> > > > > client. Perhaps we can design the client such that the JARs aren't
> > even
> > > > > loaded if the user has opted out. The user could even exclude the
> > JARs
> > > > from
> > > > > their dependencies if they so wished.
> > > > >
> > > > > I can't speak to the compatibility of the libraries. Is it possible
> > > that
> > > > > we include a shaded version?
> > > > >
> > > > > Thanks,
> > > > > Kirk
> > > > >
> > > > > >
> > > > > > 14. The proposal listed idempotence=true. This is more of a
> > > > configuration
> > > > > > than a metric. Are we including that as a metric? What other
> > > > > configurations
> > > > > > are we including? Should we separate the configurations from the
> > > > metrics?
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Jun
> > > > > >
> > > > > > On Mon, Nov 29, 2021 at 7:34 AM Magnus Edenhill <
> > magnus@edenhill.se>
> > > > > wrote:
> > > > > >
> > > > > > > Hey Bob,
> > > > > > >
> > > > > > > That's a good point.
> > > > > > >
> > > > > > > Request type labels were considered but since they're already
> > > tracked
> > > > > by
> > > > > > > broker-side metrics
> > > > > > > they were left out as to avoid metric duplication, however
> those
> > > > > metrics
> > > > > > > are not per connection,
> > > > > > > so they won't be that useful in practice for troubleshooting
> > > specific
> > > > > > > client instances.
> > > > > > >
> > > > > > > I'll add the request_type label to the relevant metrics.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Magnus
> > > > > > >
> > > > > > >
> > > > > > > Den tis 23 nov. 2021 kl 19:20 skrev Bob Barrett
> > > > > > > <bo...@confluent.io.invalid>:
> > > > > > >
> > > > > > > > Hi Magnus,
> > > > > > > >
> > > > > > > > Thanks for the thorough KIP, this seems very useful.
> > > > > > > >
> > > > > > > > Would it make sense to include the request type as a label
> for
> > > the
> > > > > > > > `client.request.success`, `client.request.errors` and
> > > > > > > `client.request.rtt`
> > > > > > > > metrics? I think it would be very useful to see which
> specific
> > > > > requests
> > > > > > > are
> > > > > > > > succeeding and failing for a client. One specific case I can
> > > think
> > > > of
> > > > > > > where
> > > > > > > > this could be useful is producer batch timeouts. If a Java
> > > > > application
> > > > > > > does
> > > > > > > > not enable producer client logs (unfortunately, in my
> > experience
> > > > this
> > > > > > > > happens more often than it should), the application logs will
> > > only
> > > > > > > contain
> > > > > > > > the expiration error message, but no information about what
> is
> > > > > causing
> > > > > > > the
> > > > > > > > timeout. The requests might all be succeeding but taking too
> > long
> > > > to
> > > > > > > > process batches, or metadata requests might be failing, or
> some
> > > or
> > > > > all
> > > > > > > > produce requests might be failing (if the bootstrap servers
> are
> > > > > reachable
> > > > > > > > from the client but one or more other brokers are not, for
> > > > example).
> > > > > If
> > > > > > > the
> > > > > > > > cluster operator is able to identify the specific requests
> that
> > > are
> > > > > slow
> > > > > > > or
> > > > > > > > failing for a client, they will be better able to diagnose
> the
> > > > issue
> > > > > > > > causing batch timeouts.
> > > > > > > >
> > > > > > > > One drawback I can think of is that this will increase the
> > > > > cardinality of
> > > > > > > > the request metrics. But any given client is only going to
> use
> > a
> > > > > small
> > > > > > > > subset of the request types, and since we already have
> > partition
> > > > > labels
> > > > > > > for
> > > > > > > > the topic-level metrics, I think request labels will still
> make
> > > up
> > > > a
> > > > > > > > relatively small percentage of the set of metrics.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Bob
> > > > > > > >
> > > > > > > > On Mon, Nov 22, 2021 at 2:08 AM Viktor Somogyi-Vass <
> > > > > > > > viktorsomogyi@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Magnus,
> > > > > > > > >
> > > > > > > > > I think this is a very useful addition. We also have a
> > similar
> > > > (but
> > > > > > > much
> > > > > > > > > more simplistic) implementation of this. Maybe I missed it
> in
> > > the
> > > > > KIP
> > > > > > > but
> > > > > > > > > what about adding metrics about the subscription cache
> > itself?
> > > > > That I
> > > > > > > > think
> > > > > > > > > would improve its usability and debuggability as we'd be
> able
> > > to
> > > > > see
> > > > > > > its
> > > > > > > > > performance, hit/miss rates, eviction counts and others.
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > > Viktor
> > > > > > > > >
> > > > > > > > > On Thu, Nov 18, 2021 at 5:12 PM Magnus Edenhill <
> > > > > magnus@edenhill.se>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi Mickael,
> > > > > > > > > >
> > > > > > > > > > see inline.
> > > > > > > > > >
> > > > > > > > > > Den ons 10 nov. 2021 kl 15:21 skrev Mickael Maison <
> > > > > > > > > > mickael.maison@gmail.com
> > > > > > > > > > >:
> > > > > > > > > >
> > > > > > > > > > > Hi Magnus,
> > > > > > > > > > >
> > > > > > > > > > > I see you've addressed some of the points I raised
> above
> > > but
> > > > > some
> > > > > > > (4,
> > > > > > > > > > > 5) have not been addressed yet.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Re 4) How will the user/app know metrics are being sent.
> > > > > > > > > >
> > > > > > > > > > One possibility is to add a JMX metric (thus for user
> > > > > consumption)
> > > > > > > for
> > > > > > > > > the
> > > > > > > > > > number of metric pushes the
> > > > > > > > > > client has performed, or perhaps the number of metrics
> > > > > subscriptions
> > > > > > > > > > currently being collected.
> > > > > > > > > > Would that be sufficient?
> > > > > > > > > >
> > > > > > > > > > Re 5) Metric sizes and rates
> > > > > > > > > >
> > > > > > > > > > A worst case scenario for a producer that is producing to
> > 50
> > > > > unique
> > > > > > > > > topics
> > > > > > > > > > and emitting all standard metrics yields
> > > > > > > > > > a serialized size of around 100KB prior to compression,
> > which
> > > > > > > > compresses
> > > > > > > > > > down to about 20-30% of that depending
> > > > > > > > > > on compression type and topic name uniqueness.
> > > > > > > > > > The numbers for a consumer would be similar.
> > > > > > > > > >
> > > > > > > > > > In practice the number of unique topics would be far
> less,
> > > and
> > > > > the
> > > > > > > > > > subscription set would typically be for a subset of
> > metrics.
> > > > > > > > > > So we're probably closer to 1kb, or less, compressed size
> > per
> > > > > client
> > > > > > > > per
> > > > > > > > > > push interval.
> > > > > > > > > >
> > > > > > > > > > As both the subscription set and push intervals are
> > > controlled
> > > > > by the
> > > > > > > > > > cluster operator it shouldn't be too hard
> > > > > > > > > > to strike a good balance between metrics overhead and
> > > > > granularity.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > I'm really uneasy with this being enabled by default on
> > the
> > > > > client
> > > > > > > > > > > side. When collecting data, I think the best practice
> is
> > to
> > > > > ensure
> > > > > > > > > > > users are explicitly enabling it.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Requiring metrics to be explicitly enabled on clients
> > > severely
> > > > > > > cripples
> > > > > > > > > its
> > > > > > > > > > usability and value.
> > > > > > > > > >
> > > > > > > > > > One of the problems that this KIP aims to solve is for
> > useful
> > > > > metrics
> > > > > > > > to
> > > > > > > > > be
> > > > > > > > > > available on demand
> > > > > > > > > > regardless of the technical expertise of the user. As
> > Ryanne
> > > > > points,
> > > > > > > > out
> > > > > > > > > a
> > > > > > > > > > savvy user/organization
> > > > > > > > > > will typically have metrics collection and monitoring in
> > > place
> > > > > > > already,
> > > > > > > > > and
> > > > > > > > > > the benefits of this KIP
> > > > > > > > > > are then more of a common set and format metrics across
> > > client
> > > > > > > > > > implementations and languages.
> > > > > > > > > > But that is not the typical Kafka user in my experience,
> > > > they're
> > > > > not
> > > > > > > > > Kafka
> > > > > > > > > > experts and they don't have the
> > > > > > > > > > knowledge of how to best instrument their clients.
> > > > > > > > > > Having metrics enabled by default for this user base
> allows
> > > the
> > > > > Kafka
> > > > > > > > > > operators to proactively and reactively
> > > > > > > > > > monitor and troubleshoot client issues, without the need
> > for
> > > > the
> > > > > less
> > > > > > > > > savvy
> > > > > > > > > > user to do anything.
> > > > > > > > > > It is often too late to tell a user to enable metrics
> when
> > > the
> > > > > > > problem
> > > > > > > > > has
> > > > > > > > > > already occurred.
> > > > > > > > > >
> > > > > > > > > > Now, to be clear, even though metrics are enabled by
> > default
> > > on
> > > > > > > clients
> > > > > > > > > it
> > > > > > > > > > is not enabled by default
> > > > > > > > > > on the brokers; the Kafka operator needs to build and set
> > up
> > > a
> > > > > > > metrics
> > > > > > > > > > plugin and add metrics subscriptions
> > > > > > > > > > before anything is sent from the client.
> > > > > > > > > > It is opt-out on the clients and opt-in on the broker.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > You mentioned brokers already have
> > > > > > > > > > > some(most?) of the information contained in metrics, if
> > so
> > > > > then why
> > > > > > > > > > > are we collecting it again? Surely there must be some
> new
> > > > > > > information
> > > > > > > > > > > in the client metrics.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > From the user's perspective the Kafka infrastructure
> > extends
> > > > from
> > > > > > > > > > producer.send() to
> > > > > > > > > > messages being returned from consumer.poll(), a giant
> black
> > > box
> > > > > where
> > > > > > > > > > there's a lot going on between those
> > > > > > > > > > two points. The brokers currently only see what happens
> > once
> > > > > those
> > > > > > > > > requests
> > > > > > > > > > and messages hits the broker,
> > > > > > > > > > but as Kafka clients are complex pieces of machinery
> > there's
> > > a
> > > > > myriad
> > > > > > > > of
> > > > > > > > > > queues, timers, and state
> > > > > > > > > > that's critical to the operation and infrastructure
> that's
> > > not
> > > > > > > > currently
> > > > > > > > > > visible to the operator.
> > > > > > > > > > Relying on the user to accurately and timely provide this
> > > > missing
> > > > > > > > > > information is not generally feasible.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Most of the standard metrics listed in the KIP are data
> > > points
> > > > > that
> > > > > > > the
> > > > > > > > > > broker does not have.
> > > > > > > > > > Only a small number of metrics are duplicates (like the
> > > request
> > > > > > > counts
> > > > > > > > > and
> > > > > > > > > > sizes), but they are included
> > > > > > > > > > to ease correlation when inspecting these client metrics.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > Moreover this is a brand new feature so it's even
> harder
> > to
> > > > > justify
> > > > > > > > > > > enabling it and forcing onto all our users. If disabled
> > by
> > > > > default,
> > > > > > > > > > > it's relatively easy to enable in a new release if we
> > > decide
> > > > > to,
> > > > > > > but
> > > > > > > > > > > once enabled by default it's much harder to disable.
> Also
> > > > this
> > > > > > > > feature
> > > > > > > > > > > will apply to all future metrics we will add.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > I think maturity of a feature implementation should be
> the
> > > > > deciding
> > > > > > > > > factor,
> > > > > > > > > > rather than
> > > > > > > > > > the design of it (which this KIP is). I.e., if the
> > > > > implementation is
> > > > > > > > not
> > > > > > > > > > deemed mature enough
> > > > > > > > > > for release X.Y it will be disabled.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > Overall I think it's an interesting feature but I'd
> > prefer
> > > to
> > > > > be
> > > > > > > > > > > slightly defensive and see how it works in practice
> > before
> > > > > enabling
> > > > > > > > it
> > > > > > > > > > > everywhere.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Right, and I agree on being defensive, but since this
> > feature
> > > > > still
> > > > > > > > > > requires manual
> > > > > > > > > > enabling on the brokers before actually being used, I
> think
> > > > that
> > > > > > > gives
> > > > > > > > > > enough control
> > > > > > > > > > to opt-in or out of this feature as needed.
> > > > > > > > > >
> > > > > > > > > > Thanks for your comments!
> > > > > > > > > >
> > > > > > > > > > Regards,
> > > > > > > > > > Magnus
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > > Mickael
> > > > > > > > > > >
> > > > > > > > > > > On Mon, Nov 8, 2021 at 7:55 AM Magnus Edenhill <
> > > > > magnus@edenhill.se
> > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks David for pointing this out,
> > > > > > > > > > > > I've updated the KIP to include client_id as a
> matching
> > > > > selector.
> > > > > > > > > > > >
> > > > > > > > > > > > Regards,
> > > > > > > > > > > > Magnus
> > > > > > > > > > > >
> > > > > > > > > > > > Den tors 4 nov. 2021 kl 18:01 skrev David Mao
> > > > > > > > > > <dmao@confluent.io.invalid
> > > > > > > > > > > >:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hey Magnus,
> > > > > > > > > > > > >
> > > > > > > > > > > > > I noticed that the KIP outlines the initial
> selectors
> > > > > supported
> > > > > > > > as:
> > > > > > > > > > > > >
> > > > > > > > > > > > >    - client_instance_id - CLIENT_INSTANCE_ID UUID
> > > string
> > > > > > > > > > > representation.
> > > > > > > > > > > > >    - client_software_name  - client software
> > > > implementation
> > > > > > > name.
> > > > > > > > > > > > >    - client_software_version  - client software
> > > > > implementation
> > > > > > > > > > version.
> > > > > > > > > > > > >
> > > > > > > > > > > > > In the given reactive monitoring workflow, we
> mention
> > > > that
> > > > > the
> > > > > > > > > > > application
> > > > > > > > > > > > > user does not know their client's client instance
> ID,
> > > but
> > > > > it's
> > > > > > > > > > outlined
> > > > > > > > > > > > > that the operator can add a metrics subscription
> > > > selecting
> > > > > for
> > > > > > > > > > > clientId. I
> > > > > > > > > > > > > don't see clientId as one of the supported
> selectors.
> > > > > > > > > > > > > I can see how this would have made sense in a
> > previous
> > > > > > > iteration
> > > > > > > > > > given
> > > > > > > > > > > that
> > > > > > > > > > > > > the previous client instance ID proposal was to
> > > construct
> > > > > the
> > > > > > > > > client
> > > > > > > > > > > > > instance ID using clientId as a prefix. Now that
> the
> > > > client
> > > > > > > > > instance
> > > > > > > > > > > ID is
> > > > > > > > > > > > > a UUID, would we want to add clientId as a
> supported
> > > > > selector?
> > > > > > > > > > > > > Let me know what you think.
> > > > > > > > > > > > >
> > > > > > > > > > > > > David
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Tue, Oct 19, 2021 at 12:33 PM Magnus Edenhill <
> > > > > > > > > magnus@edenhill.se
> > > > > > > > > > >
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi Mickael!
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Den tis 19 okt. 2021 kl 19:30 skrev Mickael
> Maison
> > <
> > > > > > > > > > > > > > mickael.maison@gmail.com
> > > > > > > > > > > > > > >:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi Magnus,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks for the proposal.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 1. Looking at the protocol section, isn't
> > > > > > > "ClientInstanceId"
> > > > > > > > > > > expected
> > > > > > > > > > > > > > > to be a field in
> > > GetTelemetrySubscriptionsResponseV0?
> > > > > > > > > Otherwise,
> > > > > > > > > > > how
> > > > > > > > > > > > > > > does a client retrieve this value?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Good catch, it got removed by mistake in one of
> the
> > > > > edits.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 2. In the client API section, you mention a new
> > > > method
> > > > > > > > > > > > > > > "clientInstanceId()". Can you clarify which
> > > > interfaces
> > > > > are
> > > > > > > > > > > affected?
> > > > > > > > > > > > > > > Is it only Consumer and Producer?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > And Admin. Will update the KIP.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 3. I'm a bit concerned this is enabled by
> > default.
> > > > > Even if
> > > > > > > > the
> > > > > > > > > > data
> > > > > > > > > > > > > > > collected is supposed to be not sensitive, I
> > think
> > > > > this can
> > > > > > > > be
> > > > > > > > > > > > > > > problematic in some environments. Also users
> > don't
> > > > > seem to
> > > > > > > > have
> > > > > > > > > > the
> > > > > > > > > > > > > > > choice to only expose some metrics. Knowing how
> > > much
> > > > > data
> > > > > > > > > transit
> > > > > > > > > > > > > > > through some applications can be considered
> > > critical.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > The broker already knows how much data transits
> > > through
> > > > > the
> > > > > > > > > client
> > > > > > > > > > > > > though,
> > > > > > > > > > > > > > right?
> > > > > > > > > > > > > > Care has been taken not to expose information in
> > the
> > > > > standard
> > > > > > > > > > metrics
> > > > > > > > > > > > > that
> > > > > > > > > > > > > > might
> > > > > > > > > > > > > > reveal sensitive information.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Do you have an example of how the proposed
> metrics
> > > > could
> > > > > leak
> > > > > > > > > > > sensitive
> > > > > > > > > > > > > > information?
> > > > > > > > > > > > > > As for limiting the what metrics to export; I
> guess
> > > > that
> > > > > > > could
> > > > > > > > > make
> > > > > > > > > > > sense
> > > > > > > > > > > > > > in some
> > > > > > > > > > > > > > very sensitive use-cases, but those users might
> > > disable
> > > > > > > metrics
> > > > > > > > > > > > > altogether
> > > > > > > > > > > > > > for now.
> > > > > > > > > > > > > > Could these concerns be addressed by a later KIP?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 4. As a user, how do you know if your
> application
> > > is
> > > > > > > actively
> > > > > > > > > > > sending
> > > > > > > > > > > > > > > metrics? Are there new metrics exposing what's
> > > going
> > > > > on,
> > > > > > > like
> > > > > > > > > how
> > > > > > > > > > > much
> > > > > > > > > > > > > > > data is being sent?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > That's a good question.
> > > > > > > > > > > > > > Since the proposed metrics interface is not aimed
> > at,
> > > > or
> > > > > > > > directly
> > > > > > > > > > > > > available
> > > > > > > > > > > > > > to, the application
> > > > > > > > > > > > > > I guess there's little point of adding it here,
> but
> > > > > instead
> > > > > > > > > adding
> > > > > > > > > > > > > > something to the
> > > > > > > > > > > > > > existing JMX metrics?
> > > > > > > > > > > > > > Do you have any suggestions?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 5. If all metrics are enabled on a regular
> > Consumer
> > > > or
> > > > > > > > > Producer,
> > > > > > > > > > do
> > > > > > > > > > > > > > > you have an idea how much throughput this would
> > > use?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > It depends on the number of partition/topics/etc
> > the
> > > > > client
> > > > > > > is
> > > > > > > > > > > producing
> > > > > > > > > > > > > > to/consuming from.
> > > > > > > > > > > > > > I'll add some sizes to the KIP for some typical
> > > > > use-cases.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > Magnus
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Tue, Oct 19, 2021 at 5:06 PM Magnus
> Edenhill <
> > > > > > > > > > > magnus@edenhill.se>
> > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Den tis 19 okt. 2021 kl 13:22 skrev Tom
> > Bentley <
> > > > > > > > > > > tbentley@redhat.com
> > > > > > > > > > > > > >:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Hi Magnus,
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > I reviewed the KIP since you called the
> vote
> > > > > (sorry for
> > > > > > > > not
> > > > > > > > > > > > > reviewing
> > > > > > > > > > > > > > > when
> > > > > > > > > > > > > > > > > you announced your intention to call the
> > > vote). I
> > > > > have
> > > > > > > a
> > > > > > > > > few
> > > > > > > > > > > > > > questions
> > > > > > > > > > > > > > > on
> > > > > > > > > > > > > > > > > some of the details.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > 1. There's no Javadoc on
> > > > > ClientTelemetryPayload.data(),
> > > > > > > > so
> > > > > > > > > I
> > > > > > > > > > > don't
> > > > > > > > > > > > > > know
> > > > > > > > > > > > > > > > > whether the payload is exposed through this
> > > > method
> > > > > as
> > > > > > > > > > > compressed or
> > > > > > > > > > > > > > > not.
> > > > > > > > > > > > > > > > > Later on you say "Decompression of the
> > payloads
> > > > > will be
> > > > > > > > > > > handled by
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > broker metrics plugin, the broker should
> > > expose a
> > > > > > > > suitable
> > > > > > > > > > > > > > > decompression
> > > > > > > > > > > > > > > > > API to the metrics plugin for this
> purpose.",
> > > > which
> > > > > > > > > suggests
> > > > > > > > > > > it's
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > compressed data in the buffer, but then we
> > > don't
> > > > > know
> > > > > > > > which
> > > > > > > > > > > codec
> > > > > > > > > > > > > was
> > > > > > > > > > > > > > > used,
> > > > > > > > > > > > > > > > > nor the API via which the plugin should
> > > > decompress
> > > > > it
> > > > > > > if
> > > > > > > > > > > required
> > > > > > > > > > > > > for
> > > > > > > > > > > > > > > > > forwarding to the ultimate metrics store.
> > > Should
> > > > > the
> > > > > > > > > > > > > > > ClientTelemetryPayload
> > > > > > > > > > > > > > > > > expose a method to get the compression and
> a
> > > > > > > > decompressor?
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Good point, updated.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > 2. The client-side API is expressed as
> > > > > StringOrError
> > > > > > > > > > > > > > > > > ClientInstance::ClientInstanceId(int
> > > > timeout_ms). I
> > > > > > > > > > understand
> > > > > > > > > > > that
> > > > > > > > > > > > > > > you're
> > > > > > > > > > > > > > > > > thinking about the librdkafka
> implementation,
> > > but
> > > > > it
> > > > > > > > would
> > > > > > > > > be
> > > > > > > > > > > good
> > > > > > > > > > > > > to
> > > > > > > > > > > > > > > show
> > > > > > > > > > > > > > > > > the API as it would appear on the Apache
> > Kafka
> > > > > clients.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > This was meant as pseudo-code, but I changed
> it
> > > to
> > > > > Java.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > 3. "PushTelemetryRequest|Response -
> protocol
> > > > > request
> > > > > > > used
> > > > > > > > > by
> > > > > > > > > > > the
> > > > > > > > > > > > > > > client to
> > > > > > > > > > > > > > > > > send metrics to any broker it is connected
> > to."
> > > > To
> > > > > be
> > > > > > > > > clear,
> > > > > > > > > > > this
> > > > > > > > > > > > > > means
> > > > > > > > > > > > > > > > > that the client can choose any of the
> > connected
> > > > > brokers
> > > > > > > > and
> > > > > > > > > > > push to
> > > > > > > > > > > > > > > just
> > > > > > > > > > > > > > > > > one of them? What should a supporting
> client
> > do
> > > > if
> > > > > it
> > > > > > > > gets
> > > > > > > > > an
> > > > > > > > > > > error
> > > > > > > > > > > > > > > when
> > > > > > > > > > > > > > > > > pushing metrics to a broker, retry sending
> to
> > > the
> > > > > same
> > > > > > > > > broker
> > > > > > > > > > > or
> > > > > > > > > > > > > try
> > > > > > > > > > > > > > > > > pushing to another broker, or drop the
> > metrics?
> > > > > Should
> > > > > > > > > > > supporting
> > > > > > > > > > > > > > > clients
> > > > > > > > > > > > > > > > > send successive requests to a single
> broker,
> > or
> > > > > round
> > > > > > > > > robin,
> > > > > > > > > > > or is
> > > > > > > > > > > > > > > that up
> > > > > > > > > > > > > > > > > to the client author? I'm guessing the
> > > behaviour
> > > > > should
> > > > > > > > be
> > > > > > > > > > > sticky
> > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > support the rate limiting features, but I
> > think
> > > > it
> > > > > > > would
> > > > > > > > be
> > > > > > > > > > > good
> > > > > > > > > > > > > for
> > > > > > > > > > > > > > > client
> > > > > > > > > > > > > > > > > authors if this section were explicit on
> the
> > > > > > > recommended
> > > > > > > > > > > behaviour.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > You are right, I've updated the KIP to make
> > this
> > > > > clearer.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > 4. "Mapping the client instance id to an
> > actual
> > > > > > > > application
> > > > > > > > > > > > > instance
> > > > > > > > > > > > > > > > > running on a (virtual) machine can be done
> by
> > > > > > > inspecting
> > > > > > > > > the
> > > > > > > > > > > > > metrics
> > > > > > > > > > > > > > > > > resource labels, such as the client source
> > > > address
> > > > > and
> > > > > > > > > source
> > > > > > > > > > > port,
> > > > > > > > > > > > > > or
> > > > > > > > > > > > > > > > > security principal, all of which are added
> by
> > > the
> > > > > > > > receiving
> > > > > > > > > > > broker.
> > > > > > > > > > > > > > > This
> > > > > > > > > > > > > > > > > will allow the operator together with the
> > user
> > > to
> > > > > > > > identify
> > > > > > > > > > the
> > > > > > > > > > > > > actual
> > > > > > > > > > > > > > > > > application instance." Is this really
> always
> > > > true?
> > > > > The
> > > > > > > > > source
> > > > > > > > > > > IP
> > > > > > > > > > > > > and
> > > > > > > > > > > > > > > port
> > > > > > > > > > > > > > > > > might be a loadbalancer/proxy in some
> setups.
> > > The
> > > > > > > > > principal,
> > > > > > > > > > as
> > > > > > > > > > > > > > already
> > > > > > > > > > > > > > > > > mentioned in the KIP, might be shared
> between
> > > > > multiple
> > > > > > > > > > > > > applications.
> > > > > > > > > > > > > > > So at
> > > > > > > > > > > > > > > > > worst the organization running the clients
> > > might
> > > > > have
> > > > > > > to
> > > > > > > > > > > consult
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > logs
> > > > > > > > > > > > > > > > > of a set of client applications, right?
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Yes, that's correct. There's no guaranteed
> > > mapping
> > > > > from
> > > > > > > > > > > > > > > client_instance_id
> > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > an actual instance, that's why the KIP
> > recommends
> > > > > client
> > > > > > > > > > > > > > implementations
> > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > log the client instance id
> > > > > > > > > > > > > > > > upon retrieval, and also provide an API for
> the
> > > > > > > application
> > > > > > > > > to
> > > > > > > > > > > > > retrieve
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > instance id programmatically
> > > > > > > > > > > > > > > > if it has a better way of exposing it.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 5. "Tests indicate that a compression ratio
> up
> > to
> > > > > 10x is
> > > > > > > > > > > possible for
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > standard metrics." Client authors might
> > > > appreciate
> > > > > your
> > > > > > > > > > > mentioning
> > > > > > > > > > > > > > > which
> > > > > > > > > > > > > > > > > compression codec got these results.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Good point. Updated.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > 6. "Should the client send a push request
> > prior
> > > > to
> > > > > > > expiry
> > > > > > > > > of
> > > > > > > > > > > the
> > > > > > > > > > > > > > > previously
> > > > > > > > > > > > > > > > > calculated PushIntervalMs the broker will
> > > discard
> > > > > the
> > > > > > > > > metrics
> > > > > > > > > > > and
> > > > > > > > > > > > > > > return a
> > > > > > > > > > > > > > > > > PushTelemetryResponse with the ErrorCode
> set
> > to
> > > > > > > > > RateLimited."
> > > > > > > > > > > Is
> > > > > > > > > > > > > this
> > > > > > > > > > > > > > > > > RATE_LIMITED a new error code? It's not
> > > mentioned
> > > > > in
> > > > > > > the
> > > > > > > > > "New
> > > > > > > > > > > Error
> > > > > > > > > > > > > > > Codes"
> > > > > > > > > > > > > > > > > section.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > That's a leftover, it should be using the
> > > standard
> > > > > > > > > ThrottleTime
> > > > > > > > > > > > > > > mechanism.
> > > > > > > > > > > > > > > > Fixed.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > 7. In the section "Standard client resource
> > > > labels"
> > > > > > > > > > > application_id
> > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > > described as Kafka Streams only, but the
> > > section
> > > > of
> > > > > > > > "Client
> > > > > > > > > > > > > > > Identification"
> > > > > > > > > > > > > > > > > talks about "application instance id as an
> > > > optional
> > > > > > > > future
> > > > > > > > > > > > > > nice-to-have
> > > > > > > > > > > > > > > > > that may be included as a metrics label if
> it
> > > has
> > > > > been
> > > > > > > > set
> > > > > > > > > by
> > > > > > > > > > > the
> > > > > > > > > > > > > > > user", so
> > > > > > > > > > > > > > > > > I'm confused whether non-Kafka Streams
> > clients
> > > > > should
> > > > > > > set
> > > > > > > > > an
> > > > > > > > > > > > > > > application_id
> > > > > > > > > > > > > > > > > or not.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I'll clarify this in the KIP, but basically
> we
> > > > would
> > > > > need
> > > > > > > > to
> > > > > > > > > > add
> > > > > > > > > > > an `
> > > > > > > > > > > > > > > > application.id` config
> > > > > > > > > > > > > > > > property for non-streams clients for this
> > > purpose,
> > > > > and
> > > > > > > > that's
> > > > > > > > > > > outside
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > scope of this KIP since we want to make it
> > > > > zero-conf:ish
> > > > > > > on
> > > > > > > > > the
> > > > > > > > > > > > > client
> > > > > > > > > > > > > > > side.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Kind regards,
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Tom
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thanks for the review,
> > > > > > > > > > > > > > > > Magnus
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On Thu, Oct 7, 2021 at 5:26 PM Magnus
> > Edenhill
> > > <
> > > > > > > > > > > magnus@edenhill.se
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Hi all,
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > I've updated the KIP following our recent
> > > > > discussions
> > > > > > > > on
> > > > > > > > > > the
> > > > > > > > > > > > > > mailing
> > > > > > > > > > > > > > > > > list:
> > > > > > > > > > > > > > > > > >  - split the protocol in two, one for
> > getting
> > > > the
> > > > > > > > metrics
> > > > > > > > > > > > > > > subscriptions,
> > > > > > > > > > > > > > > > > > and one for pushing the metrics.
> > > > > > > > > > > > > > > > > >  - simplifications: initially only one
> > > > supported
> > > > > > > > metrics
> > > > > > > > > > > format,
> > > > > > > > > > > > > no
> > > > > > > > > > > > > > > > > > client.id in the instance id, etc.
> > > > > > > > > > > > > > > > > >  - made CLIENT_METRICS subscription
> > > > configuration
> > > > > > > > entries
> > > > > > > > > > > more
> > > > > > > > > > > > > > > structured
> > > > > > > > > > > > > > > > > >    and allowing better client matching
> > > > selectors
> > > > > (not
> > > > > > > > > only
> > > > > > > > > > > on the
> > > > > > > > > > > > > > > > > instance
> > > > > > > > > > > > > > > > > > id, but also the other
> > > > > > > > > > > > > > > > > >    client resource labels, such as
> > > > > > > > client_software_name,
> > > > > > > > > > > etc.).
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Unless there are further comments I'll
> call
> > > the
> > > > > vote
> > > > > > > > in a
> > > > > > > > > > > day or
> > > > > > > > > > > > > > two.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Regards,
> > > > > > > > > > > > > > > > > > Magnus
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Den mån 4 okt. 2021 kl 20:57 skrev Magnus
> > > > > Edenhill <
> > > > > > > > > > > > > > > magnus@edenhill.se>:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Hi Gwen,
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > I'm finishing up the KIP based on the
> > last
> > > > > couple
> > > > > > > of
> > > > > > > > > > > discussion
> > > > > > > > > > > > > > > points
> > > > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > > > > this thread
> > > > > > > > > > > > > > > > > > > and will call the Vote later this week.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Best,
> > > > > > > > > > > > > > > > > > > Magnus
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Den lör 2 okt. 2021 kl 02:01 skrev Gwen
> > > > Shapira
> > > > > > > > > > > > > > > > > > <gwen@confluent.io.invalid
> > > > > > > > > > > > > > > > > > > >:
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >> Hey,
> > > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > > >> I noticed that there was no discussion
> > for
> > > > the
> > > > > > > last
> > > > > > > > 10
> > > > > > > > > > > days,
> > > > > > > > > > > > > > but I
> > > > > > > > > > > > > > > > > > >> couldn't
> > > > > > > > > > > > > > > > > > >> find the vote thread. Is there one
> that
> > > I'm
> > > > > > > missing?
> > > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > > >> Gwen
> > > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > > >> On Wed, Sep 22, 2021 at 4:58 AM Magnus
> > > > > Edenhill <
> > > > > > > > > > > > > > > magnus@edenhill.se>
> > > > > > > > > > > > > > > > > > >> wrote:
> > > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > > >> > Den tis 21 sep. 2021 kl 06:58 skrev
> > > Colin
> > > > > > > McCabe <
> > > > > > > > > > > > > > > > > cmccabe@apache.org
> > > > > > > > > > > > > > > > > > >:
> > > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > > >> > > On Mon, Sep 20, 2021, at 17:35,
> Feng
> > > Min
> > > > > > > wrote:
> > > > > > > > > > > > > > > > > > >> > > > Thanks Magnus & Colin for the
> > > > > discussion.
> > > > > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > > > > >> > > > Based on KIP-714's stateless
> > design,
> > > > > Client
> > > > > > > > can
> > > > > > > > > > > pretty
> > > > > > > > > > > > > > much
> > > > > > > > > > > > > > > use
> > > > > > > > > > > > > > > > > > any
> > > > > > > > > > > > > > > > > > >> > > > connection to any broker to send
> > > > > metrics. We
> > > > > > > > are
> > > > > > > > > > not
> > > > > > > > > > > > > > > associating
> > > > > > > > > > > > > > > > > > >> > > connection
> > > > > > > > > > > > > > > > > > >> > > > with client metric state. Is my
> > > > > > > understanding
> > > > > > > > > > > correct?
> > > > > > > > > > > > > If
> > > > > > > > > > > > > > > yes,
> > > > > > > > > > > > > > > > > > how
> > > > > > > > > > > > > > > > > > >> > about
> > > > > > > > > > > > > > > > > > >> > > > the following two scenarios
> > > > > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > > > > >> > > > 1) One Client (Client-ID)
> > registers
> > > > two
> > > > > > > > > different
> > > > > > > > > > > client
> > > > > > > > > > > > > > > > > instance
> > > > > > > > > > > > > > > > > > id
> > > > > > > > > > > > > > > > > > >> > via
> > > > > > > > > > > > > > > > > > >> > > > separate registration. Is it
> > > > permitted?
> > > > > If
> > > > > > > OK,
> > > > > > > > > how
> > > > > > > > > > > to
> > > > > > > > > > > > > > > > > distinguish
> > > > > > > > > > > > > > > > > > >> them
> > > > > > > > > > > > > > > > > > >> > > from
> > > > > > > > > > > > > > > > > > >> > > > the case 2 below.
> > > > > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > > > >> > > Hi Feng,
> > > > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > > > >> > > My understanding, which Magnus can
> > > > > clarify I
> > > > > > > > > guess,
> > > > > > > > > > is
> > > > > > > > > > > > > that
> > > > > > > > > > > > > > > you
> > > > > > > > > > > > > > > > > > could
> > > > > > > > > > > > > > > > > > >> > have
> > > > > > > > > > > > > > > > > > >> > > something like two Producer
> > instances
> > > > > running
> > > > > > > > with
> > > > > > > > > > the
> > > > > > > > > > > > > same
> > > > > > > > > > > > > > > > > > client.id
> > > > > > > > > > > > > > > > > > >> > > (perhaps because they're using the
> > > same
> > > > > config
> > > > > > > > > file,
> > > > > > > > > > > for
> > > > > > > > > > > > > > > example).
> > > > > > > > > > > > > > > > > > >> They
> > > > > > > > > > > > > > > > > > >> > > could even be in the same process.
> > But
> > > > > they
> > > > > > > > would
> > > > > > > > > > get
> > > > > > > > > > > > > > separate
> > > > > > > > > > > > > > > > > > UUIDs.
> > > > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > > > >> > > I believe Magnus used the term
> > client
> > > to
> > > > > mean
> > > > > > > > > > > "Producer or
> > > > > > > > > > > > > > > > > > Consumer".
> > > > > > > > > > > > > > > > > > >> So
> > > > > > > > > > > > > > > > > > >> > > if you have both a Producer and a
> > > > > Consumer in
> > > > > > > > your
> > > > > > > > > > > > > > > application I
> > > > > > > > > > > > > > > > > > would
> > > > > > > > > > > > > > > > > > >> > > expect you'd get separate UUIDs
> for
> > > > both.
> > > > > > > Again
> > > > > > > > > > > Magnus can
> > > > > > > > > > > > > > > chime
> > > > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > > > >> > here, I
> > > > > > > > > > > > > > > > > > >> > > guess.
> > > > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > > >> > That's correct.
> > > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > > > >> > > > 2) How about the client
> > restarting?
> > > > > What's
> > > > > > > the
> > > > > > > > > > > > > > expectation?
> > > > > > > > > > > > > > > > > Should
> > > > > > > > > > > > > > > > > > >> the
> > > > > > > > > > > > > > > > > > >> > > > server expect the client to
> carry
> > a
> > > > > > > persisted
> > > > > > > > > > client
> > > > > > > > > > > > > > > instance id
> > > > > > > > > > > > > > > > > > or
> > > > > > > > > > > > > > > > > > >> > > should
> > > > > > > > > > > > > > > > > > >> > > > the client be treated as a new
> > > > instance?
> > > > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > > > >> > > The KIP doesn't describe any
> > mechanism
> > > > for
> > > > > > > > > > > persistence,
> > > > > > > > > > > > > so I
> > > > > > > > > > > > > > > would
> > > > > > > > > > > > > > > > > > >> assume
> > > > > > > > > > > > > > > > > > >> > > that when you restart the client
> you
> > > get
> > > > > a new
> > > > > > > > > > UUID. I
> > > > > > > > > > > > > agree
> > > > > > > > > > > > > > > that
> > > > > > > > > > > > > > > > > it
> > > > > > > > > > > > > > > > > > >> > would
> > > > > > > > > > > > > > > > > > >> > > be good to spell this out.
> > > > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > > > >> > Right, it will not be persisted
> since
> > a
> > > > > client
> > > > > > > > > > instance
> > > > > > > > > > > > > can't
> > > > > > > > > > > > > > be
> > > > > > > > > > > > > > > > > > >> restarted.
> > > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > > >> > Will update the KIP to make this
> > > clearer.
> > > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > > >> > /Magnus
> > > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > > >> --
> > > > > > > > > > > > > > > > > > >> Gwen Shapira
> > > > > > > > > > > > > > > > > > >> Engineering Manager | Confluent
> > > > > > > > > > > > > > > > > > >> 650.450.2760 | @gwenshap
> > > > > > > > > > > > > > > > > > >> Follow us: Twitter | blog
> > > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Magnus Edenhill <ma...@edenhill.se>.

Hi Jun,

thanks for your initiated questions, see my answers below.
There's been a number of clarifications to the KIP.



Den tors 27 jan. 2022 kl 20:08 skrev Jun Rao <ju...@confluent.io.invalid>:

> Hi, Magnus,
>
> Thanks for updating the KIP. The overall approach makes sense to me. A few
> more detailed comments below.
>
> 20. ClientTelemetry: Should it be extending configurable and closable?
>

I'll pass this question to Sarat and/or Xavier.



> 21. Compression of the metrics on the client: what's the default?
>

How about we specify a prioritized list: zstd, lz4, snappy, gzip?
But ultimately it is up to what the client supports.


23. A client instance is considered a metric resource and the
> resource-level (thus client instance level) labels could include:
>     client_software_name=confluent-kafka-python
>     client_software_version=v2.1.3
>     client_instance_id=B64CD139-3975-440A-91D4
>     transactional_id=someTxnApp
> Are those labels added in PushTelemetryRequest? If so, are they per metric
> or per request?
>


client_software* and client_instance_id are not added by the client, but
available to
the broker-side metrics plugin for adding as it see fits, remove them from
the KIP.

As for transactional_id, group_id, etc, which I believe will be useful in
troubleshooting,
are included only once (per push) as resource-level attributes (the client
instance is a singular resource).


>
> 24.  "the broker will only send
> GetTelemetrySubscriptionsResponse.DeltaTemporality=True" :
> 24.1 If it's always true, does it need to be part of the protocol?
>

We're anticipating that it will take a lot longer to upgrade the majority
of clients than the
broker/plugin side, which is why we want the client to support both
temporalities out-of-the-box
so that cumulative reporting can be turned on seamlessly in the future.



> 24.2 Does delta only apply to Counter type?
>


And Histograms. More details in Xavier's OTLP link.



> 24.3 In the delta representation, the first request needs to send the full
> value, how does the broker plugin know whether a value is full or delta?
>

The client may (should) send the start time for each metric sample,
indicating when
the metric began to be collected.
We've discussed whether this should be the client instance start time or
the time when a matching
metric subscription for that metric is received.
For completeness we recommend using the former, the client instance start
time.



> 25. quota:
> 25.1 Since we are fitting PushTelemetryRequest into the existing request
> quota, it would be useful to document the impact, i.e. client metric
> throttling causes the data from the same client to be delayed.
> 25.2 Is PushTelemetryRequest subject to the write bandwidth quota like the
> producer?
>


Yes, it should be, as to protect the cluster from rogue clients.
But, in practice the size of metrics will be quite low (e.g., 1-10kb per
60s interval), so I don't think this will pose a problem.
The KIP has been updated with more details on quota/throttling behaviour,
see the
"Throttling and rate-limiting" section.


25.3 THROTTLING_QUOTA_EXCEEDED: Currently, we don't send this error when
> the request/bandwidth quota is exceeded since those requests are not
> rejected. We only set this error when the request is rejected (e.g., topic
> creation). It would be useful to clarify when this error is used.
>

Right, I was trying to reuse an existing error-code. We can introduce
a new one for the case where a client pushes metrics at a higher frequency
than the
than the configured push interval (e.g., out-of-profile sends).
This causes the broker to drop those metrics and send this error code back
to the client. There will be no connection throttling / channel-muting in
this
case (unless the standard quotas are exceeded).


> 27. kafka-client-metrics.sh: Could we add an example on how to disable a
> bad client?
>

There's now a --block option to kafka-client-metrics.sh which overrides all
subscriptions
for the matched client(s). This allows silencing metrics for one or more
clients without having
to remove existing subscriptions. From the client's perspective it will
look like it no longer has
any subscriptions.

# Block metrics collection for a specific client instance
$ kafka-client-metrics.sh --bootstrap-server $BROKERS \
   --add \
   --name 'Disabe_b69cc35a' \  # A descriptive name makes it easier to
clean up old subscriptions.
   --match client_instance_id=b69cc35a-7a54-4790-aa69-cc2bd4ee4538 \  #
Match this specific client instance
   --block




> 28. New broker side metrics: Could we spell out the details of the metrics
> (e.g., group, tags, etc)?
>

KIP has been updated accordingly (thanks Sarat).



>
> 29. Client instance-level metrics: client.io.wait.time is a gauge not a
> histogram.
>

I believe a population/distribution should preferably be represented as a
histogram, space permitting,
and only secondarily as a Gauge average.
While we might not want to maintain a bunch of histograms for each
partition, since that could be
quite space consuming, this client.io.wait.time is a single metric per
client instance and can
thus afford a Histogram representation.



Thanks,
Magnus



> Thanks,
>
> Jun
>
> On Wed, Jan 26, 2022 at 6:32 AM Magnus Edenhill <ma...@edenhill.se>
> wrote:
>
> > Hi all,
> >
> > I've updated the KIP with responses to the latest comments: Java client
> > dependencies (Thanks Kirk!), alternate designs (separate cluster,
> separate
> > producer, etc), etc.
> >
> > I will revive the vote thread.
> >
> > Thanks,
> > Magnus
> >
> >
> > Den mån 13 dec. 2021 kl 22:32 skrev Ryanne Dolan <ryannedolan@gmail.com
> >:
> >
> > > I think we should be very careful about introducing new runtime
> > > dependencies into the clients. Historically this has been rare and
> > > essentially necessary (e.g. compression libs).
> > >
> > > Ryanne
> > >
> > > On Mon, Dec 13, 2021, 1:06 PM Kirk True <ki...@mustardgrain.com> wrote:
> > >
> > > > Hi Jun,
> > > >
> > > > On Thu, Dec 9, 2021, at 2:28 PM, Jun Rao wrote:
> > > > > 13. Using OpenTelemetry. Does that require runtime dependency
> > > > > on OpenTelemetry library? How good is the compatibility story
> > > > > of OpenTelemetry? This is important since an application could have
> > > other
> > > > > OpenTelemetry dependencies than the Kafka client.
> > > >
> > > > The current design is that the OpenTelemetry JARs would ship with the
> > > > client. Perhaps we can design the client such that the JARs aren't
> even
> > > > loaded if the user has opted out. The user could even exclude the
> JARs
> > > from
> > > > their dependencies if they so wished.
> > > >
> > > > I can't speak to the compatibility of the libraries. Is it possible
> > that
> > > > we include a shaded version?
> > > >
> > > > Thanks,
> > > > Kirk
> > > >
> > > > >
> > > > > 14. The proposal listed idempotence=true. This is more of a
> > > configuration
> > > > > than a metric. Are we including that as a metric? What other
> > > > configurations
> > > > > are we including? Should we separate the configurations from the
> > > metrics?
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jun
> > > > >
> > > > > On Mon, Nov 29, 2021 at 7:34 AM Magnus Edenhill <
> magnus@edenhill.se>
> > > > wrote:
> > > > >
> > > > > > Hey Bob,
> > > > > >
> > > > > > That's a good point.
> > > > > >
> > > > > > Request type labels were considered but since they're already
> > tracked
> > > > by
> > > > > > broker-side metrics
> > > > > > they were left out as to avoid metric duplication, however those
> > > > metrics
> > > > > > are not per connection,
> > > > > > so they won't be that useful in practice for troubleshooting
> > specific
> > > > > > client instances.
> > > > > >
> > > > > > I'll add the request_type label to the relevant metrics.
> > > > > >
> > > > > > Thanks,
> > > > > > Magnus
> > > > > >
> > > > > >
> > > > > > Den tis 23 nov. 2021 kl 19:20 skrev Bob Barrett
> > > > > > <bo...@confluent.io.invalid>:
> > > > > >
> > > > > > > Hi Magnus,
> > > > > > >
> > > > > > > Thanks for the thorough KIP, this seems very useful.
> > > > > > >
> > > > > > > Would it make sense to include the request type as a label for
> > the
> > > > > > > `client.request.success`, `client.request.errors` and
> > > > > > `client.request.rtt`
> > > > > > > metrics? I think it would be very useful to see which specific
> > > > requests
> > > > > > are
> > > > > > > succeeding and failing for a client. One specific case I can
> > think
> > > of
> > > > > > where
> > > > > > > this could be useful is producer batch timeouts. If a Java
> > > > application
> > > > > > does
> > > > > > > not enable producer client logs (unfortunately, in my
> experience
> > > this
> > > > > > > happens more often than it should), the application logs will
> > only
> > > > > > contain
> > > > > > > the expiration error message, but no information about what is
> > > > causing
> > > > > > the
> > > > > > > timeout. The requests might all be succeeding but taking too
> long
> > > to
> > > > > > > process batches, or metadata requests might be failing, or some
> > or
> > > > all
> > > > > > > produce requests might be failing (if the bootstrap servers are
> > > > reachable
> > > > > > > from the client but one or more other brokers are not, for
> > > example).
> > > > If
> > > > > > the
> > > > > > > cluster operator is able to identify the specific requests that
> > are
> > > > slow
> > > > > > or
> > > > > > > failing for a client, they will be better able to diagnose the
> > > issue
> > > > > > > causing batch timeouts.
> > > > > > >
> > > > > > > One drawback I can think of is that this will increase the
> > > > cardinality of
> > > > > > > the request metrics. But any given client is only going to use
> a
> > > > small
> > > > > > > subset of the request types, and since we already have
> partition
> > > > labels
> > > > > > for
> > > > > > > the topic-level metrics, I think request labels will still make
> > up
> > > a
> > > > > > > relatively small percentage of the set of metrics.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Bob
> > > > > > >
> > > > > > > On Mon, Nov 22, 2021 at 2:08 AM Viktor Somogyi-Vass <
> > > > > > > viktorsomogyi@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Magnus,
> > > > > > > >
> > > > > > > > I think this is a very useful addition. We also have a
> similar
> > > (but
> > > > > > much
> > > > > > > > more simplistic) implementation of this. Maybe I missed it in
> > the
> > > > KIP
> > > > > > but
> > > > > > > > what about adding metrics about the subscription cache
> itself?
> > > > That I
> > > > > > > think
> > > > > > > > would improve its usability and debuggability as we'd be able
> > to
> > > > see
> > > > > > its
> > > > > > > > performance, hit/miss rates, eviction counts and others.
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Viktor
> > > > > > > >
> > > > > > > > On Thu, Nov 18, 2021 at 5:12 PM Magnus Edenhill <
> > > > magnus@edenhill.se>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Mickael,
> > > > > > > > >
> > > > > > > > > see inline.
> > > > > > > > >
> > > > > > > > > Den ons 10 nov. 2021 kl 15:21 skrev Mickael Maison <
> > > > > > > > > mickael.maison@gmail.com
> > > > > > > > > >:
> > > > > > > > >
> > > > > > > > > > Hi Magnus,
> > > > > > > > > >
> > > > > > > > > > I see you've addressed some of the points I raised above
> > but
> > > > some
> > > > > > (4,
> > > > > > > > > > 5) have not been addressed yet.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Re 4) How will the user/app know metrics are being sent.
> > > > > > > > >
> > > > > > > > > One possibility is to add a JMX metric (thus for user
> > > > consumption)
> > > > > > for
> > > > > > > > the
> > > > > > > > > number of metric pushes the
> > > > > > > > > client has performed, or perhaps the number of metrics
> > > > subscriptions
> > > > > > > > > currently being collected.
> > > > > > > > > Would that be sufficient?
> > > > > > > > >
> > > > > > > > > Re 5) Metric sizes and rates
> > > > > > > > >
> > > > > > > > > A worst case scenario for a producer that is producing to
> 50
> > > > unique
> > > > > > > > topics
> > > > > > > > > and emitting all standard metrics yields
> > > > > > > > > a serialized size of around 100KB prior to compression,
> which
> > > > > > > compresses
> > > > > > > > > down to about 20-30% of that depending
> > > > > > > > > on compression type and topic name uniqueness.
> > > > > > > > > The numbers for a consumer would be similar.
> > > > > > > > >
> > > > > > > > > In practice the number of unique topics would be far less,
> > and
> > > > the
> > > > > > > > > subscription set would typically be for a subset of
> metrics.
> > > > > > > > > So we're probably closer to 1kb, or less, compressed size
> per
> > > > client
> > > > > > > per
> > > > > > > > > push interval.
> > > > > > > > >
> > > > > > > > > As both the subscription set and push intervals are
> > controlled
> > > > by the
> > > > > > > > > cluster operator it shouldn't be too hard
> > > > > > > > > to strike a good balance between metrics overhead and
> > > > granularity.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > I'm really uneasy with this being enabled by default on
> the
> > > > client
> > > > > > > > > > side. When collecting data, I think the best practice is
> to
> > > > ensure
> > > > > > > > > > users are explicitly enabling it.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Requiring metrics to be explicitly enabled on clients
> > severely
> > > > > > cripples
> > > > > > > > its
> > > > > > > > > usability and value.
> > > > > > > > >
> > > > > > > > > One of the problems that this KIP aims to solve is for
> useful
> > > > metrics
> > > > > > > to
> > > > > > > > be
> > > > > > > > > available on demand
> > > > > > > > > regardless of the technical expertise of the user. As
> Ryanne
> > > > points,
> > > > > > > out
> > > > > > > > a
> > > > > > > > > savvy user/organization
> > > > > > > > > will typically have metrics collection and monitoring in
> > place
> > > > > > already,
> > > > > > > > and
> > > > > > > > > the benefits of this KIP
> > > > > > > > > are then more of a common set and format metrics across
> > client
> > > > > > > > > implementations and languages.
> > > > > > > > > But that is not the typical Kafka user in my experience,
> > > they're
> > > > not
> > > > > > > > Kafka
> > > > > > > > > experts and they don't have the
> > > > > > > > > knowledge of how to best instrument their clients.
> > > > > > > > > Having metrics enabled by default for this user base allows
> > the
> > > > Kafka
> > > > > > > > > operators to proactively and reactively
> > > > > > > > > monitor and troubleshoot client issues, without the need
> for
> > > the
> > > > less
> > > > > > > > savvy
> > > > > > > > > user to do anything.
> > > > > > > > > It is often too late to tell a user to enable metrics when
> > the
> > > > > > problem
> > > > > > > > has
> > > > > > > > > already occurred.
> > > > > > > > >
> > > > > > > > > Now, to be clear, even though metrics are enabled by
> default
> > on
> > > > > > clients
> > > > > > > > it
> > > > > > > > > is not enabled by default
> > > > > > > > > on the brokers; the Kafka operator needs to build and set
> up
> > a
> > > > > > metrics
> > > > > > > > > plugin and add metrics subscriptions
> > > > > > > > > before anything is sent from the client.
> > > > > > > > > It is opt-out on the clients and opt-in on the broker.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > You mentioned brokers already have
> > > > > > > > > > some(most?) of the information contained in metrics, if
> so
> > > > then why
> > > > > > > > > > are we collecting it again? Surely there must be some new
> > > > > > information
> > > > > > > > > > in the client metrics.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > From the user's perspective the Kafka infrastructure
> extends
> > > from
> > > > > > > > > producer.send() to
> > > > > > > > > messages being returned from consumer.poll(), a giant black
> > box
> > > > where
> > > > > > > > > there's a lot going on between those
> > > > > > > > > two points. The brokers currently only see what happens
> once
> > > > those
> > > > > > > > requests
> > > > > > > > > and messages hits the broker,
> > > > > > > > > but as Kafka clients are complex pieces of machinery
> there's
> > a
> > > > myriad
> > > > > > > of
> > > > > > > > > queues, timers, and state
> > > > > > > > > that's critical to the operation and infrastructure that's
> > not
> > > > > > > currently
> > > > > > > > > visible to the operator.
> > > > > > > > > Relying on the user to accurately and timely provide this
> > > missing
> > > > > > > > > information is not generally feasible.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Most of the standard metrics listed in the KIP are data
> > points
> > > > that
> > > > > > the
> > > > > > > > > broker does not have.
> > > > > > > > > Only a small number of metrics are duplicates (like the
> > request
> > > > > > counts
> > > > > > > > and
> > > > > > > > > sizes), but they are included
> > > > > > > > > to ease correlation when inspecting these client metrics.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > Moreover this is a brand new feature so it's even harder
> to
> > > > justify
> > > > > > > > > > enabling it and forcing onto all our users. If disabled
> by
> > > > default,
> > > > > > > > > > it's relatively easy to enable in a new release if we
> > decide
> > > > to,
> > > > > > but
> > > > > > > > > > once enabled by default it's much harder to disable. Also
> > > this
> > > > > > > feature
> > > > > > > > > > will apply to all future metrics we will add.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > I think maturity of a feature implementation should be the
> > > > deciding
> > > > > > > > factor,
> > > > > > > > > rather than
> > > > > > > > > the design of it (which this KIP is). I.e., if the
> > > > implementation is
> > > > > > > not
> > > > > > > > > deemed mature enough
> > > > > > > > > for release X.Y it will be disabled.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > Overall I think it's an interesting feature but I'd
> prefer
> > to
> > > > be
> > > > > > > > > > slightly defensive and see how it works in practice
> before
> > > > enabling
> > > > > > > it
> > > > > > > > > > everywhere.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Right, and I agree on being defensive, but since this
> feature
> > > > still
> > > > > > > > > requires manual
> > > > > > > > > enabling on the brokers before actually being used, I think
> > > that
> > > > > > gives
> > > > > > > > > enough control
> > > > > > > > > to opt-in or out of this feature as needed.
> > > > > > > > >
> > > > > > > > > Thanks for your comments!
> > > > > > > > >
> > > > > > > > > Regards,
> > > > > > > > > Magnus
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Mickael
> > > > > > > > > >
> > > > > > > > > > On Mon, Nov 8, 2021 at 7:55 AM Magnus Edenhill <
> > > > magnus@edenhill.se
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > Thanks David for pointing this out,
> > > > > > > > > > > I've updated the KIP to include client_id as a matching
> > > > selector.
> > > > > > > > > > >
> > > > > > > > > > > Regards,
> > > > > > > > > > > Magnus
> > > > > > > > > > >
> > > > > > > > > > > Den tors 4 nov. 2021 kl 18:01 skrev David Mao
> > > > > > > > > <dmao@confluent.io.invalid
> > > > > > > > > > >:
> > > > > > > > > > >
> > > > > > > > > > > > Hey Magnus,
> > > > > > > > > > > >
> > > > > > > > > > > > I noticed that the KIP outlines the initial selectors
> > > > supported
> > > > > > > as:
> > > > > > > > > > > >
> > > > > > > > > > > >    - client_instance_id - CLIENT_INSTANCE_ID UUID
> > string
> > > > > > > > > > representation.
> > > > > > > > > > > >    - client_software_name  - client software
> > > implementation
> > > > > > name.
> > > > > > > > > > > >    - client_software_version  - client software
> > > > implementation
> > > > > > > > > version.
> > > > > > > > > > > >
> > > > > > > > > > > > In the given reactive monitoring workflow, we mention
> > > that
> > > > the
> > > > > > > > > > application
> > > > > > > > > > > > user does not know their client's client instance ID,
> > but
> > > > it's
> > > > > > > > > outlined
> > > > > > > > > > > > that the operator can add a metrics subscription
> > > selecting
> > > > for
> > > > > > > > > > clientId. I
> > > > > > > > > > > > don't see clientId as one of the supported selectors.
> > > > > > > > > > > > I can see how this would have made sense in a
> previous
> > > > > > iteration
> > > > > > > > > given
> > > > > > > > > > that
> > > > > > > > > > > > the previous client instance ID proposal was to
> > construct
> > > > the
> > > > > > > > client
> > > > > > > > > > > > instance ID using clientId as a prefix. Now that the
> > > client
> > > > > > > > instance
> > > > > > > > > > ID is
> > > > > > > > > > > > a UUID, would we want to add clientId as a supported
> > > > selector?
> > > > > > > > > > > > Let me know what you think.
> > > > > > > > > > > >
> > > > > > > > > > > > David
> > > > > > > > > > > >
> > > > > > > > > > > > On Tue, Oct 19, 2021 at 12:33 PM Magnus Edenhill <
> > > > > > > > magnus@edenhill.se
> > > > > > > > > >
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi Mickael!
> > > > > > > > > > > > >
> > > > > > > > > > > > > Den tis 19 okt. 2021 kl 19:30 skrev Mickael Maison
> <
> > > > > > > > > > > > > mickael.maison@gmail.com
> > > > > > > > > > > > > >:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi Magnus,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks for the proposal.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 1. Looking at the protocol section, isn't
> > > > > > "ClientInstanceId"
> > > > > > > > > > expected
> > > > > > > > > > > > > > to be a field in
> > GetTelemetrySubscriptionsResponseV0?
> > > > > > > > Otherwise,
> > > > > > > > > > how
> > > > > > > > > > > > > > does a client retrieve this value?
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Good catch, it got removed by mistake in one of the
> > > > edits.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 2. In the client API section, you mention a new
> > > method
> > > > > > > > > > > > > > "clientInstanceId()". Can you clarify which
> > > interfaces
> > > > are
> > > > > > > > > > affected?
> > > > > > > > > > > > > > Is it only Consumer and Producer?
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > And Admin. Will update the KIP.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > > 3. I'm a bit concerned this is enabled by
> default.
> > > > Even if
> > > > > > > the
> > > > > > > > > data
> > > > > > > > > > > > > > collected is supposed to be not sensitive, I
> think
> > > > this can
> > > > > > > be
> > > > > > > > > > > > > > problematic in some environments. Also users
> don't
> > > > seem to
> > > > > > > have
> > > > > > > > > the
> > > > > > > > > > > > > > choice to only expose some metrics. Knowing how
> > much
> > > > data
> > > > > > > > transit
> > > > > > > > > > > > > > through some applications can be considered
> > critical.
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > The broker already knows how much data transits
> > through
> > > > the
> > > > > > > > client
> > > > > > > > > > > > though,
> > > > > > > > > > > > > right?
> > > > > > > > > > > > > Care has been taken not to expose information in
> the
> > > > standard
> > > > > > > > > metrics
> > > > > > > > > > > > that
> > > > > > > > > > > > > might
> > > > > > > > > > > > > reveal sensitive information.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Do you have an example of how the proposed metrics
> > > could
> > > > leak
> > > > > > > > > > sensitive
> > > > > > > > > > > > > information?
> > > > > > > > > > > > > As for limiting the what metrics to export; I guess
> > > that
> > > > > > could
> > > > > > > > make
> > > > > > > > > > sense
> > > > > > > > > > > > > in some
> > > > > > > > > > > > > very sensitive use-cases, but those users might
> > disable
> > > > > > metrics
> > > > > > > > > > > > altogether
> > > > > > > > > > > > > for now.
> > > > > > > > > > > > > Could these concerns be addressed by a later KIP?
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 4. As a user, how do you know if your application
> > is
> > > > > > actively
> > > > > > > > > > sending
> > > > > > > > > > > > > > metrics? Are there new metrics exposing what's
> > going
> > > > on,
> > > > > > like
> > > > > > > > how
> > > > > > > > > > much
> > > > > > > > > > > > > > data is being sent?
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > That's a good question.
> > > > > > > > > > > > > Since the proposed metrics interface is not aimed
> at,
> > > or
> > > > > > > directly
> > > > > > > > > > > > available
> > > > > > > > > > > > > to, the application
> > > > > > > > > > > > > I guess there's little point of adding it here, but
> > > > instead
> > > > > > > > adding
> > > > > > > > > > > > > something to the
> > > > > > > > > > > > > existing JMX metrics?
> > > > > > > > > > > > > Do you have any suggestions?
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > > 5. If all metrics are enabled on a regular
> Consumer
> > > or
> > > > > > > > Producer,
> > > > > > > > > do
> > > > > > > > > > > > > > you have an idea how much throughput this would
> > use?
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > It depends on the number of partition/topics/etc
> the
> > > > client
> > > > > > is
> > > > > > > > > > producing
> > > > > > > > > > > > > to/consuming from.
> > > > > > > > > > > > > I'll add some sizes to the KIP for some typical
> > > > use-cases.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > Magnus
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Tue, Oct 19, 2021 at 5:06 PM Magnus Edenhill <
> > > > > > > > > > magnus@edenhill.se>
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Den tis 19 okt. 2021 kl 13:22 skrev Tom
> Bentley <
> > > > > > > > > > tbentley@redhat.com
> > > > > > > > > > > > >:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Hi Magnus,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I reviewed the KIP since you called the vote
> > > > (sorry for
> > > > > > > not
> > > > > > > > > > > > reviewing
> > > > > > > > > > > > > > when
> > > > > > > > > > > > > > > > you announced your intention to call the
> > vote). I
> > > > have
> > > > > > a
> > > > > > > > few
> > > > > > > > > > > > > questions
> > > > > > > > > > > > > > on
> > > > > > > > > > > > > > > > some of the details.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 1. There's no Javadoc on
> > > > ClientTelemetryPayload.data(),
> > > > > > > so
> > > > > > > > I
> > > > > > > > > > don't
> > > > > > > > > > > > > know
> > > > > > > > > > > > > > > > whether the payload is exposed through this
> > > method
> > > > as
> > > > > > > > > > compressed or
> > > > > > > > > > > > > > not.
> > > > > > > > > > > > > > > > Later on you say "Decompression of the
> payloads
> > > > will be
> > > > > > > > > > handled by
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > broker metrics plugin, the broker should
> > expose a
> > > > > > > suitable
> > > > > > > > > > > > > > decompression
> > > > > > > > > > > > > > > > API to the metrics plugin for this purpose.",
> > > which
> > > > > > > > suggests
> > > > > > > > > > it's
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > > compressed data in the buffer, but then we
> > don't
> > > > know
> > > > > > > which
> > > > > > > > > > codec
> > > > > > > > > > > > was
> > > > > > > > > > > > > > used,
> > > > > > > > > > > > > > > > nor the API via which the plugin should
> > > decompress
> > > > it
> > > > > > if
> > > > > > > > > > required
> > > > > > > > > > > > for
> > > > > > > > > > > > > > > > forwarding to the ultimate metrics store.
> > Should
> > > > the
> > > > > > > > > > > > > > ClientTelemetryPayload
> > > > > > > > > > > > > > > > expose a method to get the compression and a
> > > > > > > decompressor?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Good point, updated.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 2. The client-side API is expressed as
> > > > StringOrError
> > > > > > > > > > > > > > > > ClientInstance::ClientInstanceId(int
> > > timeout_ms). I
> > > > > > > > > understand
> > > > > > > > > > that
> > > > > > > > > > > > > > you're
> > > > > > > > > > > > > > > > thinking about the librdkafka implementation,
> > but
> > > > it
> > > > > > > would
> > > > > > > > be
> > > > > > > > > > good
> > > > > > > > > > > > to
> > > > > > > > > > > > > > show
> > > > > > > > > > > > > > > > the API as it would appear on the Apache
> Kafka
> > > > clients.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > This was meant as pseudo-code, but I changed it
> > to
> > > > Java.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 3. "PushTelemetryRequest|Response - protocol
> > > > request
> > > > > > used
> > > > > > > > by
> > > > > > > > > > the
> > > > > > > > > > > > > > client to
> > > > > > > > > > > > > > > > send metrics to any broker it is connected
> to."
> > > To
> > > > be
> > > > > > > > clear,
> > > > > > > > > > this
> > > > > > > > > > > > > means
> > > > > > > > > > > > > > > > that the client can choose any of the
> connected
> > > > brokers
> > > > > > > and
> > > > > > > > > > push to
> > > > > > > > > > > > > > just
> > > > > > > > > > > > > > > > one of them? What should a supporting client
> do
> > > if
> > > > it
> > > > > > > gets
> > > > > > > > an
> > > > > > > > > > error
> > > > > > > > > > > > > > when
> > > > > > > > > > > > > > > > pushing metrics to a broker, retry sending to
> > the
> > > > same
> > > > > > > > broker
> > > > > > > > > > or
> > > > > > > > > > > > try
> > > > > > > > > > > > > > > > pushing to another broker, or drop the
> metrics?
> > > > Should
> > > > > > > > > > supporting
> > > > > > > > > > > > > > clients
> > > > > > > > > > > > > > > > send successive requests to a single broker,
> or
> > > > round
> > > > > > > > robin,
> > > > > > > > > > or is
> > > > > > > > > > > > > > that up
> > > > > > > > > > > > > > > > to the client author? I'm guessing the
> > behaviour
> > > > should
> > > > > > > be
> > > > > > > > > > sticky
> > > > > > > > > > > > to
> > > > > > > > > > > > > > > > support the rate limiting features, but I
> think
> > > it
> > > > > > would
> > > > > > > be
> > > > > > > > > > good
> > > > > > > > > > > > for
> > > > > > > > > > > > > > client
> > > > > > > > > > > > > > > > authors if this section were explicit on the
> > > > > > recommended
> > > > > > > > > > behaviour.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > You are right, I've updated the KIP to make
> this
> > > > clearer.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 4. "Mapping the client instance id to an
> actual
> > > > > > > application
> > > > > > > > > > > > instance
> > > > > > > > > > > > > > > > running on a (virtual) machine can be done by
> > > > > > inspecting
> > > > > > > > the
> > > > > > > > > > > > metrics
> > > > > > > > > > > > > > > > resource labels, such as the client source
> > > address
> > > > and
> > > > > > > > source
> > > > > > > > > > port,
> > > > > > > > > > > > > or
> > > > > > > > > > > > > > > > security principal, all of which are added by
> > the
> > > > > > > receiving
> > > > > > > > > > broker.
> > > > > > > > > > > > > > This
> > > > > > > > > > > > > > > > will allow the operator together with the
> user
> > to
> > > > > > > identify
> > > > > > > > > the
> > > > > > > > > > > > actual
> > > > > > > > > > > > > > > > application instance." Is this really always
> > > true?
> > > > The
> > > > > > > > source
> > > > > > > > > > IP
> > > > > > > > > > > > and
> > > > > > > > > > > > > > port
> > > > > > > > > > > > > > > > might be a loadbalancer/proxy in some setups.
> > The
> > > > > > > > principal,
> > > > > > > > > as
> > > > > > > > > > > > > already
> > > > > > > > > > > > > > > > mentioned in the KIP, might be shared between
> > > > multiple
> > > > > > > > > > > > applications.
> > > > > > > > > > > > > > So at
> > > > > > > > > > > > > > > > worst the organization running the clients
> > might
> > > > have
> > > > > > to
> > > > > > > > > > consult
> > > > > > > > > > > > the
> > > > > > > > > > > > > > logs
> > > > > > > > > > > > > > > > of a set of client applications, right?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Yes, that's correct. There's no guaranteed
> > mapping
> > > > from
> > > > > > > > > > > > > > client_instance_id
> > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > an actual instance, that's why the KIP
> recommends
> > > > client
> > > > > > > > > > > > > implementations
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > log the client instance id
> > > > > > > > > > > > > > > upon retrieval, and also provide an API for the
> > > > > > application
> > > > > > > > to
> > > > > > > > > > > > retrieve
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > instance id programmatically
> > > > > > > > > > > > > > > if it has a better way of exposing it.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 5. "Tests indicate that a compression ratio up
> to
> > > > 10x is
> > > > > > > > > > possible for
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > standard metrics." Client authors might
> > > appreciate
> > > > your
> > > > > > > > > > mentioning
> > > > > > > > > > > > > > which
> > > > > > > > > > > > > > > > compression codec got these results.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Good point. Updated.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 6. "Should the client send a push request
> prior
> > > to
> > > > > > expiry
> > > > > > > > of
> > > > > > > > > > the
> > > > > > > > > > > > > > previously
> > > > > > > > > > > > > > > > calculated PushIntervalMs the broker will
> > discard
> > > > the
> > > > > > > > metrics
> > > > > > > > > > and
> > > > > > > > > > > > > > return a
> > > > > > > > > > > > > > > > PushTelemetryResponse with the ErrorCode set
> to
> > > > > > > > RateLimited."
> > > > > > > > > > Is
> > > > > > > > > > > > this
> > > > > > > > > > > > > > > > RATE_LIMITED a new error code? It's not
> > mentioned
> > > > in
> > > > > > the
> > > > > > > > "New
> > > > > > > > > > Error
> > > > > > > > > > > > > > Codes"
> > > > > > > > > > > > > > > > section.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > That's a leftover, it should be using the
> > standard
> > > > > > > > ThrottleTime
> > > > > > > > > > > > > > mechanism.
> > > > > > > > > > > > > > > Fixed.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 7. In the section "Standard client resource
> > > labels"
> > > > > > > > > > application_id
> > > > > > > > > > > > is
> > > > > > > > > > > > > > > > described as Kafka Streams only, but the
> > section
> > > of
> > > > > > > "Client
> > > > > > > > > > > > > > Identification"
> > > > > > > > > > > > > > > > talks about "application instance id as an
> > > optional
> > > > > > > future
> > > > > > > > > > > > > nice-to-have
> > > > > > > > > > > > > > > > that may be included as a metrics label if it
> > has
> > > > been
> > > > > > > set
> > > > > > > > by
> > > > > > > > > > the
> > > > > > > > > > > > > > user", so
> > > > > > > > > > > > > > > > I'm confused whether non-Kafka Streams
> clients
> > > > should
> > > > > > set
> > > > > > > > an
> > > > > > > > > > > > > > application_id
> > > > > > > > > > > > > > > > or not.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I'll clarify this in the KIP, but basically we
> > > would
> > > > need
> > > > > > > to
> > > > > > > > > add
> > > > > > > > > > an `
> > > > > > > > > > > > > > > application.id` config
> > > > > > > > > > > > > > > property for non-streams clients for this
> > purpose,
> > > > and
> > > > > > > that's
> > > > > > > > > > outside
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > scope of this KIP since we want to make it
> > > > zero-conf:ish
> > > > > > on
> > > > > > > > the
> > > > > > > > > > > > client
> > > > > > > > > > > > > > side.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Kind regards,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Tom
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks for the review,
> > > > > > > > > > > > > > > Magnus
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Thu, Oct 7, 2021 at 5:26 PM Magnus
> Edenhill
> > <
> > > > > > > > > > magnus@edenhill.se
> > > > > > > > > > > > >
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Hi all,
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > I've updated the KIP following our recent
> > > > discussions
> > > > > > > on
> > > > > > > > > the
> > > > > > > > > > > > > mailing
> > > > > > > > > > > > > > > > list:
> > > > > > > > > > > > > > > > >  - split the protocol in two, one for
> getting
> > > the
> > > > > > > metrics
> > > > > > > > > > > > > > subscriptions,
> > > > > > > > > > > > > > > > > and one for pushing the metrics.
> > > > > > > > > > > > > > > > >  - simplifications: initially only one
> > > supported
> > > > > > > metrics
> > > > > > > > > > format,
> > > > > > > > > > > > no
> > > > > > > > > > > > > > > > > client.id in the instance id, etc.
> > > > > > > > > > > > > > > > >  - made CLIENT_METRICS subscription
> > > configuration
> > > > > > > entries
> > > > > > > > > > more
> > > > > > > > > > > > > > structured
> > > > > > > > > > > > > > > > >    and allowing better client matching
> > > selectors
> > > > (not
> > > > > > > > only
> > > > > > > > > > on the
> > > > > > > > > > > > > > > > instance
> > > > > > > > > > > > > > > > > id, but also the other
> > > > > > > > > > > > > > > > >    client resource labels, such as
> > > > > > > client_software_name,
> > > > > > > > > > etc.).
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Unless there are further comments I'll call
> > the
> > > > vote
> > > > > > > in a
> > > > > > > > > > day or
> > > > > > > > > > > > > two.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Regards,
> > > > > > > > > > > > > > > > > Magnus
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Den mån 4 okt. 2021 kl 20:57 skrev Magnus
> > > > Edenhill <
> > > > > > > > > > > > > > magnus@edenhill.se>:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Hi Gwen,
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > I'm finishing up the KIP based on the
> last
> > > > couple
> > > > > > of
> > > > > > > > > > discussion
> > > > > > > > > > > > > > points
> > > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > > > this thread
> > > > > > > > > > > > > > > > > > and will call the Vote later this week.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Best,
> > > > > > > > > > > > > > > > > > Magnus
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Den lör 2 okt. 2021 kl 02:01 skrev Gwen
> > > Shapira
> > > > > > > > > > > > > > > > > <gwen@confluent.io.invalid
> > > > > > > > > > > > > > > > > > >:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >> Hey,
> > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > >> I noticed that there was no discussion
> for
> > > the
> > > > > > last
> > > > > > > 10
> > > > > > > > > > days,
> > > > > > > > > > > > > but I
> > > > > > > > > > > > > > > > > >> couldn't
> > > > > > > > > > > > > > > > > >> find the vote thread. Is there one that
> > I'm
> > > > > > missing?
> > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > >> Gwen
> > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > >> On Wed, Sep 22, 2021 at 4:58 AM Magnus
> > > > Edenhill <
> > > > > > > > > > > > > > magnus@edenhill.se>
> > > > > > > > > > > > > > > > > >> wrote:
> > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > >> > Den tis 21 sep. 2021 kl 06:58 skrev
> > Colin
> > > > > > McCabe <
> > > > > > > > > > > > > > > > cmccabe@apache.org
> > > > > > > > > > > > > > > > > >:
> > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > >> > > On Mon, Sep 20, 2021, at 17:35, Feng
> > Min
> > > > > > wrote:
> > > > > > > > > > > > > > > > > >> > > > Thanks Magnus & Colin for the
> > > > discussion.
> > > > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > > > >> > > > Based on KIP-714's stateless
> design,
> > > > Client
> > > > > > > can
> > > > > > > > > > pretty
> > > > > > > > > > > > > much
> > > > > > > > > > > > > > use
> > > > > > > > > > > > > > > > > any
> > > > > > > > > > > > > > > > > >> > > > connection to any broker to send
> > > > metrics. We
> > > > > > > are
> > > > > > > > > not
> > > > > > > > > > > > > > associating
> > > > > > > > > > > > > > > > > >> > > connection
> > > > > > > > > > > > > > > > > >> > > > with client metric state. Is my
> > > > > > understanding
> > > > > > > > > > correct?
> > > > > > > > > > > > If
> > > > > > > > > > > > > > yes,
> > > > > > > > > > > > > > > > > how
> > > > > > > > > > > > > > > > > >> > about
> > > > > > > > > > > > > > > > > >> > > > the following two scenarios
> > > > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > > > >> > > > 1) One Client (Client-ID)
> registers
> > > two
> > > > > > > > different
> > > > > > > > > > client
> > > > > > > > > > > > > > > > instance
> > > > > > > > > > > > > > > > > id
> > > > > > > > > > > > > > > > > >> > via
> > > > > > > > > > > > > > > > > >> > > > separate registration. Is it
> > > permitted?
> > > > If
> > > > > > OK,
> > > > > > > > how
> > > > > > > > > > to
> > > > > > > > > > > > > > > > distinguish
> > > > > > > > > > > > > > > > > >> them
> > > > > > > > > > > > > > > > > >> > > from
> > > > > > > > > > > > > > > > > >> > > > the case 2 below.
> > > > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > > >> > > Hi Feng,
> > > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > > >> > > My understanding, which Magnus can
> > > > clarify I
> > > > > > > > guess,
> > > > > > > > > is
> > > > > > > > > > > > that
> > > > > > > > > > > > > > you
> > > > > > > > > > > > > > > > > could
> > > > > > > > > > > > > > > > > >> > have
> > > > > > > > > > > > > > > > > >> > > something like two Producer
> instances
> > > > running
> > > > > > > with
> > > > > > > > > the
> > > > > > > > > > > > same
> > > > > > > > > > > > > > > > > client.id
> > > > > > > > > > > > > > > > > >> > > (perhaps because they're using the
> > same
> > > > config
> > > > > > > > file,
> > > > > > > > > > for
> > > > > > > > > > > > > > example).
> > > > > > > > > > > > > > > > > >> They
> > > > > > > > > > > > > > > > > >> > > could even be in the same process.
> But
> > > > they
> > > > > > > would
> > > > > > > > > get
> > > > > > > > > > > > > separate
> > > > > > > > > > > > > > > > > UUIDs.
> > > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > > >> > > I believe Magnus used the term
> client
> > to
> > > > mean
> > > > > > > > > > "Producer or
> > > > > > > > > > > > > > > > > Consumer".
> > > > > > > > > > > > > > > > > >> So
> > > > > > > > > > > > > > > > > >> > > if you have both a Producer and a
> > > > Consumer in
> > > > > > > your
> > > > > > > > > > > > > > application I
> > > > > > > > > > > > > > > > > would
> > > > > > > > > > > > > > > > > >> > > expect you'd get separate UUIDs for
> > > both.
> > > > > > Again
> > > > > > > > > > Magnus can
> > > > > > > > > > > > > > chime
> > > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > > >> > here, I
> > > > > > > > > > > > > > > > > >> > > guess.
> > > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > >> > That's correct.
> > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > > >> > > > 2) How about the client
> restarting?
> > > > What's
> > > > > > the
> > > > > > > > > > > > > expectation?
> > > > > > > > > > > > > > > > Should
> > > > > > > > > > > > > > > > > >> the
> > > > > > > > > > > > > > > > > >> > > > server expect the client to carry
> a
> > > > > > persisted
> > > > > > > > > client
> > > > > > > > > > > > > > instance id
> > > > > > > > > > > > > > > > > or
> > > > > > > > > > > > > > > > > >> > > should
> > > > > > > > > > > > > > > > > >> > > > the client be treated as a new
> > > instance?
> > > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > > >> > > The KIP doesn't describe any
> mechanism
> > > for
> > > > > > > > > > persistence,
> > > > > > > > > > > > so I
> > > > > > > > > > > > > > would
> > > > > > > > > > > > > > > > > >> assume
> > > > > > > > > > > > > > > > > >> > > that when you restart the client you
> > get
> > > > a new
> > > > > > > > > UUID. I
> > > > > > > > > > > > agree
> > > > > > > > > > > > > > that
> > > > > > > > > > > > > > > > it
> > > > > > > > > > > > > > > > > >> > would
> > > > > > > > > > > > > > > > > >> > > be good to spell this out.
> > > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > > >> > Right, it will not be persisted since
> a
> > > > client
> > > > > > > > > instance
> > > > > > > > > > > > can't
> > > > > > > > > > > > > be
> > > > > > > > > > > > > > > > > >> restarted.
> > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > >> > Will update the KIP to make this
> > clearer.
> > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > >> > /Magnus
> > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > >> --
> > > > > > > > > > > > > > > > > >> Gwen Shapira
> > > > > > > > > > > > > > > > > >> Engineering Manager | Confluent
> > > > > > > > > > > > > > > > > >> 650.450.2760 | @gwenshap
> > > > > > > > > > > > > > > > > >> Follow us: Twitter | blog
> > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Jun Rao <ju...@confluent.io.INVALID>.

Hi, Magnus,

Thanks for updating the KIP. The overall approach makes sense to me. A few
more detailed comments below.

20. ClientTelemetry: Should it be extending configurable and closable?

21. Compression of the metrics on the client: what's the default?

22. "Client metrics plugin / extending the MetricsReporter interface":
ClientTelemetry doesn't seem to extend MetricsReporter.

23. A client instance is considered a metric resource and the
resource-level (thus client instance level) labels could include:
    client_software_name=confluent-kafka-python
    client_software_version=v2.1.3
    client_instance_id=B64CD139-3975-440A-91D4
    transactional_id=someTxnApp
Are those labels added in PushTelemetryRequest? If so, are they per metric
or per request?

24.  "the broker will only send
GetTelemetrySubscriptionsResponse.DeltaTemporality=True" :
24.1 If it's always true, does it need to be part of the protocol?
24.2 Does delta only apply to Counter type?
24.3 In the delta representation, the first request needs to send the full
value, how does the broker plugin know whether a value is full or delta?

25. quota:
25.1 Since we are fitting PushTelemetryRequest into the existing request
quota, it would be useful to document the impact, i.e. client metric
throttling causes the data from the same client to be delayed.
25.2 Is PushTelemetryRequest subject to the write bandwidth quota like the
producer?
25.3 THROTTLING_QUOTA_EXCEEDED: Currently, we don't send this error when
the request/bandwidth quota is exceeded since those requests are not
rejected. We only set this error when the request is rejected (e.g., topic
creation). It would be useful to clarify when this error is used.

26. client-metrics entity:
26.1 It seems that we could add multiple entities that match to the same
client. Which one takes precedent?
26.2 How do we persist the new client metrics entities? Do we need to add
new ZK paths and new records in KRaft?

27. kafka-client-metrics.sh: Could we add an example on how to disable a
bad client?

28. New broker side metrics: Could we spell out the details of the metrics
(e.g., group, tags, etc)?

29. Client instance-level metrics: client.io.wait.time is a gauge not a
histogram.

Thanks,

Jun

On Wed, Jan 26, 2022 at 6:32 AM Magnus Edenhill <ma...@edenhill.se> wrote:

> Hi all,
>
> I've updated the KIP with responses to the latest comments: Java client
> dependencies (Thanks Kirk!), alternate designs (separate cluster, separate
> producer, etc), etc.
>
> I will revive the vote thread.
>
> Thanks,
> Magnus
>
>
> Den mån 13 dec. 2021 kl 22:32 skrev Ryanne Dolan <ry...@gmail.com>:
>
> > I think we should be very careful about introducing new runtime
> > dependencies into the clients. Historically this has been rare and
> > essentially necessary (e.g. compression libs).
> >
> > Ryanne
> >
> > On Mon, Dec 13, 2021, 1:06 PM Kirk True <ki...@mustardgrain.com> wrote:
> >
> > > Hi Jun,
> > >
> > > On Thu, Dec 9, 2021, at 2:28 PM, Jun Rao wrote:
> > > > 13. Using OpenTelemetry. Does that require runtime dependency
> > > > on OpenTelemetry library? How good is the compatibility story
> > > > of OpenTelemetry? This is important since an application could have
> > other
> > > > OpenTelemetry dependencies than the Kafka client.
> > >
> > > The current design is that the OpenTelemetry JARs would ship with the
> > > client. Perhaps we can design the client such that the JARs aren't even
> > > loaded if the user has opted out. The user could even exclude the JARs
> > from
> > > their dependencies if they so wished.
> > >
> > > I can't speak to the compatibility of the libraries. Is it possible
> that
> > > we include a shaded version?
> > >
> > > Thanks,
> > > Kirk
> > >
> > > >
> > > > 14. The proposal listed idempotence=true. This is more of a
> > configuration
> > > > than a metric. Are we including that as a metric? What other
> > > configurations
> > > > are we including? Should we separate the configurations from the
> > metrics?
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > > On Mon, Nov 29, 2021 at 7:34 AM Magnus Edenhill <ma...@edenhill.se>
> > > wrote:
> > > >
> > > > > Hey Bob,
> > > > >
> > > > > That's a good point.
> > > > >
> > > > > Request type labels were considered but since they're already
> tracked
> > > by
> > > > > broker-side metrics
> > > > > they were left out as to avoid metric duplication, however those
> > > metrics
> > > > > are not per connection,
> > > > > so they won't be that useful in practice for troubleshooting
> specific
> > > > > client instances.
> > > > >
> > > > > I'll add the request_type label to the relevant metrics.
> > > > >
> > > > > Thanks,
> > > > > Magnus
> > > > >
> > > > >
> > > > > Den tis 23 nov. 2021 kl 19:20 skrev Bob Barrett
> > > > > <bo...@confluent.io.invalid>:
> > > > >
> > > > > > Hi Magnus,
> > > > > >
> > > > > > Thanks for the thorough KIP, this seems very useful.
> > > > > >
> > > > > > Would it make sense to include the request type as a label for
> the
> > > > > > `client.request.success`, `client.request.errors` and
> > > > > `client.request.rtt`
> > > > > > metrics? I think it would be very useful to see which specific
> > > requests
> > > > > are
> > > > > > succeeding and failing for a client. One specific case I can
> think
> > of
> > > > > where
> > > > > > this could be useful is producer batch timeouts. If a Java
> > > application
> > > > > does
> > > > > > not enable producer client logs (unfortunately, in my experience
> > this
> > > > > > happens more often than it should), the application logs will
> only
> > > > > contain
> > > > > > the expiration error message, but no information about what is
> > > causing
> > > > > the
> > > > > > timeout. The requests might all be succeeding but taking too long
> > to
> > > > > > process batches, or metadata requests might be failing, or some
> or
> > > all
> > > > > > produce requests might be failing (if the bootstrap servers are
> > > reachable
> > > > > > from the client but one or more other brokers are not, for
> > example).
> > > If
> > > > > the
> > > > > > cluster operator is able to identify the specific requests that
> are
> > > slow
> > > > > or
> > > > > > failing for a client, they will be better able to diagnose the
> > issue
> > > > > > causing batch timeouts.
> > > > > >
> > > > > > One drawback I can think of is that this will increase the
> > > cardinality of
> > > > > > the request metrics. But any given client is only going to use a
> > > small
> > > > > > subset of the request types, and since we already have partition
> > > labels
> > > > > for
> > > > > > the topic-level metrics, I think request labels will still make
> up
> > a
> > > > > > relatively small percentage of the set of metrics.
> > > > > >
> > > > > > Thanks,
> > > > > > Bob
> > > > > >
> > > > > > On Mon, Nov 22, 2021 at 2:08 AM Viktor Somogyi-Vass <
> > > > > > viktorsomogyi@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Magnus,
> > > > > > >
> > > > > > > I think this is a very useful addition. We also have a similar
> > (but
> > > > > much
> > > > > > > more simplistic) implementation of this. Maybe I missed it in
> the
> > > KIP
> > > > > but
> > > > > > > what about adding metrics about the subscription cache itself?
> > > That I
> > > > > > think
> > > > > > > would improve its usability and debuggability as we'd be able
> to
> > > see
> > > > > its
> > > > > > > performance, hit/miss rates, eviction counts and others.
> > > > > > >
> > > > > > > Best,
> > > > > > > Viktor
> > > > > > >
> > > > > > > On Thu, Nov 18, 2021 at 5:12 PM Magnus Edenhill <
> > > magnus@edenhill.se>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Mickael,
> > > > > > > >
> > > > > > > > see inline.
> > > > > > > >
> > > > > > > > Den ons 10 nov. 2021 kl 15:21 skrev Mickael Maison <
> > > > > > > > mickael.maison@gmail.com
> > > > > > > > >:
> > > > > > > >
> > > > > > > > > Hi Magnus,
> > > > > > > > >
> > > > > > > > > I see you've addressed some of the points I raised above
> but
> > > some
> > > > > (4,
> > > > > > > > > 5) have not been addressed yet.
> > > > > > > > >
> > > > > > > >
> > > > > > > > Re 4) How will the user/app know metrics are being sent.
> > > > > > > >
> > > > > > > > One possibility is to add a JMX metric (thus for user
> > > consumption)
> > > > > for
> > > > > > > the
> > > > > > > > number of metric pushes the
> > > > > > > > client has performed, or perhaps the number of metrics
> > > subscriptions
> > > > > > > > currently being collected.
> > > > > > > > Would that be sufficient?
> > > > > > > >
> > > > > > > > Re 5) Metric sizes and rates
> > > > > > > >
> > > > > > > > A worst case scenario for a producer that is producing to 50
> > > unique
> > > > > > > topics
> > > > > > > > and emitting all standard metrics yields
> > > > > > > > a serialized size of around 100KB prior to compression, which
> > > > > > compresses
> > > > > > > > down to about 20-30% of that depending
> > > > > > > > on compression type and topic name uniqueness.
> > > > > > > > The numbers for a consumer would be similar.
> > > > > > > >
> > > > > > > > In practice the number of unique topics would be far less,
> and
> > > the
> > > > > > > > subscription set would typically be for a subset of metrics.
> > > > > > > > So we're probably closer to 1kb, or less, compressed size per
> > > client
> > > > > > per
> > > > > > > > push interval.
> > > > > > > >
> > > > > > > > As both the subscription set and push intervals are
> controlled
> > > by the
> > > > > > > > cluster operator it shouldn't be too hard
> > > > > > > > to strike a good balance between metrics overhead and
> > > granularity.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > I'm really uneasy with this being enabled by default on the
> > > client
> > > > > > > > > side. When collecting data, I think the best practice is to
> > > ensure
> > > > > > > > > users are explicitly enabling it.
> > > > > > > > >
> > > > > > > >
> > > > > > > > Requiring metrics to be explicitly enabled on clients
> severely
> > > > > cripples
> > > > > > > its
> > > > > > > > usability and value.
> > > > > > > >
> > > > > > > > One of the problems that this KIP aims to solve is for useful
> > > metrics
> > > > > > to
> > > > > > > be
> > > > > > > > available on demand
> > > > > > > > regardless of the technical expertise of the user. As Ryanne
> > > points,
> > > > > > out
> > > > > > > a
> > > > > > > > savvy user/organization
> > > > > > > > will typically have metrics collection and monitoring in
> place
> > > > > already,
> > > > > > > and
> > > > > > > > the benefits of this KIP
> > > > > > > > are then more of a common set and format metrics across
> client
> > > > > > > > implementations and languages.
> > > > > > > > But that is not the typical Kafka user in my experience,
> > they're
> > > not
> > > > > > > Kafka
> > > > > > > > experts and they don't have the
> > > > > > > > knowledge of how to best instrument their clients.
> > > > > > > > Having metrics enabled by default for this user base allows
> the
> > > Kafka
> > > > > > > > operators to proactively and reactively
> > > > > > > > monitor and troubleshoot client issues, without the need for
> > the
> > > less
> > > > > > > savvy
> > > > > > > > user to do anything.
> > > > > > > > It is often too late to tell a user to enable metrics when
> the
> > > > > problem
> > > > > > > has
> > > > > > > > already occurred.
> > > > > > > >
> > > > > > > > Now, to be clear, even though metrics are enabled by default
> on
> > > > > clients
> > > > > > > it
> > > > > > > > is not enabled by default
> > > > > > > > on the brokers; the Kafka operator needs to build and set up
> a
> > > > > metrics
> > > > > > > > plugin and add metrics subscriptions
> > > > > > > > before anything is sent from the client.
> > > > > > > > It is opt-out on the clients and opt-in on the broker.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > You mentioned brokers already have
> > > > > > > > > some(most?) of the information contained in metrics, if so
> > > then why
> > > > > > > > > are we collecting it again? Surely there must be some new
> > > > > information
> > > > > > > > > in the client metrics.
> > > > > > > > >
> > > > > > > >
> > > > > > > > From the user's perspective the Kafka infrastructure extends
> > from
> > > > > > > > producer.send() to
> > > > > > > > messages being returned from consumer.poll(), a giant black
> box
> > > where
> > > > > > > > there's a lot going on between those
> > > > > > > > two points. The brokers currently only see what happens once
> > > those
> > > > > > > requests
> > > > > > > > and messages hits the broker,
> > > > > > > > but as Kafka clients are complex pieces of machinery there's
> a
> > > myriad
> > > > > > of
> > > > > > > > queues, timers, and state
> > > > > > > > that's critical to the operation and infrastructure that's
> not
> > > > > > currently
> > > > > > > > visible to the operator.
> > > > > > > > Relying on the user to accurately and timely provide this
> > missing
> > > > > > > > information is not generally feasible.
> > > > > > > >
> > > > > > > >
> > > > > > > > Most of the standard metrics listed in the KIP are data
> points
> > > that
> > > > > the
> > > > > > > > broker does not have.
> > > > > > > > Only a small number of metrics are duplicates (like the
> request
> > > > > counts
> > > > > > > and
> > > > > > > > sizes), but they are included
> > > > > > > > to ease correlation when inspecting these client metrics.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > Moreover this is a brand new feature so it's even harder to
> > > justify
> > > > > > > > > enabling it and forcing onto all our users. If disabled by
> > > default,
> > > > > > > > > it's relatively easy to enable in a new release if we
> decide
> > > to,
> > > > > but
> > > > > > > > > once enabled by default it's much harder to disable. Also
> > this
> > > > > > feature
> > > > > > > > > will apply to all future metrics we will add.
> > > > > > > > >
> > > > > > > >
> > > > > > > > I think maturity of a feature implementation should be the
> > > deciding
> > > > > > > factor,
> > > > > > > > rather than
> > > > > > > > the design of it (which this KIP is). I.e., if the
> > > implementation is
> > > > > > not
> > > > > > > > deemed mature enough
> > > > > > > > for release X.Y it will be disabled.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > Overall I think it's an interesting feature but I'd prefer
> to
> > > be
> > > > > > > > > slightly defensive and see how it works in practice before
> > > enabling
> > > > > > it
> > > > > > > > > everywhere.
> > > > > > > > >
> > > > > > > >
> > > > > > > > Right, and I agree on being defensive, but since this feature
> > > still
> > > > > > > > requires manual
> > > > > > > > enabling on the brokers before actually being used, I think
> > that
> > > > > gives
> > > > > > > > enough control
> > > > > > > > to opt-in or out of this feature as needed.
> > > > > > > >
> > > > > > > > Thanks for your comments!
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > > Magnus
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Mickael
> > > > > > > > >
> > > > > > > > > On Mon, Nov 8, 2021 at 7:55 AM Magnus Edenhill <
> > > magnus@edenhill.se
> > > > > >
> > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > Thanks David for pointing this out,
> > > > > > > > > > I've updated the KIP to include client_id as a matching
> > > selector.
> > > > > > > > > >
> > > > > > > > > > Regards,
> > > > > > > > > > Magnus
> > > > > > > > > >
> > > > > > > > > > Den tors 4 nov. 2021 kl 18:01 skrev David Mao
> > > > > > > > <dmao@confluent.io.invalid
> > > > > > > > > >:
> > > > > > > > > >
> > > > > > > > > > > Hey Magnus,
> > > > > > > > > > >
> > > > > > > > > > > I noticed that the KIP outlines the initial selectors
> > > supported
> > > > > > as:
> > > > > > > > > > >
> > > > > > > > > > >    - client_instance_id - CLIENT_INSTANCE_ID UUID
> string
> > > > > > > > > representation.
> > > > > > > > > > >    - client_software_name  - client software
> > implementation
> > > > > name.
> > > > > > > > > > >    - client_software_version  - client software
> > > implementation
> > > > > > > > version.
> > > > > > > > > > >
> > > > > > > > > > > In the given reactive monitoring workflow, we mention
> > that
> > > the
> > > > > > > > > application
> > > > > > > > > > > user does not know their client's client instance ID,
> but
> > > it's
> > > > > > > > outlined
> > > > > > > > > > > that the operator can add a metrics subscription
> > selecting
> > > for
> > > > > > > > > clientId. I
> > > > > > > > > > > don't see clientId as one of the supported selectors.
> > > > > > > > > > > I can see how this would have made sense in a previous
> > > > > iteration
> > > > > > > > given
> > > > > > > > > that
> > > > > > > > > > > the previous client instance ID proposal was to
> construct
> > > the
> > > > > > > client
> > > > > > > > > > > instance ID using clientId as a prefix. Now that the
> > client
> > > > > > > instance
> > > > > > > > > ID is
> > > > > > > > > > > a UUID, would we want to add clientId as a supported
> > > selector?
> > > > > > > > > > > Let me know what you think.
> > > > > > > > > > >
> > > > > > > > > > > David
> > > > > > > > > > >
> > > > > > > > > > > On Tue, Oct 19, 2021 at 12:33 PM Magnus Edenhill <
> > > > > > > magnus@edenhill.se
> > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi Mickael!
> > > > > > > > > > > >
> > > > > > > > > > > > Den tis 19 okt. 2021 kl 19:30 skrev Mickael Maison <
> > > > > > > > > > > > mickael.maison@gmail.com
> > > > > > > > > > > > >:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi Magnus,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks for the proposal.
> > > > > > > > > > > > >
> > > > > > > > > > > > > 1. Looking at the protocol section, isn't
> > > > > "ClientInstanceId"
> > > > > > > > > expected
> > > > > > > > > > > > > to be a field in
> GetTelemetrySubscriptionsResponseV0?
> > > > > > > Otherwise,
> > > > > > > > > how
> > > > > > > > > > > > > does a client retrieve this value?
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Good catch, it got removed by mistake in one of the
> > > edits.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > 2. In the client API section, you mention a new
> > method
> > > > > > > > > > > > > "clientInstanceId()". Can you clarify which
> > interfaces
> > > are
> > > > > > > > > affected?
> > > > > > > > > > > > > Is it only Consumer and Producer?
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > And Admin. Will update the KIP.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > 3. I'm a bit concerned this is enabled by default.
> > > Even if
> > > > > > the
> > > > > > > > data
> > > > > > > > > > > > > collected is supposed to be not sensitive, I think
> > > this can
> > > > > > be
> > > > > > > > > > > > > problematic in some environments. Also users don't
> > > seem to
> > > > > > have
> > > > > > > > the
> > > > > > > > > > > > > choice to only expose some metrics. Knowing how
> much
> > > data
> > > > > > > transit
> > > > > > > > > > > > > through some applications can be considered
> critical.
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > The broker already knows how much data transits
> through
> > > the
> > > > > > > client
> > > > > > > > > > > though,
> > > > > > > > > > > > right?
> > > > > > > > > > > > Care has been taken not to expose information in the
> > > standard
> > > > > > > > metrics
> > > > > > > > > > > that
> > > > > > > > > > > > might
> > > > > > > > > > > > reveal sensitive information.
> > > > > > > > > > > >
> > > > > > > > > > > > Do you have an example of how the proposed metrics
> > could
> > > leak
> > > > > > > > > sensitive
> > > > > > > > > > > > information?
> > > > > > > > > > > > As for limiting the what metrics to export; I guess
> > that
> > > > > could
> > > > > > > make
> > > > > > > > > sense
> > > > > > > > > > > > in some
> > > > > > > > > > > > very sensitive use-cases, but those users might
> disable
> > > > > metrics
> > > > > > > > > > > altogether
> > > > > > > > > > > > for now.
> > > > > > > > > > > > Could these concerns be addressed by a later KIP?
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > 4. As a user, how do you know if your application
> is
> > > > > actively
> > > > > > > > > sending
> > > > > > > > > > > > > metrics? Are there new metrics exposing what's
> going
> > > on,
> > > > > like
> > > > > > > how
> > > > > > > > > much
> > > > > > > > > > > > > data is being sent?
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > That's a good question.
> > > > > > > > > > > > Since the proposed metrics interface is not aimed at,
> > or
> > > > > > directly
> > > > > > > > > > > available
> > > > > > > > > > > > to, the application
> > > > > > > > > > > > I guess there's little point of adding it here, but
> > > instead
> > > > > > > adding
> > > > > > > > > > > > something to the
> > > > > > > > > > > > existing JMX metrics?
> > > > > > > > > > > > Do you have any suggestions?
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > 5. If all metrics are enabled on a regular Consumer
> > or
> > > > > > > Producer,
> > > > > > > > do
> > > > > > > > > > > > > you have an idea how much throughput this would
> use?
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > It depends on the number of partition/topics/etc the
> > > client
> > > > > is
> > > > > > > > > producing
> > > > > > > > > > > > to/consuming from.
> > > > > > > > > > > > I'll add some sizes to the KIP for some typical
> > > use-cases.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks,
> > > > > > > > > > > > Magnus
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > Thanks
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Tue, Oct 19, 2021 at 5:06 PM Magnus Edenhill <
> > > > > > > > > magnus@edenhill.se>
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Den tis 19 okt. 2021 kl 13:22 skrev Tom Bentley <
> > > > > > > > > tbentley@redhat.com
> > > > > > > > > > > >:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi Magnus,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I reviewed the KIP since you called the vote
> > > (sorry for
> > > > > > not
> > > > > > > > > > > reviewing
> > > > > > > > > > > > > when
> > > > > > > > > > > > > > > you announced your intention to call the
> vote). I
> > > have
> > > > > a
> > > > > > > few
> > > > > > > > > > > > questions
> > > > > > > > > > > > > on
> > > > > > > > > > > > > > > some of the details.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 1. There's no Javadoc on
> > > ClientTelemetryPayload.data(),
> > > > > > so
> > > > > > > I
> > > > > > > > > don't
> > > > > > > > > > > > know
> > > > > > > > > > > > > > > whether the payload is exposed through this
> > method
> > > as
> > > > > > > > > compressed or
> > > > > > > > > > > > > not.
> > > > > > > > > > > > > > > Later on you say "Decompression of the payloads
> > > will be
> > > > > > > > > handled by
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > broker metrics plugin, the broker should
> expose a
> > > > > > suitable
> > > > > > > > > > > > > decompression
> > > > > > > > > > > > > > > API to the metrics plugin for this purpose.",
> > which
> > > > > > > suggests
> > > > > > > > > it's
> > > > > > > > > > > the
> > > > > > > > > > > > > > > compressed data in the buffer, but then we
> don't
> > > know
> > > > > > which
> > > > > > > > > codec
> > > > > > > > > > > was
> > > > > > > > > > > > > used,
> > > > > > > > > > > > > > > nor the API via which the plugin should
> > decompress
> > > it
> > > > > if
> > > > > > > > > required
> > > > > > > > > > > for
> > > > > > > > > > > > > > > forwarding to the ultimate metrics store.
> Should
> > > the
> > > > > > > > > > > > > ClientTelemetryPayload
> > > > > > > > > > > > > > > expose a method to get the compression and a
> > > > > > decompressor?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Good point, updated.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 2. The client-side API is expressed as
> > > StringOrError
> > > > > > > > > > > > > > > ClientInstance::ClientInstanceId(int
> > timeout_ms). I
> > > > > > > > understand
> > > > > > > > > that
> > > > > > > > > > > > > you're
> > > > > > > > > > > > > > > thinking about the librdkafka implementation,
> but
> > > it
> > > > > > would
> > > > > > > be
> > > > > > > > > good
> > > > > > > > > > > to
> > > > > > > > > > > > > show
> > > > > > > > > > > > > > > the API as it would appear on the Apache Kafka
> > > clients.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > This was meant as pseudo-code, but I changed it
> to
> > > Java.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 3. "PushTelemetryRequest|Response - protocol
> > > request
> > > > > used
> > > > > > > by
> > > > > > > > > the
> > > > > > > > > > > > > client to
> > > > > > > > > > > > > > > send metrics to any broker it is connected to."
> > To
> > > be
> > > > > > > clear,
> > > > > > > > > this
> > > > > > > > > > > > means
> > > > > > > > > > > > > > > that the client can choose any of the connected
> > > brokers
> > > > > > and
> > > > > > > > > push to
> > > > > > > > > > > > > just
> > > > > > > > > > > > > > > one of them? What should a supporting client do
> > if
> > > it
> > > > > > gets
> > > > > > > an
> > > > > > > > > error
> > > > > > > > > > > > > when
> > > > > > > > > > > > > > > pushing metrics to a broker, retry sending to
> the
> > > same
> > > > > > > broker
> > > > > > > > > or
> > > > > > > > > > > try
> > > > > > > > > > > > > > > pushing to another broker, or drop the metrics?
> > > Should
> > > > > > > > > supporting
> > > > > > > > > > > > > clients
> > > > > > > > > > > > > > > send successive requests to a single broker, or
> > > round
> > > > > > > robin,
> > > > > > > > > or is
> > > > > > > > > > > > > that up
> > > > > > > > > > > > > > > to the client author? I'm guessing the
> behaviour
> > > should
> > > > > > be
> > > > > > > > > sticky
> > > > > > > > > > > to
> > > > > > > > > > > > > > > support the rate limiting features, but I think
> > it
> > > > > would
> > > > > > be
> > > > > > > > > good
> > > > > > > > > > > for
> > > > > > > > > > > > > client
> > > > > > > > > > > > > > > authors if this section were explicit on the
> > > > > recommended
> > > > > > > > > behaviour.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > You are right, I've updated the KIP to make this
> > > clearer.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 4. "Mapping the client instance id to an actual
> > > > > > application
> > > > > > > > > > > instance
> > > > > > > > > > > > > > > running on a (virtual) machine can be done by
> > > > > inspecting
> > > > > > > the
> > > > > > > > > > > metrics
> > > > > > > > > > > > > > > resource labels, such as the client source
> > address
> > > and
> > > > > > > source
> > > > > > > > > port,
> > > > > > > > > > > > or
> > > > > > > > > > > > > > > security principal, all of which are added by
> the
> > > > > > receiving
> > > > > > > > > broker.
> > > > > > > > > > > > > This
> > > > > > > > > > > > > > > will allow the operator together with the user
> to
> > > > > > identify
> > > > > > > > the
> > > > > > > > > > > actual
> > > > > > > > > > > > > > > application instance." Is this really always
> > true?
> > > The
> > > > > > > source
> > > > > > > > > IP
> > > > > > > > > > > and
> > > > > > > > > > > > > port
> > > > > > > > > > > > > > > might be a loadbalancer/proxy in some setups.
> The
> > > > > > > principal,
> > > > > > > > as
> > > > > > > > > > > > already
> > > > > > > > > > > > > > > mentioned in the KIP, might be shared between
> > > multiple
> > > > > > > > > > > applications.
> > > > > > > > > > > > > So at
> > > > > > > > > > > > > > > worst the organization running the clients
> might
> > > have
> > > > > to
> > > > > > > > > consult
> > > > > > > > > > > the
> > > > > > > > > > > > > logs
> > > > > > > > > > > > > > > of a set of client applications, right?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Yes, that's correct. There's no guaranteed
> mapping
> > > from
> > > > > > > > > > > > > client_instance_id
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > an actual instance, that's why the KIP recommends
> > > client
> > > > > > > > > > > > implementations
> > > > > > > > > > > > > to
> > > > > > > > > > > > > > log the client instance id
> > > > > > > > > > > > > > upon retrieval, and also provide an API for the
> > > > > application
> > > > > > > to
> > > > > > > > > > > retrieve
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > instance id programmatically
> > > > > > > > > > > > > > if it has a better way of exposing it.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 5. "Tests indicate that a compression ratio up to
> > > 10x is
> > > > > > > > > possible for
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > standard metrics." Client authors might
> > appreciate
> > > your
> > > > > > > > > mentioning
> > > > > > > > > > > > > which
> > > > > > > > > > > > > > > compression codec got these results.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Good point. Updated.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 6. "Should the client send a push request prior
> > to
> > > > > expiry
> > > > > > > of
> > > > > > > > > the
> > > > > > > > > > > > > previously
> > > > > > > > > > > > > > > calculated PushIntervalMs the broker will
> discard
> > > the
> > > > > > > metrics
> > > > > > > > > and
> > > > > > > > > > > > > return a
> > > > > > > > > > > > > > > PushTelemetryResponse with the ErrorCode set to
> > > > > > > RateLimited."
> > > > > > > > > Is
> > > > > > > > > > > this
> > > > > > > > > > > > > > > RATE_LIMITED a new error code? It's not
> mentioned
> > > in
> > > > > the
> > > > > > > "New
> > > > > > > > > Error
> > > > > > > > > > > > > Codes"
> > > > > > > > > > > > > > > section.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > That's a leftover, it should be using the
> standard
> > > > > > > ThrottleTime
> > > > > > > > > > > > > mechanism.
> > > > > > > > > > > > > > Fixed.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 7. In the section "Standard client resource
> > labels"
> > > > > > > > > application_id
> > > > > > > > > > > is
> > > > > > > > > > > > > > > described as Kafka Streams only, but the
> section
> > of
> > > > > > "Client
> > > > > > > > > > > > > Identification"
> > > > > > > > > > > > > > > talks about "application instance id as an
> > optional
> > > > > > future
> > > > > > > > > > > > nice-to-have
> > > > > > > > > > > > > > > that may be included as a metrics label if it
> has
> > > been
> > > > > > set
> > > > > > > by
> > > > > > > > > the
> > > > > > > > > > > > > user", so
> > > > > > > > > > > > > > > I'm confused whether non-Kafka Streams clients
> > > should
> > > > > set
> > > > > > > an
> > > > > > > > > > > > > application_id
> > > > > > > > > > > > > > > or not.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I'll clarify this in the KIP, but basically we
> > would
> > > need
> > > > > > to
> > > > > > > > add
> > > > > > > > > an `
> > > > > > > > > > > > > > application.id` config
> > > > > > > > > > > > > > property for non-streams clients for this
> purpose,
> > > and
> > > > > > that's
> > > > > > > > > outside
> > > > > > > > > > > > the
> > > > > > > > > > > > > > scope of this KIP since we want to make it
> > > zero-conf:ish
> > > > > on
> > > > > > > the
> > > > > > > > > > > client
> > > > > > > > > > > > > side.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Kind regards,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Tom
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks for the review,
> > > > > > > > > > > > > > Magnus
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Thu, Oct 7, 2021 at 5:26 PM Magnus Edenhill
> <
> > > > > > > > > magnus@edenhill.se
> > > > > > > > > > > >
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Hi all,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I've updated the KIP following our recent
> > > discussions
> > > > > > on
> > > > > > > > the
> > > > > > > > > > > > mailing
> > > > > > > > > > > > > > > list:
> > > > > > > > > > > > > > > >  - split the protocol in two, one for getting
> > the
> > > > > > metrics
> > > > > > > > > > > > > subscriptions,
> > > > > > > > > > > > > > > > and one for pushing the metrics.
> > > > > > > > > > > > > > > >  - simplifications: initially only one
> > supported
> > > > > > metrics
> > > > > > > > > format,
> > > > > > > > > > > no
> > > > > > > > > > > > > > > > client.id in the instance id, etc.
> > > > > > > > > > > > > > > >  - made CLIENT_METRICS subscription
> > configuration
> > > > > > entries
> > > > > > > > > more
> > > > > > > > > > > > > structured
> > > > > > > > > > > > > > > >    and allowing better client matching
> > selectors
> > > (not
> > > > > > > only
> > > > > > > > > on the
> > > > > > > > > > > > > > > instance
> > > > > > > > > > > > > > > > id, but also the other
> > > > > > > > > > > > > > > >    client resource labels, such as
> > > > > > client_software_name,
> > > > > > > > > etc.).
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Unless there are further comments I'll call
> the
> > > vote
> > > > > > in a
> > > > > > > > > day or
> > > > > > > > > > > > two.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Regards,
> > > > > > > > > > > > > > > > Magnus
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Den mån 4 okt. 2021 kl 20:57 skrev Magnus
> > > Edenhill <
> > > > > > > > > > > > > magnus@edenhill.se>:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Hi Gwen,
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > I'm finishing up the KIP based on the last
> > > couple
> > > > > of
> > > > > > > > > discussion
> > > > > > > > > > > > > points
> > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > > this thread
> > > > > > > > > > > > > > > > > and will call the Vote later this week.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Best,
> > > > > > > > > > > > > > > > > Magnus
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Den lör 2 okt. 2021 kl 02:01 skrev Gwen
> > Shapira
> > > > > > > > > > > > > > > > <gwen@confluent.io.invalid
> > > > > > > > > > > > > > > > > >:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >> Hey,
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >> I noticed that there was no discussion for
> > the
> > > > > last
> > > > > > 10
> > > > > > > > > days,
> > > > > > > > > > > > but I
> > > > > > > > > > > > > > > > >> couldn't
> > > > > > > > > > > > > > > > >> find the vote thread. Is there one that
> I'm
> > > > > missing?
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >> Gwen
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >> On Wed, Sep 22, 2021 at 4:58 AM Magnus
> > > Edenhill <
> > > > > > > > > > > > > magnus@edenhill.se>
> > > > > > > > > > > > > > > > >> wrote:
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >> > Den tis 21 sep. 2021 kl 06:58 skrev
> Colin
> > > > > McCabe <
> > > > > > > > > > > > > > > cmccabe@apache.org
> > > > > > > > > > > > > > > > >:
> > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > >> > > On Mon, Sep 20, 2021, at 17:35, Feng
> Min
> > > > > wrote:
> > > > > > > > > > > > > > > > >> > > > Thanks Magnus & Colin for the
> > > discussion.
> > > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > > >> > > > Based on KIP-714's stateless design,
> > > Client
> > > > > > can
> > > > > > > > > pretty
> > > > > > > > > > > > much
> > > > > > > > > > > > > use
> > > > > > > > > > > > > > > > any
> > > > > > > > > > > > > > > > >> > > > connection to any broker to send
> > > metrics. We
> > > > > > are
> > > > > > > > not
> > > > > > > > > > > > > associating
> > > > > > > > > > > > > > > > >> > > connection
> > > > > > > > > > > > > > > > >> > > > with client metric state. Is my
> > > > > understanding
> > > > > > > > > correct?
> > > > > > > > > > > If
> > > > > > > > > > > > > yes,
> > > > > > > > > > > > > > > > how
> > > > > > > > > > > > > > > > >> > about
> > > > > > > > > > > > > > > > >> > > > the following two scenarios
> > > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > > >> > > > 1) One Client (Client-ID) registers
> > two
> > > > > > > different
> > > > > > > > > client
> > > > > > > > > > > > > > > instance
> > > > > > > > > > > > > > > > id
> > > > > > > > > > > > > > > > >> > via
> > > > > > > > > > > > > > > > >> > > > separate registration. Is it
> > permitted?
> > > If
> > > > > OK,
> > > > > > > how
> > > > > > > > > to
> > > > > > > > > > > > > > > distinguish
> > > > > > > > > > > > > > > > >> them
> > > > > > > > > > > > > > > > >> > > from
> > > > > > > > > > > > > > > > >> > > > the case 2 below.
> > > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > >> > > Hi Feng,
> > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > >> > > My understanding, which Magnus can
> > > clarify I
> > > > > > > guess,
> > > > > > > > is
> > > > > > > > > > > that
> > > > > > > > > > > > > you
> > > > > > > > > > > > > > > > could
> > > > > > > > > > > > > > > > >> > have
> > > > > > > > > > > > > > > > >> > > something like two Producer instances
> > > running
> > > > > > with
> > > > > > > > the
> > > > > > > > > > > same
> > > > > > > > > > > > > > > > client.id
> > > > > > > > > > > > > > > > >> > > (perhaps because they're using the
> same
> > > config
> > > > > > > file,
> > > > > > > > > for
> > > > > > > > > > > > > example).
> > > > > > > > > > > > > > > > >> They
> > > > > > > > > > > > > > > > >> > > could even be in the same process. But
> > > they
> > > > > > would
> > > > > > > > get
> > > > > > > > > > > > separate
> > > > > > > > > > > > > > > > UUIDs.
> > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > >> > > I believe Magnus used the term client
> to
> > > mean
> > > > > > > > > "Producer or
> > > > > > > > > > > > > > > > Consumer".
> > > > > > > > > > > > > > > > >> So
> > > > > > > > > > > > > > > > >> > > if you have both a Producer and a
> > > Consumer in
> > > > > > your
> > > > > > > > > > > > > application I
> > > > > > > > > > > > > > > > would
> > > > > > > > > > > > > > > > >> > > expect you'd get separate UUIDs for
> > both.
> > > > > Again
> > > > > > > > > Magnus can
> > > > > > > > > > > > > chime
> > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > >> > here, I
> > > > > > > > > > > > > > > > >> > > guess.
> > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > >> > That's correct.
> > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > >> > > > 2) How about the client restarting?
> > > What's
> > > > > the
> > > > > > > > > > > > expectation?
> > > > > > > > > > > > > > > Should
> > > > > > > > > > > > > > > > >> the
> > > > > > > > > > > > > > > > >> > > > server expect the client to carry a
> > > > > persisted
> > > > > > > > client
> > > > > > > > > > > > > instance id
> > > > > > > > > > > > > > > > or
> > > > > > > > > > > > > > > > >> > > should
> > > > > > > > > > > > > > > > >> > > > the client be treated as a new
> > instance?
> > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > >> > > The KIP doesn't describe any mechanism
> > for
> > > > > > > > > persistence,
> > > > > > > > > > > so I
> > > > > > > > > > > > > would
> > > > > > > > > > > > > > > > >> assume
> > > > > > > > > > > > > > > > >> > > that when you restart the client you
> get
> > > a new
> > > > > > > > UUID. I
> > > > > > > > > > > agree
> > > > > > > > > > > > > that
> > > > > > > > > > > > > > > it
> > > > > > > > > > > > > > > > >> > would
> > > > > > > > > > > > > > > > >> > > be good to spell this out.
> > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > >> > Right, it will not be persisted since a
> > > client
> > > > > > > > instance
> > > > > > > > > > > can't
> > > > > > > > > > > > be
> > > > > > > > > > > > > > > > >> restarted.
> > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > >> > Will update the KIP to make this
> clearer.
> > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > >> > /Magnus
> > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >> --
> > > > > > > > > > > > > > > > >> Gwen Shapira
> > > > > > > > > > > > > > > > >> Engineering Manager | Confluent
> > > > > > > > > > > > > > > > >> 650.450.2760 | @gwenshap
> > > > > > > > > > > > > > > > >> Follow us: Twitter | blog
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Magnus Edenhill <ma...@edenhill.se>.

Hi all,

I've updated the KIP with responses to the latest comments: Java client
dependencies (Thanks Kirk!), alternate designs (separate cluster, separate
producer, etc), etc.

I will revive the vote thread.

Thanks,
Magnus


Den mån 13 dec. 2021 kl 22:32 skrev Ryanne Dolan <ry...@gmail.com>:

> I think we should be very careful about introducing new runtime
> dependencies into the clients. Historically this has been rare and
> essentially necessary (e.g. compression libs).
>
> Ryanne
>
> On Mon, Dec 13, 2021, 1:06 PM Kirk True <ki...@mustardgrain.com> wrote:
>
> > Hi Jun,
> >
> > On Thu, Dec 9, 2021, at 2:28 PM, Jun Rao wrote:
> > > 13. Using OpenTelemetry. Does that require runtime dependency
> > > on OpenTelemetry library? How good is the compatibility story
> > > of OpenTelemetry? This is important since an application could have
> other
> > > OpenTelemetry dependencies than the Kafka client.
> >
> > The current design is that the OpenTelemetry JARs would ship with the
> > client. Perhaps we can design the client such that the JARs aren't even
> > loaded if the user has opted out. The user could even exclude the JARs
> from
> > their dependencies if they so wished.
> >
> > I can't speak to the compatibility of the libraries. Is it possible that
> > we include a shaded version?
> >
> > Thanks,
> > Kirk
> >
> > >
> > > 14. The proposal listed idempotence=true. This is more of a
> configuration
> > > than a metric. Are we including that as a metric? What other
> > configurations
> > > are we including? Should we separate the configurations from the
> metrics?
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Mon, Nov 29, 2021 at 7:34 AM Magnus Edenhill <ma...@edenhill.se>
> > wrote:
> > >
> > > > Hey Bob,
> > > >
> > > > That's a good point.
> > > >
> > > > Request type labels were considered but since they're already tracked
> > by
> > > > broker-side metrics
> > > > they were left out as to avoid metric duplication, however those
> > metrics
> > > > are not per connection,
> > > > so they won't be that useful in practice for troubleshooting specific
> > > > client instances.
> > > >
> > > > I'll add the request_type label to the relevant metrics.
> > > >
> > > > Thanks,
> > > > Magnus
> > > >
> > > >
> > > > Den tis 23 nov. 2021 kl 19:20 skrev Bob Barrett
> > > > <bo...@confluent.io.invalid>:
> > > >
> > > > > Hi Magnus,
> > > > >
> > > > > Thanks for the thorough KIP, this seems very useful.
> > > > >
> > > > > Would it make sense to include the request type as a label for the
> > > > > `client.request.success`, `client.request.errors` and
> > > > `client.request.rtt`
> > > > > metrics? I think it would be very useful to see which specific
> > requests
> > > > are
> > > > > succeeding and failing for a client. One specific case I can think
> of
> > > > where
> > > > > this could be useful is producer batch timeouts. If a Java
> > application
> > > > does
> > > > > not enable producer client logs (unfortunately, in my experience
> this
> > > > > happens more often than it should), the application logs will only
> > > > contain
> > > > > the expiration error message, but no information about what is
> > causing
> > > > the
> > > > > timeout. The requests might all be succeeding but taking too long
> to
> > > > > process batches, or metadata requests might be failing, or some or
> > all
> > > > > produce requests might be failing (if the bootstrap servers are
> > reachable
> > > > > from the client but one or more other brokers are not, for
> example).
> > If
> > > > the
> > > > > cluster operator is able to identify the specific requests that are
> > slow
> > > > or
> > > > > failing for a client, they will be better able to diagnose the
> issue
> > > > > causing batch timeouts.
> > > > >
> > > > > One drawback I can think of is that this will increase the
> > cardinality of
> > > > > the request metrics. But any given client is only going to use a
> > small
> > > > > subset of the request types, and since we already have partition
> > labels
> > > > for
> > > > > the topic-level metrics, I think request labels will still make up
> a
> > > > > relatively small percentage of the set of metrics.
> > > > >
> > > > > Thanks,
> > > > > Bob
> > > > >
> > > > > On Mon, Nov 22, 2021 at 2:08 AM Viktor Somogyi-Vass <
> > > > > viktorsomogyi@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi Magnus,
> > > > > >
> > > > > > I think this is a very useful addition. We also have a similar
> (but
> > > > much
> > > > > > more simplistic) implementation of this. Maybe I missed it in the
> > KIP
> > > > but
> > > > > > what about adding metrics about the subscription cache itself?
> > That I
> > > > > think
> > > > > > would improve its usability and debuggability as we'd be able to
> > see
> > > > its
> > > > > > performance, hit/miss rates, eviction counts and others.
> > > > > >
> > > > > > Best,
> > > > > > Viktor
> > > > > >
> > > > > > On Thu, Nov 18, 2021 at 5:12 PM Magnus Edenhill <
> > magnus@edenhill.se>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Mickael,
> > > > > > >
> > > > > > > see inline.
> > > > > > >
> > > > > > > Den ons 10 nov. 2021 kl 15:21 skrev Mickael Maison <
> > > > > > > mickael.maison@gmail.com
> > > > > > > >:
> > > > > > >
> > > > > > > > Hi Magnus,
> > > > > > > >
> > > > > > > > I see you've addressed some of the points I raised above but
> > some
> > > > (4,
> > > > > > > > 5) have not been addressed yet.
> > > > > > > >
> > > > > > >
> > > > > > > Re 4) How will the user/app know metrics are being sent.
> > > > > > >
> > > > > > > One possibility is to add a JMX metric (thus for user
> > consumption)
> > > > for
> > > > > > the
> > > > > > > number of metric pushes the
> > > > > > > client has performed, or perhaps the number of metrics
> > subscriptions
> > > > > > > currently being collected.
> > > > > > > Would that be sufficient?
> > > > > > >
> > > > > > > Re 5) Metric sizes and rates
> > > > > > >
> > > > > > > A worst case scenario for a producer that is producing to 50
> > unique
> > > > > > topics
> > > > > > > and emitting all standard metrics yields
> > > > > > > a serialized size of around 100KB prior to compression, which
> > > > > compresses
> > > > > > > down to about 20-30% of that depending
> > > > > > > on compression type and topic name uniqueness.
> > > > > > > The numbers for a consumer would be similar.
> > > > > > >
> > > > > > > In practice the number of unique topics would be far less, and
> > the
> > > > > > > subscription set would typically be for a subset of metrics.
> > > > > > > So we're probably closer to 1kb, or less, compressed size per
> > client
> > > > > per
> > > > > > > push interval.
> > > > > > >
> > > > > > > As both the subscription set and push intervals are controlled
> > by the
> > > > > > > cluster operator it shouldn't be too hard
> > > > > > > to strike a good balance between metrics overhead and
> > granularity.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > I'm really uneasy with this being enabled by default on the
> > client
> > > > > > > > side. When collecting data, I think the best practice is to
> > ensure
> > > > > > > > users are explicitly enabling it.
> > > > > > > >
> > > > > > >
> > > > > > > Requiring metrics to be explicitly enabled on clients severely
> > > > cripples
> > > > > > its
> > > > > > > usability and value.
> > > > > > >
> > > > > > > One of the problems that this KIP aims to solve is for useful
> > metrics
> > > > > to
> > > > > > be
> > > > > > > available on demand
> > > > > > > regardless of the technical expertise of the user. As Ryanne
> > points,
> > > > > out
> > > > > > a
> > > > > > > savvy user/organization
> > > > > > > will typically have metrics collection and monitoring in place
> > > > already,
> > > > > > and
> > > > > > > the benefits of this KIP
> > > > > > > are then more of a common set and format metrics across client
> > > > > > > implementations and languages.
> > > > > > > But that is not the typical Kafka user in my experience,
> they're
> > not
> > > > > > Kafka
> > > > > > > experts and they don't have the
> > > > > > > knowledge of how to best instrument their clients.
> > > > > > > Having metrics enabled by default for this user base allows the
> > Kafka
> > > > > > > operators to proactively and reactively
> > > > > > > monitor and troubleshoot client issues, without the need for
> the
> > less
> > > > > > savvy
> > > > > > > user to do anything.
> > > > > > > It is often too late to tell a user to enable metrics when the
> > > > problem
> > > > > > has
> > > > > > > already occurred.
> > > > > > >
> > > > > > > Now, to be clear, even though metrics are enabled by default on
> > > > clients
> > > > > > it
> > > > > > > is not enabled by default
> > > > > > > on the brokers; the Kafka operator needs to build and set up a
> > > > metrics
> > > > > > > plugin and add metrics subscriptions
> > > > > > > before anything is sent from the client.
> > > > > > > It is opt-out on the clients and opt-in on the broker.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > You mentioned brokers already have
> > > > > > > > some(most?) of the information contained in metrics, if so
> > then why
> > > > > > > > are we collecting it again? Surely there must be some new
> > > > information
> > > > > > > > in the client metrics.
> > > > > > > >
> > > > > > >
> > > > > > > From the user's perspective the Kafka infrastructure extends
> from
> > > > > > > producer.send() to
> > > > > > > messages being returned from consumer.poll(), a giant black box
> > where
> > > > > > > there's a lot going on between those
> > > > > > > two points. The brokers currently only see what happens once
> > those
> > > > > > requests
> > > > > > > and messages hits the broker,
> > > > > > > but as Kafka clients are complex pieces of machinery there's a
> > myriad
> > > > > of
> > > > > > > queues, timers, and state
> > > > > > > that's critical to the operation and infrastructure that's not
> > > > > currently
> > > > > > > visible to the operator.
> > > > > > > Relying on the user to accurately and timely provide this
> missing
> > > > > > > information is not generally feasible.
> > > > > > >
> > > > > > >
> > > > > > > Most of the standard metrics listed in the KIP are data points
> > that
> > > > the
> > > > > > > broker does not have.
> > > > > > > Only a small number of metrics are duplicates (like the request
> > > > counts
> > > > > > and
> > > > > > > sizes), but they are included
> > > > > > > to ease correlation when inspecting these client metrics.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > Moreover this is a brand new feature so it's even harder to
> > justify
> > > > > > > > enabling it and forcing onto all our users. If disabled by
> > default,
> > > > > > > > it's relatively easy to enable in a new release if we decide
> > to,
> > > > but
> > > > > > > > once enabled by default it's much harder to disable. Also
> this
> > > > > feature
> > > > > > > > will apply to all future metrics we will add.
> > > > > > > >
> > > > > > >
> > > > > > > I think maturity of a feature implementation should be the
> > deciding
> > > > > > factor,
> > > > > > > rather than
> > > > > > > the design of it (which this KIP is). I.e., if the
> > implementation is
> > > > > not
> > > > > > > deemed mature enough
> > > > > > > for release X.Y it will be disabled.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > Overall I think it's an interesting feature but I'd prefer to
> > be
> > > > > > > > slightly defensive and see how it works in practice before
> > enabling
> > > > > it
> > > > > > > > everywhere.
> > > > > > > >
> > > > > > >
> > > > > > > Right, and I agree on being defensive, but since this feature
> > still
> > > > > > > requires manual
> > > > > > > enabling on the brokers before actually being used, I think
> that
> > > > gives
> > > > > > > enough control
> > > > > > > to opt-in or out of this feature as needed.
> > > > > > >
> > > > > > > Thanks for your comments!
> > > > > > >
> > > > > > > Regards,
> > > > > > > Magnus
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Mickael
> > > > > > > >
> > > > > > > > On Mon, Nov 8, 2021 at 7:55 AM Magnus Edenhill <
> > magnus@edenhill.se
> > > > >
> > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > Thanks David for pointing this out,
> > > > > > > > > I've updated the KIP to include client_id as a matching
> > selector.
> > > > > > > > >
> > > > > > > > > Regards,
> > > > > > > > > Magnus
> > > > > > > > >
> > > > > > > > > Den tors 4 nov. 2021 kl 18:01 skrev David Mao
> > > > > > > <dmao@confluent.io.invalid
> > > > > > > > >:
> > > > > > > > >
> > > > > > > > > > Hey Magnus,
> > > > > > > > > >
> > > > > > > > > > I noticed that the KIP outlines the initial selectors
> > supported
> > > > > as:
> > > > > > > > > >
> > > > > > > > > >    - client_instance_id - CLIENT_INSTANCE_ID UUID string
> > > > > > > > representation.
> > > > > > > > > >    - client_software_name  - client software
> implementation
> > > > name.
> > > > > > > > > >    - client_software_version  - client software
> > implementation
> > > > > > > version.
> > > > > > > > > >
> > > > > > > > > > In the given reactive monitoring workflow, we mention
> that
> > the
> > > > > > > > application
> > > > > > > > > > user does not know their client's client instance ID, but
> > it's
> > > > > > > outlined
> > > > > > > > > > that the operator can add a metrics subscription
> selecting
> > for
> > > > > > > > clientId. I
> > > > > > > > > > don't see clientId as one of the supported selectors.
> > > > > > > > > > I can see how this would have made sense in a previous
> > > > iteration
> > > > > > > given
> > > > > > > > that
> > > > > > > > > > the previous client instance ID proposal was to construct
> > the
> > > > > > client
> > > > > > > > > > instance ID using clientId as a prefix. Now that the
> client
> > > > > > instance
> > > > > > > > ID is
> > > > > > > > > > a UUID, would we want to add clientId as a supported
> > selector?
> > > > > > > > > > Let me know what you think.
> > > > > > > > > >
> > > > > > > > > > David
> > > > > > > > > >
> > > > > > > > > > On Tue, Oct 19, 2021 at 12:33 PM Magnus Edenhill <
> > > > > > magnus@edenhill.se
> > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi Mickael!
> > > > > > > > > > >
> > > > > > > > > > > Den tis 19 okt. 2021 kl 19:30 skrev Mickael Maison <
> > > > > > > > > > > mickael.maison@gmail.com
> > > > > > > > > > > >:
> > > > > > > > > > >
> > > > > > > > > > > > Hi Magnus,
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks for the proposal.
> > > > > > > > > > > >
> > > > > > > > > > > > 1. Looking at the protocol section, isn't
> > > > "ClientInstanceId"
> > > > > > > > expected
> > > > > > > > > > > > to be a field in GetTelemetrySubscriptionsResponseV0?
> > > > > > Otherwise,
> > > > > > > > how
> > > > > > > > > > > > does a client retrieve this value?
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Good catch, it got removed by mistake in one of the
> > edits.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > 2. In the client API section, you mention a new
> method
> > > > > > > > > > > > "clientInstanceId()". Can you clarify which
> interfaces
> > are
> > > > > > > > affected?
> > > > > > > > > > > > Is it only Consumer and Producer?
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > And Admin. Will update the KIP.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > 3. I'm a bit concerned this is enabled by default.
> > Even if
> > > > > the
> > > > > > > data
> > > > > > > > > > > > collected is supposed to be not sensitive, I think
> > this can
> > > > > be
> > > > > > > > > > > > problematic in some environments. Also users don't
> > seem to
> > > > > have
> > > > > > > the
> > > > > > > > > > > > choice to only expose some metrics. Knowing how much
> > data
> > > > > > transit
> > > > > > > > > > > > through some applications can be considered critical.
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > The broker already knows how much data transits through
> > the
> > > > > > client
> > > > > > > > > > though,
> > > > > > > > > > > right?
> > > > > > > > > > > Care has been taken not to expose information in the
> > standard
> > > > > > > metrics
> > > > > > > > > > that
> > > > > > > > > > > might
> > > > > > > > > > > reveal sensitive information.
> > > > > > > > > > >
> > > > > > > > > > > Do you have an example of how the proposed metrics
> could
> > leak
> > > > > > > > sensitive
> > > > > > > > > > > information?
> > > > > > > > > > > As for limiting the what metrics to export; I guess
> that
> > > > could
> > > > > > make
> > > > > > > > sense
> > > > > > > > > > > in some
> > > > > > > > > > > very sensitive use-cases, but those users might disable
> > > > metrics
> > > > > > > > > > altogether
> > > > > > > > > > > for now.
> > > > > > > > > > > Could these concerns be addressed by a later KIP?
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > 4. As a user, how do you know if your application is
> > > > actively
> > > > > > > > sending
> > > > > > > > > > > > metrics? Are there new metrics exposing what's going
> > on,
> > > > like
> > > > > > how
> > > > > > > > much
> > > > > > > > > > > > data is being sent?
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > That's a good question.
> > > > > > > > > > > Since the proposed metrics interface is not aimed at,
> or
> > > > > directly
> > > > > > > > > > available
> > > > > > > > > > > to, the application
> > > > > > > > > > > I guess there's little point of adding it here, but
> > instead
> > > > > > adding
> > > > > > > > > > > something to the
> > > > > > > > > > > existing JMX metrics?
> > > > > > > > > > > Do you have any suggestions?
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > 5. If all metrics are enabled on a regular Consumer
> or
> > > > > > Producer,
> > > > > > > do
> > > > > > > > > > > > you have an idea how much throughput this would use?
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > It depends on the number of partition/topics/etc the
> > client
> > > > is
> > > > > > > > producing
> > > > > > > > > > > to/consuming from.
> > > > > > > > > > > I'll add some sizes to the KIP for some typical
> > use-cases.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > > Magnus
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > Thanks
> > > > > > > > > > > >
> > > > > > > > > > > > On Tue, Oct 19, 2021 at 5:06 PM Magnus Edenhill <
> > > > > > > > magnus@edenhill.se>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > Den tis 19 okt. 2021 kl 13:22 skrev Tom Bentley <
> > > > > > > > tbentley@redhat.com
> > > > > > > > > > >:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi Magnus,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I reviewed the KIP since you called the vote
> > (sorry for
> > > > > not
> > > > > > > > > > reviewing
> > > > > > > > > > > > when
> > > > > > > > > > > > > > you announced your intention to call the vote). I
> > have
> > > > a
> > > > > > few
> > > > > > > > > > > questions
> > > > > > > > > > > > on
> > > > > > > > > > > > > > some of the details.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 1. There's no Javadoc on
> > ClientTelemetryPayload.data(),
> > > > > so
> > > > > > I
> > > > > > > > don't
> > > > > > > > > > > know
> > > > > > > > > > > > > > whether the payload is exposed through this
> method
> > as
> > > > > > > > compressed or
> > > > > > > > > > > > not.
> > > > > > > > > > > > > > Later on you say "Decompression of the payloads
> > will be
> > > > > > > > handled by
> > > > > > > > > > > the
> > > > > > > > > > > > > > broker metrics plugin, the broker should expose a
> > > > > suitable
> > > > > > > > > > > > decompression
> > > > > > > > > > > > > > API to the metrics plugin for this purpose.",
> which
> > > > > > suggests
> > > > > > > > it's
> > > > > > > > > > the
> > > > > > > > > > > > > > compressed data in the buffer, but then we don't
> > know
> > > > > which
> > > > > > > > codec
> > > > > > > > > > was
> > > > > > > > > > > > used,
> > > > > > > > > > > > > > nor the API via which the plugin should
> decompress
> > it
> > > > if
> > > > > > > > required
> > > > > > > > > > for
> > > > > > > > > > > > > > forwarding to the ultimate metrics store. Should
> > the
> > > > > > > > > > > > ClientTelemetryPayload
> > > > > > > > > > > > > > expose a method to get the compression and a
> > > > > decompressor?
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Good point, updated.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > > 2. The client-side API is expressed as
> > StringOrError
> > > > > > > > > > > > > > ClientInstance::ClientInstanceId(int
> timeout_ms). I
> > > > > > > understand
> > > > > > > > that
> > > > > > > > > > > > you're
> > > > > > > > > > > > > > thinking about the librdkafka implementation, but
> > it
> > > > > would
> > > > > > be
> > > > > > > > good
> > > > > > > > > > to
> > > > > > > > > > > > show
> > > > > > > > > > > > > > the API as it would appear on the Apache Kafka
> > clients.
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > This was meant as pseudo-code, but I changed it to
> > Java.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > > 3. "PushTelemetryRequest|Response - protocol
> > request
> > > > used
> > > > > > by
> > > > > > > > the
> > > > > > > > > > > > client to
> > > > > > > > > > > > > > send metrics to any broker it is connected to."
> To
> > be
> > > > > > clear,
> > > > > > > > this
> > > > > > > > > > > means
> > > > > > > > > > > > > > that the client can choose any of the connected
> > brokers
> > > > > and
> > > > > > > > push to
> > > > > > > > > > > > just
> > > > > > > > > > > > > > one of them? What should a supporting client do
> if
> > it
> > > > > gets
> > > > > > an
> > > > > > > > error
> > > > > > > > > > > > when
> > > > > > > > > > > > > > pushing metrics to a broker, retry sending to the
> > same
> > > > > > broker
> > > > > > > > or
> > > > > > > > > > try
> > > > > > > > > > > > > > pushing to another broker, or drop the metrics?
> > Should
> > > > > > > > supporting
> > > > > > > > > > > > clients
> > > > > > > > > > > > > > send successive requests to a single broker, or
> > round
> > > > > > robin,
> > > > > > > > or is
> > > > > > > > > > > > that up
> > > > > > > > > > > > > > to the client author? I'm guessing the behaviour
> > should
> > > > > be
> > > > > > > > sticky
> > > > > > > > > > to
> > > > > > > > > > > > > > support the rate limiting features, but I think
> it
> > > > would
> > > > > be
> > > > > > > > good
> > > > > > > > > > for
> > > > > > > > > > > > client
> > > > > > > > > > > > > > authors if this section were explicit on the
> > > > recommended
> > > > > > > > behaviour.
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > You are right, I've updated the KIP to make this
> > clearer.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > > 4. "Mapping the client instance id to an actual
> > > > > application
> > > > > > > > > > instance
> > > > > > > > > > > > > > running on a (virtual) machine can be done by
> > > > inspecting
> > > > > > the
> > > > > > > > > > metrics
> > > > > > > > > > > > > > resource labels, such as the client source
> address
> > and
> > > > > > source
> > > > > > > > port,
> > > > > > > > > > > or
> > > > > > > > > > > > > > security principal, all of which are added by the
> > > > > receiving
> > > > > > > > broker.
> > > > > > > > > > > > This
> > > > > > > > > > > > > > will allow the operator together with the user to
> > > > > identify
> > > > > > > the
> > > > > > > > > > actual
> > > > > > > > > > > > > > application instance." Is this really always
> true?
> > The
> > > > > > source
> > > > > > > > IP
> > > > > > > > > > and
> > > > > > > > > > > > port
> > > > > > > > > > > > > > might be a loadbalancer/proxy in some setups. The
> > > > > > principal,
> > > > > > > as
> > > > > > > > > > > already
> > > > > > > > > > > > > > mentioned in the KIP, might be shared between
> > multiple
> > > > > > > > > > applications.
> > > > > > > > > > > > So at
> > > > > > > > > > > > > > worst the organization running the clients might
> > have
> > > > to
> > > > > > > > consult
> > > > > > > > > > the
> > > > > > > > > > > > logs
> > > > > > > > > > > > > > of a set of client applications, right?
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Yes, that's correct. There's no guaranteed mapping
> > from
> > > > > > > > > > > > client_instance_id
> > > > > > > > > > > > > to
> > > > > > > > > > > > > an actual instance, that's why the KIP recommends
> > client
> > > > > > > > > > > implementations
> > > > > > > > > > > > to
> > > > > > > > > > > > > log the client instance id
> > > > > > > > > > > > > upon retrieval, and also provide an API for the
> > > > application
> > > > > > to
> > > > > > > > > > retrieve
> > > > > > > > > > > > the
> > > > > > > > > > > > > instance id programmatically
> > > > > > > > > > > > > if it has a better way of exposing it.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > 5. "Tests indicate that a compression ratio up to
> > 10x is
> > > > > > > > possible for
> > > > > > > > > > > the
> > > > > > > > > > > > > > standard metrics." Client authors might
> appreciate
> > your
> > > > > > > > mentioning
> > > > > > > > > > > > which
> > > > > > > > > > > > > > compression codec got these results.
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Good point. Updated.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > > 6. "Should the client send a push request prior
> to
> > > > expiry
> > > > > > of
> > > > > > > > the
> > > > > > > > > > > > previously
> > > > > > > > > > > > > > calculated PushIntervalMs the broker will discard
> > the
> > > > > > metrics
> > > > > > > > and
> > > > > > > > > > > > return a
> > > > > > > > > > > > > > PushTelemetryResponse with the ErrorCode set to
> > > > > > RateLimited."
> > > > > > > > Is
> > > > > > > > > > this
> > > > > > > > > > > > > > RATE_LIMITED a new error code? It's not mentioned
> > in
> > > > the
> > > > > > "New
> > > > > > > > Error
> > > > > > > > > > > > Codes"
> > > > > > > > > > > > > > section.
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > That's a leftover, it should be using the standard
> > > > > > ThrottleTime
> > > > > > > > > > > > mechanism.
> > > > > > > > > > > > > Fixed.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > > 7. In the section "Standard client resource
> labels"
> > > > > > > > application_id
> > > > > > > > > > is
> > > > > > > > > > > > > > described as Kafka Streams only, but the section
> of
> > > > > "Client
> > > > > > > > > > > > Identification"
> > > > > > > > > > > > > > talks about "application instance id as an
> optional
> > > > > future
> > > > > > > > > > > nice-to-have
> > > > > > > > > > > > > > that may be included as a metrics label if it has
> > been
> > > > > set
> > > > > > by
> > > > > > > > the
> > > > > > > > > > > > user", so
> > > > > > > > > > > > > > I'm confused whether non-Kafka Streams clients
> > should
> > > > set
> > > > > > an
> > > > > > > > > > > > application_id
> > > > > > > > > > > > > > or not.
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > I'll clarify this in the KIP, but basically we
> would
> > need
> > > > > to
> > > > > > > add
> > > > > > > > an `
> > > > > > > > > > > > > application.id` config
> > > > > > > > > > > > > property for non-streams clients for this purpose,
> > and
> > > > > that's
> > > > > > > > outside
> > > > > > > > > > > the
> > > > > > > > > > > > > scope of this KIP since we want to make it
> > zero-conf:ish
> > > > on
> > > > > > the
> > > > > > > > > > client
> > > > > > > > > > > > side.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Kind regards,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Tom
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks for the review,
> > > > > > > > > > > > > Magnus
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Thu, Oct 7, 2021 at 5:26 PM Magnus Edenhill <
> > > > > > > > magnus@edenhill.se
> > > > > > > > > > >
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi all,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I've updated the KIP following our recent
> > discussions
> > > > > on
> > > > > > > the
> > > > > > > > > > > mailing
> > > > > > > > > > > > > > list:
> > > > > > > > > > > > > > >  - split the protocol in two, one for getting
> the
> > > > > metrics
> > > > > > > > > > > > subscriptions,
> > > > > > > > > > > > > > > and one for pushing the metrics.
> > > > > > > > > > > > > > >  - simplifications: initially only one
> supported
> > > > > metrics
> > > > > > > > format,
> > > > > > > > > > no
> > > > > > > > > > > > > > > client.id in the instance id, etc.
> > > > > > > > > > > > > > >  - made CLIENT_METRICS subscription
> configuration
> > > > > entries
> > > > > > > > more
> > > > > > > > > > > > structured
> > > > > > > > > > > > > > >    and allowing better client matching
> selectors
> > (not
> > > > > > only
> > > > > > > > on the
> > > > > > > > > > > > > > instance
> > > > > > > > > > > > > > > id, but also the other
> > > > > > > > > > > > > > >    client resource labels, such as
> > > > > client_software_name,
> > > > > > > > etc.).
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Unless there are further comments I'll call the
> > vote
> > > > > in a
> > > > > > > > day or
> > > > > > > > > > > two.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Regards,
> > > > > > > > > > > > > > > Magnus
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Den mån 4 okt. 2021 kl 20:57 skrev Magnus
> > Edenhill <
> > > > > > > > > > > > magnus@edenhill.se>:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Hi Gwen,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I'm finishing up the KIP based on the last
> > couple
> > > > of
> > > > > > > > discussion
> > > > > > > > > > > > points
> > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > this thread
> > > > > > > > > > > > > > > > and will call the Vote later this week.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Best,
> > > > > > > > > > > > > > > > Magnus
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Den lör 2 okt. 2021 kl 02:01 skrev Gwen
> Shapira
> > > > > > > > > > > > > > > <gwen@confluent.io.invalid
> > > > > > > > > > > > > > > > >:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >> Hey,
> > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > >> I noticed that there was no discussion for
> the
> > > > last
> > > > > 10
> > > > > > > > days,
> > > > > > > > > > > but I
> > > > > > > > > > > > > > > >> couldn't
> > > > > > > > > > > > > > > >> find the vote thread. Is there one that I'm
> > > > missing?
> > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > >> Gwen
> > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > >> On Wed, Sep 22, 2021 at 4:58 AM Magnus
> > Edenhill <
> > > > > > > > > > > > magnus@edenhill.se>
> > > > > > > > > > > > > > > >> wrote:
> > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > >> > Den tis 21 sep. 2021 kl 06:58 skrev Colin
> > > > McCabe <
> > > > > > > > > > > > > > cmccabe@apache.org
> > > > > > > > > > > > > > > >:
> > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > >> > > On Mon, Sep 20, 2021, at 17:35, Feng Min
> > > > wrote:
> > > > > > > > > > > > > > > >> > > > Thanks Magnus & Colin for the
> > discussion.
> > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > >> > > > Based on KIP-714's stateless design,
> > Client
> > > > > can
> > > > > > > > pretty
> > > > > > > > > > > much
> > > > > > > > > > > > use
> > > > > > > > > > > > > > > any
> > > > > > > > > > > > > > > >> > > > connection to any broker to send
> > metrics. We
> > > > > are
> > > > > > > not
> > > > > > > > > > > > associating
> > > > > > > > > > > > > > > >> > > connection
> > > > > > > > > > > > > > > >> > > > with client metric state. Is my
> > > > understanding
> > > > > > > > correct?
> > > > > > > > > > If
> > > > > > > > > > > > yes,
> > > > > > > > > > > > > > > how
> > > > > > > > > > > > > > > >> > about
> > > > > > > > > > > > > > > >> > > > the following two scenarios
> > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > >> > > > 1) One Client (Client-ID) registers
> two
> > > > > > different
> > > > > > > > client
> > > > > > > > > > > > > > instance
> > > > > > > > > > > > > > > id
> > > > > > > > > > > > > > > >> > via
> > > > > > > > > > > > > > > >> > > > separate registration. Is it
> permitted?
> > If
> > > > OK,
> > > > > > how
> > > > > > > > to
> > > > > > > > > > > > > > distinguish
> > > > > > > > > > > > > > > >> them
> > > > > > > > > > > > > > > >> > > from
> > > > > > > > > > > > > > > >> > > > the case 2 below.
> > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > >> > > Hi Feng,
> > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > >> > > My understanding, which Magnus can
> > clarify I
> > > > > > guess,
> > > > > > > is
> > > > > > > > > > that
> > > > > > > > > > > > you
> > > > > > > > > > > > > > > could
> > > > > > > > > > > > > > > >> > have
> > > > > > > > > > > > > > > >> > > something like two Producer instances
> > running
> > > > > with
> > > > > > > the
> > > > > > > > > > same
> > > > > > > > > > > > > > > client.id
> > > > > > > > > > > > > > > >> > > (perhaps because they're using the same
> > config
> > > > > > file,
> > > > > > > > for
> > > > > > > > > > > > example).
> > > > > > > > > > > > > > > >> They
> > > > > > > > > > > > > > > >> > > could even be in the same process. But
> > they
> > > > > would
> > > > > > > get
> > > > > > > > > > > separate
> > > > > > > > > > > > > > > UUIDs.
> > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > >> > > I believe Magnus used the term client to
> > mean
> > > > > > > > "Producer or
> > > > > > > > > > > > > > > Consumer".
> > > > > > > > > > > > > > > >> So
> > > > > > > > > > > > > > > >> > > if you have both a Producer and a
> > Consumer in
> > > > > your
> > > > > > > > > > > > application I
> > > > > > > > > > > > > > > would
> > > > > > > > > > > > > > > >> > > expect you'd get separate UUIDs for
> both.
> > > > Again
> > > > > > > > Magnus can
> > > > > > > > > > > > chime
> > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > >> > here, I
> > > > > > > > > > > > > > > >> > > guess.
> > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > >> > That's correct.
> > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > >> > > > 2) How about the client restarting?
> > What's
> > > > the
> > > > > > > > > > > expectation?
> > > > > > > > > > > > > > Should
> > > > > > > > > > > > > > > >> the
> > > > > > > > > > > > > > > >> > > > server expect the client to carry a
> > > > persisted
> > > > > > > client
> > > > > > > > > > > > instance id
> > > > > > > > > > > > > > > or
> > > > > > > > > > > > > > > >> > > should
> > > > > > > > > > > > > > > >> > > > the client be treated as a new
> instance?
> > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > >> > > The KIP doesn't describe any mechanism
> for
> > > > > > > > persistence,
> > > > > > > > > > so I
> > > > > > > > > > > > would
> > > > > > > > > > > > > > > >> assume
> > > > > > > > > > > > > > > >> > > that when you restart the client you get
> > a new
> > > > > > > UUID. I
> > > > > > > > > > agree
> > > > > > > > > > > > that
> > > > > > > > > > > > > > it
> > > > > > > > > > > > > > > >> > would
> > > > > > > > > > > > > > > >> > > be good to spell this out.
> > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > >> > Right, it will not be persisted since a
> > client
> > > > > > > instance
> > > > > > > > > > can't
> > > > > > > > > > > be
> > > > > > > > > > > > > > > >> restarted.
> > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > >> > Will update the KIP to make this clearer.
> > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > >> > /Magnus
> > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > >> --
> > > > > > > > > > > > > > > >> Gwen Shapira
> > > > > > > > > > > > > > > >> Engineering Manager | Confluent
> > > > > > > > > > > > > > > >> 650.450.2760 | @gwenshap
> > > > > > > > > > > > > > > >> Follow us: Twitter | blog
> > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Ryanne Dolan <ry...@gmail.com>.

I think we should be very careful about introducing new runtime
dependencies into the clients. Historically this has been rare and
essentially necessary (e.g. compression libs).

Ryanne

On Mon, Dec 13, 2021, 1:06 PM Kirk True <ki...@mustardgrain.com> wrote:

> Hi Jun,
>
> On Thu, Dec 9, 2021, at 2:28 PM, Jun Rao wrote:
> > 13. Using OpenTelemetry. Does that require runtime dependency
> > on OpenTelemetry library? How good is the compatibility story
> > of OpenTelemetry? This is important since an application could have other
> > OpenTelemetry dependencies than the Kafka client.
>
> The current design is that the OpenTelemetry JARs would ship with the
> client. Perhaps we can design the client such that the JARs aren't even
> loaded if the user has opted out. The user could even exclude the JARs from
> their dependencies if they so wished.
>
> I can't speak to the compatibility of the libraries. Is it possible that
> we include a shaded version?
>
> Thanks,
> Kirk
>
> >
> > 14. The proposal listed idempotence=true. This is more of a configuration
> > than a metric. Are we including that as a metric? What other
> configurations
> > are we including? Should we separate the configurations from the metrics?
> >
> > Thanks,
> >
> > Jun
> >
> > On Mon, Nov 29, 2021 at 7:34 AM Magnus Edenhill <ma...@edenhill.se>
> wrote:
> >
> > > Hey Bob,
> > >
> > > That's a good point.
> > >
> > > Request type labels were considered but since they're already tracked
> by
> > > broker-side metrics
> > > they were left out as to avoid metric duplication, however those
> metrics
> > > are not per connection,
> > > so they won't be that useful in practice for troubleshooting specific
> > > client instances.
> > >
> > > I'll add the request_type label to the relevant metrics.
> > >
> > > Thanks,
> > > Magnus
> > >
> > >
> > > Den tis 23 nov. 2021 kl 19:20 skrev Bob Barrett
> > > <bo...@confluent.io.invalid>:
> > >
> > > > Hi Magnus,
> > > >
> > > > Thanks for the thorough KIP, this seems very useful.
> > > >
> > > > Would it make sense to include the request type as a label for the
> > > > `client.request.success`, `client.request.errors` and
> > > `client.request.rtt`
> > > > metrics? I think it would be very useful to see which specific
> requests
> > > are
> > > > succeeding and failing for a client. One specific case I can think of
> > > where
> > > > this could be useful is producer batch timeouts. If a Java
> application
> > > does
> > > > not enable producer client logs (unfortunately, in my experience this
> > > > happens more often than it should), the application logs will only
> > > contain
> > > > the expiration error message, but no information about what is
> causing
> > > the
> > > > timeout. The requests might all be succeeding but taking too long to
> > > > process batches, or metadata requests might be failing, or some or
> all
> > > > produce requests might be failing (if the bootstrap servers are
> reachable
> > > > from the client but one or more other brokers are not, for example).
> If
> > > the
> > > > cluster operator is able to identify the specific requests that are
> slow
> > > or
> > > > failing for a client, they will be better able to diagnose the issue
> > > > causing batch timeouts.
> > > >
> > > > One drawback I can think of is that this will increase the
> cardinality of
> > > > the request metrics. But any given client is only going to use a
> small
> > > > subset of the request types, and since we already have partition
> labels
> > > for
> > > > the topic-level metrics, I think request labels will still make up a
> > > > relatively small percentage of the set of metrics.
> > > >
> > > > Thanks,
> > > > Bob
> > > >
> > > > On Mon, Nov 22, 2021 at 2:08 AM Viktor Somogyi-Vass <
> > > > viktorsomogyi@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi Magnus,
> > > > >
> > > > > I think this is a very useful addition. We also have a similar (but
> > > much
> > > > > more simplistic) implementation of this. Maybe I missed it in the
> KIP
> > > but
> > > > > what about adding metrics about the subscription cache itself?
> That I
> > > > think
> > > > > would improve its usability and debuggability as we'd be able to
> see
> > > its
> > > > > performance, hit/miss rates, eviction counts and others.
> > > > >
> > > > > Best,
> > > > > Viktor
> > > > >
> > > > > On Thu, Nov 18, 2021 at 5:12 PM Magnus Edenhill <
> magnus@edenhill.se>
> > > > > wrote:
> > > > >
> > > > > > Hi Mickael,
> > > > > >
> > > > > > see inline.
> > > > > >
> > > > > > Den ons 10 nov. 2021 kl 15:21 skrev Mickael Maison <
> > > > > > mickael.maison@gmail.com
> > > > > > >:
> > > > > >
> > > > > > > Hi Magnus,
> > > > > > >
> > > > > > > I see you've addressed some of the points I raised above but
> some
> > > (4,
> > > > > > > 5) have not been addressed yet.
> > > > > > >
> > > > > >
> > > > > > Re 4) How will the user/app know metrics are being sent.
> > > > > >
> > > > > > One possibility is to add a JMX metric (thus for user
> consumption)
> > > for
> > > > > the
> > > > > > number of metric pushes the
> > > > > > client has performed, or perhaps the number of metrics
> subscriptions
> > > > > > currently being collected.
> > > > > > Would that be sufficient?
> > > > > >
> > > > > > Re 5) Metric sizes and rates
> > > > > >
> > > > > > A worst case scenario for a producer that is producing to 50
> unique
> > > > > topics
> > > > > > and emitting all standard metrics yields
> > > > > > a serialized size of around 100KB prior to compression, which
> > > > compresses
> > > > > > down to about 20-30% of that depending
> > > > > > on compression type and topic name uniqueness.
> > > > > > The numbers for a consumer would be similar.
> > > > > >
> > > > > > In practice the number of unique topics would be far less, and
> the
> > > > > > subscription set would typically be for a subset of metrics.
> > > > > > So we're probably closer to 1kb, or less, compressed size per
> client
> > > > per
> > > > > > push interval.
> > > > > >
> > > > > > As both the subscription set and push intervals are controlled
> by the
> > > > > > cluster operator it shouldn't be too hard
> > > > > > to strike a good balance between metrics overhead and
> granularity.
> > > > > >
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > I'm really uneasy with this being enabled by default on the
> client
> > > > > > > side. When collecting data, I think the best practice is to
> ensure
> > > > > > > users are explicitly enabling it.
> > > > > > >
> > > > > >
> > > > > > Requiring metrics to be explicitly enabled on clients severely
> > > cripples
> > > > > its
> > > > > > usability and value.
> > > > > >
> > > > > > One of the problems that this KIP aims to solve is for useful
> metrics
> > > > to
> > > > > be
> > > > > > available on demand
> > > > > > regardless of the technical expertise of the user. As Ryanne
> points,
> > > > out
> > > > > a
> > > > > > savvy user/organization
> > > > > > will typically have metrics collection and monitoring in place
> > > already,
> > > > > and
> > > > > > the benefits of this KIP
> > > > > > are then more of a common set and format metrics across client
> > > > > > implementations and languages.
> > > > > > But that is not the typical Kafka user in my experience, they're
> not
> > > > > Kafka
> > > > > > experts and they don't have the
> > > > > > knowledge of how to best instrument their clients.
> > > > > > Having metrics enabled by default for this user base allows the
> Kafka
> > > > > > operators to proactively and reactively
> > > > > > monitor and troubleshoot client issues, without the need for the
> less
> > > > > savvy
> > > > > > user to do anything.
> > > > > > It is often too late to tell a user to enable metrics when the
> > > problem
> > > > > has
> > > > > > already occurred.
> > > > > >
> > > > > > Now, to be clear, even though metrics are enabled by default on
> > > clients
> > > > > it
> > > > > > is not enabled by default
> > > > > > on the brokers; the Kafka operator needs to build and set up a
> > > metrics
> > > > > > plugin and add metrics subscriptions
> > > > > > before anything is sent from the client.
> > > > > > It is opt-out on the clients and opt-in on the broker.
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > > You mentioned brokers already have
> > > > > > > some(most?) of the information contained in metrics, if so
> then why
> > > > > > > are we collecting it again? Surely there must be some new
> > > information
> > > > > > > in the client metrics.
> > > > > > >
> > > > > >
> > > > > > From the user's perspective the Kafka infrastructure extends from
> > > > > > producer.send() to
> > > > > > messages being returned from consumer.poll(), a giant black box
> where
> > > > > > there's a lot going on between those
> > > > > > two points. The brokers currently only see what happens once
> those
> > > > > requests
> > > > > > and messages hits the broker,
> > > > > > but as Kafka clients are complex pieces of machinery there's a
> myriad
> > > > of
> > > > > > queues, timers, and state
> > > > > > that's critical to the operation and infrastructure that's not
> > > > currently
> > > > > > visible to the operator.
> > > > > > Relying on the user to accurately and timely provide this missing
> > > > > > information is not generally feasible.
> > > > > >
> > > > > >
> > > > > > Most of the standard metrics listed in the KIP are data points
> that
> > > the
> > > > > > broker does not have.
> > > > > > Only a small number of metrics are duplicates (like the request
> > > counts
> > > > > and
> > > > > > sizes), but they are included
> > > > > > to ease correlation when inspecting these client metrics.
> > > > > >
> > > > > >
> > > > > >
> > > > > > > Moreover this is a brand new feature so it's even harder to
> justify
> > > > > > > enabling it and forcing onto all our users. If disabled by
> default,
> > > > > > > it's relatively easy to enable in a new release if we decide
> to,
> > > but
> > > > > > > once enabled by default it's much harder to disable. Also this
> > > > feature
> > > > > > > will apply to all future metrics we will add.
> > > > > > >
> > > > > >
> > > > > > I think maturity of a feature implementation should be the
> deciding
> > > > > factor,
> > > > > > rather than
> > > > > > the design of it (which this KIP is). I.e., if the
> implementation is
> > > > not
> > > > > > deemed mature enough
> > > > > > for release X.Y it will be disabled.
> > > > > >
> > > > > >
> > > > > >
> > > > > > > Overall I think it's an interesting feature but I'd prefer to
> be
> > > > > > > slightly defensive and see how it works in practice before
> enabling
> > > > it
> > > > > > > everywhere.
> > > > > > >
> > > > > >
> > > > > > Right, and I agree on being defensive, but since this feature
> still
> > > > > > requires manual
> > > > > > enabling on the brokers before actually being used, I think that
> > > gives
> > > > > > enough control
> > > > > > to opt-in or out of this feature as needed.
> > > > > >
> > > > > > Thanks for your comments!
> > > > > >
> > > > > > Regards,
> > > > > > Magnus
> > > > > >
> > > > > >
> > > > > >
> > > > > > > Thanks,
> > > > > > > Mickael
> > > > > > >
> > > > > > > On Mon, Nov 8, 2021 at 7:55 AM Magnus Edenhill <
> magnus@edenhill.se
> > > >
> > > > > > wrote:
> > > > > > > >
> > > > > > > > Thanks David for pointing this out,
> > > > > > > > I've updated the KIP to include client_id as a matching
> selector.
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > > Magnus
> > > > > > > >
> > > > > > > > Den tors 4 nov. 2021 kl 18:01 skrev David Mao
> > > > > > <dmao@confluent.io.invalid
> > > > > > > >:
> > > > > > > >
> > > > > > > > > Hey Magnus,
> > > > > > > > >
> > > > > > > > > I noticed that the KIP outlines the initial selectors
> supported
> > > > as:
> > > > > > > > >
> > > > > > > > >    - client_instance_id - CLIENT_INSTANCE_ID UUID string
> > > > > > > representation.
> > > > > > > > >    - client_software_name  - client software implementation
> > > name.
> > > > > > > > >    - client_software_version  - client software
> implementation
> > > > > > version.
> > > > > > > > >
> > > > > > > > > In the given reactive monitoring workflow, we mention that
> the
> > > > > > > application
> > > > > > > > > user does not know their client's client instance ID, but
> it's
> > > > > > outlined
> > > > > > > > > that the operator can add a metrics subscription selecting
> for
> > > > > > > clientId. I
> > > > > > > > > don't see clientId as one of the supported selectors.
> > > > > > > > > I can see how this would have made sense in a previous
> > > iteration
> > > > > > given
> > > > > > > that
> > > > > > > > > the previous client instance ID proposal was to construct
> the
> > > > > client
> > > > > > > > > instance ID using clientId as a prefix. Now that the client
> > > > > instance
> > > > > > > ID is
> > > > > > > > > a UUID, would we want to add clientId as a supported
> selector?
> > > > > > > > > Let me know what you think.
> > > > > > > > >
> > > > > > > > > David
> > > > > > > > >
> > > > > > > > > On Tue, Oct 19, 2021 at 12:33 PM Magnus Edenhill <
> > > > > magnus@edenhill.se
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi Mickael!
> > > > > > > > > >
> > > > > > > > > > Den tis 19 okt. 2021 kl 19:30 skrev Mickael Maison <
> > > > > > > > > > mickael.maison@gmail.com
> > > > > > > > > > >:
> > > > > > > > > >
> > > > > > > > > > > Hi Magnus,
> > > > > > > > > > >
> > > > > > > > > > > Thanks for the proposal.
> > > > > > > > > > >
> > > > > > > > > > > 1. Looking at the protocol section, isn't
> > > "ClientInstanceId"
> > > > > > > expected
> > > > > > > > > > > to be a field in GetTelemetrySubscriptionsResponseV0?
> > > > > Otherwise,
> > > > > > > how
> > > > > > > > > > > does a client retrieve this value?
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Good catch, it got removed by mistake in one of the
> edits.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > 2. In the client API section, you mention a new method
> > > > > > > > > > > "clientInstanceId()". Can you clarify which interfaces
> are
> > > > > > > affected?
> > > > > > > > > > > Is it only Consumer and Producer?
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > And Admin. Will update the KIP.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > 3. I'm a bit concerned this is enabled by default.
> Even if
> > > > the
> > > > > > data
> > > > > > > > > > > collected is supposed to be not sensitive, I think
> this can
> > > > be
> > > > > > > > > > > problematic in some environments. Also users don't
> seem to
> > > > have
> > > > > > the
> > > > > > > > > > > choice to only expose some metrics. Knowing how much
> data
> > > > > transit
> > > > > > > > > > > through some applications can be considered critical.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > The broker already knows how much data transits through
> the
> > > > > client
> > > > > > > > > though,
> > > > > > > > > > right?
> > > > > > > > > > Care has been taken not to expose information in the
> standard
> > > > > > metrics
> > > > > > > > > that
> > > > > > > > > > might
> > > > > > > > > > reveal sensitive information.
> > > > > > > > > >
> > > > > > > > > > Do you have an example of how the proposed metrics could
> leak
> > > > > > > sensitive
> > > > > > > > > > information?
> > > > > > > > > > As for limiting the what metrics to export; I guess that
> > > could
> > > > > make
> > > > > > > sense
> > > > > > > > > > in some
> > > > > > > > > > very sensitive use-cases, but those users might disable
> > > metrics
> > > > > > > > > altogether
> > > > > > > > > > for now.
> > > > > > > > > > Could these concerns be addressed by a later KIP?
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > 4. As a user, how do you know if your application is
> > > actively
> > > > > > > sending
> > > > > > > > > > > metrics? Are there new metrics exposing what's going
> on,
> > > like
> > > > > how
> > > > > > > much
> > > > > > > > > > > data is being sent?
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > That's a good question.
> > > > > > > > > > Since the proposed metrics interface is not aimed at, or
> > > > directly
> > > > > > > > > available
> > > > > > > > > > to, the application
> > > > > > > > > > I guess there's little point of adding it here, but
> instead
> > > > > adding
> > > > > > > > > > something to the
> > > > > > > > > > existing JMX metrics?
> > > > > > > > > > Do you have any suggestions?
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > 5. If all metrics are enabled on a regular Consumer or
> > > > > Producer,
> > > > > > do
> > > > > > > > > > > you have an idea how much throughput this would use?
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > It depends on the number of partition/topics/etc the
> client
> > > is
> > > > > > > producing
> > > > > > > > > > to/consuming from.
> > > > > > > > > > I'll add some sizes to the KIP for some typical
> use-cases.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Magnus
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > Thanks
> > > > > > > > > > >
> > > > > > > > > > > On Tue, Oct 19, 2021 at 5:06 PM Magnus Edenhill <
> > > > > > > magnus@edenhill.se>
> > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > Den tis 19 okt. 2021 kl 13:22 skrev Tom Bentley <
> > > > > > > tbentley@redhat.com
> > > > > > > > > >:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi Magnus,
> > > > > > > > > > > > >
> > > > > > > > > > > > > I reviewed the KIP since you called the vote
> (sorry for
> > > > not
> > > > > > > > > reviewing
> > > > > > > > > > > when
> > > > > > > > > > > > > you announced your intention to call the vote). I
> have
> > > a
> > > > > few
> > > > > > > > > > questions
> > > > > > > > > > > on
> > > > > > > > > > > > > some of the details.
> > > > > > > > > > > > >
> > > > > > > > > > > > > 1. There's no Javadoc on
> ClientTelemetryPayload.data(),
> > > > so
> > > > > I
> > > > > > > don't
> > > > > > > > > > know
> > > > > > > > > > > > > whether the payload is exposed through this method
> as
> > > > > > > compressed or
> > > > > > > > > > > not.
> > > > > > > > > > > > > Later on you say "Decompression of the payloads
> will be
> > > > > > > handled by
> > > > > > > > > > the
> > > > > > > > > > > > > broker metrics plugin, the broker should expose a
> > > > suitable
> > > > > > > > > > > decompression
> > > > > > > > > > > > > API to the metrics plugin for this purpose.", which
> > > > > suggests
> > > > > > > it's
> > > > > > > > > the
> > > > > > > > > > > > > compressed data in the buffer, but then we don't
> know
> > > > which
> > > > > > > codec
> > > > > > > > > was
> > > > > > > > > > > used,
> > > > > > > > > > > > > nor the API via which the plugin should decompress
> it
> > > if
> > > > > > > required
> > > > > > > > > for
> > > > > > > > > > > > > forwarding to the ultimate metrics store. Should
> the
> > > > > > > > > > > ClientTelemetryPayload
> > > > > > > > > > > > > expose a method to get the compression and a
> > > > decompressor?
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Good point, updated.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > 2. The client-side API is expressed as
> StringOrError
> > > > > > > > > > > > > ClientInstance::ClientInstanceId(int timeout_ms). I
> > > > > > understand
> > > > > > > that
> > > > > > > > > > > you're
> > > > > > > > > > > > > thinking about the librdkafka implementation, but
> it
> > > > would
> > > > > be
> > > > > > > good
> > > > > > > > > to
> > > > > > > > > > > show
> > > > > > > > > > > > > the API as it would appear on the Apache Kafka
> clients.
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > This was meant as pseudo-code, but I changed it to
> Java.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > 3. "PushTelemetryRequest|Response - protocol
> request
> > > used
> > > > > by
> > > > > > > the
> > > > > > > > > > > client to
> > > > > > > > > > > > > send metrics to any broker it is connected to." To
> be
> > > > > clear,
> > > > > > > this
> > > > > > > > > > means
> > > > > > > > > > > > > that the client can choose any of the connected
> brokers
> > > > and
> > > > > > > push to
> > > > > > > > > > > just
> > > > > > > > > > > > > one of them? What should a supporting client do if
> it
> > > > gets
> > > > > an
> > > > > > > error
> > > > > > > > > > > when
> > > > > > > > > > > > > pushing metrics to a broker, retry sending to the
> same
> > > > > broker
> > > > > > > or
> > > > > > > > > try
> > > > > > > > > > > > > pushing to another broker, or drop the metrics?
> Should
> > > > > > > supporting
> > > > > > > > > > > clients
> > > > > > > > > > > > > send successive requests to a single broker, or
> round
> > > > > robin,
> > > > > > > or is
> > > > > > > > > > > that up
> > > > > > > > > > > > > to the client author? I'm guessing the behaviour
> should
> > > > be
> > > > > > > sticky
> > > > > > > > > to
> > > > > > > > > > > > > support the rate limiting features, but I think it
> > > would
> > > > be
> > > > > > > good
> > > > > > > > > for
> > > > > > > > > > > client
> > > > > > > > > > > > > authors if this section were explicit on the
> > > recommended
> > > > > > > behaviour.
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > You are right, I've updated the KIP to make this
> clearer.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > 4. "Mapping the client instance id to an actual
> > > > application
> > > > > > > > > instance
> > > > > > > > > > > > > running on a (virtual) machine can be done by
> > > inspecting
> > > > > the
> > > > > > > > > metrics
> > > > > > > > > > > > > resource labels, such as the client source address
> and
> > > > > source
> > > > > > > port,
> > > > > > > > > > or
> > > > > > > > > > > > > security principal, all of which are added by the
> > > > receiving
> > > > > > > broker.
> > > > > > > > > > > This
> > > > > > > > > > > > > will allow the operator together with the user to
> > > > identify
> > > > > > the
> > > > > > > > > actual
> > > > > > > > > > > > > application instance." Is this really always true?
> The
> > > > > source
> > > > > > > IP
> > > > > > > > > and
> > > > > > > > > > > port
> > > > > > > > > > > > > might be a loadbalancer/proxy in some setups. The
> > > > > principal,
> > > > > > as
> > > > > > > > > > already
> > > > > > > > > > > > > mentioned in the KIP, might be shared between
> multiple
> > > > > > > > > applications.
> > > > > > > > > > > So at
> > > > > > > > > > > > > worst the organization running the clients might
> have
> > > to
> > > > > > > consult
> > > > > > > > > the
> > > > > > > > > > > logs
> > > > > > > > > > > > > of a set of client applications, right?
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Yes, that's correct. There's no guaranteed mapping
> from
> > > > > > > > > > > client_instance_id
> > > > > > > > > > > > to
> > > > > > > > > > > > an actual instance, that's why the KIP recommends
> client
> > > > > > > > > > implementations
> > > > > > > > > > > to
> > > > > > > > > > > > log the client instance id
> > > > > > > > > > > > upon retrieval, and also provide an API for the
> > > application
> > > > > to
> > > > > > > > > retrieve
> > > > > > > > > > > the
> > > > > > > > > > > > instance id programmatically
> > > > > > > > > > > > if it has a better way of exposing it.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > 5. "Tests indicate that a compression ratio up to
> 10x is
> > > > > > > possible for
> > > > > > > > > > the
> > > > > > > > > > > > > standard metrics." Client authors might appreciate
> your
> > > > > > > mentioning
> > > > > > > > > > > which
> > > > > > > > > > > > > compression codec got these results.
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Good point. Updated.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > 6. "Should the client send a push request prior to
> > > expiry
> > > > > of
> > > > > > > the
> > > > > > > > > > > previously
> > > > > > > > > > > > > calculated PushIntervalMs the broker will discard
> the
> > > > > metrics
> > > > > > > and
> > > > > > > > > > > return a
> > > > > > > > > > > > > PushTelemetryResponse with the ErrorCode set to
> > > > > RateLimited."
> > > > > > > Is
> > > > > > > > > this
> > > > > > > > > > > > > RATE_LIMITED a new error code? It's not mentioned
> in
> > > the
> > > > > "New
> > > > > > > Error
> > > > > > > > > > > Codes"
> > > > > > > > > > > > > section.
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > That's a leftover, it should be using the standard
> > > > > ThrottleTime
> > > > > > > > > > > mechanism.
> > > > > > > > > > > > Fixed.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > 7. In the section "Standard client resource labels"
> > > > > > > application_id
> > > > > > > > > is
> > > > > > > > > > > > > described as Kafka Streams only, but the section of
> > > > "Client
> > > > > > > > > > > Identification"
> > > > > > > > > > > > > talks about "application instance id as an optional
> > > > future
> > > > > > > > > > nice-to-have
> > > > > > > > > > > > > that may be included as a metrics label if it has
> been
> > > > set
> > > > > by
> > > > > > > the
> > > > > > > > > > > user", so
> > > > > > > > > > > > > I'm confused whether non-Kafka Streams clients
> should
> > > set
> > > > > an
> > > > > > > > > > > application_id
> > > > > > > > > > > > > or not.
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > I'll clarify this in the KIP, but basically we would
> need
> > > > to
> > > > > > add
> > > > > > > an `
> > > > > > > > > > > > application.id` config
> > > > > > > > > > > > property for non-streams clients for this purpose,
> and
> > > > that's
> > > > > > > outside
> > > > > > > > > > the
> > > > > > > > > > > > scope of this KIP since we want to make it
> zero-conf:ish
> > > on
> > > > > the
> > > > > > > > > client
> > > > > > > > > > > side.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Kind regards,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Tom
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks for the review,
> > > > > > > > > > > > Magnus
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Thu, Oct 7, 2021 at 5:26 PM Magnus Edenhill <
> > > > > > > magnus@edenhill.se
> > > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi all,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I've updated the KIP following our recent
> discussions
> > > > on
> > > > > > the
> > > > > > > > > > mailing
> > > > > > > > > > > > > list:
> > > > > > > > > > > > > >  - split the protocol in two, one for getting the
> > > > metrics
> > > > > > > > > > > subscriptions,
> > > > > > > > > > > > > > and one for pushing the metrics.
> > > > > > > > > > > > > >  - simplifications: initially only one supported
> > > > metrics
> > > > > > > format,
> > > > > > > > > no
> > > > > > > > > > > > > > client.id in the instance id, etc.
> > > > > > > > > > > > > >  - made CLIENT_METRICS subscription configuration
> > > > entries
> > > > > > > more
> > > > > > > > > > > structured
> > > > > > > > > > > > > >    and allowing better client matching selectors
> (not
> > > > > only
> > > > > > > on the
> > > > > > > > > > > > > instance
> > > > > > > > > > > > > > id, but also the other
> > > > > > > > > > > > > >    client resource labels, such as
> > > > client_software_name,
> > > > > > > etc.).
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Unless there are further comments I'll call the
> vote
> > > > in a
> > > > > > > day or
> > > > > > > > > > two.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Regards,
> > > > > > > > > > > > > > Magnus
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Den mån 4 okt. 2021 kl 20:57 skrev Magnus
> Edenhill <
> > > > > > > > > > > magnus@edenhill.se>:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi Gwen,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I'm finishing up the KIP based on the last
> couple
> > > of
> > > > > > > discussion
> > > > > > > > > > > points
> > > > > > > > > > > > > in
> > > > > > > > > > > > > > > this thread
> > > > > > > > > > > > > > > and will call the Vote later this week.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Best,
> > > > > > > > > > > > > > > Magnus
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Den lör 2 okt. 2021 kl 02:01 skrev Gwen Shapira
> > > > > > > > > > > > > > <gwen@confluent.io.invalid
> > > > > > > > > > > > > > > >:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >> Hey,
> > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > >> I noticed that there was no discussion for the
> > > last
> > > > 10
> > > > > > > days,
> > > > > > > > > > but I
> > > > > > > > > > > > > > >> couldn't
> > > > > > > > > > > > > > >> find the vote thread. Is there one that I'm
> > > missing?
> > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > >> Gwen
> > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > >> On Wed, Sep 22, 2021 at 4:58 AM Magnus
> Edenhill <
> > > > > > > > > > > magnus@edenhill.se>
> > > > > > > > > > > > > > >> wrote:
> > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > >> > Den tis 21 sep. 2021 kl 06:58 skrev Colin
> > > McCabe <
> > > > > > > > > > > > > cmccabe@apache.org
> > > > > > > > > > > > > > >:
> > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > >> > > On Mon, Sep 20, 2021, at 17:35, Feng Min
> > > wrote:
> > > > > > > > > > > > > > >> > > > Thanks Magnus & Colin for the
> discussion.
> > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > >> > > > Based on KIP-714's stateless design,
> Client
> > > > can
> > > > > > > pretty
> > > > > > > > > > much
> > > > > > > > > > > use
> > > > > > > > > > > > > > any
> > > > > > > > > > > > > > >> > > > connection to any broker to send
> metrics. We
> > > > are
> > > > > > not
> > > > > > > > > > > associating
> > > > > > > > > > > > > > >> > > connection
> > > > > > > > > > > > > > >> > > > with client metric state. Is my
> > > understanding
> > > > > > > correct?
> > > > > > > > > If
> > > > > > > > > > > yes,
> > > > > > > > > > > > > > how
> > > > > > > > > > > > > > >> > about
> > > > > > > > > > > > > > >> > > > the following two scenarios
> > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > >> > > > 1) One Client (Client-ID) registers two
> > > > > different
> > > > > > > client
> > > > > > > > > > > > > instance
> > > > > > > > > > > > > > id
> > > > > > > > > > > > > > >> > via
> > > > > > > > > > > > > > >> > > > separate registration. Is it permitted?
> If
> > > OK,
> > > > > how
> > > > > > > to
> > > > > > > > > > > > > distinguish
> > > > > > > > > > > > > > >> them
> > > > > > > > > > > > > > >> > > from
> > > > > > > > > > > > > > >> > > > the case 2 below.
> > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > >> > > Hi Feng,
> > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > >> > > My understanding, which Magnus can
> clarify I
> > > > > guess,
> > > > > > is
> > > > > > > > > that
> > > > > > > > > > > you
> > > > > > > > > > > > > > could
> > > > > > > > > > > > > > >> > have
> > > > > > > > > > > > > > >> > > something like two Producer instances
> running
> > > > with
> > > > > > the
> > > > > > > > > same
> > > > > > > > > > > > > > client.id
> > > > > > > > > > > > > > >> > > (perhaps because they're using the same
> config
> > > > > file,
> > > > > > > for
> > > > > > > > > > > example).
> > > > > > > > > > > > > > >> They
> > > > > > > > > > > > > > >> > > could even be in the same process. But
> they
> > > > would
> > > > > > get
> > > > > > > > > > separate
> > > > > > > > > > > > > > UUIDs.
> > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > >> > > I believe Magnus used the term client to
> mean
> > > > > > > "Producer or
> > > > > > > > > > > > > > Consumer".
> > > > > > > > > > > > > > >> So
> > > > > > > > > > > > > > >> > > if you have both a Producer and a
> Consumer in
> > > > your
> > > > > > > > > > > application I
> > > > > > > > > > > > > > would
> > > > > > > > > > > > > > >> > > expect you'd get separate UUIDs for both.
> > > Again
> > > > > > > Magnus can
> > > > > > > > > > > chime
> > > > > > > > > > > > > in
> > > > > > > > > > > > > > >> > here, I
> > > > > > > > > > > > > > >> > > guess.
> > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > >> > That's correct.
> > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > >> > > > 2) How about the client restarting?
> What's
> > > the
> > > > > > > > > > expectation?
> > > > > > > > > > > > > Should
> > > > > > > > > > > > > > >> the
> > > > > > > > > > > > > > >> > > > server expect the client to carry a
> > > persisted
> > > > > > client
> > > > > > > > > > > instance id
> > > > > > > > > > > > > > or
> > > > > > > > > > > > > > >> > > should
> > > > > > > > > > > > > > >> > > > the client be treated as a new instance?
> > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > >> > > The KIP doesn't describe any mechanism for
> > > > > > > persistence,
> > > > > > > > > so I
> > > > > > > > > > > would
> > > > > > > > > > > > > > >> assume
> > > > > > > > > > > > > > >> > > that when you restart the client you get
> a new
> > > > > > UUID. I
> > > > > > > > > agree
> > > > > > > > > > > that
> > > > > > > > > > > > > it
> > > > > > > > > > > > > > >> > would
> > > > > > > > > > > > > > >> > > be good to spell this out.
> > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > >> > Right, it will not be persisted since a
> client
> > > > > > instance
> > > > > > > > > can't
> > > > > > > > > > be
> > > > > > > > > > > > > > >> restarted.
> > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > >> > Will update the KIP to make this clearer.
> > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > >> > /Magnus
> > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > >> --
> > > > > > > > > > > > > > >> Gwen Shapira
> > > > > > > > > > > > > > >> Engineering Manager | Confluent
> > > > > > > > > > > > > > >> 650.450.2760 | @gwenshap
> > > > > > > > > > > > > > >> Follow us: Twitter | blog
> > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Kirk True <ki...@mustardgrain.com>.

Hi Jun,

On Thu, Dec 9, 2021, at 2:28 PM, Jun Rao wrote:
> 13. Using OpenTelemetry. Does that require runtime dependency
> on OpenTelemetry library? How good is the compatibility story
> of OpenTelemetry? This is important since an application could have other
> OpenTelemetry dependencies than the Kafka client.

The current design is that the OpenTelemetry JARs would ship with the client. Perhaps we can design the client such that the JARs aren't even loaded if the user has opted out. The user could even exclude the JARs from their dependencies if they so wished.

I can't speak to the compatibility of the libraries. Is it possible that we include a shaded version?

Thanks,
Kirk

> 
> 14. The proposal listed idempotence=true. This is more of a configuration
> than a metric. Are we including that as a metric? What other configurations
> are we including? Should we separate the configurations from the metrics?
> 
> Thanks,
> 
> Jun
> 
> On Mon, Nov 29, 2021 at 7:34 AM Magnus Edenhill <ma...@edenhill.se> wrote:
> 
> > Hey Bob,
> >
> > That's a good point.
> >
> > Request type labels were considered but since they're already tracked by
> > broker-side metrics
> > they were left out as to avoid metric duplication, however those metrics
> > are not per connection,
> > so they won't be that useful in practice for troubleshooting specific
> > client instances.
> >
> > I'll add the request_type label to the relevant metrics.
> >
> > Thanks,
> > Magnus
> >
> >
> > Den tis 23 nov. 2021 kl 19:20 skrev Bob Barrett
> > <bo...@confluent.io.invalid>:
> >
> > > Hi Magnus,
> > >
> > > Thanks for the thorough KIP, this seems very useful.
> > >
> > > Would it make sense to include the request type as a label for the
> > > `client.request.success`, `client.request.errors` and
> > `client.request.rtt`
> > > metrics? I think it would be very useful to see which specific requests
> > are
> > > succeeding and failing for a client. One specific case I can think of
> > where
> > > this could be useful is producer batch timeouts. If a Java application
> > does
> > > not enable producer client logs (unfortunately, in my experience this
> > > happens more often than it should), the application logs will only
> > contain
> > > the expiration error message, but no information about what is causing
> > the
> > > timeout. The requests might all be succeeding but taking too long to
> > > process batches, or metadata requests might be failing, or some or all
> > > produce requests might be failing (if the bootstrap servers are reachable
> > > from the client but one or more other brokers are not, for example). If
> > the
> > > cluster operator is able to identify the specific requests that are slow
> > or
> > > failing for a client, they will be better able to diagnose the issue
> > > causing batch timeouts.
> > >
> > > One drawback I can think of is that this will increase the cardinality of
> > > the request metrics. But any given client is only going to use a small
> > > subset of the request types, and since we already have partition labels
> > for
> > > the topic-level metrics, I think request labels will still make up a
> > > relatively small percentage of the set of metrics.
> > >
> > > Thanks,
> > > Bob
> > >
> > > On Mon, Nov 22, 2021 at 2:08 AM Viktor Somogyi-Vass <
> > > viktorsomogyi@gmail.com>
> > > wrote:
> > >
> > > > Hi Magnus,
> > > >
> > > > I think this is a very useful addition. We also have a similar (but
> > much
> > > > more simplistic) implementation of this. Maybe I missed it in the KIP
> > but
> > > > what about adding metrics about the subscription cache itself? That I
> > > think
> > > > would improve its usability and debuggability as we'd be able to see
> > its
> > > > performance, hit/miss rates, eviction counts and others.
> > > >
> > > > Best,
> > > > Viktor
> > > >
> > > > On Thu, Nov 18, 2021 at 5:12 PM Magnus Edenhill <ma...@edenhill.se>
> > > > wrote:
> > > >
> > > > > Hi Mickael,
> > > > >
> > > > > see inline.
> > > > >
> > > > > Den ons 10 nov. 2021 kl 15:21 skrev Mickael Maison <
> > > > > mickael.maison@gmail.com
> > > > > >:
> > > > >
> > > > > > Hi Magnus,
> > > > > >
> > > > > > I see you've addressed some of the points I raised above but some
> > (4,
> > > > > > 5) have not been addressed yet.
> > > > > >
> > > > >
> > > > > Re 4) How will the user/app know metrics are being sent.
> > > > >
> > > > > One possibility is to add a JMX metric (thus for user consumption)
> > for
> > > > the
> > > > > number of metric pushes the
> > > > > client has performed, or perhaps the number of metrics subscriptions
> > > > > currently being collected.
> > > > > Would that be sufficient?
> > > > >
> > > > > Re 5) Metric sizes and rates
> > > > >
> > > > > A worst case scenario for a producer that is producing to 50 unique
> > > > topics
> > > > > and emitting all standard metrics yields
> > > > > a serialized size of around 100KB prior to compression, which
> > > compresses
> > > > > down to about 20-30% of that depending
> > > > > on compression type and topic name uniqueness.
> > > > > The numbers for a consumer would be similar.
> > > > >
> > > > > In practice the number of unique topics would be far less, and the
> > > > > subscription set would typically be for a subset of metrics.
> > > > > So we're probably closer to 1kb, or less, compressed size per client
> > > per
> > > > > push interval.
> > > > >
> > > > > As both the subscription set and push intervals are controlled by the
> > > > > cluster operator it shouldn't be too hard
> > > > > to strike a good balance between metrics overhead and granularity.
> > > > >
> > > > >
> > > > >
> > > > > >
> > > > > > I'm really uneasy with this being enabled by default on the client
> > > > > > side. When collecting data, I think the best practice is to ensure
> > > > > > users are explicitly enabling it.
> > > > > >
> > > > >
> > > > > Requiring metrics to be explicitly enabled on clients severely
> > cripples
> > > > its
> > > > > usability and value.
> > > > >
> > > > > One of the problems that this KIP aims to solve is for useful metrics
> > > to
> > > > be
> > > > > available on demand
> > > > > regardless of the technical expertise of the user. As Ryanne points,
> > > out
> > > > a
> > > > > savvy user/organization
> > > > > will typically have metrics collection and monitoring in place
> > already,
> > > > and
> > > > > the benefits of this KIP
> > > > > are then more of a common set and format metrics across client
> > > > > implementations and languages.
> > > > > But that is not the typical Kafka user in my experience, they're not
> > > > Kafka
> > > > > experts and they don't have the
> > > > > knowledge of how to best instrument their clients.
> > > > > Having metrics enabled by default for this user base allows the Kafka
> > > > > operators to proactively and reactively
> > > > > monitor and troubleshoot client issues, without the need for the less
> > > > savvy
> > > > > user to do anything.
> > > > > It is often too late to tell a user to enable metrics when the
> > problem
> > > > has
> > > > > already occurred.
> > > > >
> > > > > Now, to be clear, even though metrics are enabled by default on
> > clients
> > > > it
> > > > > is not enabled by default
> > > > > on the brokers; the Kafka operator needs to build and set up a
> > metrics
> > > > > plugin and add metrics subscriptions
> > > > > before anything is sent from the client.
> > > > > It is opt-out on the clients and opt-in on the broker.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > > You mentioned brokers already have
> > > > > > some(most?) of the information contained in metrics, if so then why
> > > > > > are we collecting it again? Surely there must be some new
> > information
> > > > > > in the client metrics.
> > > > > >
> > > > >
> > > > > From the user's perspective the Kafka infrastructure extends from
> > > > > producer.send() to
> > > > > messages being returned from consumer.poll(), a giant black box where
> > > > > there's a lot going on between those
> > > > > two points. The brokers currently only see what happens once those
> > > > requests
> > > > > and messages hits the broker,
> > > > > but as Kafka clients are complex pieces of machinery there's a myriad
> > > of
> > > > > queues, timers, and state
> > > > > that's critical to the operation and infrastructure that's not
> > > currently
> > > > > visible to the operator.
> > > > > Relying on the user to accurately and timely provide this missing
> > > > > information is not generally feasible.
> > > > >
> > > > >
> > > > > Most of the standard metrics listed in the KIP are data points that
> > the
> > > > > broker does not have.
> > > > > Only a small number of metrics are duplicates (like the request
> > counts
> > > > and
> > > > > sizes), but they are included
> > > > > to ease correlation when inspecting these client metrics.
> > > > >
> > > > >
> > > > >
> > > > > > Moreover this is a brand new feature so it's even harder to justify
> > > > > > enabling it and forcing onto all our users. If disabled by default,
> > > > > > it's relatively easy to enable in a new release if we decide to,
> > but
> > > > > > once enabled by default it's much harder to disable. Also this
> > > feature
> > > > > > will apply to all future metrics we will add.
> > > > > >
> > > > >
> > > > > I think maturity of a feature implementation should be the deciding
> > > > factor,
> > > > > rather than
> > > > > the design of it (which this KIP is). I.e., if the implementation is
> > > not
> > > > > deemed mature enough
> > > > > for release X.Y it will be disabled.
> > > > >
> > > > >
> > > > >
> > > > > > Overall I think it's an interesting feature but I'd prefer to be
> > > > > > slightly defensive and see how it works in practice before enabling
> > > it
> > > > > > everywhere.
> > > > > >
> > > > >
> > > > > Right, and I agree on being defensive, but since this feature still
> > > > > requires manual
> > > > > enabling on the brokers before actually being used, I think that
> > gives
> > > > > enough control
> > > > > to opt-in or out of this feature as needed.
> > > > >
> > > > > Thanks for your comments!
> > > > >
> > > > > Regards,
> > > > > Magnus
> > > > >
> > > > >
> > > > >
> > > > > > Thanks,
> > > > > > Mickael
> > > > > >
> > > > > > On Mon, Nov 8, 2021 at 7:55 AM Magnus Edenhill <magnus@edenhill.se
> > >
> > > > > wrote:
> > > > > > >
> > > > > > > Thanks David for pointing this out,
> > > > > > > I've updated the KIP to include client_id as a matching selector.
> > > > > > >
> > > > > > > Regards,
> > > > > > > Magnus
> > > > > > >
> > > > > > > Den tors 4 nov. 2021 kl 18:01 skrev David Mao
> > > > > <dmao@confluent.io.invalid
> > > > > > >:
> > > > > > >
> > > > > > > > Hey Magnus,
> > > > > > > >
> > > > > > > > I noticed that the KIP outlines the initial selectors supported
> > > as:
> > > > > > > >
> > > > > > > >    - client_instance_id - CLIENT_INSTANCE_ID UUID string
> > > > > > representation.
> > > > > > > >    - client_software_name  - client software implementation
> > name.
> > > > > > > >    - client_software_version  - client software implementation
> > > > > version.
> > > > > > > >
> > > > > > > > In the given reactive monitoring workflow, we mention that the
> > > > > > application
> > > > > > > > user does not know their client's client instance ID, but it's
> > > > > outlined
> > > > > > > > that the operator can add a metrics subscription selecting for
> > > > > > clientId. I
> > > > > > > > don't see clientId as one of the supported selectors.
> > > > > > > > I can see how this would have made sense in a previous
> > iteration
> > > > > given
> > > > > > that
> > > > > > > > the previous client instance ID proposal was to construct the
> > > > client
> > > > > > > > instance ID using clientId as a prefix. Now that the client
> > > > instance
> > > > > > ID is
> > > > > > > > a UUID, would we want to add clientId as a supported selector?
> > > > > > > > Let me know what you think.
> > > > > > > >
> > > > > > > > David
> > > > > > > >
> > > > > > > > On Tue, Oct 19, 2021 at 12:33 PM Magnus Edenhill <
> > > > magnus@edenhill.se
> > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Mickael!
> > > > > > > > >
> > > > > > > > > Den tis 19 okt. 2021 kl 19:30 skrev Mickael Maison <
> > > > > > > > > mickael.maison@gmail.com
> > > > > > > > > >:
> > > > > > > > >
> > > > > > > > > > Hi Magnus,
> > > > > > > > > >
> > > > > > > > > > Thanks for the proposal.
> > > > > > > > > >
> > > > > > > > > > 1. Looking at the protocol section, isn't
> > "ClientInstanceId"
> > > > > > expected
> > > > > > > > > > to be a field in GetTelemetrySubscriptionsResponseV0?
> > > > Otherwise,
> > > > > > how
> > > > > > > > > > does a client retrieve this value?
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Good catch, it got removed by mistake in one of the edits.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > 2. In the client API section, you mention a new method
> > > > > > > > > > "clientInstanceId()". Can you clarify which interfaces are
> > > > > > affected?
> > > > > > > > > > Is it only Consumer and Producer?
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > And Admin. Will update the KIP.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > 3. I'm a bit concerned this is enabled by default. Even if
> > > the
> > > > > data
> > > > > > > > > > collected is supposed to be not sensitive, I think this can
> > > be
> > > > > > > > > > problematic in some environments. Also users don't seem to
> > > have
> > > > > the
> > > > > > > > > > choice to only expose some metrics. Knowing how much data
> > > > transit
> > > > > > > > > > through some applications can be considered critical.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > The broker already knows how much data transits through the
> > > > client
> > > > > > > > though,
> > > > > > > > > right?
> > > > > > > > > Care has been taken not to expose information in the standard
> > > > > metrics
> > > > > > > > that
> > > > > > > > > might
> > > > > > > > > reveal sensitive information.
> > > > > > > > >
> > > > > > > > > Do you have an example of how the proposed metrics could leak
> > > > > > sensitive
> > > > > > > > > information?
> > > > > > > > > As for limiting the what metrics to export; I guess that
> > could
> > > > make
> > > > > > sense
> > > > > > > > > in some
> > > > > > > > > very sensitive use-cases, but those users might disable
> > metrics
> > > > > > > > altogether
> > > > > > > > > for now.
> > > > > > > > > Could these concerns be addressed by a later KIP?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > 4. As a user, how do you know if your application is
> > actively
> > > > > > sending
> > > > > > > > > > metrics? Are there new metrics exposing what's going on,
> > like
> > > > how
> > > > > > much
> > > > > > > > > > data is being sent?
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > That's a good question.
> > > > > > > > > Since the proposed metrics interface is not aimed at, or
> > > directly
> > > > > > > > available
> > > > > > > > > to, the application
> > > > > > > > > I guess there's little point of adding it here, but instead
> > > > adding
> > > > > > > > > something to the
> > > > > > > > > existing JMX metrics?
> > > > > > > > > Do you have any suggestions?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > 5. If all metrics are enabled on a regular Consumer or
> > > > Producer,
> > > > > do
> > > > > > > > > > you have an idea how much throughput this would use?
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > It depends on the number of partition/topics/etc the client
> > is
> > > > > > producing
> > > > > > > > > to/consuming from.
> > > > > > > > > I'll add some sizes to the KIP for some typical use-cases.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Magnus
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > Thanks
> > > > > > > > > >
> > > > > > > > > > On Tue, Oct 19, 2021 at 5:06 PM Magnus Edenhill <
> > > > > > magnus@edenhill.se>
> > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > Den tis 19 okt. 2021 kl 13:22 skrev Tom Bentley <
> > > > > > tbentley@redhat.com
> > > > > > > > >:
> > > > > > > > > > >
> > > > > > > > > > > > Hi Magnus,
> > > > > > > > > > > >
> > > > > > > > > > > > I reviewed the KIP since you called the vote (sorry for
> > > not
> > > > > > > > reviewing
> > > > > > > > > > when
> > > > > > > > > > > > you announced your intention to call the vote). I have
> > a
> > > > few
> > > > > > > > > questions
> > > > > > > > > > on
> > > > > > > > > > > > some of the details.
> > > > > > > > > > > >
> > > > > > > > > > > > 1. There's no Javadoc on ClientTelemetryPayload.data(),
> > > so
> > > > I
> > > > > > don't
> > > > > > > > > know
> > > > > > > > > > > > whether the payload is exposed through this method as
> > > > > > compressed or
> > > > > > > > > > not.
> > > > > > > > > > > > Later on you say "Decompression of the payloads will be
> > > > > > handled by
> > > > > > > > > the
> > > > > > > > > > > > broker metrics plugin, the broker should expose a
> > > suitable
> > > > > > > > > > decompression
> > > > > > > > > > > > API to the metrics plugin for this purpose.", which
> > > > suggests
> > > > > > it's
> > > > > > > > the
> > > > > > > > > > > > compressed data in the buffer, but then we don't know
> > > which
> > > > > > codec
> > > > > > > > was
> > > > > > > > > > used,
> > > > > > > > > > > > nor the API via which the plugin should decompress it
> > if
> > > > > > required
> > > > > > > > for
> > > > > > > > > > > > forwarding to the ultimate metrics store. Should the
> > > > > > > > > > ClientTelemetryPayload
> > > > > > > > > > > > expose a method to get the compression and a
> > > decompressor?
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Good point, updated.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > 2. The client-side API is expressed as StringOrError
> > > > > > > > > > > > ClientInstance::ClientInstanceId(int timeout_ms). I
> > > > > understand
> > > > > > that
> > > > > > > > > > you're
> > > > > > > > > > > > thinking about the librdkafka implementation, but it
> > > would
> > > > be
> > > > > > good
> > > > > > > > to
> > > > > > > > > > show
> > > > > > > > > > > > the API as it would appear on the Apache Kafka clients.
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > This was meant as pseudo-code, but I changed it to Java.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > 3. "PushTelemetryRequest|Response - protocol request
> > used
> > > > by
> > > > > > the
> > > > > > > > > > client to
> > > > > > > > > > > > send metrics to any broker it is connected to." To be
> > > > clear,
> > > > > > this
> > > > > > > > > means
> > > > > > > > > > > > that the client can choose any of the connected brokers
> > > and
> > > > > > push to
> > > > > > > > > > just
> > > > > > > > > > > > one of them? What should a supporting client do if it
> > > gets
> > > > an
> > > > > > error
> > > > > > > > > > when
> > > > > > > > > > > > pushing metrics to a broker, retry sending to the same
> > > > broker
> > > > > > or
> > > > > > > > try
> > > > > > > > > > > > pushing to another broker, or drop the metrics? Should
> > > > > > supporting
> > > > > > > > > > clients
> > > > > > > > > > > > send successive requests to a single broker, or round
> > > > robin,
> > > > > > or is
> > > > > > > > > > that up
> > > > > > > > > > > > to the client author? I'm guessing the behaviour should
> > > be
> > > > > > sticky
> > > > > > > > to
> > > > > > > > > > > > support the rate limiting features, but I think it
> > would
> > > be
> > > > > > good
> > > > > > > > for
> > > > > > > > > > client
> > > > > > > > > > > > authors if this section were explicit on the
> > recommended
> > > > > > behaviour.
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > You are right, I've updated the KIP to make this clearer.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > 4. "Mapping the client instance id to an actual
> > > application
> > > > > > > > instance
> > > > > > > > > > > > running on a (virtual) machine can be done by
> > inspecting
> > > > the
> > > > > > > > metrics
> > > > > > > > > > > > resource labels, such as the client source address and
> > > > source
> > > > > > port,
> > > > > > > > > or
> > > > > > > > > > > > security principal, all of which are added by the
> > > receiving
> > > > > > broker.
> > > > > > > > > > This
> > > > > > > > > > > > will allow the operator together with the user to
> > > identify
> > > > > the
> > > > > > > > actual
> > > > > > > > > > > > application instance." Is this really always true? The
> > > > source
> > > > > > IP
> > > > > > > > and
> > > > > > > > > > port
> > > > > > > > > > > > might be a loadbalancer/proxy in some setups. The
> > > > principal,
> > > > > as
> > > > > > > > > already
> > > > > > > > > > > > mentioned in the KIP, might be shared between multiple
> > > > > > > > applications.
> > > > > > > > > > So at
> > > > > > > > > > > > worst the organization running the clients might have
> > to
> > > > > > consult
> > > > > > > > the
> > > > > > > > > > logs
> > > > > > > > > > > > of a set of client applications, right?
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Yes, that's correct. There's no guaranteed mapping from
> > > > > > > > > > client_instance_id
> > > > > > > > > > > to
> > > > > > > > > > > an actual instance, that's why the KIP recommends client
> > > > > > > > > implementations
> > > > > > > > > > to
> > > > > > > > > > > log the client instance id
> > > > > > > > > > > upon retrieval, and also provide an API for the
> > application
> > > > to
> > > > > > > > retrieve
> > > > > > > > > > the
> > > > > > > > > > > instance id programmatically
> > > > > > > > > > > if it has a better way of exposing it.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > 5. "Tests indicate that a compression ratio up to 10x is
> > > > > > possible for
> > > > > > > > > the
> > > > > > > > > > > > standard metrics." Client authors might appreciate your
> > > > > > mentioning
> > > > > > > > > > which
> > > > > > > > > > > > compression codec got these results.
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Good point. Updated.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > 6. "Should the client send a push request prior to
> > expiry
> > > > of
> > > > > > the
> > > > > > > > > > previously
> > > > > > > > > > > > calculated PushIntervalMs the broker will discard the
> > > > metrics
> > > > > > and
> > > > > > > > > > return a
> > > > > > > > > > > > PushTelemetryResponse with the ErrorCode set to
> > > > RateLimited."
> > > > > > Is
> > > > > > > > this
> > > > > > > > > > > > RATE_LIMITED a new error code? It's not mentioned in
> > the
> > > > "New
> > > > > > Error
> > > > > > > > > > Codes"
> > > > > > > > > > > > section.
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > That's a leftover, it should be using the standard
> > > > ThrottleTime
> > > > > > > > > > mechanism.
> > > > > > > > > > > Fixed.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > 7. In the section "Standard client resource labels"
> > > > > > application_id
> > > > > > > > is
> > > > > > > > > > > > described as Kafka Streams only, but the section of
> > > "Client
> > > > > > > > > > Identification"
> > > > > > > > > > > > talks about "application instance id as an optional
> > > future
> > > > > > > > > nice-to-have
> > > > > > > > > > > > that may be included as a metrics label if it has been
> > > set
> > > > by
> > > > > > the
> > > > > > > > > > user", so
> > > > > > > > > > > > I'm confused whether non-Kafka Streams clients should
> > set
> > > > an
> > > > > > > > > > application_id
> > > > > > > > > > > > or not.
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > I'll clarify this in the KIP, but basically we would need
> > > to
> > > > > add
> > > > > > an `
> > > > > > > > > > > application.id` config
> > > > > > > > > > > property for non-streams clients for this purpose, and
> > > that's
> > > > > > outside
> > > > > > > > > the
> > > > > > > > > > > scope of this KIP since we want to make it zero-conf:ish
> > on
> > > > the
> > > > > > > > client
> > > > > > > > > > side.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Kind regards,
> > > > > > > > > > > >
> > > > > > > > > > > > Tom
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Thanks for the review,
> > > > > > > > > > > Magnus
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Thu, Oct 7, 2021 at 5:26 PM Magnus Edenhill <
> > > > > > magnus@edenhill.se
> > > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi all,
> > > > > > > > > > > > >
> > > > > > > > > > > > > I've updated the KIP following our recent discussions
> > > on
> > > > > the
> > > > > > > > > mailing
> > > > > > > > > > > > list:
> > > > > > > > > > > > >  - split the protocol in two, one for getting the
> > > metrics
> > > > > > > > > > subscriptions,
> > > > > > > > > > > > > and one for pushing the metrics.
> > > > > > > > > > > > >  - simplifications: initially only one supported
> > > metrics
> > > > > > format,
> > > > > > > > no
> > > > > > > > > > > > > client.id in the instance id, etc.
> > > > > > > > > > > > >  - made CLIENT_METRICS subscription configuration
> > > entries
> > > > > > more
> > > > > > > > > > structured
> > > > > > > > > > > > >    and allowing better client matching selectors (not
> > > > only
> > > > > > on the
> > > > > > > > > > > > instance
> > > > > > > > > > > > > id, but also the other
> > > > > > > > > > > > >    client resource labels, such as
> > > client_software_name,
> > > > > > etc.).
> > > > > > > > > > > > >
> > > > > > > > > > > > > Unless there are further comments I'll call the vote
> > > in a
> > > > > > day or
> > > > > > > > > two.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Regards,
> > > > > > > > > > > > > Magnus
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Den mån 4 okt. 2021 kl 20:57 skrev Magnus Edenhill <
> > > > > > > > > > magnus@edenhill.se>:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi Gwen,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I'm finishing up the KIP based on the last couple
> > of
> > > > > > discussion
> > > > > > > > > > points
> > > > > > > > > > > > in
> > > > > > > > > > > > > > this thread
> > > > > > > > > > > > > > and will call the Vote later this week.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Best,
> > > > > > > > > > > > > > Magnus
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Den lör 2 okt. 2021 kl 02:01 skrev Gwen Shapira
> > > > > > > > > > > > > <gwen@confluent.io.invalid
> > > > > > > > > > > > > > >:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >> Hey,
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >> I noticed that there was no discussion for the
> > last
> > > 10
> > > > > > days,
> > > > > > > > > but I
> > > > > > > > > > > > > >> couldn't
> > > > > > > > > > > > > >> find the vote thread. Is there one that I'm
> > missing?
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >> Gwen
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >> On Wed, Sep 22, 2021 at 4:58 AM Magnus Edenhill <
> > > > > > > > > > magnus@edenhill.se>
> > > > > > > > > > > > > >> wrote:
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >> > Den tis 21 sep. 2021 kl 06:58 skrev Colin
> > McCabe <
> > > > > > > > > > > > cmccabe@apache.org
> > > > > > > > > > > > > >:
> > > > > > > > > > > > > >> >
> > > > > > > > > > > > > >> > > On Mon, Sep 20, 2021, at 17:35, Feng Min
> > wrote:
> > > > > > > > > > > > > >> > > > Thanks Magnus & Colin for the discussion.
> > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > >> > > > Based on KIP-714's stateless design, Client
> > > can
> > > > > > pretty
> > > > > > > > > much
> > > > > > > > > > use
> > > > > > > > > > > > > any
> > > > > > > > > > > > > >> > > > connection to any broker to send metrics. We
> > > are
> > > > > not
> > > > > > > > > > associating
> > > > > > > > > > > > > >> > > connection
> > > > > > > > > > > > > >> > > > with client metric state. Is my
> > understanding
> > > > > > correct?
> > > > > > > > If
> > > > > > > > > > yes,
> > > > > > > > > > > > > how
> > > > > > > > > > > > > >> > about
> > > > > > > > > > > > > >> > > > the following two scenarios
> > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > >> > > > 1) One Client (Client-ID) registers two
> > > > different
> > > > > > client
> > > > > > > > > > > > instance
> > > > > > > > > > > > > id
> > > > > > > > > > > > > >> > via
> > > > > > > > > > > > > >> > > > separate registration. Is it permitted? If
> > OK,
> > > > how
> > > > > > to
> > > > > > > > > > > > distinguish
> > > > > > > > > > > > > >> them
> > > > > > > > > > > > > >> > > from
> > > > > > > > > > > > > >> > > > the case 2 below.
> > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > >> > > Hi Feng,
> > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > >> > > My understanding, which Magnus can clarify I
> > > > guess,
> > > > > is
> > > > > > > > that
> > > > > > > > > > you
> > > > > > > > > > > > > could
> > > > > > > > > > > > > >> > have
> > > > > > > > > > > > > >> > > something like two Producer instances running
> > > with
> > > > > the
> > > > > > > > same
> > > > > > > > > > > > > client.id
> > > > > > > > > > > > > >> > > (perhaps because they're using the same config
> > > > file,
> > > > > > for
> > > > > > > > > > example).
> > > > > > > > > > > > > >> They
> > > > > > > > > > > > > >> > > could even be in the same process. But they
> > > would
> > > > > get
> > > > > > > > > separate
> > > > > > > > > > > > > UUIDs.
> > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > >> > > I believe Magnus used the term client to mean
> > > > > > "Producer or
> > > > > > > > > > > > > Consumer".
> > > > > > > > > > > > > >> So
> > > > > > > > > > > > > >> > > if you have both a Producer and a Consumer in
> > > your
> > > > > > > > > > application I
> > > > > > > > > > > > > would
> > > > > > > > > > > > > >> > > expect you'd get separate UUIDs for both.
> > Again
> > > > > > Magnus can
> > > > > > > > > > chime
> > > > > > > > > > > > in
> > > > > > > > > > > > > >> > here, I
> > > > > > > > > > > > > >> > > guess.
> > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > >> >
> > > > > > > > > > > > > >> > That's correct.
> > > > > > > > > > > > > >> >
> > > > > > > > > > > > > >> >
> > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > >> > > > 2) How about the client restarting? What's
> > the
> > > > > > > > > expectation?
> > > > > > > > > > > > Should
> > > > > > > > > > > > > >> the
> > > > > > > > > > > > > >> > > > server expect the client to carry a
> > persisted
> > > > > client
> > > > > > > > > > instance id
> > > > > > > > > > > > > or
> > > > > > > > > > > > > >> > > should
> > > > > > > > > > > > > >> > > > the client be treated as a new instance?
> > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > >> > > The KIP doesn't describe any mechanism for
> > > > > > persistence,
> > > > > > > > so I
> > > > > > > > > > would
> > > > > > > > > > > > > >> assume
> > > > > > > > > > > > > >> > > that when you restart the client you get a new
> > > > > UUID. I
> > > > > > > > agree
> > > > > > > > > > that
> > > > > > > > > > > > it
> > > > > > > > > > > > > >> > would
> > > > > > > > > > > > > >> > > be good to spell this out.
> > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > >> > Right, it will not be persisted since a client
> > > > > instance
> > > > > > > > can't
> > > > > > > > > be
> > > > > > > > > > > > > >> restarted.
> > > > > > > > > > > > > >> >
> > > > > > > > > > > > > >> > Will update the KIP to make this clearer.
> > > > > > > > > > > > > >> >
> > > > > > > > > > > > > >> > /Magnus
> > > > > > > > > > > > > >> >
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >> --
> > > > > > > > > > > > > >> Gwen Shapira
> > > > > > > > > > > > > >> Engineering Manager | Confluent
> > > > > > > > > > > > > >> 650.450.2760 | @gwenshap
> > > > > > > > > > > > > >> Follow us: Twitter | blog
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Jun Rao <ju...@confluent.io.INVALID>.

Hi, Magnus.

Thanks for the KIP. A few comments below.

10. There seems to be some questions on the use cases of this KIP since we
already have a client side metric reporter. It would be useful to provide a
bit more details on that. To me, there are 3 potential use cases: (1) not
all organizations are enforcing client side metric collections; (2) if the
data is shared among 3rd parties, there is less control on external
clients; (3) when Kafka is offered as a hosted service. It would also be
useful to outline the client problems this KIP can help identify. For
example, this KIP may not help with any client connectivity problems.

11. Have we considered sending the metrics with the existing produce
request to an internal topic instead of a new request PushTelemetryRequest?
The potential benefits are (1) reusing existing request's support on
compression, throughput throttling, etc and (2) we could potentially get
rid of ClientTelemetryReceiver. Once the metrics land in a Kafka topic, the
operator can decide what to do with it by just consuming the topic.

12. It seems that we are defining a set of common metric names that every
client needs to support. Are most non-java clients following the naming
convention of the java client metrics? If not, forcing them to all change
their metric names could be destructive.

13. Using OpenTelemetry. Does that require runtime dependency
on OpenTelemetry library? How good is the compatibility story
of OpenTelemetry? This is important since an application could have other
OpenTelemetry dependencies than the Kafka client.

14. The proposal listed idempotence=true. This is more of a configuration
than a metric. Are we including that as a metric? What other configurations
are we including? Should we separate the configurations from the metrics?

Thanks,

Jun

On Mon, Nov 29, 2021 at 7:34 AM Magnus Edenhill <ma...@edenhill.se> wrote:

> Hey Bob,
>
> That's a good point.
>
> Request type labels were considered but since they're already tracked by
> broker-side metrics
> they were left out as to avoid metric duplication, however those metrics
> are not per connection,
> so they won't be that useful in practice for troubleshooting specific
> client instances.
>
> I'll add the request_type label to the relevant metrics.
>
> Thanks,
> Magnus
>
>
> Den tis 23 nov. 2021 kl 19:20 skrev Bob Barrett
> <bo...@confluent.io.invalid>:
>
> > Hi Magnus,
> >
> > Thanks for the thorough KIP, this seems very useful.
> >
> > Would it make sense to include the request type as a label for the
> > `client.request.success`, `client.request.errors` and
> `client.request.rtt`
> > metrics? I think it would be very useful to see which specific requests
> are
> > succeeding and failing for a client. One specific case I can think of
> where
> > this could be useful is producer batch timeouts. If a Java application
> does
> > not enable producer client logs (unfortunately, in my experience this
> > happens more often than it should), the application logs will only
> contain
> > the expiration error message, but no information about what is causing
> the
> > timeout. The requests might all be succeeding but taking too long to
> > process batches, or metadata requests might be failing, or some or all
> > produce requests might be failing (if the bootstrap servers are reachable
> > from the client but one or more other brokers are not, for example). If
> the
> > cluster operator is able to identify the specific requests that are slow
> or
> > failing for a client, they will be better able to diagnose the issue
> > causing batch timeouts.
> >
> > One drawback I can think of is that this will increase the cardinality of
> > the request metrics. But any given client is only going to use a small
> > subset of the request types, and since we already have partition labels
> for
> > the topic-level metrics, I think request labels will still make up a
> > relatively small percentage of the set of metrics.
> >
> > Thanks,
> > Bob
> >
> > On Mon, Nov 22, 2021 at 2:08 AM Viktor Somogyi-Vass <
> > viktorsomogyi@gmail.com>
> > wrote:
> >
> > > Hi Magnus,
> > >
> > > I think this is a very useful addition. We also have a similar (but
> much
> > > more simplistic) implementation of this. Maybe I missed it in the KIP
> but
> > > what about adding metrics about the subscription cache itself? That I
> > think
> > > would improve its usability and debuggability as we'd be able to see
> its
> > > performance, hit/miss rates, eviction counts and others.
> > >
> > > Best,
> > > Viktor
> > >
> > > On Thu, Nov 18, 2021 at 5:12 PM Magnus Edenhill <ma...@edenhill.se>
> > > wrote:
> > >
> > > > Hi Mickael,
> > > >
> > > > see inline.
> > > >
> > > > Den ons 10 nov. 2021 kl 15:21 skrev Mickael Maison <
> > > > mickael.maison@gmail.com
> > > > >:
> > > >
> > > > > Hi Magnus,
> > > > >
> > > > > I see you've addressed some of the points I raised above but some
> (4,
> > > > > 5) have not been addressed yet.
> > > > >
> > > >
> > > > Re 4) How will the user/app know metrics are being sent.
> > > >
> > > > One possibility is to add a JMX metric (thus for user consumption)
> for
> > > the
> > > > number of metric pushes the
> > > > client has performed, or perhaps the number of metrics subscriptions
> > > > currently being collected.
> > > > Would that be sufficient?
> > > >
> > > > Re 5) Metric sizes and rates
> > > >
> > > > A worst case scenario for a producer that is producing to 50 unique
> > > topics
> > > > and emitting all standard metrics yields
> > > > a serialized size of around 100KB prior to compression, which
> > compresses
> > > > down to about 20-30% of that depending
> > > > on compression type and topic name uniqueness.
> > > > The numbers for a consumer would be similar.
> > > >
> > > > In practice the number of unique topics would be far less, and the
> > > > subscription set would typically be for a subset of metrics.
> > > > So we're probably closer to 1kb, or less, compressed size per client
> > per
> > > > push interval.
> > > >
> > > > As both the subscription set and push intervals are controlled by the
> > > > cluster operator it shouldn't be too hard
> > > > to strike a good balance between metrics overhead and granularity.
> > > >
> > > >
> > > >
> > > > >
> > > > > I'm really uneasy with this being enabled by default on the client
> > > > > side. When collecting data, I think the best practice is to ensure
> > > > > users are explicitly enabling it.
> > > > >
> > > >
> > > > Requiring metrics to be explicitly enabled on clients severely
> cripples
> > > its
> > > > usability and value.
> > > >
> > > > One of the problems that this KIP aims to solve is for useful metrics
> > to
> > > be
> > > > available on demand
> > > > regardless of the technical expertise of the user. As Ryanne points,
> > out
> > > a
> > > > savvy user/organization
> > > > will typically have metrics collection and monitoring in place
> already,
> > > and
> > > > the benefits of this KIP
> > > > are then more of a common set and format metrics across client
> > > > implementations and languages.
> > > > But that is not the typical Kafka user in my experience, they're not
> > > Kafka
> > > > experts and they don't have the
> > > > knowledge of how to best instrument their clients.
> > > > Having metrics enabled by default for this user base allows the Kafka
> > > > operators to proactively and reactively
> > > > monitor and troubleshoot client issues, without the need for the less
> > > savvy
> > > > user to do anything.
> > > > It is often too late to tell a user to enable metrics when the
> problem
> > > has
> > > > already occurred.
> > > >
> > > > Now, to be clear, even though metrics are enabled by default on
> clients
> > > it
> > > > is not enabled by default
> > > > on the brokers; the Kafka operator needs to build and set up a
> metrics
> > > > plugin and add metrics subscriptions
> > > > before anything is sent from the client.
> > > > It is opt-out on the clients and opt-in on the broker.
> > > >
> > > >
> > > >
> > > >
> > > > > You mentioned brokers already have
> > > > > some(most?) of the information contained in metrics, if so then why
> > > > > are we collecting it again? Surely there must be some new
> information
> > > > > in the client metrics.
> > > > >
> > > >
> > > > From the user's perspective the Kafka infrastructure extends from
> > > > producer.send() to
> > > > messages being returned from consumer.poll(), a giant black box where
> > > > there's a lot going on between those
> > > > two points. The brokers currently only see what happens once those
> > > requests
> > > > and messages hits the broker,
> > > > but as Kafka clients are complex pieces of machinery there's a myriad
> > of
> > > > queues, timers, and state
> > > > that's critical to the operation and infrastructure that's not
> > currently
> > > > visible to the operator.
> > > > Relying on the user to accurately and timely provide this missing
> > > > information is not generally feasible.
> > > >
> > > >
> > > > Most of the standard metrics listed in the KIP are data points that
> the
> > > > broker does not have.
> > > > Only a small number of metrics are duplicates (like the request
> counts
> > > and
> > > > sizes), but they are included
> > > > to ease correlation when inspecting these client metrics.
> > > >
> > > >
> > > >
> > > > > Moreover this is a brand new feature so it's even harder to justify
> > > > > enabling it and forcing onto all our users. If disabled by default,
> > > > > it's relatively easy to enable in a new release if we decide to,
> but
> > > > > once enabled by default it's much harder to disable. Also this
> > feature
> > > > > will apply to all future metrics we will add.
> > > > >
> > > >
> > > > I think maturity of a feature implementation should be the deciding
> > > factor,
> > > > rather than
> > > > the design of it (which this KIP is). I.e., if the implementation is
> > not
> > > > deemed mature enough
> > > > for release X.Y it will be disabled.
> > > >
> > > >
> > > >
> > > > > Overall I think it's an interesting feature but I'd prefer to be
> > > > > slightly defensive and see how it works in practice before enabling
> > it
> > > > > everywhere.
> > > > >
> > > >
> > > > Right, and I agree on being defensive, but since this feature still
> > > > requires manual
> > > > enabling on the brokers before actually being used, I think that
> gives
> > > > enough control
> > > > to opt-in or out of this feature as needed.
> > > >
> > > > Thanks for your comments!
> > > >
> > > > Regards,
> > > > Magnus
> > > >
> > > >
> > > >
> > > > > Thanks,
> > > > > Mickael
> > > > >
> > > > > On Mon, Nov 8, 2021 at 7:55 AM Magnus Edenhill <magnus@edenhill.se
> >
> > > > wrote:
> > > > > >
> > > > > > Thanks David for pointing this out,
> > > > > > I've updated the KIP to include client_id as a matching selector.
> > > > > >
> > > > > > Regards,
> > > > > > Magnus
> > > > > >
> > > > > > Den tors 4 nov. 2021 kl 18:01 skrev David Mao
> > > > <dmao@confluent.io.invalid
> > > > > >:
> > > > > >
> > > > > > > Hey Magnus,
> > > > > > >
> > > > > > > I noticed that the KIP outlines the initial selectors supported
> > as:
> > > > > > >
> > > > > > >    - client_instance_id - CLIENT_INSTANCE_ID UUID string
> > > > > representation.
> > > > > > >    - client_software_name  - client software implementation
> name.
> > > > > > >    - client_software_version  - client software implementation
> > > > version.
> > > > > > >
> > > > > > > In the given reactive monitoring workflow, we mention that the
> > > > > application
> > > > > > > user does not know their client's client instance ID, but it's
> > > > outlined
> > > > > > > that the operator can add a metrics subscription selecting for
> > > > > clientId. I
> > > > > > > don't see clientId as one of the supported selectors.
> > > > > > > I can see how this would have made sense in a previous
> iteration
> > > > given
> > > > > that
> > > > > > > the previous client instance ID proposal was to construct the
> > > client
> > > > > > > instance ID using clientId as a prefix. Now that the client
> > > instance
> > > > > ID is
> > > > > > > a UUID, would we want to add clientId as a supported selector?
> > > > > > > Let me know what you think.
> > > > > > >
> > > > > > > David
> > > > > > >
> > > > > > > On Tue, Oct 19, 2021 at 12:33 PM Magnus Edenhill <
> > > magnus@edenhill.se
> > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Mickael!
> > > > > > > >
> > > > > > > > Den tis 19 okt. 2021 kl 19:30 skrev Mickael Maison <
> > > > > > > > mickael.maison@gmail.com
> > > > > > > > >:
> > > > > > > >
> > > > > > > > > Hi Magnus,
> > > > > > > > >
> > > > > > > > > Thanks for the proposal.
> > > > > > > > >
> > > > > > > > > 1. Looking at the protocol section, isn't
> "ClientInstanceId"
> > > > > expected
> > > > > > > > > to be a field in GetTelemetrySubscriptionsResponseV0?
> > > Otherwise,
> > > > > how
> > > > > > > > > does a client retrieve this value?
> > > > > > > > >
> > > > > > > >
> > > > > > > > Good catch, it got removed by mistake in one of the edits.
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > 2. In the client API section, you mention a new method
> > > > > > > > > "clientInstanceId()". Can you clarify which interfaces are
> > > > > affected?
> > > > > > > > > Is it only Consumer and Producer?
> > > > > > > > >
> > > > > > > >
> > > > > > > > And Admin. Will update the KIP.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > 3. I'm a bit concerned this is enabled by default. Even if
> > the
> > > > data
> > > > > > > > > collected is supposed to be not sensitive, I think this can
> > be
> > > > > > > > > problematic in some environments. Also users don't seem to
> > have
> > > > the
> > > > > > > > > choice to only expose some metrics. Knowing how much data
> > > transit
> > > > > > > > > through some applications can be considered critical.
> > > > > > > > >
> > > > > > > >
> > > > > > > > The broker already knows how much data transits through the
> > > client
> > > > > > > though,
> > > > > > > > right?
> > > > > > > > Care has been taken not to expose information in the standard
> > > > metrics
> > > > > > > that
> > > > > > > > might
> > > > > > > > reveal sensitive information.
> > > > > > > >
> > > > > > > > Do you have an example of how the proposed metrics could leak
> > > > > sensitive
> > > > > > > > information?
> > > > > > > > As for limiting the what metrics to export; I guess that
> could
> > > make
> > > > > sense
> > > > > > > > in some
> > > > > > > > very sensitive use-cases, but those users might disable
> metrics
> > > > > > > altogether
> > > > > > > > for now.
> > > > > > > > Could these concerns be addressed by a later KIP?
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > 4. As a user, how do you know if your application is
> actively
> > > > > sending
> > > > > > > > > metrics? Are there new metrics exposing what's going on,
> like
> > > how
> > > > > much
> > > > > > > > > data is being sent?
> > > > > > > > >
> > > > > > > >
> > > > > > > > That's a good question.
> > > > > > > > Since the proposed metrics interface is not aimed at, or
> > directly
> > > > > > > available
> > > > > > > > to, the application
> > > > > > > > I guess there's little point of adding it here, but instead
> > > adding
> > > > > > > > something to the
> > > > > > > > existing JMX metrics?
> > > > > > > > Do you have any suggestions?
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > 5. If all metrics are enabled on a regular Consumer or
> > > Producer,
> > > > do
> > > > > > > > > you have an idea how much throughput this would use?
> > > > > > > > >
> > > > > > > >
> > > > > > > > It depends on the number of partition/topics/etc the client
> is
> > > > > producing
> > > > > > > > to/consuming from.
> > > > > > > > I'll add some sizes to the KIP for some typical use-cases.
> > > > > > > >
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Magnus
> > > > > > > >
> > > > > > > >
> > > > > > > > > Thanks
> > > > > > > > >
> > > > > > > > > On Tue, Oct 19, 2021 at 5:06 PM Magnus Edenhill <
> > > > > magnus@edenhill.se>
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > Den tis 19 okt. 2021 kl 13:22 skrev Tom Bentley <
> > > > > tbentley@redhat.com
> > > > > > > >:
> > > > > > > > > >
> > > > > > > > > > > Hi Magnus,
> > > > > > > > > > >
> > > > > > > > > > > I reviewed the KIP since you called the vote (sorry for
> > not
> > > > > > > reviewing
> > > > > > > > > when
> > > > > > > > > > > you announced your intention to call the vote). I have
> a
> > > few
> > > > > > > > questions
> > > > > > > > > on
> > > > > > > > > > > some of the details.
> > > > > > > > > > >
> > > > > > > > > > > 1. There's no Javadoc on ClientTelemetryPayload.data(),
> > so
> > > I
> > > > > don't
> > > > > > > > know
> > > > > > > > > > > whether the payload is exposed through this method as
> > > > > compressed or
> > > > > > > > > not.
> > > > > > > > > > > Later on you say "Decompression of the payloads will be
> > > > > handled by
> > > > > > > > the
> > > > > > > > > > > broker metrics plugin, the broker should expose a
> > suitable
> > > > > > > > > decompression
> > > > > > > > > > > API to the metrics plugin for this purpose.", which
> > > suggests
> > > > > it's
> > > > > > > the
> > > > > > > > > > > compressed data in the buffer, but then we don't know
> > which
> > > > > codec
> > > > > > > was
> > > > > > > > > used,
> > > > > > > > > > > nor the API via which the plugin should decompress it
> if
> > > > > required
> > > > > > > for
> > > > > > > > > > > forwarding to the ultimate metrics store. Should the
> > > > > > > > > ClientTelemetryPayload
> > > > > > > > > > > expose a method to get the compression and a
> > decompressor?
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Good point, updated.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > 2. The client-side API is expressed as StringOrError
> > > > > > > > > > > ClientInstance::ClientInstanceId(int timeout_ms). I
> > > > understand
> > > > > that
> > > > > > > > > you're
> > > > > > > > > > > thinking about the librdkafka implementation, but it
> > would
> > > be
> > > > > good
> > > > > > > to
> > > > > > > > > show
> > > > > > > > > > > the API as it would appear on the Apache Kafka clients.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > This was meant as pseudo-code, but I changed it to Java.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > 3. "PushTelemetryRequest|Response - protocol request
> used
> > > by
> > > > > the
> > > > > > > > > client to
> > > > > > > > > > > send metrics to any broker it is connected to." To be
> > > clear,
> > > > > this
> > > > > > > > means
> > > > > > > > > > > that the client can choose any of the connected brokers
> > and
> > > > > push to
> > > > > > > > > just
> > > > > > > > > > > one of them? What should a supporting client do if it
> > gets
> > > an
> > > > > error
> > > > > > > > > when
> > > > > > > > > > > pushing metrics to a broker, retry sending to the same
> > > broker
> > > > > or
> > > > > > > try
> > > > > > > > > > > pushing to another broker, or drop the metrics? Should
> > > > > supporting
> > > > > > > > > clients
> > > > > > > > > > > send successive requests to a single broker, or round
> > > robin,
> > > > > or is
> > > > > > > > > that up
> > > > > > > > > > > to the client author? I'm guessing the behaviour should
> > be
> > > > > sticky
> > > > > > > to
> > > > > > > > > > > support the rate limiting features, but I think it
> would
> > be
> > > > > good
> > > > > > > for
> > > > > > > > > client
> > > > > > > > > > > authors if this section were explicit on the
> recommended
> > > > > behaviour.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > You are right, I've updated the KIP to make this clearer.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > 4. "Mapping the client instance id to an actual
> > application
> > > > > > > instance
> > > > > > > > > > > running on a (virtual) machine can be done by
> inspecting
> > > the
> > > > > > > metrics
> > > > > > > > > > > resource labels, such as the client source address and
> > > source
> > > > > port,
> > > > > > > > or
> > > > > > > > > > > security principal, all of which are added by the
> > receiving
> > > > > broker.
> > > > > > > > > This
> > > > > > > > > > > will allow the operator together with the user to
> > identify
> > > > the
> > > > > > > actual
> > > > > > > > > > > application instance." Is this really always true? The
> > > source
> > > > > IP
> > > > > > > and
> > > > > > > > > port
> > > > > > > > > > > might be a loadbalancer/proxy in some setups. The
> > > principal,
> > > > as
> > > > > > > > already
> > > > > > > > > > > mentioned in the KIP, might be shared between multiple
> > > > > > > applications.
> > > > > > > > > So at
> > > > > > > > > > > worst the organization running the clients might have
> to
> > > > > consult
> > > > > > > the
> > > > > > > > > logs
> > > > > > > > > > > of a set of client applications, right?
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Yes, that's correct. There's no guaranteed mapping from
> > > > > > > > > client_instance_id
> > > > > > > > > > to
> > > > > > > > > > an actual instance, that's why the KIP recommends client
> > > > > > > > implementations
> > > > > > > > > to
> > > > > > > > > > log the client instance id
> > > > > > > > > > upon retrieval, and also provide an API for the
> application
> > > to
> > > > > > > retrieve
> > > > > > > > > the
> > > > > > > > > > instance id programmatically
> > > > > > > > > > if it has a better way of exposing it.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > 5. "Tests indicate that a compression ratio up to 10x is
> > > > > possible for
> > > > > > > > the
> > > > > > > > > > > standard metrics." Client authors might appreciate your
> > > > > mentioning
> > > > > > > > > which
> > > > > > > > > > > compression codec got these results.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Good point. Updated.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > 6. "Should the client send a push request prior to
> expiry
> > > of
> > > > > the
> > > > > > > > > previously
> > > > > > > > > > > calculated PushIntervalMs the broker will discard the
> > > metrics
> > > > > and
> > > > > > > > > return a
> > > > > > > > > > > PushTelemetryResponse with the ErrorCode set to
> > > RateLimited."
> > > > > Is
> > > > > > > this
> > > > > > > > > > > RATE_LIMITED a new error code? It's not mentioned in
> the
> > > "New
> > > > > Error
> > > > > > > > > Codes"
> > > > > > > > > > > section.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > That's a leftover, it should be using the standard
> > > ThrottleTime
> > > > > > > > > mechanism.
> > > > > > > > > > Fixed.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > 7. In the section "Standard client resource labels"
> > > > > application_id
> > > > > > > is
> > > > > > > > > > > described as Kafka Streams only, but the section of
> > "Client
> > > > > > > > > Identification"
> > > > > > > > > > > talks about "application instance id as an optional
> > future
> > > > > > > > nice-to-have
> > > > > > > > > > > that may be included as a metrics label if it has been
> > set
> > > by
> > > > > the
> > > > > > > > > user", so
> > > > > > > > > > > I'm confused whether non-Kafka Streams clients should
> set
> > > an
> > > > > > > > > application_id
> > > > > > > > > > > or not.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > I'll clarify this in the KIP, but basically we would need
> > to
> > > > add
> > > > > an `
> > > > > > > > > > application.id` config
> > > > > > > > > > property for non-streams clients for this purpose, and
> > that's
> > > > > outside
> > > > > > > > the
> > > > > > > > > > scope of this KIP since we want to make it zero-conf:ish
> on
> > > the
> > > > > > > client
> > > > > > > > > side.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Kind regards,
> > > > > > > > > > >
> > > > > > > > > > > Tom
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Thanks for the review,
> > > > > > > > > > Magnus
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Thu, Oct 7, 2021 at 5:26 PM Magnus Edenhill <
> > > > > magnus@edenhill.se
> > > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi all,
> > > > > > > > > > > >
> > > > > > > > > > > > I've updated the KIP following our recent discussions
> > on
> > > > the
> > > > > > > > mailing
> > > > > > > > > > > list:
> > > > > > > > > > > >  - split the protocol in two, one for getting the
> > metrics
> > > > > > > > > subscriptions,
> > > > > > > > > > > > and one for pushing the metrics.
> > > > > > > > > > > >  - simplifications: initially only one supported
> > metrics
> > > > > format,
> > > > > > > no
> > > > > > > > > > > > client.id in the instance id, etc.
> > > > > > > > > > > >  - made CLIENT_METRICS subscription configuration
> > entries
> > > > > more
> > > > > > > > > structured
> > > > > > > > > > > >    and allowing better client matching selectors (not
> > > only
> > > > > on the
> > > > > > > > > > > instance
> > > > > > > > > > > > id, but also the other
> > > > > > > > > > > >    client resource labels, such as
> > client_software_name,
> > > > > etc.).
> > > > > > > > > > > >
> > > > > > > > > > > > Unless there are further comments I'll call the vote
> > in a
> > > > > day or
> > > > > > > > two.
> > > > > > > > > > > >
> > > > > > > > > > > > Regards,
> > > > > > > > > > > > Magnus
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Den mån 4 okt. 2021 kl 20:57 skrev Magnus Edenhill <
> > > > > > > > > magnus@edenhill.se>:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi Gwen,
> > > > > > > > > > > > >
> > > > > > > > > > > > > I'm finishing up the KIP based on the last couple
> of
> > > > > discussion
> > > > > > > > > points
> > > > > > > > > > > in
> > > > > > > > > > > > > this thread
> > > > > > > > > > > > > and will call the Vote later this week.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Best,
> > > > > > > > > > > > > Magnus
> > > > > > > > > > > > >
> > > > > > > > > > > > > Den lör 2 okt. 2021 kl 02:01 skrev Gwen Shapira
> > > > > > > > > > > > <gwen@confluent.io.invalid
> > > > > > > > > > > > > >:
> > > > > > > > > > > > >
> > > > > > > > > > > > >> Hey,
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> I noticed that there was no discussion for the
> last
> > 10
> > > > > days,
> > > > > > > > but I
> > > > > > > > > > > > >> couldn't
> > > > > > > > > > > > >> find the vote thread. Is there one that I'm
> missing?
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> Gwen
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> On Wed, Sep 22, 2021 at 4:58 AM Magnus Edenhill <
> > > > > > > > > magnus@edenhill.se>
> > > > > > > > > > > > >> wrote:
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> > Den tis 21 sep. 2021 kl 06:58 skrev Colin
> McCabe <
> > > > > > > > > > > cmccabe@apache.org
> > > > > > > > > > > > >:
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > > On Mon, Sep 20, 2021, at 17:35, Feng Min
> wrote:
> > > > > > > > > > > > >> > > > Thanks Magnus & Colin for the discussion.
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > > > Based on KIP-714's stateless design, Client
> > can
> > > > > pretty
> > > > > > > > much
> > > > > > > > > use
> > > > > > > > > > > > any
> > > > > > > > > > > > >> > > > connection to any broker to send metrics. We
> > are
> > > > not
> > > > > > > > > associating
> > > > > > > > > > > > >> > > connection
> > > > > > > > > > > > >> > > > with client metric state. Is my
> understanding
> > > > > correct?
> > > > > > > If
> > > > > > > > > yes,
> > > > > > > > > > > > how
> > > > > > > > > > > > >> > about
> > > > > > > > > > > > >> > > > the following two scenarios
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > > > 1) One Client (Client-ID) registers two
> > > different
> > > > > client
> > > > > > > > > > > instance
> > > > > > > > > > > > id
> > > > > > > > > > > > >> > via
> > > > > > > > > > > > >> > > > separate registration. Is it permitted? If
> OK,
> > > how
> > > > > to
> > > > > > > > > > > distinguish
> > > > > > > > > > > > >> them
> > > > > > > > > > > > >> > > from
> > > > > > > > > > > > >> > > > the case 2 below.
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> > > Hi Feng,
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> > > My understanding, which Magnus can clarify I
> > > guess,
> > > > is
> > > > > > > that
> > > > > > > > > you
> > > > > > > > > > > > could
> > > > > > > > > > > > >> > have
> > > > > > > > > > > > >> > > something like two Producer instances running
> > with
> > > > the
> > > > > > > same
> > > > > > > > > > > > client.id
> > > > > > > > > > > > >> > > (perhaps because they're using the same config
> > > file,
> > > > > for
> > > > > > > > > example).
> > > > > > > > > > > > >> They
> > > > > > > > > > > > >> > > could even be in the same process. But they
> > would
> > > > get
> > > > > > > > separate
> > > > > > > > > > > > UUIDs.
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> > > I believe Magnus used the term client to mean
> > > > > "Producer or
> > > > > > > > > > > > Consumer".
> > > > > > > > > > > > >> So
> > > > > > > > > > > > >> > > if you have both a Producer and a Consumer in
> > your
> > > > > > > > > application I
> > > > > > > > > > > > would
> > > > > > > > > > > > >> > > expect you'd get separate UUIDs for both.
> Again
> > > > > Magnus can
> > > > > > > > > chime
> > > > > > > > > > > in
> > > > > > > > > > > > >> > here, I
> > > > > > > > > > > > >> > > guess.
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > That's correct.
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> > > > 2) How about the client restarting? What's
> the
> > > > > > > > expectation?
> > > > > > > > > > > Should
> > > > > > > > > > > > >> the
> > > > > > > > > > > > >> > > > server expect the client to carry a
> persisted
> > > > client
> > > > > > > > > instance id
> > > > > > > > > > > > or
> > > > > > > > > > > > >> > > should
> > > > > > > > > > > > >> > > > the client be treated as a new instance?
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> > > The KIP doesn't describe any mechanism for
> > > > > persistence,
> > > > > > > so I
> > > > > > > > > would
> > > > > > > > > > > > >> assume
> > > > > > > > > > > > >> > > that when you restart the client you get a new
> > > > UUID. I
> > > > > > > agree
> > > > > > > > > that
> > > > > > > > > > > it
> > > > > > > > > > > > >> > would
> > > > > > > > > > > > >> > > be good to spell this out.
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> > Right, it will not be persisted since a client
> > > > instance
> > > > > > > can't
> > > > > > > > be
> > > > > > > > > > > > >> restarted.
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > Will update the KIP to make this clearer.
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > /Magnus
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >>
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> --
> > > > > > > > > > > > >> Gwen Shapira
> > > > > > > > > > > > >> Engineering Manager | Confluent
> > > > > > > > > > > > >> 650.450.2760 | @gwenshap
> > > > > > > > > > > > >> Follow us: Twitter | blog
> > > > > > > > > > > > >>
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Magnus Edenhill <ma...@edenhill.se>.

Hey Bob,

That's a good point.

Request type labels were considered but since they're already tracked by
broker-side metrics
they were left out as to avoid metric duplication, however those metrics
are not per connection,
so they won't be that useful in practice for troubleshooting specific
client instances.

I'll add the request_type label to the relevant metrics.

Thanks,
Magnus


Den tis 23 nov. 2021 kl 19:20 skrev Bob Barrett
<bo...@confluent.io.invalid>:

> Hi Magnus,
>
> Thanks for the thorough KIP, this seems very useful.
>
> Would it make sense to include the request type as a label for the
> `client.request.success`, `client.request.errors` and `client.request.rtt`
> metrics? I think it would be very useful to see which specific requests are
> succeeding and failing for a client. One specific case I can think of where
> this could be useful is producer batch timeouts. If a Java application does
> not enable producer client logs (unfortunately, in my experience this
> happens more often than it should), the application logs will only contain
> the expiration error message, but no information about what is causing the
> timeout. The requests might all be succeeding but taking too long to
> process batches, or metadata requests might be failing, or some or all
> produce requests might be failing (if the bootstrap servers are reachable
> from the client but one or more other brokers are not, for example). If the
> cluster operator is able to identify the specific requests that are slow or
> failing for a client, they will be better able to diagnose the issue
> causing batch timeouts.
>
> One drawback I can think of is that this will increase the cardinality of
> the request metrics. But any given client is only going to use a small
> subset of the request types, and since we already have partition labels for
> the topic-level metrics, I think request labels will still make up a
> relatively small percentage of the set of metrics.
>
> Thanks,
> Bob
>
> On Mon, Nov 22, 2021 at 2:08 AM Viktor Somogyi-Vass <
> viktorsomogyi@gmail.com>
> wrote:
>
> > Hi Magnus,
> >
> > I think this is a very useful addition. We also have a similar (but much
> > more simplistic) implementation of this. Maybe I missed it in the KIP but
> > what about adding metrics about the subscription cache itself? That I
> think
> > would improve its usability and debuggability as we'd be able to see its
> > performance, hit/miss rates, eviction counts and others.
> >
> > Best,
> > Viktor
> >
> > On Thu, Nov 18, 2021 at 5:12 PM Magnus Edenhill <ma...@edenhill.se>
> > wrote:
> >
> > > Hi Mickael,
> > >
> > > see inline.
> > >
> > > Den ons 10 nov. 2021 kl 15:21 skrev Mickael Maison <
> > > mickael.maison@gmail.com
> > > >:
> > >
> > > > Hi Magnus,
> > > >
> > > > I see you've addressed some of the points I raised above but some (4,
> > > > 5) have not been addressed yet.
> > > >
> > >
> > > Re 4) How will the user/app know metrics are being sent.
> > >
> > > One possibility is to add a JMX metric (thus for user consumption) for
> > the
> > > number of metric pushes the
> > > client has performed, or perhaps the number of metrics subscriptions
> > > currently being collected.
> > > Would that be sufficient?
> > >
> > > Re 5) Metric sizes and rates
> > >
> > > A worst case scenario for a producer that is producing to 50 unique
> > topics
> > > and emitting all standard metrics yields
> > > a serialized size of around 100KB prior to compression, which
> compresses
> > > down to about 20-30% of that depending
> > > on compression type and topic name uniqueness.
> > > The numbers for a consumer would be similar.
> > >
> > > In practice the number of unique topics would be far less, and the
> > > subscription set would typically be for a subset of metrics.
> > > So we're probably closer to 1kb, or less, compressed size per client
> per
> > > push interval.
> > >
> > > As both the subscription set and push intervals are controlled by the
> > > cluster operator it shouldn't be too hard
> > > to strike a good balance between metrics overhead and granularity.
> > >
> > >
> > >
> > > >
> > > > I'm really uneasy with this being enabled by default on the client
> > > > side. When collecting data, I think the best practice is to ensure
> > > > users are explicitly enabling it.
> > > >
> > >
> > > Requiring metrics to be explicitly enabled on clients severely cripples
> > its
> > > usability and value.
> > >
> > > One of the problems that this KIP aims to solve is for useful metrics
> to
> > be
> > > available on demand
> > > regardless of the technical expertise of the user. As Ryanne points,
> out
> > a
> > > savvy user/organization
> > > will typically have metrics collection and monitoring in place already,
> > and
> > > the benefits of this KIP
> > > are then more of a common set and format metrics across client
> > > implementations and languages.
> > > But that is not the typical Kafka user in my experience, they're not
> > Kafka
> > > experts and they don't have the
> > > knowledge of how to best instrument their clients.
> > > Having metrics enabled by default for this user base allows the Kafka
> > > operators to proactively and reactively
> > > monitor and troubleshoot client issues, without the need for the less
> > savvy
> > > user to do anything.
> > > It is often too late to tell a user to enable metrics when the problem
> > has
> > > already occurred.
> > >
> > > Now, to be clear, even though metrics are enabled by default on clients
> > it
> > > is not enabled by default
> > > on the brokers; the Kafka operator needs to build and set up a metrics
> > > plugin and add metrics subscriptions
> > > before anything is sent from the client.
> > > It is opt-out on the clients and opt-in on the broker.
> > >
> > >
> > >
> > >
> > > > You mentioned brokers already have
> > > > some(most?) of the information contained in metrics, if so then why
> > > > are we collecting it again? Surely there must be some new information
> > > > in the client metrics.
> > > >
> > >
> > > From the user's perspective the Kafka infrastructure extends from
> > > producer.send() to
> > > messages being returned from consumer.poll(), a giant black box where
> > > there's a lot going on between those
> > > two points. The brokers currently only see what happens once those
> > requests
> > > and messages hits the broker,
> > > but as Kafka clients are complex pieces of machinery there's a myriad
> of
> > > queues, timers, and state
> > > that's critical to the operation and infrastructure that's not
> currently
> > > visible to the operator.
> > > Relying on the user to accurately and timely provide this missing
> > > information is not generally feasible.
> > >
> > >
> > > Most of the standard metrics listed in the KIP are data points that the
> > > broker does not have.
> > > Only a small number of metrics are duplicates (like the request counts
> > and
> > > sizes), but they are included
> > > to ease correlation when inspecting these client metrics.
> > >
> > >
> > >
> > > > Moreover this is a brand new feature so it's even harder to justify
> > > > enabling it and forcing onto all our users. If disabled by default,
> > > > it's relatively easy to enable in a new release if we decide to, but
> > > > once enabled by default it's much harder to disable. Also this
> feature
> > > > will apply to all future metrics we will add.
> > > >
> > >
> > > I think maturity of a feature implementation should be the deciding
> > factor,
> > > rather than
> > > the design of it (which this KIP is). I.e., if the implementation is
> not
> > > deemed mature enough
> > > for release X.Y it will be disabled.
> > >
> > >
> > >
> > > > Overall I think it's an interesting feature but I'd prefer to be
> > > > slightly defensive and see how it works in practice before enabling
> it
> > > > everywhere.
> > > >
> > >
> > > Right, and I agree on being defensive, but since this feature still
> > > requires manual
> > > enabling on the brokers before actually being used, I think that gives
> > > enough control
> > > to opt-in or out of this feature as needed.
> > >
> > > Thanks for your comments!
> > >
> > > Regards,
> > > Magnus
> > >
> > >
> > >
> > > > Thanks,
> > > > Mickael
> > > >
> > > > On Mon, Nov 8, 2021 at 7:55 AM Magnus Edenhill <ma...@edenhill.se>
> > > wrote:
> > > > >
> > > > > Thanks David for pointing this out,
> > > > > I've updated the KIP to include client_id as a matching selector.
> > > > >
> > > > > Regards,
> > > > > Magnus
> > > > >
> > > > > Den tors 4 nov. 2021 kl 18:01 skrev David Mao
> > > <dmao@confluent.io.invalid
> > > > >:
> > > > >
> > > > > > Hey Magnus,
> > > > > >
> > > > > > I noticed that the KIP outlines the initial selectors supported
> as:
> > > > > >
> > > > > >    - client_instance_id - CLIENT_INSTANCE_ID UUID string
> > > > representation.
> > > > > >    - client_software_name  - client software implementation name.
> > > > > >    - client_software_version  - client software implementation
> > > version.
> > > > > >
> > > > > > In the given reactive monitoring workflow, we mention that the
> > > > application
> > > > > > user does not know their client's client instance ID, but it's
> > > outlined
> > > > > > that the operator can add a metrics subscription selecting for
> > > > clientId. I
> > > > > > don't see clientId as one of the supported selectors.
> > > > > > I can see how this would have made sense in a previous iteration
> > > given
> > > > that
> > > > > > the previous client instance ID proposal was to construct the
> > client
> > > > > > instance ID using clientId as a prefix. Now that the client
> > instance
> > > > ID is
> > > > > > a UUID, would we want to add clientId as a supported selector?
> > > > > > Let me know what you think.
> > > > > >
> > > > > > David
> > > > > >
> > > > > > On Tue, Oct 19, 2021 at 12:33 PM Magnus Edenhill <
> > magnus@edenhill.se
> > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Mickael!
> > > > > > >
> > > > > > > Den tis 19 okt. 2021 kl 19:30 skrev Mickael Maison <
> > > > > > > mickael.maison@gmail.com
> > > > > > > >:
> > > > > > >
> > > > > > > > Hi Magnus,
> > > > > > > >
> > > > > > > > Thanks for the proposal.
> > > > > > > >
> > > > > > > > 1. Looking at the protocol section, isn't "ClientInstanceId"
> > > > expected
> > > > > > > > to be a field in GetTelemetrySubscriptionsResponseV0?
> > Otherwise,
> > > > how
> > > > > > > > does a client retrieve this value?
> > > > > > > >
> > > > > > >
> > > > > > > Good catch, it got removed by mistake in one of the edits.
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > 2. In the client API section, you mention a new method
> > > > > > > > "clientInstanceId()". Can you clarify which interfaces are
> > > > affected?
> > > > > > > > Is it only Consumer and Producer?
> > > > > > > >
> > > > > > >
> > > > > > > And Admin. Will update the KIP.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > 3. I'm a bit concerned this is enabled by default. Even if
> the
> > > data
> > > > > > > > collected is supposed to be not sensitive, I think this can
> be
> > > > > > > > problematic in some environments. Also users don't seem to
> have
> > > the
> > > > > > > > choice to only expose some metrics. Knowing how much data
> > transit
> > > > > > > > through some applications can be considered critical.
> > > > > > > >
> > > > > > >
> > > > > > > The broker already knows how much data transits through the
> > client
> > > > > > though,
> > > > > > > right?
> > > > > > > Care has been taken not to expose information in the standard
> > > metrics
> > > > > > that
> > > > > > > might
> > > > > > > reveal sensitive information.
> > > > > > >
> > > > > > > Do you have an example of how the proposed metrics could leak
> > > > sensitive
> > > > > > > information?
> > > > > > > As for limiting the what metrics to export; I guess that could
> > make
> > > > sense
> > > > > > > in some
> > > > > > > very sensitive use-cases, but those users might disable metrics
> > > > > > altogether
> > > > > > > for now.
> > > > > > > Could these concerns be addressed by a later KIP?
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > 4. As a user, how do you know if your application is actively
> > > > sending
> > > > > > > > metrics? Are there new metrics exposing what's going on, like
> > how
> > > > much
> > > > > > > > data is being sent?
> > > > > > > >
> > > > > > >
> > > > > > > That's a good question.
> > > > > > > Since the proposed metrics interface is not aimed at, or
> directly
> > > > > > available
> > > > > > > to, the application
> > > > > > > I guess there's little point of adding it here, but instead
> > adding
> > > > > > > something to the
> > > > > > > existing JMX metrics?
> > > > > > > Do you have any suggestions?
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > 5. If all metrics are enabled on a regular Consumer or
> > Producer,
> > > do
> > > > > > > > you have an idea how much throughput this would use?
> > > > > > > >
> > > > > > >
> > > > > > > It depends on the number of partition/topics/etc the client is
> > > > producing
> > > > > > > to/consuming from.
> > > > > > > I'll add some sizes to the KIP for some typical use-cases.
> > > > > > >
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Magnus
> > > > > > >
> > > > > > >
> > > > > > > > Thanks
> > > > > > > >
> > > > > > > > On Tue, Oct 19, 2021 at 5:06 PM Magnus Edenhill <
> > > > magnus@edenhill.se>
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > Den tis 19 okt. 2021 kl 13:22 skrev Tom Bentley <
> > > > tbentley@redhat.com
> > > > > > >:
> > > > > > > > >
> > > > > > > > > > Hi Magnus,
> > > > > > > > > >
> > > > > > > > > > I reviewed the KIP since you called the vote (sorry for
> not
> > > > > > reviewing
> > > > > > > > when
> > > > > > > > > > you announced your intention to call the vote). I have a
> > few
> > > > > > > questions
> > > > > > > > on
> > > > > > > > > > some of the details.
> > > > > > > > > >
> > > > > > > > > > 1. There's no Javadoc on ClientTelemetryPayload.data(),
> so
> > I
> > > > don't
> > > > > > > know
> > > > > > > > > > whether the payload is exposed through this method as
> > > > compressed or
> > > > > > > > not.
> > > > > > > > > > Later on you say "Decompression of the payloads will be
> > > > handled by
> > > > > > > the
> > > > > > > > > > broker metrics plugin, the broker should expose a
> suitable
> > > > > > > > decompression
> > > > > > > > > > API to the metrics plugin for this purpose.", which
> > suggests
> > > > it's
> > > > > > the
> > > > > > > > > > compressed data in the buffer, but then we don't know
> which
> > > > codec
> > > > > > was
> > > > > > > > used,
> > > > > > > > > > nor the API via which the plugin should decompress it if
> > > > required
> > > > > > for
> > > > > > > > > > forwarding to the ultimate metrics store. Should the
> > > > > > > > ClientTelemetryPayload
> > > > > > > > > > expose a method to get the compression and a
> decompressor?
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Good point, updated.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > 2. The client-side API is expressed as StringOrError
> > > > > > > > > > ClientInstance::ClientInstanceId(int timeout_ms). I
> > > understand
> > > > that
> > > > > > > > you're
> > > > > > > > > > thinking about the librdkafka implementation, but it
> would
> > be
> > > > good
> > > > > > to
> > > > > > > > show
> > > > > > > > > > the API as it would appear on the Apache Kafka clients.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > This was meant as pseudo-code, but I changed it to Java.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > 3. "PushTelemetryRequest|Response - protocol request used
> > by
> > > > the
> > > > > > > > client to
> > > > > > > > > > send metrics to any broker it is connected to." To be
> > clear,
> > > > this
> > > > > > > means
> > > > > > > > > > that the client can choose any of the connected brokers
> and
> > > > push to
> > > > > > > > just
> > > > > > > > > > one of them? What should a supporting client do if it
> gets
> > an
> > > > error
> > > > > > > > when
> > > > > > > > > > pushing metrics to a broker, retry sending to the same
> > broker
> > > > or
> > > > > > try
> > > > > > > > > > pushing to another broker, or drop the metrics? Should
> > > > supporting
> > > > > > > > clients
> > > > > > > > > > send successive requests to a single broker, or round
> > robin,
> > > > or is
> > > > > > > > that up
> > > > > > > > > > to the client author? I'm guessing the behaviour should
> be
> > > > sticky
> > > > > > to
> > > > > > > > > > support the rate limiting features, but I think it would
> be
> > > > good
> > > > > > for
> > > > > > > > client
> > > > > > > > > > authors if this section were explicit on the recommended
> > > > behaviour.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > You are right, I've updated the KIP to make this clearer.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > 4. "Mapping the client instance id to an actual
> application
> > > > > > instance
> > > > > > > > > > running on a (virtual) machine can be done by inspecting
> > the
> > > > > > metrics
> > > > > > > > > > resource labels, such as the client source address and
> > source
> > > > port,
> > > > > > > or
> > > > > > > > > > security principal, all of which are added by the
> receiving
> > > > broker.
> > > > > > > > This
> > > > > > > > > > will allow the operator together with the user to
> identify
> > > the
> > > > > > actual
> > > > > > > > > > application instance." Is this really always true? The
> > source
> > > > IP
> > > > > > and
> > > > > > > > port
> > > > > > > > > > might be a loadbalancer/proxy in some setups. The
> > principal,
> > > as
> > > > > > > already
> > > > > > > > > > mentioned in the KIP, might be shared between multiple
> > > > > > applications.
> > > > > > > > So at
> > > > > > > > > > worst the organization running the clients might have to
> > > > consult
> > > > > > the
> > > > > > > > logs
> > > > > > > > > > of a set of client applications, right?
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Yes, that's correct. There's no guaranteed mapping from
> > > > > > > > client_instance_id
> > > > > > > > > to
> > > > > > > > > an actual instance, that's why the KIP recommends client
> > > > > > > implementations
> > > > > > > > to
> > > > > > > > > log the client instance id
> > > > > > > > > upon retrieval, and also provide an API for the application
> > to
> > > > > > retrieve
> > > > > > > > the
> > > > > > > > > instance id programmatically
> > > > > > > > > if it has a better way of exposing it.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > 5. "Tests indicate that a compression ratio up to 10x is
> > > > possible for
> > > > > > > the
> > > > > > > > > > standard metrics." Client authors might appreciate your
> > > > mentioning
> > > > > > > > which
> > > > > > > > > > compression codec got these results.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Good point. Updated.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > 6. "Should the client send a push request prior to expiry
> > of
> > > > the
> > > > > > > > previously
> > > > > > > > > > calculated PushIntervalMs the broker will discard the
> > metrics
> > > > and
> > > > > > > > return a
> > > > > > > > > > PushTelemetryResponse with the ErrorCode set to
> > RateLimited."
> > > > Is
> > > > > > this
> > > > > > > > > > RATE_LIMITED a new error code? It's not mentioned in the
> > "New
> > > > Error
> > > > > > > > Codes"
> > > > > > > > > > section.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > That's a leftover, it should be using the standard
> > ThrottleTime
> > > > > > > > mechanism.
> > > > > > > > > Fixed.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > 7. In the section "Standard client resource labels"
> > > > application_id
> > > > > > is
> > > > > > > > > > described as Kafka Streams only, but the section of
> "Client
> > > > > > > > Identification"
> > > > > > > > > > talks about "application instance id as an optional
> future
> > > > > > > nice-to-have
> > > > > > > > > > that may be included as a metrics label if it has been
> set
> > by
> > > > the
> > > > > > > > user", so
> > > > > > > > > > I'm confused whether non-Kafka Streams clients should set
> > an
> > > > > > > > application_id
> > > > > > > > > > or not.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > I'll clarify this in the KIP, but basically we would need
> to
> > > add
> > > > an `
> > > > > > > > > application.id` config
> > > > > > > > > property for non-streams clients for this purpose, and
> that's
> > > > outside
> > > > > > > the
> > > > > > > > > scope of this KIP since we want to make it zero-conf:ish on
> > the
> > > > > > client
> > > > > > > > side.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Kind regards,
> > > > > > > > > >
> > > > > > > > > > Tom
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Thanks for the review,
> > > > > > > > > Magnus
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Thu, Oct 7, 2021 at 5:26 PM Magnus Edenhill <
> > > > magnus@edenhill.se
> > > > > > >
> > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi all,
> > > > > > > > > > >
> > > > > > > > > > > I've updated the KIP following our recent discussions
> on
> > > the
> > > > > > > mailing
> > > > > > > > > > list:
> > > > > > > > > > >  - split the protocol in two, one for getting the
> metrics
> > > > > > > > subscriptions,
> > > > > > > > > > > and one for pushing the metrics.
> > > > > > > > > > >  - simplifications: initially only one supported
> metrics
> > > > format,
> > > > > > no
> > > > > > > > > > > client.id in the instance id, etc.
> > > > > > > > > > >  - made CLIENT_METRICS subscription configuration
> entries
> > > > more
> > > > > > > > structured
> > > > > > > > > > >    and allowing better client matching selectors (not
> > only
> > > > on the
> > > > > > > > > > instance
> > > > > > > > > > > id, but also the other
> > > > > > > > > > >    client resource labels, such as
> client_software_name,
> > > > etc.).
> > > > > > > > > > >
> > > > > > > > > > > Unless there are further comments I'll call the vote
> in a
> > > > day or
> > > > > > > two.
> > > > > > > > > > >
> > > > > > > > > > > Regards,
> > > > > > > > > > > Magnus
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Den mån 4 okt. 2021 kl 20:57 skrev Magnus Edenhill <
> > > > > > > > magnus@edenhill.se>:
> > > > > > > > > > >
> > > > > > > > > > > > Hi Gwen,
> > > > > > > > > > > >
> > > > > > > > > > > > I'm finishing up the KIP based on the last couple of
> > > > discussion
> > > > > > > > points
> > > > > > > > > > in
> > > > > > > > > > > > this thread
> > > > > > > > > > > > and will call the Vote later this week.
> > > > > > > > > > > >
> > > > > > > > > > > > Best,
> > > > > > > > > > > > Magnus
> > > > > > > > > > > >
> > > > > > > > > > > > Den lör 2 okt. 2021 kl 02:01 skrev Gwen Shapira
> > > > > > > > > > > <gwen@confluent.io.invalid
> > > > > > > > > > > > >:
> > > > > > > > > > > >
> > > > > > > > > > > >> Hey,
> > > > > > > > > > > >>
> > > > > > > > > > > >> I noticed that there was no discussion for the last
> 10
> > > > days,
> > > > > > > but I
> > > > > > > > > > > >> couldn't
> > > > > > > > > > > >> find the vote thread. Is there one that I'm missing?
> > > > > > > > > > > >>
> > > > > > > > > > > >> Gwen
> > > > > > > > > > > >>
> > > > > > > > > > > >> On Wed, Sep 22, 2021 at 4:58 AM Magnus Edenhill <
> > > > > > > > magnus@edenhill.se>
> > > > > > > > > > > >> wrote:
> > > > > > > > > > > >>
> > > > > > > > > > > >> > Den tis 21 sep. 2021 kl 06:58 skrev Colin McCabe <
> > > > > > > > > > cmccabe@apache.org
> > > > > > > > > > > >:
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > > On Mon, Sep 20, 2021, at 17:35, Feng Min wrote:
> > > > > > > > > > > >> > > > Thanks Magnus & Colin for the discussion.
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > > > Based on KIP-714's stateless design, Client
> can
> > > > pretty
> > > > > > > much
> > > > > > > > use
> > > > > > > > > > > any
> > > > > > > > > > > >> > > > connection to any broker to send metrics. We
> are
> > > not
> > > > > > > > associating
> > > > > > > > > > > >> > > connection
> > > > > > > > > > > >> > > > with client metric state. Is my understanding
> > > > correct?
> > > > > > If
> > > > > > > > yes,
> > > > > > > > > > > how
> > > > > > > > > > > >> > about
> > > > > > > > > > > >> > > > the following two scenarios
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > > > 1) One Client (Client-ID) registers two
> > different
> > > > client
> > > > > > > > > > instance
> > > > > > > > > > > id
> > > > > > > > > > > >> > via
> > > > > > > > > > > >> > > > separate registration. Is it permitted? If OK,
> > how
> > > > to
> > > > > > > > > > distinguish
> > > > > > > > > > > >> them
> > > > > > > > > > > >> > > from
> > > > > > > > > > > >> > > > the case 2 below.
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> > > Hi Feng,
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> > > My understanding, which Magnus can clarify I
> > guess,
> > > is
> > > > > > that
> > > > > > > > you
> > > > > > > > > > > could
> > > > > > > > > > > >> > have
> > > > > > > > > > > >> > > something like two Producer instances running
> with
> > > the
> > > > > > same
> > > > > > > > > > > client.id
> > > > > > > > > > > >> > > (perhaps because they're using the same config
> > file,
> > > > for
> > > > > > > > example).
> > > > > > > > > > > >> They
> > > > > > > > > > > >> > > could even be in the same process. But they
> would
> > > get
> > > > > > > separate
> > > > > > > > > > > UUIDs.
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> > > I believe Magnus used the term client to mean
> > > > "Producer or
> > > > > > > > > > > Consumer".
> > > > > > > > > > > >> So
> > > > > > > > > > > >> > > if you have both a Producer and a Consumer in
> your
> > > > > > > > application I
> > > > > > > > > > > would
> > > > > > > > > > > >> > > expect you'd get separate UUIDs for both. Again
> > > > Magnus can
> > > > > > > > chime
> > > > > > > > > > in
> > > > > > > > > > > >> > here, I
> > > > > > > > > > > >> > > guess.
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > That's correct.
> > > > > > > > > > > >> >
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> > > > 2) How about the client restarting? What's the
> > > > > > > expectation?
> > > > > > > > > > Should
> > > > > > > > > > > >> the
> > > > > > > > > > > >> > > > server expect the client to carry a persisted
> > > client
> > > > > > > > instance id
> > > > > > > > > > > or
> > > > > > > > > > > >> > > should
> > > > > > > > > > > >> > > > the client be treated as a new instance?
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> > > The KIP doesn't describe any mechanism for
> > > > persistence,
> > > > > > so I
> > > > > > > > would
> > > > > > > > > > > >> assume
> > > > > > > > > > > >> > > that when you restart the client you get a new
> > > UUID. I
> > > > > > agree
> > > > > > > > that
> > > > > > > > > > it
> > > > > > > > > > > >> > would
> > > > > > > > > > > >> > > be good to spell this out.
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> > Right, it will not be persisted since a client
> > > instance
> > > > > > can't
> > > > > > > be
> > > > > > > > > > > >> restarted.
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > Will update the KIP to make this clearer.
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > /Magnus
> > > > > > > > > > > >> >
> > > > > > > > > > > >>
> > > > > > > > > > > >>
> > > > > > > > > > > >> --
> > > > > > > > > > > >> Gwen Shapira
> > > > > > > > > > > >> Engineering Manager | Confluent
> > > > > > > > > > > >> 650.450.2760 | @gwenshap
> > > > > > > > > > > >> Follow us: Twitter | blog
> > > > > > > > > > > >>
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Bob Barrett <bo...@confluent.io.INVALID>.

Hi Magnus,

Thanks for the thorough KIP, this seems very useful.

Would it make sense to include the request type as a label for the
`client.request.success`, `client.request.errors` and `client.request.rtt`
metrics? I think it would be very useful to see which specific requests are
succeeding and failing for a client. One specific case I can think of where
this could be useful is producer batch timeouts. If a Java application does
not enable producer client logs (unfortunately, in my experience this
happens more often than it should), the application logs will only contain
the expiration error message, but no information about what is causing the
timeout. The requests might all be succeeding but taking too long to
process batches, or metadata requests might be failing, or some or all
produce requests might be failing (if the bootstrap servers are reachable
from the client but one or more other brokers are not, for example). If the
cluster operator is able to identify the specific requests that are slow or
failing for a client, they will be better able to diagnose the issue
causing batch timeouts.

One drawback I can think of is that this will increase the cardinality of
the request metrics. But any given client is only going to use a small
subset of the request types, and since we already have partition labels for
the topic-level metrics, I think request labels will still make up a
relatively small percentage of the set of metrics.

Thanks,
Bob

On Mon, Nov 22, 2021 at 2:08 AM Viktor Somogyi-Vass <vi...@gmail.com>
wrote:

> Hi Magnus,
>
> I think this is a very useful addition. We also have a similar (but much
> more simplistic) implementation of this. Maybe I missed it in the KIP but
> what about adding metrics about the subscription cache itself? That I think
> would improve its usability and debuggability as we'd be able to see its
> performance, hit/miss rates, eviction counts and others.
>
> Best,
> Viktor
>
> On Thu, Nov 18, 2021 at 5:12 PM Magnus Edenhill <ma...@edenhill.se>
> wrote:
>
> > Hi Mickael,
> >
> > see inline.
> >
> > Den ons 10 nov. 2021 kl 15:21 skrev Mickael Maison <
> > mickael.maison@gmail.com
> > >:
> >
> > > Hi Magnus,
> > >
> > > I see you've addressed some of the points I raised above but some (4,
> > > 5) have not been addressed yet.
> > >
> >
> > Re 4) How will the user/app know metrics are being sent.
> >
> > One possibility is to add a JMX metric (thus for user consumption) for
> the
> > number of metric pushes the
> > client has performed, or perhaps the number of metrics subscriptions
> > currently being collected.
> > Would that be sufficient?
> >
> > Re 5) Metric sizes and rates
> >
> > A worst case scenario for a producer that is producing to 50 unique
> topics
> > and emitting all standard metrics yields
> > a serialized size of around 100KB prior to compression, which compresses
> > down to about 20-30% of that depending
> > on compression type and topic name uniqueness.
> > The numbers for a consumer would be similar.
> >
> > In practice the number of unique topics would be far less, and the
> > subscription set would typically be for a subset of metrics.
> > So we're probably closer to 1kb, or less, compressed size per client per
> > push interval.
> >
> > As both the subscription set and push intervals are controlled by the
> > cluster operator it shouldn't be too hard
> > to strike a good balance between metrics overhead and granularity.
> >
> >
> >
> > >
> > > I'm really uneasy with this being enabled by default on the client
> > > side. When collecting data, I think the best practice is to ensure
> > > users are explicitly enabling it.
> > >
> >
> > Requiring metrics to be explicitly enabled on clients severely cripples
> its
> > usability and value.
> >
> > One of the problems that this KIP aims to solve is for useful metrics to
> be
> > available on demand
> > regardless of the technical expertise of the user. As Ryanne points, out
> a
> > savvy user/organization
> > will typically have metrics collection and monitoring in place already,
> and
> > the benefits of this KIP
> > are then more of a common set and format metrics across client
> > implementations and languages.
> > But that is not the typical Kafka user in my experience, they're not
> Kafka
> > experts and they don't have the
> > knowledge of how to best instrument their clients.
> > Having metrics enabled by default for this user base allows the Kafka
> > operators to proactively and reactively
> > monitor and troubleshoot client issues, without the need for the less
> savvy
> > user to do anything.
> > It is often too late to tell a user to enable metrics when the problem
> has
> > already occurred.
> >
> > Now, to be clear, even though metrics are enabled by default on clients
> it
> > is not enabled by default
> > on the brokers; the Kafka operator needs to build and set up a metrics
> > plugin and add metrics subscriptions
> > before anything is sent from the client.
> > It is opt-out on the clients and opt-in on the broker.
> >
> >
> >
> >
> > > You mentioned brokers already have
> > > some(most?) of the information contained in metrics, if so then why
> > > are we collecting it again? Surely there must be some new information
> > > in the client metrics.
> > >
> >
> > From the user's perspective the Kafka infrastructure extends from
> > producer.send() to
> > messages being returned from consumer.poll(), a giant black box where
> > there's a lot going on between those
> > two points. The brokers currently only see what happens once those
> requests
> > and messages hits the broker,
> > but as Kafka clients are complex pieces of machinery there's a myriad of
> > queues, timers, and state
> > that's critical to the operation and infrastructure that's not currently
> > visible to the operator.
> > Relying on the user to accurately and timely provide this missing
> > information is not generally feasible.
> >
> >
> > Most of the standard metrics listed in the KIP are data points that the
> > broker does not have.
> > Only a small number of metrics are duplicates (like the request counts
> and
> > sizes), but they are included
> > to ease correlation when inspecting these client metrics.
> >
> >
> >
> > > Moreover this is a brand new feature so it's even harder to justify
> > > enabling it and forcing onto all our users. If disabled by default,
> > > it's relatively easy to enable in a new release if we decide to, but
> > > once enabled by default it's much harder to disable. Also this feature
> > > will apply to all future metrics we will add.
> > >
> >
> > I think maturity of a feature implementation should be the deciding
> factor,
> > rather than
> > the design of it (which this KIP is). I.e., if the implementation is not
> > deemed mature enough
> > for release X.Y it will be disabled.
> >
> >
> >
> > > Overall I think it's an interesting feature but I'd prefer to be
> > > slightly defensive and see how it works in practice before enabling it
> > > everywhere.
> > >
> >
> > Right, and I agree on being defensive, but since this feature still
> > requires manual
> > enabling on the brokers before actually being used, I think that gives
> > enough control
> > to opt-in or out of this feature as needed.
> >
> > Thanks for your comments!
> >
> > Regards,
> > Magnus
> >
> >
> >
> > > Thanks,
> > > Mickael
> > >
> > > On Mon, Nov 8, 2021 at 7:55 AM Magnus Edenhill <ma...@edenhill.se>
> > wrote:
> > > >
> > > > Thanks David for pointing this out,
> > > > I've updated the KIP to include client_id as a matching selector.
> > > >
> > > > Regards,
> > > > Magnus
> > > >
> > > > Den tors 4 nov. 2021 kl 18:01 skrev David Mao
> > <dmao@confluent.io.invalid
> > > >:
> > > >
> > > > > Hey Magnus,
> > > > >
> > > > > I noticed that the KIP outlines the initial selectors supported as:
> > > > >
> > > > >    - client_instance_id - CLIENT_INSTANCE_ID UUID string
> > > representation.
> > > > >    - client_software_name  - client software implementation name.
> > > > >    - client_software_version  - client software implementation
> > version.
> > > > >
> > > > > In the given reactive monitoring workflow, we mention that the
> > > application
> > > > > user does not know their client's client instance ID, but it's
> > outlined
> > > > > that the operator can add a metrics subscription selecting for
> > > clientId. I
> > > > > don't see clientId as one of the supported selectors.
> > > > > I can see how this would have made sense in a previous iteration
> > given
> > > that
> > > > > the previous client instance ID proposal was to construct the
> client
> > > > > instance ID using clientId as a prefix. Now that the client
> instance
> > > ID is
> > > > > a UUID, would we want to add clientId as a supported selector?
> > > > > Let me know what you think.
> > > > >
> > > > > David
> > > > >
> > > > > On Tue, Oct 19, 2021 at 12:33 PM Magnus Edenhill <
> magnus@edenhill.se
> > >
> > > > > wrote:
> > > > >
> > > > > > Hi Mickael!
> > > > > >
> > > > > > Den tis 19 okt. 2021 kl 19:30 skrev Mickael Maison <
> > > > > > mickael.maison@gmail.com
> > > > > > >:
> > > > > >
> > > > > > > Hi Magnus,
> > > > > > >
> > > > > > > Thanks for the proposal.
> > > > > > >
> > > > > > > 1. Looking at the protocol section, isn't "ClientInstanceId"
> > > expected
> > > > > > > to be a field in GetTelemetrySubscriptionsResponseV0?
> Otherwise,
> > > how
> > > > > > > does a client retrieve this value?
> > > > > > >
> > > > > >
> > > > > > Good catch, it got removed by mistake in one of the edits.
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > 2. In the client API section, you mention a new method
> > > > > > > "clientInstanceId()". Can you clarify which interfaces are
> > > affected?
> > > > > > > Is it only Consumer and Producer?
> > > > > > >
> > > > > >
> > > > > > And Admin. Will update the KIP.
> > > > > >
> > > > > >
> > > > > >
> > > > > > > 3. I'm a bit concerned this is enabled by default. Even if the
> > data
> > > > > > > collected is supposed to be not sensitive, I think this can be
> > > > > > > problematic in some environments. Also users don't seem to have
> > the
> > > > > > > choice to only expose some metrics. Knowing how much data
> transit
> > > > > > > through some applications can be considered critical.
> > > > > > >
> > > > > >
> > > > > > The broker already knows how much data transits through the
> client
> > > > > though,
> > > > > > right?
> > > > > > Care has been taken not to expose information in the standard
> > metrics
> > > > > that
> > > > > > might
> > > > > > reveal sensitive information.
> > > > > >
> > > > > > Do you have an example of how the proposed metrics could leak
> > > sensitive
> > > > > > information?
> > > > > > As for limiting the what metrics to export; I guess that could
> make
> > > sense
> > > > > > in some
> > > > > > very sensitive use-cases, but those users might disable metrics
> > > > > altogether
> > > > > > for now.
> > > > > > Could these concerns be addressed by a later KIP?
> > > > > >
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > 4. As a user, how do you know if your application is actively
> > > sending
> > > > > > > metrics? Are there new metrics exposing what's going on, like
> how
> > > much
> > > > > > > data is being sent?
> > > > > > >
> > > > > >
> > > > > > That's a good question.
> > > > > > Since the proposed metrics interface is not aimed at, or directly
> > > > > available
> > > > > > to, the application
> > > > > > I guess there's little point of adding it here, but instead
> adding
> > > > > > something to the
> > > > > > existing JMX metrics?
> > > > > > Do you have any suggestions?
> > > > > >
> > > > > >
> > > > > >
> > > > > > > 5. If all metrics are enabled on a regular Consumer or
> Producer,
> > do
> > > > > > > you have an idea how much throughput this would use?
> > > > > > >
> > > > > >
> > > > > > It depends on the number of partition/topics/etc the client is
> > > producing
> > > > > > to/consuming from.
> > > > > > I'll add some sizes to the KIP for some typical use-cases.
> > > > > >
> > > > > >
> > > > > > Thanks,
> > > > > > Magnus
> > > > > >
> > > > > >
> > > > > > > Thanks
> > > > > > >
> > > > > > > On Tue, Oct 19, 2021 at 5:06 PM Magnus Edenhill <
> > > magnus@edenhill.se>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > Den tis 19 okt. 2021 kl 13:22 skrev Tom Bentley <
> > > tbentley@redhat.com
> > > > > >:
> > > > > > > >
> > > > > > > > > Hi Magnus,
> > > > > > > > >
> > > > > > > > > I reviewed the KIP since you called the vote (sorry for not
> > > > > reviewing
> > > > > > > when
> > > > > > > > > you announced your intention to call the vote). I have a
> few
> > > > > > questions
> > > > > > > on
> > > > > > > > > some of the details.
> > > > > > > > >
> > > > > > > > > 1. There's no Javadoc on ClientTelemetryPayload.data(), so
> I
> > > don't
> > > > > > know
> > > > > > > > > whether the payload is exposed through this method as
> > > compressed or
> > > > > > > not.
> > > > > > > > > Later on you say "Decompression of the payloads will be
> > > handled by
> > > > > > the
> > > > > > > > > broker metrics plugin, the broker should expose a suitable
> > > > > > > decompression
> > > > > > > > > API to the metrics plugin for this purpose.", which
> suggests
> > > it's
> > > > > the
> > > > > > > > > compressed data in the buffer, but then we don't know which
> > > codec
> > > > > was
> > > > > > > used,
> > > > > > > > > nor the API via which the plugin should decompress it if
> > > required
> > > > > for
> > > > > > > > > forwarding to the ultimate metrics store. Should the
> > > > > > > ClientTelemetryPayload
> > > > > > > > > expose a method to get the compression and a decompressor?
> > > > > > > > >
> > > > > > > >
> > > > > > > > Good point, updated.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > 2. The client-side API is expressed as StringOrError
> > > > > > > > > ClientInstance::ClientInstanceId(int timeout_ms). I
> > understand
> > > that
> > > > > > > you're
> > > > > > > > > thinking about the librdkafka implementation, but it would
> be
> > > good
> > > > > to
> > > > > > > show
> > > > > > > > > the API as it would appear on the Apache Kafka clients.
> > > > > > > > >
> > > > > > > >
> > > > > > > > This was meant as pseudo-code, but I changed it to Java.
> > > > > > > >
> > > > > > > >
> > > > > > > > > 3. "PushTelemetryRequest|Response - protocol request used
> by
> > > the
> > > > > > > client to
> > > > > > > > > send metrics to any broker it is connected to." To be
> clear,
> > > this
> > > > > > means
> > > > > > > > > that the client can choose any of the connected brokers and
> > > push to
> > > > > > > just
> > > > > > > > > one of them? What should a supporting client do if it gets
> an
> > > error
> > > > > > > when
> > > > > > > > > pushing metrics to a broker, retry sending to the same
> broker
> > > or
> > > > > try
> > > > > > > > > pushing to another broker, or drop the metrics? Should
> > > supporting
> > > > > > > clients
> > > > > > > > > send successive requests to a single broker, or round
> robin,
> > > or is
> > > > > > > that up
> > > > > > > > > to the client author? I'm guessing the behaviour should be
> > > sticky
> > > > > to
> > > > > > > > > support the rate limiting features, but I think it would be
> > > good
> > > > > for
> > > > > > > client
> > > > > > > > > authors if this section were explicit on the recommended
> > > behaviour.
> > > > > > > > >
> > > > > > > >
> > > > > > > > You are right, I've updated the KIP to make this clearer.
> > > > > > > >
> > > > > > > >
> > > > > > > > > 4. "Mapping the client instance id to an actual application
> > > > > instance
> > > > > > > > > running on a (virtual) machine can be done by inspecting
> the
> > > > > metrics
> > > > > > > > > resource labels, such as the client source address and
> source
> > > port,
> > > > > > or
> > > > > > > > > security principal, all of which are added by the receiving
> > > broker.
> > > > > > > This
> > > > > > > > > will allow the operator together with the user to identify
> > the
> > > > > actual
> > > > > > > > > application instance." Is this really always true? The
> source
> > > IP
> > > > > and
> > > > > > > port
> > > > > > > > > might be a loadbalancer/proxy in some setups. The
> principal,
> > as
> > > > > > already
> > > > > > > > > mentioned in the KIP, might be shared between multiple
> > > > > applications.
> > > > > > > So at
> > > > > > > > > worst the organization running the clients might have to
> > > consult
> > > > > the
> > > > > > > logs
> > > > > > > > > of a set of client applications, right?
> > > > > > > > >
> > > > > > > >
> > > > > > > > Yes, that's correct. There's no guaranteed mapping from
> > > > > > > client_instance_id
> > > > > > > > to
> > > > > > > > an actual instance, that's why the KIP recommends client
> > > > > > implementations
> > > > > > > to
> > > > > > > > log the client instance id
> > > > > > > > upon retrieval, and also provide an API for the application
> to
> > > > > retrieve
> > > > > > > the
> > > > > > > > instance id programmatically
> > > > > > > > if it has a better way of exposing it.
> > > > > > > >
> > > > > > > >
> > > > > > > > 5. "Tests indicate that a compression ratio up to 10x is
> > > possible for
> > > > > > the
> > > > > > > > > standard metrics." Client authors might appreciate your
> > > mentioning
> > > > > > > which
> > > > > > > > > compression codec got these results.
> > > > > > > > >
> > > > > > > >
> > > > > > > > Good point. Updated.
> > > > > > > >
> > > > > > > >
> > > > > > > > > 6. "Should the client send a push request prior to expiry
> of
> > > the
> > > > > > > previously
> > > > > > > > > calculated PushIntervalMs the broker will discard the
> metrics
> > > and
> > > > > > > return a
> > > > > > > > > PushTelemetryResponse with the ErrorCode set to
> RateLimited."
> > > Is
> > > > > this
> > > > > > > > > RATE_LIMITED a new error code? It's not mentioned in the
> "New
> > > Error
> > > > > > > Codes"
> > > > > > > > > section.
> > > > > > > > >
> > > > > > > >
> > > > > > > > That's a leftover, it should be using the standard
> ThrottleTime
> > > > > > > mechanism.
> > > > > > > > Fixed.
> > > > > > > >
> > > > > > > >
> > > > > > > > > 7. In the section "Standard client resource labels"
> > > application_id
> > > > > is
> > > > > > > > > described as Kafka Streams only, but the section of "Client
> > > > > > > Identification"
> > > > > > > > > talks about "application instance id as an optional future
> > > > > > nice-to-have
> > > > > > > > > that may be included as a metrics label if it has been set
> by
> > > the
> > > > > > > user", so
> > > > > > > > > I'm confused whether non-Kafka Streams clients should set
> an
> > > > > > > application_id
> > > > > > > > > or not.
> > > > > > > > >
> > > > > > > >
> > > > > > > > I'll clarify this in the KIP, but basically we would need to
> > add
> > > an `
> > > > > > > > application.id` config
> > > > > > > > property for non-streams clients for this purpose, and that's
> > > outside
> > > > > > the
> > > > > > > > scope of this KIP since we want to make it zero-conf:ish on
> the
> > > > > client
> > > > > > > side.
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Kind regards,
> > > > > > > > >
> > > > > > > > > Tom
> > > > > > > > >
> > > > > > > >
> > > > > > > > Thanks for the review,
> > > > > > > > Magnus
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > On Thu, Oct 7, 2021 at 5:26 PM Magnus Edenhill <
> > > magnus@edenhill.se
> > > > > >
> > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi all,
> > > > > > > > > >
> > > > > > > > > > I've updated the KIP following our recent discussions on
> > the
> > > > > > mailing
> > > > > > > > > list:
> > > > > > > > > >  - split the protocol in two, one for getting the metrics
> > > > > > > subscriptions,
> > > > > > > > > > and one for pushing the metrics.
> > > > > > > > > >  - simplifications: initially only one supported metrics
> > > format,
> > > > > no
> > > > > > > > > > client.id in the instance id, etc.
> > > > > > > > > >  - made CLIENT_METRICS subscription configuration entries
> > > more
> > > > > > > structured
> > > > > > > > > >    and allowing better client matching selectors (not
> only
> > > on the
> > > > > > > > > instance
> > > > > > > > > > id, but also the other
> > > > > > > > > >    client resource labels, such as client_software_name,
> > > etc.).
> > > > > > > > > >
> > > > > > > > > > Unless there are further comments I'll call the vote in a
> > > day or
> > > > > > two.
> > > > > > > > > >
> > > > > > > > > > Regards,
> > > > > > > > > > Magnus
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Den mån 4 okt. 2021 kl 20:57 skrev Magnus Edenhill <
> > > > > > > magnus@edenhill.se>:
> > > > > > > > > >
> > > > > > > > > > > Hi Gwen,
> > > > > > > > > > >
> > > > > > > > > > > I'm finishing up the KIP based on the last couple of
> > > discussion
> > > > > > > points
> > > > > > > > > in
> > > > > > > > > > > this thread
> > > > > > > > > > > and will call the Vote later this week.
> > > > > > > > > > >
> > > > > > > > > > > Best,
> > > > > > > > > > > Magnus
> > > > > > > > > > >
> > > > > > > > > > > Den lör 2 okt. 2021 kl 02:01 skrev Gwen Shapira
> > > > > > > > > > <gwen@confluent.io.invalid
> > > > > > > > > > > >:
> > > > > > > > > > >
> > > > > > > > > > >> Hey,
> > > > > > > > > > >>
> > > > > > > > > > >> I noticed that there was no discussion for the last 10
> > > days,
> > > > > > but I
> > > > > > > > > > >> couldn't
> > > > > > > > > > >> find the vote thread. Is there one that I'm missing?
> > > > > > > > > > >>
> > > > > > > > > > >> Gwen
> > > > > > > > > > >>
> > > > > > > > > > >> On Wed, Sep 22, 2021 at 4:58 AM Magnus Edenhill <
> > > > > > > magnus@edenhill.se>
> > > > > > > > > > >> wrote:
> > > > > > > > > > >>
> > > > > > > > > > >> > Den tis 21 sep. 2021 kl 06:58 skrev Colin McCabe <
> > > > > > > > > cmccabe@apache.org
> > > > > > > > > > >:
> > > > > > > > > > >> >
> > > > > > > > > > >> > > On Mon, Sep 20, 2021, at 17:35, Feng Min wrote:
> > > > > > > > > > >> > > > Thanks Magnus & Colin for the discussion.
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > Based on KIP-714's stateless design, Client can
> > > pretty
> > > > > > much
> > > > > > > use
> > > > > > > > > > any
> > > > > > > > > > >> > > > connection to any broker to send metrics. We are
> > not
> > > > > > > associating
> > > > > > > > > > >> > > connection
> > > > > > > > > > >> > > > with client metric state. Is my understanding
> > > correct?
> > > > > If
> > > > > > > yes,
> > > > > > > > > > how
> > > > > > > > > > >> > about
> > > > > > > > > > >> > > > the following two scenarios
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > 1) One Client (Client-ID) registers two
> different
> > > client
> > > > > > > > > instance
> > > > > > > > > > id
> > > > > > > > > > >> > via
> > > > > > > > > > >> > > > separate registration. Is it permitted? If OK,
> how
> > > to
> > > > > > > > > distinguish
> > > > > > > > > > >> them
> > > > > > > > > > >> > > from
> > > > > > > > > > >> > > > the case 2 below.
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > Hi Feng,
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > My understanding, which Magnus can clarify I
> guess,
> > is
> > > > > that
> > > > > > > you
> > > > > > > > > > could
> > > > > > > > > > >> > have
> > > > > > > > > > >> > > something like two Producer instances running with
> > the
> > > > > same
> > > > > > > > > > client.id
> > > > > > > > > > >> > > (perhaps because they're using the same config
> file,
> > > for
> > > > > > > example).
> > > > > > > > > > >> They
> > > > > > > > > > >> > > could even be in the same process. But they would
> > get
> > > > > > separate
> > > > > > > > > > UUIDs.
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > I believe Magnus used the term client to mean
> > > "Producer or
> > > > > > > > > > Consumer".
> > > > > > > > > > >> So
> > > > > > > > > > >> > > if you have both a Producer and a Consumer in your
> > > > > > > application I
> > > > > > > > > > would
> > > > > > > > > > >> > > expect you'd get separate UUIDs for both. Again
> > > Magnus can
> > > > > > > chime
> > > > > > > > > in
> > > > > > > > > > >> > here, I
> > > > > > > > > > >> > > guess.
> > > > > > > > > > >> > >
> > > > > > > > > > >> >
> > > > > > > > > > >> > That's correct.
> > > > > > > > > > >> >
> > > > > > > > > > >> >
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > > 2) How about the client restarting? What's the
> > > > > > expectation?
> > > > > > > > > Should
> > > > > > > > > > >> the
> > > > > > > > > > >> > > > server expect the client to carry a persisted
> > client
> > > > > > > instance id
> > > > > > > > > > or
> > > > > > > > > > >> > > should
> > > > > > > > > > >> > > > the client be treated as a new instance?
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > The KIP doesn't describe any mechanism for
> > > persistence,
> > > > > so I
> > > > > > > would
> > > > > > > > > > >> assume
> > > > > > > > > > >> > > that when you restart the client you get a new
> > UUID. I
> > > > > agree
> > > > > > > that
> > > > > > > > > it
> > > > > > > > > > >> > would
> > > > > > > > > > >> > > be good to spell this out.
> > > > > > > > > > >> > >
> > > > > > > > > > >> > >
> > > > > > > > > > >> > Right, it will not be persisted since a client
> > instance
> > > > > can't
> > > > > > be
> > > > > > > > > > >> restarted.
> > > > > > > > > > >> >
> > > > > > > > > > >> > Will update the KIP to make this clearer.
> > > > > > > > > > >> >
> > > > > > > > > > >> > /Magnus
> > > > > > > > > > >> >
> > > > > > > > > > >>
> > > > > > > > > > >>
> > > > > > > > > > >> --
> > > > > > > > > > >> Gwen Shapira
> > > > > > > > > > >> Engineering Manager | Confluent
> > > > > > > > > > >> 650.450.2760 | @gwenshap
> > > > > > > > > > >> Follow us: Twitter | blog
> > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > >
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Magnus Edenhill <ma...@edenhill.se>.

Hi Viktor,

that's a good idea, I've added a bunch of broker-side metrics for the
client metrics handling.
There might be more added during development as the need arise.

https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability#KIP714:Clientmetricsandobservability-Newbrokermetrics

Thanks,
Magnus

Den mån 22 nov. 2021 kl 11:08 skrev Viktor Somogyi-Vass <
viktorsomogyi@gmail.com>:

> Hi Magnus,
>
> I think this is a very useful addition. We also have a similar (but much
> more simplistic) implementation of this. Maybe I missed it in the KIP but
> what about adding metrics about the subscription cache itself? That I think
> would improve its usability and debuggability as we'd be able to see its
> performance, hit/miss rates, eviction counts and others.
>
> Best,
> Viktor
>
> On Thu, Nov 18, 2021 at 5:12 PM Magnus Edenhill <ma...@edenhill.se>
> wrote:
>
> > Hi Mickael,
> >
> > see inline.
> >
> > Den ons 10 nov. 2021 kl 15:21 skrev Mickael Maison <
> > mickael.maison@gmail.com
> > >:
> >
> > > Hi Magnus,
> > >
> > > I see you've addressed some of the points I raised above but some (4,
> > > 5) have not been addressed yet.
> > >
> >
> > Re 4) How will the user/app know metrics are being sent.
> >
> > One possibility is to add a JMX metric (thus for user consumption) for
> the
> > number of metric pushes the
> > client has performed, or perhaps the number of metrics subscriptions
> > currently being collected.
> > Would that be sufficient?
> >
> > Re 5) Metric sizes and rates
> >
> > A worst case scenario for a producer that is producing to 50 unique
> topics
> > and emitting all standard metrics yields
> > a serialized size of around 100KB prior to compression, which compresses
> > down to about 20-30% of that depending
> > on compression type and topic name uniqueness.
> > The numbers for a consumer would be similar.
> >
> > In practice the number of unique topics would be far less, and the
> > subscription set would typically be for a subset of metrics.
> > So we're probably closer to 1kb, or less, compressed size per client per
> > push interval.
> >
> > As both the subscription set and push intervals are controlled by the
> > cluster operator it shouldn't be too hard
> > to strike a good balance between metrics overhead and granularity.
> >
> >
> >
> > >
> > > I'm really uneasy with this being enabled by default on the client
> > > side. When collecting data, I think the best practice is to ensure
> > > users are explicitly enabling it.
> > >
> >
> > Requiring metrics to be explicitly enabled on clients severely cripples
> its
> > usability and value.
> >
> > One of the problems that this KIP aims to solve is for useful metrics to
> be
> > available on demand
> > regardless of the technical expertise of the user. As Ryanne points, out
> a
> > savvy user/organization
> > will typically have metrics collection and monitoring in place already,
> and
> > the benefits of this KIP
> > are then more of a common set and format metrics across client
> > implementations and languages.
> > But that is not the typical Kafka user in my experience, they're not
> Kafka
> > experts and they don't have the
> > knowledge of how to best instrument their clients.
> > Having metrics enabled by default for this user base allows the Kafka
> > operators to proactively and reactively
> > monitor and troubleshoot client issues, without the need for the less
> savvy
> > user to do anything.
> > It is often too late to tell a user to enable metrics when the problem
> has
> > already occurred.
> >
> > Now, to be clear, even though metrics are enabled by default on clients
> it
> > is not enabled by default
> > on the brokers; the Kafka operator needs to build and set up a metrics
> > plugin and add metrics subscriptions
> > before anything is sent from the client.
> > It is opt-out on the clients and opt-in on the broker.
> >
> >
> >
> >
> > > You mentioned brokers already have
> > > some(most?) of the information contained in metrics, if so then why
> > > are we collecting it again? Surely there must be some new information
> > > in the client metrics.
> > >
> >
> > From the user's perspective the Kafka infrastructure extends from
> > producer.send() to
> > messages being returned from consumer.poll(), a giant black box where
> > there's a lot going on between those
> > two points. The brokers currently only see what happens once those
> requests
> > and messages hits the broker,
> > but as Kafka clients are complex pieces of machinery there's a myriad of
> > queues, timers, and state
> > that's critical to the operation and infrastructure that's not currently
> > visible to the operator.
> > Relying on the user to accurately and timely provide this missing
> > information is not generally feasible.
> >
> >
> > Most of the standard metrics listed in the KIP are data points that the
> > broker does not have.
> > Only a small number of metrics are duplicates (like the request counts
> and
> > sizes), but they are included
> > to ease correlation when inspecting these client metrics.
> >
> >
> >
> > > Moreover this is a brand new feature so it's even harder to justify
> > > enabling it and forcing onto all our users. If disabled by default,
> > > it's relatively easy to enable in a new release if we decide to, but
> > > once enabled by default it's much harder to disable. Also this feature
> > > will apply to all future metrics we will add.
> > >
> >
> > I think maturity of a feature implementation should be the deciding
> factor,
> > rather than
> > the design of it (which this KIP is). I.e., if the implementation is not
> > deemed mature enough
> > for release X.Y it will be disabled.
> >
> >
> >
> > > Overall I think it's an interesting feature but I'd prefer to be
> > > slightly defensive and see how it works in practice before enabling it
> > > everywhere.
> > >
> >
> > Right, and I agree on being defensive, but since this feature still
> > requires manual
> > enabling on the brokers before actually being used, I think that gives
> > enough control
> > to opt-in or out of this feature as needed.
> >
> > Thanks for your comments!
> >
> > Regards,
> > Magnus
> >
> >
> >
> > > Thanks,
> > > Mickael
> > >
> > > On Mon, Nov 8, 2021 at 7:55 AM Magnus Edenhill <ma...@edenhill.se>
> > wrote:
> > > >
> > > > Thanks David for pointing this out,
> > > > I've updated the KIP to include client_id as a matching selector.
> > > >
> > > > Regards,
> > > > Magnus
> > > >
> > > > Den tors 4 nov. 2021 kl 18:01 skrev David Mao
> > <dmao@confluent.io.invalid
> > > >:
> > > >
> > > > > Hey Magnus,
> > > > >
> > > > > I noticed that the KIP outlines the initial selectors supported as:
> > > > >
> > > > >    - client_instance_id - CLIENT_INSTANCE_ID UUID string
> > > representation.
> > > > >    - client_software_name  - client software implementation name.
> > > > >    - client_software_version  - client software implementation
> > version.
> > > > >
> > > > > In the given reactive monitoring workflow, we mention that the
> > > application
> > > > > user does not know their client's client instance ID, but it's
> > outlined
> > > > > that the operator can add a metrics subscription selecting for
> > > clientId. I
> > > > > don't see clientId as one of the supported selectors.
> > > > > I can see how this would have made sense in a previous iteration
> > given
> > > that
> > > > > the previous client instance ID proposal was to construct the
> client
> > > > > instance ID using clientId as a prefix. Now that the client
> instance
> > > ID is
> > > > > a UUID, would we want to add clientId as a supported selector?
> > > > > Let me know what you think.
> > > > >
> > > > > David
> > > > >
> > > > > On Tue, Oct 19, 2021 at 12:33 PM Magnus Edenhill <
> magnus@edenhill.se
> > >
> > > > > wrote:
> > > > >
> > > > > > Hi Mickael!
> > > > > >
> > > > > > Den tis 19 okt. 2021 kl 19:30 skrev Mickael Maison <
> > > > > > mickael.maison@gmail.com
> > > > > > >:
> > > > > >
> > > > > > > Hi Magnus,
> > > > > > >
> > > > > > > Thanks for the proposal.
> > > > > > >
> > > > > > > 1. Looking at the protocol section, isn't "ClientInstanceId"
> > > expected
> > > > > > > to be a field in GetTelemetrySubscriptionsResponseV0?
> Otherwise,
> > > how
> > > > > > > does a client retrieve this value?
> > > > > > >
> > > > > >
> > > > > > Good catch, it got removed by mistake in one of the edits.
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > 2. In the client API section, you mention a new method
> > > > > > > "clientInstanceId()". Can you clarify which interfaces are
> > > affected?
> > > > > > > Is it only Consumer and Producer?
> > > > > > >
> > > > > >
> > > > > > And Admin. Will update the KIP.
> > > > > >
> > > > > >
> > > > > >
> > > > > > > 3. I'm a bit concerned this is enabled by default. Even if the
> > data
> > > > > > > collected is supposed to be not sensitive, I think this can be
> > > > > > > problematic in some environments. Also users don't seem to have
> > the
> > > > > > > choice to only expose some metrics. Knowing how much data
> transit
> > > > > > > through some applications can be considered critical.
> > > > > > >
> > > > > >
> > > > > > The broker already knows how much data transits through the
> client
> > > > > though,
> > > > > > right?
> > > > > > Care has been taken not to expose information in the standard
> > metrics
> > > > > that
> > > > > > might
> > > > > > reveal sensitive information.
> > > > > >
> > > > > > Do you have an example of how the proposed metrics could leak
> > > sensitive
> > > > > > information?
> > > > > > As for limiting the what metrics to export; I guess that could
> make
> > > sense
> > > > > > in some
> > > > > > very sensitive use-cases, but those users might disable metrics
> > > > > altogether
> > > > > > for now.
> > > > > > Could these concerns be addressed by a later KIP?
> > > > > >
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > 4. As a user, how do you know if your application is actively
> > > sending
> > > > > > > metrics? Are there new metrics exposing what's going on, like
> how
> > > much
> > > > > > > data is being sent?
> > > > > > >
> > > > > >
> > > > > > That's a good question.
> > > > > > Since the proposed metrics interface is not aimed at, or directly
> > > > > available
> > > > > > to, the application
> > > > > > I guess there's little point of adding it here, but instead
> adding
> > > > > > something to the
> > > > > > existing JMX metrics?
> > > > > > Do you have any suggestions?
> > > > > >
> > > > > >
> > > > > >
> > > > > > > 5. If all metrics are enabled on a regular Consumer or
> Producer,
> > do
> > > > > > > you have an idea how much throughput this would use?
> > > > > > >
> > > > > >
> > > > > > It depends on the number of partition/topics/etc the client is
> > > producing
> > > > > > to/consuming from.
> > > > > > I'll add some sizes to the KIP for some typical use-cases.
> > > > > >
> > > > > >
> > > > > > Thanks,
> > > > > > Magnus
> > > > > >
> > > > > >
> > > > > > > Thanks
> > > > > > >
> > > > > > > On Tue, Oct 19, 2021 at 5:06 PM Magnus Edenhill <
> > > magnus@edenhill.se>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > Den tis 19 okt. 2021 kl 13:22 skrev Tom Bentley <
> > > tbentley@redhat.com
> > > > > >:
> > > > > > > >
> > > > > > > > > Hi Magnus,
> > > > > > > > >
> > > > > > > > > I reviewed the KIP since you called the vote (sorry for not
> > > > > reviewing
> > > > > > > when
> > > > > > > > > you announced your intention to call the vote). I have a
> few
> > > > > > questions
> > > > > > > on
> > > > > > > > > some of the details.
> > > > > > > > >
> > > > > > > > > 1. There's no Javadoc on ClientTelemetryPayload.data(), so
> I
> > > don't
> > > > > > know
> > > > > > > > > whether the payload is exposed through this method as
> > > compressed or
> > > > > > > not.
> > > > > > > > > Later on you say "Decompression of the payloads will be
> > > handled by
> > > > > > the
> > > > > > > > > broker metrics plugin, the broker should expose a suitable
> > > > > > > decompression
> > > > > > > > > API to the metrics plugin for this purpose.", which
> suggests
> > > it's
> > > > > the
> > > > > > > > > compressed data in the buffer, but then we don't know which
> > > codec
> > > > > was
> > > > > > > used,
> > > > > > > > > nor the API via which the plugin should decompress it if
> > > required
> > > > > for
> > > > > > > > > forwarding to the ultimate metrics store. Should the
> > > > > > > ClientTelemetryPayload
> > > > > > > > > expose a method to get the compression and a decompressor?
> > > > > > > > >
> > > > > > > >
> > > > > > > > Good point, updated.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > 2. The client-side API is expressed as StringOrError
> > > > > > > > > ClientInstance::ClientInstanceId(int timeout_ms). I
> > understand
> > > that
> > > > > > > you're
> > > > > > > > > thinking about the librdkafka implementation, but it would
> be
> > > good
> > > > > to
> > > > > > > show
> > > > > > > > > the API as it would appear on the Apache Kafka clients.
> > > > > > > > >
> > > > > > > >
> > > > > > > > This was meant as pseudo-code, but I changed it to Java.
> > > > > > > >
> > > > > > > >
> > > > > > > > > 3. "PushTelemetryRequest|Response - protocol request used
> by
> > > the
> > > > > > > client to
> > > > > > > > > send metrics to any broker it is connected to." To be
> clear,
> > > this
> > > > > > means
> > > > > > > > > that the client can choose any of the connected brokers and
> > > push to
> > > > > > > just
> > > > > > > > > one of them? What should a supporting client do if it gets
> an
> > > error
> > > > > > > when
> > > > > > > > > pushing metrics to a broker, retry sending to the same
> broker
> > > or
> > > > > try
> > > > > > > > > pushing to another broker, or drop the metrics? Should
> > > supporting
> > > > > > > clients
> > > > > > > > > send successive requests to a single broker, or round
> robin,
> > > or is
> > > > > > > that up
> > > > > > > > > to the client author? I'm guessing the behaviour should be
> > > sticky
> > > > > to
> > > > > > > > > support the rate limiting features, but I think it would be
> > > good
> > > > > for
> > > > > > > client
> > > > > > > > > authors if this section were explicit on the recommended
> > > behaviour.
> > > > > > > > >
> > > > > > > >
> > > > > > > > You are right, I've updated the KIP to make this clearer.
> > > > > > > >
> > > > > > > >
> > > > > > > > > 4. "Mapping the client instance id to an actual application
> > > > > instance
> > > > > > > > > running on a (virtual) machine can be done by inspecting
> the
> > > > > metrics
> > > > > > > > > resource labels, such as the client source address and
> source
> > > port,
> > > > > > or
> > > > > > > > > security principal, all of which are added by the receiving
> > > broker.
> > > > > > > This
> > > > > > > > > will allow the operator together with the user to identify
> > the
> > > > > actual
> > > > > > > > > application instance." Is this really always true? The
> source
> > > IP
> > > > > and
> > > > > > > port
> > > > > > > > > might be a loadbalancer/proxy in some setups. The
> principal,
> > as
> > > > > > already
> > > > > > > > > mentioned in the KIP, might be shared between multiple
> > > > > applications.
> > > > > > > So at
> > > > > > > > > worst the organization running the clients might have to
> > > consult
> > > > > the
> > > > > > > logs
> > > > > > > > > of a set of client applications, right?
> > > > > > > > >
> > > > > > > >
> > > > > > > > Yes, that's correct. There's no guaranteed mapping from
> > > > > > > client_instance_id
> > > > > > > > to
> > > > > > > > an actual instance, that's why the KIP recommends client
> > > > > > implementations
> > > > > > > to
> > > > > > > > log the client instance id
> > > > > > > > upon retrieval, and also provide an API for the application
> to
> > > > > retrieve
> > > > > > > the
> > > > > > > > instance id programmatically
> > > > > > > > if it has a better way of exposing it.
> > > > > > > >
> > > > > > > >
> > > > > > > > 5. "Tests indicate that a compression ratio up to 10x is
> > > possible for
> > > > > > the
> > > > > > > > > standard metrics." Client authors might appreciate your
> > > mentioning
> > > > > > > which
> > > > > > > > > compression codec got these results.
> > > > > > > > >
> > > > > > > >
> > > > > > > > Good point. Updated.
> > > > > > > >
> > > > > > > >
> > > > > > > > > 6. "Should the client send a push request prior to expiry
> of
> > > the
> > > > > > > previously
> > > > > > > > > calculated PushIntervalMs the broker will discard the
> metrics
> > > and
> > > > > > > return a
> > > > > > > > > PushTelemetryResponse with the ErrorCode set to
> RateLimited."
> > > Is
> > > > > this
> > > > > > > > > RATE_LIMITED a new error code? It's not mentioned in the
> "New
> > > Error
> > > > > > > Codes"
> > > > > > > > > section.
> > > > > > > > >
> > > > > > > >
> > > > > > > > That's a leftover, it should be using the standard
> ThrottleTime
> > > > > > > mechanism.
> > > > > > > > Fixed.
> > > > > > > >
> > > > > > > >
> > > > > > > > > 7. In the section "Standard client resource labels"
> > > application_id
> > > > > is
> > > > > > > > > described as Kafka Streams only, but the section of "Client
> > > > > > > Identification"
> > > > > > > > > talks about "application instance id as an optional future
> > > > > > nice-to-have
> > > > > > > > > that may be included as a metrics label if it has been set
> by
> > > the
> > > > > > > user", so
> > > > > > > > > I'm confused whether non-Kafka Streams clients should set
> an
> > > > > > > application_id
> > > > > > > > > or not.
> > > > > > > > >
> > > > > > > >
> > > > > > > > I'll clarify this in the KIP, but basically we would need to
> > add
> > > an `
> > > > > > > > application.id` config
> > > > > > > > property for non-streams clients for this purpose, and that's
> > > outside
> > > > > > the
> > > > > > > > scope of this KIP since we want to make it zero-conf:ish on
> the
> > > > > client
> > > > > > > side.
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Kind regards,
> > > > > > > > >
> > > > > > > > > Tom
> > > > > > > > >
> > > > > > > >
> > > > > > > > Thanks for the review,
> > > > > > > > Magnus
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > On Thu, Oct 7, 2021 at 5:26 PM Magnus Edenhill <
> > > magnus@edenhill.se
> > > > > >
> > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi all,
> > > > > > > > > >
> > > > > > > > > > I've updated the KIP following our recent discussions on
> > the
> > > > > > mailing
> > > > > > > > > list:
> > > > > > > > > >  - split the protocol in two, one for getting the metrics
> > > > > > > subscriptions,
> > > > > > > > > > and one for pushing the metrics.
> > > > > > > > > >  - simplifications: initially only one supported metrics
> > > format,
> > > > > no
> > > > > > > > > > client.id in the instance id, etc.
> > > > > > > > > >  - made CLIENT_METRICS subscription configuration entries
> > > more
> > > > > > > structured
> > > > > > > > > >    and allowing better client matching selectors (not
> only
> > > on the
> > > > > > > > > instance
> > > > > > > > > > id, but also the other
> > > > > > > > > >    client resource labels, such as client_software_name,
> > > etc.).
> > > > > > > > > >
> > > > > > > > > > Unless there are further comments I'll call the vote in a
> > > day or
> > > > > > two.
> > > > > > > > > >
> > > > > > > > > > Regards,
> > > > > > > > > > Magnus
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Den mån 4 okt. 2021 kl 20:57 skrev Magnus Edenhill <
> > > > > > > magnus@edenhill.se>:
> > > > > > > > > >
> > > > > > > > > > > Hi Gwen,
> > > > > > > > > > >
> > > > > > > > > > > I'm finishing up the KIP based on the last couple of
> > > discussion
> > > > > > > points
> > > > > > > > > in
> > > > > > > > > > > this thread
> > > > > > > > > > > and will call the Vote later this week.
> > > > > > > > > > >
> > > > > > > > > > > Best,
> > > > > > > > > > > Magnus
> > > > > > > > > > >
> > > > > > > > > > > Den lör 2 okt. 2021 kl 02:01 skrev Gwen Shapira
> > > > > > > > > > <gwen@confluent.io.invalid
> > > > > > > > > > > >:
> > > > > > > > > > >
> > > > > > > > > > >> Hey,
> > > > > > > > > > >>
> > > > > > > > > > >> I noticed that there was no discussion for the last 10
> > > days,
> > > > > > but I
> > > > > > > > > > >> couldn't
> > > > > > > > > > >> find the vote thread. Is there one that I'm missing?
> > > > > > > > > > >>
> > > > > > > > > > >> Gwen
> > > > > > > > > > >>
> > > > > > > > > > >> On Wed, Sep 22, 2021 at 4:58 AM Magnus Edenhill <
> > > > > > > magnus@edenhill.se>
> > > > > > > > > > >> wrote:
> > > > > > > > > > >>
> > > > > > > > > > >> > Den tis 21 sep. 2021 kl 06:58 skrev Colin McCabe <
> > > > > > > > > cmccabe@apache.org
> > > > > > > > > > >:
> > > > > > > > > > >> >
> > > > > > > > > > >> > > On Mon, Sep 20, 2021, at 17:35, Feng Min wrote:
> > > > > > > > > > >> > > > Thanks Magnus & Colin for the discussion.
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > Based on KIP-714's stateless design, Client can
> > > pretty
> > > > > > much
> > > > > > > use
> > > > > > > > > > any
> > > > > > > > > > >> > > > connection to any broker to send metrics. We are
> > not
> > > > > > > associating
> > > > > > > > > > >> > > connection
> > > > > > > > > > >> > > > with client metric state. Is my understanding
> > > correct?
> > > > > If
> > > > > > > yes,
> > > > > > > > > > how
> > > > > > > > > > >> > about
> > > > > > > > > > >> > > > the following two scenarios
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > 1) One Client (Client-ID) registers two
> different
> > > client
> > > > > > > > > instance
> > > > > > > > > > id
> > > > > > > > > > >> > via
> > > > > > > > > > >> > > > separate registration. Is it permitted? If OK,
> how
> > > to
> > > > > > > > > distinguish
> > > > > > > > > > >> them
> > > > > > > > > > >> > > from
> > > > > > > > > > >> > > > the case 2 below.
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > Hi Feng,
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > My understanding, which Magnus can clarify I
> guess,
> > is
> > > > > that
> > > > > > > you
> > > > > > > > > > could
> > > > > > > > > > >> > have
> > > > > > > > > > >> > > something like two Producer instances running with
> > the
> > > > > same
> > > > > > > > > > client.id
> > > > > > > > > > >> > > (perhaps because they're using the same config
> file,
> > > for
> > > > > > > example).
> > > > > > > > > > >> They
> > > > > > > > > > >> > > could even be in the same process. But they would
> > get
> > > > > > separate
> > > > > > > > > > UUIDs.
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > I believe Magnus used the term client to mean
> > > "Producer or
> > > > > > > > > > Consumer".
> > > > > > > > > > >> So
> > > > > > > > > > >> > > if you have both a Producer and a Consumer in your
> > > > > > > application I
> > > > > > > > > > would
> > > > > > > > > > >> > > expect you'd get separate UUIDs for both. Again
> > > Magnus can
> > > > > > > chime
> > > > > > > > > in
> > > > > > > > > > >> > here, I
> > > > > > > > > > >> > > guess.
> > > > > > > > > > >> > >
> > > > > > > > > > >> >
> > > > > > > > > > >> > That's correct.
> > > > > > > > > > >> >
> > > > > > > > > > >> >
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > > 2) How about the client restarting? What's the
> > > > > > expectation?
> > > > > > > > > Should
> > > > > > > > > > >> the
> > > > > > > > > > >> > > > server expect the client to carry a persisted
> > client
> > > > > > > instance id
> > > > > > > > > > or
> > > > > > > > > > >> > > should
> > > > > > > > > > >> > > > the client be treated as a new instance?
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > The KIP doesn't describe any mechanism for
> > > persistence,
> > > > > so I
> > > > > > > would
> > > > > > > > > > >> assume
> > > > > > > > > > >> > > that when you restart the client you get a new
> > UUID. I
> > > > > agree
> > > > > > > that
> > > > > > > > > it
> > > > > > > > > > >> > would
> > > > > > > > > > >> > > be good to spell this out.
> > > > > > > > > > >> > >
> > > > > > > > > > >> > >
> > > > > > > > > > >> > Right, it will not be persisted since a client
> > instance
> > > > > can't
> > > > > > be
> > > > > > > > > > >> restarted.
> > > > > > > > > > >> >
> > > > > > > > > > >> > Will update the KIP to make this clearer.
> > > > > > > > > > >> >
> > > > > > > > > > >> > /Magnus
> > > > > > > > > > >> >
> > > > > > > > > > >>
> > > > > > > > > > >>
> > > > > > > > > > >> --
> > > > > > > > > > >> Gwen Shapira
> > > > > > > > > > >> Engineering Manager | Confluent
> > > > > > > > > > >> 650.450.2760 | @gwenshap
> > > > > > > > > > >> Follow us: Twitter | blog
> > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > >
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Viktor Somogyi-Vass <vi...@gmail.com>.

Hi Magnus,

I think this is a very useful addition. We also have a similar (but much
more simplistic) implementation of this. Maybe I missed it in the KIP but
what about adding metrics about the subscription cache itself? That I think
would improve its usability and debuggability as we'd be able to see its
performance, hit/miss rates, eviction counts and others.

Best,
Viktor

On Thu, Nov 18, 2021 at 5:12 PM Magnus Edenhill <ma...@edenhill.se> wrote:

> Hi Mickael,
>
> see inline.
>
> Den ons 10 nov. 2021 kl 15:21 skrev Mickael Maison <
> mickael.maison@gmail.com
> >:
>
> > Hi Magnus,
> >
> > I see you've addressed some of the points I raised above but some (4,
> > 5) have not been addressed yet.
> >
>
> Re 4) How will the user/app know metrics are being sent.
>
> One possibility is to add a JMX metric (thus for user consumption) for the
> number of metric pushes the
> client has performed, or perhaps the number of metrics subscriptions
> currently being collected.
> Would that be sufficient?
>
> Re 5) Metric sizes and rates
>
> A worst case scenario for a producer that is producing to 50 unique topics
> and emitting all standard metrics yields
> a serialized size of around 100KB prior to compression, which compresses
> down to about 20-30% of that depending
> on compression type and topic name uniqueness.
> The numbers for a consumer would be similar.
>
> In practice the number of unique topics would be far less, and the
> subscription set would typically be for a subset of metrics.
> So we're probably closer to 1kb, or less, compressed size per client per
> push interval.
>
> As both the subscription set and push intervals are controlled by the
> cluster operator it shouldn't be too hard
> to strike a good balance between metrics overhead and granularity.
>
>
>
> >
> > I'm really uneasy with this being enabled by default on the client
> > side. When collecting data, I think the best practice is to ensure
> > users are explicitly enabling it.
> >
>
> Requiring metrics to be explicitly enabled on clients severely cripples its
> usability and value.
>
> One of the problems that this KIP aims to solve is for useful metrics to be
> available on demand
> regardless of the technical expertise of the user. As Ryanne points, out a
> savvy user/organization
> will typically have metrics collection and monitoring in place already, and
> the benefits of this KIP
> are then more of a common set and format metrics across client
> implementations and languages.
> But that is not the typical Kafka user in my experience, they're not Kafka
> experts and they don't have the
> knowledge of how to best instrument their clients.
> Having metrics enabled by default for this user base allows the Kafka
> operators to proactively and reactively
> monitor and troubleshoot client issues, without the need for the less savvy
> user to do anything.
> It is often too late to tell a user to enable metrics when the problem has
> already occurred.
>
> Now, to be clear, even though metrics are enabled by default on clients it
> is not enabled by default
> on the brokers; the Kafka operator needs to build and set up a metrics
> plugin and add metrics subscriptions
> before anything is sent from the client.
> It is opt-out on the clients and opt-in on the broker.
>
>
>
>
> > You mentioned brokers already have
> > some(most?) of the information contained in metrics, if so then why
> > are we collecting it again? Surely there must be some new information
> > in the client metrics.
> >
>
> From the user's perspective the Kafka infrastructure extends from
> producer.send() to
> messages being returned from consumer.poll(), a giant black box where
> there's a lot going on between those
> two points. The brokers currently only see what happens once those requests
> and messages hits the broker,
> but as Kafka clients are complex pieces of machinery there's a myriad of
> queues, timers, and state
> that's critical to the operation and infrastructure that's not currently
> visible to the operator.
> Relying on the user to accurately and timely provide this missing
> information is not generally feasible.
>
>
> Most of the standard metrics listed in the KIP are data points that the
> broker does not have.
> Only a small number of metrics are duplicates (like the request counts and
> sizes), but they are included
> to ease correlation when inspecting these client metrics.
>
>
>
> > Moreover this is a brand new feature so it's even harder to justify
> > enabling it and forcing onto all our users. If disabled by default,
> > it's relatively easy to enable in a new release if we decide to, but
> > once enabled by default it's much harder to disable. Also this feature
> > will apply to all future metrics we will add.
> >
>
> I think maturity of a feature implementation should be the deciding factor,
> rather than
> the design of it (which this KIP is). I.e., if the implementation is not
> deemed mature enough
> for release X.Y it will be disabled.
>
>
>
> > Overall I think it's an interesting feature but I'd prefer to be
> > slightly defensive and see how it works in practice before enabling it
> > everywhere.
> >
>
> Right, and I agree on being defensive, but since this feature still
> requires manual
> enabling on the brokers before actually being used, I think that gives
> enough control
> to opt-in or out of this feature as needed.
>
> Thanks for your comments!
>
> Regards,
> Magnus
>
>
>
> > Thanks,
> > Mickael
> >
> > On Mon, Nov 8, 2021 at 7:55 AM Magnus Edenhill <ma...@edenhill.se>
> wrote:
> > >
> > > Thanks David for pointing this out,
> > > I've updated the KIP to include client_id as a matching selector.
> > >
> > > Regards,
> > > Magnus
> > >
> > > Den tors 4 nov. 2021 kl 18:01 skrev David Mao
> <dmao@confluent.io.invalid
> > >:
> > >
> > > > Hey Magnus,
> > > >
> > > > I noticed that the KIP outlines the initial selectors supported as:
> > > >
> > > >    - client_instance_id - CLIENT_INSTANCE_ID UUID string
> > representation.
> > > >    - client_software_name  - client software implementation name.
> > > >    - client_software_version  - client software implementation
> version.
> > > >
> > > > In the given reactive monitoring workflow, we mention that the
> > application
> > > > user does not know their client's client instance ID, but it's
> outlined
> > > > that the operator can add a metrics subscription selecting for
> > clientId. I
> > > > don't see clientId as one of the supported selectors.
> > > > I can see how this would have made sense in a previous iteration
> given
> > that
> > > > the previous client instance ID proposal was to construct the client
> > > > instance ID using clientId as a prefix. Now that the client instance
> > ID is
> > > > a UUID, would we want to add clientId as a supported selector?
> > > > Let me know what you think.
> > > >
> > > > David
> > > >
> > > > On Tue, Oct 19, 2021 at 12:33 PM Magnus Edenhill <magnus@edenhill.se
> >
> > > > wrote:
> > > >
> > > > > Hi Mickael!
> > > > >
> > > > > Den tis 19 okt. 2021 kl 19:30 skrev Mickael Maison <
> > > > > mickael.maison@gmail.com
> > > > > >:
> > > > >
> > > > > > Hi Magnus,
> > > > > >
> > > > > > Thanks for the proposal.
> > > > > >
> > > > > > 1. Looking at the protocol section, isn't "ClientInstanceId"
> > expected
> > > > > > to be a field in GetTelemetrySubscriptionsResponseV0? Otherwise,
> > how
> > > > > > does a client retrieve this value?
> > > > > >
> > > > >
> > > > > Good catch, it got removed by mistake in one of the edits.
> > > > >
> > > > >
> > > > > >
> > > > > > 2. In the client API section, you mention a new method
> > > > > > "clientInstanceId()". Can you clarify which interfaces are
> > affected?
> > > > > > Is it only Consumer and Producer?
> > > > > >
> > > > >
> > > > > And Admin. Will update the KIP.
> > > > >
> > > > >
> > > > >
> > > > > > 3. I'm a bit concerned this is enabled by default. Even if the
> data
> > > > > > collected is supposed to be not sensitive, I think this can be
> > > > > > problematic in some environments. Also users don't seem to have
> the
> > > > > > choice to only expose some metrics. Knowing how much data transit
> > > > > > through some applications can be considered critical.
> > > > > >
> > > > >
> > > > > The broker already knows how much data transits through the client
> > > > though,
> > > > > right?
> > > > > Care has been taken not to expose information in the standard
> metrics
> > > > that
> > > > > might
> > > > > reveal sensitive information.
> > > > >
> > > > > Do you have an example of how the proposed metrics could leak
> > sensitive
> > > > > information?
> > > > > As for limiting the what metrics to export; I guess that could make
> > sense
> > > > > in some
> > > > > very sensitive use-cases, but those users might disable metrics
> > > > altogether
> > > > > for now.
> > > > > Could these concerns be addressed by a later KIP?
> > > > >
> > > > >
> > > > >
> > > > > >
> > > > > > 4. As a user, how do you know if your application is actively
> > sending
> > > > > > metrics? Are there new metrics exposing what's going on, like how
> > much
> > > > > > data is being sent?
> > > > > >
> > > > >
> > > > > That's a good question.
> > > > > Since the proposed metrics interface is not aimed at, or directly
> > > > available
> > > > > to, the application
> > > > > I guess there's little point of adding it here, but instead adding
> > > > > something to the
> > > > > existing JMX metrics?
> > > > > Do you have any suggestions?
> > > > >
> > > > >
> > > > >
> > > > > > 5. If all metrics are enabled on a regular Consumer or Producer,
> do
> > > > > > you have an idea how much throughput this would use?
> > > > > >
> > > > >
> > > > > It depends on the number of partition/topics/etc the client is
> > producing
> > > > > to/consuming from.
> > > > > I'll add some sizes to the KIP for some typical use-cases.
> > > > >
> > > > >
> > > > > Thanks,
> > > > > Magnus
> > > > >
> > > > >
> > > > > > Thanks
> > > > > >
> > > > > > On Tue, Oct 19, 2021 at 5:06 PM Magnus Edenhill <
> > magnus@edenhill.se>
> > > > > > wrote:
> > > > > > >
> > > > > > > Den tis 19 okt. 2021 kl 13:22 skrev Tom Bentley <
> > tbentley@redhat.com
> > > > >:
> > > > > > >
> > > > > > > > Hi Magnus,
> > > > > > > >
> > > > > > > > I reviewed the KIP since you called the vote (sorry for not
> > > > reviewing
> > > > > > when
> > > > > > > > you announced your intention to call the vote). I have a few
> > > > > questions
> > > > > > on
> > > > > > > > some of the details.
> > > > > > > >
> > > > > > > > 1. There's no Javadoc on ClientTelemetryPayload.data(), so I
> > don't
> > > > > know
> > > > > > > > whether the payload is exposed through this method as
> > compressed or
> > > > > > not.
> > > > > > > > Later on you say "Decompression of the payloads will be
> > handled by
> > > > > the
> > > > > > > > broker metrics plugin, the broker should expose a suitable
> > > > > > decompression
> > > > > > > > API to the metrics plugin for this purpose.", which suggests
> > it's
> > > > the
> > > > > > > > compressed data in the buffer, but then we don't know which
> > codec
> > > > was
> > > > > > used,
> > > > > > > > nor the API via which the plugin should decompress it if
> > required
> > > > for
> > > > > > > > forwarding to the ultimate metrics store. Should the
> > > > > > ClientTelemetryPayload
> > > > > > > > expose a method to get the compression and a decompressor?
> > > > > > > >
> > > > > > >
> > > > > > > Good point, updated.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > 2. The client-side API is expressed as StringOrError
> > > > > > > > ClientInstance::ClientInstanceId(int timeout_ms). I
> understand
> > that
> > > > > > you're
> > > > > > > > thinking about the librdkafka implementation, but it would be
> > good
> > > > to
> > > > > > show
> > > > > > > > the API as it would appear on the Apache Kafka clients.
> > > > > > > >
> > > > > > >
> > > > > > > This was meant as pseudo-code, but I changed it to Java.
> > > > > > >
> > > > > > >
> > > > > > > > 3. "PushTelemetryRequest|Response - protocol request used by
> > the
> > > > > > client to
> > > > > > > > send metrics to any broker it is connected to." To be clear,
> > this
> > > > > means
> > > > > > > > that the client can choose any of the connected brokers and
> > push to
> > > > > > just
> > > > > > > > one of them? What should a supporting client do if it gets an
> > error
> > > > > > when
> > > > > > > > pushing metrics to a broker, retry sending to the same broker
> > or
> > > > try
> > > > > > > > pushing to another broker, or drop the metrics? Should
> > supporting
> > > > > > clients
> > > > > > > > send successive requests to a single broker, or round robin,
> > or is
> > > > > > that up
> > > > > > > > to the client author? I'm guessing the behaviour should be
> > sticky
> > > > to
> > > > > > > > support the rate limiting features, but I think it would be
> > good
> > > > for
> > > > > > client
> > > > > > > > authors if this section were explicit on the recommended
> > behaviour.
> > > > > > > >
> > > > > > >
> > > > > > > You are right, I've updated the KIP to make this clearer.
> > > > > > >
> > > > > > >
> > > > > > > > 4. "Mapping the client instance id to an actual application
> > > > instance
> > > > > > > > running on a (virtual) machine can be done by inspecting the
> > > > metrics
> > > > > > > > resource labels, such as the client source address and source
> > port,
> > > > > or
> > > > > > > > security principal, all of which are added by the receiving
> > broker.
> > > > > > This
> > > > > > > > will allow the operator together with the user to identify
> the
> > > > actual
> > > > > > > > application instance." Is this really always true? The source
> > IP
> > > > and
> > > > > > port
> > > > > > > > might be a loadbalancer/proxy in some setups. The principal,
> as
> > > > > already
> > > > > > > > mentioned in the KIP, might be shared between multiple
> > > > applications.
> > > > > > So at
> > > > > > > > worst the organization running the clients might have to
> > consult
> > > > the
> > > > > > logs
> > > > > > > > of a set of client applications, right?
> > > > > > > >
> > > > > > >
> > > > > > > Yes, that's correct. There's no guaranteed mapping from
> > > > > > client_instance_id
> > > > > > > to
> > > > > > > an actual instance, that's why the KIP recommends client
> > > > > implementations
> > > > > > to
> > > > > > > log the client instance id
> > > > > > > upon retrieval, and also provide an API for the application to
> > > > retrieve
> > > > > > the
> > > > > > > instance id programmatically
> > > > > > > if it has a better way of exposing it.
> > > > > > >
> > > > > > >
> > > > > > > 5. "Tests indicate that a compression ratio up to 10x is
> > possible for
> > > > > the
> > > > > > > > standard metrics." Client authors might appreciate your
> > mentioning
> > > > > > which
> > > > > > > > compression codec got these results.
> > > > > > > >
> > > > > > >
> > > > > > > Good point. Updated.
> > > > > > >
> > > > > > >
> > > > > > > > 6. "Should the client send a push request prior to expiry of
> > the
> > > > > > previously
> > > > > > > > calculated PushIntervalMs the broker will discard the metrics
> > and
> > > > > > return a
> > > > > > > > PushTelemetryResponse with the ErrorCode set to RateLimited."
> > Is
> > > > this
> > > > > > > > RATE_LIMITED a new error code? It's not mentioned in the "New
> > Error
> > > > > > Codes"
> > > > > > > > section.
> > > > > > > >
> > > > > > >
> > > > > > > That's a leftover, it should be using the standard ThrottleTime
> > > > > > mechanism.
> > > > > > > Fixed.
> > > > > > >
> > > > > > >
> > > > > > > > 7. In the section "Standard client resource labels"
> > application_id
> > > > is
> > > > > > > > described as Kafka Streams only, but the section of "Client
> > > > > > Identification"
> > > > > > > > talks about "application instance id as an optional future
> > > > > nice-to-have
> > > > > > > > that may be included as a metrics label if it has been set by
> > the
> > > > > > user", so
> > > > > > > > I'm confused whether non-Kafka Streams clients should set an
> > > > > > application_id
> > > > > > > > or not.
> > > > > > > >
> > > > > > >
> > > > > > > I'll clarify this in the KIP, but basically we would need to
> add
> > an `
> > > > > > > application.id` config
> > > > > > > property for non-streams clients for this purpose, and that's
> > outside
> > > > > the
> > > > > > > scope of this KIP since we want to make it zero-conf:ish on the
> > > > client
> > > > > > side.
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > Kind regards,
> > > > > > > >
> > > > > > > > Tom
> > > > > > > >
> > > > > > >
> > > > > > > Thanks for the review,
> > > > > > > Magnus
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > On Thu, Oct 7, 2021 at 5:26 PM Magnus Edenhill <
> > magnus@edenhill.se
> > > > >
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi all,
> > > > > > > > >
> > > > > > > > > I've updated the KIP following our recent discussions on
> the
> > > > > mailing
> > > > > > > > list:
> > > > > > > > >  - split the protocol in two, one for getting the metrics
> > > > > > subscriptions,
> > > > > > > > > and one for pushing the metrics.
> > > > > > > > >  - simplifications: initially only one supported metrics
> > format,
> > > > no
> > > > > > > > > client.id in the instance id, etc.
> > > > > > > > >  - made CLIENT_METRICS subscription configuration entries
> > more
> > > > > > structured
> > > > > > > > >    and allowing better client matching selectors (not only
> > on the
> > > > > > > > instance
> > > > > > > > > id, but also the other
> > > > > > > > >    client resource labels, such as client_software_name,
> > etc.).
> > > > > > > > >
> > > > > > > > > Unless there are further comments I'll call the vote in a
> > day or
> > > > > two.
> > > > > > > > >
> > > > > > > > > Regards,
> > > > > > > > > Magnus
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Den mån 4 okt. 2021 kl 20:57 skrev Magnus Edenhill <
> > > > > > magnus@edenhill.se>:
> > > > > > > > >
> > > > > > > > > > Hi Gwen,
> > > > > > > > > >
> > > > > > > > > > I'm finishing up the KIP based on the last couple of
> > discussion
> > > > > > points
> > > > > > > > in
> > > > > > > > > > this thread
> > > > > > > > > > and will call the Vote later this week.
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Magnus
> > > > > > > > > >
> > > > > > > > > > Den lör 2 okt. 2021 kl 02:01 skrev Gwen Shapira
> > > > > > > > > <gwen@confluent.io.invalid
> > > > > > > > > > >:
> > > > > > > > > >
> > > > > > > > > >> Hey,
> > > > > > > > > >>
> > > > > > > > > >> I noticed that there was no discussion for the last 10
> > days,
> > > > > but I
> > > > > > > > > >> couldn't
> > > > > > > > > >> find the vote thread. Is there one that I'm missing?
> > > > > > > > > >>
> > > > > > > > > >> Gwen
> > > > > > > > > >>
> > > > > > > > > >> On Wed, Sep 22, 2021 at 4:58 AM Magnus Edenhill <
> > > > > > magnus@edenhill.se>
> > > > > > > > > >> wrote:
> > > > > > > > > >>
> > > > > > > > > >> > Den tis 21 sep. 2021 kl 06:58 skrev Colin McCabe <
> > > > > > > > cmccabe@apache.org
> > > > > > > > > >:
> > > > > > > > > >> >
> > > > > > > > > >> > > On Mon, Sep 20, 2021, at 17:35, Feng Min wrote:
> > > > > > > > > >> > > > Thanks Magnus & Colin for the discussion.
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > Based on KIP-714's stateless design, Client can
> > pretty
> > > > > much
> > > > > > use
> > > > > > > > > any
> > > > > > > > > >> > > > connection to any broker to send metrics. We are
> not
> > > > > > associating
> > > > > > > > > >> > > connection
> > > > > > > > > >> > > > with client metric state. Is my understanding
> > correct?
> > > > If
> > > > > > yes,
> > > > > > > > > how
> > > > > > > > > >> > about
> > > > > > > > > >> > > > the following two scenarios
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > 1) One Client (Client-ID) registers two different
> > client
> > > > > > > > instance
> > > > > > > > > id
> > > > > > > > > >> > via
> > > > > > > > > >> > > > separate registration. Is it permitted? If OK, how
> > to
> > > > > > > > distinguish
> > > > > > > > > >> them
> > > > > > > > > >> > > from
> > > > > > > > > >> > > > the case 2 below.
> > > > > > > > > >> > > >
> > > > > > > > > >> > >
> > > > > > > > > >> > > Hi Feng,
> > > > > > > > > >> > >
> > > > > > > > > >> > > My understanding, which Magnus can clarify I guess,
> is
> > > > that
> > > > > > you
> > > > > > > > > could
> > > > > > > > > >> > have
> > > > > > > > > >> > > something like two Producer instances running with
> the
> > > > same
> > > > > > > > > client.id
> > > > > > > > > >> > > (perhaps because they're using the same config file,
> > for
> > > > > > example).
> > > > > > > > > >> They
> > > > > > > > > >> > > could even be in the same process. But they would
> get
> > > > > separate
> > > > > > > > > UUIDs.
> > > > > > > > > >> > >
> > > > > > > > > >> > > I believe Magnus used the term client to mean
> > "Producer or
> > > > > > > > > Consumer".
> > > > > > > > > >> So
> > > > > > > > > >> > > if you have both a Producer and a Consumer in your
> > > > > > application I
> > > > > > > > > would
> > > > > > > > > >> > > expect you'd get separate UUIDs for both. Again
> > Magnus can
> > > > > > chime
> > > > > > > > in
> > > > > > > > > >> > here, I
> > > > > > > > > >> > > guess.
> > > > > > > > > >> > >
> > > > > > > > > >> >
> > > > > > > > > >> > That's correct.
> > > > > > > > > >> >
> > > > > > > > > >> >
> > > > > > > > > >> > >
> > > > > > > > > >> > > > 2) How about the client restarting? What's the
> > > > > expectation?
> > > > > > > > Should
> > > > > > > > > >> the
> > > > > > > > > >> > > > server expect the client to carry a persisted
> client
> > > > > > instance id
> > > > > > > > > or
> > > > > > > > > >> > > should
> > > > > > > > > >> > > > the client be treated as a new instance?
> > > > > > > > > >> > >
> > > > > > > > > >> > > The KIP doesn't describe any mechanism for
> > persistence,
> > > > so I
> > > > > > would
> > > > > > > > > >> assume
> > > > > > > > > >> > > that when you restart the client you get a new
> UUID. I
> > > > agree
> > > > > > that
> > > > > > > > it
> > > > > > > > > >> > would
> > > > > > > > > >> > > be good to spell this out.
> > > > > > > > > >> > >
> > > > > > > > > >> > >
> > > > > > > > > >> > Right, it will not be persisted since a client
> instance
> > > > can't
> > > > > be
> > > > > > > > > >> restarted.
> > > > > > > > > >> >
> > > > > > > > > >> > Will update the KIP to make this clearer.
> > > > > > > > > >> >
> > > > > > > > > >> > /Magnus
> > > > > > > > > >> >
> > > > > > > > > >>
> > > > > > > > > >>
> > > > > > > > > >> --
> > > > > > > > > >> Gwen Shapira
> > > > > > > > > >> Engineering Manager | Confluent
> > > > > > > > > >> 650.450.2760 | @gwenshap
> > > > > > > > > >> Follow us: Twitter | blog
> > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > > >
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Magnus Edenhill <ma...@edenhill.se>.

Hi Mickael,

see inline.

Den ons 10 nov. 2021 kl 15:21 skrev Mickael Maison <mickael.maison@gmail.com
>:

> Hi Magnus,
>
> I see you've addressed some of the points I raised above but some (4,
> 5) have not been addressed yet.
>

Re 4) How will the user/app know metrics are being sent.

One possibility is to add a JMX metric (thus for user consumption) for the
number of metric pushes the
client has performed, or perhaps the number of metrics subscriptions
currently being collected.
Would that be sufficient?

Re 5) Metric sizes and rates

A worst case scenario for a producer that is producing to 50 unique topics
and emitting all standard metrics yields
a serialized size of around 100KB prior to compression, which compresses
down to about 20-30% of that depending
on compression type and topic name uniqueness.
The numbers for a consumer would be similar.

In practice the number of unique topics would be far less, and the
subscription set would typically be for a subset of metrics.
So we're probably closer to 1kb, or less, compressed size per client per
push interval.

As both the subscription set and push intervals are controlled by the
cluster operator it shouldn't be too hard
to strike a good balance between metrics overhead and granularity.



>
> I'm really uneasy with this being enabled by default on the client
> side. When collecting data, I think the best practice is to ensure
> users are explicitly enabling it.
>

Requiring metrics to be explicitly enabled on clients severely cripples its
usability and value.

One of the problems that this KIP aims to solve is for useful metrics to be
available on demand
regardless of the technical expertise of the user. As Ryanne points, out a
savvy user/organization
will typically have metrics collection and monitoring in place already, and
the benefits of this KIP
are then more of a common set and format metrics across client
implementations and languages.
But that is not the typical Kafka user in my experience, they're not Kafka
experts and they don't have the
knowledge of how to best instrument their clients.
Having metrics enabled by default for this user base allows the Kafka
operators to proactively and reactively
monitor and troubleshoot client issues, without the need for the less savvy
user to do anything.
It is often too late to tell a user to enable metrics when the problem has
already occurred.

Now, to be clear, even though metrics are enabled by default on clients it
is not enabled by default
on the brokers; the Kafka operator needs to build and set up a metrics
plugin and add metrics subscriptions
before anything is sent from the client.
It is opt-out on the clients and opt-in on the broker.




> You mentioned brokers already have
> some(most?) of the information contained in metrics, if so then why
> are we collecting it again? Surely there must be some new information
> in the client metrics.
>

From the user's perspective the Kafka infrastructure extends from
producer.send() to
messages being returned from consumer.poll(), a giant black box where
there's a lot going on between those
two points. The brokers currently only see what happens once those requests
and messages hits the broker,
but as Kafka clients are complex pieces of machinery there's a myriad of
queues, timers, and state
that's critical to the operation and infrastructure that's not currently
visible to the operator.
Relying on the user to accurately and timely provide this missing
information is not generally feasible.


Most of the standard metrics listed in the KIP are data points that the
broker does not have.
Only a small number of metrics are duplicates (like the request counts and
sizes), but they are included
to ease correlation when inspecting these client metrics.



> Moreover this is a brand new feature so it's even harder to justify
> enabling it and forcing onto all our users. If disabled by default,
> it's relatively easy to enable in a new release if we decide to, but
> once enabled by default it's much harder to disable. Also this feature
> will apply to all future metrics we will add.
>

I think maturity of a feature implementation should be the deciding factor,
rather than
the design of it (which this KIP is). I.e., if the implementation is not
deemed mature enough
for release X.Y it will be disabled.



> Overall I think it's an interesting feature but I'd prefer to be
> slightly defensive and see how it works in practice before enabling it
> everywhere.
>

Right, and I agree on being defensive, but since this feature still
requires manual
enabling on the brokers before actually being used, I think that gives
enough control
to opt-in or out of this feature as needed.

Thanks for your comments!

Regards,
Magnus



> Thanks,
> Mickael
>
> On Mon, Nov 8, 2021 at 7:55 AM Magnus Edenhill <ma...@edenhill.se> wrote:
> >
> > Thanks David for pointing this out,
> > I've updated the KIP to include client_id as a matching selector.
> >
> > Regards,
> > Magnus
> >
> > Den tors 4 nov. 2021 kl 18:01 skrev David Mao <dmao@confluent.io.invalid
> >:
> >
> > > Hey Magnus,
> > >
> > > I noticed that the KIP outlines the initial selectors supported as:
> > >
> > >    - client_instance_id - CLIENT_INSTANCE_ID UUID string
> representation.
> > >    - client_software_name  - client software implementation name.
> > >    - client_software_version  - client software implementation version.
> > >
> > > In the given reactive monitoring workflow, we mention that the
> application
> > > user does not know their client's client instance ID, but it's outlined
> > > that the operator can add a metrics subscription selecting for
> clientId. I
> > > don't see clientId as one of the supported selectors.
> > > I can see how this would have made sense in a previous iteration given
> that
> > > the previous client instance ID proposal was to construct the client
> > > instance ID using clientId as a prefix. Now that the client instance
> ID is
> > > a UUID, would we want to add clientId as a supported selector?
> > > Let me know what you think.
> > >
> > > David
> > >
> > > On Tue, Oct 19, 2021 at 12:33 PM Magnus Edenhill <ma...@edenhill.se>
> > > wrote:
> > >
> > > > Hi Mickael!
> > > >
> > > > Den tis 19 okt. 2021 kl 19:30 skrev Mickael Maison <
> > > > mickael.maison@gmail.com
> > > > >:
> > > >
> > > > > Hi Magnus,
> > > > >
> > > > > Thanks for the proposal.
> > > > >
> > > > > 1. Looking at the protocol section, isn't "ClientInstanceId"
> expected
> > > > > to be a field in GetTelemetrySubscriptionsResponseV0? Otherwise,
> how
> > > > > does a client retrieve this value?
> > > > >
> > > >
> > > > Good catch, it got removed by mistake in one of the edits.
> > > >
> > > >
> > > > >
> > > > > 2. In the client API section, you mention a new method
> > > > > "clientInstanceId()". Can you clarify which interfaces are
> affected?
> > > > > Is it only Consumer and Producer?
> > > > >
> > > >
> > > > And Admin. Will update the KIP.
> > > >
> > > >
> > > >
> > > > > 3. I'm a bit concerned this is enabled by default. Even if the data
> > > > > collected is supposed to be not sensitive, I think this can be
> > > > > problematic in some environments. Also users don't seem to have the
> > > > > choice to only expose some metrics. Knowing how much data transit
> > > > > through some applications can be considered critical.
> > > > >
> > > >
> > > > The broker already knows how much data transits through the client
> > > though,
> > > > right?
> > > > Care has been taken not to expose information in the standard metrics
> > > that
> > > > might
> > > > reveal sensitive information.
> > > >
> > > > Do you have an example of how the proposed metrics could leak
> sensitive
> > > > information?
> > > > As for limiting the what metrics to export; I guess that could make
> sense
> > > > in some
> > > > very sensitive use-cases, but those users might disable metrics
> > > altogether
> > > > for now.
> > > > Could these concerns be addressed by a later KIP?
> > > >
> > > >
> > > >
> > > > >
> > > > > 4. As a user, how do you know if your application is actively
> sending
> > > > > metrics? Are there new metrics exposing what's going on, like how
> much
> > > > > data is being sent?
> > > > >
> > > >
> > > > That's a good question.
> > > > Since the proposed metrics interface is not aimed at, or directly
> > > available
> > > > to, the application
> > > > I guess there's little point of adding it here, but instead adding
> > > > something to the
> > > > existing JMX metrics?
> > > > Do you have any suggestions?
> > > >
> > > >
> > > >
> > > > > 5. If all metrics are enabled on a regular Consumer or Producer, do
> > > > > you have an idea how much throughput this would use?
> > > > >
> > > >
> > > > It depends on the number of partition/topics/etc the client is
> producing
> > > > to/consuming from.
> > > > I'll add some sizes to the KIP for some typical use-cases.
> > > >
> > > >
> > > > Thanks,
> > > > Magnus
> > > >
> > > >
> > > > > Thanks
> > > > >
> > > > > On Tue, Oct 19, 2021 at 5:06 PM Magnus Edenhill <
> magnus@edenhill.se>
> > > > > wrote:
> > > > > >
> > > > > > Den tis 19 okt. 2021 kl 13:22 skrev Tom Bentley <
> tbentley@redhat.com
> > > >:
> > > > > >
> > > > > > > Hi Magnus,
> > > > > > >
> > > > > > > I reviewed the KIP since you called the vote (sorry for not
> > > reviewing
> > > > > when
> > > > > > > you announced your intention to call the vote). I have a few
> > > > questions
> > > > > on
> > > > > > > some of the details.
> > > > > > >
> > > > > > > 1. There's no Javadoc on ClientTelemetryPayload.data(), so I
> don't
> > > > know
> > > > > > > whether the payload is exposed through this method as
> compressed or
> > > > > not.
> > > > > > > Later on you say "Decompression of the payloads will be
> handled by
> > > > the
> > > > > > > broker metrics plugin, the broker should expose a suitable
> > > > > decompression
> > > > > > > API to the metrics plugin for this purpose.", which suggests
> it's
> > > the
> > > > > > > compressed data in the buffer, but then we don't know which
> codec
> > > was
> > > > > used,
> > > > > > > nor the API via which the plugin should decompress it if
> required
> > > for
> > > > > > > forwarding to the ultimate metrics store. Should the
> > > > > ClientTelemetryPayload
> > > > > > > expose a method to get the compression and a decompressor?
> > > > > > >
> > > > > >
> > > > > > Good point, updated.
> > > > > >
> > > > > >
> > > > > >
> > > > > > > 2. The client-side API is expressed as StringOrError
> > > > > > > ClientInstance::ClientInstanceId(int timeout_ms). I understand
> that
> > > > > you're
> > > > > > > thinking about the librdkafka implementation, but it would be
> good
> > > to
> > > > > show
> > > > > > > the API as it would appear on the Apache Kafka clients.
> > > > > > >
> > > > > >
> > > > > > This was meant as pseudo-code, but I changed it to Java.
> > > > > >
> > > > > >
> > > > > > > 3. "PushTelemetryRequest|Response - protocol request used by
> the
> > > > > client to
> > > > > > > send metrics to any broker it is connected to." To be clear,
> this
> > > > means
> > > > > > > that the client can choose any of the connected brokers and
> push to
> > > > > just
> > > > > > > one of them? What should a supporting client do if it gets an
> error
> > > > > when
> > > > > > > pushing metrics to a broker, retry sending to the same broker
> or
> > > try
> > > > > > > pushing to another broker, or drop the metrics? Should
> supporting
> > > > > clients
> > > > > > > send successive requests to a single broker, or round robin,
> or is
> > > > > that up
> > > > > > > to the client author? I'm guessing the behaviour should be
> sticky
> > > to
> > > > > > > support the rate limiting features, but I think it would be
> good
> > > for
> > > > > client
> > > > > > > authors if this section were explicit on the recommended
> behaviour.
> > > > > > >
> > > > > >
> > > > > > You are right, I've updated the KIP to make this clearer.
> > > > > >
> > > > > >
> > > > > > > 4. "Mapping the client instance id to an actual application
> > > instance
> > > > > > > running on a (virtual) machine can be done by inspecting the
> > > metrics
> > > > > > > resource labels, such as the client source address and source
> port,
> > > > or
> > > > > > > security principal, all of which are added by the receiving
> broker.
> > > > > This
> > > > > > > will allow the operator together with the user to identify the
> > > actual
> > > > > > > application instance." Is this really always true? The source
> IP
> > > and
> > > > > port
> > > > > > > might be a loadbalancer/proxy in some setups. The principal, as
> > > > already
> > > > > > > mentioned in the KIP, might be shared between multiple
> > > applications.
> > > > > So at
> > > > > > > worst the organization running the clients might have to
> consult
> > > the
> > > > > logs
> > > > > > > of a set of client applications, right?
> > > > > > >
> > > > > >
> > > > > > Yes, that's correct. There's no guaranteed mapping from
> > > > > client_instance_id
> > > > > > to
> > > > > > an actual instance, that's why the KIP recommends client
> > > > implementations
> > > > > to
> > > > > > log the client instance id
> > > > > > upon retrieval, and also provide an API for the application to
> > > retrieve
> > > > > the
> > > > > > instance id programmatically
> > > > > > if it has a better way of exposing it.
> > > > > >
> > > > > >
> > > > > > 5. "Tests indicate that a compression ratio up to 10x is
> possible for
> > > > the
> > > > > > > standard metrics." Client authors might appreciate your
> mentioning
> > > > > which
> > > > > > > compression codec got these results.
> > > > > > >
> > > > > >
> > > > > > Good point. Updated.
> > > > > >
> > > > > >
> > > > > > > 6. "Should the client send a push request prior to expiry of
> the
> > > > > previously
> > > > > > > calculated PushIntervalMs the broker will discard the metrics
> and
> > > > > return a
> > > > > > > PushTelemetryResponse with the ErrorCode set to RateLimited."
> Is
> > > this
> > > > > > > RATE_LIMITED a new error code? It's not mentioned in the "New
> Error
> > > > > Codes"
> > > > > > > section.
> > > > > > >
> > > > > >
> > > > > > That's a leftover, it should be using the standard ThrottleTime
> > > > > mechanism.
> > > > > > Fixed.
> > > > > >
> > > > > >
> > > > > > > 7. In the section "Standard client resource labels"
> application_id
> > > is
> > > > > > > described as Kafka Streams only, but the section of "Client
> > > > > Identification"
> > > > > > > talks about "application instance id as an optional future
> > > > nice-to-have
> > > > > > > that may be included as a metrics label if it has been set by
> the
> > > > > user", so
> > > > > > > I'm confused whether non-Kafka Streams clients should set an
> > > > > application_id
> > > > > > > or not.
> > > > > > >
> > > > > >
> > > > > > I'll clarify this in the KIP, but basically we would need to add
> an `
> > > > > > application.id` config
> > > > > > property for non-streams clients for this purpose, and that's
> outside
> > > > the
> > > > > > scope of this KIP since we want to make it zero-conf:ish on the
> > > client
> > > > > side.
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > Kind regards,
> > > > > > >
> > > > > > > Tom
> > > > > > >
> > > > > >
> > > > > > Thanks for the review,
> > > > > > Magnus
> > > > > >
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > On Thu, Oct 7, 2021 at 5:26 PM Magnus Edenhill <
> magnus@edenhill.se
> > > >
> > > > > wrote:
> > > > > > >
> > > > > > > > Hi all,
> > > > > > > >
> > > > > > > > I've updated the KIP following our recent discussions on the
> > > > mailing
> > > > > > > list:
> > > > > > > >  - split the protocol in two, one for getting the metrics
> > > > > subscriptions,
> > > > > > > > and one for pushing the metrics.
> > > > > > > >  - simplifications: initially only one supported metrics
> format,
> > > no
> > > > > > > > client.id in the instance id, etc.
> > > > > > > >  - made CLIENT_METRICS subscription configuration entries
> more
> > > > > structured
> > > > > > > >    and allowing better client matching selectors (not only
> on the
> > > > > > > instance
> > > > > > > > id, but also the other
> > > > > > > >    client resource labels, such as client_software_name,
> etc.).
> > > > > > > >
> > > > > > > > Unless there are further comments I'll call the vote in a
> day or
> > > > two.
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > > Magnus
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > Den mån 4 okt. 2021 kl 20:57 skrev Magnus Edenhill <
> > > > > magnus@edenhill.se>:
> > > > > > > >
> > > > > > > > > Hi Gwen,
> > > > > > > > >
> > > > > > > > > I'm finishing up the KIP based on the last couple of
> discussion
> > > > > points
> > > > > > > in
> > > > > > > > > this thread
> > > > > > > > > and will call the Vote later this week.
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > > Magnus
> > > > > > > > >
> > > > > > > > > Den lör 2 okt. 2021 kl 02:01 skrev Gwen Shapira
> > > > > > > > <gwen@confluent.io.invalid
> > > > > > > > > >:
> > > > > > > > >
> > > > > > > > >> Hey,
> > > > > > > > >>
> > > > > > > > >> I noticed that there was no discussion for the last 10
> days,
> > > > but I
> > > > > > > > >> couldn't
> > > > > > > > >> find the vote thread. Is there one that I'm missing?
> > > > > > > > >>
> > > > > > > > >> Gwen
> > > > > > > > >>
> > > > > > > > >> On Wed, Sep 22, 2021 at 4:58 AM Magnus Edenhill <
> > > > > magnus@edenhill.se>
> > > > > > > > >> wrote:
> > > > > > > > >>
> > > > > > > > >> > Den tis 21 sep. 2021 kl 06:58 skrev Colin McCabe <
> > > > > > > cmccabe@apache.org
> > > > > > > > >:
> > > > > > > > >> >
> > > > > > > > >> > > On Mon, Sep 20, 2021, at 17:35, Feng Min wrote:
> > > > > > > > >> > > > Thanks Magnus & Colin for the discussion.
> > > > > > > > >> > > >
> > > > > > > > >> > > > Based on KIP-714's stateless design, Client can
> pretty
> > > > much
> > > > > use
> > > > > > > > any
> > > > > > > > >> > > > connection to any broker to send metrics. We are not
> > > > > associating
> > > > > > > > >> > > connection
> > > > > > > > >> > > > with client metric state. Is my understanding
> correct?
> > > If
> > > > > yes,
> > > > > > > > how
> > > > > > > > >> > about
> > > > > > > > >> > > > the following two scenarios
> > > > > > > > >> > > >
> > > > > > > > >> > > > 1) One Client (Client-ID) registers two different
> client
> > > > > > > instance
> > > > > > > > id
> > > > > > > > >> > via
> > > > > > > > >> > > > separate registration. Is it permitted? If OK, how
> to
> > > > > > > distinguish
> > > > > > > > >> them
> > > > > > > > >> > > from
> > > > > > > > >> > > > the case 2 below.
> > > > > > > > >> > > >
> > > > > > > > >> > >
> > > > > > > > >> > > Hi Feng,
> > > > > > > > >> > >
> > > > > > > > >> > > My understanding, which Magnus can clarify I guess, is
> > > that
> > > > > you
> > > > > > > > could
> > > > > > > > >> > have
> > > > > > > > >> > > something like two Producer instances running with the
> > > same
> > > > > > > > client.id
> > > > > > > > >> > > (perhaps because they're using the same config file,
> for
> > > > > example).
> > > > > > > > >> They
> > > > > > > > >> > > could even be in the same process. But they would get
> > > > separate
> > > > > > > > UUIDs.
> > > > > > > > >> > >
> > > > > > > > >> > > I believe Magnus used the term client to mean
> "Producer or
> > > > > > > > Consumer".
> > > > > > > > >> So
> > > > > > > > >> > > if you have both a Producer and a Consumer in your
> > > > > application I
> > > > > > > > would
> > > > > > > > >> > > expect you'd get separate UUIDs for both. Again
> Magnus can
> > > > > chime
> > > > > > > in
> > > > > > > > >> > here, I
> > > > > > > > >> > > guess.
> > > > > > > > >> > >
> > > > > > > > >> >
> > > > > > > > >> > That's correct.
> > > > > > > > >> >
> > > > > > > > >> >
> > > > > > > > >> > >
> > > > > > > > >> > > > 2) How about the client restarting? What's the
> > > > expectation?
> > > > > > > Should
> > > > > > > > >> the
> > > > > > > > >> > > > server expect the client to carry a persisted client
> > > > > instance id
> > > > > > > > or
> > > > > > > > >> > > should
> > > > > > > > >> > > > the client be treated as a new instance?
> > > > > > > > >> > >
> > > > > > > > >> > > The KIP doesn't describe any mechanism for
> persistence,
> > > so I
> > > > > would
> > > > > > > > >> assume
> > > > > > > > >> > > that when you restart the client you get a new UUID. I
> > > agree
> > > > > that
> > > > > > > it
> > > > > > > > >> > would
> > > > > > > > >> > > be good to spell this out.
> > > > > > > > >> > >
> > > > > > > > >> > >
> > > > > > > > >> > Right, it will not be persisted since a client instance
> > > can't
> > > > be
> > > > > > > > >> restarted.
> > > > > > > > >> >
> > > > > > > > >> > Will update the KIP to make this clearer.
> > > > > > > > >> >
> > > > > > > > >> > /Magnus
> > > > > > > > >> >
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >> --
> > > > > > > > >> Gwen Shapira
> > > > > > > > >> Engineering Manager | Confluent
> > > > > > > > >> 650.450.2760 | @gwenshap
> > > > > > > > >> Follow us: Twitter | blog
> > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > > >
> > >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Mickael Maison <mi...@gmail.com>.

Hi Magnus,

I see you've addressed some of the points I raised above but some (4,
5) have not been addressed yet.

I'm really uneasy with this being enabled by default on the client
side. When collecting data, I think the best practice is to ensure
users are explicitly enabling it. You mentioned brokers already have
some(most?) of the information contained in metrics, if so then why
are we collecting it again? Surely there must be some new information
in the client metrics.

Moreover this is a brand new feature so it's even harder to justify
enabling it and forcing onto all our users. If disabled by default,
it's relatively easy to enable in a new release if we decide to, but
once enabled by default it's much harder to disable. Also this feature
will apply to all future metrics we will add.

Overall I think it's an interesting feature but I'd prefer to be
slightly defensive and see how it works in practice before enabling it
everywhere.

Thanks,
Mickael

On Mon, Nov 8, 2021 at 7:55 AM Magnus Edenhill <ma...@edenhill.se> wrote:
>
> Thanks David for pointing this out,
> I've updated the KIP to include client_id as a matching selector.
>
> Regards,
> Magnus
>
> Den tors 4 nov. 2021 kl 18:01 skrev David Mao <dm...@confluent.io.invalid>:
>
> > Hey Magnus,
> >
> > I noticed that the KIP outlines the initial selectors supported as:
> >
> >    - client_instance_id - CLIENT_INSTANCE_ID UUID string representation.
> >    - client_software_name  - client software implementation name.
> >    - client_software_version  - client software implementation version.
> >
> > In the given reactive monitoring workflow, we mention that the application
> > user does not know their client's client instance ID, but it's outlined
> > that the operator can add a metrics subscription selecting for clientId. I
> > don't see clientId as one of the supported selectors.
> > I can see how this would have made sense in a previous iteration given that
> > the previous client instance ID proposal was to construct the client
> > instance ID using clientId as a prefix. Now that the client instance ID is
> > a UUID, would we want to add clientId as a supported selector?
> > Let me know what you think.
> >
> > David
> >
> > On Tue, Oct 19, 2021 at 12:33 PM Magnus Edenhill <ma...@edenhill.se>
> > wrote:
> >
> > > Hi Mickael!
> > >
> > > Den tis 19 okt. 2021 kl 19:30 skrev Mickael Maison <
> > > mickael.maison@gmail.com
> > > >:
> > >
> > > > Hi Magnus,
> > > >
> > > > Thanks for the proposal.
> > > >
> > > > 1. Looking at the protocol section, isn't "ClientInstanceId" expected
> > > > to be a field in GetTelemetrySubscriptionsResponseV0? Otherwise, how
> > > > does a client retrieve this value?
> > > >
> > >
> > > Good catch, it got removed by mistake in one of the edits.
> > >
> > >
> > > >
> > > > 2. In the client API section, you mention a new method
> > > > "clientInstanceId()". Can you clarify which interfaces are affected?
> > > > Is it only Consumer and Producer?
> > > >
> > >
> > > And Admin. Will update the KIP.
> > >
> > >
> > >
> > > > 3. I'm a bit concerned this is enabled by default. Even if the data
> > > > collected is supposed to be not sensitive, I think this can be
> > > > problematic in some environments. Also users don't seem to have the
> > > > choice to only expose some metrics. Knowing how much data transit
> > > > through some applications can be considered critical.
> > > >
> > >
> > > The broker already knows how much data transits through the client
> > though,
> > > right?
> > > Care has been taken not to expose information in the standard metrics
> > that
> > > might
> > > reveal sensitive information.
> > >
> > > Do you have an example of how the proposed metrics could leak sensitive
> > > information?
> > > As for limiting the what metrics to export; I guess that could make sense
> > > in some
> > > very sensitive use-cases, but those users might disable metrics
> > altogether
> > > for now.
> > > Could these concerns be addressed by a later KIP?
> > >
> > >
> > >
> > > >
> > > > 4. As a user, how do you know if your application is actively sending
> > > > metrics? Are there new metrics exposing what's going on, like how much
> > > > data is being sent?
> > > >
> > >
> > > That's a good question.
> > > Since the proposed metrics interface is not aimed at, or directly
> > available
> > > to, the application
> > > I guess there's little point of adding it here, but instead adding
> > > something to the
> > > existing JMX metrics?
> > > Do you have any suggestions?
> > >
> > >
> > >
> > > > 5. If all metrics are enabled on a regular Consumer or Producer, do
> > > > you have an idea how much throughput this would use?
> > > >
> > >
> > > It depends on the number of partition/topics/etc the client is producing
> > > to/consuming from.
> > > I'll add some sizes to the KIP for some typical use-cases.
> > >
> > >
> > > Thanks,
> > > Magnus
> > >
> > >
> > > > Thanks
> > > >
> > > > On Tue, Oct 19, 2021 at 5:06 PM Magnus Edenhill <ma...@edenhill.se>
> > > > wrote:
> > > > >
> > > > > Den tis 19 okt. 2021 kl 13:22 skrev Tom Bentley <tbentley@redhat.com
> > >:
> > > > >
> > > > > > Hi Magnus,
> > > > > >
> > > > > > I reviewed the KIP since you called the vote (sorry for not
> > reviewing
> > > > when
> > > > > > you announced your intention to call the vote). I have a few
> > > questions
> > > > on
> > > > > > some of the details.
> > > > > >
> > > > > > 1. There's no Javadoc on ClientTelemetryPayload.data(), so I don't
> > > know
> > > > > > whether the payload is exposed through this method as compressed or
> > > > not.
> > > > > > Later on you say "Decompression of the payloads will be handled by
> > > the
> > > > > > broker metrics plugin, the broker should expose a suitable
> > > > decompression
> > > > > > API to the metrics plugin for this purpose.", which suggests it's
> > the
> > > > > > compressed data in the buffer, but then we don't know which codec
> > was
> > > > used,
> > > > > > nor the API via which the plugin should decompress it if required
> > for
> > > > > > forwarding to the ultimate metrics store. Should the
> > > > ClientTelemetryPayload
> > > > > > expose a method to get the compression and a decompressor?
> > > > > >
> > > > >
> > > > > Good point, updated.
> > > > >
> > > > >
> > > > >
> > > > > > 2. The client-side API is expressed as StringOrError
> > > > > > ClientInstance::ClientInstanceId(int timeout_ms). I understand that
> > > > you're
> > > > > > thinking about the librdkafka implementation, but it would be good
> > to
> > > > show
> > > > > > the API as it would appear on the Apache Kafka clients.
> > > > > >
> > > > >
> > > > > This was meant as pseudo-code, but I changed it to Java.
> > > > >
> > > > >
> > > > > > 3. "PushTelemetryRequest|Response - protocol request used by the
> > > > client to
> > > > > > send metrics to any broker it is connected to." To be clear, this
> > > means
> > > > > > that the client can choose any of the connected brokers and push to
> > > > just
> > > > > > one of them? What should a supporting client do if it gets an error
> > > > when
> > > > > > pushing metrics to a broker, retry sending to the same broker or
> > try
> > > > > > pushing to another broker, or drop the metrics? Should supporting
> > > > clients
> > > > > > send successive requests to a single broker, or round robin, or is
> > > > that up
> > > > > > to the client author? I'm guessing the behaviour should be sticky
> > to
> > > > > > support the rate limiting features, but I think it would be good
> > for
> > > > client
> > > > > > authors if this section were explicit on the recommended behaviour.
> > > > > >
> > > > >
> > > > > You are right, I've updated the KIP to make this clearer.
> > > > >
> > > > >
> > > > > > 4. "Mapping the client instance id to an actual application
> > instance
> > > > > > running on a (virtual) machine can be done by inspecting the
> > metrics
> > > > > > resource labels, such as the client source address and source port,
> > > or
> > > > > > security principal, all of which are added by the receiving broker.
> > > > This
> > > > > > will allow the operator together with the user to identify the
> > actual
> > > > > > application instance." Is this really always true? The source IP
> > and
> > > > port
> > > > > > might be a loadbalancer/proxy in some setups. The principal, as
> > > already
> > > > > > mentioned in the KIP, might be shared between multiple
> > applications.
> > > > So at
> > > > > > worst the organization running the clients might have to consult
> > the
> > > > logs
> > > > > > of a set of client applications, right?
> > > > > >
> > > > >
> > > > > Yes, that's correct. There's no guaranteed mapping from
> > > > client_instance_id
> > > > > to
> > > > > an actual instance, that's why the KIP recommends client
> > > implementations
> > > > to
> > > > > log the client instance id
> > > > > upon retrieval, and also provide an API for the application to
> > retrieve
> > > > the
> > > > > instance id programmatically
> > > > > if it has a better way of exposing it.
> > > > >
> > > > >
> > > > > 5. "Tests indicate that a compression ratio up to 10x is possible for
> > > the
> > > > > > standard metrics." Client authors might appreciate your mentioning
> > > > which
> > > > > > compression codec got these results.
> > > > > >
> > > > >
> > > > > Good point. Updated.
> > > > >
> > > > >
> > > > > > 6. "Should the client send a push request prior to expiry of the
> > > > previously
> > > > > > calculated PushIntervalMs the broker will discard the metrics and
> > > > return a
> > > > > > PushTelemetryResponse with the ErrorCode set to RateLimited." Is
> > this
> > > > > > RATE_LIMITED a new error code? It's not mentioned in the "New Error
> > > > Codes"
> > > > > > section.
> > > > > >
> > > > >
> > > > > That's a leftover, it should be using the standard ThrottleTime
> > > > mechanism.
> > > > > Fixed.
> > > > >
> > > > >
> > > > > > 7. In the section "Standard client resource labels" application_id
> > is
> > > > > > described as Kafka Streams only, but the section of "Client
> > > > Identification"
> > > > > > talks about "application instance id as an optional future
> > > nice-to-have
> > > > > > that may be included as a metrics label if it has been set by the
> > > > user", so
> > > > > > I'm confused whether non-Kafka Streams clients should set an
> > > > application_id
> > > > > > or not.
> > > > > >
> > > > >
> > > > > I'll clarify this in the KIP, but basically we would need to add an `
> > > > > application.id` config
> > > > > property for non-streams clients for this purpose, and that's outside
> > > the
> > > > > scope of this KIP since we want to make it zero-conf:ish on the
> > client
> > > > side.
> > > > >
> > > > >
> > > > > >
> > > > > > Kind regards,
> > > > > >
> > > > > > Tom
> > > > > >
> > > > >
> > > > > Thanks for the review,
> > > > > Magnus
> > > > >
> > > > >
> > > > >
> > > > > >
> > > > > > On Thu, Oct 7, 2021 at 5:26 PM Magnus Edenhill <magnus@edenhill.se
> > >
> > > > wrote:
> > > > > >
> > > > > > > Hi all,
> > > > > > >
> > > > > > > I've updated the KIP following our recent discussions on the
> > > mailing
> > > > > > list:
> > > > > > >  - split the protocol in two, one for getting the metrics
> > > > subscriptions,
> > > > > > > and one for pushing the metrics.
> > > > > > >  - simplifications: initially only one supported metrics format,
> > no
> > > > > > > client.id in the instance id, etc.
> > > > > > >  - made CLIENT_METRICS subscription configuration entries more
> > > > structured
> > > > > > >    and allowing better client matching selectors (not only on the
> > > > > > instance
> > > > > > > id, but also the other
> > > > > > >    client resource labels, such as client_software_name, etc.).
> > > > > > >
> > > > > > > Unless there are further comments I'll call the vote in a day or
> > > two.
> > > > > > >
> > > > > > > Regards,
> > > > > > > Magnus
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Den mån 4 okt. 2021 kl 20:57 skrev Magnus Edenhill <
> > > > magnus@edenhill.se>:
> > > > > > >
> > > > > > > > Hi Gwen,
> > > > > > > >
> > > > > > > > I'm finishing up the KIP based on the last couple of discussion
> > > > points
> > > > > > in
> > > > > > > > this thread
> > > > > > > > and will call the Vote later this week.
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Magnus
> > > > > > > >
> > > > > > > > Den lör 2 okt. 2021 kl 02:01 skrev Gwen Shapira
> > > > > > > <gwen@confluent.io.invalid
> > > > > > > > >:
> > > > > > > >
> > > > > > > >> Hey,
> > > > > > > >>
> > > > > > > >> I noticed that there was no discussion for the last 10 days,
> > > but I
> > > > > > > >> couldn't
> > > > > > > >> find the vote thread. Is there one that I'm missing?
> > > > > > > >>
> > > > > > > >> Gwen
> > > > > > > >>
> > > > > > > >> On Wed, Sep 22, 2021 at 4:58 AM Magnus Edenhill <
> > > > magnus@edenhill.se>
> > > > > > > >> wrote:
> > > > > > > >>
> > > > > > > >> > Den tis 21 sep. 2021 kl 06:58 skrev Colin McCabe <
> > > > > > cmccabe@apache.org
> > > > > > > >:
> > > > > > > >> >
> > > > > > > >> > > On Mon, Sep 20, 2021, at 17:35, Feng Min wrote:
> > > > > > > >> > > > Thanks Magnus & Colin for the discussion.
> > > > > > > >> > > >
> > > > > > > >> > > > Based on KIP-714's stateless design, Client can pretty
> > > much
> > > > use
> > > > > > > any
> > > > > > > >> > > > connection to any broker to send metrics. We are not
> > > > associating
> > > > > > > >> > > connection
> > > > > > > >> > > > with client metric state. Is my understanding correct?
> > If
> > > > yes,
> > > > > > > how
> > > > > > > >> > about
> > > > > > > >> > > > the following two scenarios
> > > > > > > >> > > >
> > > > > > > >> > > > 1) One Client (Client-ID) registers two different client
> > > > > > instance
> > > > > > > id
> > > > > > > >> > via
> > > > > > > >> > > > separate registration. Is it permitted? If OK, how to
> > > > > > distinguish
> > > > > > > >> them
> > > > > > > >> > > from
> > > > > > > >> > > > the case 2 below.
> > > > > > > >> > > >
> > > > > > > >> > >
> > > > > > > >> > > Hi Feng,
> > > > > > > >> > >
> > > > > > > >> > > My understanding, which Magnus can clarify I guess, is
> > that
> > > > you
> > > > > > > could
> > > > > > > >> > have
> > > > > > > >> > > something like two Producer instances running with the
> > same
> > > > > > > client.id
> > > > > > > >> > > (perhaps because they're using the same config file, for
> > > > example).
> > > > > > > >> They
> > > > > > > >> > > could even be in the same process. But they would get
> > > separate
> > > > > > > UUIDs.
> > > > > > > >> > >
> > > > > > > >> > > I believe Magnus used the term client to mean "Producer or
> > > > > > > Consumer".
> > > > > > > >> So
> > > > > > > >> > > if you have both a Producer and a Consumer in your
> > > > application I
> > > > > > > would
> > > > > > > >> > > expect you'd get separate UUIDs for both. Again Magnus can
> > > > chime
> > > > > > in
> > > > > > > >> > here, I
> > > > > > > >> > > guess.
> > > > > > > >> > >
> > > > > > > >> >
> > > > > > > >> > That's correct.
> > > > > > > >> >
> > > > > > > >> >
> > > > > > > >> > >
> > > > > > > >> > > > 2) How about the client restarting? What's the
> > > expectation?
> > > > > > Should
> > > > > > > >> the
> > > > > > > >> > > > server expect the client to carry a persisted client
> > > > instance id
> > > > > > > or
> > > > > > > >> > > should
> > > > > > > >> > > > the client be treated as a new instance?
> > > > > > > >> > >
> > > > > > > >> > > The KIP doesn't describe any mechanism for persistence,
> > so I
> > > > would
> > > > > > > >> assume
> > > > > > > >> > > that when you restart the client you get a new UUID. I
> > agree
> > > > that
> > > > > > it
> > > > > > > >> > would
> > > > > > > >> > > be good to spell this out.
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> > Right, it will not be persisted since a client instance
> > can't
> > > be
> > > > > > > >> restarted.
> > > > > > > >> >
> > > > > > > >> > Will update the KIP to make this clearer.
> > > > > > > >> >
> > > > > > > >> > /Magnus
> > > > > > > >> >
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> --
> > > > > > > >> Gwen Shapira
> > > > > > > >> Engineering Manager | Confluent
> > > > > > > >> 650.450.2760 | @gwenshap
> > > > > > > >> Follow us: Twitter | blog
> > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > >
> >

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Magnus Edenhill <ma...@edenhill.se>.

Thanks David for pointing this out,
I've updated the KIP to include client_id as a matching selector.

Regards,
Magnus

Den tors 4 nov. 2021 kl 18:01 skrev David Mao <dm...@confluent.io.invalid>:

> Hey Magnus,
>
> I noticed that the KIP outlines the initial selectors supported as:
>
>    - client_instance_id - CLIENT_INSTANCE_ID UUID string representation.
>    - client_software_name  - client software implementation name.
>    - client_software_version  - client software implementation version.
>
> In the given reactive monitoring workflow, we mention that the application
> user does not know their client's client instance ID, but it's outlined
> that the operator can add a metrics subscription selecting for clientId. I
> don't see clientId as one of the supported selectors.
> I can see how this would have made sense in a previous iteration given that
> the previous client instance ID proposal was to construct the client
> instance ID using clientId as a prefix. Now that the client instance ID is
> a UUID, would we want to add clientId as a supported selector?
> Let me know what you think.
>
> David
>
> On Tue, Oct 19, 2021 at 12:33 PM Magnus Edenhill <ma...@edenhill.se>
> wrote:
>
> > Hi Mickael!
> >
> > Den tis 19 okt. 2021 kl 19:30 skrev Mickael Maison <
> > mickael.maison@gmail.com
> > >:
> >
> > > Hi Magnus,
> > >
> > > Thanks for the proposal.
> > >
> > > 1. Looking at the protocol section, isn't "ClientInstanceId" expected
> > > to be a field in GetTelemetrySubscriptionsResponseV0? Otherwise, how
> > > does a client retrieve this value?
> > >
> >
> > Good catch, it got removed by mistake in one of the edits.
> >
> >
> > >
> > > 2. In the client API section, you mention a new method
> > > "clientInstanceId()". Can you clarify which interfaces are affected?
> > > Is it only Consumer and Producer?
> > >
> >
> > And Admin. Will update the KIP.
> >
> >
> >
> > > 3. I'm a bit concerned this is enabled by default. Even if the data
> > > collected is supposed to be not sensitive, I think this can be
> > > problematic in some environments. Also users don't seem to have the
> > > choice to only expose some metrics. Knowing how much data transit
> > > through some applications can be considered critical.
> > >
> >
> > The broker already knows how much data transits through the client
> though,
> > right?
> > Care has been taken not to expose information in the standard metrics
> that
> > might
> > reveal sensitive information.
> >
> > Do you have an example of how the proposed metrics could leak sensitive
> > information?
> > As for limiting the what metrics to export; I guess that could make sense
> > in some
> > very sensitive use-cases, but those users might disable metrics
> altogether
> > for now.
> > Could these concerns be addressed by a later KIP?
> >
> >
> >
> > >
> > > 4. As a user, how do you know if your application is actively sending
> > > metrics? Are there new metrics exposing what's going on, like how much
> > > data is being sent?
> > >
> >
> > That's a good question.
> > Since the proposed metrics interface is not aimed at, or directly
> available
> > to, the application
> > I guess there's little point of adding it here, but instead adding
> > something to the
> > existing JMX metrics?
> > Do you have any suggestions?
> >
> >
> >
> > > 5. If all metrics are enabled on a regular Consumer or Producer, do
> > > you have an idea how much throughput this would use?
> > >
> >
> > It depends on the number of partition/topics/etc the client is producing
> > to/consuming from.
> > I'll add some sizes to the KIP for some typical use-cases.
> >
> >
> > Thanks,
> > Magnus
> >
> >
> > > Thanks
> > >
> > > On Tue, Oct 19, 2021 at 5:06 PM Magnus Edenhill <ma...@edenhill.se>
> > > wrote:
> > > >
> > > > Den tis 19 okt. 2021 kl 13:22 skrev Tom Bentley <tbentley@redhat.com
> >:
> > > >
> > > > > Hi Magnus,
> > > > >
> > > > > I reviewed the KIP since you called the vote (sorry for not
> reviewing
> > > when
> > > > > you announced your intention to call the vote). I have a few
> > questions
> > > on
> > > > > some of the details.
> > > > >
> > > > > 1. There's no Javadoc on ClientTelemetryPayload.data(), so I don't
> > know
> > > > > whether the payload is exposed through this method as compressed or
> > > not.
> > > > > Later on you say "Decompression of the payloads will be handled by
> > the
> > > > > broker metrics plugin, the broker should expose a suitable
> > > decompression
> > > > > API to the metrics plugin for this purpose.", which suggests it's
> the
> > > > > compressed data in the buffer, but then we don't know which codec
> was
> > > used,
> > > > > nor the API via which the plugin should decompress it if required
> for
> > > > > forwarding to the ultimate metrics store. Should the
> > > ClientTelemetryPayload
> > > > > expose a method to get the compression and a decompressor?
> > > > >
> > > >
> > > > Good point, updated.
> > > >
> > > >
> > > >
> > > > > 2. The client-side API is expressed as StringOrError
> > > > > ClientInstance::ClientInstanceId(int timeout_ms). I understand that
> > > you're
> > > > > thinking about the librdkafka implementation, but it would be good
> to
> > > show
> > > > > the API as it would appear on the Apache Kafka clients.
> > > > >
> > > >
> > > > This was meant as pseudo-code, but I changed it to Java.
> > > >
> > > >
> > > > > 3. "PushTelemetryRequest|Response - protocol request used by the
> > > client to
> > > > > send metrics to any broker it is connected to." To be clear, this
> > means
> > > > > that the client can choose any of the connected brokers and push to
> > > just
> > > > > one of them? What should a supporting client do if it gets an error
> > > when
> > > > > pushing metrics to a broker, retry sending to the same broker or
> try
> > > > > pushing to another broker, or drop the metrics? Should supporting
> > > clients
> > > > > send successive requests to a single broker, or round robin, or is
> > > that up
> > > > > to the client author? I'm guessing the behaviour should be sticky
> to
> > > > > support the rate limiting features, but I think it would be good
> for
> > > client
> > > > > authors if this section were explicit on the recommended behaviour.
> > > > >
> > > >
> > > > You are right, I've updated the KIP to make this clearer.
> > > >
> > > >
> > > > > 4. "Mapping the client instance id to an actual application
> instance
> > > > > running on a (virtual) machine can be done by inspecting the
> metrics
> > > > > resource labels, such as the client source address and source port,
> > or
> > > > > security principal, all of which are added by the receiving broker.
> > > This
> > > > > will allow the operator together with the user to identify the
> actual
> > > > > application instance." Is this really always true? The source IP
> and
> > > port
> > > > > might be a loadbalancer/proxy in some setups. The principal, as
> > already
> > > > > mentioned in the KIP, might be shared between multiple
> applications.
> > > So at
> > > > > worst the organization running the clients might have to consult
> the
> > > logs
> > > > > of a set of client applications, right?
> > > > >
> > > >
> > > > Yes, that's correct. There's no guaranteed mapping from
> > > client_instance_id
> > > > to
> > > > an actual instance, that's why the KIP recommends client
> > implementations
> > > to
> > > > log the client instance id
> > > > upon retrieval, and also provide an API for the application to
> retrieve
> > > the
> > > > instance id programmatically
> > > > if it has a better way of exposing it.
> > > >
> > > >
> > > > 5. "Tests indicate that a compression ratio up to 10x is possible for
> > the
> > > > > standard metrics." Client authors might appreciate your mentioning
> > > which
> > > > > compression codec got these results.
> > > > >
> > > >
> > > > Good point. Updated.
> > > >
> > > >
> > > > > 6. "Should the client send a push request prior to expiry of the
> > > previously
> > > > > calculated PushIntervalMs the broker will discard the metrics and
> > > return a
> > > > > PushTelemetryResponse with the ErrorCode set to RateLimited." Is
> this
> > > > > RATE_LIMITED a new error code? It's not mentioned in the "New Error
> > > Codes"
> > > > > section.
> > > > >
> > > >
> > > > That's a leftover, it should be using the standard ThrottleTime
> > > mechanism.
> > > > Fixed.
> > > >
> > > >
> > > > > 7. In the section "Standard client resource labels" application_id
> is
> > > > > described as Kafka Streams only, but the section of "Client
> > > Identification"
> > > > > talks about "application instance id as an optional future
> > nice-to-have
> > > > > that may be included as a metrics label if it has been set by the
> > > user", so
> > > > > I'm confused whether non-Kafka Streams clients should set an
> > > application_id
> > > > > or not.
> > > > >
> > > >
> > > > I'll clarify this in the KIP, but basically we would need to add an `
> > > > application.id` config
> > > > property for non-streams clients for this purpose, and that's outside
> > the
> > > > scope of this KIP since we want to make it zero-conf:ish on the
> client
> > > side.
> > > >
> > > >
> > > > >
> > > > > Kind regards,
> > > > >
> > > > > Tom
> > > > >
> > > >
> > > > Thanks for the review,
> > > > Magnus
> > > >
> > > >
> > > >
> > > > >
> > > > > On Thu, Oct 7, 2021 at 5:26 PM Magnus Edenhill <magnus@edenhill.se
> >
> > > wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > I've updated the KIP following our recent discussions on the
> > mailing
> > > > > list:
> > > > > >  - split the protocol in two, one for getting the metrics
> > > subscriptions,
> > > > > > and one for pushing the metrics.
> > > > > >  - simplifications: initially only one supported metrics format,
> no
> > > > > > client.id in the instance id, etc.
> > > > > >  - made CLIENT_METRICS subscription configuration entries more
> > > structured
> > > > > >    and allowing better client matching selectors (not only on the
> > > > > instance
> > > > > > id, but also the other
> > > > > >    client resource labels, such as client_software_name, etc.).
> > > > > >
> > > > > > Unless there are further comments I'll call the vote in a day or
> > two.
> > > > > >
> > > > > > Regards,
> > > > > > Magnus
> > > > > >
> > > > > >
> > > > > >
> > > > > > Den mån 4 okt. 2021 kl 20:57 skrev Magnus Edenhill <
> > > magnus@edenhill.se>:
> > > > > >
> > > > > > > Hi Gwen,
> > > > > > >
> > > > > > > I'm finishing up the KIP based on the last couple of discussion
> > > points
> > > > > in
> > > > > > > this thread
> > > > > > > and will call the Vote later this week.
> > > > > > >
> > > > > > > Best,
> > > > > > > Magnus
> > > > > > >
> > > > > > > Den lör 2 okt. 2021 kl 02:01 skrev Gwen Shapira
> > > > > > <gwen@confluent.io.invalid
> > > > > > > >:
> > > > > > >
> > > > > > >> Hey,
> > > > > > >>
> > > > > > >> I noticed that there was no discussion for the last 10 days,
> > but I
> > > > > > >> couldn't
> > > > > > >> find the vote thread. Is there one that I'm missing?
> > > > > > >>
> > > > > > >> Gwen
> > > > > > >>
> > > > > > >> On Wed, Sep 22, 2021 at 4:58 AM Magnus Edenhill <
> > > magnus@edenhill.se>
> > > > > > >> wrote:
> > > > > > >>
> > > > > > >> > Den tis 21 sep. 2021 kl 06:58 skrev Colin McCabe <
> > > > > cmccabe@apache.org
> > > > > > >:
> > > > > > >> >
> > > > > > >> > > On Mon, Sep 20, 2021, at 17:35, Feng Min wrote:
> > > > > > >> > > > Thanks Magnus & Colin for the discussion.
> > > > > > >> > > >
> > > > > > >> > > > Based on KIP-714's stateless design, Client can pretty
> > much
> > > use
> > > > > > any
> > > > > > >> > > > connection to any broker to send metrics. We are not
> > > associating
> > > > > > >> > > connection
> > > > > > >> > > > with client metric state. Is my understanding correct?
> If
> > > yes,
> > > > > > how
> > > > > > >> > about
> > > > > > >> > > > the following two scenarios
> > > > > > >> > > >
> > > > > > >> > > > 1) One Client (Client-ID) registers two different client
> > > > > instance
> > > > > > id
> > > > > > >> > via
> > > > > > >> > > > separate registration. Is it permitted? If OK, how to
> > > > > distinguish
> > > > > > >> them
> > > > > > >> > > from
> > > > > > >> > > > the case 2 below.
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> > > Hi Feng,
> > > > > > >> > >
> > > > > > >> > > My understanding, which Magnus can clarify I guess, is
> that
> > > you
> > > > > > could
> > > > > > >> > have
> > > > > > >> > > something like two Producer instances running with the
> same
> > > > > > client.id
> > > > > > >> > > (perhaps because they're using the same config file, for
> > > example).
> > > > > > >> They
> > > > > > >> > > could even be in the same process. But they would get
> > separate
> > > > > > UUIDs.
> > > > > > >> > >
> > > > > > >> > > I believe Magnus used the term client to mean "Producer or
> > > > > > Consumer".
> > > > > > >> So
> > > > > > >> > > if you have both a Producer and a Consumer in your
> > > application I
> > > > > > would
> > > > > > >> > > expect you'd get separate UUIDs for both. Again Magnus can
> > > chime
> > > > > in
> > > > > > >> > here, I
> > > > > > >> > > guess.
> > > > > > >> > >
> > > > > > >> >
> > > > > > >> > That's correct.
> > > > > > >> >
> > > > > > >> >
> > > > > > >> > >
> > > > > > >> > > > 2) How about the client restarting? What's the
> > expectation?
> > > > > Should
> > > > > > >> the
> > > > > > >> > > > server expect the client to carry a persisted client
> > > instance id
> > > > > > or
> > > > > > >> > > should
> > > > > > >> > > > the client be treated as a new instance?
> > > > > > >> > >
> > > > > > >> > > The KIP doesn't describe any mechanism for persistence,
> so I
> > > would
> > > > > > >> assume
> > > > > > >> > > that when you restart the client you get a new UUID. I
> agree
> > > that
> > > > > it
> > > > > > >> > would
> > > > > > >> > > be good to spell this out.
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > Right, it will not be persisted since a client instance
> can't
> > be
> > > > > > >> restarted.
> > > > > > >> >
> > > > > > >> > Will update the KIP to make this clearer.
> > > > > > >> >
> > > > > > >> > /Magnus
> > > > > > >> >
> > > > > > >>
> > > > > > >>
> > > > > > >> --
> > > > > > >> Gwen Shapira
> > > > > > >> Engineering Manager | Confluent
> > > > > > >> 650.450.2760 | @gwenshap
> > > > > > >> Follow us: Twitter | blog
> > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > >
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by David Mao <dm...@confluent.io.INVALID>.

Hey Magnus,

I noticed that the KIP outlines the initial selectors supported as:

   - client_instance_id - CLIENT_INSTANCE_ID UUID string representation.
   - client_software_name  - client software implementation name.
   - client_software_version  - client software implementation version.

In the given reactive monitoring workflow, we mention that the application
user does not know their client's client instance ID, but it's outlined
that the operator can add a metrics subscription selecting for clientId. I
don't see clientId as one of the supported selectors.
I can see how this would have made sense in a previous iteration given that
the previous client instance ID proposal was to construct the client
instance ID using clientId as a prefix. Now that the client instance ID is
a UUID, would we want to add clientId as a supported selector?
Let me know what you think.

David

On Tue, Oct 19, 2021 at 12:33 PM Magnus Edenhill <ma...@edenhill.se> wrote:

> Hi Mickael!
>
> Den tis 19 okt. 2021 kl 19:30 skrev Mickael Maison <
> mickael.maison@gmail.com
> >:
>
> > Hi Magnus,
> >
> > Thanks for the proposal.
> >
> > 1. Looking at the protocol section, isn't "ClientInstanceId" expected
> > to be a field in GetTelemetrySubscriptionsResponseV0? Otherwise, how
> > does a client retrieve this value?
> >
>
> Good catch, it got removed by mistake in one of the edits.
>
>
> >
> > 2. In the client API section, you mention a new method
> > "clientInstanceId()". Can you clarify which interfaces are affected?
> > Is it only Consumer and Producer?
> >
>
> And Admin. Will update the KIP.
>
>
>
> > 3. I'm a bit concerned this is enabled by default. Even if the data
> > collected is supposed to be not sensitive, I think this can be
> > problematic in some environments. Also users don't seem to have the
> > choice to only expose some metrics. Knowing how much data transit
> > through some applications can be considered critical.
> >
>
> The broker already knows how much data transits through the client though,
> right?
> Care has been taken not to expose information in the standard metrics that
> might
> reveal sensitive information.
>
> Do you have an example of how the proposed metrics could leak sensitive
> information?
> As for limiting the what metrics to export; I guess that could make sense
> in some
> very sensitive use-cases, but those users might disable metrics altogether
> for now.
> Could these concerns be addressed by a later KIP?
>
>
>
> >
> > 4. As a user, how do you know if your application is actively sending
> > metrics? Are there new metrics exposing what's going on, like how much
> > data is being sent?
> >
>
> That's a good question.
> Since the proposed metrics interface is not aimed at, or directly available
> to, the application
> I guess there's little point of adding it here, but instead adding
> something to the
> existing JMX metrics?
> Do you have any suggestions?
>
>
>
> > 5. If all metrics are enabled on a regular Consumer or Producer, do
> > you have an idea how much throughput this would use?
> >
>
> It depends on the number of partition/topics/etc the client is producing
> to/consuming from.
> I'll add some sizes to the KIP for some typical use-cases.
>
>
> Thanks,
> Magnus
>
>
> > Thanks
> >
> > On Tue, Oct 19, 2021 at 5:06 PM Magnus Edenhill <ma...@edenhill.se>
> > wrote:
> > >
> > > Den tis 19 okt. 2021 kl 13:22 skrev Tom Bentley <tb...@redhat.com>:
> > >
> > > > Hi Magnus,
> > > >
> > > > I reviewed the KIP since you called the vote (sorry for not reviewing
> > when
> > > > you announced your intention to call the vote). I have a few
> questions
> > on
> > > > some of the details.
> > > >
> > > > 1. There's no Javadoc on ClientTelemetryPayload.data(), so I don't
> know
> > > > whether the payload is exposed through this method as compressed or
> > not.
> > > > Later on you say "Decompression of the payloads will be handled by
> the
> > > > broker metrics plugin, the broker should expose a suitable
> > decompression
> > > > API to the metrics plugin for this purpose.", which suggests it's the
> > > > compressed data in the buffer, but then we don't know which codec was
> > used,
> > > > nor the API via which the plugin should decompress it if required for
> > > > forwarding to the ultimate metrics store. Should the
> > ClientTelemetryPayload
> > > > expose a method to get the compression and a decompressor?
> > > >
> > >
> > > Good point, updated.
> > >
> > >
> > >
> > > > 2. The client-side API is expressed as StringOrError
> > > > ClientInstance::ClientInstanceId(int timeout_ms). I understand that
> > you're
> > > > thinking about the librdkafka implementation, but it would be good to
> > show
> > > > the API as it would appear on the Apache Kafka clients.
> > > >
> > >
> > > This was meant as pseudo-code, but I changed it to Java.
> > >
> > >
> > > > 3. "PushTelemetryRequest|Response - protocol request used by the
> > client to
> > > > send metrics to any broker it is connected to." To be clear, this
> means
> > > > that the client can choose any of the connected brokers and push to
> > just
> > > > one of them? What should a supporting client do if it gets an error
> > when
> > > > pushing metrics to a broker, retry sending to the same broker or try
> > > > pushing to another broker, or drop the metrics? Should supporting
> > clients
> > > > send successive requests to a single broker, or round robin, or is
> > that up
> > > > to the client author? I'm guessing the behaviour should be sticky to
> > > > support the rate limiting features, but I think it would be good for
> > client
> > > > authors if this section were explicit on the recommended behaviour.
> > > >
> > >
> > > You are right, I've updated the KIP to make this clearer.
> > >
> > >
> > > > 4. "Mapping the client instance id to an actual application instance
> > > > running on a (virtual) machine can be done by inspecting the metrics
> > > > resource labels, such as the client source address and source port,
> or
> > > > security principal, all of which are added by the receiving broker.
> > This
> > > > will allow the operator together with the user to identify the actual
> > > > application instance." Is this really always true? The source IP and
> > port
> > > > might be a loadbalancer/proxy in some setups. The principal, as
> already
> > > > mentioned in the KIP, might be shared between multiple applications.
> > So at
> > > > worst the organization running the clients might have to consult the
> > logs
> > > > of a set of client applications, right?
> > > >
> > >
> > > Yes, that's correct. There's no guaranteed mapping from
> > client_instance_id
> > > to
> > > an actual instance, that's why the KIP recommends client
> implementations
> > to
> > > log the client instance id
> > > upon retrieval, and also provide an API for the application to retrieve
> > the
> > > instance id programmatically
> > > if it has a better way of exposing it.
> > >
> > >
> > > 5. "Tests indicate that a compression ratio up to 10x is possible for
> the
> > > > standard metrics." Client authors might appreciate your mentioning
> > which
> > > > compression codec got these results.
> > > >
> > >
> > > Good point. Updated.
> > >
> > >
> > > > 6. "Should the client send a push request prior to expiry of the
> > previously
> > > > calculated PushIntervalMs the broker will discard the metrics and
> > return a
> > > > PushTelemetryResponse with the ErrorCode set to RateLimited." Is this
> > > > RATE_LIMITED a new error code? It's not mentioned in the "New Error
> > Codes"
> > > > section.
> > > >
> > >
> > > That's a leftover, it should be using the standard ThrottleTime
> > mechanism.
> > > Fixed.
> > >
> > >
> > > > 7. In the section "Standard client resource labels" application_id is
> > > > described as Kafka Streams only, but the section of "Client
> > Identification"
> > > > talks about "application instance id as an optional future
> nice-to-have
> > > > that may be included as a metrics label if it has been set by the
> > user", so
> > > > I'm confused whether non-Kafka Streams clients should set an
> > application_id
> > > > or not.
> > > >
> > >
> > > I'll clarify this in the KIP, but basically we would need to add an `
> > > application.id` config
> > > property for non-streams clients for this purpose, and that's outside
> the
> > > scope of this KIP since we want to make it zero-conf:ish on the client
> > side.
> > >
> > >
> > > >
> > > > Kind regards,
> > > >
> > > > Tom
> > > >
> > >
> > > Thanks for the review,
> > > Magnus
> > >
> > >
> > >
> > > >
> > > > On Thu, Oct 7, 2021 at 5:26 PM Magnus Edenhill <ma...@edenhill.se>
> > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > I've updated the KIP following our recent discussions on the
> mailing
> > > > list:
> > > > >  - split the protocol in two, one for getting the metrics
> > subscriptions,
> > > > > and one for pushing the metrics.
> > > > >  - simplifications: initially only one supported metrics format, no
> > > > > client.id in the instance id, etc.
> > > > >  - made CLIENT_METRICS subscription configuration entries more
> > structured
> > > > >    and allowing better client matching selectors (not only on the
> > > > instance
> > > > > id, but also the other
> > > > >    client resource labels, such as client_software_name, etc.).
> > > > >
> > > > > Unless there are further comments I'll call the vote in a day or
> two.
> > > > >
> > > > > Regards,
> > > > > Magnus
> > > > >
> > > > >
> > > > >
> > > > > Den mån 4 okt. 2021 kl 20:57 skrev Magnus Edenhill <
> > magnus@edenhill.se>:
> > > > >
> > > > > > Hi Gwen,
> > > > > >
> > > > > > I'm finishing up the KIP based on the last couple of discussion
> > points
> > > > in
> > > > > > this thread
> > > > > > and will call the Vote later this week.
> > > > > >
> > > > > > Best,
> > > > > > Magnus
> > > > > >
> > > > > > Den lör 2 okt. 2021 kl 02:01 skrev Gwen Shapira
> > > > > <gwen@confluent.io.invalid
> > > > > > >:
> > > > > >
> > > > > >> Hey,
> > > > > >>
> > > > > >> I noticed that there was no discussion for the last 10 days,
> but I
> > > > > >> couldn't
> > > > > >> find the vote thread. Is there one that I'm missing?
> > > > > >>
> > > > > >> Gwen
> > > > > >>
> > > > > >> On Wed, Sep 22, 2021 at 4:58 AM Magnus Edenhill <
> > magnus@edenhill.se>
> > > > > >> wrote:
> > > > > >>
> > > > > >> > Den tis 21 sep. 2021 kl 06:58 skrev Colin McCabe <
> > > > cmccabe@apache.org
> > > > > >:
> > > > > >> >
> > > > > >> > > On Mon, Sep 20, 2021, at 17:35, Feng Min wrote:
> > > > > >> > > > Thanks Magnus & Colin for the discussion.
> > > > > >> > > >
> > > > > >> > > > Based on KIP-714's stateless design, Client can pretty
> much
> > use
> > > > > any
> > > > > >> > > > connection to any broker to send metrics. We are not
> > associating
> > > > > >> > > connection
> > > > > >> > > > with client metric state. Is my understanding correct? If
> > yes,
> > > > > how
> > > > > >> > about
> > > > > >> > > > the following two scenarios
> > > > > >> > > >
> > > > > >> > > > 1) One Client (Client-ID) registers two different client
> > > > instance
> > > > > id
> > > > > >> > via
> > > > > >> > > > separate registration. Is it permitted? If OK, how to
> > > > distinguish
> > > > > >> them
> > > > > >> > > from
> > > > > >> > > > the case 2 below.
> > > > > >> > > >
> > > > > >> > >
> > > > > >> > > Hi Feng,
> > > > > >> > >
> > > > > >> > > My understanding, which Magnus can clarify I guess, is that
> > you
> > > > > could
> > > > > >> > have
> > > > > >> > > something like two Producer instances running with the same
> > > > > client.id
> > > > > >> > > (perhaps because they're using the same config file, for
> > example).
> > > > > >> They
> > > > > >> > > could even be in the same process. But they would get
> separate
> > > > > UUIDs.
> > > > > >> > >
> > > > > >> > > I believe Magnus used the term client to mean "Producer or
> > > > > Consumer".
> > > > > >> So
> > > > > >> > > if you have both a Producer and a Consumer in your
> > application I
> > > > > would
> > > > > >> > > expect you'd get separate UUIDs for both. Again Magnus can
> > chime
> > > > in
> > > > > >> > here, I
> > > > > >> > > guess.
> > > > > >> > >
> > > > > >> >
> > > > > >> > That's correct.
> > > > > >> >
> > > > > >> >
> > > > > >> > >
> > > > > >> > > > 2) How about the client restarting? What's the
> expectation?
> > > > Should
> > > > > >> the
> > > > > >> > > > server expect the client to carry a persisted client
> > instance id
> > > > > or
> > > > > >> > > should
> > > > > >> > > > the client be treated as a new instance?
> > > > > >> > >
> > > > > >> > > The KIP doesn't describe any mechanism for persistence, so I
> > would
> > > > > >> assume
> > > > > >> > > that when you restart the client you get a new UUID. I agree
> > that
> > > > it
> > > > > >> > would
> > > > > >> > > be good to spell this out.
> > > > > >> > >
> > > > > >> > >
> > > > > >> > Right, it will not be persisted since a client instance can't
> be
> > > > > >> restarted.
> > > > > >> >
> > > > > >> > Will update the KIP to make this clearer.
> > > > > >> >
> > > > > >> > /Magnus
> > > > > >> >
> > > > > >>
> > > > > >>
> > > > > >> --
> > > > > >> Gwen Shapira
> > > > > >> Engineering Manager | Confluent
> > > > > >> 650.450.2760 | @gwenshap
> > > > > >> Follow us: Twitter | blog
> > > > > >>
> > > > > >
> > > > >
> > > >
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Magnus Edenhill <ma...@edenhill.se>.

Hi Mickael!

Den tis 19 okt. 2021 kl 19:30 skrev Mickael Maison <mickael.maison@gmail.com
>:

> Hi Magnus,
>
> Thanks for the proposal.
>
> 1. Looking at the protocol section, isn't "ClientInstanceId" expected
> to be a field in GetTelemetrySubscriptionsResponseV0? Otherwise, how
> does a client retrieve this value?
>

Good catch, it got removed by mistake in one of the edits.


>
> 2. In the client API section, you mention a new method
> "clientInstanceId()". Can you clarify which interfaces are affected?
> Is it only Consumer and Producer?
>

And Admin. Will update the KIP.



> 3. I'm a bit concerned this is enabled by default. Even if the data
> collected is supposed to be not sensitive, I think this can be
> problematic in some environments. Also users don't seem to have the
> choice to only expose some metrics. Knowing how much data transit
> through some applications can be considered critical.
>

The broker already knows how much data transits through the client though,
right?
Care has been taken not to expose information in the standard metrics that
might
reveal sensitive information.

Do you have an example of how the proposed metrics could leak sensitive
information?
As for limiting the what metrics to export; I guess that could make sense
in some
very sensitive use-cases, but those users might disable metrics altogether
for now.
Could these concerns be addressed by a later KIP?



>
> 4. As a user, how do you know if your application is actively sending
> metrics? Are there new metrics exposing what's going on, like how much
> data is being sent?
>

That's a good question.
Since the proposed metrics interface is not aimed at, or directly available
to, the application
I guess there's little point of adding it here, but instead adding
something to the
existing JMX metrics?
Do you have any suggestions?



> 5. If all metrics are enabled on a regular Consumer or Producer, do
> you have an idea how much throughput this would use?
>

It depends on the number of partition/topics/etc the client is producing
to/consuming from.
I'll add some sizes to the KIP for some typical use-cases.


Thanks,
Magnus


> Thanks
>
> On Tue, Oct 19, 2021 at 5:06 PM Magnus Edenhill <ma...@edenhill.se>
> wrote:
> >
> > Den tis 19 okt. 2021 kl 13:22 skrev Tom Bentley <tb...@redhat.com>:
> >
> > > Hi Magnus,
> > >
> > > I reviewed the KIP since you called the vote (sorry for not reviewing
> when
> > > you announced your intention to call the vote). I have a few questions
> on
> > > some of the details.
> > >
> > > 1. There's no Javadoc on ClientTelemetryPayload.data(), so I don't know
> > > whether the payload is exposed through this method as compressed or
> not.
> > > Later on you say "Decompression of the payloads will be handled by the
> > > broker metrics plugin, the broker should expose a suitable
> decompression
> > > API to the metrics plugin for this purpose.", which suggests it's the
> > > compressed data in the buffer, but then we don't know which codec was
> used,
> > > nor the API via which the plugin should decompress it if required for
> > > forwarding to the ultimate metrics store. Should the
> ClientTelemetryPayload
> > > expose a method to get the compression and a decompressor?
> > >
> >
> > Good point, updated.
> >
> >
> >
> > > 2. The client-side API is expressed as StringOrError
> > > ClientInstance::ClientInstanceId(int timeout_ms). I understand that
> you're
> > > thinking about the librdkafka implementation, but it would be good to
> show
> > > the API as it would appear on the Apache Kafka clients.
> > >
> >
> > This was meant as pseudo-code, but I changed it to Java.
> >
> >
> > > 3. "PushTelemetryRequest|Response - protocol request used by the
> client to
> > > send metrics to any broker it is connected to." To be clear, this means
> > > that the client can choose any of the connected brokers and push to
> just
> > > one of them? What should a supporting client do if it gets an error
> when
> > > pushing metrics to a broker, retry sending to the same broker or try
> > > pushing to another broker, or drop the metrics? Should supporting
> clients
> > > send successive requests to a single broker, or round robin, or is
> that up
> > > to the client author? I'm guessing the behaviour should be sticky to
> > > support the rate limiting features, but I think it would be good for
> client
> > > authors if this section were explicit on the recommended behaviour.
> > >
> >
> > You are right, I've updated the KIP to make this clearer.
> >
> >
> > > 4. "Mapping the client instance id to an actual application instance
> > > running on a (virtual) machine can be done by inspecting the metrics
> > > resource labels, such as the client source address and source port, or
> > > security principal, all of which are added by the receiving broker.
> This
> > > will allow the operator together with the user to identify the actual
> > > application instance." Is this really always true? The source IP and
> port
> > > might be a loadbalancer/proxy in some setups. The principal, as already
> > > mentioned in the KIP, might be shared between multiple applications.
> So at
> > > worst the organization running the clients might have to consult the
> logs
> > > of a set of client applications, right?
> > >
> >
> > Yes, that's correct. There's no guaranteed mapping from
> client_instance_id
> > to
> > an actual instance, that's why the KIP recommends client implementations
> to
> > log the client instance id
> > upon retrieval, and also provide an API for the application to retrieve
> the
> > instance id programmatically
> > if it has a better way of exposing it.
> >
> >
> > 5. "Tests indicate that a compression ratio up to 10x is possible for the
> > > standard metrics." Client authors might appreciate your mentioning
> which
> > > compression codec got these results.
> > >
> >
> > Good point. Updated.
> >
> >
> > > 6. "Should the client send a push request prior to expiry of the
> previously
> > > calculated PushIntervalMs the broker will discard the metrics and
> return a
> > > PushTelemetryResponse with the ErrorCode set to RateLimited." Is this
> > > RATE_LIMITED a new error code? It's not mentioned in the "New Error
> Codes"
> > > section.
> > >
> >
> > That's a leftover, it should be using the standard ThrottleTime
> mechanism.
> > Fixed.
> >
> >
> > > 7. In the section "Standard client resource labels" application_id is
> > > described as Kafka Streams only, but the section of "Client
> Identification"
> > > talks about "application instance id as an optional future nice-to-have
> > > that may be included as a metrics label if it has been set by the
> user", so
> > > I'm confused whether non-Kafka Streams clients should set an
> application_id
> > > or not.
> > >
> >
> > I'll clarify this in the KIP, but basically we would need to add an `
> > application.id` config
> > property for non-streams clients for this purpose, and that's outside the
> > scope of this KIP since we want to make it zero-conf:ish on the client
> side.
> >
> >
> > >
> > > Kind regards,
> > >
> > > Tom
> > >
> >
> > Thanks for the review,
> > Magnus
> >
> >
> >
> > >
> > > On Thu, Oct 7, 2021 at 5:26 PM Magnus Edenhill <ma...@edenhill.se>
> wrote:
> > >
> > > > Hi all,
> > > >
> > > > I've updated the KIP following our recent discussions on the mailing
> > > list:
> > > >  - split the protocol in two, one for getting the metrics
> subscriptions,
> > > > and one for pushing the metrics.
> > > >  - simplifications: initially only one supported metrics format, no
> > > > client.id in the instance id, etc.
> > > >  - made CLIENT_METRICS subscription configuration entries more
> structured
> > > >    and allowing better client matching selectors (not only on the
> > > instance
> > > > id, but also the other
> > > >    client resource labels, such as client_software_name, etc.).
> > > >
> > > > Unless there are further comments I'll call the vote in a day or two.
> > > >
> > > > Regards,
> > > > Magnus
> > > >
> > > >
> > > >
> > > > Den mån 4 okt. 2021 kl 20:57 skrev Magnus Edenhill <
> magnus@edenhill.se>:
> > > >
> > > > > Hi Gwen,
> > > > >
> > > > > I'm finishing up the KIP based on the last couple of discussion
> points
> > > in
> > > > > this thread
> > > > > and will call the Vote later this week.
> > > > >
> > > > > Best,
> > > > > Magnus
> > > > >
> > > > > Den lör 2 okt. 2021 kl 02:01 skrev Gwen Shapira
> > > > <gwen@confluent.io.invalid
> > > > > >:
> > > > >
> > > > >> Hey,
> > > > >>
> > > > >> I noticed that there was no discussion for the last 10 days, but I
> > > > >> couldn't
> > > > >> find the vote thread. Is there one that I'm missing?
> > > > >>
> > > > >> Gwen
> > > > >>
> > > > >> On Wed, Sep 22, 2021 at 4:58 AM Magnus Edenhill <
> magnus@edenhill.se>
> > > > >> wrote:
> > > > >>
> > > > >> > Den tis 21 sep. 2021 kl 06:58 skrev Colin McCabe <
> > > cmccabe@apache.org
> > > > >:
> > > > >> >
> > > > >> > > On Mon, Sep 20, 2021, at 17:35, Feng Min wrote:
> > > > >> > > > Thanks Magnus & Colin for the discussion.
> > > > >> > > >
> > > > >> > > > Based on KIP-714's stateless design, Client can pretty much
> use
> > > > any
> > > > >> > > > connection to any broker to send metrics. We are not
> associating
> > > > >> > > connection
> > > > >> > > > with client metric state. Is my understanding correct? If
> yes,
> > > > how
> > > > >> > about
> > > > >> > > > the following two scenarios
> > > > >> > > >
> > > > >> > > > 1) One Client (Client-ID) registers two different client
> > > instance
> > > > id
> > > > >> > via
> > > > >> > > > separate registration. Is it permitted? If OK, how to
> > > distinguish
> > > > >> them
> > > > >> > > from
> > > > >> > > > the case 2 below.
> > > > >> > > >
> > > > >> > >
> > > > >> > > Hi Feng,
> > > > >> > >
> > > > >> > > My understanding, which Magnus can clarify I guess, is that
> you
> > > > could
> > > > >> > have
> > > > >> > > something like two Producer instances running with the same
> > > > client.id
> > > > >> > > (perhaps because they're using the same config file, for
> example).
> > > > >> They
> > > > >> > > could even be in the same process. But they would get separate
> > > > UUIDs.
> > > > >> > >
> > > > >> > > I believe Magnus used the term client to mean "Producer or
> > > > Consumer".
> > > > >> So
> > > > >> > > if you have both a Producer and a Consumer in your
> application I
> > > > would
> > > > >> > > expect you'd get separate UUIDs for both. Again Magnus can
> chime
> > > in
> > > > >> > here, I
> > > > >> > > guess.
> > > > >> > >
> > > > >> >
> > > > >> > That's correct.
> > > > >> >
> > > > >> >
> > > > >> > >
> > > > >> > > > 2) How about the client restarting? What's the expectation?
> > > Should
> > > > >> the
> > > > >> > > > server expect the client to carry a persisted client
> instance id
> > > > or
> > > > >> > > should
> > > > >> > > > the client be treated as a new instance?
> > > > >> > >
> > > > >> > > The KIP doesn't describe any mechanism for persistence, so I
> would
> > > > >> assume
> > > > >> > > that when you restart the client you get a new UUID. I agree
> that
> > > it
> > > > >> > would
> > > > >> > > be good to spell this out.
> > > > >> > >
> > > > >> > >
> > > > >> > Right, it will not be persisted since a client instance can't be
> > > > >> restarted.
> > > > >> >
> > > > >> > Will update the KIP to make this clearer.
> > > > >> >
> > > > >> > /Magnus
> > > > >> >
> > > > >>
> > > > >>
> > > > >> --
> > > > >> Gwen Shapira
> > > > >> Engineering Manager | Confluent
> > > > >> 650.450.2760 | @gwenshap
> > > > >> Follow us: Twitter | blog
> > > > >>
> > > > >
> > > >
> > >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Mickael Maison <mi...@gmail.com>.

Hi Magnus,

Thanks for the proposal.

1. Looking at the protocol section, isn't "ClientInstanceId" expected
to be a field in GetTelemetrySubscriptionsResponseV0? Otherwise, how
does a client retrieve this value?

2. In the client API section, you mention a new method
"clientInstanceId()". Can you clarify which interfaces are affected?
Is it only Consumer and Producer?

3. I'm a bit concerned this is enabled by default. Even if the data
collected is supposed to be not sensitive, I think this can be
problematic in some environments. Also users don't seem to have the
choice to only expose some metrics. Knowing how much data transit
through some applications can be considered critical.

4. As a user, how do you know if your application is actively sending
metrics? Are there new metrics exposing what's going on, like how much
data is being sent?

5. If all metrics are enabled on a regular Consumer or Producer, do
you have an idea how much throughput this would use?

Thanks

On Tue, Oct 19, 2021 at 5:06 PM Magnus Edenhill <ma...@edenhill.se> wrote:
>
> Den tis 19 okt. 2021 kl 13:22 skrev Tom Bentley <tb...@redhat.com>:
>
> > Hi Magnus,
> >
> > I reviewed the KIP since you called the vote (sorry for not reviewing when
> > you announced your intention to call the vote). I have a few questions on
> > some of the details.
> >
> > 1. There's no Javadoc on ClientTelemetryPayload.data(), so I don't know
> > whether the payload is exposed through this method as compressed or not.
> > Later on you say "Decompression of the payloads will be handled by the
> > broker metrics plugin, the broker should expose a suitable decompression
> > API to the metrics plugin for this purpose.", which suggests it's the
> > compressed data in the buffer, but then we don't know which codec was used,
> > nor the API via which the plugin should decompress it if required for
> > forwarding to the ultimate metrics store. Should the ClientTelemetryPayload
> > expose a method to get the compression and a decompressor?
> >
>
> Good point, updated.
>
>
>
> > 2. The client-side API is expressed as StringOrError
> > ClientInstance::ClientInstanceId(int timeout_ms). I understand that you're
> > thinking about the librdkafka implementation, but it would be good to show
> > the API as it would appear on the Apache Kafka clients.
> >
>
> This was meant as pseudo-code, but I changed it to Java.
>
>
> > 3. "PushTelemetryRequest|Response - protocol request used by the client to
> > send metrics to any broker it is connected to." To be clear, this means
> > that the client can choose any of the connected brokers and push to just
> > one of them? What should a supporting client do if it gets an error when
> > pushing metrics to a broker, retry sending to the same broker or try
> > pushing to another broker, or drop the metrics? Should supporting clients
> > send successive requests to a single broker, or round robin, or is that up
> > to the client author? I'm guessing the behaviour should be sticky to
> > support the rate limiting features, but I think it would be good for client
> > authors if this section were explicit on the recommended behaviour.
> >
>
> You are right, I've updated the KIP to make this clearer.
>
>
> > 4. "Mapping the client instance id to an actual application instance
> > running on a (virtual) machine can be done by inspecting the metrics
> > resource labels, such as the client source address and source port, or
> > security principal, all of which are added by the receiving broker. This
> > will allow the operator together with the user to identify the actual
> > application instance." Is this really always true? The source IP and port
> > might be a loadbalancer/proxy in some setups. The principal, as already
> > mentioned in the KIP, might be shared between multiple applications. So at
> > worst the organization running the clients might have to consult the logs
> > of a set of client applications, right?
> >
>
> Yes, that's correct. There's no guaranteed mapping from client_instance_id
> to
> an actual instance, that's why the KIP recommends client implementations to
> log the client instance id
> upon retrieval, and also provide an API for the application to retrieve the
> instance id programmatically
> if it has a better way of exposing it.
>
>
> 5. "Tests indicate that a compression ratio up to 10x is possible for the
> > standard metrics." Client authors might appreciate your mentioning which
> > compression codec got these results.
> >
>
> Good point. Updated.
>
>
> > 6. "Should the client send a push request prior to expiry of the previously
> > calculated PushIntervalMs the broker will discard the metrics and return a
> > PushTelemetryResponse with the ErrorCode set to RateLimited." Is this
> > RATE_LIMITED a new error code? It's not mentioned in the "New Error Codes"
> > section.
> >
>
> That's a leftover, it should be using the standard ThrottleTime mechanism.
> Fixed.
>
>
> > 7. In the section "Standard client resource labels" application_id is
> > described as Kafka Streams only, but the section of "Client Identification"
> > talks about "application instance id as an optional future nice-to-have
> > that may be included as a metrics label if it has been set by the user", so
> > I'm confused whether non-Kafka Streams clients should set an application_id
> > or not.
> >
>
> I'll clarify this in the KIP, but basically we would need to add an `
> application.id` config
> property for non-streams clients for this purpose, and that's outside the
> scope of this KIP since we want to make it zero-conf:ish on the client side.
>
>
> >
> > Kind regards,
> >
> > Tom
> >
>
> Thanks for the review,
> Magnus
>
>
>
> >
> > On Thu, Oct 7, 2021 at 5:26 PM Magnus Edenhill <ma...@edenhill.se> wrote:
> >
> > > Hi all,
> > >
> > > I've updated the KIP following our recent discussions on the mailing
> > list:
> > >  - split the protocol in two, one for getting the metrics subscriptions,
> > > and one for pushing the metrics.
> > >  - simplifications: initially only one supported metrics format, no
> > > client.id in the instance id, etc.
> > >  - made CLIENT_METRICS subscription configuration entries more structured
> > >    and allowing better client matching selectors (not only on the
> > instance
> > > id, but also the other
> > >    client resource labels, such as client_software_name, etc.).
> > >
> > > Unless there are further comments I'll call the vote in a day or two.
> > >
> > > Regards,
> > > Magnus
> > >
> > >
> > >
> > > Den mån 4 okt. 2021 kl 20:57 skrev Magnus Edenhill <ma...@edenhill.se>:
> > >
> > > > Hi Gwen,
> > > >
> > > > I'm finishing up the KIP based on the last couple of discussion points
> > in
> > > > this thread
> > > > and will call the Vote later this week.
> > > >
> > > > Best,
> > > > Magnus
> > > >
> > > > Den lör 2 okt. 2021 kl 02:01 skrev Gwen Shapira
> > > <gwen@confluent.io.invalid
> > > > >:
> > > >
> > > >> Hey,
> > > >>
> > > >> I noticed that there was no discussion for the last 10 days, but I
> > > >> couldn't
> > > >> find the vote thread. Is there one that I'm missing?
> > > >>
> > > >> Gwen
> > > >>
> > > >> On Wed, Sep 22, 2021 at 4:58 AM Magnus Edenhill <ma...@edenhill.se>
> > > >> wrote:
> > > >>
> > > >> > Den tis 21 sep. 2021 kl 06:58 skrev Colin McCabe <
> > cmccabe@apache.org
> > > >:
> > > >> >
> > > >> > > On Mon, Sep 20, 2021, at 17:35, Feng Min wrote:
> > > >> > > > Thanks Magnus & Colin for the discussion.
> > > >> > > >
> > > >> > > > Based on KIP-714's stateless design, Client can pretty much use
> > > any
> > > >> > > > connection to any broker to send metrics. We are not associating
> > > >> > > connection
> > > >> > > > with client metric state. Is my understanding correct? If yes,
> > > how
> > > >> > about
> > > >> > > > the following two scenarios
> > > >> > > >
> > > >> > > > 1) One Client (Client-ID) registers two different client
> > instance
> > > id
> > > >> > via
> > > >> > > > separate registration. Is it permitted? If OK, how to
> > distinguish
> > > >> them
> > > >> > > from
> > > >> > > > the case 2 below.
> > > >> > > >
> > > >> > >
> > > >> > > Hi Feng,
> > > >> > >
> > > >> > > My understanding, which Magnus can clarify I guess, is that you
> > > could
> > > >> > have
> > > >> > > something like two Producer instances running with the same
> > > client.id
> > > >> > > (perhaps because they're using the same config file, for example).
> > > >> They
> > > >> > > could even be in the same process. But they would get separate
> > > UUIDs.
> > > >> > >
> > > >> > > I believe Magnus used the term client to mean "Producer or
> > > Consumer".
> > > >> So
> > > >> > > if you have both a Producer and a Consumer in your application I
> > > would
> > > >> > > expect you'd get separate UUIDs for both. Again Magnus can chime
> > in
> > > >> > here, I
> > > >> > > guess.
> > > >> > >
> > > >> >
> > > >> > That's correct.
> > > >> >
> > > >> >
> > > >> > >
> > > >> > > > 2) How about the client restarting? What's the expectation?
> > Should
> > > >> the
> > > >> > > > server expect the client to carry a persisted client instance id
> > > or
> > > >> > > should
> > > >> > > > the client be treated as a new instance?
> > > >> > >
> > > >> > > The KIP doesn't describe any mechanism for persistence, so I would
> > > >> assume
> > > >> > > that when you restart the client you get a new UUID. I agree that
> > it
> > > >> > would
> > > >> > > be good to spell this out.
> > > >> > >
> > > >> > >
> > > >> > Right, it will not be persisted since a client instance can't be
> > > >> restarted.
> > > >> >
> > > >> > Will update the KIP to make this clearer.
> > > >> >
> > > >> > /Magnus
> > > >> >
> > > >>
> > > >>
> > > >> --
> > > >> Gwen Shapira
> > > >> Engineering Manager | Confluent
> > > >> 650.450.2760 | @gwenshap
> > > >> Follow us: Twitter | blog
> > > >>
> > > >
> > >
> >

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Magnus Edenhill <ma...@edenhill.se>.

Den tis 19 okt. 2021 kl 13:22 skrev Tom Bentley <tb...@redhat.com>:

> Hi Magnus,
>
> I reviewed the KIP since you called the vote (sorry for not reviewing when
> you announced your intention to call the vote). I have a few questions on
> some of the details.
>
> 1. There's no Javadoc on ClientTelemetryPayload.data(), so I don't know
> whether the payload is exposed through this method as compressed or not.
> Later on you say "Decompression of the payloads will be handled by the
> broker metrics plugin, the broker should expose a suitable decompression
> API to the metrics plugin for this purpose.", which suggests it's the
> compressed data in the buffer, but then we don't know which codec was used,
> nor the API via which the plugin should decompress it if required for
> forwarding to the ultimate metrics store. Should the ClientTelemetryPayload
> expose a method to get the compression and a decompressor?
>

Good point, updated.



> 2. The client-side API is expressed as StringOrError
> ClientInstance::ClientInstanceId(int timeout_ms). I understand that you're
> thinking about the librdkafka implementation, but it would be good to show
> the API as it would appear on the Apache Kafka clients.
>

This was meant as pseudo-code, but I changed it to Java.


> 3. "PushTelemetryRequest|Response - protocol request used by the client to
> send metrics to any broker it is connected to." To be clear, this means
> that the client can choose any of the connected brokers and push to just
> one of them? What should a supporting client do if it gets an error when
> pushing metrics to a broker, retry sending to the same broker or try
> pushing to another broker, or drop the metrics? Should supporting clients
> send successive requests to a single broker, or round robin, or is that up
> to the client author? I'm guessing the behaviour should be sticky to
> support the rate limiting features, but I think it would be good for client
> authors if this section were explicit on the recommended behaviour.
>

You are right, I've updated the KIP to make this clearer.


> 4. "Mapping the client instance id to an actual application instance
> running on a (virtual) machine can be done by inspecting the metrics
> resource labels, such as the client source address and source port, or
> security principal, all of which are added by the receiving broker. This
> will allow the operator together with the user to identify the actual
> application instance." Is this really always true? The source IP and port
> might be a loadbalancer/proxy in some setups. The principal, as already
> mentioned in the KIP, might be shared between multiple applications. So at
> worst the organization running the clients might have to consult the logs
> of a set of client applications, right?
>

Yes, that's correct. There's no guaranteed mapping from client_instance_id
to
an actual instance, that's why the KIP recommends client implementations to
log the client instance id
upon retrieval, and also provide an API for the application to retrieve the
instance id programmatically
if it has a better way of exposing it.


5. "Tests indicate that a compression ratio up to 10x is possible for the
> standard metrics." Client authors might appreciate your mentioning which
> compression codec got these results.
>

Good point. Updated.


> 6. "Should the client send a push request prior to expiry of the previously
> calculated PushIntervalMs the broker will discard the metrics and return a
> PushTelemetryResponse with the ErrorCode set to RateLimited." Is this
> RATE_LIMITED a new error code? It's not mentioned in the "New Error Codes"
> section.
>

That's a leftover, it should be using the standard ThrottleTime mechanism.
Fixed.


> 7. In the section "Standard client resource labels" application_id is
> described as Kafka Streams only, but the section of "Client Identification"
> talks about "application instance id as an optional future nice-to-have
> that may be included as a metrics label if it has been set by the user", so
> I'm confused whether non-Kafka Streams clients should set an application_id
> or not.
>

I'll clarify this in the KIP, but basically we would need to add an `
application.id` config
property for non-streams clients for this purpose, and that's outside the
scope of this KIP since we want to make it zero-conf:ish on the client side.


>
> Kind regards,
>
> Tom
>

Thanks for the review,
Magnus



>
> On Thu, Oct 7, 2021 at 5:26 PM Magnus Edenhill <ma...@edenhill.se> wrote:
>
> > Hi all,
> >
> > I've updated the KIP following our recent discussions on the mailing
> list:
> >  - split the protocol in two, one for getting the metrics subscriptions,
> > and one for pushing the metrics.
> >  - simplifications: initially only one supported metrics format, no
> > client.id in the instance id, etc.
> >  - made CLIENT_METRICS subscription configuration entries more structured
> >    and allowing better client matching selectors (not only on the
> instance
> > id, but also the other
> >    client resource labels, such as client_software_name, etc.).
> >
> > Unless there are further comments I'll call the vote in a day or two.
> >
> > Regards,
> > Magnus
> >
> >
> >
> > Den mån 4 okt. 2021 kl 20:57 skrev Magnus Edenhill <ma...@edenhill.se>:
> >
> > > Hi Gwen,
> > >
> > > I'm finishing up the KIP based on the last couple of discussion points
> in
> > > this thread
> > > and will call the Vote later this week.
> > >
> > > Best,
> > > Magnus
> > >
> > > Den lör 2 okt. 2021 kl 02:01 skrev Gwen Shapira
> > <gwen@confluent.io.invalid
> > > >:
> > >
> > >> Hey,
> > >>
> > >> I noticed that there was no discussion for the last 10 days, but I
> > >> couldn't
> > >> find the vote thread. Is there one that I'm missing?
> > >>
> > >> Gwen
> > >>
> > >> On Wed, Sep 22, 2021 at 4:58 AM Magnus Edenhill <ma...@edenhill.se>
> > >> wrote:
> > >>
> > >> > Den tis 21 sep. 2021 kl 06:58 skrev Colin McCabe <
> cmccabe@apache.org
> > >:
> > >> >
> > >> > > On Mon, Sep 20, 2021, at 17:35, Feng Min wrote:
> > >> > > > Thanks Magnus & Colin for the discussion.
> > >> > > >
> > >> > > > Based on KIP-714's stateless design, Client can pretty much use
> > any
> > >> > > > connection to any broker to send metrics. We are not associating
> > >> > > connection
> > >> > > > with client metric state. Is my understanding correct? If yes,
> > how
> > >> > about
> > >> > > > the following two scenarios
> > >> > > >
> > >> > > > 1) One Client (Client-ID) registers two different client
> instance
> > id
> > >> > via
> > >> > > > separate registration. Is it permitted? If OK, how to
> distinguish
> > >> them
> > >> > > from
> > >> > > > the case 2 below.
> > >> > > >
> > >> > >
> > >> > > Hi Feng,
> > >> > >
> > >> > > My understanding, which Magnus can clarify I guess, is that you
> > could
> > >> > have
> > >> > > something like two Producer instances running with the same
> > client.id
> > >> > > (perhaps because they're using the same config file, for example).
> > >> They
> > >> > > could even be in the same process. But they would get separate
> > UUIDs.
> > >> > >
> > >> > > I believe Magnus used the term client to mean "Producer or
> > Consumer".
> > >> So
> > >> > > if you have both a Producer and a Consumer in your application I
> > would
> > >> > > expect you'd get separate UUIDs for both. Again Magnus can chime
> in
> > >> > here, I
> > >> > > guess.
> > >> > >
> > >> >
> > >> > That's correct.
> > >> >
> > >> >
> > >> > >
> > >> > > > 2) How about the client restarting? What's the expectation?
> Should
> > >> the
> > >> > > > server expect the client to carry a persisted client instance id
> > or
> > >> > > should
> > >> > > > the client be treated as a new instance?
> > >> > >
> > >> > > The KIP doesn't describe any mechanism for persistence, so I would
> > >> assume
> > >> > > that when you restart the client you get a new UUID. I agree that
> it
> > >> > would
> > >> > > be good to spell this out.
> > >> > >
> > >> > >
> > >> > Right, it will not be persisted since a client instance can't be
> > >> restarted.
> > >> >
> > >> > Will update the KIP to make this clearer.
> > >> >
> > >> > /Magnus
> > >> >
> > >>
> > >>
> > >> --
> > >> Gwen Shapira
> > >> Engineering Manager | Confluent
> > >> 650.450.2760 | @gwenshap
> > >> Follow us: Twitter | blog
> > >>
> > >
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Tom Bentley <tb...@redhat.com>.

Hi Magnus,

I reviewed the KIP since you called the vote (sorry for not reviewing when
you announced your intention to call the vote). I have a few questions on
some of the details.

1. There's no Javadoc on ClientTelemetryPayload.data(), so I don't know
whether the payload is exposed through this method as compressed or not.
Later on you say "Decompression of the payloads will be handled by the
broker metrics plugin, the broker should expose a suitable decompression
API to the metrics plugin for this purpose.", which suggests it's the
compressed data in the buffer, but then we don't know which codec was used,
nor the API via which the plugin should decompress it if required for
forwarding to the ultimate metrics store. Should the ClientTelemetryPayload
expose a method to get the compression and a decompressor?
2. The client-side API is expressed as StringOrError
ClientInstance::ClientInstanceId(int timeout_ms). I understand that you're
thinking about the librdkafka implementation, but it would be good to show
the API as it would appear on the Apache Kafka clients.
3. "PushTelemetryRequest|Response - protocol request used by the client to
send metrics to any broker it is connected to." To be clear, this means
that the client can choose any of the connected brokers and push to just
one of them? What should a supporting client do if it gets an error when
pushing metrics to a broker, retry sending to the same broker or try
pushing to another broker, or drop the metrics? Should supporting clients
send successive requests to a single broker, or round robin, or is that up
to the client author? I'm guessing the behaviour should be sticky to
support the rate limiting features, but I think it would be good for client
authors if this section were explicit on the recommended behaviour.
4. "Mapping the client instance id to an actual application instance
running on a (virtual) machine can be done by inspecting the metrics
resource labels, such as the client source address and source port, or
security principal, all of which are added by the receiving broker. This
will allow the operator together with the user to identify the actual
application instance." Is this really always true? The source IP and port
might be a loadbalancer/proxy in some setups. The principal, as already
mentioned in the KIP, might be shared between multiple applications. So at
worst the organization running the clients might have to consult the logs
of a set of client applications, right?
5. "Tests indicate that a compression ratio up to 10x is possible for the
standard metrics." Client authors might appreciate your mentioning which
compression codec got these results.
6. "Should the client send a push request prior to expiry of the previously
calculated PushIntervalMs the broker will discard the metrics and return a
PushTelemetryResponse with the ErrorCode set to RateLimited." Is this
RATE_LIMITED a new error code? It's not mentioned in the "New Error Codes"
section.
7. In the section "Standard client resource labels" application_id is
described as Kafka Streams only, but the section of "Client Identification"
talks about "application instance id as an optional future nice-to-have
that may be included as a metrics label if it has been set by the user", so
I'm confused whether non-Kafka Streams clients should set an application_id
or not.

Kind regards,

Tom

On Thu, Oct 7, 2021 at 5:26 PM Magnus Edenhill <ma...@edenhill.se> wrote:

> Hi all,
>
> I've updated the KIP following our recent discussions on the mailing list:
>  - split the protocol in two, one for getting the metrics subscriptions,
> and one for pushing the metrics.
>  - simplifications: initially only one supported metrics format, no
> client.id in the instance id, etc.
>  - made CLIENT_METRICS subscription configuration entries more structured
>    and allowing better client matching selectors (not only on the instance
> id, but also the other
>    client resource labels, such as client_software_name, etc.).
>
> Unless there are further comments I'll call the vote in a day or two.
>
> Regards,
> Magnus
>
>
>
> Den mån 4 okt. 2021 kl 20:57 skrev Magnus Edenhill <ma...@edenhill.se>:
>
> > Hi Gwen,
> >
> > I'm finishing up the KIP based on the last couple of discussion points in
> > this thread
> > and will call the Vote later this week.
> >
> > Best,
> > Magnus
> >
> > Den lör 2 okt. 2021 kl 02:01 skrev Gwen Shapira
> <gwen@confluent.io.invalid
> > >:
> >
> >> Hey,
> >>
> >> I noticed that there was no discussion for the last 10 days, but I
> >> couldn't
> >> find the vote thread. Is there one that I'm missing?
> >>
> >> Gwen
> >>
> >> On Wed, Sep 22, 2021 at 4:58 AM Magnus Edenhill <ma...@edenhill.se>
> >> wrote:
> >>
> >> > Den tis 21 sep. 2021 kl 06:58 skrev Colin McCabe <cmccabe@apache.org
> >:
> >> >
> >> > > On Mon, Sep 20, 2021, at 17:35, Feng Min wrote:
> >> > > > Thanks Magnus & Colin for the discussion.
> >> > > >
> >> > > > Based on KIP-714's stateless design, Client can pretty much use
> any
> >> > > > connection to any broker to send metrics. We are not associating
> >> > > connection
> >> > > > with client metric state. Is my understanding correct? If yes,
> how
> >> > about
> >> > > > the following two scenarios
> >> > > >
> >> > > > 1) One Client (Client-ID) registers two different client instance
> id
> >> > via
> >> > > > separate registration. Is it permitted? If OK, how to distinguish
> >> them
> >> > > from
> >> > > > the case 2 below.
> >> > > >
> >> > >
> >> > > Hi Feng,
> >> > >
> >> > > My understanding, which Magnus can clarify I guess, is that you
> could
> >> > have
> >> > > something like two Producer instances running with the same
> client.id
> >> > > (perhaps because they're using the same config file, for example).
> >> They
> >> > > could even be in the same process. But they would get separate
> UUIDs.
> >> > >
> >> > > I believe Magnus used the term client to mean "Producer or
> Consumer".
> >> So
> >> > > if you have both a Producer and a Consumer in your application I
> would
> >> > > expect you'd get separate UUIDs for both. Again Magnus can chime in
> >> > here, I
> >> > > guess.
> >> > >
> >> >
> >> > That's correct.
> >> >
> >> >
> >> > >
> >> > > > 2) How about the client restarting? What's the expectation? Should
> >> the
> >> > > > server expect the client to carry a persisted client instance id
> or
> >> > > should
> >> > > > the client be treated as a new instance?
> >> > >
> >> > > The KIP doesn't describe any mechanism for persistence, so I would
> >> assume
> >> > > that when you restart the client you get a new UUID. I agree that it
> >> > would
> >> > > be good to spell this out.
> >> > >
> >> > >
> >> > Right, it will not be persisted since a client instance can't be
> >> restarted.
> >> >
> >> > Will update the KIP to make this clearer.
> >> >
> >> > /Magnus
> >> >
> >>
> >>
> >> --
> >> Gwen Shapira
> >> Engineering Manager | Confluent
> >> 650.450.2760 | @gwenshap
> >> Follow us: Twitter | blog
> >>
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Magnus Edenhill <ma...@edenhill.se>.

Hi all,

I've updated the KIP following our recent discussions on the mailing list:
 - split the protocol in two, one for getting the metrics subscriptions,
and one for pushing the metrics.
 - simplifications: initially only one supported metrics format, no
client.id in the instance id, etc.
 - made CLIENT_METRICS subscription configuration entries more structured
   and allowing better client matching selectors (not only on the instance
id, but also the other
   client resource labels, such as client_software_name, etc.).

Unless there are further comments I'll call the vote in a day or two.

Regards,
Magnus



Den mån 4 okt. 2021 kl 20:57 skrev Magnus Edenhill <ma...@edenhill.se>:

> Hi Gwen,
>
> I'm finishing up the KIP based on the last couple of discussion points in
> this thread
> and will call the Vote later this week.
>
> Best,
> Magnus
>
> Den lör 2 okt. 2021 kl 02:01 skrev Gwen Shapira <gwen@confluent.io.invalid
> >:
>
>> Hey,
>>
>> I noticed that there was no discussion for the last 10 days, but I
>> couldn't
>> find the vote thread. Is there one that I'm missing?
>>
>> Gwen
>>
>> On Wed, Sep 22, 2021 at 4:58 AM Magnus Edenhill <ma...@edenhill.se>
>> wrote:
>>
>> > Den tis 21 sep. 2021 kl 06:58 skrev Colin McCabe <cm...@apache.org>:
>> >
>> > > On Mon, Sep 20, 2021, at 17:35, Feng Min wrote:
>> > > > Thanks Magnus & Colin for the discussion.
>> > > >
>> > > > Based on KIP-714's stateless design, Client can pretty much use any
>> > > > connection to any broker to send metrics. We are not associating
>> > > connection
>> > > > with client metric state. Is my understanding correct? If yes,  how
>> > about
>> > > > the following two scenarios
>> > > >
>> > > > 1) One Client (Client-ID) registers two different client instance id
>> > via
>> > > > separate registration. Is it permitted? If OK, how to distinguish
>> them
>> > > from
>> > > > the case 2 below.
>> > > >
>> > >
>> > > Hi Feng,
>> > >
>> > > My understanding, which Magnus can clarify I guess, is that you could
>> > have
>> > > something like two Producer instances running with the same client.id
>> > > (perhaps because they're using the same config file, for example).
>> They
>> > > could even be in the same process. But they would get separate UUIDs.
>> > >
>> > > I believe Magnus used the term client to mean "Producer or Consumer".
>> So
>> > > if you have both a Producer and a Consumer in your application I would
>> > > expect you'd get separate UUIDs for both. Again Magnus can chime in
>> > here, I
>> > > guess.
>> > >
>> >
>> > That's correct.
>> >
>> >
>> > >
>> > > > 2) How about the client restarting? What's the expectation? Should
>> the
>> > > > server expect the client to carry a persisted client instance id or
>> > > should
>> > > > the client be treated as a new instance?
>> > >
>> > > The KIP doesn't describe any mechanism for persistence, so I would
>> assume
>> > > that when you restart the client you get a new UUID. I agree that it
>> > would
>> > > be good to spell this out.
>> > >
>> > >
>> > Right, it will not be persisted since a client instance can't be
>> restarted.
>> >
>> > Will update the KIP to make this clearer.
>> >
>> > /Magnus
>> >
>>
>>
>> --
>> Gwen Shapira
>> Engineering Manager | Confluent
>> 650.450.2760 | @gwenshap
>> Follow us: Twitter | blog
>>
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Magnus Edenhill <ma...@edenhill.se>.

Hi Gwen,

I'm finishing up the KIP based on the last couple of discussion points in
this thread
and will call the Vote later this week.

Best,
Magnus

Den lör 2 okt. 2021 kl 02:01 skrev Gwen Shapira <gw...@confluent.io.invalid>:

> Hey,
>
> I noticed that there was no discussion for the last 10 days, but I couldn't
> find the vote thread. Is there one that I'm missing?
>
> Gwen
>
> On Wed, Sep 22, 2021 at 4:58 AM Magnus Edenhill <ma...@edenhill.se>
> wrote:
>
> > Den tis 21 sep. 2021 kl 06:58 skrev Colin McCabe <cm...@apache.org>:
> >
> > > On Mon, Sep 20, 2021, at 17:35, Feng Min wrote:
> > > > Thanks Magnus & Colin for the discussion.
> > > >
> > > > Based on KIP-714's stateless design, Client can pretty much use any
> > > > connection to any broker to send metrics. We are not associating
> > > connection
> > > > with client metric state. Is my understanding correct? If yes,  how
> > about
> > > > the following two scenarios
> > > >
> > > > 1) One Client (Client-ID) registers two different client instance id
> > via
> > > > separate registration. Is it permitted? If OK, how to distinguish
> them
> > > from
> > > > the case 2 below.
> > > >
> > >
> > > Hi Feng,
> > >
> > > My understanding, which Magnus can clarify I guess, is that you could
> > have
> > > something like two Producer instances running with the same client.id
> > > (perhaps because they're using the same config file, for example). They
> > > could even be in the same process. But they would get separate UUIDs.
> > >
> > > I believe Magnus used the term client to mean "Producer or Consumer".
> So
> > > if you have both a Producer and a Consumer in your application I would
> > > expect you'd get separate UUIDs for both. Again Magnus can chime in
> > here, I
> > > guess.
> > >
> >
> > That's correct.
> >
> >
> > >
> > > > 2) How about the client restarting? What's the expectation? Should
> the
> > > > server expect the client to carry a persisted client instance id or
> > > should
> > > > the client be treated as a new instance?
> > >
> > > The KIP doesn't describe any mechanism for persistence, so I would
> assume
> > > that when you restart the client you get a new UUID. I agree that it
> > would
> > > be good to spell this out.
> > >
> > >
> > Right, it will not be persisted since a client instance can't be
> restarted.
> >
> > Will update the KIP to make this clearer.
> >
> > /Magnus
> >
>
>
> --
> Gwen Shapira
> Engineering Manager | Confluent
> 650.450.2760 | @gwenshap
> Follow us: Twitter | blog
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Gwen Shapira <gw...@confluent.io.INVALID>.

Hey,

I noticed that there was no discussion for the last 10 days, but I couldn't
find the vote thread. Is there one that I'm missing?

Gwen

On Wed, Sep 22, 2021 at 4:58 AM Magnus Edenhill <ma...@edenhill.se> wrote:

> Den tis 21 sep. 2021 kl 06:58 skrev Colin McCabe <cm...@apache.org>:
>
> > On Mon, Sep 20, 2021, at 17:35, Feng Min wrote:
> > > Thanks Magnus & Colin for the discussion.
> > >
> > > Based on KIP-714's stateless design, Client can pretty much use any
> > > connection to any broker to send metrics. We are not associating
> > connection
> > > with client metric state. Is my understanding correct? If yes,  how
> about
> > > the following two scenarios
> > >
> > > 1) One Client (Client-ID) registers two different client instance id
> via
> > > separate registration. Is it permitted? If OK, how to distinguish them
> > from
> > > the case 2 below.
> > >
> >
> > Hi Feng,
> >
> > My understanding, which Magnus can clarify I guess, is that you could
> have
> > something like two Producer instances running with the same client.id
> > (perhaps because they're using the same config file, for example). They
> > could even be in the same process. But they would get separate UUIDs.
> >
> > I believe Magnus used the term client to mean "Producer or Consumer". So
> > if you have both a Producer and a Consumer in your application I would
> > expect you'd get separate UUIDs for both. Again Magnus can chime in
> here, I
> > guess.
> >
>
> That's correct.
>
>
> >
> > > 2) How about the client restarting? What's the expectation? Should the
> > > server expect the client to carry a persisted client instance id or
> > should
> > > the client be treated as a new instance?
> >
> > The KIP doesn't describe any mechanism for persistence, so I would assume
> > that when you restart the client you get a new UUID. I agree that it
> would
> > be good to spell this out.
> >
> >
> Right, it will not be persisted since a client instance can't be restarted.
>
> Will update the KIP to make this clearer.
>
> /Magnus
>


-- 
Gwen Shapira
Engineering Manager | Confluent
650.450.2760 | @gwenshap
Follow us: Twitter | blog

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Magnus Edenhill <ma...@edenhill.se>.

Den tis 21 sep. 2021 kl 06:58 skrev Colin McCabe <cm...@apache.org>:

> On Mon, Sep 20, 2021, at 17:35, Feng Min wrote:
> > Thanks Magnus & Colin for the discussion.
> >
> > Based on KIP-714's stateless design, Client can pretty much use any
> > connection to any broker to send metrics. We are not associating
> connection
> > with client metric state. Is my understanding correct? If yes,  how about
> > the following two scenarios
> >
> > 1) One Client (Client-ID) registers two different client instance id via
> > separate registration. Is it permitted? If OK, how to distinguish them
> from
> > the case 2 below.
> >
>
> Hi Feng,
>
> My understanding, which Magnus can clarify I guess, is that you could have
> something like two Producer instances running with the same client.id
> (perhaps because they're using the same config file, for example). They
> could even be in the same process. But they would get separate UUIDs.
>
> I believe Magnus used the term client to mean "Producer or Consumer". So
> if you have both a Producer and a Consumer in your application I would
> expect you'd get separate UUIDs for both. Again Magnus can chime in here, I
> guess.
>

That's correct.


>
> > 2) How about the client restarting? What's the expectation? Should the
> > server expect the client to carry a persisted client instance id or
> should
> > the client be treated as a new instance?
>
> The KIP doesn't describe any mechanism for persistence, so I would assume
> that when you restart the client you get a new UUID. I agree that it would
> be good to spell this out.
>
>
Right, it will not be persisted since a client instance can't be restarted.

Will update the KIP to make this clearer.

/Magnus

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Colin McCabe <cm...@apache.org>.

On Mon, Sep 20, 2021, at 17:35, Feng Min wrote:
> Thanks Magnus & Colin for the discussion.
>
> Based on KIP-714's stateless design, Client can pretty much use any
> connection to any broker to send metrics. We are not associating connection
> with client metric state. Is my understanding correct? If yes,  how about
> the following two scenarios
>
> 1) One Client (Client-ID) registers two different client instance id via
> separate registration. Is it permitted? If OK, how to distinguish them from
> the case 2 below.
>

Hi Feng,

My understanding, which Magnus can clarify I guess, is that you could have something like two Producer instances running with the same client.id (perhaps because they're using the same config file, for example). They could even be in the same process. But they would get separate UUIDs.

I believe Magnus used the term client to mean "Producer or Consumer". So if you have both a Producer and a Consumer in your application I would expect you'd get separate UUIDs for both. Again Magnus can chime in here, I guess.

> 2) How about the client restarting? What's the expectation? Should the
> server expect the client to carry a persisted client instance id or should
> the client be treated as a new instance?

The KIP doesn't describe any mechanism for persistence, so I would assume that when you restart the client you get a new UUID. I agree that it would be good to spell this out.

> also some comments inline.
>
> On Mon, Sep 20, 2021 at 11:41 AM Colin McCabe <cm...@apache.org> wrote:
>
> ...
>
>> It seems like the goal here is to have the client register itself, so that
>> we can tell if this is an old client reconnecting. If that is the case,
>> then I suggest to rename the RPC to RegisterClient.
>>
>> I think we need a better name than "clientInstanceId" since that name is
>> very similar to "clientId." Perhaps something like originId? Or clientUuid?
>> Let's also use UUID here rather than a string.
>>
>> > 6. > PushTelemetryRequest{
>> >        ClientInstanceId = f00d-feed-deff-ceff-ffff-….,
>> >        SubscriptionId = 0x234adf34,
>> >        ContentType = OTLPv8|ZSTD,
>> >        Terminating = False,
>> >        Metrics = …// zstd-compressed OTLPv08-protobuf-serialized metrics
>> >   }
>>
>>
> If we assume connection is not bound to ClientInstanceId, and the RPC can
> be sent to any broker (not necessarily the broker doing the registration).
> The client-instance-id is required for every metric reporting. It's just
> part of the labelling.
>

Hmm, I don't quite follow. I suggested using a name that was less confusingly similar to "client ID". Your response states that "the client-instance-id is required for every metric reporting... it's just part of the labelling". I don't see how the UUID being required for metric reporting is related to what its name should be. Did you mean to reply to a different point here?

best,
Colin


>
>> It's not necessary for the client to re-send its client instance ID here,
>> since it already registered with RegisterClient. If the TCP connection
>> dropped, it will have to re-send RegisterClient anyway. SubscriptionID we
>> should get rid of, as I said above.
>>
>> I don't see the need for protobufs. Why not just use Kafka's own
>> serialization mechanism? As much as possible, we should try to avoid
>> creating "turduckens" of protocol X containing a buffer serialized with
>> protocol Y, containing a protocol serialized with protocol Z. These aren't
>> conducive to a good implementation, and make it harder for people to write
>> clients. Just use Kafka's RPC protocol (with optional fields if you wish).
>>
>> If we do compression on Kafka RPC, I would prefer that we do it a more
>> generic way that applies to all control messages, not just this one. I also
>> doubt we need to support lots and lots of different compression codecs, at
>> first at least.
>>
>> Another thing I'd like to understand is whether we truly need
>> "terminating" (or anything like it). I'm still confused about how the
>> backend could use this. Keep in mind that we may receive it on multiple
>> brokers (or not receive it at all). We may receive more stuff about client
>> XYZ from broker 1 after we have already received a "terminated" for client
>> XYZ from broker 2.
>>
>> > If the broker connection goes down or the connection is to be used for
>> > other purposes (e.g., blocking FetchRequests), the client will send
>> > PushTelemetryRequests to any other broker in the cluster, using the same
>> > ClientInstanceId and SubscriptionId as received in the latest
>> > GetTelemetrySubscriptionsResponse.
>> >
>> > While the subscriptionId may change during the lifetime of the client
>> > instance (when metric subscriptions are updated), the ClientInstanceId is
>> > only acquired once and must not change (as it is used to identify the
>> > unique client instance).
>> > ...
>> > What we do want though is ability to single out a specific client
>> instance
>> > to give it a more fine-grained subscription for troubleshooting, and
>> > we can do that with the current proposal with matching solely on the
>> > CLIENT_INSTANCE_ID.
>> > In other words; all clients will have the same standard metrics
>> > subscription, but specific client instances can have alternate
>> > subscriptions.
>>
>> That makes sense, and gives a good reason why we might want to couple
>> finding the metrics info to passing the client UUID.
>>
>> > The metrics collector/tsdb/whatever will need to identify a single client
>> > instance, regardless of which broker received the metrics.
>> > The chapter on CLIENT_INSTANCE_ID motivates why we need a unique
>> > identifier, basically because neither clientID, principal or remote
>> > address:port, etc, can be
>> > used to identify a single client instance.
>> >
>>
>> Thanks for the background. I agree that a UUID is useful here.
>>
>> This also gives additional reasons why using a UUID rather than a
>> free-form string is desirable (many databases can handle numbers more
>> efficiently than strings).
>>
>> > > Unfortunately, we have to strictly control the metrics format, because
>> > > otherwise clients can't implement it. I agree that we don't need to
>> specify
>> > > how the broker-side code works, since that is pluggable. It's also
>> > > reasonable for the clients to have pluggable extensions as well, but
>> this
>> > > KIP won't be of much use if we don't at least define a basic set of
>> metrics
>> > > that most clients can understand how to send. The open source clients
>> will
>> > > not implement anything more than what is specified in the KIP (or at
>> least
>> > > the AK one won't...)
>> > >
>> >
>> > Makes sense, in the updated proposal above I changed ContentType to a
>> > bitmask.
>> >
>>
>> Hmm. You might have forgotten to save your changes. It's still listed as a
>> string in the KIP.
>>
>> > > I'm not sure if OpenCensus adds any value to this KIP, to be honest.
>> Their
>> > > primary focus was never on the format of the data being sent (in fact,
>> the
>> > > last time they checked, they left the format up to each OpenCensus
>> > > implementation). That may have changed, but I think it still has
>> limited
>> > > usefulness to us, since we have our own format which we have to use
>> anyway.
>> > >
>> >
>> > Oh, I meant concensus as in kafka-dev agreement :)
>> >
>> > Feng is looking into the implementation details of the Java client and
>> will
>> > update the KIP with regards to dependencies.
>>
>> I still don't understand what value using the OpenTelemetry format is
>> giving us here, as opposed to using Kafka's own format. I guess we will
>> need to talk more about this.
>>
>> > > Hmm, that data is about 10 fields, most of which are strings. It
>> certainly
>> > > adds a lot of overhead to resend it each time.
>> > >
>> > > I don't follow the comment about unpacking and repacking -- since the
>> > > client registered with the broker it already knows all this
>> information, so
>> > > there's nothing to unpack or repack, except from memory. If it's more
>> > > convenient to serialize it once rather than multiple times, that is an
>> > > implementation detail of the broker side plugin, which we are not
>> > > specifying here anyway.
>> > >
>> >
>> > The current proposal is pretty much stateless on the broker, it does not
>> > need to hold any state for a client (instance), and no state
>> > synchronization is needed
>> > between brokers in the cluster, which allows a client to seamlessly send
>> > metrics to any broker it wants and keeps the API overhead down (no need
>> to
>> > re-register when
>> > switching brokers for instance).
>> >
>> > We could remove the labels that are already available to the broker on a
>> > per-request basis or that it already maintains state for:
>> >  - client_id
>> >  - client_instance_id
>> >  - client_software_*
>> >
>> > Leaving the following to still be included:
>> >  - group_id
>> >  - group_instance_id
>> >  - transactional_id
>> >   etc..
>> >
>> > What do you think of that?
>>
>> Yes, that might be a reasonable balance. I don't think we currently
>> associate group_id with a connection, so it would be reasonable to re-send
>> it in the telemetry request.
>>
>> What I would suggest next is to set up a protocol definition for the
>> metrics you are sending over the wire. For example, you could specify
>> something like this for producer metrics:
>>
>> > { "name": "clientProducerRecordBytes", "type": "int64",
>> > "versions": "0+", "taggedVersions": "0+", "tag": 0,
>> >  "about": "client.producer.record.bytes" },
>> > { "name": "clientProducerQueueMaxMessages", "type": "int32",
>> > "versions": "0+", "taggedVersions": "0+", "tag": 1,
>> > "about": "client.producer.queue.max.messages" },
>>
>> This specifies what it is (int32, int64, etc.), how it's being sent, etc.
>>
>> best,
>> Colin
>>
>> >
>> >
>> > Thanks,
>> > Magnus
>> >
>> >
>> >
>> > >
>> > > best,
>> > > Colin
>> > >
>> > > > Thanks,
>> > > > Magnus
>> > > >
>> > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>
> -- 
> Best,
> Feng

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Magnus Edenhill <ma...@edenhill.se>.

Den mån 20 sep. 2021 kl 20:41 skrev Colin McCabe <cm...@apache.org>:

> On Tue, Sep 14, 2021, at 00:47, Magnus Edenhill wrote:
> > Thanks for your feedback Colin, see my updated proposal below.
> > ...
>
> Hi Magnus,
>
> Thanks for the update.
>
> >
> > Splitting up the API into separate data and control requests makes sense.
> > With a split we would have one API for querying the broker for configured
> > metrics subscriptions,
> > and one API for pushing the collected metrics to the broker.
> >
> > A mechanism is still needed to notify the client when the subscription is
> > changed;
> > I’ve added a SubscriptionId for this purpose (which could be a checksum
> of
> > the configured metrics subscription), this id is sent to the client along
> > with the metrics subscription, and the client sends it back to the broker
> > when pushing metrics. If the broker finds the pushed subscription id to
> > differ from what is expected it will return an error to the client, which
> > triggers the client to retrieve the new subscribed metrics and an updated
> > subscription id. The generation of the subscriptionId is opaque to the
> > client.
> >
>
> Hmm, SubscriptionId seems rather complex. We don't have this kind of
> complicated machinery for changing ApiVersions, and that is something that
> can also change over time, and which affects the clients.
>

I'm not sure how it relates to ApiVersion?
The SubscriptionId is a rather simple and stateless way to make sure the
client is using the most recently configured metrics subscriptions.

>
> Changing the configured metrics should be extremely rare. In this case,
> why don't we just close all connections on the broker side? Then the
> clients can re-connect and re-fetch the information about the metrics
> they're supposed to send.

While the standard metrics subscription is rarely updated, the second
use-case of troubleshooting a specific client will require the metrics
subscription to be updated and propagated in a timely manner.
Closing all client connections on the broker side is quite an intrusive
thing to do, will create a thundering horde of reconnects, and doesn't
really serve much of a purpose since the metrics in
this proposal are explicitly not bound to a single broker connection, but
to a client instance, allowing any broker connection to be used. This is a
feature of the proposal.

> >
> > Something like this:
> >
> > // Get the configured metrics subscription.
> > GetTelemetrySubscriptionsRequest {
> >    StrNull  ClientInstanceId  // Null on first invocation to retrieve a
> > newly generated instance id from the broker.
> > }
>
> It seems like the goal here is to have the client register itself, so that
> we can tell if this is an old client reconnecting. If that is the case,
> then I suggest to rename the RPC to RegisterClient.
>

Registering a client is perhaps a good idea, but a bigger take that also
involves other parts of the protocol (we wouldn't need to send the ClientId
in the protocol header, for instance), and is
thus outside the scope of this proposal. What we are aiming to provide here
is as stateless transmission as possible of client metrics through any
broker.

>
> I think we need a better name than "clientInstanceId" since that name is
> very similar to "clientId." Perhaps something like originId? Or clientUuid?
> Let's also use UUID here rather than a string.
>

I believe clientInstanceId is descriptive of what it is; the identification
of a specific client instance or incarnation.
There's some previous art here, e.g., group.id vs group.instance.id. (in
this case it made sense to have the instance configurable, but we want
metrics to be zero-conf).

+1 on using the UUID type as opposed to string.

>
> > 6. > PushTelemetryRequest{
> >        ClientInstanceId = f00d-feed-deff-ceff-ffff-….,
> >        SubscriptionId = 0x234adf34,
> >        ContentType = OTLPv8|ZSTD,
> >        Terminating = False,
> >        Metrics = …// zstd-compressed OTLPv08-protobuf-serialized metrics
> >   }
>
> It's not necessary for the client to re-send its client instance ID here,
> since it already registered with RegisterClient. If the TCP connection
> dropped, it will have to re-send RegisterClient anyway. SubscriptionID we
> should get rid of, as I said above.
>

The overhead of resending the client instance id (16 bytes) is minimal in
relation to the metrics data itself, and it is typically only sent every
60s or so.

As for caching it on the connection; the client only requests the
clientInstanceId once per client instance lifetime, not per broker
connection, so the client does not need to re-register itself on each
connection.

The SubscriptionId is used to make sure the client's metrics subscription
is up to date, pretty much a configuration version but without the need for
sequanciality checks, a simple inequality check by the broker is sufficient.

> I don't see the need for protobufs. Why not just use Kafka's own
> serialization mechanism? As much as possible, we should try to avoid
> creating "turduckens" of protocol X containing a buffer serialized with
> protocol Y, containing a protocol serialized with protocol Z. These aren't
> conducive to a good implementation, and make it harder for people to write
> clients. Just use Kafka's RPC protocol (with optional fields if you wish).
>

This is covered in the Rejected alternatives; We do not want to duplicate
the efforts of the OpenTelemetry project, but rather reap the benefits of
their work.
There's also future functionality in this space that OpenTelemetry
provides: events and tracing.

If we do compression on Kafka RPC, I would prefer that we do it a more
> generic way that applies to all control messages, not just this one. I also
> doubt we need to support lots and lots of different compression codecs, at
> first at least.
>

Generic compression as part of the Kafka framing makes sense, but is
outside the scope of this proposal.
However, as with ProduceRequest, there's benefit of having just the data
parts of the request compressed to avoid the need of decompression and
recompression.
A metrics plugin on the broker may for example simply forward the received
compressed metrics as is to an upstream system.

>
> Another thing I'd like to understand is whether we truly need
> "terminating" (or anything like it). I'm still confused about how the
> backend could use this. Keep in mind that we may receive it on multiple
> brokers (or not receive it at all). We may receive more stuff about client
> XYZ from broker 1 after we have already received a "terminated" for client
> XYZ from broker 2.
>

The Terminating flag is useful for the receiving metrics system. E.g., it
can be used to disable alarms for missing data points.
From the broker's perspective it can be used to delete its metrics
subscription cache entry for the client, but those should have timeouts
anyway.

Maybe we should move this field to the metrics data itself as a label.

> > Makes sense, in the updated proposal above I changed ContentType to a
> > bitmask.
> >
>
> Hmm. You might have forgotten to save your changes. It's still listed as a
> string in the KIP.
>

I just updated the proposal in this thread, will update the KIP as we get
closer to agreement. :)

> > > Hmm, that data is about 10 fields, most of which are strings. It
> certainly
> > > adds a lot of overhead to resend it each time.
> > >
> > > I don't follow the comment about unpacking and repacking -- since the
> > > client registered with the broker it already knows all this
> information, so
> > > there's nothing to unpack or repack, except from memory. If it's more
> > > convenient to serialize it once rather than multiple times, that is an
> > > implementation detail of the broker side plugin, which we are not
> > > specifying here anyway.
> > >
> >
> > The current proposal is pretty much stateless on the broker, it does not
> > need to hold any state for a client (instance), and no state
> > synchronization is needed
> > between brokers in the cluster, which allows a client to seamlessly send
> > metrics to any broker it wants and keeps the API overhead down (no need
> to
> > re-register when
> > switching brokers for instance).
> >
> > We could remove the labels that are already available to the broker on a
> > per-request basis or that it already maintains state for:
> >  - client_id
> >  - client_instance_id
> >  - client_software_*
> >
> > Leaving the following to still be included:
> >  - group_id
> >  - group_instance_id
> >  - transactional_id
> >   etc..
> >
> > What do you think of that?
>
> Yes, that might be a reasonable balance. I don't think we currently
> associate group_id with a connection, so it would be reasonable to re-send
> it in the telemetry request.
>
> What I would suggest next is to set up a protocol definition for the
> metrics you are sending over the wire. For example, you could specify
> something like this for producer metrics:
>
> > { "name": "clientProducerRecordBytes", "type": "int64",
> > "versions": "0+", "taggedVersions": "0+", "tag": 0,
> >  "about": "client.producer.record.bytes" },
> > { "name": "clientProducerQueueMaxMessages", "type": "int32",
> > "versions": "0+", "taggedVersions": "0+", "tag": 1,
> > "about": "client.producer.queue.max.messages" },
>
> This specifies what it is (int32, int64, etc.), how it's being sent, etc.
>

There's an extensive list of standard metrics+types in the KIP, but these
will be using the OpenTelemetry format
so there is no point in having them described as Kafka protocol fields.

Thanks for all your valuable input, Colin!

Regards,
Magnus

>
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Feng Min <fm...@confluent.io.INVALID>.

Thanks Magnus & Colin for the discussion.

Based on KIP-714's stateless design, Client can pretty much use any
connection to any broker to send metrics. We are not associating connection
with client metric state. Is my understanding correct? If yes,  how about
the following two scenarios

1) One Client (Client-ID) registers two different client instance id via
separate registration. Is it permitted? If OK, how to distinguish them from
the case 2 below.

2) How about the client restarting? What's the expectation? Should the
server expect the client to carry a persisted client instance id or should
the client be treated as a new instance?

also some comments inline.

On Mon, Sep 20, 2021 at 11:41 AM Colin McCabe <cm...@apache.org> wrote:

> On Tue, Sep 14, 2021, at 00:47, Magnus Edenhill wrote:
> > Thanks for your feedback Colin, see my updated proposal below.
> > ...
>
> Hi Magnus,
>
> Thanks for the update.
>
> >
> > Splitting up the API into separate data and control requests makes sense.
> > With a split we would have one API for querying the broker for configured
> > metrics subscriptions,
> > and one API for pushing the collected metrics to the broker.
> >
> > A mechanism is still needed to notify the client when the subscription is
> > changed;
> > I’ve added a SubscriptionId for this purpose (which could be a checksum
> of
> > the configured metrics subscription), this id is sent to the client along
> > with the metrics subscription, and the client sends it back to the broker
> > when pushing metrics. If the broker finds the pushed subscription id to
> > differ from what is expected it will return an error to the client, which
> > triggers the client to retrieve the new subscribed metrics and an updated
> > subscription id. The generation of the subscriptionId is opaque to the
> > client.
> >
>
> Hmm, SubscriptionId seems rather complex. We don't have this kind of
> complicated machinery for changing ApiVersions, and that is something that
> can also change over time, and which affects the clients.
>
> Changing the configured metrics should be extremely rare. In this case,
> why don't we just close all connections on the broker side? Then the
> clients can re-connect and re-fetch the information about the metrics
> they're supposed to send.
>
> >
> > Something like this:
> >
> > // Get the configured metrics subscription.
> > GetTelemetrySubscriptionsRequest {
> >    StrNull  ClientInstanceId  // Null on first invocation to retrieve a
> > newly generated instance id from the broker.
> > }
>
> +1 on RegisterClient or RegisterMetricClient


> It seems like the goal here is to have the client register itself, so that
> we can tell if this is an old client reconnecting. If that is the case,
> then I suggest to rename the RPC to RegisterClient.
>
> I think we need a better name than "clientInstanceId" since that name is
> very similar to "clientId." Perhaps something like originId? Or clientUuid?
> Let's also use UUID here rather than a string.
>
> > 6. > PushTelemetryRequest{
> >        ClientInstanceId = f00d-feed-deff-ceff-ffff-….,
> >        SubscriptionId = 0x234adf34,
> >        ContentType = OTLPv8|ZSTD,
> >        Terminating = False,
> >        Metrics = …// zstd-compressed OTLPv08-protobuf-serialized metrics
> >   }
>
>
If we assume connection is not bound to ClientInstanceId, and the RPC can
be sent to any broker (not necessarily the broker doing the registration).
The client-instance-id is required for every metric reporting. It's just
part of the labelling.


> It's not necessary for the client to re-send its client instance ID here,
> since it already registered with RegisterClient. If the TCP connection
> dropped, it will have to re-send RegisterClient anyway. SubscriptionID we
> should get rid of, as I said above.
>
> I don't see the need for protobufs. Why not just use Kafka's own
> serialization mechanism? As much as possible, we should try to avoid
> creating "turduckens" of protocol X containing a buffer serialized with
> protocol Y, containing a protocol serialized with protocol Z. These aren't
> conducive to a good implementation, and make it harder for people to write
> clients. Just use Kafka's RPC protocol (with optional fields if you wish).
>
> If we do compression on Kafka RPC, I would prefer that we do it a more
> generic way that applies to all control messages, not just this one. I also
> doubt we need to support lots and lots of different compression codecs, at
> first at least.
>
> Another thing I'd like to understand is whether we truly need
> "terminating" (or anything like it). I'm still confused about how the
> backend could use this. Keep in mind that we may receive it on multiple
> brokers (or not receive it at all). We may receive more stuff about client
> XYZ from broker 1 after we have already received a "terminated" for client
> XYZ from broker 2.
>
> > If the broker connection goes down or the connection is to be used for
> > other purposes (e.g., blocking FetchRequests), the client will send
> > PushTelemetryRequests to any other broker in the cluster, using the same
> > ClientInstanceId and SubscriptionId as received in the latest
> > GetTelemetrySubscriptionsResponse.
> >
> > While the subscriptionId may change during the lifetime of the client
> > instance (when metric subscriptions are updated), the ClientInstanceId is
> > only acquired once and must not change (as it is used to identify the
> > unique client instance).
> > ...
> > What we do want though is ability to single out a specific client
> instance
> > to give it a more fine-grained subscription for troubleshooting, and
> > we can do that with the current proposal with matching solely on the
> > CLIENT_INSTANCE_ID.
> > In other words; all clients will have the same standard metrics
> > subscription, but specific client instances can have alternate
> > subscriptions.
>
> That makes sense, and gives a good reason why we might want to couple
> finding the metrics info to passing the client UUID.
>
> > The metrics collector/tsdb/whatever will need to identify a single client
> > instance, regardless of which broker received the metrics.
> > The chapter on CLIENT_INSTANCE_ID motivates why we need a unique
> > identifier, basically because neither clientID, principal or remote
> > address:port, etc, can be
> > used to identify a single client instance.
> >
>
> Thanks for the background. I agree that a UUID is useful here.
>
> This also gives additional reasons why using a UUID rather than a
> free-form string is desirable (many databases can handle numbers more
> efficiently than strings).
>
> > > Unfortunately, we have to strictly control the metrics format, because
> > > otherwise clients can't implement it. I agree that we don't need to
> specify
> > > how the broker-side code works, since that is pluggable. It's also
> > > reasonable for the clients to have pluggable extensions as well, but
> this
> > > KIP won't be of much use if we don't at least define a basic set of
> metrics
> > > that most clients can understand how to send. The open source clients
> will
> > > not implement anything more than what is specified in the KIP (or at
> least
> > > the AK one won't...)
> > >
> >
> > Makes sense, in the updated proposal above I changed ContentType to a
> > bitmask.
> >
>
> Hmm. You might have forgotten to save your changes. It's still listed as a
> string in the KIP.
>
> > > I'm not sure if OpenCensus adds any value to this KIP, to be honest.
> Their
> > > primary focus was never on the format of the data being sent (in fact,
> the
> > > last time they checked, they left the format up to each OpenCensus
> > > implementation). That may have changed, but I think it still has
> limited
> > > usefulness to us, since we have our own format which we have to use
> anyway.
> > >
> >
> > Oh, I meant concensus as in kafka-dev agreement :)
> >
> > Feng is looking into the implementation details of the Java client and
> will
> > update the KIP with regards to dependencies.
>
> I still don't understand what value using the OpenTelemetry format is
> giving us here, as opposed to using Kafka's own format. I guess we will
> need to talk more about this.
>
> > > Hmm, that data is about 10 fields, most of which are strings. It
> certainly
> > > adds a lot of overhead to resend it each time.
> > >
> > > I don't follow the comment about unpacking and repacking -- since the
> > > client registered with the broker it already knows all this
> information, so
> > > there's nothing to unpack or repack, except from memory. If it's more
> > > convenient to serialize it once rather than multiple times, that is an
> > > implementation detail of the broker side plugin, which we are not
> > > specifying here anyway.
> > >
> >
> > The current proposal is pretty much stateless on the broker, it does not
> > need to hold any state for a client (instance), and no state
> > synchronization is needed
> > between brokers in the cluster, which allows a client to seamlessly send
> > metrics to any broker it wants and keeps the API overhead down (no need
> to
> > re-register when
> > switching brokers for instance).
> >
> > We could remove the labels that are already available to the broker on a
> > per-request basis or that it already maintains state for:
> >  - client_id
> >  - client_instance_id
> >  - client_software_*
> >
> > Leaving the following to still be included:
> >  - group_id
> >  - group_instance_id
> >  - transactional_id
> >   etc..
> >
> > What do you think of that?
>
> Yes, that might be a reasonable balance. I don't think we currently
> associate group_id with a connection, so it would be reasonable to re-send
> it in the telemetry request.
>
> What I would suggest next is to set up a protocol definition for the
> metrics you are sending over the wire. For example, you could specify
> something like this for producer metrics:
>
> > { "name": "clientProducerRecordBytes", "type": "int64",
> > "versions": "0+", "taggedVersions": "0+", "tag": 0,
> >  "about": "client.producer.record.bytes" },
> > { "name": "clientProducerQueueMaxMessages", "type": "int32",
> > "versions": "0+", "taggedVersions": "0+", "tag": 1,
> > "about": "client.producer.queue.max.messages" },
>
> This specifies what it is (int32, int64, etc.), how it's being sent, etc.
>
> best,
> Colin
>
> >
> >
> > Thanks,
> > Magnus
> >
> >
> >
> > >
> > > best,
> > > Colin
> > >
> > > > Thanks,
> > > > Magnus
> > > >
> > > > >
> > > > >
> > > >
> > >
> >
>


-- 
Best,
Feng

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Feng Min <fm...@confluent.io.INVALID>.

Hi Colin,

It was just analogy to say api version is similar to subscription Id. Every
request come with api version information, broker can return an error if
supported api version has been changed. It’s similar to the role of
subscriptionid here.

Thanks,
Feng



On Mon, Sep 20, 2021 at 9:51 PM Colin McCabe <cm...@apache.org> wrote:

> On Mon, Sep 20, 2021, at 12:30, Feng Min wrote:
> > Some comments about subscriptionId.
> >
> > ApiVersion is not a good example. API Version here is actually acting
> like
> > an identifier as the client will carry this information. Forcing to
> > disconnect a connection from the server side is quite heavy. IMHO, the
> > behavior is kind of part of the protocol. Adding subscriptionId is
> > relatively simple and straightforward.
> >
>
> Hi Feng,
>
> Sorry, I'm not sure what you mean by "API Version here is actually acting
> like an identifier." APIVersions is not an identifier. Each client gets the
> same ApiVersionsResponse from the broker. In most clusters, each broker
> will return the same set of ApiVersionsResponse as well. So you can not use
> ApiVersionsResponse as an identifier of anything, as far as I can see.
>
> Dropping a connection is not that "heavy" considering that it only has to
> happen when we change the metrics subscription, which should be a very rare
> event, if I understand the proposal correctly.
>
> best,
> Colin
>
>
> >
> >
> >> Hmm, SubscriptionId seems rather complex. We don't have this kind of
> >> complicated machinery for changing ApiVersions, and that is something
> that
> >> can also change over time, and which affects the clients.
> >>
> >> Changing the configured metrics should be extremely rare. In this case,
> >> why don't we just close all connections on the broker side? Then the
> >> clients can re-connect and re-fetch the information about the metrics
> >> they're supposed to send.
> >>
> >> >
> >> > Something like this:
> >> >
> >> > // Get the configured metrics subscription.
> >> > GetTelemetrySubscriptionsRequest {
> >> >    StrNull  ClientInstanceId  // Null on first invocation to retrieve
> a
> >> > newly generated instance id from the broker.
> >> > }
> >>
> >> It seems like the goal here is to have the client register itself, so
> that
> >> we can tell if this is an old client reconnecting. If that is the case,
> >> then I suggest to rename the RPC to RegisterClient.
> >>
> >> I think we need a better name than "clientInstanceId" since that name is
> >> very similar to "clientId." Perhaps something like originId? Or
> clientUuid?
> >> Let's also use UUID here rather than a string.
> >>
> >> > 6. > PushTelemetryRequest{
> >> >        ClientInstanceId = f00d-feed-deff-ceff-ffff-….,
> >> >        SubscriptionId = 0x234adf34,
> >> >        ContentType = OTLPv8|ZSTD,
> >> >        Terminating = False,
> >> >        Metrics = …// zstd-compressed OTLPv08-protobuf-serialized
> metrics
> >> >   }
> >>
> >> It's not necessary for the client to re-send its client instance ID
> here,
> >> since it already registered with RegisterClient. If the TCP connection
> >> dropped, it will have to re-send RegisterClient anyway. SubscriptionID
> we
> >> should get rid of, as I said above.
> >>
> >> I don't see the need for protobufs. Why not just use Kafka's own
> >> serialization mechanism? As much as possible, we should try to avoid
> >> creating "turduckens" of protocol X containing a buffer serialized with
> >> protocol Y, containing a protocol serialized with protocol Z. These
> aren't
> >> conducive to a good implementation, and make it harder for people to
> write
> >> clients. Just use Kafka's RPC protocol (with optional fields if you
> wish).
> >>
> >> If we do compression on Kafka RPC, I would prefer that we do it a more
> >> generic way that applies to all control messages, not just this one. I
> also
> >> doubt we need to support lots and lots of different compression codecs,
> at
> >> first at least.
> >>
> >> Another thing I'd like to understand is whether we truly need
> >> "terminating" (or anything like it). I'm still confused about how the
> >> backend could use this. Keep in mind that we may receive it on multiple
> >> brokers (or not receive it at all). We may receive more stuff about
> client
> >> XYZ from broker 1 after we have already received a "terminated" for
> client
> >> XYZ from broker 2.
> >>
> >> > If the broker connection goes down or the connection is to be used for
> >> > other purposes (e.g., blocking FetchRequests), the client will send
> >> > PushTelemetryRequests to any other broker in the cluster, using the
> same
> >> > ClientInstanceId and SubscriptionId as received in the latest
> >> > GetTelemetrySubscriptionsResponse.
> >> >
> >> > While the subscriptionId may change during the lifetime of the client
> >> > instance (when metric subscriptions are updated), the
> ClientInstanceId is
> >> > only acquired once and must not change (as it is used to identify the
> >> > unique client instance).
> >> > ...
> >> > What we do want though is ability to single out a specific client
> >> instance
> >> > to give it a more fine-grained subscription for troubleshooting, and
> >> > we can do that with the current proposal with matching solely on the
> >> > CLIENT_INSTANCE_ID.
> >> > In other words; all clients will have the same standard metrics
> >> > subscription, but specific client instances can have alternate
> >> > subscriptions.
> >>
> >> That makes sense, and gives a good reason why we might want to couple
> >> finding the metrics info to passing the client UUID.
> >>
> >> > The metrics collector/tsdb/whatever will need to identify a single
> client
> >> > instance, regardless of which broker received the metrics.
> >> > The chapter on CLIENT_INSTANCE_ID motivates why we need a unique
> >> > identifier, basically because neither clientID, principal or remote
> >> > address:port, etc, can be
> >> > used to identify a single client instance.
> >> >
> >>
> >> Thanks for the background. I agree that a UUID is useful here.
> >>
> >> This also gives additional reasons why using a UUID rather than a
> >> free-form string is desirable (many databases can handle numbers more
> >> efficiently than strings).
> >>
> >> > > Unfortunately, we have to strictly control the metrics format,
> because
> >> > > otherwise clients can't implement it. I agree that we don't need to
> >> specify
> >> > > how the broker-side code works, since that is pluggable. It's also
> >> > > reasonable for the clients to have pluggable extensions as well, but
> >> this
> >> > > KIP won't be of much use if we don't at least define a basic set of
> >> metrics
> >> > > that most clients can understand how to send. The open source
> clients
> >> will
> >> > > not implement anything more than what is specified in the KIP (or at
> >> least
> >> > > the AK one won't...)
> >> > >
> >> >
> >> > Makes sense, in the updated proposal above I changed ContentType to a
> >> > bitmask.
> >> >
> >>
> >> Hmm. You might have forgotten to save your changes. It's still listed
> as a
> >> string in the KIP.
> >>
> >> > > I'm not sure if OpenCensus adds any value to this KIP, to be honest.
> >> Their
> >> > > primary focus was never on the format of the data being sent (in
> fact,
> >> the
> >> > > last time they checked, they left the format up to each OpenCensus
> >> > > implementation). That may have changed, but I think it still has
> >> limited
> >> > > usefulness to us, since we have our own format which we have to use
> >> anyway.
> >> > >
> >> >
> >> > Oh, I meant concensus as in kafka-dev agreement :)
> >> >
> >> > Feng is looking into the implementation details of the Java client and
> >> will
> >> > update the KIP with regards to dependencies.
> >>
> >> I still don't understand what value using the OpenTelemetry format is
> >> giving us here, as opposed to using Kafka's own format. I guess we will
> >> need to talk more about this.
> >>
> >> > > Hmm, that data is about 10 fields, most of which are strings. It
> >> certainly
> >> > > adds a lot of overhead to resend it each time.
> >> > >
> >> > > I don't follow the comment about unpacking and repacking -- since
> the
> >> > > client registered with the broker it already knows all this
> >> information, so
> >> > > there's nothing to unpack or repack, except from memory. If it's
> more
> >> > > convenient to serialize it once rather than multiple times, that is
> an
> >> > > implementation detail of the broker side plugin, which we are not
> >> > > specifying here anyway.
> >> > >
> >> >
> >> > The current proposal is pretty much stateless on the broker, it does
> not
> >> > need to hold any state for a client (instance), and no state
> >> > synchronization is needed
> >> > between brokers in the cluster, which allows a client to seamlessly
> send
> >> > metrics to any broker it wants and keeps the API overhead down (no
> need
> >> to
> >> > re-register when
> >> > switching brokers for instance).
> >> >
> >> > We could remove the labels that are already available to the broker
> on a
> >> > per-request basis or that it already maintains state for:
> >> >  - client_id
> >> >  - client_instance_id
> >> >  - client_software_*
> >> >
> >> > Leaving the following to still be included:
> >> >  - group_id
> >> >  - group_instance_id
> >> >  - transactional_id
> >> >   etc..
> >> >
> >> > What do you think of that?
> >>
> >> Yes, that might be a reasonable balance. I don't think we currently
> >> associate group_id with a connection, so it would be reasonable to
> re-send
> >> it in the telemetry request.
> >>
> >> What I would suggest next is to set up a protocol definition for the
> >> metrics you are sending over the wire. For example, you could specify
> >> something like this for producer metrics:
> >>
> >> > { "name": "clientProducerRecordBytes", "type": "int64",
> >> > "versions": "0+", "taggedVersions": "0+", "tag": 0,
> >> >  "about": "client.producer.record.bytes" },
> >> > { "name": "clientProducerQueueMaxMessages", "type": "int32",
> >> > "versions": "0+", "taggedVersions": "0+", "tag": 1,
> >> > "about": "client.producer.queue.max.messages" },
> >>
> >> This specifies what it is (int32, int64, etc.), how it's being sent,
> etc.
> >>
> >> best,
> >> Colin
> >>
> >> >
> >> >
> >> > Thanks,
> >> > Magnus
> >> >
> >> >
> >> >
> >> > >
> >> > > best,
> >> > > Colin
> >> > >
> >> > > > Thanks,
> >> > > > Magnus
> >> > > >
> >> > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
> >
> > --
> > Best,
> > Feng
>
-- 
Best,
Feng

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Colin McCabe <cm...@apache.org>.

On Mon, Sep 20, 2021, at 12:30, Feng Min wrote:
> Some comments about subscriptionId.
>
> ApiVersion is not a good example. API Version here is actually acting like
> an identifier as the client will carry this information. Forcing to
> disconnect a connection from the server side is quite heavy. IMHO, the
> behavior is kind of part of the protocol. Adding subscriptionId is
> relatively simple and straightforward.
>

Hi Feng,

Sorry, I'm not sure what you mean by "API Version here is actually acting like an identifier." APIVersions is not an identifier. Each client gets the same ApiVersionsResponse from the broker. In most clusters, each broker will return the same set of ApiVersionsResponse as well. So you can not use ApiVersionsResponse as an identifier of anything, as far as I can see.

Dropping a connection is not that "heavy" considering that it only has to happen when we change the metrics subscription, which should be a very rare event, if I understand the proposal correctly.

best,
Colin


>
>
>> Hmm, SubscriptionId seems rather complex. We don't have this kind of
>> complicated machinery for changing ApiVersions, and that is something that
>> can also change over time, and which affects the clients.
>>
>> Changing the configured metrics should be extremely rare. In this case,
>> why don't we just close all connections on the broker side? Then the
>> clients can re-connect and re-fetch the information about the metrics
>> they're supposed to send.
>>
>> >
>> > Something like this:
>> >
>> > // Get the configured metrics subscription.
>> > GetTelemetrySubscriptionsRequest {
>> >    StrNull  ClientInstanceId  // Null on first invocation to retrieve a
>> > newly generated instance id from the broker.
>> > }
>>
>> It seems like the goal here is to have the client register itself, so that
>> we can tell if this is an old client reconnecting. If that is the case,
>> then I suggest to rename the RPC to RegisterClient.
>>
>> I think we need a better name than "clientInstanceId" since that name is
>> very similar to "clientId." Perhaps something like originId? Or clientUuid?
>> Let's also use UUID here rather than a string.
>>
>> > 6. > PushTelemetryRequest{
>> >        ClientInstanceId = f00d-feed-deff-ceff-ffff-….,
>> >        SubscriptionId = 0x234adf34,
>> >        ContentType = OTLPv8|ZSTD,
>> >        Terminating = False,
>> >        Metrics = …// zstd-compressed OTLPv08-protobuf-serialized metrics
>> >   }
>>
>> It's not necessary for the client to re-send its client instance ID here,
>> since it already registered with RegisterClient. If the TCP connection
>> dropped, it will have to re-send RegisterClient anyway. SubscriptionID we
>> should get rid of, as I said above.
>>
>> I don't see the need for protobufs. Why not just use Kafka's own
>> serialization mechanism? As much as possible, we should try to avoid
>> creating "turduckens" of protocol X containing a buffer serialized with
>> protocol Y, containing a protocol serialized with protocol Z. These aren't
>> conducive to a good implementation, and make it harder for people to write
>> clients. Just use Kafka's RPC protocol (with optional fields if you wish).
>>
>> If we do compression on Kafka RPC, I would prefer that we do it a more
>> generic way that applies to all control messages, not just this one. I also
>> doubt we need to support lots and lots of different compression codecs, at
>> first at least.
>>
>> Another thing I'd like to understand is whether we truly need
>> "terminating" (or anything like it). I'm still confused about how the
>> backend could use this. Keep in mind that we may receive it on multiple
>> brokers (or not receive it at all). We may receive more stuff about client
>> XYZ from broker 1 after we have already received a "terminated" for client
>> XYZ from broker 2.
>>
>> > If the broker connection goes down or the connection is to be used for
>> > other purposes (e.g., blocking FetchRequests), the client will send
>> > PushTelemetryRequests to any other broker in the cluster, using the same
>> > ClientInstanceId and SubscriptionId as received in the latest
>> > GetTelemetrySubscriptionsResponse.
>> >
>> > While the subscriptionId may change during the lifetime of the client
>> > instance (when metric subscriptions are updated), the ClientInstanceId is
>> > only acquired once and must not change (as it is used to identify the
>> > unique client instance).
>> > ...
>> > What we do want though is ability to single out a specific client
>> instance
>> > to give it a more fine-grained subscription for troubleshooting, and
>> > we can do that with the current proposal with matching solely on the
>> > CLIENT_INSTANCE_ID.
>> > In other words; all clients will have the same standard metrics
>> > subscription, but specific client instances can have alternate
>> > subscriptions.
>>
>> That makes sense, and gives a good reason why we might want to couple
>> finding the metrics info to passing the client UUID.
>>
>> > The metrics collector/tsdb/whatever will need to identify a single client
>> > instance, regardless of which broker received the metrics.
>> > The chapter on CLIENT_INSTANCE_ID motivates why we need a unique
>> > identifier, basically because neither clientID, principal or remote
>> > address:port, etc, can be
>> > used to identify a single client instance.
>> >
>>
>> Thanks for the background. I agree that a UUID is useful here.
>>
>> This also gives additional reasons why using a UUID rather than a
>> free-form string is desirable (many databases can handle numbers more
>> efficiently than strings).
>>
>> > > Unfortunately, we have to strictly control the metrics format, because
>> > > otherwise clients can't implement it. I agree that we don't need to
>> specify
>> > > how the broker-side code works, since that is pluggable. It's also
>> > > reasonable for the clients to have pluggable extensions as well, but
>> this
>> > > KIP won't be of much use if we don't at least define a basic set of
>> metrics
>> > > that most clients can understand how to send. The open source clients
>> will
>> > > not implement anything more than what is specified in the KIP (or at
>> least
>> > > the AK one won't...)
>> > >
>> >
>> > Makes sense, in the updated proposal above I changed ContentType to a
>> > bitmask.
>> >
>>
>> Hmm. You might have forgotten to save your changes. It's still listed as a
>> string in the KIP.
>>
>> > > I'm not sure if OpenCensus adds any value to this KIP, to be honest.
>> Their
>> > > primary focus was never on the format of the data being sent (in fact,
>> the
>> > > last time they checked, they left the format up to each OpenCensus
>> > > implementation). That may have changed, but I think it still has
>> limited
>> > > usefulness to us, since we have our own format which we have to use
>> anyway.
>> > >
>> >
>> > Oh, I meant concensus as in kafka-dev agreement :)
>> >
>> > Feng is looking into the implementation details of the Java client and
>> will
>> > update the KIP with regards to dependencies.
>>
>> I still don't understand what value using the OpenTelemetry format is
>> giving us here, as opposed to using Kafka's own format. I guess we will
>> need to talk more about this.
>>
>> > > Hmm, that data is about 10 fields, most of which are strings. It
>> certainly
>> > > adds a lot of overhead to resend it each time.
>> > >
>> > > I don't follow the comment about unpacking and repacking -- since the
>> > > client registered with the broker it already knows all this
>> information, so
>> > > there's nothing to unpack or repack, except from memory. If it's more
>> > > convenient to serialize it once rather than multiple times, that is an
>> > > implementation detail of the broker side plugin, which we are not
>> > > specifying here anyway.
>> > >
>> >
>> > The current proposal is pretty much stateless on the broker, it does not
>> > need to hold any state for a client (instance), and no state
>> > synchronization is needed
>> > between brokers in the cluster, which allows a client to seamlessly send
>> > metrics to any broker it wants and keeps the API overhead down (no need
>> to
>> > re-register when
>> > switching brokers for instance).
>> >
>> > We could remove the labels that are already available to the broker on a
>> > per-request basis or that it already maintains state for:
>> >  - client_id
>> >  - client_instance_id
>> >  - client_software_*
>> >
>> > Leaving the following to still be included:
>> >  - group_id
>> >  - group_instance_id
>> >  - transactional_id
>> >   etc..
>> >
>> > What do you think of that?
>>
>> Yes, that might be a reasonable balance. I don't think we currently
>> associate group_id with a connection, so it would be reasonable to re-send
>> it in the telemetry request.
>>
>> What I would suggest next is to set up a protocol definition for the
>> metrics you are sending over the wire. For example, you could specify
>> something like this for producer metrics:
>>
>> > { "name": "clientProducerRecordBytes", "type": "int64",
>> > "versions": "0+", "taggedVersions": "0+", "tag": 0,
>> >  "about": "client.producer.record.bytes" },
>> > { "name": "clientProducerQueueMaxMessages", "type": "int32",
>> > "versions": "0+", "taggedVersions": "0+", "tag": 1,
>> > "about": "client.producer.queue.max.messages" },
>>
>> This specifies what it is (int32, int64, etc.), how it's being sent, etc.
>>
>> best,
>> Colin
>>
>> >
>> >
>> > Thanks,
>> > Magnus
>> >
>> >
>> >
>> > >
>> > > best,
>> > > Colin
>> > >
>> > > > Thanks,
>> > > > Magnus
>> > > >
>> > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>
> -- 
> Best,
> Feng

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Feng Min <fm...@confluent.io.INVALID>.

Some comments about subscriptionId.

On Mon, Sep 20, 2021 at 11:41 AM Colin McCabe <cm...@apache.org> wrote:

> On Tue, Sep 14, 2021, at 00:47, Magnus Edenhill wrote:
> > Thanks for your feedback Colin, see my updated proposal below.
> > ...
>
> Hi Magnus,
>
> Thanks for the update.
>
> >
> > Splitting up the API into separate data and control requests makes sense.
> > With a split we would have one API for querying the broker for configured
> > metrics subscriptions,
> > and one API for pushing the collected metrics to the broker.
> >
> > A mechanism is still needed to notify the client when the subscription is
> > changed;
> > I’ve added a SubscriptionId for this purpose (which could be a checksum
> of
> > the configured metrics subscription), this id is sent to the client along
> > with the metrics subscription, and the client sends it back to the broker
> > when pushing metrics. If the broker finds the pushed subscription id to
> > differ from what is expected it will return an error to the client, which
> > triggers the client to retrieve the new subscribed metrics and an updated
> > subscription id. The generation of the subscriptionId is opaque to the
> > client.
> >
>
>
ApiVersion is not a good example. API Version here is actually acting like
an identifier as the client will carry this information. Forcing to
disconnect a connection from the server side is quite heavy. IMHO, the
behavior is kind of part of the protocol. Adding subscriptionId is
relatively simple and straightforward.



> Hmm, SubscriptionId seems rather complex. We don't have this kind of
> complicated machinery for changing ApiVersions, and that is something that
> can also change over time, and which affects the clients.
>
> Changing the configured metrics should be extremely rare. In this case,
> why don't we just close all connections on the broker side? Then the
> clients can re-connect and re-fetch the information about the metrics
> they're supposed to send.
>
> >
> > Something like this:
> >
> > // Get the configured metrics subscription.
> > GetTelemetrySubscriptionsRequest {
> >    StrNull  ClientInstanceId  // Null on first invocation to retrieve a
> > newly generated instance id from the broker.
> > }
>
> It seems like the goal here is to have the client register itself, so that
> we can tell if this is an old client reconnecting. If that is the case,
> then I suggest to rename the RPC to RegisterClient.
>
> I think we need a better name than "clientInstanceId" since that name is
> very similar to "clientId." Perhaps something like originId? Or clientUuid?
> Let's also use UUID here rather than a string.
>
> > 6. > PushTelemetryRequest{
> >        ClientInstanceId = f00d-feed-deff-ceff-ffff-….,
> >        SubscriptionId = 0x234adf34,
> >        ContentType = OTLPv8|ZSTD,
> >        Terminating = False,
> >        Metrics = …// zstd-compressed OTLPv08-protobuf-serialized metrics
> >   }
>
> It's not necessary for the client to re-send its client instance ID here,
> since it already registered with RegisterClient. If the TCP connection
> dropped, it will have to re-send RegisterClient anyway. SubscriptionID we
> should get rid of, as I said above.
>
> I don't see the need for protobufs. Why not just use Kafka's own
> serialization mechanism? As much as possible, we should try to avoid
> creating "turduckens" of protocol X containing a buffer serialized with
> protocol Y, containing a protocol serialized with protocol Z. These aren't
> conducive to a good implementation, and make it harder for people to write
> clients. Just use Kafka's RPC protocol (with optional fields if you wish).
>
> If we do compression on Kafka RPC, I would prefer that we do it a more
> generic way that applies to all control messages, not just this one. I also
> doubt we need to support lots and lots of different compression codecs, at
> first at least.
>
> Another thing I'd like to understand is whether we truly need
> "terminating" (or anything like it). I'm still confused about how the
> backend could use this. Keep in mind that we may receive it on multiple
> brokers (or not receive it at all). We may receive more stuff about client
> XYZ from broker 1 after we have already received a "terminated" for client
> XYZ from broker 2.
>
> > If the broker connection goes down or the connection is to be used for
> > other purposes (e.g., blocking FetchRequests), the client will send
> > PushTelemetryRequests to any other broker in the cluster, using the same
> > ClientInstanceId and SubscriptionId as received in the latest
> > GetTelemetrySubscriptionsResponse.
> >
> > While the subscriptionId may change during the lifetime of the client
> > instance (when metric subscriptions are updated), the ClientInstanceId is
> > only acquired once and must not change (as it is used to identify the
> > unique client instance).
> > ...
> > What we do want though is ability to single out a specific client
> instance
> > to give it a more fine-grained subscription for troubleshooting, and
> > we can do that with the current proposal with matching solely on the
> > CLIENT_INSTANCE_ID.
> > In other words; all clients will have the same standard metrics
> > subscription, but specific client instances can have alternate
> > subscriptions.
>
> That makes sense, and gives a good reason why we might want to couple
> finding the metrics info to passing the client UUID.
>
> > The metrics collector/tsdb/whatever will need to identify a single client
> > instance, regardless of which broker received the metrics.
> > The chapter on CLIENT_INSTANCE_ID motivates why we need a unique
> > identifier, basically because neither clientID, principal or remote
> > address:port, etc, can be
> > used to identify a single client instance.
> >
>
> Thanks for the background. I agree that a UUID is useful here.
>
> This also gives additional reasons why using a UUID rather than a
> free-form string is desirable (many databases can handle numbers more
> efficiently than strings).
>
> > > Unfortunately, we have to strictly control the metrics format, because
> > > otherwise clients can't implement it. I agree that we don't need to
> specify
> > > how the broker-side code works, since that is pluggable. It's also
> > > reasonable for the clients to have pluggable extensions as well, but
> this
> > > KIP won't be of much use if we don't at least define a basic set of
> metrics
> > > that most clients can understand how to send. The open source clients
> will
> > > not implement anything more than what is specified in the KIP (or at
> least
> > > the AK one won't...)
> > >
> >
> > Makes sense, in the updated proposal above I changed ContentType to a
> > bitmask.
> >
>
> Hmm. You might have forgotten to save your changes. It's still listed as a
> string in the KIP.
>
> > > I'm not sure if OpenCensus adds any value to this KIP, to be honest.
> Their
> > > primary focus was never on the format of the data being sent (in fact,
> the
> > > last time they checked, they left the format up to each OpenCensus
> > > implementation). That may have changed, but I think it still has
> limited
> > > usefulness to us, since we have our own format which we have to use
> anyway.
> > >
> >
> > Oh, I meant concensus as in kafka-dev agreement :)
> >
> > Feng is looking into the implementation details of the Java client and
> will
> > update the KIP with regards to dependencies.
>
> I still don't understand what value using the OpenTelemetry format is
> giving us here, as opposed to using Kafka's own format. I guess we will
> need to talk more about this.
>
> > > Hmm, that data is about 10 fields, most of which are strings. It
> certainly
> > > adds a lot of overhead to resend it each time.
> > >
> > > I don't follow the comment about unpacking and repacking -- since the
> > > client registered with the broker it already knows all this
> information, so
> > > there's nothing to unpack or repack, except from memory. If it's more
> > > convenient to serialize it once rather than multiple times, that is an
> > > implementation detail of the broker side plugin, which we are not
> > > specifying here anyway.
> > >
> >
> > The current proposal is pretty much stateless on the broker, it does not
> > need to hold any state for a client (instance), and no state
> > synchronization is needed
> > between brokers in the cluster, which allows a client to seamlessly send
> > metrics to any broker it wants and keeps the API overhead down (no need
> to
> > re-register when
> > switching brokers for instance).
> >
> > We could remove the labels that are already available to the broker on a
> > per-request basis or that it already maintains state for:
> >  - client_id
> >  - client_instance_id
> >  - client_software_*
> >
> > Leaving the following to still be included:
> >  - group_id
> >  - group_instance_id
> >  - transactional_id
> >   etc..
> >
> > What do you think of that?
>
> Yes, that might be a reasonable balance. I don't think we currently
> associate group_id with a connection, so it would be reasonable to re-send
> it in the telemetry request.
>
> What I would suggest next is to set up a protocol definition for the
> metrics you are sending over the wire. For example, you could specify
> something like this for producer metrics:
>
> > { "name": "clientProducerRecordBytes", "type": "int64",
> > "versions": "0+", "taggedVersions": "0+", "tag": 0,
> >  "about": "client.producer.record.bytes" },
> > { "name": "clientProducerQueueMaxMessages", "type": "int32",
> > "versions": "0+", "taggedVersions": "0+", "tag": 1,
> > "about": "client.producer.queue.max.messages" },
>
> This specifies what it is (int32, int64, etc.), how it's being sent, etc.
>
> best,
> Colin
>
> >
> >
> > Thanks,
> > Magnus
> >
> >
> >
> > >
> > > best,
> > > Colin
> > >
> > > > Thanks,
> > > > Magnus
> > > >
> > > > >
> > > > >
> > > >
> > >
> >
>


-- 
Best,
Feng

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Colin McCabe <cm...@apache.org>.

On Tue, Sep 14, 2021, at 00:47, Magnus Edenhill wrote:
> Thanks for your feedback Colin, see my updated proposal below.
> ...

Hi Magnus,

Thanks for the update.

> 
> Splitting up the API into separate data and control requests makes sense.
> With a split we would have one API for querying the broker for configured
> metrics subscriptions,
> and one API for pushing the collected metrics to the broker.
> 
> A mechanism is still needed to notify the client when the subscription is
> changed;
> I’ve added a SubscriptionId for this purpose (which could be a checksum of
> the configured metrics subscription), this id is sent to the client along
> with the metrics subscription, and the client sends it back to the broker
> when pushing metrics. If the broker finds the pushed subscription id to
> differ from what is expected it will return an error to the client, which
> triggers the client to retrieve the new subscribed metrics and an updated
> subscription id. The generation of the subscriptionId is opaque to the
> client.
>

Hmm, SubscriptionId seems rather complex. We don't have this kind of complicated machinery for changing ApiVersions, and that is something that can also change over time, and which affects the clients.

Changing the configured metrics should be extremely rare. In this case, why don't we just close all connections on the broker side? Then the clients can re-connect and re-fetch the information about the metrics they're supposed to send.

> 
> Something like this:
> 
> // Get the configured metrics subscription.
> GetTelemetrySubscriptionsRequest {
>    StrNull  ClientInstanceId  // Null on first invocation to retrieve a
> newly generated instance id from the broker.
> }

It seems like the goal here is to have the client register itself, so that we can tell if this is an old client reconnecting. If that is the case, then I suggest to rename the RPC to RegisterClient.

I think we need a better name than "clientInstanceId" since that name is very similar to "clientId." Perhaps something like originId? Or clientUuid? Let's also use UUID here rather than a string.

> 6. > PushTelemetryRequest{
>        ClientInstanceId = f00d-feed-deff-ceff-ffff-….,
>        SubscriptionId = 0x234adf34,
>        ContentType = OTLPv8|ZSTD,
>        Terminating = False,
>        Metrics = …// zstd-compressed OTLPv08-protobuf-serialized metrics
>   }

It's not necessary for the client to re-send its client instance ID here, since it already registered with RegisterClient. If the TCP connection dropped, it will have to re-send RegisterClient anyway. SubscriptionID we should get rid of, as I said above.

I don't see the need for protobufs. Why not just use Kafka's own serialization mechanism? As much as possible, we should try to avoid creating "turduckens" of protocol X containing a buffer serialized with protocol Y, containing a protocol serialized with protocol Z. These aren't conducive to a good implementation, and make it harder for people to write clients. Just use Kafka's RPC protocol (with optional fields if you wish).

If we do compression on Kafka RPC, I would prefer that we do it a more generic way that applies to all control messages, not just this one. I also doubt we need to support lots and lots of different compression codecs, at first at least.

Another thing I'd like to understand is whether we truly need "terminating" (or anything like it). I'm still confused about how the backend could use this. Keep in mind that we may receive it on multiple brokers (or not receive it at all). We may receive more stuff about client XYZ from broker 1 after we have already received a "terminated" for client XYZ from broker 2.

> If the broker connection goes down or the connection is to be used for
> other purposes (e.g., blocking FetchRequests), the client will send
> PushTelemetryRequests to any other broker in the cluster, using the same
> ClientInstanceId and SubscriptionId as received in the latest
> GetTelemetrySubscriptionsResponse.
> 
> While the subscriptionId may change during the lifetime of the client
> instance (when metric subscriptions are updated), the ClientInstanceId is
> only acquired once and must not change (as it is used to identify the
> unique client instance).
> ...
> What we do want though is ability to single out a specific client instance
> to give it a more fine-grained subscription for troubleshooting, and
> we can do that with the current proposal with matching solely on the
> CLIENT_INSTANCE_ID.
> In other words; all clients will have the same standard metrics
> subscription, but specific client instances can have alternate
> subscriptions.

That makes sense, and gives a good reason why we might want to couple finding the metrics info to passing the client UUID.

> The metrics collector/tsdb/whatever will need to identify a single client
> instance, regardless of which broker received the metrics.
> The chapter on CLIENT_INSTANCE_ID motivates why we need a unique
> identifier, basically because neither clientID, principal or remote
> address:port, etc, can be
> used to identify a single client instance.
> 

Thanks for the background. I agree that a UUID is useful here.

This also gives additional reasons why using a UUID rather than a free-form string is desirable (many databases can handle numbers more efficiently than strings).

> > Unfortunately, we have to strictly control the metrics format, because
> > otherwise clients can't implement it. I agree that we don't need to specify
> > how the broker-side code works, since that is pluggable. It's also
> > reasonable for the clients to have pluggable extensions as well, but this
> > KIP won't be of much use if we don't at least define a basic set of metrics
> > that most clients can understand how to send. The open source clients will
> > not implement anything more than what is specified in the KIP (or at least
> > the AK one won't...)
> >
> 
> Makes sense, in the updated proposal above I changed ContentType to a
> bitmask.
> 

Hmm. You might have forgotten to save your changes. It's still listed as a string in the KIP.

> > I'm not sure if OpenCensus adds any value to this KIP, to be honest. Their
> > primary focus was never on the format of the data being sent (in fact, the
> > last time they checked, they left the format up to each OpenCensus
> > implementation). That may have changed, but I think it still has limited
> > usefulness to us, since we have our own format which we have to use anyway.
> >
> 
> Oh, I meant concensus as in kafka-dev agreement :)
> 
> Feng is looking into the implementation details of the Java client and will
> update the KIP with regards to dependencies.

I still don't understand what value using the OpenTelemetry format is giving us here, as opposed to using Kafka's own format. I guess we will need to talk more about this.

> > Hmm, that data is about 10 fields, most of which are strings. It certainly
> > adds a lot of overhead to resend it each time.
> >
> > I don't follow the comment about unpacking and repacking -- since the
> > client registered with the broker it already knows all this information, so
> > there's nothing to unpack or repack, except from memory. If it's more
> > convenient to serialize it once rather than multiple times, that is an
> > implementation detail of the broker side plugin, which we are not
> > specifying here anyway.
> >
> 
> The current proposal is pretty much stateless on the broker, it does not
> need to hold any state for a client (instance), and no state
> synchronization is needed
> between brokers in the cluster, which allows a client to seamlessly send
> metrics to any broker it wants and keeps the API overhead down (no need to
> re-register when
> switching brokers for instance).
> 
> We could remove the labels that are already available to the broker on a
> per-request basis or that it already maintains state for:
>  - client_id
>  - client_instance_id
>  - client_software_*
> 
> Leaving the following to still be included:
>  - group_id
>  - group_instance_id
>  - transactional_id
>   etc..
> 
> What do you think of that?

Yes, that might be a reasonable balance. I don't think we currently associate group_id with a connection, so it would be reasonable to re-send it in the telemetry request.

What I would suggest next is to set up a protocol definition for the metrics you are sending over the wire. For example, you could specify something like this for producer metrics:

> { "name": "clientProducerRecordBytes", "type": "int64",
> "versions": "0+", "taggedVersions": "0+", "tag": 0,
>  "about": "client.producer.record.bytes" },
> { "name": "clientProducerQueueMaxMessages", "type": "int32",
> "versions": "0+", "taggedVersions": "0+", "tag": 1,
> "about": "client.producer.queue.max.messages" },

This specifies what it is (int32, int64, etc.), how it's being sent, etc.

best,
Colin

> 
> 
> Thanks,
> Magnus
> 
> 
> 
> >
> > best,
> > Colin
> >
> > > Thanks,
> > > Magnus
> > >
> > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Magnus Edenhill <ma...@edenhill.se>.

Thanks for your feedback Colin, see my updated proposal below.


Den tors 22 juli 2021 kl 03:17 skrev Colin McCabe <cm...@apache.org>:

> On Tue, Jun 29, 2021, at 07:22, Magnus Edenhill wrote:
> > Den tors 17 juni 2021 kl 00:52 skrev Colin McCabe <cm...@apache.org>:
> > > A few critiques:
> > >
> > > - As I wrote above, I think this could benefit a lot by being split
> into
> > > several RPCs. A registration RPC, a report RPC, and an unregister RPC
> seem
> > > like logical choices.
> > >
> >
> > Responded to this in your previous mail, but in short I think a single
> > request is sufficient and keeps the implementation complexity / state
> down.
> >
>
> Hi Magnus,
>
> I still suspect that trying to do everything with a single RPC is more
> complex than using multiple RPCs.
>
> Can you go into more detail about how the client learns what metrics it
> should send? This was the purpose of the "registration" step in my scheme
> above.
>
> It seems quite awkward to combine an RPC for reporting metrics with and
> RPC for finding out what metrics are configured to be reported. For
> example, how would you build a tool to check what metrics are configured to
> be reported? Does the tool have to report fake metrics, just because
> there's no other way to get back that information? Seems wrong. (It would
> be a bit like combining createTopics and listTopics for "simplicity")
>



Splitting up the API into separate data and control requests makes sense.
With a split we would have one API for querying the broker for configured
metrics subscriptions,
and one API for pushing the collected metrics to the broker.

A mechanism is still needed to notify the client when the subscription is
changed;
I’ve added a SubscriptionId for this purpose (which could be a checksum of
the configured metrics subscription), this id is sent to the client along
with the metrics subscription, and the client sends it back to the broker
when pushing metrics. If the broker finds the pushed subscription id to
differ from what is expected it will return an error to the client, which
triggers the client to retrieve the new subscribed metrics and an updated
subscription id. The generation of the subscriptionId is opaque to the
client.


Something like this:

// Get the configured metrics subscription.
GetTelemetrySubscriptionsRequest {
   StrNull  ClientInstanceId  // Null on first invocation to retrieve a
newly generated instance id from the broker.
}

GetTelemetrySubscriptionsResponse {
  Int16  ErrorCode
  Int32  SubscriptionId   // This is used for comparison in
PushTelemetryRequest. Could be a crc32 of the subscription.
  Str    ClientInstanceId
  Int8   AcceptedContentTypes
  Array  SubscribedMetrics[] {
      String MetricsPrefix
      Int32  IntervalMs
  }
}


The ContentType is a bitmask in this new proposal, high bits indicate
compression:
  0x01   OTLPv08
  0x10   GZIP
  0x40   ZSTD
  0x80   LZ4


// Push metrics
PushTelemetryRequest {
   Str    ClientInstanceId
   Int32  SubscriptionId    // The collected metrics in this request are
based on the subscription with this Id.
   Int8   ContentType       // E.g., OTLPv08|ZSTD
   Bool   Terminating
   Binary Metrics
}


PushTelemetryResponse {
   Int32 ThrottleTime
   Int16 ErrorCode
}


An example run:

1. Client instance starts, connects to broker.
2. > GetTelemetrySubscriptionsRequest{ ClientInstanceId=Null } // Requests
an instance id and the subscribed metrics.
3. < GetTelemetrySubscriptionsResponse{
      ErrorCode = 0,
      SubscriptionId = 0x234adf34,
      ClientInstanceId = f00d-feed-deff-ceff-ffff-…,
      AcceptedContentTypes = OTLPv08|ZSTD|LZ4,
      SubscribeddMetrics[] = {
         { “client.producer.tx.”, 60000 },
         { “client.memory.rss”, 900000 },
      }
   }
4. Client updates its metrics subscription, next push to fire in 60 seconds.
5. 60 seconds passes
6. > PushTelemetryRequest{
       ClientInstanceId = f00d-feed-deff-ceff-ffff-….,
       SubscriptionId = 0x234adf34,
       ContentType = OTLPv8|ZSTD,
       Terminating = False,
       Metrics = …// zstd-compressed OTLPv08-protobuf-serialized metrics
  }
7. < PushTelemetryResponse{ 0, NO_ERROR }
8. 60 seconds passes
9. > PushTelemetryRequest…
…
56. The operator changes the configured metrics subscriptions (through
Admin API).
57. > PushTelemetryRequest{ .. SubscriptionId = 0x234adf34 .. }
58. The subscriptionId no longer matches since the subscription has been
updated, broker responds with an error:
59. < PushTelemetryResponse{ 0,   ERR_INVALID_SUBSCRIPTION_ID }
60. The error triggers the client to request the subscriptions again.
61. > GetTelemetrySubscriptionsRequest{..}
62. < GetTelemetrySubscriptionsResponse { .. SubscriptionId = 0x777772211,
SubscribedMetrics[] = .. }
63. Client update its subscription and continues to push metrics
accordingly.
…


If the broker connection goes down or the connection is to be used for
other purposes (e.g., blocking FetchRequests), the client will send
PushTelemetryRequests to any other broker in the cluster, using the same
ClientInstanceId and SubscriptionId as received in the latest
GetTelemetrySubscriptionsResponse.

While the subscriptionId may change during the lifetime of the client
instance (when metric subscriptions are updated), the ClientInstanceId is
only acquired once and must not change (as it is used to identify the
unique client instance).


>
> > > - I don't think the client should be able to choose its own UUID. This
> > > adds complexity and introduces a chance that clients will choose an ID
> that
> > > is not unique. We already have an ID that the client itself supplies
> > > (clientID) so there is no need to introduce another such ID.
> > >
> >
> > The CLIENT_INSTANCE_ID (which is a combination of the client.id and a
> UUID)
> > is actually generated by the receiving broker on first contact.
> > The need for a new unique semi-random id is outlined in the KIP, but in
> > short; the client.id is not unique, and we need something unique that
> still
> > is prefix-matchable to the client.id so that we can add subscriptions
> > either using prefix-matching of just the client.id (which may match one
> or
> > more client instances), and exact matching which will match a one
> specific
> > client instance.
>
> Hmm... the client id is already sent in every RPC as part of the header.
> It's not necessary to send it again as part of one of the other RPC fields,
> right?
>
> More generally, why does the client instance ID need to be
> prefix-matchable? That seems like an implementation detail of the metrics
> collection system used on the broker side. Maybe someone wants to group by
> things other than client IDs -- perhaps client versions, for instance. By
> the same argument, we should put the client version string in the client
> instance ID, since someone might want to group by that. Or maybe we should
> include the hostname, and the IP, and, and, and.... You see the issue here.
> I think we shouldn't get involved in this kind of decision -- if we just
> pass a UUID, the broker-side software can group it or prefix it however it
> wants internally.
>

Yes, I agree, other selectors will indeed be needed eventually.
I'll remove the client.id from the CLIENT_INSTANCE_ID and only keep the
UUID part.
My assumption is that the set of subscribed metrics prefixes throughout a
cluster will be quite small initially, so maybe we could leave fine-grained
selectors out of this proposal
and address it later when an actual need arises (maybe ACLs can be used for
selector matching).
And there is no harm for a client in having a metrics subscription with
metrics it does not provide, e.g.,  including the consumer metrics for a
producer, and vice versa, it will just be ignored by the client
if it doesn't match a metrics prefix it can provide.

What we do want though is ability to single out a specific client instance
to give it a more fine-grained subscription for troubleshooting, and
we can do that with the current proposal with matching solely on the
CLIENT_INSTANCE_ID.
In other words; all clients will have the same standard metrics
subscription, but specific client instances can have alternate
subscriptions.


> > - In general the schema seems to have a bad case of string-itis. UUID,
> > > content type, and requested metrics are all strings. Since these
> messages
> > > will be sent very frequently, it's quite costly to use strings for all
> > > these things. We have a type for UUID, which uses 16 bytes -- let's use
> > > that type for client instance ID, rather than a string which will be
> much
> > > larger. Also, since we already send clientID in the message header,
> there
> > > is no need to include it again in the instance ID.
> > >
> >
> > As explained above we need the client.id in the CLIENT_INSTANCE_ID. And
> I
> > don't think the overhead of this one string per request is going to be
> much
> > of an issue,
> > typical metric push intervals are probably in the >60s range.
> > If this becomes a problem we could use a per-connection identifier that
> the
> > broker translates to the client instance id before pushing metrics
> upwards
> > in the system.
> >
>
> This is actually an interesting design question -- why not use a
> per-TCP-connection identifier, rather than a per-client-instance
> identifier? If we are grouping by other things anyway (clientID, principal,
> etc.) on the server side, do we need to maintain a per-process identifier
> rather than a per-connection one?
>


The metrics collector/tsdb/whatever will need to identify a single client
instance, regardless of which broker received the metrics.
The chapter on CLIENT_INSTANCE_ID motivates why we need a unique
identifier, basically because neither clientID, principal or remote
address:port, etc, can be
used to identify a single client instance.




> >
> > > - I think it would also be nice to have an enum or something for
> > > AcceptedContentTypes, RequestedMetrics, etc. We know that new
> additions to
> > > these categories will require KIPs, so it should be straightforward
> for the
> > > project to just have an enum that allows us to communicate these as
> ints.
> > >
> >
> > I'm thinking this might be overly constraining. The broker doesn't parse
> or
> > handle the received metrics data itself but just pushes it to the metrics
> > plugin, using an enum would require a KIP and broker upgrade if the
> metrics plugin
> > supports a newer version of OTLP.
> > It is probably better if we don't strictly control the metric format
> itself.
> >
>
> Unfortunately, we have to strictly control the metrics format, because
> otherwise clients can't implement it. I agree that we don't need to specify
> how the broker-side code works, since that is pluggable. It's also
> reasonable for the clients to have pluggable extensions as well, but this
> KIP won't be of much use if we don't at least define a basic set of metrics
> that most clients can understand how to send. The open source clients will
> not implement anything more than what is specified in the KIP (or at least
> the AK one won't...)
>

Makes sense, in the updated proposal above I changed ContentType to a
bitmask.


>
> >
> >
> > > - Can you talk about whether you are adding any new library
> dependencies
> > > to the Kafka client? It seems like you'd want to add opencensus /
> > > opentelemetry, if we are using that format here.
> > >
> >
> > Yeah, as we get closer to concensus more implementation specific details
> > will be added to the KIP.
> >
>
> I'm not sure if OpenCensus adds any value to this KIP, to be honest. Their
> primary focus was never on the format of the data being sent (in fact, the
> last time they checked, they left the format up to each OpenCensus
> implementation). That may have changed, but I think it still has limited
> usefulness to us, since we have our own format which we have to use anyway.
>

Oh, I meant concensus as in kafka-dev agreement :)

Feng is looking into the implementation details of the Java client and will
update the KIP with regards to dependencies.



>
> >
> > >
> > > - Standard client resource labels: can we send these only in the
> > > registration RPC?
> > >
> >
> > These labels are part of the serialized OTLP data, which means it would
> > need to be unpacked and repacked (including compression) by the broker
> (or
> > metrics plugin), which I believe is more costly than sending them for
> each request.
> >
>
> Hmm, that data is about 10 fields, most of which are strings. It certainly
> adds a lot of overhead to resend it each time.
>
> I don't follow the comment about unpacking and repacking -- since the
> client registered with the broker it already knows all this information, so
> there's nothing to unpack or repack, except from memory. If it's more
> convenient to serialize it once rather than multiple times, that is an
> implementation detail of the broker side plugin, which we are not
> specifying here anyway.
>

The current proposal is pretty much stateless on the broker, it does not
need to hold any state for a client (instance), and no state
synchronization is needed
between brokers in the cluster, which allows a client to seamlessly send
metrics to any broker it wants and keeps the API overhead down (no need to
re-register when
switching brokers for instance).

We could remove the labels that are already available to the broker on a
per-request basis or that it already maintains state for:
 - client_id
 - client_instance_id
 - client_software_*

Leaving the following to still be included:
 - group_id
 - group_instance_id
 - transactional_id
  etc..

What do you think of that?


Thanks,
Magnus



>
> best,
> Colin
>
> > Thanks,
> > Magnus
> >
> > >
> > >
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Colin McCabe <cm...@apache.org>.

On Tue, Jun 29, 2021, at 07:22, Magnus Edenhill wrote:
> Den tors 17 juni 2021 kl 00:52 skrev Colin McCabe <cm...@apache.org>:
> > A few critiques:
> >
> > - As I wrote above, I think this could benefit a lot by being split into
> > several RPCs. A registration RPC, a report RPC, and an unregister RPC seem
> > like logical choices.
> >
> 
> Responded to this in your previous mail, but in short I think a single
> request is sufficient and keeps the implementation complexity / state down.
> 

Hi Magnus,

I still suspect that trying to do everything with a single RPC is more complex than using multiple RPCs.

Can you go into more detail about how the client learns what metrics it should send? This was the purpose of the "registration" step in my scheme above.

It seems quite awkward to combine an RPC for reporting metrics with and RPC for finding out what metrics are configured to be reported. For example, how would you build a tool to check what metrics are configured to be reported? Does the tool have to report fake metrics, just because there's no other way to get back that information? Seems wrong. (It would be a bit like combining createTopics and listTopics for "simplicity")

> > - I don't think the client should be able to choose its own UUID. This
> > adds complexity and introduces a chance that clients will choose an ID that
> > is not unique. We already have an ID that the client itself supplies
> > (clientID) so there is no need to introduce another such ID.
> >
> 
> The CLIENT_INSTANCE_ID (which is a combination of the client.id and a UUID)
> is actually generated by the receiving broker on first contact.
> The need for a new unique semi-random id is outlined in the KIP, but in
> short; the client.id is not unique, and we need something unique that still
> is prefix-matchable to the client.id so that we can add subscriptions
> either using prefix-matching of just the client.id (which may match one or
> more client instances), and exact matching which will match a one specific
> client instance.

Hmm... the client id is already sent in every RPC as part of the header. It's not necessary to send it again as part of one of the other RPC fields, right?

More generally, why does the client instance ID need to be prefix-matchable? That seems like an implementation detail of the metrics collection system used on the broker side. Maybe someone wants to group by things other than client IDs -- perhaps client versions, for instance. By the same argument, we should put the client version string in the client instance ID, since someone might want to group by that. Or maybe we should include the hostname, and the IP, and, and, and.... You see the issue here. I think we shouldn't get involved in this kind of decision -- if we just pass a UUID, the broker-side software can group it or prefix it however it wants internally.

> > - In general the schema seems to have a bad case of string-itis. UUID,
> > content type, and requested metrics are all strings. Since these messages
> > will be sent very frequently, it's quite costly to use strings for all
> > these things. We have a type for UUID, which uses 16 bytes -- let's use
> > that type for client instance ID, rather than a string which will be much
> > larger. Also, since we already send clientID in the message header, there
> > is no need to include it again in the instance ID.
> >
> 
> As explained above we need the client.id in the CLIENT_INSTANCE_ID. And I
> don't think the overhead of this one string per request is going to be much
> of an issue,
> typical metric push intervals are probably in the >60s range.
> If this becomes a problem we could use a per-connection identifier that the
> broker translates to the client instance id before pushing metrics upwards
> in the system.
> 

This is actually an interesting design question -- why not use a per-TCP-connection identifier, rather than a per-client-instance identifier? If we are grouping by other things anyway (clientID, principal, etc.) on the server side, do we need to maintain a per-process identifier rather than a per-connection one?

> 
> > - I think it would also be nice to have an enum or something for
> > AcceptedContentTypes, RequestedMetrics, etc. We know that new additions to
> > these categories will require KIPs, so it should be straightforward for the
> > project to just have an enum that allows us to communicate these as ints.
> >
> 
> I'm thinking this might be overly constraining. The broker doesn't parse or
> handle the received metrics data itself but just pushes it to the metrics
> plugin, using an enum would require a KIP and broker upgrade if the metrics plugin
> supports a newer version of OTLP.
> It is probably better if we don't strictly control the metric format itself.
> 

Unfortunately, we have to strictly control the metrics format, because otherwise clients can't implement it. I agree that we don't need to specify how the broker-side code works, since that is pluggable. It's also reasonable for the clients to have pluggable extensions as well, but this KIP won't be of much use if we don't at least define a basic set of metrics that most clients can understand how to send. The open source clients will not implement anything more than what is specified in the KIP (or at least the AK one won't...)

> 
> 
> > - Can you talk about whether you are adding any new library dependencies
> > to the Kafka client? It seems like you'd want to add opencensus /
> > opentelemetry, if we are using that format here.
> >
> 
> Yeah, as we get closer to concensus more implementation specific details
> will be added to the KIP.
> 

I'm not sure if OpenCensus adds any value to this KIP, to be honest. Their primary focus was never on the format of the data being sent (in fact, the last time they checked, they left the format up to each OpenCensus implementation). That may have changed, but I think it still has limited usefulness to us, since we have our own format which we have to use anyway.

> 
> >
> > - Standard client resource labels: can we send these only in the
> > registration RPC?
> >
> 
> These labels are part of the serialized OTLP data, which means it would
> need to be unpacked and repacked (including compression) by the broker (or
> metrics plugin), which I believe is more costly than sending them for each request.
> 

Hmm, that data is about 10 fields, most of which are strings. It certainly adds a lot of overhead to resend it each time.

I don't follow the comment about unpacking and repacking -- since the client registered with the broker it already knows all this information, so there's nothing to unpack or repack, except from memory. If it's more convenient to serialize it once rather than multiple times, that is an implementation detail of the broker side plugin, which we are not specifying here anyway.

best,
Colin

> Thanks,
> Magnus
> 
> >
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Magnus Edenhill <ma...@edenhill.se>.

Den tors 17 juni 2021 kl 00:52 skrev Colin McCabe <cm...@apache.org>:

> Hi Magnus,
>
> Thanks for the KIP. This is certainly something I've been wishing for for
> a while.
>
> Maybe we should emphasize more that the metrics that are being gathered
> here are Kafka metrics, not general application business logic metrics.
> That seems like a point of confusion in some of the replies here. The
> analogy with a telecom gathering metrics about a DSL modem is a good one.
> These are really metrics about the Kafka cluster itself, very similar to
> the metrics we expose about the broker, controller, and so forth.
>

Good point, will make this more clear in the KIP.


>
> In my experience, most users want their Kafka clients to be "plug and
> play" -- they want to start up a Kafka client, and do some things. Their
> focus is on their application, not on the details of the infrastructure. If
> something is goes wrong, they want the Kafka team to diagnose the problem
> and fix it, or at least tell them what the issue is. When the Kafka teams
> tells them they need to install and maintain a third-party metrics system
> to diagnose the problem, this can be a very big disappointment. Many users
> don't have this level of expertise.
>
> A few critiques:
>
> - As I wrote above, I think this could benefit a lot by being split into
> several RPCs. A registration RPC, a report RPC, and an unregister RPC seem
> like logical choices.
>

Responded to this in your previous mail, but in short I think a single
request is sufficient and keeps the implementation complexity / state down.


>
> - I don't think the client should be able to choose its own UUID. This
> adds complexity and introduces a chance that clients will choose an ID that
> is not unique. We already have an ID that the client itself supplies
> (clientID) so there is no need to introduce another such ID.
>

The CLIENT_INSTANCE_ID (which is a combination of the client.id and a UUID)
is actually generated by the receiving broker on first contact.
The need for a new unique semi-random id is outlined in the KIP, but in
short; the client.id is not unique, and we need something unique that still
is prefix-matchable to the client.id so that we can add subscriptions
either using prefix-matching of just the client.id (which may match one or
more client instances), and exact matching which will match a one specific
client instance.



> - I might be misunderstanding something here, but my reading of this is
> that the client chooses what metrics to send and the broker filters that on
> the broker-side. I think this is backwards -- the broker should inform the
> client about what it wants, and the client should send only that data. (Of
> course, the client may also not know what the broker is asking for, in
> which case it can choose to not send the data). We shouldn't have clients
> pumping out data that nobody wants to read. (sorry if I misinterpreted and
> this is already the case...)
>

This is indeed completely controlled from the cluster side:
The cluster operator (et.al) configured client metric subscriptions, which
are basically: what metrics to collect, at what interval, from what client
instance(s).
These subscriptions are then propagated to matching clients, which in turn
starts pushing the requested metrics (but nothing else) to the broker.



> - In general the schema seems to have a bad case of string-itis. UUID,
> content type, and requested metrics are all strings. Since these messages
> will be sent very frequently, it's quite costly to use strings for all
> these things. We have a type for UUID, which uses 16 bytes -- let's use
> that type for client instance ID, rather than a string which will be much
> larger. Also, since we already send clientID in the message header, there
> is no need to include it again in the instance ID.
>

As explained above we need the client.id in the CLIENT_INSTANCE_ID. And I
don't think the overhead of this one string per request is going to be much
of an issue,
typical metric push intervals are probably in the >60s range.
If this becomes a problem we could use a per-connection identifier that the
broker translates to the client instance id before pushing metrics upwards
in the system.


> - I think it would also be nice to have an enum or something for
> AcceptedContentTypes, RequestedMetrics, etc. We know that new additions to
> these categories will require KIPs, so it should be straightforward for the
> project to just have an enum that allows us to communicate these as ints.
>

I'm thinking this might be overly constraining. The broker doesn't parse or
handle the received metrics data itself but just pushes it to the metrics
plugin,
using an enum would require a KIP and broker upgrade if the metrics plugin
supports a newer version of OTLP.
It is probably better if we don't strictly control the metric format itself.



> - Can you talk about whether you are adding any new library dependencies
> to the Kafka client? It seems like you'd want to add opencensus /
> opentelemetry, if we are using that format here.
>

Yeah, as we get closer to concensus more implementation specific details
will be added to the KIP.



>
> - Standard client resource labels: can we send these only in the
> registration RPC?
>

These labels are part of the serialized OTLP data, which means it would
need to be unpacked and repacked (including compression) by the broker (or
metrics plugin), which
I believe is more costly than sending them for each request.

Thanks,
Magnus

>
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Colin McCabe <cm...@apache.org>.

Hi Magnus,

Thanks for the KIP. This is certainly something I've been wishing for for a while.

Maybe we should emphasize more that the metrics that are being gathered here are Kafka metrics, not general application business logic metrics. That seems like a point of confusion in some of the replies here. The analogy with a telecom gathering metrics about a DSL modem is a good one. These are really metrics about the Kafka cluster itself, very similar to the metrics we expose about the broker, controller, and so forth.

In my experience, most users want their Kafka clients to be "plug and play" -- they want to start up a Kafka client, and do some things. Their focus is on their application, not on the details of the infrastructure. If something is goes wrong, they want the Kafka team to diagnose the problem and fix it, or at least tell them what the issue is. When the Kafka teams tells them they need to install and maintain a third-party metrics system to diagnose the problem, this can be a very big disappointment. Many users don't have this level of expertise.

A few critiques:

- As I wrote above, I think this could benefit a lot by being split into several RPCs. A registration RPC, a report RPC, and an unregister RPC seem like logical choices.

- I don't think the client should be able to choose its own UUID. This adds complexity and introduces a chance that clients will choose an ID that is not unique. We already have an ID that the client itself supplies (clientID) so there is no need to introduce another such ID.

- I might be misunderstanding something here, but my reading of this is that the client chooses what metrics to send and the broker filters that on the broker-side. I think this is backwards -- the broker should inform the client about what it wants, and the client should send only that data. (Of course, the client may also not know what the broker is asking for, in which case it can choose to not send the data). We shouldn't have clients pumping out data that nobody wants to read. (sorry if I misinterpreted and this is already the case...)

- In general the schema seems to have a bad case of string-itis. UUID, content type, and requested metrics are all strings. Since these messages will be sent very frequently, it's quite costly to use strings for all these things. We have a type for UUID, which uses 16 bytes -- let's use that type for client instance ID, rather than a string which will be much larger. Also, since we already send clientID in the message header, there is no need to include it again in the instance ID.

- I think it would also be nice to have an enum or something for AcceptedContentTypes, RequestedMetrics, etc. We know that new additions to these categories will require KIPs, so it should be straightforward for the project to just have an enum that allows us to communicate these as ints.

- Can you talk about whether you are adding any new library dependencies to the Kafka client? It seems like you'd want to add opencensus / opentelemetry, if we are using that format here.

- Standard client resource labels: can we send these only in the registration RPC?

best,
Colin

On Wed, Jun 16, 2021, at 08:27, Magnus Edenhill wrote:
> Hi Ryanne,
> 
> this proposal stems from a need to improve troubleshooting Kafka issues.
> 
> As it currently stands, when an application team is experiencing Kafka
> service degradation,
> or the Kafka operator is seeing misbehaving clients, there are plenty of
> steps that needs
> to be taken before any client-side metrics can be observed at all, if at
> all:
>  - Is the application even collecting client metrics? If not it needs to be
> reconfigured or implemented, and restarted;
>    a restart may have business impact, and may also temporarily? remedy the
> problem without giving any further insight
>    into what was wrong.
>  - Are the desired metrics collected? Where are they stored? For how long?
> Is there enough correlating information
>    to map it to cluster-side metrics and events? Does the application
> on-call know how to find the collected metrics?
>  - Export and send these metrics to whoever knows how to interpret them. In
> what format? Are all relevant metadata fields
>    provided?
> 
> The KIP aims to solve all these obstacles by giving the Kafka operator the
> tools to collect this information.
> 
> Regards,
> Magnus
> 
> 
> Den tis 15 juni 2021 kl 02:37 skrev Ryanne Dolan <ry...@gmail.com>:
> 
> > Magnus, I think such a substantial change requires more motivation than is
> > currently provided. As I read it, the motivation boils down to this: you
> > want your clients to phone-home unless they opt-out. As stated in the KIP,
> > "there are plenty of existing solutions [...] to send metrics [...] to a
> > collector", so the opt-out appears to be the only motivation. Am I missing
> > something?
> >
> > Ryanne
> >
> > On Wed, Jun 2, 2021 at 7:46 AM Magnus Edenhill <ma...@edenhill.se> wrote:
> >
> > > Hey all,
> > >
> > > I'm proposing KIP-714 to add remote Client metrics and observability.
> > > This functionality will allow centralized monitoring and troubleshooting
> > of
> > > clients and their internals.
> > >
> > > Please see
> > >
> > >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability
> > >
> > > Looking forward to your feedback!
> > >
> > > Regards,
> > > Magnus
> > >
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Magnus Edenhill <ma...@edenhill.se>.

Hi Ryanne,

this proposal stems from a need to improve troubleshooting Kafka issues.

As it currently stands, when an application team is experiencing Kafka
service degradation,
or the Kafka operator is seeing misbehaving clients, there are plenty of
steps that needs
to be taken before any client-side metrics can be observed at all, if at
all:
 - Is the application even collecting client metrics? If not it needs to be
reconfigured or implemented, and restarted;
   a restart may have business impact, and may also temporarily? remedy the
problem without giving any further insight
   into what was wrong.
 - Are the desired metrics collected? Where are they stored? For how long?
Is there enough correlating information
   to map it to cluster-side metrics and events? Does the application
on-call know how to find the collected metrics?
 - Export and send these metrics to whoever knows how to interpret them. In
what format? Are all relevant metadata fields
   provided?

The KIP aims to solve all these obstacles by giving the Kafka operator the
tools to collect this information.

Regards,
Magnus


Den tis 15 juni 2021 kl 02:37 skrev Ryanne Dolan <ry...@gmail.com>:

> Magnus, I think such a substantial change requires more motivation than is
> currently provided. As I read it, the motivation boils down to this: you
> want your clients to phone-home unless they opt-out. As stated in the KIP,
> "there are plenty of existing solutions [...] to send metrics [...] to a
> collector", so the opt-out appears to be the only motivation. Am I missing
> something?
>
> Ryanne
>
> On Wed, Jun 2, 2021 at 7:46 AM Magnus Edenhill <ma...@edenhill.se> wrote:
>
> > Hey all,
> >
> > I'm proposing KIP-714 to add remote Client metrics and observability.
> > This functionality will allow centralized monitoring and troubleshooting
> of
> > clients and their internals.
> >
> > Please see
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability
> >
> > Looking forward to your feedback!
> >
> > Regards,
> > Magnus
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Ryanne Dolan <ry...@gmail.com>.

Magnus, I think such a substantial change requires more motivation than is
currently provided. As I read it, the motivation boils down to this: you
want your clients to phone-home unless they opt-out. As stated in the KIP,
"there are plenty of existing solutions [...] to send metrics [...] to a
collector", so the opt-out appears to be the only motivation. Am I missing
something?

Ryanne

On Wed, Jun 2, 2021 at 7:46 AM Magnus Edenhill <ma...@edenhill.se> wrote:

> Hey all,
>
> I'm proposing KIP-714 to add remote Client metrics and observability.
> This functionality will allow centralized monitoring and troubleshooting of
> clients and their internals.
>
> Please see
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability
>
> Looking forward to your feedback!
>
> Regards,
> Magnus
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Magnus Edenhill <ma...@edenhill.se>.

Thanks for your feedback, Travis!

I believe there are different audiences and uses for application (business
logic)
and client (infrastructure) metrics. Kafka clients are part of the
infrastructure,
not the business logic, and should be monitored as such by the organization,
sub-organization, or team, that knows Kafka best and already do Kafka
monitoring - the Kafka operators.


So to be clear, this KIP does not cover application metrics, but Kafka
client metrics.
It does in no way replace or change the way application metrics are
collected, they are
not relevant to the intended use.

An analogy from the telco space are CPEs (customer premises equipment),
e.g. an ADSL router in the customer's home. The network owner - the
infrastructure operator -
monitors the ADSL router metrics for queue pressure, latencies, error
rates, etc, which allows
the operator to effectively troubleshoot customer issues, scale the
network, and foresee
issues, completely without any intervention needed by the end user itself.
This is what we want to achieve with this KIP, extending the infrastructure
operator's
(aka the Kafka cluster operator) monitoring abilities to allow for
end-to-end troubleshooting and observability.


The collection model in the KIP is subscription-based, no metrics will be
collected by default.
Two things need to happen before anything is collected:
 - a metrics plugin needs to be configured on the brokers. This is a custom
plugin to
   serve whatever needs the operator might have for the metrics.
 - client metric subscriptions need to be configured through the Kafka
Admin API to
   select which metrics to collect. The subscription defines what metrics
to collect and at
  what interval; this effectively puts filtering at the edge (client) to
spare central resources.

This functionality is thus opt-in on the cluster side, and opt-out on the
client side, and
great care is taken not to expose any sensitive information in the metrics.


As for what needs to be implemented by a supporting client;
a supporting client does not need to implement all the defined metrics,
each client maintainer may choose
her own subset that makes sense for that given client implementation, and
it is fine to add metrics not
listed in the KIP as long as they're in the client's namespace.
But there's obviously value in having a shared set of common metrics that
all clients provide.
The goal is for all client implementations to support this.


Regards,
Magnus

Den mån 14 juni 2021 kl 16:24 skrev Travis Bischel <travis.bischel@gmail.com
>:

> Hi! I have a few thoughts on this KIP. First, I'd like to thank you for
> the writeup,
> clearly a lot of thought has gone into it and it is very thorough.
> However, I'm not
> convinced it's the right approach from a fundamental level.
>
> Fundamentally, this KIP seems like somewhat of a solution to an
> organizational
> problem. Metrics are organizational concerns, not Kafka operator concerns.
> Clients should make it easy to plug in metrics (this is the approach I
> take in
> my own client), and organizations should have processes such that all
> clients
> gather and ship metrics how that organization desires. If an organization
> is
> set up correctly, there is no reason for metrics to be forwarded through
> Kafka.
> This feels like a solution to an organization not properly setting up how
> processes ship metrics, and in some ways, it's an overbroad solution, and
> in
> other ways, it doesn't cover the entire problem.
>
> From the perspective of Kafka operators, it is easy to see that this KIP is
> nice in that it just dictates what clients should support for metrics and
> that
> the metrics should ship through Kafka. But, from the perspective of an
> observability team, this workflow is basically hijacking the standard flow
> that
> organizations may have. I would rather have applications collect metrics
> and
> ship them the same way every other application does. I'd rather not have to
> configure additional plugins within Kafka to take metrics and forward them.
>
> More importantly, this KIP prescibes cardinality problems, requires that to
> officially support the KIP a client must support all relevant metrics
> within
> the KIP, and requires that a client cannot support other metrics unless
> those
> other metrics also go through a KIP process. It is difficult to imagine
> all of
> these metrics being relevant to every organization, and there is no way
> for an
> organization to filter what is relevant within the client. Instead, the
> filtering is pushed downwards, meaning more network IO and more CPU costs
> to
> filter what is irrelevant and aggregate what needs to be aggregated, and
> more
> time for an organization to setup whatever it is that will be doing this
> filtering and aggregating. Contrast this with a client that enables
> hooking in
> to capture numbers that are relevant within an org itself: the org can
> gather
> what they want, ship only want they want, and ship directly to the
> observability system they have already set up. As an aside, it may also be
> wise to avoid shipping metrics through Kafka about client interaction with
> Kafka, because if Kafka is having problems, then orgs lose insight into
> those
> problems. This would be like statuspage using itself for status on its own
> systems.
>
> Another downside is that by dictating the important metrics, this KIP
> either
> has two choices: try to choose what is important to every org, and
> inevitably
> leave out something important to somebody else, or just add everything and
> let
> the orgs filter. This KIP mostly looks to go with the latter approach,
> meaning
> orgs will be shipping & filtering. With hooks, an org would be able to
> gather
> exactly what they want.
>
> As well, I expect that org applications have metrics on the state of the
> applications outside of the Kafka client. Applications are already sending
> non-Kafka-client related metrics outbound to observability systems. If a
> Kafka
> client provided hooks, then users could just gather the additional relevant
> Kafka client metrics and ship those metrics the same way they do all of
> their
> other metrics. It feels a bit odd for a Kafka client to have its own
> separate
> way of forwarding metrics. Another benefit hooks in clients is that
> organizations do not _have_ to set up additional plugins to forward metrics
> from Kafka. Hooks avoid extra organizational work.
>
> The option that the KIP provides for users of clients to opt out of
> metrics may
> avoid some of the above issues (by just disabling things at the user
> level),
> but that's not really great from the perspective of client authors,
> because the
> existence of this KIP forces authors to either just not implement the KIP,
> or
> increase complexity within the KIP. Further, from an operator perspective,
> if I
> would prefer clients to ship metrics through the systems they already have
> in
> place, now I have to expect that anything that uses librdkafka or the
> official
> Java client will be shipping me metrics that I have to deal with (since
> the KIP
> is default enabled).
>
> Lastly, I'm a little wary that this KIP may stem from a product goal of
> Confluent: since most everything uses librdkafka or the Java client, then
> by
> defaulting clients sending metrics, Confluent gets an easy way to provide
> metric panels for a nice cloud UI. If any client does not want to support
> these
> metrics, and then a user wonders why these hypothetical panels have no
> metrics,
> then Confluent can just reply "use a supported client".  Even if this
> (potentially unlikely) scenario is true, then hooks would still be a great
> alternative, because then Confluent could provide drop-in hooks for any
> client
> and the end result of easy-panels would be the same.
>
> In summary,
>
> - Metrics are more of an organizational concern, not specifically a broker
>   operator concern.
>
> - The proposal seems to hijack how metrics are gathered within
> organizations
>
> - I don't think KIPs should dictate which metrics should be gathered and
> which
>   should not. Clients instead should make it easy for users to gather
> anything
>   they could be interested in, and ignore anything they are not.
>
> - I think hooks are more extensible, more exact, and fit better into
>   organizational workflows.
>
> On 2021/06/02 12:45:45, Magnus Edenhill <ma...@edenhill.se> wrote:
> > Hey all,
> >
> > I'm proposing KIP-714 to add remote Client metrics and observability.
> > This functionality will allow centralized monitoring and troubleshooting
> of
> > clients and their internals.
> >
> > Please see
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability
> >
> > Looking forward to your feedback!
> >
> > Regards,
> > Magnus
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Xavier Léauté <xa...@confluent.io.INVALID>.

>
> 1. Did you consider using a `default ClientTelemetryReceiver
> clientReceiver() { return null; }` method on the existing MetricsReporter
> interface, avoiding the need for the ClientTelemetry trait?


I did. Part of the motivation was to separate more clearly the
MetricsReporter methods which are more directly tied to the KafkaMetrics
framework from the metrics collected from clients by the broker.
It would also make it more explicit that this trait only makes sense in the
context of a broker, unlike more general MetricsReporters which can be run
inside client or connect plugins.
That being said, ClientTelemetry would typically still rely on the
configuration and context provided via the metrics reporter, so I agree
that there might not be much value in a separate interface yet.

Maybe we'd be better served if we did a clean break like we did in KIP-504
with the Authorizer interface and revampt the interfaces altogether.
Currently the initialization of a metrics reporter is somewhat difficult,
due to the mix of context information being provided via Reconfigurable,
ClusterResourceListener, and MetricsContext.
There is a lack of a clear initialization sequence, and detecting whether
the reporter runs inside of a client, connect, or a broker is somewhat
brittle.
I felt that fixing those aspects would be outside of the scope of this KIP,
which is already quite large, and would instead keep changes to existing
interfaces minimal.

I don't have a strong feeling though, so if we decide that having a default
method is in line with our current conventions I'd be happy to change that.

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Magnus Edenhill <ma...@edenhill.se>.

Hey Tom,

Den mån 21 juni 2021 kl 21:08 skrev Tom Bentley <tb...@redhat.com>:

>
> 1. Did you consider using a `default ClientTelemetryReceiver
> clientReceiver() { return null; }` method on the existing MetricsReporter
> interface, avoiding the need for the ClientTelemetry trait?
>

I'll let Xavier answer this one since he designed the new interface.



> 2. On the metrics naming and format, I wasn't really clear about what's
> being proposed. I assume we're taking a subset of the existing client
> metrics and representing them as OpenTelemetry metrics, but it didn't
> really explain how the existing metric names would be mapped to meter and
> instrument names. Or did I misunderstand?
>

The KIP is approaching the set of standard metrics from a general viewpoint
rather
than what exactly is provided by the Java clients today, and this is
because we
want these standard metrics to make sense across all languages and all
client implementations.
They're loosely based on existing metrics across the dominant client
implementations.
It is up to each client maintainer to map its existing metrics to the
metrics defined here.
Also, not all metrics may make sense for all clients since the
implementations differ.


3. In the client behaviour section it doesn't explicitly say whether the
> client uses a dedicated thread for this work (I assume it does).
>

Client implementation details are currently left out of the KIP, the focus
is currently
more general protocol-level and high-level client and broker semantics.

I'm not sure if it's best to add Java client specifics to KIP-714, or make
a new KIP
with the Java client implementation details once KIP-714 is accepted.


4. The description of the FunctionalityNotEnabled error code suggests that
> PushTelemetryRequest would only be included in an ApiVersions response if
> the broker was configured with a plugin. I think the ApiVersionsResponse is
> normally a constant response (not dependent on broker config), so I wonder
> whether this is really a precedent we want to set here? Surely in a broker
> without a plugin configured it could just return an empty set of
> RequestedMetrics and a maxint NextPushMs in the PushTelemetryResponse?
>

Yes, that's a good idea. That would also solve the (future) issue with
enabling a metrics plugin
while the broker was running.


> 5. Maybe the AcceptedContentTypes should be documented to be in priority
> order. That would simplify the action for UnsupportedCompressionType.
>

Good idea!


> 6. """As the client will not know the broker id of its bootstrap servers
> the broker_id label should be set to “bootstrap”.""" Maybe using the same
> convention as is used in the NetworkClient, where bootstrap servers are the
> id of the negative of their index in the list?
>

This too!


> 7. Maybe call it "client.process.rss.bytes" rather than
> "client.process.memory.bytes",
> to be explicit?
>

Yeah I started out with rss but then went with something more generic.
Don't really have a strong opinion.


8. It's a little confusing that --id option to kafka-client-metrics.sh can
> be a prefix or an exact match. Perhaps --id and --id-prefix would be
> clearer.
>

Makes sense.


> 9. Maybe I missed it, but does the client continue to push metrics to the
> same broker as it randomly picked initially? If it gets disconnected from
> that broker what happens, does it just randomly pick another?
>

Yep, and the new broker must accept the already assigned CLIENT_INSTANCE_ID
that the client is using.


10. To subscribe to all metrics I assume I can just do
> `kafka-client-metrics.sh ... --metric ''`? It might be worth saying this
> explicitly. AFAICS this is the only way to find out all the metrics
> supported by a client if you don't already know from the client's software
> version.
>

Will make a note of that.


Thanks for the valuable input, will update the KIP accordingly.

/Magnus


>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Tom Bentley <tb...@redhat.com>.

Hi Magnus,

Thanks for the KIP.

1. Did you consider using a `default ClientTelemetryReceiver
clientReceiver() { return null; }` method on the existing MetricsReporter
interface, avoiding the need for the ClientTelemetry trait?
2. On the metrics naming and format, I wasn't really clear about what's
being proposed. I assume we're taking a subset of the existing client
metrics and representing them as OpenTelemetry metrics, but it didn't
really explain how the existing metric names would be mapped to meter and
instrument names. Or did I misunderstand?
3. In the client behaviour section it doesn't explicitly say whether the
client uses a dedicated thread for this work (I assume it does).
4. The description of the FunctionalityNotEnabled error code suggests that
PushTelemetryRequest would only be included in an ApiVersions response if
the broker was configured with a plugin. I think the ApiVersionsResponse is
normally a constant response (not dependent on broker config), so I wonder
whether this is really a precedent we want to set here? Surely in a broker
without a plugin configured it could just return an empty set of
RequestedMetrics and a maxint NextPushMs in the PushTelemetryResponse?
5. Maybe the AcceptedContentTypes should be documented to be in priority
order. That would simplify the action for UnsupportedCompressionType.
6. """As the client will not know the broker id of its bootstrap servers
the broker_id label should be set to “bootstrap”.""" Maybe using the same
convention as is used in the NetworkClient, where bootstrap servers are the
id of the negative of their index in the list?
7. Maybe call it "client.process.rss.bytes" rather than
"client.process.memory.bytes",
to be explicit?
8. It's a little confusing that --id option to kafka-client-metrics.sh can
be a prefix or an exact match. Perhaps --id and --id-prefix would be
clearer.
9. Maybe I missed it, but does the client continue to push metrics to the
same broker as it randomly picked initially? If it gets disconnected from
that broker what happens, does it just randomly pick another?
10. To subscribe to all metrics I assume I can just do
`kafka-client-metrics.sh ... --metric ''`? It might be worth saying this
explicitly. AFAICS this is the only way to find out all the metrics
supported by a client if you don't already know from the client's software
version.

Kind regards,

Tom

On Fri, Jun 18, 2021 at 9:39 PM Travis Bischel <tr...@gmail.com>
wrote:

> H Colin (and Magnus),
>
> Thanks for the replies!
>
> I think the biggest concern I have is the cardinality bits. I'm
> sympathetic to the aspect of this making it easier for Kafka brokers to
> understand *every* aspect of the kafka ecoystem. I am not sure this will
> 100% solve the need there, though: if a client is unable to connect to a
> broker, visibility disappears immediately, no?
>
> I do still think that the problem of difficulty of monitoring within an
> organization results from issues within organizations themselves: orgs
> should have proper processes in place such that anything talking to Kafka
> has the org's plug-in monitoring libraries. Kafka operators can define
> those libraries, such that all clients in the org have the libraries the
> operators require. This satisfies the same goals this KIP aims to provide,
> albeit with the increased org cost of not just having something defined to
> be plugged in.
>
> If Kafka operators themselves can which metrics they want, so that the
> broker can tell the client "only send these metrics", then my biggest
> concern is removed.
>
> I do still think that hooks can be a cleaner abstraction to this same
> goal, and then pre-provided libraries (say, "this library provides X,Y,Z
> and sends to prometheus from your client") could exist that more exactly
> satisfy what this KIP aims to provide. This would also avoid the
> kitchen-sink vs. not-comprehensive-enough issue I brought up previously.
> This would also avoid require KIPs for any supported metrics.
>
> On 2021/06/16 22:27:55, "Colin McCabe" <cm...@apache.org> wrote:
> > On Sun, Jun 13, 2021, at 21:51, Travis Bischel wrote:
> > > Hi! I have a few thoughts on this KIP. First, I'd like to thank you
> for
> > > the writeup,
> > > clearly a lot of thought has gone into it and it is very thorough.
> > > However, I'm not
> > > convinced it's the right approach from a fundamental level.
> > >
> > > Fundamentally, this KIP seems like somewhat of a solution to an
> organizational
> > > problem. Metrics are organizational concerns, not Kafka operator
> concerns.
> >
> > Hi Travis,
> >
> > Metrics are certainly Kafka operator concerns. It is very important for
> cluster operators to know things like how many clients there are, what they
> clients are doing, and so forth. This information is needed to administer
> Kafka. Therefore it certainly falls in the domain of the Kafka operations
> team (and the Kafka development team.)
> >
> > We have added many metrics in the past to make it easier to monitor
> clients. I think this is just another step in that direction.
> >
> > > Clients should make it easy to plug in metrics (this is the approach I
> take in
> > > my own client), and organizations should have processes such that all
> clients
> > > gather and ship metrics how that organization desires.
> > >
> > > If an organization is set up correctly, there is no reason for metrics
> to be
> > > forwarded through Kafka. This feels like a solution to an organization
> not
> > > properly setting up how processes ship metrics, and in some ways, it's
> an
> > > overbroad solution, and in other ways, it doesn't cover the entire
> problem.
> >
> > I think the reason was explained pretty clearly: many admins find it
> difficult to set up monitoring for every client in the organization. In
> general the team which maintains a Kafka cluster is often separate from the
> teams that use the cluster. Therefore rolling out monitoring for clients
> can be very difficult to coordinate.
> >
> > No metrics will ever cover every possible use-case, but the set proposed
> here does seem useful.
> >
> > >
> > > From the perspective of Kafka operators, it is easy to see that this
> KIP is
> > > nice in that it just dictates what clients should support for metrics
> and that
> > > the metrics should ship through Kafka. But, from the perspective of an
> > > observability team, this workflow is basically hijacking the standard
> flow that
> > > organizations may have. I would rather have applications collect
> metrics and
> > > ship them the same way every other application does. I'd rather not
> have to
> > > configure additional plugins within Kafka to take metrics and forward
> them.
> >
> > This change doesn't remove any functionality. If you don't want to use
> KIP-714 metrics collection, you can simply turn it off and continue
> collecting metrics the way you always have.
> >
> > >
> > > More importantly, this KIP prescibes cardinality problems, requires
> that to
> > > officially support the KIP a client must support all relevant metrics
> within
> > > the KIP, and requires that a client cannot support other metrics
> unless those
> > > other metrics also go through a KIP process. It is difficult to
> imagine all of
> > > these metrics being relevant to every organization, and there is no
> way for an
> > > organization to filter what is relevant within the client. Instead, the
> > > filtering is pushed downwards, meaning more network IO and more CPU
> costs to
> > > filter what is irrelevant and aggregate what needs to be aggregated,
> and more
> > > time for an organization to setup whatever it is that will be doing
> this
> > > filtering and aggregating. Contrast this with a client that enables
> hooking in
> > > to capture numbers that are relevant within an org itself: the org can
> gather
> > > what they want, ship only want they want, and ship directly to the
> > > observability system they have already set up. As an aside, it may
> also be
> > > wise to avoid shipping metrics through Kafka about client interaction
> with
> > > Kafka, because if Kafka is having problems, then orgs lose insight
> into those
> > > problems. This would be like statuspage using itself for status on its
> own
> > > systems.
> > >
> > > Another downside is that by dictating the important metrics, this KIP
> either
> > > has two choices: try to choose what is important to every org, and
> inevitably
> > > leave out something important to somebody else, or just add everything
> and let
> > > the orgs filter. This KIP mostly looks to go with the latter approach,
> meaning
> > > orgs will be shipping & filtering. With hooks, an org would be able to
> gather
> > > exactly what they want.
> >
> > I actually do agree with this criticism to some extent. It would be good
> if the broker could specify what metrics it wants, and the clients would
> send only those metrics.
> >
> > More generally, I'd like to see this split up into several RPCs rather
> than one mega-RPC.
> >
> > Maybe something like
> > 1. RegisterClient{Request,Response}
> > 2. ClientMetricsReport{Request,Response}
> > 3. UnregisterClient{Request,Response}
> >
> > Then the broker can communicate which metrics it wants in
> RegisterClientResponse. It can also assign a client instance ID (which I
> think should be a UUID, not another string).
> >
> > >
> > > As well, I expect that org applications have metrics on the state of
> the
> > > applications outside of the Kafka client. Applications are already
> sending
> > > non-Kafka-client related metrics outbound to observability systems. If
> a Kafka
> > > client provided hooks, then users could just gather the additional
> relevant
> > > Kafka client metrics and ship those metrics the same way they do all
> of their
> > > other metrics. It feels a bit odd for a Kafka client to have its own
> separate
> > > way of forwarding metrics. Another benefit hooks in clients is that
> > > organizations do not _have_ to set up additional plugins to forward
> metrics
> > > from Kafka. Hooks avoid extra organizational work.
> >
> > Again, if you want to continue collecting metrics directly from clients,
> you can simply do that. Nothing has to change for you as a result of this
> KIP.
> >
> > >
> > > The option that the KIP provides for users of clients to opt out of
> metrics may
> > > avoid some of the above issues (by just disabling things at the user
> level),
> > > but that's not really great from the perspective of client authors,
> because the
> > > existence of this KIP forces authors to either just not implement the
> KIP, or
> > > increase complexity within the KIP. Further, from an operator
> perspective, if I
> > > would prefer clients to ship metrics through the systems they already
> have in
> > > place, now I have to expect that anything that uses librdkafka or the
> official
> > > Java client will be shipping me metrics that I have to deal with
> (since the KIP
> > > is default enabled).
> >
> > It is clear that we want to avoid unnecessary complexity. However, the
> ability to easily gather metrics from clients without doing a lot of
> client-side configuration is not an ability that we currently have, and
> this KIP changes that. So I think it's very well worth it.
> >
> > >
> > > Lastly, I'm a little wary that this KIP may stem from a product goal of
> > > Confluent: since most everything uses librdkafka or the Java client,
> then by
> > > defaulting clients sending metrics, Confluent gets an easy way to
> provide
> > > metric panels for a nice cloud UI. If any client does not want to
> support these
> > > metrics, and then a user wonders why these hypothetical panels have no
> metrics,
> > > then Confluent can just reply "use a supported client".  Even if this
> > > (potentially unlikely) scenario is true, then hooks would still be a
> great
> > > alternative, because then Confluent could provide drop-in hooks for
> any client
> > > and the end result of easy-panels would be the same.
> > >
> >
> > In general, if a feature provides a benefit to users and operators,
> that's a reason to put it in, not a reason to leave it out. We want to make
> Kafka better, which includes making it better for vendors. There is nothing
> Confluent-specific in the proposal.
> >
> > > In summary,
> > >
> > > - Metrics are more of an organizational concern, not specifically a
> broker
> > >   operator concern.
> > >
> > > - The proposal seems to hijack how metrics are gathered within
> organizations
> > >
> > > - I don't think KIPs should dictate which metrics should be gathered
> and which
> > >   should not. Clients instead should make it easy for users to gather
> anything
> > >   they could be interested in, and ignore anything they are not.
> >
> > KIPs have always dictated which metrics are gathered and which should
> not. This is very intentionally part of the KIP process (since metrics are
> public API).
> >
> > best,
> > Colin
> >
> > >
> > > - I think hooks are more extensible, more exact, and fit better into
> > >   organizational workflows.
> > >
> > > On 2021/06/02 12:45:45, Magnus Edenhill <ma...@edenhill.se> wrote:
> > > > Hey all,
> > > >
> > > > I'm proposing KIP-714 to add remote Client metrics and observability.
> > > > This functionality will allow centralized monitoring and
> troubleshooting of
> > > > clients and their internals.
> > > >
> > > > Please see
> > > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability
> > > >
> > > > Looking forward to your feedback!
> > > >
> > > > Regards,
> > > > Magnus
> > > >
> > >
> >
>
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Magnus Edenhill <ma...@edenhill.se>.

Den fre 18 juni 2021 kl 22:32 skrev Travis Bischel <travis.bischel@gmail.com
>:

> H Colin (and Magnus),
>
> Thanks for the replies!
>
> I think the biggest concern I have is the cardinality bits. I'm
> sympathetic to the aspect of this making it easier for Kafka brokers to
> understand *every* aspect of the kafka ecoystem. I am not sure this will
> 100% solve the need there, though: if a client is unable to connect to a
> broker, visibility disappears immediately, no?
>

At the end of the day this is an unsolvable problem, but what the proposed
approach gives us is a channel that is operational when Kafka is
operational, regardless of external systems.
If a Kafka client can't connect to Kafka, its internal Kafka metrics are
not the main interest, but rather on the connectivity/networking side.



>
> I do still think that the problem of difficulty of monitoring within an
> organization results from issues within organizations themselves: orgs
> should have proper processes in place such that anything talking to Kafka
> has the org's plug-in monitoring libraries. Kafka operators can define
> those libraries, such that all clients in the org have the libraries the
> operators require. This satisfies the same goals this KIP aims to provide,
> albeit with the increased org cost of not just having something defined to
> be plugged in.
>

Yeah, that would be great, and some orgs do indeed come close to this. But
most don't, and then there's the case of multi-org; where the client
developers and cluster operators reside in different organizations.



>
> If Kafka operators themselves can which metrics they want, so that the
> broker can tell the client "only send these metrics", then my biggest
> concern is removed.
>

That's indeed how it works, the metrics that a client pushes are set up by
the cluster operator (et.al) by configuring metrics subscriptions. The
client will not
send any metrics that have not been centrally requested/subscribed, it is
all controlled from the cluster; what clients sends what metrics at what
interval.



>
> I do still think that hooks can be a cleaner abstraction to this same
> goal, and then pre-provided libraries (say, "this library provides X,Y,Z
> and sends to prometheus from your client") could exist that more exactly
> satisfy what this KIP aims to provide. This would also avoid the
> kitchen-sink vs. not-comprehensive-enough issue I brought up previously.
> This would also avoid require KIPs for any supported metrics.
>


This defeats the general availability always-on goal of the KIP though:
client metrics available on demand out of the box.


Thanks for your comments Travis.

/Magnus





> On 2021/06/16 22:27:55, "Colin McCabe" <cm...@apache.org> wrote:
> > On Sun, Jun 13, 2021, at 21:51, Travis Bischel wrote:
> > > Hi! I have a few thoughts on this KIP. First, I'd like to thank you
> for
> > > the writeup,
> > > clearly a lot of thought has gone into it and it is very thorough.
> > > However, I'm not
> > > convinced it's the right approach from a fundamental level.
> > >
> > > Fundamentally, this KIP seems like somewhat of a solution to an
> organizational
> > > problem. Metrics are organizational concerns, not Kafka operator
> concerns.
> >
> > Hi Travis,
> >
> > Metrics are certainly Kafka operator concerns. It is very important for
> cluster operators to know things like how many clients there are, what they
> clients are doing, and so forth. This information is needed to administer
> Kafka. Therefore it certainly falls in the domain of the Kafka operations
> team (and the Kafka development team.)
> >
> > We have added many metrics in the past to make it easier to monitor
> clients. I think this is just another step in that direction.
> >
> > > Clients should make it easy to plug in metrics (this is the approach I
> take in
> > > my own client), and organizations should have processes such that all
> clients
> > > gather and ship metrics how that organization desires.
> > >
> > > If an organization is set up correctly, there is no reason for metrics
> to be
> > > forwarded through Kafka. This feels like a solution to an organization
> not
> > > properly setting up how processes ship metrics, and in some ways, it's
> an
> > > overbroad solution, and in other ways, it doesn't cover the entire
> problem.
> >
> > I think the reason was explained pretty clearly: many admins find it
> difficult to set up monitoring for every client in the organization. In
> general the team which maintains a Kafka cluster is often separate from the
> teams that use the cluster. Therefore rolling out monitoring for clients
> can be very difficult to coordinate.
> >
> > No metrics will ever cover every possible use-case, but the set proposed
> here does seem useful.
> >
> > >
> > > From the perspective of Kafka operators, it is easy to see that this
> KIP is
> > > nice in that it just dictates what clients should support for metrics
> and that
> > > the metrics should ship through Kafka. But, from the perspective of an
> > > observability team, this workflow is basically hijacking the standard
> flow that
> > > organizations may have. I would rather have applications collect
> metrics and
> > > ship them the same way every other application does. I'd rather not
> have to
> > > configure additional plugins within Kafka to take metrics and forward
> them.
> >
> > This change doesn't remove any functionality. If you don't want to use
> KIP-714 metrics collection, you can simply turn it off and continue
> collecting metrics the way you always have.
> >
> > >
> > > More importantly, this KIP prescibes cardinality problems, requires
> that to
> > > officially support the KIP a client must support all relevant metrics
> within
> > > the KIP, and requires that a client cannot support other metrics
> unless those
> > > other metrics also go through a KIP process. It is difficult to
> imagine all of
> > > these metrics being relevant to every organization, and there is no
> way for an
> > > organization to filter what is relevant within the client. Instead, the
> > > filtering is pushed downwards, meaning more network IO and more CPU
> costs to
> > > filter what is irrelevant and aggregate what needs to be aggregated,
> and more
> > > time for an organization to setup whatever it is that will be doing
> this
> > > filtering and aggregating. Contrast this with a client that enables
> hooking in
> > > to capture numbers that are relevant within an org itself: the org can
> gather
> > > what they want, ship only want they want, and ship directly to the
> > > observability system they have already set up. As an aside, it may
> also be
> > > wise to avoid shipping metrics through Kafka about client interaction
> with
> > > Kafka, because if Kafka is having problems, then orgs lose insight
> into those
> > > problems. This would be like statuspage using itself for status on its
> own
> > > systems.
> > >
> > > Another downside is that by dictating the important metrics, this KIP
> either
> > > has two choices: try to choose what is important to every org, and
> inevitably
> > > leave out something important to somebody else, or just add everything
> and let
> > > the orgs filter. This KIP mostly looks to go with the latter approach,
> meaning
> > > orgs will be shipping & filtering. With hooks, an org would be able to
> gather
> > > exactly what they want.
> >
> > I actually do agree with this criticism to some extent. It would be good
> if the broker could specify what metrics it wants, and the clients would
> send only those metrics.
> >
> > More generally, I'd like to see this split up into several RPCs rather
> than one mega-RPC.
> >
> > Maybe something like
> > 1. RegisterClient{Request,Response}
> > 2. ClientMetricsReport{Request,Response}
> > 3. UnregisterClient{Request,Response}
> >
> > Then the broker can communicate which metrics it wants in
> RegisterClientResponse. It can also assign a client instance ID (which I
> think should be a UUID, not another string).
> >
> > >
> > > As well, I expect that org applications have metrics on the state of
> the
> > > applications outside of the Kafka client. Applications are already
> sending
> > > non-Kafka-client related metrics outbound to observability systems. If
> a Kafka
> > > client provided hooks, then users could just gather the additional
> relevant
> > > Kafka client metrics and ship those metrics the same way they do all
> of their
> > > other metrics. It feels a bit odd for a Kafka client to have its own
> separate
> > > way of forwarding metrics. Another benefit hooks in clients is that
> > > organizations do not _have_ to set up additional plugins to forward
> metrics
> > > from Kafka. Hooks avoid extra organizational work.
> >
> > Again, if you want to continue collecting metrics directly from clients,
> you can simply do that. Nothing has to change for you as a result of this
> KIP.
> >
> > >
> > > The option that the KIP provides for users of clients to opt out of
> metrics may
> > > avoid some of the above issues (by just disabling things at the user
> level),
> > > but that's not really great from the perspective of client authors,
> because the
> > > existence of this KIP forces authors to either just not implement the
> KIP, or
> > > increase complexity within the KIP. Further, from an operator
> perspective, if I
> > > would prefer clients to ship metrics through the systems they already
> have in
> > > place, now I have to expect that anything that uses librdkafka or the
> official
> > > Java client will be shipping me metrics that I have to deal with
> (since the KIP
> > > is default enabled).
> >
> > It is clear that we want to avoid unnecessary complexity. However, the
> ability to easily gather metrics from clients without doing a lot of
> client-side configuration is not an ability that we currently have, and
> this KIP changes that. So I think it's very well worth it.
> >
> > >
> > > Lastly, I'm a little wary that this KIP may stem from a product goal of
> > > Confluent: since most everything uses librdkafka or the Java client,
> then by
> > > defaulting clients sending metrics, Confluent gets an easy way to
> provide
> > > metric panels for a nice cloud UI. If any client does not want to
> support these
> > > metrics, and then a user wonders why these hypothetical panels have no
> metrics,
> > > then Confluent can just reply "use a supported client".  Even if this
> > > (potentially unlikely) scenario is true, then hooks would still be a
> great
> > > alternative, because then Confluent could provide drop-in hooks for
> any client
> > > and the end result of easy-panels would be the same.
> > >
> >
> > In general, if a feature provides a benefit to users and operators,
> that's a reason to put it in, not a reason to leave it out. We want to make
> Kafka better, which includes making it better for vendors. There is nothing
> Confluent-specific in the proposal.
> >
> > > In summary,
> > >
> > > - Metrics are more of an organizational concern, not specifically a
> broker
> > >   operator concern.
> > >
> > > - The proposal seems to hijack how metrics are gathered within
> organizations
> > >
> > > - I don't think KIPs should dictate which metrics should be gathered
> and which
> > >   should not. Clients instead should make it easy for users to gather
> anything
> > >   they could be interested in, and ignore anything they are not.
> >
> > KIPs have always dictated which metrics are gathered and which should
> not. This is very intentionally part of the KIP process (since metrics are
> public API).
> >
> > best,
> > Colin
> >
> > >
> > > - I think hooks are more extensible, more exact, and fit better into
> > >   organizational workflows.
> > >
> > > On 2021/06/02 12:45:45, Magnus Edenhill <ma...@edenhill.se> wrote:
> > > > Hey all,
> > > >
> > > > I'm proposing KIP-714 to add remote Client metrics and observability.
> > > > This functionality will allow centralized monitoring and
> troubleshooting of
> > > > clients and their internals.
> > > >
> > > > Please see
> > > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability
> > > >
> > > > Looking forward to your feedback!
> > > >
> > > > Regards,
> > > > Magnus
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Travis Bischel <tr...@gmail.com>.

H Colin (and Magnus),

Thanks for the replies!

I think the biggest concern I have is the cardinality bits. I'm sympathetic to the aspect of this making it easier for Kafka brokers to understand *every* aspect of the kafka ecoystem. I am not sure this will 100% solve the need there, though: if a client is unable to connect to a broker, visibility disappears immediately, no?

I do still think that the problem of difficulty of monitoring within an organization results from issues within organizations themselves: orgs should have proper processes in place such that anything talking to Kafka has the org's plug-in monitoring libraries. Kafka operators can define those libraries, such that all clients in the org have the libraries the operators require. This satisfies the same goals this KIP aims to provide, albeit with the increased org cost of not just having something defined to be plugged in.

If Kafka operators themselves can which metrics they want, so that the broker can tell the client "only send these metrics", then my biggest concern is removed.

I do still think that hooks can be a cleaner abstraction to this same goal, and then pre-provided libraries (say, "this library provides X,Y,Z and sends to prometheus from your client") could exist that more exactly satisfy what this KIP aims to provide. This would also avoid the kitchen-sink vs. not-comprehensive-enough issue I brought up previously. This would also avoid require KIPs for any supported metrics.

On 2021/06/16 22:27:55, "Colin McCabe" <cm...@apache.org> wrote: 
> On Sun, Jun 13, 2021, at 21:51, Travis Bischel wrote:
> > Hi! I have a few thoughts on this KIP. First, I'd like to thank you for 
> > the writeup,
> > clearly a lot of thought has gone into it and it is very thorough. 
> > However, I'm not
> > convinced it's the right approach from a fundamental level.
> > 
> > Fundamentally, this KIP seems like somewhat of a solution to an organizational
> > problem. Metrics are organizational concerns, not Kafka operator concerns.
> 
> Hi Travis,
> 
> Metrics are certainly Kafka operator concerns. It is very important for cluster operators to know things like how many clients there are, what they clients are doing, and so forth. This information is needed to administer Kafka. Therefore it certainly falls in the domain of the Kafka operations team (and the Kafka development team.)
> 
> We have added many metrics in the past to make it easier to monitor clients. I think this is just another step in that direction.
> 
> > Clients should make it easy to plug in metrics (this is the approach I take in
> > my own client), and organizations should have processes such that all clients
> > gather and ship metrics how that organization desires.
> >
> > If an organization is set up correctly, there is no reason for metrics to be
> > forwarded through Kafka. This feels like a solution to an organization not
> > properly setting up how processes ship metrics, and in some ways, it's an
> > overbroad solution, and in other ways, it doesn't cover the entire problem.
> 
> I think the reason was explained pretty clearly: many admins find it difficult to set up monitoring for every client in the organization. In general the team which maintains a Kafka cluster is often separate from the teams that use the cluster. Therefore rolling out monitoring for clients can be very difficult to coordinate.
> 
> No metrics will ever cover every possible use-case, but the set proposed here does seem useful.
> 
> > 
> > From the perspective of Kafka operators, it is easy to see that this KIP is
> > nice in that it just dictates what clients should support for metrics and that
> > the metrics should ship through Kafka. But, from the perspective of an
> > observability team, this workflow is basically hijacking the standard flow that
> > organizations may have. I would rather have applications collect metrics and
> > ship them the same way every other application does. I'd rather not have to
> > configure additional plugins within Kafka to take metrics and forward them.
> 
> This change doesn't remove any functionality. If you don't want to use KIP-714 metrics collection, you can simply turn it off and continue collecting metrics the way you always have.
> 
> > 
> > More importantly, this KIP prescibes cardinality problems, requires that to
> > officially support the KIP a client must support all relevant metrics within
> > the KIP, and requires that a client cannot support other metrics unless those
> > other metrics also go through a KIP process. It is difficult to imagine all of
> > these metrics being relevant to every organization, and there is no way for an
> > organization to filter what is relevant within the client. Instead, the
> > filtering is pushed downwards, meaning more network IO and more CPU costs to
> > filter what is irrelevant and aggregate what needs to be aggregated, and more
> > time for an organization to setup whatever it is that will be doing this
> > filtering and aggregating. Contrast this with a client that enables hooking in
> > to capture numbers that are relevant within an org itself: the org can gather
> > what they want, ship only want they want, and ship directly to the
> > observability system they have already set up. As an aside, it may also be
> > wise to avoid shipping metrics through Kafka about client interaction with
> > Kafka, because if Kafka is having problems, then orgs lose insight into those
> > problems. This would be like statuspage using itself for status on its own
> > systems.
> > 
> > Another downside is that by dictating the important metrics, this KIP either
> > has two choices: try to choose what is important to every org, and inevitably
> > leave out something important to somebody else, or just add everything and let
> > the orgs filter. This KIP mostly looks to go with the latter approach, meaning
> > orgs will be shipping & filtering. With hooks, an org would be able to gather
> > exactly what they want.
> 
> I actually do agree with this criticism to some extent. It would be good if the broker could specify what metrics it wants, and the clients would send only those metrics.
> 
> More generally, I'd like to see this split up into several RPCs rather than one mega-RPC.
> 
> Maybe something like 
> 1. RegisterClient{Request,Response}
> 2. ClientMetricsReport{Request,Response}
> 3. UnregisterClient{Request,Response}
> 
> Then the broker can communicate which metrics it wants in RegisterClientResponse. It can also assign a client instance ID (which I think should be a UUID, not another string).
> 
> > 
> > As well, I expect that org applications have metrics on the state of the
> > applications outside of the Kafka client. Applications are already sending
> > non-Kafka-client related metrics outbound to observability systems. If a Kafka
> > client provided hooks, then users could just gather the additional relevant
> > Kafka client metrics and ship those metrics the same way they do all of their
> > other metrics. It feels a bit odd for a Kafka client to have its own separate
> > way of forwarding metrics. Another benefit hooks in clients is that
> > organizations do not _have_ to set up additional plugins to forward metrics
> > from Kafka. Hooks avoid extra organizational work.
> 
> Again, if you want to continue collecting metrics directly from clients, you can simply do that. Nothing has to change for you as a result of this KIP.
> 
> > 
> > The option that the KIP provides for users of clients to opt out of metrics may
> > avoid some of the above issues (by just disabling things at the user level),
> > but that's not really great from the perspective of client authors, because the
> > existence of this KIP forces authors to either just not implement the KIP, or
> > increase complexity within the KIP. Further, from an operator perspective, if I
> > would prefer clients to ship metrics through the systems they already have in
> > place, now I have to expect that anything that uses librdkafka or the official
> > Java client will be shipping me metrics that I have to deal with (since the KIP
> > is default enabled).
> 
> It is clear that we want to avoid unnecessary complexity. However, the ability to easily gather metrics from clients without doing a lot of client-side configuration is not an ability that we currently have, and this KIP changes that. So I think it's very well worth it.
> 
> > 
> > Lastly, I'm a little wary that this KIP may stem from a product goal of
> > Confluent: since most everything uses librdkafka or the Java client, then by
> > defaulting clients sending metrics, Confluent gets an easy way to provide
> > metric panels for a nice cloud UI. If any client does not want to support these
> > metrics, and then a user wonders why these hypothetical panels have no metrics,
> > then Confluent can just reply "use a supported client".  Even if this
> > (potentially unlikely) scenario is true, then hooks would still be a great
> > alternative, because then Confluent could provide drop-in hooks for any client
> > and the end result of easy-panels would be the same.
> > 
> 
> In general, if a feature provides a benefit to users and operators, that's a reason to put it in, not a reason to leave it out. We want to make Kafka better, which includes making it better for vendors. There is nothing Confluent-specific in the proposal.
> 
> > In summary,
> > 
> > - Metrics are more of an organizational concern, not specifically a broker
> >   operator concern.
> > 
> > - The proposal seems to hijack how metrics are gathered within organizations
> > 
> > - I don't think KIPs should dictate which metrics should be gathered and which
> >   should not. Clients instead should make it easy for users to gather anything
> >   they could be interested in, and ignore anything they are not.
> 
> KIPs have always dictated which metrics are gathered and which should not. This is very intentionally part of the KIP process (since metrics are public API).
> 
> best,
> Colin
> 
> > 
> > - I think hooks are more extensible, more exact, and fit better into
> >   organizational workflows.
> > 
> > On 2021/06/02 12:45:45, Magnus Edenhill <ma...@edenhill.se> wrote: 
> > > Hey all,
> > > 
> > > I'm proposing KIP-714 to add remote Client metrics and observability.
> > > This functionality will allow centralized monitoring and troubleshooting of
> > > clients and their internals.
> > > 
> > > Please see
> > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability
> > > 
> > > Looking forward to your feedback!
> > > 
> > > Regards,
> > > Magnus
> > > 
> > 
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Magnus Edenhill <ma...@edenhill.se>.

Thanks for your feedback, Colin, see response below.

Den tors 17 juni 2021 kl 00:28 skrev Colin McCabe <cm...@apache.org>:

> On Sun, Jun 13, 2021, at 21:51, Travis Bischel wrote:
>

...

>  > Another downside is that by dictating the important metrics, this KIP
> either
>
> has two choices: try to choose what is important to every org, and
> inevitably
> > leave out something important to somebody else, or just add everything
> and let
> > the orgs filter. This KIP mostly looks to go with the latter approach,
> meaning
> > orgs will be shipping & filtering. With hooks, an org would be able to
> gather
> > exactly what they want.
>
> I actually do agree with this criticism to some extent. It would be good
> if the broker could specify what metrics it wants, and the clients would
> send only those metrics.
>

The metrics to collect are indeed controlled by the cluster operator (or
whoever has access),
this is done by setting up metrics subscriptions (a new Admin ConfigEntry)
that are propagated to the client through the
PushTelemetryResponse, telling the client exactly what metrics to push and
at what interval.

> More generally, I'd like to see this split up into several RPCs rather
> than one mega-RPC.
>
> Maybe something like
> 1. RegisterClient{Request,Response}
> 2. ClientMetricsReport{Request,Response}
> 3. UnregisterClient{Request,Response}
>
> Then the broker can communicate which metrics it wants in
> RegisterClientResponse. It can also assign a client instance ID (which I
> think should be a UUID, not another string).
>

All this functionality is covered by the single PushTelemetryRequest which
is used both
for pushing metrics to the broker (in the request) and propagating metrics
subscriptions
to the client (in the response). Using a single request type for both these
operations allows
piggy-backing either metrics or subscriptions (depending on direction) in a
request that
is sent at regular intervals, sort of like a recurring poll.

I think something like RegisterClientRequest makes sense for deconfliction
and fencing,
such as with InitProducerIdRequest, but we don't have any need for that so
I don't think the
added complexity gives us much.

/Magnus

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Colin McCabe <cm...@apache.org>.

On Sun, Jun 13, 2021, at 21:51, Travis Bischel wrote:
> Hi! I have a few thoughts on this KIP. First, I'd like to thank you for 
> the writeup,
> clearly a lot of thought has gone into it and it is very thorough. 
> However, I'm not
> convinced it's the right approach from a fundamental level.
> 
> Fundamentally, this KIP seems like somewhat of a solution to an organizational
> problem. Metrics are organizational concerns, not Kafka operator concerns.

Hi Travis,

Metrics are certainly Kafka operator concerns. It is very important for cluster operators to know things like how many clients there are, what they clients are doing, and so forth. This information is needed to administer Kafka. Therefore it certainly falls in the domain of the Kafka operations team (and the Kafka development team.)

We have added many metrics in the past to make it easier to monitor clients. I think this is just another step in that direction.

> Clients should make it easy to plug in metrics (this is the approach I take in
> my own client), and organizations should have processes such that all clients
> gather and ship metrics how that organization desires.
>
> If an organization is set up correctly, there is no reason for metrics to be
> forwarded through Kafka. This feels like a solution to an organization not
> properly setting up how processes ship metrics, and in some ways, it's an
> overbroad solution, and in other ways, it doesn't cover the entire problem.

I think the reason was explained pretty clearly: many admins find it difficult to set up monitoring for every client in the organization. In general the team which maintains a Kafka cluster is often separate from the teams that use the cluster. Therefore rolling out monitoring for clients can be very difficult to coordinate.

No metrics will ever cover every possible use-case, but the set proposed here does seem useful.

> 
> From the perspective of Kafka operators, it is easy to see that this KIP is
> nice in that it just dictates what clients should support for metrics and that
> the metrics should ship through Kafka. But, from the perspective of an
> observability team, this workflow is basically hijacking the standard flow that
> organizations may have. I would rather have applications collect metrics and
> ship them the same way every other application does. I'd rather not have to
> configure additional plugins within Kafka to take metrics and forward them.

This change doesn't remove any functionality. If you don't want to use KIP-714 metrics collection, you can simply turn it off and continue collecting metrics the way you always have.

> 
> More importantly, this KIP prescibes cardinality problems, requires that to
> officially support the KIP a client must support all relevant metrics within
> the KIP, and requires that a client cannot support other metrics unless those
> other metrics also go through a KIP process. It is difficult to imagine all of
> these metrics being relevant to every organization, and there is no way for an
> organization to filter what is relevant within the client. Instead, the
> filtering is pushed downwards, meaning more network IO and more CPU costs to
> filter what is irrelevant and aggregate what needs to be aggregated, and more
> time for an organization to setup whatever it is that will be doing this
> filtering and aggregating. Contrast this with a client that enables hooking in
> to capture numbers that are relevant within an org itself: the org can gather
> what they want, ship only want they want, and ship directly to the
> observability system they have already set up. As an aside, it may also be
> wise to avoid shipping metrics through Kafka about client interaction with
> Kafka, because if Kafka is having problems, then orgs lose insight into those
> problems. This would be like statuspage using itself for status on its own
> systems.
> 
> Another downside is that by dictating the important metrics, this KIP either
> has two choices: try to choose what is important to every org, and inevitably
> leave out something important to somebody else, or just add everything and let
> the orgs filter. This KIP mostly looks to go with the latter approach, meaning
> orgs will be shipping & filtering. With hooks, an org would be able to gather
> exactly what they want.

I actually do agree with this criticism to some extent. It would be good if the broker could specify what metrics it wants, and the clients would send only those metrics.

More generally, I'd like to see this split up into several RPCs rather than one mega-RPC.

Maybe something like 
1. RegisterClient{Request,Response}
2. ClientMetricsReport{Request,Response}
3. UnregisterClient{Request,Response}

Then the broker can communicate which metrics it wants in RegisterClientResponse. It can also assign a client instance ID (which I think should be a UUID, not another string).

> 
> As well, I expect that org applications have metrics on the state of the
> applications outside of the Kafka client. Applications are already sending
> non-Kafka-client related metrics outbound to observability systems. If a Kafka
> client provided hooks, then users could just gather the additional relevant
> Kafka client metrics and ship those metrics the same way they do all of their
> other metrics. It feels a bit odd for a Kafka client to have its own separate
> way of forwarding metrics. Another benefit hooks in clients is that
> organizations do not _have_ to set up additional plugins to forward metrics
> from Kafka. Hooks avoid extra organizational work.

Again, if you want to continue collecting metrics directly from clients, you can simply do that. Nothing has to change for you as a result of this KIP.

> 
> The option that the KIP provides for users of clients to opt out of metrics may
> avoid some of the above issues (by just disabling things at the user level),
> but that's not really great from the perspective of client authors, because the
> existence of this KIP forces authors to either just not implement the KIP, or
> increase complexity within the KIP. Further, from an operator perspective, if I
> would prefer clients to ship metrics through the systems they already have in
> place, now I have to expect that anything that uses librdkafka or the official
> Java client will be shipping me metrics that I have to deal with (since the KIP
> is default enabled).

It is clear that we want to avoid unnecessary complexity. However, the ability to easily gather metrics from clients without doing a lot of client-side configuration is not an ability that we currently have, and this KIP changes that. So I think it's very well worth it.

> 
> Lastly, I'm a little wary that this KIP may stem from a product goal of
> Confluent: since most everything uses librdkafka or the Java client, then by
> defaulting clients sending metrics, Confluent gets an easy way to provide
> metric panels for a nice cloud UI. If any client does not want to support these
> metrics, and then a user wonders why these hypothetical panels have no metrics,
> then Confluent can just reply "use a supported client".  Even if this
> (potentially unlikely) scenario is true, then hooks would still be a great
> alternative, because then Confluent could provide drop-in hooks for any client
> and the end result of easy-panels would be the same.
> 

In general, if a feature provides a benefit to users and operators, that's a reason to put it in, not a reason to leave it out. We want to make Kafka better, which includes making it better for vendors. There is nothing Confluent-specific in the proposal.

> In summary,
> 
> - Metrics are more of an organizational concern, not specifically a broker
>   operator concern.
> 
> - The proposal seems to hijack how metrics are gathered within organizations
> 
> - I don't think KIPs should dictate which metrics should be gathered and which
>   should not. Clients instead should make it easy for users to gather anything
>   they could be interested in, and ignore anything they are not.

KIPs have always dictated which metrics are gathered and which should not. This is very intentionally part of the KIP process (since metrics are public API).

best,
Colin

> 
> - I think hooks are more extensible, more exact, and fit better into
>   organizational workflows.
> 
> On 2021/06/02 12:45:45, Magnus Edenhill <ma...@edenhill.se> wrote: 
> > Hey all,
> > 
> > I'm proposing KIP-714 to add remote Client metrics and observability.
> > This functionality will allow centralized monitoring and troubleshooting of
> > clients and their internals.
> > 
> > Please see
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability
> > 
> > Looking forward to your feedback!
> > 
> > Regards,
> > Magnus
> > 
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Travis Bischel <tr...@gmail.com>.

Hi! I have a few thoughts on this KIP. First, I'd like to thank you for the writeup,
clearly a lot of thought has gone into it and it is very thorough. However, I'm not
convinced it's the right approach from a fundamental level.

Fundamentally, this KIP seems like somewhat of a solution to an organizational
problem. Metrics are organizational concerns, not Kafka operator concerns.
Clients should make it easy to plug in metrics (this is the approach I take in
my own client), and organizations should have processes such that all clients
gather and ship metrics how that organization desires. If an organization is
set up correctly, there is no reason for metrics to be forwarded through Kafka.
This feels like a solution to an organization not properly setting up how
processes ship metrics, and in some ways, it's an overbroad solution, and in
other ways, it doesn't cover the entire problem.

From the perspective of Kafka operators, it is easy to see that this KIP is
nice in that it just dictates what clients should support for metrics and that
the metrics should ship through Kafka. But, from the perspective of an
observability team, this workflow is basically hijacking the standard flow that
organizations may have. I would rather have applications collect metrics and
ship them the same way every other application does. I'd rather not have to
configure additional plugins within Kafka to take metrics and forward them.

More importantly, this KIP prescibes cardinality problems, requires that to
officially support the KIP a client must support all relevant metrics within
the KIP, and requires that a client cannot support other metrics unless those
other metrics also go through a KIP process. It is difficult to imagine all of
these metrics being relevant to every organization, and there is no way for an
organization to filter what is relevant within the client. Instead, the
filtering is pushed downwards, meaning more network IO and more CPU costs to
filter what is irrelevant and aggregate what needs to be aggregated, and more
time for an organization to setup whatever it is that will be doing this
filtering and aggregating. Contrast this with a client that enables hooking in
to capture numbers that are relevant within an org itself: the org can gather
what they want, ship only want they want, and ship directly to the
observability system they have already set up. As an aside, it may also be
wise to avoid shipping metrics through Kafka about client interaction with
Kafka, because if Kafka is having problems, then orgs lose insight into those
problems. This would be like statuspage using itself for status on its own
systems.

Another downside is that by dictating the important metrics, this KIP either
has two choices: try to choose what is important to every org, and inevitably
leave out something important to somebody else, or just add everything and let
the orgs filter. This KIP mostly looks to go with the latter approach, meaning
orgs will be shipping & filtering. With hooks, an org would be able to gather
exactly what they want.

As well, I expect that org applications have metrics on the state of the
applications outside of the Kafka client. Applications are already sending
non-Kafka-client related metrics outbound to observability systems. If a Kafka
client provided hooks, then users could just gather the additional relevant
Kafka client metrics and ship those metrics the same way they do all of their
other metrics. It feels a bit odd for a Kafka client to have its own separate
way of forwarding metrics. Another benefit hooks in clients is that
organizations do not _have_ to set up additional plugins to forward metrics
from Kafka. Hooks avoid extra organizational work.

The option that the KIP provides for users of clients to opt out of metrics may
avoid some of the above issues (by just disabling things at the user level),
but that's not really great from the perspective of client authors, because the
existence of this KIP forces authors to either just not implement the KIP, or
increase complexity within the KIP. Further, from an operator perspective, if I
would prefer clients to ship metrics through the systems they already have in
place, now I have to expect that anything that uses librdkafka or the official
Java client will be shipping me metrics that I have to deal with (since the KIP
is default enabled).

Lastly, I'm a little wary that this KIP may stem from a product goal of
Confluent: since most everything uses librdkafka or the Java client, then by
defaulting clients sending metrics, Confluent gets an easy way to provide
metric panels for a nice cloud UI. If any client does not want to support these
metrics, and then a user wonders why these hypothetical panels have no metrics,
then Confluent can just reply "use a supported client".  Even if this
(potentially unlikely) scenario is true, then hooks would still be a great
alternative, because then Confluent could provide drop-in hooks for any client
and the end result of easy-panels would be the same.

In summary,

- Metrics are more of an organizational concern, not specifically a broker
  operator concern.

- The proposal seems to hijack how metrics are gathered within organizations

- I don't think KIPs should dictate which metrics should be gathered and which
  should not. Clients instead should make it easy for users to gather anything
  they could be interested in, and ignore anything they are not.

- I think hooks are more extensible, more exact, and fit better into
  organizational workflows.

On 2021/06/02 12:45:45, Magnus Edenhill <ma...@edenhill.se> wrote: 
> Hey all,
> 
> I'm proposing KIP-714 to add remote Client metrics and observability.
> This functionality will allow centralized monitoring and troubleshooting of
> clients and their internals.
> 
> Please see
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability
> 
> Looking forward to your feedback!
> 
> Regards,
> Magnus
>