You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by Jun Rao <ju...@confluent.io.INVALID> on 2021/12/09 22:28:55 UTC

Re: [DISCUSS] KIP-714: Client metrics and observability

Hi, Magnus.

Thanks for the KIP. A few comments below.

10. There seems to be some questions on the use cases of this KIP since we
already have a client side metric reporter. It would be useful to provide a
bit more details on that. To me, there are 3 potential use cases: (1) not
all organizations are enforcing client side metric collections; (2) if the
data is shared among 3rd parties, there is less control on external
clients; (3) when Kafka is offered as a hosted service. It would also be
useful to outline the client problems this KIP can help identify. For
example, this KIP may not help with any client connectivity problems.

11. Have we considered sending the metrics with the existing produce
request to an internal topic instead of a new request PushTelemetryRequest?
The potential benefits are (1) reusing existing request's support on
compression, throughput throttling, etc and (2) we could potentially get
rid of ClientTelemetryReceiver. Once the metrics land in a Kafka topic, the
operator can decide what to do with it by just consuming the topic.

12. It seems that we are defining a set of common metric names that every
client needs to support. Are most non-java clients following the naming
convention of the java client metrics? If not, forcing them to all change
their metric names could be destructive.

13. Using OpenTelemetry. Does that require runtime dependency
on OpenTelemetry library? How good is the compatibility story
of OpenTelemetry? This is important since an application could have other
OpenTelemetry dependencies than the Kafka client.

14. The proposal listed idempotence=true. This is more of a configuration
than a metric. Are we including that as a metric? What other configurations
are we including? Should we separate the configurations from the metrics?

Thanks,

Jun

On Mon, Nov 29, 2021 at 7:34 AM Magnus Edenhill <ma...@edenhill.se> wrote:

> Hey Bob,
>
> That's a good point.
>
> Request type labels were considered but since they're already tracked by
> broker-side metrics
> they were left out as to avoid metric duplication, however those metrics
> are not per connection,
> so they won't be that useful in practice for troubleshooting specific
> client instances.
>
> I'll add the request_type label to the relevant metrics.
>
> Thanks,
> Magnus
>
>
> Den tis 23 nov. 2021 kl 19:20 skrev Bob Barrett
> <bo...@confluent.io.invalid>:
>
> > Hi Magnus,
> >
> > Thanks for the thorough KIP, this seems very useful.
> >
> > Would it make sense to include the request type as a label for the
> > `client.request.success`, `client.request.errors` and
> `client.request.rtt`
> > metrics? I think it would be very useful to see which specific requests
> are
> > succeeding and failing for a client. One specific case I can think of
> where
> > this could be useful is producer batch timeouts. If a Java application
> does
> > not enable producer client logs (unfortunately, in my experience this
> > happens more often than it should), the application logs will only
> contain
> > the expiration error message, but no information about what is causing
> the
> > timeout. The requests might all be succeeding but taking too long to
> > process batches, or metadata requests might be failing, or some or all
> > produce requests might be failing (if the bootstrap servers are reachable
> > from the client but one or more other brokers are not, for example). If
> the
> > cluster operator is able to identify the specific requests that are slow
> or
> > failing for a client, they will be better able to diagnose the issue
> > causing batch timeouts.
> >
> > One drawback I can think of is that this will increase the cardinality of
> > the request metrics. But any given client is only going to use a small
> > subset of the request types, and since we already have partition labels
> for
> > the topic-level metrics, I think request labels will still make up a
> > relatively small percentage of the set of metrics.
> >
> > Thanks,
> > Bob
> >
> > On Mon, Nov 22, 2021 at 2:08 AM Viktor Somogyi-Vass <
> > viktorsomogyi@gmail.com>
> > wrote:
> >
> > > Hi Magnus,
> > >
> > > I think this is a very useful addition. We also have a similar (but
> much
> > > more simplistic) implementation of this. Maybe I missed it in the KIP
> but
> > > what about adding metrics about the subscription cache itself? That I
> > think
> > > would improve its usability and debuggability as we'd be able to see
> its
> > > performance, hit/miss rates, eviction counts and others.
> > >
> > > Best,
> > > Viktor
> > >
> > > On Thu, Nov 18, 2021 at 5:12 PM Magnus Edenhill <ma...@edenhill.se>
> > > wrote:
> > >
> > > > Hi Mickael,
> > > >
> > > > see inline.
> > > >
> > > > Den ons 10 nov. 2021 kl 15:21 skrev Mickael Maison <
> > > > mickael.maison@gmail.com
> > > > >:
> > > >
> > > > > Hi Magnus,
> > > > >
> > > > > I see you've addressed some of the points I raised above but some
> (4,
> > > > > 5) have not been addressed yet.
> > > > >
> > > >
> > > > Re 4) How will the user/app know metrics are being sent.
> > > >
> > > > One possibility is to add a JMX metric (thus for user consumption)
> for
> > > the
> > > > number of metric pushes the
> > > > client has performed, or perhaps the number of metrics subscriptions
> > > > currently being collected.
> > > > Would that be sufficient?
> > > >
> > > > Re 5) Metric sizes and rates
> > > >
> > > > A worst case scenario for a producer that is producing to 50 unique
> > > topics
> > > > and emitting all standard metrics yields
> > > > a serialized size of around 100KB prior to compression, which
> > compresses
> > > > down to about 20-30% of that depending
> > > > on compression type and topic name uniqueness.
> > > > The numbers for a consumer would be similar.
> > > >
> > > > In practice the number of unique topics would be far less, and the
> > > > subscription set would typically be for a subset of metrics.
> > > > So we're probably closer to 1kb, or less, compressed size per client
> > per
> > > > push interval.
> > > >
> > > > As both the subscription set and push intervals are controlled by the
> > > > cluster operator it shouldn't be too hard
> > > > to strike a good balance between metrics overhead and granularity.
> > > >
> > > >
> > > >
> > > > >
> > > > > I'm really uneasy with this being enabled by default on the client
> > > > > side. When collecting data, I think the best practice is to ensure
> > > > > users are explicitly enabling it.
> > > > >
> > > >
> > > > Requiring metrics to be explicitly enabled on clients severely
> cripples
> > > its
> > > > usability and value.
> > > >
> > > > One of the problems that this KIP aims to solve is for useful metrics
> > to
> > > be
> > > > available on demand
> > > > regardless of the technical expertise of the user. As Ryanne points,
> > out
> > > a
> > > > savvy user/organization
> > > > will typically have metrics collection and monitoring in place
> already,
> > > and
> > > > the benefits of this KIP
> > > > are then more of a common set and format metrics across client
> > > > implementations and languages.
> > > > But that is not the typical Kafka user in my experience, they're not
> > > Kafka
> > > > experts and they don't have the
> > > > knowledge of how to best instrument their clients.
> > > > Having metrics enabled by default for this user base allows the Kafka
> > > > operators to proactively and reactively
> > > > monitor and troubleshoot client issues, without the need for the less
> > > savvy
> > > > user to do anything.
> > > > It is often too late to tell a user to enable metrics when the
> problem
> > > has
> > > > already occurred.
> > > >
> > > > Now, to be clear, even though metrics are enabled by default on
> clients
> > > it
> > > > is not enabled by default
> > > > on the brokers; the Kafka operator needs to build and set up a
> metrics
> > > > plugin and add metrics subscriptions
> > > > before anything is sent from the client.
> > > > It is opt-out on the clients and opt-in on the broker.
> > > >
> > > >
> > > >
> > > >
> > > > > You mentioned brokers already have
> > > > > some(most?) of the information contained in metrics, if so then why
> > > > > are we collecting it again? Surely there must be some new
> information
> > > > > in the client metrics.
> > > > >
> > > >
> > > > From the user's perspective the Kafka infrastructure extends from
> > > > producer.send() to
> > > > messages being returned from consumer.poll(), a giant black box where
> > > > there's a lot going on between those
> > > > two points. The brokers currently only see what happens once those
> > > requests
> > > > and messages hits the broker,
> > > > but as Kafka clients are complex pieces of machinery there's a myriad
> > of
> > > > queues, timers, and state
> > > > that's critical to the operation and infrastructure that's not
> > currently
> > > > visible to the operator.
> > > > Relying on the user to accurately and timely provide this missing
> > > > information is not generally feasible.
> > > >
> > > >
> > > > Most of the standard metrics listed in the KIP are data points that
> the
> > > > broker does not have.
> > > > Only a small number of metrics are duplicates (like the request
> counts
> > > and
> > > > sizes), but they are included
> > > > to ease correlation when inspecting these client metrics.
> > > >
> > > >
> > > >
> > > > > Moreover this is a brand new feature so it's even harder to justify
> > > > > enabling it and forcing onto all our users. If disabled by default,
> > > > > it's relatively easy to enable in a new release if we decide to,
> but
> > > > > once enabled by default it's much harder to disable. Also this
> > feature
> > > > > will apply to all future metrics we will add.
> > > > >
> > > >
> > > > I think maturity of a feature implementation should be the deciding
> > > factor,
> > > > rather than
> > > > the design of it (which this KIP is). I.e., if the implementation is
> > not
> > > > deemed mature enough
> > > > for release X.Y it will be disabled.
> > > >
> > > >
> > > >
> > > > > Overall I think it's an interesting feature but I'd prefer to be
> > > > > slightly defensive and see how it works in practice before enabling
> > it
> > > > > everywhere.
> > > > >
> > > >
> > > > Right, and I agree on being defensive, but since this feature still
> > > > requires manual
> > > > enabling on the brokers before actually being used, I think that
> gives
> > > > enough control
> > > > to opt-in or out of this feature as needed.
> > > >
> > > > Thanks for your comments!
> > > >
> > > > Regards,
> > > > Magnus
> > > >
> > > >
> > > >
> > > > > Thanks,
> > > > > Mickael
> > > > >
> > > > > On Mon, Nov 8, 2021 at 7:55 AM Magnus Edenhill <magnus@edenhill.se
> >
> > > > wrote:
> > > > > >
> > > > > > Thanks David for pointing this out,
> > > > > > I've updated the KIP to include client_id as a matching selector.
> > > > > >
> > > > > > Regards,
> > > > > > Magnus
> > > > > >
> > > > > > Den tors 4 nov. 2021 kl 18:01 skrev David Mao
> > > > <dmao@confluent.io.invalid
> > > > > >:
> > > > > >
> > > > > > > Hey Magnus,
> > > > > > >
> > > > > > > I noticed that the KIP outlines the initial selectors supported
> > as:
> > > > > > >
> > > > > > >    - client_instance_id - CLIENT_INSTANCE_ID UUID string
> > > > > representation.
> > > > > > >    - client_software_name  - client software implementation
> name.
> > > > > > >    - client_software_version  - client software implementation
> > > > version.
> > > > > > >
> > > > > > > In the given reactive monitoring workflow, we mention that the
> > > > > application
> > > > > > > user does not know their client's client instance ID, but it's
> > > > outlined
> > > > > > > that the operator can add a metrics subscription selecting for
> > > > > clientId. I
> > > > > > > don't see clientId as one of the supported selectors.
> > > > > > > I can see how this would have made sense in a previous
> iteration
> > > > given
> > > > > that
> > > > > > > the previous client instance ID proposal was to construct the
> > > client
> > > > > > > instance ID using clientId as a prefix. Now that the client
> > > instance
> > > > > ID is
> > > > > > > a UUID, would we want to add clientId as a supported selector?
> > > > > > > Let me know what you think.
> > > > > > >
> > > > > > > David
> > > > > > >
> > > > > > > On Tue, Oct 19, 2021 at 12:33 PM Magnus Edenhill <
> > > magnus@edenhill.se
> > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Mickael!
> > > > > > > >
> > > > > > > > Den tis 19 okt. 2021 kl 19:30 skrev Mickael Maison <
> > > > > > > > mickael.maison@gmail.com
> > > > > > > > >:
> > > > > > > >
> > > > > > > > > Hi Magnus,
> > > > > > > > >
> > > > > > > > > Thanks for the proposal.
> > > > > > > > >
> > > > > > > > > 1. Looking at the protocol section, isn't
> "ClientInstanceId"
> > > > > expected
> > > > > > > > > to be a field in GetTelemetrySubscriptionsResponseV0?
> > > Otherwise,
> > > > > how
> > > > > > > > > does a client retrieve this value?
> > > > > > > > >
> > > > > > > >
> > > > > > > > Good catch, it got removed by mistake in one of the edits.
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > 2. In the client API section, you mention a new method
> > > > > > > > > "clientInstanceId()". Can you clarify which interfaces are
> > > > > affected?
> > > > > > > > > Is it only Consumer and Producer?
> > > > > > > > >
> > > > > > > >
> > > > > > > > And Admin. Will update the KIP.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > 3. I'm a bit concerned this is enabled by default. Even if
> > the
> > > > data
> > > > > > > > > collected is supposed to be not sensitive, I think this can
> > be
> > > > > > > > > problematic in some environments. Also users don't seem to
> > have
> > > > the
> > > > > > > > > choice to only expose some metrics. Knowing how much data
> > > transit
> > > > > > > > > through some applications can be considered critical.
> > > > > > > > >
> > > > > > > >
> > > > > > > > The broker already knows how much data transits through the
> > > client
> > > > > > > though,
> > > > > > > > right?
> > > > > > > > Care has been taken not to expose information in the standard
> > > > metrics
> > > > > > > that
> > > > > > > > might
> > > > > > > > reveal sensitive information.
> > > > > > > >
> > > > > > > > Do you have an example of how the proposed metrics could leak
> > > > > sensitive
> > > > > > > > information?
> > > > > > > > As for limiting the what metrics to export; I guess that
> could
> > > make
> > > > > sense
> > > > > > > > in some
> > > > > > > > very sensitive use-cases, but those users might disable
> metrics
> > > > > > > altogether
> > > > > > > > for now.
> > > > > > > > Could these concerns be addressed by a later KIP?
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > 4. As a user, how do you know if your application is
> actively
> > > > > sending
> > > > > > > > > metrics? Are there new metrics exposing what's going on,
> like
> > > how
> > > > > much
> > > > > > > > > data is being sent?
> > > > > > > > >
> > > > > > > >
> > > > > > > > That's a good question.
> > > > > > > > Since the proposed metrics interface is not aimed at, or
> > directly
> > > > > > > available
> > > > > > > > to, the application
> > > > > > > > I guess there's little point of adding it here, but instead
> > > adding
> > > > > > > > something to the
> > > > > > > > existing JMX metrics?
> > > > > > > > Do you have any suggestions?
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > 5. If all metrics are enabled on a regular Consumer or
> > > Producer,
> > > > do
> > > > > > > > > you have an idea how much throughput this would use?
> > > > > > > > >
> > > > > > > >
> > > > > > > > It depends on the number of partition/topics/etc the client
> is
> > > > > producing
> > > > > > > > to/consuming from.
> > > > > > > > I'll add some sizes to the KIP for some typical use-cases.
> > > > > > > >
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Magnus
> > > > > > > >
> > > > > > > >
> > > > > > > > > Thanks
> > > > > > > > >
> > > > > > > > > On Tue, Oct 19, 2021 at 5:06 PM Magnus Edenhill <
> > > > > magnus@edenhill.se>
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > Den tis 19 okt. 2021 kl 13:22 skrev Tom Bentley <
> > > > > tbentley@redhat.com
> > > > > > > >:
> > > > > > > > > >
> > > > > > > > > > > Hi Magnus,
> > > > > > > > > > >
> > > > > > > > > > > I reviewed the KIP since you called the vote (sorry for
> > not
> > > > > > > reviewing
> > > > > > > > > when
> > > > > > > > > > > you announced your intention to call the vote). I have
> a
> > > few
> > > > > > > > questions
> > > > > > > > > on
> > > > > > > > > > > some of the details.
> > > > > > > > > > >
> > > > > > > > > > > 1. There's no Javadoc on ClientTelemetryPayload.data(),
> > so
> > > I
> > > > > don't
> > > > > > > > know
> > > > > > > > > > > whether the payload is exposed through this method as
> > > > > compressed or
> > > > > > > > > not.
> > > > > > > > > > > Later on you say "Decompression of the payloads will be
> > > > > handled by
> > > > > > > > the
> > > > > > > > > > > broker metrics plugin, the broker should expose a
> > suitable
> > > > > > > > > decompression
> > > > > > > > > > > API to the metrics plugin for this purpose.", which
> > > suggests
> > > > > it's
> > > > > > > the
> > > > > > > > > > > compressed data in the buffer, but then we don't know
> > which
> > > > > codec
> > > > > > > was
> > > > > > > > > used,
> > > > > > > > > > > nor the API via which the plugin should decompress it
> if
> > > > > required
> > > > > > > for
> > > > > > > > > > > forwarding to the ultimate metrics store. Should the
> > > > > > > > > ClientTelemetryPayload
> > > > > > > > > > > expose a method to get the compression and a
> > decompressor?
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Good point, updated.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > 2. The client-side API is expressed as StringOrError
> > > > > > > > > > > ClientInstance::ClientInstanceId(int timeout_ms). I
> > > > understand
> > > > > that
> > > > > > > > > you're
> > > > > > > > > > > thinking about the librdkafka implementation, but it
> > would
> > > be
> > > > > good
> > > > > > > to
> > > > > > > > > show
> > > > > > > > > > > the API as it would appear on the Apache Kafka clients.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > This was meant as pseudo-code, but I changed it to Java.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > 3. "PushTelemetryRequest|Response - protocol request
> used
> > > by
> > > > > the
> > > > > > > > > client to
> > > > > > > > > > > send metrics to any broker it is connected to." To be
> > > clear,
> > > > > this
> > > > > > > > means
> > > > > > > > > > > that the client can choose any of the connected brokers
> > and
> > > > > push to
> > > > > > > > > just
> > > > > > > > > > > one of them? What should a supporting client do if it
> > gets
> > > an
> > > > > error
> > > > > > > > > when
> > > > > > > > > > > pushing metrics to a broker, retry sending to the same
> > > broker
> > > > > or
> > > > > > > try
> > > > > > > > > > > pushing to another broker, or drop the metrics? Should
> > > > > supporting
> > > > > > > > > clients
> > > > > > > > > > > send successive requests to a single broker, or round
> > > robin,
> > > > > or is
> > > > > > > > > that up
> > > > > > > > > > > to the client author? I'm guessing the behaviour should
> > be
> > > > > sticky
> > > > > > > to
> > > > > > > > > > > support the rate limiting features, but I think it
> would
> > be
> > > > > good
> > > > > > > for
> > > > > > > > > client
> > > > > > > > > > > authors if this section were explicit on the
> recommended
> > > > > behaviour.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > You are right, I've updated the KIP to make this clearer.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > 4. "Mapping the client instance id to an actual
> > application
> > > > > > > instance
> > > > > > > > > > > running on a (virtual) machine can be done by
> inspecting
> > > the
> > > > > > > metrics
> > > > > > > > > > > resource labels, such as the client source address and
> > > source
> > > > > port,
> > > > > > > > or
> > > > > > > > > > > security principal, all of which are added by the
> > receiving
> > > > > broker.
> > > > > > > > > This
> > > > > > > > > > > will allow the operator together with the user to
> > identify
> > > > the
> > > > > > > actual
> > > > > > > > > > > application instance." Is this really always true? The
> > > source
> > > > > IP
> > > > > > > and
> > > > > > > > > port
> > > > > > > > > > > might be a loadbalancer/proxy in some setups. The
> > > principal,
> > > > as
> > > > > > > > already
> > > > > > > > > > > mentioned in the KIP, might be shared between multiple
> > > > > > > applications.
> > > > > > > > > So at
> > > > > > > > > > > worst the organization running the clients might have
> to
> > > > > consult
> > > > > > > the
> > > > > > > > > logs
> > > > > > > > > > > of a set of client applications, right?
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Yes, that's correct. There's no guaranteed mapping from
> > > > > > > > > client_instance_id
> > > > > > > > > > to
> > > > > > > > > > an actual instance, that's why the KIP recommends client
> > > > > > > > implementations
> > > > > > > > > to
> > > > > > > > > > log the client instance id
> > > > > > > > > > upon retrieval, and also provide an API for the
> application
> > > to
> > > > > > > retrieve
> > > > > > > > > the
> > > > > > > > > > instance id programmatically
> > > > > > > > > > if it has a better way of exposing it.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > 5. "Tests indicate that a compression ratio up to 10x is
> > > > > possible for
> > > > > > > > the
> > > > > > > > > > > standard metrics." Client authors might appreciate your
> > > > > mentioning
> > > > > > > > > which
> > > > > > > > > > > compression codec got these results.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Good point. Updated.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > 6. "Should the client send a push request prior to
> expiry
> > > of
> > > > > the
> > > > > > > > > previously
> > > > > > > > > > > calculated PushIntervalMs the broker will discard the
> > > metrics
> > > > > and
> > > > > > > > > return a
> > > > > > > > > > > PushTelemetryResponse with the ErrorCode set to
> > > RateLimited."
> > > > > Is
> > > > > > > this
> > > > > > > > > > > RATE_LIMITED a new error code? It's not mentioned in
> the
> > > "New
> > > > > Error
> > > > > > > > > Codes"
> > > > > > > > > > > section.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > That's a leftover, it should be using the standard
> > > ThrottleTime
> > > > > > > > > mechanism.
> > > > > > > > > > Fixed.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > 7. In the section "Standard client resource labels"
> > > > > application_id
> > > > > > > is
> > > > > > > > > > > described as Kafka Streams only, but the section of
> > "Client
> > > > > > > > > Identification"
> > > > > > > > > > > talks about "application instance id as an optional
> > future
> > > > > > > > nice-to-have
> > > > > > > > > > > that may be included as a metrics label if it has been
> > set
> > > by
> > > > > the
> > > > > > > > > user", so
> > > > > > > > > > > I'm confused whether non-Kafka Streams clients should
> set
> > > an
> > > > > > > > > application_id
> > > > > > > > > > > or not.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > I'll clarify this in the KIP, but basically we would need
> > to
> > > > add
> > > > > an `
> > > > > > > > > > application.id` config
> > > > > > > > > > property for non-streams clients for this purpose, and
> > that's
> > > > > outside
> > > > > > > > the
> > > > > > > > > > scope of this KIP since we want to make it zero-conf:ish
> on
> > > the
> > > > > > > client
> > > > > > > > > side.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Kind regards,
> > > > > > > > > > >
> > > > > > > > > > > Tom
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Thanks for the review,
> > > > > > > > > > Magnus
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Thu, Oct 7, 2021 at 5:26 PM Magnus Edenhill <
> > > > > magnus@edenhill.se
> > > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi all,
> > > > > > > > > > > >
> > > > > > > > > > > > I've updated the KIP following our recent discussions
> > on
> > > > the
> > > > > > > > mailing
> > > > > > > > > > > list:
> > > > > > > > > > > >  - split the protocol in two, one for getting the
> > metrics
> > > > > > > > > subscriptions,
> > > > > > > > > > > > and one for pushing the metrics.
> > > > > > > > > > > >  - simplifications: initially only one supported
> > metrics
> > > > > format,
> > > > > > > no
> > > > > > > > > > > > client.id in the instance id, etc.
> > > > > > > > > > > >  - made CLIENT_METRICS subscription configuration
> > entries
> > > > > more
> > > > > > > > > structured
> > > > > > > > > > > >    and allowing better client matching selectors (not
> > > only
> > > > > on the
> > > > > > > > > > > instance
> > > > > > > > > > > > id, but also the other
> > > > > > > > > > > >    client resource labels, such as
> > client_software_name,
> > > > > etc.).
> > > > > > > > > > > >
> > > > > > > > > > > > Unless there are further comments I'll call the vote
> > in a
> > > > > day or
> > > > > > > > two.
> > > > > > > > > > > >
> > > > > > > > > > > > Regards,
> > > > > > > > > > > > Magnus
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Den mån 4 okt. 2021 kl 20:57 skrev Magnus Edenhill <
> > > > > > > > > magnus@edenhill.se>:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi Gwen,
> > > > > > > > > > > > >
> > > > > > > > > > > > > I'm finishing up the KIP based on the last couple
> of
> > > > > discussion
> > > > > > > > > points
> > > > > > > > > > > in
> > > > > > > > > > > > > this thread
> > > > > > > > > > > > > and will call the Vote later this week.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Best,
> > > > > > > > > > > > > Magnus
> > > > > > > > > > > > >
> > > > > > > > > > > > > Den lör 2 okt. 2021 kl 02:01 skrev Gwen Shapira
> > > > > > > > > > > > <gwen@confluent.io.invalid
> > > > > > > > > > > > > >:
> > > > > > > > > > > > >
> > > > > > > > > > > > >> Hey,
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> I noticed that there was no discussion for the
> last
> > 10
> > > > > days,
> > > > > > > > but I
> > > > > > > > > > > > >> couldn't
> > > > > > > > > > > > >> find the vote thread. Is there one that I'm
> missing?
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> Gwen
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> On Wed, Sep 22, 2021 at 4:58 AM Magnus Edenhill <
> > > > > > > > > magnus@edenhill.se>
> > > > > > > > > > > > >> wrote:
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> > Den tis 21 sep. 2021 kl 06:58 skrev Colin
> McCabe <
> > > > > > > > > > > cmccabe@apache.org
> > > > > > > > > > > > >:
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > > On Mon, Sep 20, 2021, at 17:35, Feng Min
> wrote:
> > > > > > > > > > > > >> > > > Thanks Magnus & Colin for the discussion.
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > > > Based on KIP-714's stateless design, Client
> > can
> > > > > pretty
> > > > > > > > much
> > > > > > > > > use
> > > > > > > > > > > > any
> > > > > > > > > > > > >> > > > connection to any broker to send metrics. We
> > are
> > > > not
> > > > > > > > > associating
> > > > > > > > > > > > >> > > connection
> > > > > > > > > > > > >> > > > with client metric state. Is my
> understanding
> > > > > correct?
> > > > > > > If
> > > > > > > > > yes,
> > > > > > > > > > > > how
> > > > > > > > > > > > >> > about
> > > > > > > > > > > > >> > > > the following two scenarios
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > > > 1) One Client (Client-ID) registers two
> > > different
> > > > > client
> > > > > > > > > > > instance
> > > > > > > > > > > > id
> > > > > > > > > > > > >> > via
> > > > > > > > > > > > >> > > > separate registration. Is it permitted? If
> OK,
> > > how
> > > > > to
> > > > > > > > > > > distinguish
> > > > > > > > > > > > >> them
> > > > > > > > > > > > >> > > from
> > > > > > > > > > > > >> > > > the case 2 below.
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> > > Hi Feng,
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> > > My understanding, which Magnus can clarify I
> > > guess,
> > > > is
> > > > > > > that
> > > > > > > > > you
> > > > > > > > > > > > could
> > > > > > > > > > > > >> > have
> > > > > > > > > > > > >> > > something like two Producer instances running
> > with
> > > > the
> > > > > > > same
> > > > > > > > > > > > client.id
> > > > > > > > > > > > >> > > (perhaps because they're using the same config
> > > file,
> > > > > for
> > > > > > > > > example).
> > > > > > > > > > > > >> They
> > > > > > > > > > > > >> > > could even be in the same process. But they
> > would
> > > > get
> > > > > > > > separate
> > > > > > > > > > > > UUIDs.
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> > > I believe Magnus used the term client to mean
> > > > > "Producer or
> > > > > > > > > > > > Consumer".
> > > > > > > > > > > > >> So
> > > > > > > > > > > > >> > > if you have both a Producer and a Consumer in
> > your
> > > > > > > > > application I
> > > > > > > > > > > > would
> > > > > > > > > > > > >> > > expect you'd get separate UUIDs for both.
> Again
> > > > > Magnus can
> > > > > > > > > chime
> > > > > > > > > > > in
> > > > > > > > > > > > >> > here, I
> > > > > > > > > > > > >> > > guess.
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > That's correct.
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> > > > 2) How about the client restarting? What's
> the
> > > > > > > > expectation?
> > > > > > > > > > > Should
> > > > > > > > > > > > >> the
> > > > > > > > > > > > >> > > > server expect the client to carry a
> persisted
> > > > client
> > > > > > > > > instance id
> > > > > > > > > > > > or
> > > > > > > > > > > > >> > > should
> > > > > > > > > > > > >> > > > the client be treated as a new instance?
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> > > The KIP doesn't describe any mechanism for
> > > > > persistence,
> > > > > > > so I
> > > > > > > > > would
> > > > > > > > > > > > >> assume
> > > > > > > > > > > > >> > > that when you restart the client you get a new
> > > > UUID. I
> > > > > > > agree
> > > > > > > > > that
> > > > > > > > > > > it
> > > > > > > > > > > > >> > would
> > > > > > > > > > > > >> > > be good to spell this out.
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> > Right, it will not be persisted since a client
> > > > instance
> > > > > > > can't
> > > > > > > > be
> > > > > > > > > > > > >> restarted.
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > Will update the KIP to make this clearer.
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > /Magnus
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >>
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> --
> > > > > > > > > > > > >> Gwen Shapira
> > > > > > > > > > > > >> Engineering Manager | Confluent
> > > > > > > > > > > > >> 650.450.2760 | @gwenshap
> > > > > > > > > > > > >> Follow us: Twitter | blog
> > > > > > > > > > > > >>
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Xavier Léauté <xa...@confluent.io.INVALID>.
 24.2 Does delta only apply to Counter type?

> 24.3 In the delta representation, the first request needs to send the full
> value, how does the broker plugin know whether a value is full or delta?
>

The temporarily semantics are defined by the OpenTelemetry data model.
Deferring to OpenTelemetry avoids having to redefine all those semantics in
the Kafka protocol.
https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/datamodel.md

Hopefully that clarifies things,
Xavier

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Sarat Kakarla <sk...@confluent.io.INVALID>.
Jun

Following are the answers for some the questions raised by you.

>> 26. client-metrics entity:
>> 26.1 It seems that we could add multiple entities that match to the same client. Which one takes precedent?

All the matching client metrics would be compiled into a single list and send to the client.

>> 26.2 How do we persist the new client metrics entities? Do we need to add new ZK paths and new records in KRaft?

The idea is to add a new ConfigResourceType:CLIENT_METRICS and follow the same code paths as the other config resources as described in ConfigResource.Type, which means new ZK paths and new KRAFT records would be added.

Thanks
Sarat




    On Wed, Jan 26, 2022 at 6:32 AM Magnus Edenhill <ma...@edenhill.se> wrote:

    > Hi all,
    >
    > I've updated the KIP with responses to the latest comments: Java client
    > dependencies (Thanks Kirk!), alternate designs (separate cluster, separate
    > producer, etc), etc.
    >
    > I will revive the vote thread.
    >
    > Thanks,
    > Magnus
    >
    >
    > Den mån 13 dec. 2021 kl 22:32 skrev Ryanne Dolan <ry...@gmail.com>:
    >
    > > I think we should be very careful about introducing new runtime
    > > dependencies into the clients. Historically this has been rare and
    > > essentially necessary (e.g. compression libs).
    > >
    > > Ryanne
    > >
    > > On Mon, Dec 13, 2021, 1:06 PM Kirk True <ki...@mustardgrain.com> wrote:
    > >
    > > > Hi Jun,
    > > >
    > > > On Thu, Dec 9, 2021, at 2:28 PM, Jun Rao wrote:
    > > > > 13. Using OpenTelemetry. Does that require runtime dependency
    > > > > on OpenTelemetry library? How good is the compatibility story
    > > > > of OpenTelemetry? This is important since an application could have
    > > other
    > > > > OpenTelemetry dependencies than the Kafka client.
    > > >
    > > > The current design is that the OpenTelemetry JARs would ship with the
    > > > client. Perhaps we can design the client such that the JARs aren't even
    > > > loaded if the user has opted out. The user could even exclude the JARs
    > > from
    > > > their dependencies if they so wished.
    > > >
    > > > I can't speak to the compatibility of the libraries. Is it possible
    > that
    > > > we include a shaded version?
    > > >
    > > > Thanks,
    > > > Kirk
    > > >
    > > > >
    > > > > 14. The proposal listed idempotence=true. This is more of a
    > > configuration
    > > > > than a metric. Are we including that as a metric? What other
    > > > configurations
    > > > > are we including? Should we separate the configurations from the
    > > metrics?
    > > > >
    > > > > Thanks,
    > > > >
    > > > > Jun
    > > > >
    > > > > On Mon, Nov 29, 2021 at 7:34 AM Magnus Edenhill <ma...@edenhill.se>
    > > > wrote:
    > > > >
    > > > > > Hey Bob,
    > > > > >
    > > > > > That's a good point.
    > > > > >
    > > > > > Request type labels were considered but since they're already
    > tracked
    > > > by
    > > > > > broker-side metrics
    > > > > > they were left out as to avoid metric duplication, however those
    > > > metrics
    > > > > > are not per connection,
    > > > > > so they won't be that useful in practice for troubleshooting
    > specific
    > > > > > client instances.
    > > > > >
    > > > > > I'll add the request_type label to the relevant metrics.
    > > > > >
    > > > > > Thanks,
    > > > > > Magnus
    > > > > >
    > > > > >
    > > > > > Den tis 23 nov. 2021 kl 19:20 skrev Bob Barrett
    > > > > > <bo...@confluent.io.invalid>:
    > > > > >
    > > > > > > Hi Magnus,
    > > > > > >
    > > > > > > Thanks for the thorough KIP, this seems very useful.
    > > > > > >
    > > > > > > Would it make sense to include the request type as a label for
    > the
    > > > > > > `client.request.success`, `client.request.errors` and
    > > > > > `client.request.rtt`
    > > > > > > metrics? I think it would be very useful to see which specific
    > > > requests
    > > > > > are
    > > > > > > succeeding and failing for a client. One specific case I can
    > think
    > > of
    > > > > > where
    > > > > > > this could be useful is producer batch timeouts. If a Java
    > > > application
    > > > > > does
    > > > > > > not enable producer client logs (unfortunately, in my experience
    > > this
    > > > > > > happens more often than it should), the application logs will
    > only
    > > > > > contain
    > > > > > > the expiration error message, but no information about what is
    > > > causing
    > > > > > the
    > > > > > > timeout. The requests might all be succeeding but taking too long
    > > to
    > > > > > > process batches, or metadata requests might be failing, or some
    > or
    > > > all
    > > > > > > produce requests might be failing (if the bootstrap servers are
    > > > reachable
    > > > > > > from the client but one or more other brokers are not, for
    > > example).
    > > > If
    > > > > > the
    > > > > > > cluster operator is able to identify the specific requests that
    > are
    > > > slow
    > > > > > or
    > > > > > > failing for a client, they will be better able to diagnose the
    > > issue
    > > > > > > causing batch timeouts.
    > > > > > >
    > > > > > > One drawback I can think of is that this will increase the
    > > > cardinality of
    > > > > > > the request metrics. But any given client is only going to use a
    > > > small
    > > > > > > subset of the request types, and since we already have partition
    > > > labels
    > > > > > for
    > > > > > > the topic-level metrics, I think request labels will still make
    > up
    > > a
    > > > > > > relatively small percentage of the set of metrics.
    > > > > > >
    > > > > > > Thanks,
    > > > > > > Bob
    > > > > > >
    > > > > > > On Mon, Nov 22, 2021 at 2:08 AM Viktor Somogyi-Vass <
    > > > > > > viktorsomogyi@gmail.com>
    > > > > > > wrote:
    > > > > > >
    > > > > > > > Hi Magnus,
    > > > > > > >
    > > > > > > > I think this is a very useful addition. We also have a similar
    > > (but
    > > > > > much
    > > > > > > > more simplistic) implementation of this. Maybe I missed it in
    > the
    > > > KIP
    > > > > > but
    > > > > > > > what about adding metrics about the subscription cache itself?
    > > > That I
    > > > > > > think
    > > > > > > > would improve its usability and debuggability as we'd be able
    > to
    > > > see
    > > > > > its
    > > > > > > > performance, hit/miss rates, eviction counts and others.
    > > > > > > >
    > > > > > > > Best,
    > > > > > > > Viktor
    > > > > > > >
    > > > > > > > On Thu, Nov 18, 2021 at 5:12 PM Magnus Edenhill <
    > > > magnus@edenhill.se>
    > > > > > > > wrote:
    > > > > > > >
    > > > > > > > > Hi Mickael,
    > > > > > > > >
    > > > > > > > > see inline.
    > > > > > > > >
    > > > > > > > > Den ons 10 nov. 2021 kl 15:21 skrev Mickael Maison <
    > > > > > > > > mickael.maison@gmail.com
    > > > > > > > > >:
    > > > > > > > >
    > > > > > > > > > Hi Magnus,
    > > > > > > > > >
    > > > > > > > > > I see you've addressed some of the points I raised above
    > but
    > > > some
    > > > > > (4,
    > > > > > > > > > 5) have not been addressed yet.
    > > > > > > > > >
    > > > > > > > >
    > > > > > > > > Re 4) How will the user/app know metrics are being sent.
    > > > > > > > >
    > > > > > > > > One possibility is to add a JMX metric (thus for user
    > > > consumption)
    > > > > > for
    > > > > > > > the
    > > > > > > > > number of metric pushes the
    > > > > > > > > client has performed, or perhaps the number of metrics
    > > > subscriptions
    > > > > > > > > currently being collected.
    > > > > > > > > Would that be sufficient?
    > > > > > > > >
    > > > > > > > > Re 5) Metric sizes and rates
    > > > > > > > >
    > > > > > > > > A worst case scenario for a producer that is producing to 50
    > > > unique
    > > > > > > > topics
    > > > > > > > > and emitting all standard metrics yields
    > > > > > > > > a serialized size of around 100KB prior to compression, which
    > > > > > > compresses
    > > > > > > > > down to about 20-30% of that depending
    > > > > > > > > on compression type and topic name uniqueness.
    > > > > > > > > The numbers for a consumer would be similar.
    > > > > > > > >
    > > > > > > > > In practice the number of unique topics would be far less,
    > and
    > > > the
    > > > > > > > > subscription set would typically be for a subset of metrics.
    > > > > > > > > So we're probably closer to 1kb, or less, compressed size per
    > > > client
    > > > > > > per
    > > > > > > > > push interval.
    > > > > > > > >
    > > > > > > > > As both the subscription set and push intervals are
    > controlled
    > > > by the
    > > > > > > > > cluster operator it shouldn't be too hard
    > > > > > > > > to strike a good balance between metrics overhead and
    > > > granularity.
    > > > > > > > >
    > > > > > > > >
    > > > > > > > >
    > > > > > > > > >
    > > > > > > > > > I'm really uneasy with this being enabled by default on the
    > > > client
    > > > > > > > > > side. When collecting data, I think the best practice is to
    > > > ensure
    > > > > > > > > > users are explicitly enabling it.
    > > > > > > > > >
    > > > > > > > >
    > > > > > > > > Requiring metrics to be explicitly enabled on clients
    > severely
    > > > > > cripples
    > > > > > > > its
    > > > > > > > > usability and value.
    > > > > > > > >
    > > > > > > > > One of the problems that this KIP aims to solve is for useful
    > > > metrics
    > > > > > > to
    > > > > > > > be
    > > > > > > > > available on demand
    > > > > > > > > regardless of the technical expertise of the user. As Ryanne
    > > > points,
    > > > > > > out
    > > > > > > > a
    > > > > > > > > savvy user/organization
    > > > > > > > > will typically have metrics collection and monitoring in
    > place
    > > > > > already,
    > > > > > > > and
    > > > > > > > > the benefits of this KIP
    > > > > > > > > are then more of a common set and format metrics across
    > client
    > > > > > > > > implementations and languages.
    > > > > > > > > But that is not the typical Kafka user in my experience,
    > > they're
    > > > not
    > > > > > > > Kafka
    > > > > > > > > experts and they don't have the
    > > > > > > > > knowledge of how to best instrument their clients.
    > > > > > > > > Having metrics enabled by default for this user base allows
    > the
    > > > Kafka
    > > > > > > > > operators to proactively and reactively
    > > > > > > > > monitor and troubleshoot client issues, without the need for
    > > the
    > > > less
    > > > > > > > savvy
    > > > > > > > > user to do anything.
    > > > > > > > > It is often too late to tell a user to enable metrics when
    > the
    > > > > > problem
    > > > > > > > has
    > > > > > > > > already occurred.
    > > > > > > > >
    > > > > > > > > Now, to be clear, even though metrics are enabled by default
    > on
    > > > > > clients
    > > > > > > > it
    > > > > > > > > is not enabled by default
    > > > > > > > > on the brokers; the Kafka operator needs to build and set up
    > a
    > > > > > metrics
    > > > > > > > > plugin and add metrics subscriptions
    > > > > > > > > before anything is sent from the client.
    > > > > > > > > It is opt-out on the clients and opt-in on the broker.
    > > > > > > > >
    > > > > > > > >
    > > > > > > > >
    > > > > > > > >
    > > > > > > > > > You mentioned brokers already have
    > > > > > > > > > some(most?) of the information contained in metrics, if so
    > > > then why
    > > > > > > > > > are we collecting it again? Surely there must be some new
    > > > > > information
    > > > > > > > > > in the client metrics.
    > > > > > > > > >
    > > > > > > > >
    > > > > > > > > From the user's perspective the Kafka infrastructure extends
    > > from
    > > > > > > > > producer.send() to
    > > > > > > > > messages being returned from consumer.poll(), a giant black
    > box
    > > > where
    > > > > > > > > there's a lot going on between those
    > > > > > > > > two points. The brokers currently only see what happens once
    > > > those
    > > > > > > > requests
    > > > > > > > > and messages hits the broker,
    > > > > > > > > but as Kafka clients are complex pieces of machinery there's
    > a
    > > > myriad
    > > > > > > of
    > > > > > > > > queues, timers, and state
    > > > > > > > > that's critical to the operation and infrastructure that's
    > not
    > > > > > > currently
    > > > > > > > > visible to the operator.
    > > > > > > > > Relying on the user to accurately and timely provide this
    > > missing
    > > > > > > > > information is not generally feasible.
    > > > > > > > >
    > > > > > > > >
    > > > > > > > > Most of the standard metrics listed in the KIP are data
    > points
    > > > that
    > > > > > the
    > > > > > > > > broker does not have.
    > > > > > > > > Only a small number of metrics are duplicates (like the
    > request
    > > > > > counts
    > > > > > > > and
    > > > > > > > > sizes), but they are included
    > > > > > > > > to ease correlation when inspecting these client metrics.
    > > > > > > > >
    > > > > > > > >
    > > > > > > > >
    > > > > > > > > > Moreover this is a brand new feature so it's even harder to
    > > > justify
    > > > > > > > > > enabling it and forcing onto all our users. If disabled by
    > > > default,
    > > > > > > > > > it's relatively easy to enable in a new release if we
    > decide
    > > > to,
    > > > > > but
    > > > > > > > > > once enabled by default it's much harder to disable. Also
    > > this
    > > > > > > feature
    > > > > > > > > > will apply to all future metrics we will add.
    > > > > > > > > >
    > > > > > > > >
    > > > > > > > > I think maturity of a feature implementation should be the
    > > > deciding
    > > > > > > > factor,
    > > > > > > > > rather than
    > > > > > > > > the design of it (which this KIP is). I.e., if the
    > > > implementation is
    > > > > > > not
    > > > > > > > > deemed mature enough
    > > > > > > > > for release X.Y it will be disabled.
    > > > > > > > >
    > > > > > > > >
    > > > > > > > >
    > > > > > > > > > Overall I think it's an interesting feature but I'd prefer
    > to
    > > > be
    > > > > > > > > > slightly defensive and see how it works in practice before
    > > > enabling
    > > > > > > it
    > > > > > > > > > everywhere.
    > > > > > > > > >
    > > > > > > > >
    > > > > > > > > Right, and I agree on being defensive, but since this feature
    > > > still
    > > > > > > > > requires manual
    > > > > > > > > enabling on the brokers before actually being used, I think
    > > that
    > > > > > gives
    > > > > > > > > enough control
    > > > > > > > > to opt-in or out of this feature as needed.
    > > > > > > > >
    > > > > > > > > Thanks for your comments!
    > > > > > > > >
    > > > > > > > > Regards,
    > > > > > > > > Magnus
    > > > > > > > >
    > > > > > > > >
    > > > > > > > >
    > > > > > > > > > Thanks,
    > > > > > > > > > Mickael
    > > > > > > > > >
    > > > > > > > > > On Mon, Nov 8, 2021 at 7:55 AM Magnus Edenhill <
    > > > magnus@edenhill.se
    > > > > > >
    > > > > > > > > wrote:
    > > > > > > > > > >
    > > > > > > > > > > Thanks David for pointing this out,
    > > > > > > > > > > I've updated the KIP to include client_id as a matching
    > > > selector.
    > > > > > > > > > >
    > > > > > > > > > > Regards,
    > > > > > > > > > > Magnus
    > > > > > > > > > >
    > > > > > > > > > > Den tors 4 nov. 2021 kl 18:01 skrev David Mao
    > > > > > > > > <dmao@confluent.io.invalid
    > > > > > > > > > >:
    > > > > > > > > > >
    > > > > > > > > > > > Hey Magnus,
    > > > > > > > > > > >
    > > > > > > > > > > > I noticed that the KIP outlines the initial selectors
    > > > supported
    > > > > > > as:
    > > > > > > > > > > >
    > > > > > > > > > > >    - client_instance_id - CLIENT_INSTANCE_ID UUID
    > string
    > > > > > > > > > representation.
    > > > > > > > > > > >    - client_software_name  - client software
    > > implementation
    > > > > > name.
    > > > > > > > > > > >    - client_software_version  - client software
    > > > implementation
    > > > > > > > > version.
    > > > > > > > > > > >
    > > > > > > > > > > > In the given reactive monitoring workflow, we mention
    > > that
    > > > the
    > > > > > > > > > application
    > > > > > > > > > > > user does not know their client's client instance ID,
    > but
    > > > it's
    > > > > > > > > outlined
    > > > > > > > > > > > that the operator can add a metrics subscription
    > > selecting
    > > > for
    > > > > > > > > > clientId. I
    > > > > > > > > > > > don't see clientId as one of the supported selectors.
    > > > > > > > > > > > I can see how this would have made sense in a previous
    > > > > > iteration
    > > > > > > > > given
    > > > > > > > > > that
    > > > > > > > > > > > the previous client instance ID proposal was to
    > construct
    > > > the
    > > > > > > > client
    > > > > > > > > > > > instance ID using clientId as a prefix. Now that the
    > > client
    > > > > > > > instance
    > > > > > > > > > ID is
    > > > > > > > > > > > a UUID, would we want to add clientId as a supported
    > > > selector?
    > > > > > > > > > > > Let me know what you think.
    > > > > > > > > > > >
    > > > > > > > > > > > David
    > > > > > > > > > > >
    > > > > > > > > > > > On Tue, Oct 19, 2021 at 12:33 PM Magnus Edenhill <
    > > > > > > > magnus@edenhill.se
    > > > > > > > > >
    > > > > > > > > > > > wrote:
    > > > > > > > > > > >
    > > > > > > > > > > > > Hi Mickael!
    > > > > > > > > > > > >
    > > > > > > > > > > > > Den tis 19 okt. 2021 kl 19:30 skrev Mickael Maison <
    > > > > > > > > > > > > mickael.maison@gmail.com
    > > > > > > > > > > > > >:
    > > > > > > > > > > > >
    > > > > > > > > > > > > > Hi Magnus,
    > > > > > > > > > > > > >
    > > > > > > > > > > > > > Thanks for the proposal.
    > > > > > > > > > > > > >
    > > > > > > > > > > > > > 1. Looking at the protocol section, isn't
    > > > > > "ClientInstanceId"
    > > > > > > > > > expected
    > > > > > > > > > > > > > to be a field in
    > GetTelemetrySubscriptionsResponseV0?
    > > > > > > > Otherwise,
    > > > > > > > > > how
    > > > > > > > > > > > > > does a client retrieve this value?
    > > > > > > > > > > > > >
    > > > > > > > > > > > >
    > > > > > > > > > > > > Good catch, it got removed by mistake in one of the
    > > > edits.
    > > > > > > > > > > > >
    > > > > > > > > > > > >
    > > > > > > > > > > > > >
    > > > > > > > > > > > > > 2. In the client API section, you mention a new
    > > method
    > > > > > > > > > > > > > "clientInstanceId()". Can you clarify which
    > > interfaces
    > > > are
    > > > > > > > > > affected?
    > > > > > > > > > > > > > Is it only Consumer and Producer?
    > > > > > > > > > > > > >
    > > > > > > > > > > > >
    > > > > > > > > > > > > And Admin. Will update the KIP.
    > > > > > > > > > > > >
    > > > > > > > > > > > >
    > > > > > > > > > > > >
    > > > > > > > > > > > > > 3. I'm a bit concerned this is enabled by default.
    > > > Even if
    > > > > > > the
    > > > > > > > > data
    > > > > > > > > > > > > > collected is supposed to be not sensitive, I think
    > > > this can
    > > > > > > be
    > > > > > > > > > > > > > problematic in some environments. Also users don't
    > > > seem to
    > > > > > > have
    > > > > > > > > the
    > > > > > > > > > > > > > choice to only expose some metrics. Knowing how
    > much
    > > > data
    > > > > > > > transit
    > > > > > > > > > > > > > through some applications can be considered
    > critical.
    > > > > > > > > > > > > >
    > > > > > > > > > > > >
    > > > > > > > > > > > > The broker already knows how much data transits
    > through
    > > > the
    > > > > > > > client
    > > > > > > > > > > > though,
    > > > > > > > > > > > > right?
    > > > > > > > > > > > > Care has been taken not to expose information in the
    > > > standard
    > > > > > > > > metrics
    > > > > > > > > > > > that
    > > > > > > > > > > > > might
    > > > > > > > > > > > > reveal sensitive information.
    > > > > > > > > > > > >
    > > > > > > > > > > > > Do you have an example of how the proposed metrics
    > > could
    > > > leak
    > > > > > > > > > sensitive
    > > > > > > > > > > > > information?
    > > > > > > > > > > > > As for limiting the what metrics to export; I guess
    > > that
    > > > > > could
    > > > > > > > make
    > > > > > > > > > sense
    > > > > > > > > > > > > in some
    > > > > > > > > > > > > very sensitive use-cases, but those users might
    > disable
    > > > > > metrics
    > > > > > > > > > > > altogether
    > > > > > > > > > > > > for now.
    > > > > > > > > > > > > Could these concerns be addressed by a later KIP?
    > > > > > > > > > > > >
    > > > > > > > > > > > >
    > > > > > > > > > > > >
    > > > > > > > > > > > > >
    > > > > > > > > > > > > > 4. As a user, how do you know if your application
    > is
    > > > > > actively
    > > > > > > > > > sending
    > > > > > > > > > > > > > metrics? Are there new metrics exposing what's
    > going
    > > > on,
    > > > > > like
    > > > > > > > how
    > > > > > > > > > much
    > > > > > > > > > > > > > data is being sent?
    > > > > > > > > > > > > >
    > > > > > > > > > > > >
    > > > > > > > > > > > > That's a good question.
    > > > > > > > > > > > > Since the proposed metrics interface is not aimed at,
    > > or
    > > > > > > directly
    > > > > > > > > > > > available
    > > > > > > > > > > > > to, the application
    > > > > > > > > > > > > I guess there's little point of adding it here, but
    > > > instead
    > > > > > > > adding
    > > > > > > > > > > > > something to the
    > > > > > > > > > > > > existing JMX metrics?
    > > > > > > > > > > > > Do you have any suggestions?
    > > > > > > > > > > > >
    > > > > > > > > > > > >
    > > > > > > > > > > > >
    > > > > > > > > > > > > > 5. If all metrics are enabled on a regular Consumer
    > > or
    > > > > > > > Producer,
    > > > > > > > > do
    > > > > > > > > > > > > > you have an idea how much throughput this would
    > use?
    > > > > > > > > > > > > >
    > > > > > > > > > > > >
    > > > > > > > > > > > > It depends on the number of partition/topics/etc the
    > > > client
    > > > > > is
    > > > > > > > > > producing
    > > > > > > > > > > > > to/consuming from.
    > > > > > > > > > > > > I'll add some sizes to the KIP for some typical
    > > > use-cases.
    > > > > > > > > > > > >
    > > > > > > > > > > > >
    > > > > > > > > > > > > Thanks,
    > > > > > > > > > > > > Magnus
    > > > > > > > > > > > >
    > > > > > > > > > > > >
    > > > > > > > > > > > > > Thanks
    > > > > > > > > > > > > >
    > > > > > > > > > > > > > On Tue, Oct 19, 2021 at 5:06 PM Magnus Edenhill <
    > > > > > > > > > magnus@edenhill.se>
    > > > > > > > > > > > > > wrote:
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > Den tis 19 okt. 2021 kl 13:22 skrev Tom Bentley <
    > > > > > > > > > tbentley@redhat.com
    > > > > > > > > > > > >:
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > Hi Magnus,
    > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > I reviewed the KIP since you called the vote
    > > > (sorry for
    > > > > > > not
    > > > > > > > > > > > reviewing
    > > > > > > > > > > > > > when
    > > > > > > > > > > > > > > > you announced your intention to call the
    > vote). I
    > > > have
    > > > > > a
    > > > > > > > few
    > > > > > > > > > > > > questions
    > > > > > > > > > > > > > on
    > > > > > > > > > > > > > > > some of the details.
    > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > 1. There's no Javadoc on
    > > > ClientTelemetryPayload.data(),
    > > > > > > so
    > > > > > > > I
    > > > > > > > > > don't
    > > > > > > > > > > > > know
    > > > > > > > > > > > > > > > whether the payload is exposed through this
    > > method
    > > > as
    > > > > > > > > > compressed or
    > > > > > > > > > > > > > not.
    > > > > > > > > > > > > > > > Later on you say "Decompression of the payloads
    > > > will be
    > > > > > > > > > handled by
    > > > > > > > > > > > > the
    > > > > > > > > > > > > > > > broker metrics plugin, the broker should
    > expose a
    > > > > > > suitable
    > > > > > > > > > > > > > decompression
    > > > > > > > > > > > > > > > API to the metrics plugin for this purpose.",
    > > which
    > > > > > > > suggests
    > > > > > > > > > it's
    > > > > > > > > > > > the
    > > > > > > > > > > > > > > > compressed data in the buffer, but then we
    > don't
    > > > know
    > > > > > > which
    > > > > > > > > > codec
    > > > > > > > > > > > was
    > > > > > > > > > > > > > used,
    > > > > > > > > > > > > > > > nor the API via which the plugin should
    > > decompress
    > > > it
    > > > > > if
    > > > > > > > > > required
    > > > > > > > > > > > for
    > > > > > > > > > > > > > > > forwarding to the ultimate metrics store.
    > Should
    > > > the
    > > > > > > > > > > > > > ClientTelemetryPayload
    > > > > > > > > > > > > > > > expose a method to get the compression and a
    > > > > > > decompressor?
    > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > Good point, updated.
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > 2. The client-side API is expressed as
    > > > StringOrError
    > > > > > > > > > > > > > > > ClientInstance::ClientInstanceId(int
    > > timeout_ms). I
    > > > > > > > > understand
    > > > > > > > > > that
    > > > > > > > > > > > > > you're
    > > > > > > > > > > > > > > > thinking about the librdkafka implementation,
    > but
    > > > it
    > > > > > > would
    > > > > > > > be
    > > > > > > > > > good
    > > > > > > > > > > > to
    > > > > > > > > > > > > > show
    > > > > > > > > > > > > > > > the API as it would appear on the Apache Kafka
    > > > clients.
    > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > This was meant as pseudo-code, but I changed it
    > to
    > > > Java.
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > 3. "PushTelemetryRequest|Response - protocol
    > > > request
    > > > > > used
    > > > > > > > by
    > > > > > > > > > the
    > > > > > > > > > > > > > client to
    > > > > > > > > > > > > > > > send metrics to any broker it is connected to."
    > > To
    > > > be
    > > > > > > > clear,
    > > > > > > > > > this
    > > > > > > > > > > > > means
    > > > > > > > > > > > > > > > that the client can choose any of the connected
    > > > brokers
    > > > > > > and
    > > > > > > > > > push to
    > > > > > > > > > > > > > just
    > > > > > > > > > > > > > > > one of them? What should a supporting client do
    > > if
    > > > it
    > > > > > > gets
    > > > > > > > an
    > > > > > > > > > error
    > > > > > > > > > > > > > when
    > > > > > > > > > > > > > > > pushing metrics to a broker, retry sending to
    > the
    > > > same
    > > > > > > > broker
    > > > > > > > > > or
    > > > > > > > > > > > try
    > > > > > > > > > > > > > > > pushing to another broker, or drop the metrics?
    > > > Should
    > > > > > > > > > supporting
    > > > > > > > > > > > > > clients
    > > > > > > > > > > > > > > > send successive requests to a single broker, or
    > > > round
    > > > > > > > robin,
    > > > > > > > > > or is
    > > > > > > > > > > > > > that up
    > > > > > > > > > > > > > > > to the client author? I'm guessing the
    > behaviour
    > > > should
    > > > > > > be
    > > > > > > > > > sticky
    > > > > > > > > > > > to
    > > > > > > > > > > > > > > > support the rate limiting features, but I think
    > > it
    > > > > > would
    > > > > > > be
    > > > > > > > > > good
    > > > > > > > > > > > for
    > > > > > > > > > > > > > client
    > > > > > > > > > > > > > > > authors if this section were explicit on the
    > > > > > recommended
    > > > > > > > > > behaviour.
    > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > You are right, I've updated the KIP to make this
    > > > clearer.
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > 4. "Mapping the client instance id to an actual
    > > > > > > application
    > > > > > > > > > > > instance
    > > > > > > > > > > > > > > > running on a (virtual) machine can be done by
    > > > > > inspecting
    > > > > > > > the
    > > > > > > > > > > > metrics
    > > > > > > > > > > > > > > > resource labels, such as the client source
    > > address
    > > > and
    > > > > > > > source
    > > > > > > > > > port,
    > > > > > > > > > > > > or
    > > > > > > > > > > > > > > > security principal, all of which are added by
    > the
    > > > > > > receiving
    > > > > > > > > > broker.
    > > > > > > > > > > > > > This
    > > > > > > > > > > > > > > > will allow the operator together with the user
    > to
    > > > > > > identify
    > > > > > > > > the
    > > > > > > > > > > > actual
    > > > > > > > > > > > > > > > application instance." Is this really always
    > > true?
    > > > The
    > > > > > > > source
    > > > > > > > > > IP
    > > > > > > > > > > > and
    > > > > > > > > > > > > > port
    > > > > > > > > > > > > > > > might be a loadbalancer/proxy in some setups.
    > The
    > > > > > > > principal,
    > > > > > > > > as
    > > > > > > > > > > > > already
    > > > > > > > > > > > > > > > mentioned in the KIP, might be shared between
    > > > multiple
    > > > > > > > > > > > applications.
    > > > > > > > > > > > > > So at
    > > > > > > > > > > > > > > > worst the organization running the clients
    > might
    > > > have
    > > > > > to
    > > > > > > > > > consult
    > > > > > > > > > > > the
    > > > > > > > > > > > > > logs
    > > > > > > > > > > > > > > > of a set of client applications, right?
    > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > Yes, that's correct. There's no guaranteed
    > mapping
    > > > from
    > > > > > > > > > > > > > client_instance_id
    > > > > > > > > > > > > > > to
    > > > > > > > > > > > > > > an actual instance, that's why the KIP recommends
    > > > client
    > > > > > > > > > > > > implementations
    > > > > > > > > > > > > > to
    > > > > > > > > > > > > > > log the client instance id
    > > > > > > > > > > > > > > upon retrieval, and also provide an API for the
    > > > > > application
    > > > > > > > to
    > > > > > > > > > > > retrieve
    > > > > > > > > > > > > > the
    > > > > > > > > > > > > > > instance id programmatically
    > > > > > > > > > > > > > > if it has a better way of exposing it.
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > 5. "Tests indicate that a compression ratio up to
    > > > 10x is
    > > > > > > > > > possible for
    > > > > > > > > > > > > the
    > > > > > > > > > > > > > > > standard metrics." Client authors might
    > > appreciate
    > > > your
    > > > > > > > > > mentioning
    > > > > > > > > > > > > > which
    > > > > > > > > > > > > > > > compression codec got these results.
    > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > Good point. Updated.
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > 6. "Should the client send a push request prior
    > > to
    > > > > > expiry
    > > > > > > > of
    > > > > > > > > > the
    > > > > > > > > > > > > > previously
    > > > > > > > > > > > > > > > calculated PushIntervalMs the broker will
    > discard
    > > > the
    > > > > > > > metrics
    > > > > > > > > > and
    > > > > > > > > > > > > > return a
    > > > > > > > > > > > > > > > PushTelemetryResponse with the ErrorCode set to
    > > > > > > > RateLimited."
    > > > > > > > > > Is
    > > > > > > > > > > > this
    > > > > > > > > > > > > > > > RATE_LIMITED a new error code? It's not
    > mentioned
    > > > in
    > > > > > the
    > > > > > > > "New
    > > > > > > > > > Error
    > > > > > > > > > > > > > Codes"
    > > > > > > > > > > > > > > > section.
    > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > That's a leftover, it should be using the
    > standard
    > > > > > > > ThrottleTime
    > > > > > > > > > > > > > mechanism.
    > > > > > > > > > > > > > > Fixed.
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > 7. In the section "Standard client resource
    > > labels"
    > > > > > > > > > application_id
    > > > > > > > > > > > is
    > > > > > > > > > > > > > > > described as Kafka Streams only, but the
    > section
    > > of
    > > > > > > "Client
    > > > > > > > > > > > > > Identification"
    > > > > > > > > > > > > > > > talks about "application instance id as an
    > > optional
    > > > > > > future
    > > > > > > > > > > > > nice-to-have
    > > > > > > > > > > > > > > > that may be included as a metrics label if it
    > has
    > > > been
    > > > > > > set
    > > > > > > > by
    > > > > > > > > > the
    > > > > > > > > > > > > > user", so
    > > > > > > > > > > > > > > > I'm confused whether non-Kafka Streams clients
    > > > should
    > > > > > set
    > > > > > > > an
    > > > > > > > > > > > > > application_id
    > > > > > > > > > > > > > > > or not.
    > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > I'll clarify this in the KIP, but basically we
    > > would
    > > > need
    > > > > > > to
    > > > > > > > > add
    > > > > > > > > > an `
    > > > > > > > > > > > > > > application.id` config
    > > > > > > > > > > > > > > property for non-streams clients for this
    > purpose,
    > > > and
    > > > > > > that's
    > > > > > > > > > outside
    > > > > > > > > > > > > the
    > > > > > > > > > > > > > > scope of this KIP since we want to make it
    > > > zero-conf:ish
    > > > > > on
    > > > > > > > the
    > > > > > > > > > > > client
    > > > > > > > > > > > > > side.
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > Kind regards,
    > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > Tom
    > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > Thanks for the review,
    > > > > > > > > > > > > > > Magnus
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > On Thu, Oct 7, 2021 at 5:26 PM Magnus Edenhill
    > <
    > > > > > > > > > magnus@edenhill.se
    > > > > > > > > > > > >
    > > > > > > > > > > > > > wrote:
    > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > Hi all,
    > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > I've updated the KIP following our recent
    > > > discussions
    > > > > > > on
    > > > > > > > > the
    > > > > > > > > > > > > mailing
    > > > > > > > > > > > > > > > list:
    > > > > > > > > > > > > > > > >  - split the protocol in two, one for getting
    > > the
    > > > > > > metrics
    > > > > > > > > > > > > > subscriptions,
    > > > > > > > > > > > > > > > > and one for pushing the metrics.
    > > > > > > > > > > > > > > > >  - simplifications: initially only one
    > > supported
    > > > > > > metrics
    > > > > > > > > > format,
    > > > > > > > > > > > no
    > > > > > > > > > > > > > > > > client.id in the instance id, etc.
    > > > > > > > > > > > > > > > >  - made CLIENT_METRICS subscription
    > > configuration
    > > > > > > entries
    > > > > > > > > > more
    > > > > > > > > > > > > > structured
    > > > > > > > > > > > > > > > >    and allowing better client matching
    > > selectors
    > > > (not
    > > > > > > > only
    > > > > > > > > > on the
    > > > > > > > > > > > > > > > instance
    > > > > > > > > > > > > > > > > id, but also the other
    > > > > > > > > > > > > > > > >    client resource labels, such as
    > > > > > > client_software_name,
    > > > > > > > > > etc.).
    > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > Unless there are further comments I'll call
    > the
    > > > vote
    > > > > > > in a
    > > > > > > > > > day or
    > > > > > > > > > > > > two.
    > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > Regards,
    > > > > > > > > > > > > > > > > Magnus
    > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > Den mån 4 okt. 2021 kl 20:57 skrev Magnus
    > > > Edenhill <
    > > > > > > > > > > > > > magnus@edenhill.se>:
    > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > > Hi Gwen,
    > > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > > I'm finishing up the KIP based on the last
    > > > couple
    > > > > > of
    > > > > > > > > > discussion
    > > > > > > > > > > > > > points
    > > > > > > > > > > > > > > > in
    > > > > > > > > > > > > > > > > > this thread
    > > > > > > > > > > > > > > > > > and will call the Vote later this week.
    > > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > > Best,
    > > > > > > > > > > > > > > > > > Magnus
    > > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > > Den lör 2 okt. 2021 kl 02:01 skrev Gwen
    > > Shapira
    > > > > > > > > > > > > > > > > <gwen@confluent.io.invalid
    > > > > > > > > > > > > > > > > > >:
    > > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > >> Hey,
    > > > > > > > > > > > > > > > > >>
    > > > > > > > > > > > > > > > > >> I noticed that there was no discussion for
    > > the
    > > > > > last
    > > > > > > 10
    > > > > > > > > > days,
    > > > > > > > > > > > > but I
    > > > > > > > > > > > > > > > > >> couldn't
    > > > > > > > > > > > > > > > > >> find the vote thread. Is there one that
    > I'm
    > > > > > missing?
    > > > > > > > > > > > > > > > > >>
    > > > > > > > > > > > > > > > > >> Gwen
    > > > > > > > > > > > > > > > > >>
    > > > > > > > > > > > > > > > > >> On Wed, Sep 22, 2021 at 4:58 AM Magnus
    > > > Edenhill <
    > > > > > > > > > > > > > magnus@edenhill.se>
    > > > > > > > > > > > > > > > > >> wrote:
    > > > > > > > > > > > > > > > > >>
    > > > > > > > > > > > > > > > > >> > Den tis 21 sep. 2021 kl 06:58 skrev
    > Colin
    > > > > > McCabe <
    > > > > > > > > > > > > > > > cmccabe@apache.org
    > > > > > > > > > > > > > > > > >:
    > > > > > > > > > > > > > > > > >> >
    > > > > > > > > > > > > > > > > >> > > On Mon, Sep 20, 2021, at 17:35, Feng
    > Min
    > > > > > wrote:
    > > > > > > > > > > > > > > > > >> > > > Thanks Magnus & Colin for the
    > > > discussion.
    > > > > > > > > > > > > > > > > >> > > >
    > > > > > > > > > > > > > > > > >> > > > Based on KIP-714's stateless design,
    > > > Client
    > > > > > > can
    > > > > > > > > > pretty
    > > > > > > > > > > > > much
    > > > > > > > > > > > > > use
    > > > > > > > > > > > > > > > > any
    > > > > > > > > > > > > > > > > >> > > > connection to any broker to send
    > > > metrics. We
    > > > > > > are
    > > > > > > > > not
    > > > > > > > > > > > > > associating
    > > > > > > > > > > > > > > > > >> > > connection
    > > > > > > > > > > > > > > > > >> > > > with client metric state. Is my
    > > > > > understanding
    > > > > > > > > > correct?
    > > > > > > > > > > > If
    > > > > > > > > > > > > > yes,
    > > > > > > > > > > > > > > > > how
    > > > > > > > > > > > > > > > > >> > about
    > > > > > > > > > > > > > > > > >> > > > the following two scenarios
    > > > > > > > > > > > > > > > > >> > > >
    > > > > > > > > > > > > > > > > >> > > > 1) One Client (Client-ID) registers
    > > two
    > > > > > > > different
    > > > > > > > > > client
    > > > > > > > > > > > > > > > instance
    > > > > > > > > > > > > > > > > id
    > > > > > > > > > > > > > > > > >> > via
    > > > > > > > > > > > > > > > > >> > > > separate registration. Is it
    > > permitted?
    > > > If
    > > > > > OK,
    > > > > > > > how
    > > > > > > > > > to
    > > > > > > > > > > > > > > > distinguish
    > > > > > > > > > > > > > > > > >> them
    > > > > > > > > > > > > > > > > >> > > from
    > > > > > > > > > > > > > > > > >> > > > the case 2 below.
    > > > > > > > > > > > > > > > > >> > > >
    > > > > > > > > > > > > > > > > >> > >
    > > > > > > > > > > > > > > > > >> > > Hi Feng,
    > > > > > > > > > > > > > > > > >> > >
    > > > > > > > > > > > > > > > > >> > > My understanding, which Magnus can
    > > > clarify I
    > > > > > > > guess,
    > > > > > > > > is
    > > > > > > > > > > > that
    > > > > > > > > > > > > > you
    > > > > > > > > > > > > > > > > could
    > > > > > > > > > > > > > > > > >> > have
    > > > > > > > > > > > > > > > > >> > > something like two Producer instances
    > > > running
    > > > > > > with
    > > > > > > > > the
    > > > > > > > > > > > same
    > > > > > > > > > > > > > > > > client.id
    > > > > > > > > > > > > > > > > >> > > (perhaps because they're using the
    > same
    > > > config
    > > > > > > > file,
    > > > > > > > > > for
    > > > > > > > > > > > > > example).
    > > > > > > > > > > > > > > > > >> They
    > > > > > > > > > > > > > > > > >> > > could even be in the same process. But
    > > > they
    > > > > > > would
    > > > > > > > > get
    > > > > > > > > > > > > separate
    > > > > > > > > > > > > > > > > UUIDs.
    > > > > > > > > > > > > > > > > >> > >
    > > > > > > > > > > > > > > > > >> > > I believe Magnus used the term client
    > to
    > > > mean
    > > > > > > > > > "Producer or
    > > > > > > > > > > > > > > > > Consumer".
    > > > > > > > > > > > > > > > > >> So
    > > > > > > > > > > > > > > > > >> > > if you have both a Producer and a
    > > > Consumer in
    > > > > > > your
    > > > > > > > > > > > > > application I
    > > > > > > > > > > > > > > > > would
    > > > > > > > > > > > > > > > > >> > > expect you'd get separate UUIDs for
    > > both.
    > > > > > Again
    > > > > > > > > > Magnus can
    > > > > > > > > > > > > > chime
    > > > > > > > > > > > > > > > in
    > > > > > > > > > > > > > > > > >> > here, I
    > > > > > > > > > > > > > > > > >> > > guess.
    > > > > > > > > > > > > > > > > >> > >
    > > > > > > > > > > > > > > > > >> >
    > > > > > > > > > > > > > > > > >> > That's correct.
    > > > > > > > > > > > > > > > > >> >
    > > > > > > > > > > > > > > > > >> >
    > > > > > > > > > > > > > > > > >> > >
    > > > > > > > > > > > > > > > > >> > > > 2) How about the client restarting?
    > > > What's
    > > > > > the
    > > > > > > > > > > > > expectation?
    > > > > > > > > > > > > > > > Should
    > > > > > > > > > > > > > > > > >> the
    > > > > > > > > > > > > > > > > >> > > > server expect the client to carry a
    > > > > > persisted
    > > > > > > > > client
    > > > > > > > > > > > > > instance id
    > > > > > > > > > > > > > > > > or
    > > > > > > > > > > > > > > > > >> > > should
    > > > > > > > > > > > > > > > > >> > > > the client be treated as a new
    > > instance?
    > > > > > > > > > > > > > > > > >> > >
    > > > > > > > > > > > > > > > > >> > > The KIP doesn't describe any mechanism
    > > for
    > > > > > > > > > persistence,
    > > > > > > > > > > > so I
    > > > > > > > > > > > > > would
    > > > > > > > > > > > > > > > > >> assume
    > > > > > > > > > > > > > > > > >> > > that when you restart the client you
    > get
    > > > a new
    > > > > > > > > UUID. I
    > > > > > > > > > > > agree
    > > > > > > > > > > > > > that
    > > > > > > > > > > > > > > > it
    > > > > > > > > > > > > > > > > >> > would
    > > > > > > > > > > > > > > > > >> > > be good to spell this out.
    > > > > > > > > > > > > > > > > >> > >
    > > > > > > > > > > > > > > > > >> > >
    > > > > > > > > > > > > > > > > >> > Right, it will not be persisted since a
    > > > client
    > > > > > > > > instance
    > > > > > > > > > > > can't
    > > > > > > > > > > > > be
    > > > > > > > > > > > > > > > > >> restarted.
    > > > > > > > > > > > > > > > > >> >
    > > > > > > > > > > > > > > > > >> > Will update the KIP to make this
    > clearer.
    > > > > > > > > > > > > > > > > >> >
    > > > > > > > > > > > > > > > > >> > /Magnus
    > > > > > > > > > > > > > > > > >> >
    > > > > > > > > > > > > > > > > >>
    > > > > > > > > > > > > > > > > >>
    > > > > > > > > > > > > > > > > >> --
    > > > > > > > > > > > > > > > > >> Gwen Shapira
    > > > > > > > > > > > > > > > > >> Engineering Manager | Confluent
    > > > > > > > > > > > > > > > > >> 650.450.2760 | @gwenshap
    > > > > > > > > > > > > > > > > >> Follow us: Twitter | blog
    > > > > > > > > > > > > > > > > >>
    > > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > >
    > > > > > > > > > > > > >
    > > > > > > > > > > > >
    > > > > > > > > > > >
    > > > > > > > > >
    > > > > > > > >
    > > > > > > >
    > > > > > >
    > > > > >
    > > > >
    > > >
    > >
    >

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Magnus Edenhill <ma...@edenhill.se>.
Hey Ismael,


> > The PushTelemetryRequest handler decompresses the payload before passing
> it
> > to the metrics plugin.
> > This was done to avoid having to expose a public decompression interface
> to
> > metrics plugin developers.
> >
>
> Are there cases where the metrics plugin developers would want to forward
> the compressed payload without decompressing?
>

Maybe, but most plugins probably want to either add some extra information
(e.g., from the auth context), or convert to another format, so the original
compressed blob is most likely not that interesting.
In any case the plugin will want to inspect the uncompressed metrics data to
verify it is not garbage before forwarding it upstream.

We could always add an option later to allow passing the metrics payload
verbatim if the need arises.

Thanks,
Magnus

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Xavier Léauté <xa...@confluent.io.INVALID>.
>
> Are there cases where the metrics plugin developers would want to forward
> the compressed payload without decompressing?


The only interoperable use-case I can think of would be to forward the
payloads directly to an OpenTelemetry collector backend.
Today OTLP only mandates gzip/none compression support for gRPC and HTTP
protocols, so this might only work for a limited set
of compression formats (or no compression) out of the box.

see
https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/protocol/otlp.md#protocol-details

Maybe we could consider exposing the raw uncompressed bytes regardless of
client side compression, if someone wanted
to avoid the cost of de-serializing the payload, since there would always
be an option to forward that as-is, and let the opentelemetry collector add
tags relevant to the broker originating those client metrics.

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Ismael Juma <is...@juma.me.uk>.
On Wed, Mar 30, 2022 at 4:08 AM Magnus Edenhill <ma...@edenhill.se> wrote:

> > 41. We include CompressionType in PushTelemetryRequestV0, but not in
> > ClientTelemetryPayload. How would the implementer know the compression
> type
> > for the telemetry payload?
> The PushTelemetryRequest handler decompresses the payload before passing it
> to the metrics plugin.
> This was done to avoid having to expose a public decompression interface to
> metrics plugin developers.
>

Are there cases where the metrics plugin developers would want to forward
the compressed payload without decompressing?

Ismael

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Magnus Edenhill <ma...@edenhill.se>.
Hey Jun,

see response inline:

Den mån 21 mars 2022 kl 19:31 skrev Jun Rao <ju...@confluent.io.invalid>:

> Hi, Kirk, Sarat,
>
> A few more comments.
>
> 40. GetTelemetrySubscriptionsResponseV0 : RequestedMetrics Array[string]
> uses "Array[0] empty string" to represent all metrics subscribed. We had a
> similar issue with the topics field in MetadataRequest and used the
> following convention.
> In version 1 and higher, an empty array indicates "request metadata for no
> topics," and a null array is used to indicate "request metadata for all
> topics."
> Should we use the same convention in GetTelemetrySubscriptionsResponseV0?
>

Right, I considered this but chose the current design because the
subscriptions are prefix-matched,
so an empty string will automatically match everything.
It is not critical in any way, so if you feel it is better to follow the
way MetadataRequest does it, I can change it?



>
> 41. We include CompressionType in PushTelemetryRequestV0, but not in
> ClientTelemetryPayload. How would the implementer know the compression type
> for the telemetry payload?
>
>
The PushTelemetryRequest handler decompresses the payload before passing it
to the metrics plugin.
This was done to avoid having to expose a public decompression interface to
metrics plugin developers.



> 42. For blocking the metrics for certain clients in the following example,
> could you describe the corresponding config value used through the
> kafka-config command?
> kafka-client-metrics.sh --bootstrap-server $BROKERS \
>    --add \
>    --name 'Disabe_b69cc35a' \  # A descriptive name makes it easier to
> clean up old subscriptions.
>    --match client_instance_id=b69cc35a-7a54-4790-aa69-cc2bd4ee4538 \  #
> Match this specific client instance
>    --block
>

--block will set the "inteval" ConfigEntry to "0", which overrides and
disables all accumulated subscriptions for the matching client instance.


Thanks,
Magnus



> On Thu, Mar 10, 2022 at 11:57 AM Jun Rao <ju...@confluent.io> wrote:
>
> > Hi, Kirk, Sarat,
> >
> > Thanks for the reply.
> >
> > 28. On the broker, we typically use Yammer metrics. Only for metrics that
> > depend on Kafka metric features (e.g., quota), we use the Kafka metric.
> > Yammer metrics have 4 types: gauge, meter, histogram and timer. meter
> > calculates a rate, but also exposes an accumulated value.
> >
> > 29. The Histogram class in org.apache.kafka.common.metrics.stats was
> never
> > used in the client metrics. The implementation of Histogram only
> provides a
> > fixed number of values in the domain and may not capture the quantiles
> very
> > accurately. So, we punted on using it.
> >
> > Thanks,
> >
> > Jun
> >
> >
> >
> > On Thu, Mar 10, 2022 at 10:59 AM Sarat Kakarla
> > <sk...@confluent.io.invalid> wrote:
> >
> >> Jun,
> >>
> >>   >>  28. For the broker metrics, could you spell out the full metric
> name
> >>   >>   including groups, tags, etc? We typically don't add the broker_id
> >> label for
> >>   >>   broker metrics. Also, brokers use Yammer metrics, which doesn't
> >> have type
> >>   >>   Sum.
> >>
> >> Sure,  I will update the KIP-714 with the above information, will remove
> >> the broker-id label from the metrics.
> >>
> >> Regarding the type is CumulativeSum the right type to use in the place
> of
> >> Sum?
> >>
> >> Thanks
> >> Sarat
> >>
> >>
> >> On 3/8/22, 5:48 PM, "Jun Rao" <ju...@confluent.io.INVALID> wrote:
> >>
> >>     Hi, Magnus, Sarat and Xavier,
> >>
> >>     Thanks for the reply. A few more comments below.
> >>
> >>     20. It seems that we are piggybacking the plugin on the
> >>     existing MetricsReporter. So, this seems fine.
> >>
> >>     21. That could work. Are we requiring any additional jar dependency
> >> on the
> >>     client? Or, are you suggesting that we check the runtime dependency
> >> to pick
> >>     the compression codec?
> >>
> >>     28. For the broker metrics, could you spell out the full metric name
> >>     including groups, tags, etc? We typically don't add the broker_id
> >> label for
> >>     broker metrics. Also, brokers use Yammer metrics, which doesn't have
> >> type
> >>     Sum.
> >>
> >>     29. There are several client metrics listed as histogram. However,
> >> the java
> >>     client currently doesn't support histogram type.
> >>
> >>     30. Could you show an example of the metric payload in
> >> PushTelemetryRequest
> >>     to help understand how we organize metrics at different levels (per
> >>     instance, per topic, per partition, per broker, etc)?
> >>
> >>     31. Could you add a bit more detail on which client thread sends the
> >>     PushTelemetryRequest?
> >>
> >>     Thanks,
> >>
> >>     Jun
> >>
> >>     On Mon, Mar 7, 2022 at 11:48 AM Magnus Edenhill <magnus@edenhill.se
> >
> >> wrote:
> >>
> >>     > Hi Jun,
> >>     >
> >>     > thanks for your initiated questions, see my answers below.
> >>     > There's been a number of clarifications to the KIP.
> >>     >
> >>     >
> >>     >
> >>     > Den tors 27 jan. 2022 kl 20:08 skrev Jun Rao
> >> <ju...@confluent.io.invalid>:
> >>     >
> >>     > > Hi, Magnus,
> >>     > >
> >>     > > Thanks for updating the KIP. The overall approach makes sense to
> >> me. A
> >>     > few
> >>     > > more detailed comments below.
> >>     > >
> >>     > > 20. ClientTelemetry: Should it be extending configurable and
> >> closable?
> >>     > >
> >>     >
> >>     > I'll pass this question to Sarat and/or Xavier.
> >>     >
> >>     >
> >>     >
> >>     > > 21. Compression of the metrics on the client: what's the
> default?
> >>     > >
> >>     >
> >>     > How about we specify a prioritized list: zstd, lz4, snappy, gzip?
> >>     > But ultimately it is up to what the client supports.
> >>     >
> >>     >
> >>     > 23. A client instance is considered a metric resource and the
> >>     > > resource-level (thus client instance level) labels could
> include:
> >>     > >     client_software_name=confluent-kafka-python
> >>     > >     client_software_version=v2.1.3
> >>     > >     client_instance_id=B64CD139-3975-440A-91D4
> >>     > >     transactional_id=someTxnApp
> >>     > > Are those labels added in PushTelemetryRequest? If so, are they
> >> per
> >>     > metric
> >>     > > or per request?
> >>     > >
> >>     >
> >>     >
> >>     > client_software* and client_instance_id are not added by the
> >> client, but
> >>     > available to
> >>     > the broker-side metrics plugin for adding as it see fits, remove
> >> them from
> >>     > the KIP.
> >>     >
> >>     > As for transactional_id, group_id, etc, which I believe will be
> >> useful in
> >>     > troubleshooting,
> >>     > are included only once (per push) as resource-level attributes
> (the
> >> client
> >>     > instance is a singular resource).
> >>     >
> >>     >
> >>     > >
> >>     > > 24.  "the broker will only send
> >>     > > GetTelemetrySubscriptionsResponse.DeltaTemporality=True" :
> >>     > > 24.1 If it's always true, does it need to be part of the
> protocol?
> >>     > >
> >>     >
> >>     > We're anticipating that it will take a lot longer to upgrade the
> >> majority
> >>     > of clients than the
> >>     > broker/plugin side, which is why we want the client to support
> both
> >>     > temporalities out-of-the-box
> >>     > so that cumulative reporting can be turned on seamlessly in the
> >> future.
> >>     >
> >>     >
> >>     >
> >>     > > 24.2 Does delta only apply to Counter type?
> >>     > >
> >>     >
> >>     >
> >>     > And Histograms. More details in Xavier's OTLP link.
> >>     >
> >>     >
> >>     >
> >>     > > 24.3 In the delta representation, the first request needs to
> send
> >> the
> >>     > full
> >>     > > value, how does the broker plugin know whether a value is full
> or
> >> delta?
> >>     > >
> >>     >
> >>     > The client may (should) send the start time for each metric
> sample,
> >>     > indicating when
> >>     > the metric began to be collected.
> >>     > We've discussed whether this should be the client instance start
> >> time or
> >>     > the time when a matching
> >>     > metric subscription for that metric is received.
> >>     > For completeness we recommend using the former, the client
> instance
> >> start
> >>     > time.
> >>     >
> >>     >
> >>     >
> >>     > > 25. quota:
> >>     > > 25.1 Since we are fitting PushTelemetryRequest into the existing
> >> request
> >>     > > quota, it would be useful to document the impact, i.e. client
> >> metric
> >>     > > throttling causes the data from the same client to be delayed.
> >>     > > 25.2 Is PushTelemetryRequest subject to the write bandwidth
> quota
> >> like
> >>     > the
> >>     > > producer?
> >>     > >
> >>     >
> >>     >
> >>     > Yes, it should be, as to protect the cluster from rogue clients.
> >>     > But, in practice the size of metrics will be quite low (e.g.,
> >> 1-10kb per
> >>     > 60s interval), so I don't think this will pose a problem.
> >>     > The KIP has been updated with more details on quota/throttling
> >> behaviour,
> >>     > see the
> >>     > "Throttling and rate-limiting" section.
> >>     >
> >>     >
> >>     > 25.3 THROTTLING_QUOTA_EXCEEDED: Currently, we don't send this
> error
> >> when
> >>     > > the request/bandwidth quota is exceeded since those requests are
> >> not
> >>     > > rejected. We only set this error when the request is rejected
> >> (e.g.,
> >>     > topic
> >>     > > creation). It would be useful to clarify when this error is
> used.
> >>     > >
> >>     >
> >>     > Right, I was trying to reuse an existing error-code. We can
> >> introduce
> >>     > a new one for the case where a client pushes metrics at a higher
> >> frequency
> >>     > than the
> >>     > than the configured push interval (e.g., out-of-profile sends).
> >>     > This causes the broker to drop those metrics and send this error
> >> code back
> >>     > to the client. There will be no connection throttling /
> >> channel-muting in
> >>     > this
> >>     > case (unless the standard quotas are exceeded).
> >>     >
> >>     >
> >>     > > 27. kafka-client-metrics.sh: Could we add an example on how to
> >> disable a
> >>     > > bad client?
> >>     > >
> >>     >
> >>     > There's now a --block option to kafka-client-metrics.sh which
> >> overrides all
> >>     > subscriptions
> >>     > for the matched client(s). This allows silencing metrics for one
> or
> >> more
> >>     > clients without having
> >>     > to remove existing subscriptions. From the client's perspective it
> >> will
> >>     > look like it no longer has
> >>     > any subscriptions.
> >>     >
> >>     > # Block metrics collection for a specific client instance
> >>     > $ kafka-client-metrics.sh --bootstrap-server $BROKERS \
> >>     >    --add \
> >>     >    --name 'Disabe_b69cc35a' \  # A descriptive name makes it
> easier
> >> to
> >>     > clean up old subscriptions.
> >>     >    --match client_instance_id=b69cc35a-7a54-4790-aa69-cc2bd4ee4538
> >> \  #
> >>     > Match this specific client instance
> >>     >    --block
> >>     >
> >>     >
> >>     >
> >>     >
> >>     > > 28. New broker side metrics: Could we spell out the details of
> the
> >>     > metrics
> >>     > > (e.g., group, tags, etc)?
> >>     > >
> >>     >
> >>     > KIP has been updated accordingly (thanks Sarat).
> >>     >
> >>     >
> >>     >
> >>     > >
> >>     > > 29. Client instance-level metrics: client.io.wait.time is a
> gauge
> >> not a
> >>     > > histogram.
> >>     > >
> >>     >
> >>     > I believe a population/distribution should preferably be
> >> represented as a
> >>     > histogram, space permitting,
> >>     > and only secondarily as a Gauge average.
> >>     > While we might not want to maintain a bunch of histograms for each
> >>     > partition, since that could be
> >>     > quite space consuming, this client.io.wait.time is a single metric
> >> per
> >>     > client instance and can
> >>     > thus afford a Histogram representation.
> >>     >
> >>     >
> >>     >
> >>     > Thanks,
> >>     > Magnus
> >>     >
> >>     >
> >>     >
> >>     > > Thanks,
> >>     > >
> >>     > > Jun
> >>     > >
> >>     > > On Wed, Jan 26, 2022 at 6:32 AM Magnus Edenhill <
> >> magnus@edenhill.se>
> >>     > > wrote:
> >>     > >
> >>     > > > Hi all,
> >>     > > >
> >>     > > > I've updated the KIP with responses to the latest comments:
> >> Java client
> >>     > > > dependencies (Thanks Kirk!), alternate designs (separate
> >> cluster,
> >>     > > separate
> >>     > > > producer, etc), etc.
> >>     > > >
> >>     > > > I will revive the vote thread.
> >>     > > >
> >>     > > > Thanks,
> >>     > > > Magnus
> >>     > > >
> >>     > > >
> >>     > > > Den mån 13 dec. 2021 kl 22:32 skrev Ryanne Dolan <
> >>     > ryannedolan@gmail.com
> >>     > > >:
> >>     > > >
> >>     > > > > I think we should be very careful about introducing new
> >> runtime
> >>     > > > > dependencies into the clients. Historically this has been
> >> rare and
> >>     > > > > essentially necessary (e.g. compression libs).
> >>     > > > >
> >>     > > > > Ryanne
> >>     > > > >
> >>     > > > > On Mon, Dec 13, 2021, 1:06 PM Kirk True <
> >> kirk@mustardgrain.com>
> >>     > wrote:
> >>     > > > >
> >>     > > > > > Hi Jun,
> >>     > > > > >
> >>     > > > > > On Thu, Dec 9, 2021, at 2:28 PM, Jun Rao wrote:
> >>     > > > > > > 13. Using OpenTelemetry. Does that require runtime
> >> dependency
> >>     > > > > > > on OpenTelemetry library? How good is the compatibility
> >> story
> >>     > > > > > > of OpenTelemetry? This is important since an application
> >> could
> >>     > have
> >>     > > > > other
> >>     > > > > > > OpenTelemetry dependencies than the Kafka client.
> >>     > > > > >
> >>     > > > > > The current design is that the OpenTelemetry JARs would
> >> ship with
> >>     > the
> >>     > > > > > client. Perhaps we can design the client such that the
> JARs
> >> aren't
> >>     > > even
> >>     > > > > > loaded if the user has opted out. The user could even
> >> exclude the
> >>     > > JARs
> >>     > > > > from
> >>     > > > > > their dependencies if they so wished.
> >>     > > > > >
> >>     > > > > > I can't speak to the compatibility of the libraries. Is it
> >> possible
> >>     > > > that
> >>     > > > > > we include a shaded version?
> >>     > > > > >
> >>     > > > > > Thanks,
> >>     > > > > > Kirk
> >>     > > > > >
> >>     > > > > > >
> >>     > > > > > > 14. The proposal listed idempotence=true. This is more
> of
> >> a
> >>     > > > > configuration
> >>     > > > > > > than a metric. Are we including that as a metric? What
> >> other
> >>     > > > > > configurations
> >>     > > > > > > are we including? Should we separate the configurations
> >> from the
> >>     > > > > metrics?
> >>     > > > > > >
> >>     > > > > > > Thanks,
> >>     > > > > > >
> >>     > > > > > > Jun
> >>     > > > > > >
> >>     > > > > > > On Mon, Nov 29, 2021 at 7:34 AM Magnus Edenhill <
> >>     > > magnus@edenhill.se>
> >>     > > > > > wrote:
> >>     > > > > > >
> >>     > > > > > > > Hey Bob,
> >>     > > > > > > >
> >>     > > > > > > > That's a good point.
> >>     > > > > > > >
> >>     > > > > > > > Request type labels were considered but since they're
> >> already
> >>     > > > tracked
> >>     > > > > > by
> >>     > > > > > > > broker-side metrics
> >>     > > > > > > > they were left out as to avoid metric duplication,
> >> however
> >>     > those
> >>     > > > > > metrics
> >>     > > > > > > > are not per connection,
> >>     > > > > > > > so they won't be that useful in practice for
> >> troubleshooting
> >>     > > > specific
> >>     > > > > > > > client instances.
> >>     > > > > > > >
> >>     > > > > > > > I'll add the request_type label to the relevant
> metrics.
> >>     > > > > > > >
> >>     > > > > > > > Thanks,
> >>     > > > > > > > Magnus
> >>     > > > > > > >
> >>     > > > > > > >
> >>     > > > > > > > Den tis 23 nov. 2021 kl 19:20 skrev Bob Barrett
> >>     > > > > > > > <bo...@confluent.io.invalid>:
> >>     > > > > > > >
> >>     > > > > > > > > Hi Magnus,
> >>     > > > > > > > >
> >>     > > > > > > > > Thanks for the thorough KIP, this seems very useful.
> >>     > > > > > > > >
> >>     > > > > > > > > Would it make sense to include the request type as a
> >> label
> >>     > for
> >>     > > > the
> >>     > > > > > > > > `client.request.success`, `client.request.errors`
> and
> >>     > > > > > > > `client.request.rtt`
> >>     > > > > > > > > metrics? I think it would be very useful to see
> which
> >>     > specific
> >>     > > > > > requests
> >>     > > > > > > > are
> >>     > > > > > > > > succeeding and failing for a client. One specific
> >> case I can
> >>     > > > think
> >>     > > > > of
> >>     > > > > > > > where
> >>     > > > > > > > > this could be useful is producer batch timeouts. If
> a
> >> Java
> >>     > > > > > application
> >>     > > > > > > > does
> >>     > > > > > > > > not enable producer client logs (unfortunately, in
> my
> >>     > > experience
> >>     > > > > this
> >>     > > > > > > > > happens more often than it should), the application
> >> logs will
> >>     > > > only
> >>     > > > > > > > contain
> >>     > > > > > > > > the expiration error message, but no information
> >> about what
> >>     > is
> >>     > > > > > causing
> >>     > > > > > > > the
> >>     > > > > > > > > timeout. The requests might all be succeeding but
> >> taking too
> >>     > > long
> >>     > > > > to
> >>     > > > > > > > > process batches, or metadata requests might be
> >> failing, or
> >>     > some
> >>     > > > or
> >>     > > > > > all
> >>     > > > > > > > > produce requests might be failing (if the bootstrap
> >> servers
> >>     > are
> >>     > > > > > reachable
> >>     > > > > > > > > from the client but one or more other brokers are
> >> not, for
> >>     > > > > example).
> >>     > > > > > If
> >>     > > > > > > > the
> >>     > > > > > > > > cluster operator is able to identify the specific
> >> requests
> >>     > that
> >>     > > > are
> >>     > > > > > slow
> >>     > > > > > > > or
> >>     > > > > > > > > failing for a client, they will be better able to
> >> diagnose
> >>     > the
> >>     > > > > issue
> >>     > > > > > > > > causing batch timeouts.
> >>     > > > > > > > >
> >>     > > > > > > > > One drawback I can think of is that this will
> >> increase the
> >>     > > > > > cardinality of
> >>     > > > > > > > > the request metrics. But any given client is only
> >> going to
> >>     > use
> >>     > > a
> >>     > > > > > small
> >>     > > > > > > > > subset of the request types, and since we already
> have
> >>     > > partition
> >>     > > > > > labels
> >>     > > > > > > > for
> >>     > > > > > > > > the topic-level metrics, I think request labels will
> >> still
> >>     > make
> >>     > > > up
> >>     > > > > a
> >>     > > > > > > > > relatively small percentage of the set of metrics.
> >>     > > > > > > > >
> >>     > > > > > > > > Thanks,
> >>     > > > > > > > > Bob
> >>     > > > > > > > >
> >>     > > > > > > > > On Mon, Nov 22, 2021 at 2:08 AM Viktor Somogyi-Vass
> <
> >>     > > > > > > > > viktorsomogyi@gmail.com>
> >>     > > > > > > > > wrote:
> >>     > > > > > > > >
> >>     > > > > > > > > > Hi Magnus,
> >>     > > > > > > > > >
> >>     > > > > > > > > > I think this is a very useful addition. We also
> >> have a
> >>     > > similar
> >>     > > > > (but
> >>     > > > > > > > much
> >>     > > > > > > > > > more simplistic) implementation of this. Maybe I
> >> missed it
> >>     > in
> >>     > > > the
> >>     > > > > > KIP
> >>     > > > > > > > but
> >>     > > > > > > > > > what about adding metrics about the subscription
> >> cache
> >>     > > itself?
> >>     > > > > > That I
> >>     > > > > > > > > think
> >>     > > > > > > > > > would improve its usability and debuggability as
> >> we'd be
> >>     > able
> >>     > > > to
> >>     > > > > > see
> >>     > > > > > > > its
> >>     > > > > > > > > > performance, hit/miss rates, eviction counts and
> >> others.
> >>     > > > > > > > > >
> >>     > > > > > > > > > Best,
> >>     > > > > > > > > > Viktor
> >>     > > > > > > > > >
> >>     > > > > > > > > > On Thu, Nov 18, 2021 at 5:12 PM Magnus Edenhill <
> >>     > > > > > magnus@edenhill.se>
> >>     > > > > > > > > > wrote:
> >>     > > > > > > > > >
> >>     > > > > > > > > > > Hi Mickael,
> >>     > > > > > > > > > >
> >>     > > > > > > > > > > see inline.
> >>     > > > > > > > > > >
> >>     > > > > > > > > > > Den ons 10 nov. 2021 kl 15:21 skrev Mickael
> >> Maison <
> >>     > > > > > > > > > > mickael.maison@gmail.com
> >>     > > > > > > > > > > >:
> >>     > > > > > > > > > >
> >>     > > > > > > > > > > > Hi Magnus,
> >>     > > > > > > > > > > >
> >>     > > > > > > > > > > > I see you've addressed some of the points I
> >> raised
> >>     > above
> >>     > > > but
> >>     > > > > > some
> >>     > > > > > > > (4,
> >>     > > > > > > > > > > > 5) have not been addressed yet.
> >>     > > > > > > > > > > >
> >>     > > > > > > > > > >
> >>     > > > > > > > > > > Re 4) How will the user/app know metrics are
> >> being sent.
> >>     > > > > > > > > > >
> >>     > > > > > > > > > > One possibility is to add a JMX metric (thus for
> >> user
> >>     > > > > > consumption)
> >>     > > > > > > > for
> >>     > > > > > > > > > the
> >>     > > > > > > > > > > number of metric pushes the
> >>     > > > > > > > > > > client has performed, or perhaps the number of
> >> metrics
> >>     > > > > > subscriptions
> >>     > > > > > > > > > > currently being collected.
> >>     > > > > > > > > > > Would that be sufficient?
> >>     > > > > > > > > > >
> >>     > > > > > > > > > > Re 5) Metric sizes and rates
> >>     > > > > > > > > > >
> >>     > > > > > > > > > > A worst case scenario for a producer that is
> >> producing to
> >>     > > 50
> >>     > > > > > unique
> >>     > > > > > > > > > topics
> >>     > > > > > > > > > > and emitting all standard metrics yields
> >>     > > > > > > > > > > a serialized size of around 100KB prior to
> >> compression,
> >>     > > which
> >>     > > > > > > > > compresses
> >>     > > > > > > > > > > down to about 20-30% of that depending
> >>     > > > > > > > > > > on compression type and topic name uniqueness.
> >>     > > > > > > > > > > The numbers for a consumer would be similar.
> >>     > > > > > > > > > >
> >>     > > > > > > > > > > In practice the number of unique topics would be
> >> far
> >>     > less,
> >>     > > > and
> >>     > > > > > the
> >>     > > > > > > > > > > subscription set would typically be for a subset
> >> of
> >>     > > metrics.
> >>     > > > > > > > > > > So we're probably closer to 1kb, or less,
> >> compressed size
> >>     > > per
> >>     > > > > > client
> >>     > > > > > > > > per
> >>     > > > > > > > > > > push interval.
> >>     > > > > > > > > > >
> >>     > > > > > > > > > > As both the subscription set and push intervals
> >> are
> >>     > > > controlled
> >>     > > > > > by the
> >>     > > > > > > > > > > cluster operator it shouldn't be too hard
> >>     > > > > > > > > > > to strike a good balance between metrics
> overhead
> >> and
> >>     > > > > > granularity.
> >>     > > > > > > > > > >
> >>     > > > > > > > > > >
> >>     > > > > > > > > > >
> >>     > > > > > > > > > > >
> >>     > > > > > > > > > > > I'm really uneasy with this being enabled by
> >> default on
> >>     > > the
> >>     > > > > > client
> >>     > > > > > > > > > > > side. When collecting data, I think the best
> >> practice
> >>     > is
> >>     > > to
> >>     > > > > > ensure
> >>     > > > > > > > > > > > users are explicitly enabling it.
> >>     > > > > > > > > > > >
> >>     > > > > > > > > > >
> >>     > > > > > > > > > > Requiring metrics to be explicitly enabled on
> >> clients
> >>     > > > severely
> >>     > > > > > > > cripples
> >>     > > > > > > > > > its
> >>     > > > > > > > > > > usability and value.
> >>     > > > > > > > > > >
> >>     > > > > > > > > > > One of the problems that this KIP aims to solve
> >> is for
> >>     > > useful
> >>     > > > > > metrics
> >>     > > > > > > > > to
> >>     > > > > > > > > > be
> >>     > > > > > > > > > > available on demand
> >>     > > > > > > > > > > regardless of the technical expertise of the
> >> user. As
> >>     > > Ryanne
> >>     > > > > > points,
> >>     > > > > > > > > out
> >>     > > > > > > > > > a
> >>     > > > > > > > > > > savvy user/organization
> >>     > > > > > > > > > > will typically have metrics collection and
> >> monitoring in
> >>     > > > place
> >>     > > > > > > > already,
> >>     > > > > > > > > > and
> >>     > > > > > > > > > > the benefits of this KIP
> >>     > > > > > > > > > > are then more of a common set and format metrics
> >> across
> >>     > > > client
> >>     > > > > > > > > > > implementations and languages.
> >>     > > > > > > > > > > But that is not the typical Kafka user in my
> >> experience,
> >>     > > > > they're
> >>     > > > > > not
> >>     > > > > > > > > > Kafka
> >>     > > > > > > > > > > experts and they don't have the
> >>     > > > > > > > > > > knowledge of how to best instrument their
> clients.
> >>     > > > > > > > > > > Having metrics enabled by default for this user
> >> base
> >>     > allows
> >>     > > > the
> >>     > > > > > Kafka
> >>     > > > > > > > > > > operators to proactively and reactively
> >>     > > > > > > > > > > monitor and troubleshoot client issues, without
> >> the need
> >>     > > for
> >>     > > > > the
> >>     > > > > > less
> >>     > > > > > > > > > savvy
> >>     > > > > > > > > > > user to do anything.
> >>     > > > > > > > > > > It is often too late to tell a user to enable
> >> metrics
> >>     > when
> >>     > > > the
> >>     > > > > > > > problem
> >>     > > > > > > > > > has
> >>     > > > > > > > > > > already occurred.
> >>     > > > > > > > > > >
> >>     > > > > > > > > > > Now, to be clear, even though metrics are
> enabled
> >> by
> >>     > > default
> >>     > > > on
> >>     > > > > > > > clients
> >>     > > > > > > > > > it
> >>     > > > > > > > > > > is not enabled by default
> >>     > > > > > > > > > > on the brokers; the Kafka operator needs to
> build
> >> and set
> >>     > > up
> >>     > > > a
> >>     > > > > > > > metrics
> >>     > > > > > > > > > > plugin and add metrics subscriptions
> >>     > > > > > > > > > > before anything is sent from the client.
> >>     > > > > > > > > > > It is opt-out on the clients and opt-in on the
> >> broker.
> >>     > > > > > > > > > >
> >>     > > > > > > > > > >
> >>     > > > > > > > > > >
> >>     > > > > > > > > > >
> >>     > > > > > > > > > > > You mentioned brokers already have
> >>     > > > > > > > > > > > some(most?) of the information contained in
> >> metrics, if
> >>     > > so
> >>     > > > > > then why
> >>     > > > > > > > > > > > are we collecting it again? Surely there must
> >> be some
> >>     > new
> >>     > > > > > > > information
> >>     > > > > > > > > > > > in the client metrics.
> >>     > > > > > > > > > > >
> >>     > > > > > > > > > >
> >>     > > > > > > > > > > From the user's perspective the Kafka
> >> infrastructure
> >>     > > extends
> >>     > > > > from
> >>     > > > > > > > > > > producer.send() to
> >>     > > > > > > > > > > messages being returned from consumer.poll(), a
> >> giant
> >>     > black
> >>     > > > box
> >>     > > > > > where
> >>     > > > > > > > > > > there's a lot going on between those
> >>     > > > > > > > > > > two points. The brokers currently only see what
> >> happens
> >>     > > once
> >>     > > > > > those
> >>     > > > > > > > > > requests
> >>     > > > > > > > > > > and messages hits the broker,
> >>     > > > > > > > > > > but as Kafka clients are complex pieces of
> >> machinery
> >>     > > there's
> >>     > > > a
> >>     > > > > > myriad
> >>     > > > > > > > > of
> >>     > > > > > > > > > > queues, timers, and state
> >>     > > > > > > > > > > that's critical to the operation and
> >> infrastructure
> >>     > that's
> >>     > > > not
> >>     > > > > > > > > currently
> >>     > > > > > > > > > > visible to the operator.
> >>     > > > > > > > > > > Relying on the user to accurately and timely
> >> provide this
> >>     > > > > missing
> >>     > > > > > > > > > > information is not generally feasible.
> >>     > > > > > > > > > >
> >>     > > > > > > > > > >
> >>     > > > > > > > > > > Most of the standard metrics listed in the KIP
> >> are data
> >>     > > > points
> >>     > > > > > that
> >>     > > > > > > > the
> >>     > > > > > > > > > > broker does not have.
> >>     > > > > > > > > > > Only a small number of metrics are duplicates
> >> (like the
> >>     > > > request
> >>     > > > > > > > counts
> >>     > > > > > > > > > and
> >>     > > > > > > > > > > sizes), but they are included
> >>     > > > > > > > > > > to ease correlation when inspecting these client
> >> metrics.
> >>     > > > > > > > > > >
> >>     > > > > > > > > > >
> >>     > > > > > > > > > >
> >>     > > > > > > > > > > > Moreover this is a brand new feature so it's
> >> even
> >>     > harder
> >>     > > to
> >>     > > > > > justify
> >>     > > > > > > > > > > > enabling it and forcing onto all our users. If
> >> disabled
> >>     > > by
> >>     > > > > > default,
> >>     > > > > > > > > > > > it's relatively easy to enable in a new
> release
> >> if we
> >>     > > > decide
> >>     > > > > > to,
> >>     > > > > > > > but
> >>     > > > > > > > > > > > once enabled by default it's much harder to
> >> disable.
> >>     > Also
> >>     > > > > this
> >>     > > > > > > > > feature
> >>     > > > > > > > > > > > will apply to all future metrics we will add.
> >>     > > > > > > > > > > >
> >>     > > > > > > > > > >
> >>     > > > > > > > > > > I think maturity of a feature implementation
> >> should be
> >>     > the
> >>     > > > > > deciding
> >>     > > > > > > > > > factor,
> >>     > > > > > > > > > > rather than
> >>     > > > > > > > > > > the design of it (which this KIP is). I.e., if
> the
> >>     > > > > > implementation is
> >>     > > > > > > > > not
> >>     > > > > > > > > > > deemed mature enough
> >>     > > > > > > > > > > for release X.Y it will be disabled.
> >>     > > > > > > > > > >
> >>     > > > > > > > > > >
> >>     > > > > > > > > > >
> >>     > > > > > > > > > > > Overall I think it's an interesting feature
> but
> >> I'd
> >>     > > prefer
> >>     > > > to
> >>     > > > > > be
> >>     > > > > > > > > > > > slightly defensive and see how it works in
> >> practice
> >>     > > before
> >>     > > > > > enabling
> >>     > > > > > > > > it
> >>     > > > > > > > > > > > everywhere.
> >>     > > > > > > > > > > >
> >>     > > > > > > > > > >
> >>     > > > > > > > > > > Right, and I agree on being defensive, but since
> >> this
> >>     > > feature
> >>     > > > > > still
> >>     > > > > > > > > > > requires manual
> >>     > > > > > > > > > > enabling on the brokers before actually being
> >> used, I
> >>     > think
> >>     > > > > that
> >>     > > > > > > > gives
> >>     > > > > > > > > > > enough control
> >>     > > > > > > > > > > to opt-in or out of this feature as needed.
> >>     > > > > > > > > > >
> >>     > > > > > > > > > > Thanks for your comments!
> >>     > > > > > > > > > >
> >>     > > > > > > > > > > Regards,
> >>     > > > > > > > > > > Magnus
> >>     > > > > > > > > > >
> >>     > > > > > > > > > >
> >>     > > > > > > > > > >
> >>     > > > > > > > > > > > Thanks,
> >>     > > > > > > > > > > > Mickael
> >>     > > > > > > > > > > >
> >>     > > > > > > > > > > > On Mon, Nov 8, 2021 at 7:55 AM Magnus
> Edenhill <
> >>     > > > > > magnus@edenhill.se
> >>     > > > > > > > >
> >>     > > > > > > > > > > wrote:
> >>     > > > > > > > > > > > >
> >>     > > > > > > > > > > > > Thanks David for pointing this out,
> >>     > > > > > > > > > > > > I've updated the KIP to include client_id
> as a
> >>     > matching
> >>     > > > > > selector.
> >>     > > > > > > > > > > > >
> >>     > > > > > > > > > > > > Regards,
> >>     > > > > > > > > > > > > Magnus
> >>     > > > > > > > > > > > >
> >>     > > > > > > > > > > > > Den tors 4 nov. 2021 kl 18:01 skrev David
> Mao
> >>     > > > > > > > > > > <dmao@confluent.io.invalid
> >>     > > > > > > > > > > > >:
> >>     > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > Hey Magnus,
> >>     > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > I noticed that the KIP outlines the
> initial
> >>     > selectors
> >>     > > > > > supported
> >>     > > > > > > > > as:
> >>     > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > >    - client_instance_id -
> >> CLIENT_INSTANCE_ID UUID
> >>     > > > string
> >>     > > > > > > > > > > > representation.
> >>     > > > > > > > > > > > > >    - client_software_name  - client
> software
> >>     > > > > implementation
> >>     > > > > > > > name.
> >>     > > > > > > > > > > > > >    - client_software_version  - client
> >> software
> >>     > > > > > implementation
> >>     > > > > > > > > > > version.
> >>     > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > In the given reactive monitoring workflow,
> >> we
> >>     > mention
> >>     > > > > that
> >>     > > > > > the
> >>     > > > > > > > > > > > application
> >>     > > > > > > > > > > > > > user does not know their client's client
> >> instance
> >>     > ID,
> >>     > > > but
> >>     > > > > > it's
> >>     > > > > > > > > > > outlined
> >>     > > > > > > > > > > > > > that the operator can add a metrics
> >> subscription
> >>     > > > > selecting
> >>     > > > > > for
> >>     > > > > > > > > > > > clientId. I
> >>     > > > > > > > > > > > > > don't see clientId as one of the supported
> >>     > selectors.
> >>     > > > > > > > > > > > > > I can see how this would have made sense
> in
> >> a
> >>     > > previous
> >>     > > > > > > > iteration
> >>     > > > > > > > > > > given
> >>     > > > > > > > > > > > that
> >>     > > > > > > > > > > > > > the previous client instance ID proposal
> >> was to
> >>     > > > construct
> >>     > > > > > the
> >>     > > > > > > > > > client
> >>     > > > > > > > > > > > > > instance ID using clientId as a prefix.
> Now
> >> that
> >>     > the
> >>     > > > > client
> >>     > > > > > > > > > instance
> >>     > > > > > > > > > > > ID is
> >>     > > > > > > > > > > > > > a UUID, would we want to add clientId as a
> >>     > supported
> >>     > > > > > selector?
> >>     > > > > > > > > > > > > > Let me know what you think.
> >>     > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > David
> >>     > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > On Tue, Oct 19, 2021 at 12:33 PM Magnus
> >> Edenhill <
> >>     > > > > > > > > > magnus@edenhill.se
> >>     > > > > > > > > > > >
> >>     > > > > > > > > > > > > > wrote:
> >>     > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > Hi Mickael!
> >>     > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > Den tis 19 okt. 2021 kl 19:30 skrev
> >> Mickael
> >>     > Maison
> >>     > > <
> >>     > > > > > > > > > > > > > > mickael.maison@gmail.com
> >>     > > > > > > > > > > > > > > >:
> >>     > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > Hi Magnus,
> >>     > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > Thanks for the proposal.
> >>     > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > 1. Looking at the protocol section,
> >> isn't
> >>     > > > > > > > "ClientInstanceId"
> >>     > > > > > > > > > > > expected
> >>     > > > > > > > > > > > > > > > to be a field in
> >>     > > > GetTelemetrySubscriptionsResponseV0?
> >>     > > > > > > > > > Otherwise,
> >>     > > > > > > > > > > > how
> >>     > > > > > > > > > > > > > > > does a client retrieve this value?
> >>     > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > Good catch, it got removed by mistake in
> >> one of
> >>     > the
> >>     > > > > > edits.
> >>     > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > 2. In the client API section, you
> >> mention a new
> >>     > > > > method
> >>     > > > > > > > > > > > > > > > "clientInstanceId()". Can you clarify
> >> which
> >>     > > > > interfaces
> >>     > > > > > are
> >>     > > > > > > > > > > > affected?
> >>     > > > > > > > > > > > > > > > Is it only Consumer and Producer?
> >>     > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > And Admin. Will update the KIP.
> >>     > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > 3. I'm a bit concerned this is enabled
> >> by
> >>     > > default.
> >>     > > > > > Even if
> >>     > > > > > > > > the
> >>     > > > > > > > > > > data
> >>     > > > > > > > > > > > > > > > collected is supposed to be not
> >> sensitive, I
> >>     > > think
> >>     > > > > > this can
> >>     > > > > > > > > be
> >>     > > > > > > > > > > > > > > > problematic in some environments. Also
> >> users
> >>     > > don't
> >>     > > > > > seem to
> >>     > > > > > > > > have
> >>     > > > > > > > > > > the
> >>     > > > > > > > > > > > > > > > choice to only expose some metrics.
> >> Knowing how
> >>     > > > much
> >>     > > > > > data
> >>     > > > > > > > > > transit
> >>     > > > > > > > > > > > > > > > through some applications can be
> >> considered
> >>     > > > critical.
> >>     > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > The broker already knows how much data
> >> transits
> >>     > > > through
> >>     > > > > > the
> >>     > > > > > > > > > client
> >>     > > > > > > > > > > > > > though,
> >>     > > > > > > > > > > > > > > right?
> >>     > > > > > > > > > > > > > > Care has been taken not to expose
> >> information in
> >>     > > the
> >>     > > > > > standard
> >>     > > > > > > > > > > metrics
> >>     > > > > > > > > > > > > > that
> >>     > > > > > > > > > > > > > > might
> >>     > > > > > > > > > > > > > > reveal sensitive information.
> >>     > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > Do you have an example of how the
> proposed
> >>     > metrics
> >>     > > > > could
> >>     > > > > > leak
> >>     > > > > > > > > > > > sensitive
> >>     > > > > > > > > > > > > > > information?
> >>     > > > > > > > > > > > > > > As for limiting the what metrics to
> >> export; I
> >>     > guess
> >>     > > > > that
> >>     > > > > > > > could
> >>     > > > > > > > > > make
> >>     > > > > > > > > > > > sense
> >>     > > > > > > > > > > > > > > in some
> >>     > > > > > > > > > > > > > > very sensitive use-cases, but those
> users
> >> might
> >>     > > > disable
> >>     > > > > > > > metrics
> >>     > > > > > > > > > > > > > altogether
> >>     > > > > > > > > > > > > > > for now.
> >>     > > > > > > > > > > > > > > Could these concerns be addressed by a
> >> later KIP?
> >>     > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > 4. As a user, how do you know if your
> >>     > application
> >>     > > > is
> >>     > > > > > > > actively
> >>     > > > > > > > > > > > sending
> >>     > > > > > > > > > > > > > > > metrics? Are there new metrics
> exposing
> >> what's
> >>     > > > going
> >>     > > > > > on,
> >>     > > > > > > > like
> >>     > > > > > > > > > how
> >>     > > > > > > > > > > > much
> >>     > > > > > > > > > > > > > > > data is being sent?
> >>     > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > That's a good question.
> >>     > > > > > > > > > > > > > > Since the proposed metrics interface is
> >> not aimed
> >>     > > at,
> >>     > > > > or
> >>     > > > > > > > > directly
> >>     > > > > > > > > > > > > > available
> >>     > > > > > > > > > > > > > > to, the application
> >>     > > > > > > > > > > > > > > I guess there's little point of adding
> it
> >> here,
> >>     > but
> >>     > > > > > instead
> >>     > > > > > > > > > adding
> >>     > > > > > > > > > > > > > > something to the
> >>     > > > > > > > > > > > > > > existing JMX metrics?
> >>     > > > > > > > > > > > > > > Do you have any suggestions?
> >>     > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > 5. If all metrics are enabled on a
> >> regular
> >>     > > Consumer
> >>     > > > > or
> >>     > > > > > > > > > Producer,
> >>     > > > > > > > > > > do
> >>     > > > > > > > > > > > > > > > you have an idea how much throughput
> >> this would
> >>     > > > use?
> >>     > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > It depends on the number of
> >> partition/topics/etc
> >>     > > the
> >>     > > > > > client
> >>     > > > > > > > is
> >>     > > > > > > > > > > > producing
> >>     > > > > > > > > > > > > > > to/consuming from.
> >>     > > > > > > > > > > > > > > I'll add some sizes to the KIP for some
> >> typical
> >>     > > > > > use-cases.
> >>     > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > Thanks,
> >>     > > > > > > > > > > > > > > Magnus
> >>     > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > Thanks
> >>     > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > On Tue, Oct 19, 2021 at 5:06 PM Magnus
> >>     > Edenhill <
> >>     > > > > > > > > > > > magnus@edenhill.se>
> >>     > > > > > > > > > > > > > > > wrote:
> >>     > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > Den tis 19 okt. 2021 kl 13:22 skrev
> >> Tom
> >>     > > Bentley <
> >>     > > > > > > > > > > > tbentley@redhat.com
> >>     > > > > > > > > > > > > > >:
> >>     > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > > Hi Magnus,
> >>     > > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > > I reviewed the KIP since you
> called
> >> the
> >>     > vote
> >>     > > > > > (sorry for
> >>     > > > > > > > > not
> >>     > > > > > > > > > > > > > reviewing
> >>     > > > > > > > > > > > > > > > when
> >>     > > > > > > > > > > > > > > > > > you announced your intention to
> >> call the
> >>     > > > vote). I
> >>     > > > > > have
> >>     > > > > > > > a
> >>     > > > > > > > > > few
> >>     > > > > > > > > > > > > > > questions
> >>     > > > > > > > > > > > > > > > on
> >>     > > > > > > > > > > > > > > > > > some of the details.
> >>     > > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > > 1. There's no Javadoc on
> >>     > > > > > ClientTelemetryPayload.data(),
> >>     > > > > > > > > so
> >>     > > > > > > > > > I
> >>     > > > > > > > > > > > don't
> >>     > > > > > > > > > > > > > > know
> >>     > > > > > > > > > > > > > > > > > whether the payload is exposed
> >> through this
> >>     > > > > method
> >>     > > > > > as
> >>     > > > > > > > > > > > compressed or
> >>     > > > > > > > > > > > > > > > not.
> >>     > > > > > > > > > > > > > > > > > Later on you say "Decompression of
> >> the
> >>     > > payloads
> >>     > > > > > will be
> >>     > > > > > > > > > > > handled by
> >>     > > > > > > > > > > > > > > the
> >>     > > > > > > > > > > > > > > > > > broker metrics plugin, the broker
> >> should
> >>     > > > expose a
> >>     > > > > > > > > suitable
> >>     > > > > > > > > > > > > > > > decompression
> >>     > > > > > > > > > > > > > > > > > API to the metrics plugin for this
> >>     > purpose.",
> >>     > > > > which
> >>     > > > > > > > > > suggests
> >>     > > > > > > > > > > > it's
> >>     > > > > > > > > > > > > > the
> >>     > > > > > > > > > > > > > > > > > compressed data in the buffer, but
> >> then we
> >>     > > > don't
> >>     > > > > > know
> >>     > > > > > > > > which
> >>     > > > > > > > > > > > codec
> >>     > > > > > > > > > > > > > was
> >>     > > > > > > > > > > > > > > > used,
> >>     > > > > > > > > > > > > > > > > > nor the API via which the plugin
> >> should
> >>     > > > > decompress
> >>     > > > > > it
> >>     > > > > > > > if
> >>     > > > > > > > > > > > required
> >>     > > > > > > > > > > > > > for
> >>     > > > > > > > > > > > > > > > > > forwarding to the ultimate metrics
> >> store.
> >>     > > > Should
> >>     > > > > > the
> >>     > > > > > > > > > > > > > > > ClientTelemetryPayload
> >>     > > > > > > > > > > > > > > > > > expose a method to get the
> >> compression and
> >>     > a
> >>     > > > > > > > > decompressor?
> >>     > > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > Good point, updated.
> >>     > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > > 2. The client-side API is
> expressed
> >> as
> >>     > > > > > StringOrError
> >>     > > > > > > > > > > > > > > > > >
> ClientInstance::ClientInstanceId(int
> >>     > > > > timeout_ms). I
> >>     > > > > > > > > > > understand
> >>     > > > > > > > > > > > that
> >>     > > > > > > > > > > > > > > > you're
> >>     > > > > > > > > > > > > > > > > > thinking about the librdkafka
> >>     > implementation,
> >>     > > > but
> >>     > > > > > it
> >>     > > > > > > > > would
> >>     > > > > > > > > > be
> >>     > > > > > > > > > > > good
> >>     > > > > > > > > > > > > > to
> >>     > > > > > > > > > > > > > > > show
> >>     > > > > > > > > > > > > > > > > > the API as it would appear on the
> >> Apache
> >>     > > Kafka
> >>     > > > > > clients.
> >>     > > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > This was meant as pseudo-code, but I
> >> changed
> >>     > it
> >>     > > > to
> >>     > > > > > Java.
> >>     > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > > 3. "PushTelemetryRequest|Response
> -
> >>     > protocol
> >>     > > > > > request
> >>     > > > > > > > used
> >>     > > > > > > > > > by
> >>     > > > > > > > > > > > the
> >>     > > > > > > > > > > > > > > > client to
> >>     > > > > > > > > > > > > > > > > > send metrics to any broker it is
> >> connected
> >>     > > to."
> >>     > > > > To
> >>     > > > > > be
> >>     > > > > > > > > > clear,
> >>     > > > > > > > > > > > this
> >>     > > > > > > > > > > > > > > means
> >>     > > > > > > > > > > > > > > > > > that the client can choose any of
> >> the
> >>     > > connected
> >>     > > > > > brokers
> >>     > > > > > > > > and
> >>     > > > > > > > > > > > push to
> >>     > > > > > > > > > > > > > > > just
> >>     > > > > > > > > > > > > > > > > > one of them? What should a
> >> supporting
> >>     > client
> >>     > > do
> >>     > > > > if
> >>     > > > > > it
> >>     > > > > > > > > gets
> >>     > > > > > > > > > an
> >>     > > > > > > > > > > > error
> >>     > > > > > > > > > > > > > > > when
> >>     > > > > > > > > > > > > > > > > > pushing metrics to a broker, retry
> >> sending
> >>     > to
> >>     > > > the
> >>     > > > > > same
> >>     > > > > > > > > > broker
> >>     > > > > > > > > > > > or
> >>     > > > > > > > > > > > > > try
> >>     > > > > > > > > > > > > > > > > > pushing to another broker, or drop
> >> the
> >>     > > metrics?
> >>     > > > > > Should
> >>     > > > > > > > > > > > supporting
> >>     > > > > > > > > > > > > > > > clients
> >>     > > > > > > > > > > > > > > > > > send successive requests to a
> single
> >>     > broker,
> >>     > > or
> >>     > > > > > round
> >>     > > > > > > > > > robin,
> >>     > > > > > > > > > > > or is
> >>     > > > > > > > > > > > > > > > that up
> >>     > > > > > > > > > > > > > > > > > to the client author? I'm guessing
> >> the
> >>     > > > behaviour
> >>     > > > > > should
> >>     > > > > > > > > be
> >>     > > > > > > > > > > > sticky
> >>     > > > > > > > > > > > > > to
> >>     > > > > > > > > > > > > > > > > > support the rate limiting
> features,
> >> but I
> >>     > > think
> >>     > > > > it
> >>     > > > > > > > would
> >>     > > > > > > > > be
> >>     > > > > > > > > > > > good
> >>     > > > > > > > > > > > > > for
> >>     > > > > > > > > > > > > > > > client
> >>     > > > > > > > > > > > > > > > > > authors if this section were
> >> explicit on
> >>     > the
> >>     > > > > > > > recommended
> >>     > > > > > > > > > > > behaviour.
> >>     > > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > You are right, I've updated the KIP
> >> to make
> >>     > > this
> >>     > > > > > clearer.
> >>     > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > > 4. "Mapping the client instance id
> >> to an
> >>     > > actual
> >>     > > > > > > > > application
> >>     > > > > > > > > > > > > > instance
> >>     > > > > > > > > > > > > > > > > > running on a (virtual) machine can
> >> be done
> >>     > by
> >>     > > > > > > > inspecting
> >>     > > > > > > > > > the
> >>     > > > > > > > > > > > > > metrics
> >>     > > > > > > > > > > > > > > > > > resource labels, such as the
> client
> >> source
> >>     > > > > address
> >>     > > > > > and
> >>     > > > > > > > > > source
> >>     > > > > > > > > > > > port,
> >>     > > > > > > > > > > > > > > or
> >>     > > > > > > > > > > > > > > > > > security principal, all of which
> >> are added
> >>     > by
> >>     > > > the
> >>     > > > > > > > > receiving
> >>     > > > > > > > > > > > broker.
> >>     > > > > > > > > > > > > > > > This
> >>     > > > > > > > > > > > > > > > > > will allow the operator together
> >> with the
> >>     > > user
> >>     > > > to
> >>     > > > > > > > > identify
> >>     > > > > > > > > > > the
> >>     > > > > > > > > > > > > > actual
> >>     > > > > > > > > > > > > > > > > > application instance." Is this
> >> really
> >>     > always
> >>     > > > > true?
> >>     > > > > > The
> >>     > > > > > > > > > source
> >>     > > > > > > > > > > > IP
> >>     > > > > > > > > > > > > > and
> >>     > > > > > > > > > > > > > > > port
> >>     > > > > > > > > > > > > > > > > > might be a loadbalancer/proxy in
> >> some
> >>     > setups.
> >>     > > > The
> >>     > > > > > > > > > principal,
> >>     > > > > > > > > > > as
> >>     > > > > > > > > > > > > > > already
> >>     > > > > > > > > > > > > > > > > > mentioned in the KIP, might be
> >> shared
> >>     > between
> >>     > > > > > multiple
> >>     > > > > > > > > > > > > > applications.
> >>     > > > > > > > > > > > > > > > So at
> >>     > > > > > > > > > > > > > > > > > worst the organization running the
> >> clients
> >>     > > > might
> >>     > > > > > have
> >>     > > > > > > > to
> >>     > > > > > > > > > > > consult
> >>     > > > > > > > > > > > > > the
> >>     > > > > > > > > > > > > > > > logs
> >>     > > > > > > > > > > > > > > > > > of a set of client applications,
> >> right?
> >>     > > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > Yes, that's correct. There's no
> >> guaranteed
> >>     > > > mapping
> >>     > > > > > from
> >>     > > > > > > > > > > > > > > > client_instance_id
> >>     > > > > > > > > > > > > > > > > to
> >>     > > > > > > > > > > > > > > > > an actual instance, that's why the
> KIP
> >>     > > recommends
> >>     > > > > > client
> >>     > > > > > > > > > > > > > > implementations
> >>     > > > > > > > > > > > > > > > to
> >>     > > > > > > > > > > > > > > > > log the client instance id
> >>     > > > > > > > > > > > > > > > > upon retrieval, and also provide an
> >> API for
> >>     > the
> >>     > > > > > > > application
> >>     > > > > > > > > > to
> >>     > > > > > > > > > > > > > retrieve
> >>     > > > > > > > > > > > > > > > the
> >>     > > > > > > > > > > > > > > > > instance id programmatically
> >>     > > > > > > > > > > > > > > > > if it has a better way of exposing
> it.
> >>     > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > 5. "Tests indicate that a
> compression
> >> ratio
> >>     > up
> >>     > > to
> >>     > > > > > 10x is
> >>     > > > > > > > > > > > possible for
> >>     > > > > > > > > > > > > > > the
> >>     > > > > > > > > > > > > > > > > > standard metrics." Client authors
> >> might
> >>     > > > > appreciate
> >>     > > > > > your
> >>     > > > > > > > > > > > mentioning
> >>     > > > > > > > > > > > > > > > which
> >>     > > > > > > > > > > > > > > > > > compression codec got these
> results.
> >>     > > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > Good point. Updated.
> >>     > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > > 6. "Should the client send a push
> >> request
> >>     > > prior
> >>     > > > > to
> >>     > > > > > > > expiry
> >>     > > > > > > > > > of
> >>     > > > > > > > > > > > the
> >>     > > > > > > > > > > > > > > > previously
> >>     > > > > > > > > > > > > > > > > > calculated PushIntervalMs the
> >> broker will
> >>     > > > discard
> >>     > > > > > the
> >>     > > > > > > > > > metrics
> >>     > > > > > > > > > > > and
> >>     > > > > > > > > > > > > > > > return a
> >>     > > > > > > > > > > > > > > > > > PushTelemetryResponse with the
> >> ErrorCode
> >>     > set
> >>     > > to
> >>     > > > > > > > > > RateLimited."
> >>     > > > > > > > > > > > Is
> >>     > > > > > > > > > > > > > this
> >>     > > > > > > > > > > > > > > > > > RATE_LIMITED a new error code?
> It's
> >> not
> >>     > > > mentioned
> >>     > > > > > in
> >>     > > > > > > > the
> >>     > > > > > > > > > "New
> >>     > > > > > > > > > > > Error
> >>     > > > > > > > > > > > > > > > Codes"
> >>     > > > > > > > > > > > > > > > > > section.
> >>     > > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > That's a leftover, it should be
> using
> >> the
> >>     > > > standard
> >>     > > > > > > > > > ThrottleTime
> >>     > > > > > > > > > > > > > > > mechanism.
> >>     > > > > > > > > > > > > > > > > Fixed.
> >>     > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > > 7. In the section "Standard client
> >> resource
> >>     > > > > labels"
> >>     > > > > > > > > > > > application_id
> >>     > > > > > > > > > > > > > is
> >>     > > > > > > > > > > > > > > > > > described as Kafka Streams only,
> >> but the
> >>     > > > section
> >>     > > > > of
> >>     > > > > > > > > "Client
> >>     > > > > > > > > > > > > > > > Identification"
> >>     > > > > > > > > > > > > > > > > > talks about "application instance
> >> id as an
> >>     > > > > optional
> >>     > > > > > > > > future
> >>     > > > > > > > > > > > > > > nice-to-have
> >>     > > > > > > > > > > > > > > > > > that may be included as a metrics
> >> label if
> >>     > it
> >>     > > > has
> >>     > > > > > been
> >>     > > > > > > > > set
> >>     > > > > > > > > > by
> >>     > > > > > > > > > > > the
> >>     > > > > > > > > > > > > > > > user", so
> >>     > > > > > > > > > > > > > > > > > I'm confused whether non-Kafka
> >> Streams
> >>     > > clients
> >>     > > > > > should
> >>     > > > > > > > set
> >>     > > > > > > > > > an
> >>     > > > > > > > > > > > > > > > application_id
> >>     > > > > > > > > > > > > > > > > > or not.
> >>     > > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > I'll clarify this in the KIP, but
> >> basically
> >>     > we
> >>     > > > > would
> >>     > > > > > need
> >>     > > > > > > > > to
> >>     > > > > > > > > > > add
> >>     > > > > > > > > > > > an `
> >>     > > > > > > > > > > > > > > > > application.id` config
> >>     > > > > > > > > > > > > > > > > property for non-streams clients for
> >> this
> >>     > > > purpose,
> >>     > > > > > and
> >>     > > > > > > > > that's
> >>     > > > > > > > > > > > outside
> >>     > > > > > > > > > > > > > > the
> >>     > > > > > > > > > > > > > > > > scope of this KIP since we want to
> >> make it
> >>     > > > > > zero-conf:ish
> >>     > > > > > > > on
> >>     > > > > > > > > > the
> >>     > > > > > > > > > > > > > client
> >>     > > > > > > > > > > > > > > > side.
> >>     > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > > Kind regards,
> >>     > > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > > Tom
> >>     > > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > Thanks for the review,
> >>     > > > > > > > > > > > > > > > > Magnus
> >>     > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > > On Thu, Oct 7, 2021 at 5:26 PM
> >> Magnus
> >>     > > Edenhill
> >>     > > > <
> >>     > > > > > > > > > > > magnus@edenhill.se
> >>     > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > wrote:
> >>     > > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > > > Hi all,
> >>     > > > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > > > I've updated the KIP following
> >> our recent
> >>     > > > > > discussions
> >>     > > > > > > > > on
> >>     > > > > > > > > > > the
> >>     > > > > > > > > > > > > > > mailing
> >>     > > > > > > > > > > > > > > > > > list:
> >>     > > > > > > > > > > > > > > > > > >  - split the protocol in two,
> one
> >> for
> >>     > > getting
> >>     > > > > the
> >>     > > > > > > > > metrics
> >>     > > > > > > > > > > > > > > > subscriptions,
> >>     > > > > > > > > > > > > > > > > > > and one for pushing the metrics.
> >>     > > > > > > > > > > > > > > > > > >  - simplifications: initially
> >> only one
> >>     > > > > supported
> >>     > > > > > > > > metrics
> >>     > > > > > > > > > > > format,
> >>     > > > > > > > > > > > > > no
> >>     > > > > > > > > > > > > > > > > > > client.id in the instance id,
> >> etc.
> >>     > > > > > > > > > > > > > > > > > >  - made CLIENT_METRICS
> >> subscription
> >>     > > > > configuration
> >>     > > > > > > > > entries
> >>     > > > > > > > > > > > more
> >>     > > > > > > > > > > > > > > > structured
> >>     > > > > > > > > > > > > > > > > > >    and allowing better client
> >> matching
> >>     > > > > selectors
> >>     > > > > > (not
> >>     > > > > > > > > > only
> >>     > > > > > > > > > > > on the
> >>     > > > > > > > > > > > > > > > > > instance
> >>     > > > > > > > > > > > > > > > > > > id, but also the other
> >>     > > > > > > > > > > > > > > > > > >    client resource labels, such
> as
> >>     > > > > > > > > client_software_name,
> >>     > > > > > > > > > > > etc.).
> >>     > > > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > > > Unless there are further
> comments
> >> I'll
> >>     > call
> >>     > > > the
> >>     > > > > > vote
> >>     > > > > > > > > in a
> >>     > > > > > > > > > > > day or
> >>     > > > > > > > > > > > > > > two.
> >>     > > > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > > > Regards,
> >>     > > > > > > > > > > > > > > > > > > Magnus
> >>     > > > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > > > Den mån 4 okt. 2021 kl 20:57
> >> skrev Magnus
> >>     > > > > > Edenhill <
> >>     > > > > > > > > > > > > > > > magnus@edenhill.se>:
> >>     > > > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > > > > Hi Gwen,
> >>     > > > > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > > > > I'm finishing up the KIP based
> >> on the
> >>     > > last
> >>     > > > > > couple
> >>     > > > > > > > of
> >>     > > > > > > > > > > > discussion
> >>     > > > > > > > > > > > > > > > points
> >>     > > > > > > > > > > > > > > > > > in
> >>     > > > > > > > > > > > > > > > > > > > this thread
> >>     > > > > > > > > > > > > > > > > > > > and will call the Vote later
> >> this week.
> >>     > > > > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > > > > Best,
> >>     > > > > > > > > > > > > > > > > > > > Magnus
> >>     > > > > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > > > > Den lör 2 okt. 2021 kl 02:01
> >> skrev Gwen
> >>     > > > > Shapira
> >>     > > > > > > > > > > > > > > > > > > <gwen@confluent.io.invalid
> >>     > > > > > > > > > > > > > > > > > > > >:
> >>     > > > > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > > > >> Hey,
> >>     > > > > > > > > > > > > > > > > > > >>
> >>     > > > > > > > > > > > > > > > > > > >> I noticed that there was no
> >> discussion
> >>     > > for
> >>     > > > > the
> >>     > > > > > > > last
> >>     > > > > > > > > 10
> >>     > > > > > > > > > > > days,
> >>     > > > > > > > > > > > > > > but I
> >>     > > > > > > > > > > > > > > > > > > >> couldn't
> >>     > > > > > > > > > > > > > > > > > > >> find the vote thread. Is
> there
> >> one
> >>     > that
> >>     > > > I'm
> >>     > > > > > > > missing?
> >>     > > > > > > > > > > > > > > > > > > >>
> >>     > > > > > > > > > > > > > > > > > > >> Gwen
> >>     > > > > > > > > > > > > > > > > > > >>
> >>     > > > > > > > > > > > > > > > > > > >> On Wed, Sep 22, 2021 at 4:58
> >> AM Magnus
> >>     > > > > > Edenhill <
> >>     > > > > > > > > > > > > > > > magnus@edenhill.se>
> >>     > > > > > > > > > > > > > > > > > > >> wrote:
> >>     > > > > > > > > > > > > > > > > > > >>
> >>     > > > > > > > > > > > > > > > > > > >> > Den tis 21 sep. 2021 kl
> >> 06:58 skrev
> >>     > > > Colin
> >>     > > > > > > > McCabe <
> >>     > > > > > > > > > > > > > > > > > cmccabe@apache.org
> >>     > > > > > > > > > > > > > > > > > > >:
> >>     > > > > > > > > > > > > > > > > > > >> >
> >>     > > > > > > > > > > > > > > > > > > >> > > On Mon, Sep 20, 2021, at
> >> 17:35,
> >>     > Feng
> >>     > > > Min
> >>     > > > > > > > wrote:
> >>     > > > > > > > > > > > > > > > > > > >> > > > Thanks Magnus & Colin
> >> for the
> >>     > > > > > discussion.
> >>     > > > > > > > > > > > > > > > > > > >> > > >
> >>     > > > > > > > > > > > > > > > > > > >> > > > Based on KIP-714's
> >> stateless
> >>     > > design,
> >>     > > > > > Client
> >>     > > > > > > > > can
> >>     > > > > > > > > > > > pretty
> >>     > > > > > > > > > > > > > > much
> >>     > > > > > > > > > > > > > > > use
> >>     > > > > > > > > > > > > > > > > > > any
> >>     > > > > > > > > > > > > > > > > > > >> > > > connection to any
> broker
> >> to send
> >>     > > > > > metrics. We
> >>     > > > > > > > > are
> >>     > > > > > > > > > > not
> >>     > > > > > > > > > > > > > > > associating
> >>     > > > > > > > > > > > > > > > > > > >> > > connection
> >>     > > > > > > > > > > > > > > > > > > >> > > > with client metric
> >> state. Is my
> >>     > > > > > > > understanding
> >>     > > > > > > > > > > > correct?
> >>     > > > > > > > > > > > > > If
> >>     > > > > > > > > > > > > > > > yes,
> >>     > > > > > > > > > > > > > > > > > > how
> >>     > > > > > > > > > > > > > > > > > > >> > about
> >>     > > > > > > > > > > > > > > > > > > >> > > > the following two
> >> scenarios
> >>     > > > > > > > > > > > > > > > > > > >> > > >
> >>     > > > > > > > > > > > > > > > > > > >> > > > 1) One Client
> (Client-ID)
> >>     > > registers
> >>     > > > > two
> >>     > > > > > > > > > different
> >>     > > > > > > > > > > > client
> >>     > > > > > > > > > > > > > > > > > instance
> >>     > > > > > > > > > > > > > > > > > > id
> >>     > > > > > > > > > > > > > > > > > > >> > via
> >>     > > > > > > > > > > > > > > > > > > >> > > > separate registration.
> >> Is it
> >>     > > > > permitted?
> >>     > > > > > If
> >>     > > > > > > > OK,
> >>     > > > > > > > > > how
> >>     > > > > > > > > > > > to
> >>     > > > > > > > > > > > > > > > > > distinguish
> >>     > > > > > > > > > > > > > > > > > > >> them
> >>     > > > > > > > > > > > > > > > > > > >> > > from
> >>     > > > > > > > > > > > > > > > > > > >> > > > the case 2 below.
> >>     > > > > > > > > > > > > > > > > > > >> > > >
> >>     > > > > > > > > > > > > > > > > > > >> > >
> >>     > > > > > > > > > > > > > > > > > > >> > > Hi Feng,
> >>     > > > > > > > > > > > > > > > > > > >> > >
> >>     > > > > > > > > > > > > > > > > > > >> > > My understanding, which
> >> Magnus can
> >>     > > > > > clarify I
> >>     > > > > > > > > > guess,
> >>     > > > > > > > > > > is
> >>     > > > > > > > > > > > > > that
> >>     > > > > > > > > > > > > > > > you
> >>     > > > > > > > > > > > > > > > > > > could
> >>     > > > > > > > > > > > > > > > > > > >> > have
> >>     > > > > > > > > > > > > > > > > > > >> > > something like two
> Producer
> >>     > > instances
> >>     > > > > > running
> >>     > > > > > > > > with
> >>     > > > > > > > > > > the
> >>     > > > > > > > > > > > > > same
> >>     > > > > > > > > > > > > > > > > > > client.id
> >>     > > > > > > > > > > > > > > > > > > >> > > (perhaps because they're
> >> using the
> >>     > > > same
> >>     > > > > > config
> >>     > > > > > > > > > file,
> >>     > > > > > > > > > > > for
> >>     > > > > > > > > > > > > > > > example).
> >>     > > > > > > > > > > > > > > > > > > >> They
> >>     > > > > > > > > > > > > > > > > > > >> > > could even be in the same
> >> process.
> >>     > > But
> >>     > > > > > they
> >>     > > > > > > > > would
> >>     > > > > > > > > > > get
> >>     > > > > > > > > > > > > > > separate
> >>     > > > > > > > > > > > > > > > > > > UUIDs.
> >>     > > > > > > > > > > > > > > > > > > >> > >
> >>     > > > > > > > > > > > > > > > > > > >> > > I believe Magnus used the
> >> term
> >>     > > client
> >>     > > > to
> >>     > > > > > mean
> >>     > > > > > > > > > > > "Producer or
> >>     > > > > > > > > > > > > > > > > > > Consumer".
> >>     > > > > > > > > > > > > > > > > > > >> So
> >>     > > > > > > > > > > > > > > > > > > >> > > if you have both a
> >> Producer and a
> >>     > > > > > Consumer in
> >>     > > > > > > > > your
> >>     > > > > > > > > > > > > > > > application I
> >>     > > > > > > > > > > > > > > > > > > would
> >>     > > > > > > > > > > > > > > > > > > >> > > expect you'd get separate
> >> UUIDs
> >>     > for
> >>     > > > > both.
> >>     > > > > > > > Again
> >>     > > > > > > > > > > > Magnus can
> >>     > > > > > > > > > > > > > > > chime
> >>     > > > > > > > > > > > > > > > > > in
> >>     > > > > > > > > > > > > > > > > > > >> > here, I
> >>     > > > > > > > > > > > > > > > > > > >> > > guess.
> >>     > > > > > > > > > > > > > > > > > > >> > >
> >>     > > > > > > > > > > > > > > > > > > >> >
> >>     > > > > > > > > > > > > > > > > > > >> > That's correct.
> >>     > > > > > > > > > > > > > > > > > > >> >
> >>     > > > > > > > > > > > > > > > > > > >> >
> >>     > > > > > > > > > > > > > > > > > > >> > >
> >>     > > > > > > > > > > > > > > > > > > >> > > > 2) How about the client
> >>     > > restarting?
> >>     > > > > > What's
> >>     > > > > > > > the
> >>     > > > > > > > > > > > > > > expectation?
> >>     > > > > > > > > > > > > > > > > > Should
> >>     > > > > > > > > > > > > > > > > > > >> the
> >>     > > > > > > > > > > > > > > > > > > >> > > > server expect the
> client
> >> to
> >>     > carry
> >>     > > a
> >>     > > > > > > > persisted
> >>     > > > > > > > > > > client
> >>     > > > > > > > > > > > > > > > instance id
> >>     > > > > > > > > > > > > > > > > > > or
> >>     > > > > > > > > > > > > > > > > > > >> > > should
> >>     > > > > > > > > > > > > > > > > > > >> > > > the client be treated
> as
> >> a new
> >>     > > > > instance?
> >>     > > > > > > > > > > > > > > > > > > >> > >
> >>     > > > > > > > > > > > > > > > > > > >> > > The KIP doesn't describe
> >> any
> >>     > > mechanism
> >>     > > > > for
> >>     > > > > > > > > > > > persistence,
> >>     > > > > > > > > > > > > > so I
> >>     > > > > > > > > > > > > > > > would
> >>     > > > > > > > > > > > > > > > > > > >> assume
> >>     > > > > > > > > > > > > > > > > > > >> > > that when you restart the
> >> client
> >>     > you
> >>     > > > get
> >>     > > > > > a new
> >>     > > > > > > > > > > UUID. I
> >>     > > > > > > > > > > > > > agree
> >>     > > > > > > > > > > > > > > > that
> >>     > > > > > > > > > > > > > > > > > it
> >>     > > > > > > > > > > > > > > > > > > >> > would
> >>     > > > > > > > > > > > > > > > > > > >> > > be good to spell this
> out.
> >>     > > > > > > > > > > > > > > > > > > >> > >
> >>     > > > > > > > > > > > > > > > > > > >> > >
> >>     > > > > > > > > > > > > > > > > > > >> > Right, it will not be
> >> persisted
> >>     > since
> >>     > > a
> >>     > > > > > client
> >>     > > > > > > > > > > instance
> >>     > > > > > > > > > > > > > can't
> >>     > > > > > > > > > > > > > > be
> >>     > > > > > > > > > > > > > > > > > > >> restarted.
> >>     > > > > > > > > > > > > > > > > > > >> >
> >>     > > > > > > > > > > > > > > > > > > >> > Will update the KIP to make
> >> this
> >>     > > > clearer.
> >>     > > > > > > > > > > > > > > > > > > >> >
> >>     > > > > > > > > > > > > > > > > > > >> > /Magnus
> >>     > > > > > > > > > > > > > > > > > > >> >
> >>     > > > > > > > > > > > > > > > > > > >>
> >>     > > > > > > > > > > > > > > > > > > >>
> >>     > > > > > > > > > > > > > > > > > > >> --
> >>     > > > > > > > > > > > > > > > > > > >> Gwen Shapira
> >>     > > > > > > > > > > > > > > > > > > >> Engineering Manager |
> Confluent
> >>     > > > > > > > > > > > > > > > > > > >> 650.450.2760 | @gwenshap
> >>     > > > > > > > > > > > > > > > > > > >> Follow us: Twitter | blog
> >>     > > > > > > > > > > > > > > > > > > >>
> >>     > > > > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > > >
> >>     > > > > > > > > > > > > >
> >>     > > > > > > > > > > >
> >>     > > > > > > > > > >
> >>     > > > > > > > > >
> >>     > > > > > > > >
> >>     > > > > > > >
> >>     > > > > > >
> >>     > > > > >
> >>     > > > >
> >>     > > >
> >>     > >
> >>     >
> >>
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Jun Rao <ju...@confluent.io.INVALID>.
Hi, Kirk, Sarat,

A few more comments.

40. GetTelemetrySubscriptionsResponseV0 : RequestedMetrics Array[string]
uses "Array[0] empty string" to represent all metrics subscribed. We had a
similar issue with the topics field in MetadataRequest and used the
following convention.
In version 1 and higher, an empty array indicates "request metadata for no
topics," and a null array is used to indicate "request metadata for all
topics."
Should we use the same convention in GetTelemetrySubscriptionsResponseV0?

41. We include CompressionType in PushTelemetryRequestV0, but not in
ClientTelemetryPayload. How would the implementer know the compression type
for the telemetry payload?

42. For blocking the metrics for certain clients in the following example,
could you describe the corresponding config value used through the
kafka-config command?
kafka-client-metrics.sh --bootstrap-server $BROKERS \
   --add \
   --name 'Disabe_b69cc35a' \  # A descriptive name makes it easier to
clean up old subscriptions.
   --match client_instance_id=b69cc35a-7a54-4790-aa69-cc2bd4ee4538 \  #
Match this specific client instance
   --block

Thanks,

Jun


On Thu, Mar 10, 2022 at 11:57 AM Jun Rao <ju...@confluent.io> wrote:

> Hi, Kirk, Sarat,
>
> Thanks for the reply.
>
> 28. On the broker, we typically use Yammer metrics. Only for metrics that
> depend on Kafka metric features (e.g., quota), we use the Kafka metric.
> Yammer metrics have 4 types: gauge, meter, histogram and timer. meter
> calculates a rate, but also exposes an accumulated value.
>
> 29. The Histogram class in org.apache.kafka.common.metrics.stats was never
> used in the client metrics. The implementation of Histogram only provides a
> fixed number of values in the domain and may not capture the quantiles very
> accurately. So, we punted on using it.
>
> Thanks,
>
> Jun
>
>
>
> On Thu, Mar 10, 2022 at 10:59 AM Sarat Kakarla
> <sk...@confluent.io.invalid> wrote:
>
>> Jun,
>>
>>   >>  28. For the broker metrics, could you spell out the full metric name
>>   >>   including groups, tags, etc? We typically don't add the broker_id
>> label for
>>   >>   broker metrics. Also, brokers use Yammer metrics, which doesn't
>> have type
>>   >>   Sum.
>>
>> Sure,  I will update the KIP-714 with the above information, will remove
>> the broker-id label from the metrics.
>>
>> Regarding the type is CumulativeSum the right type to use in the place of
>> Sum?
>>
>> Thanks
>> Sarat
>>
>>
>> On 3/8/22, 5:48 PM, "Jun Rao" <ju...@confluent.io.INVALID> wrote:
>>
>>     Hi, Magnus, Sarat and Xavier,
>>
>>     Thanks for the reply. A few more comments below.
>>
>>     20. It seems that we are piggybacking the plugin on the
>>     existing MetricsReporter. So, this seems fine.
>>
>>     21. That could work. Are we requiring any additional jar dependency
>> on the
>>     client? Or, are you suggesting that we check the runtime dependency
>> to pick
>>     the compression codec?
>>
>>     28. For the broker metrics, could you spell out the full metric name
>>     including groups, tags, etc? We typically don't add the broker_id
>> label for
>>     broker metrics. Also, brokers use Yammer metrics, which doesn't have
>> type
>>     Sum.
>>
>>     29. There are several client metrics listed as histogram. However,
>> the java
>>     client currently doesn't support histogram type.
>>
>>     30. Could you show an example of the metric payload in
>> PushTelemetryRequest
>>     to help understand how we organize metrics at different levels (per
>>     instance, per topic, per partition, per broker, etc)?
>>
>>     31. Could you add a bit more detail on which client thread sends the
>>     PushTelemetryRequest?
>>
>>     Thanks,
>>
>>     Jun
>>
>>     On Mon, Mar 7, 2022 at 11:48 AM Magnus Edenhill <ma...@edenhill.se>
>> wrote:
>>
>>     > Hi Jun,
>>     >
>>     > thanks for your initiated questions, see my answers below.
>>     > There's been a number of clarifications to the KIP.
>>     >
>>     >
>>     >
>>     > Den tors 27 jan. 2022 kl 20:08 skrev Jun Rao
>> <ju...@confluent.io.invalid>:
>>     >
>>     > > Hi, Magnus,
>>     > >
>>     > > Thanks for updating the KIP. The overall approach makes sense to
>> me. A
>>     > few
>>     > > more detailed comments below.
>>     > >
>>     > > 20. ClientTelemetry: Should it be extending configurable and
>> closable?
>>     > >
>>     >
>>     > I'll pass this question to Sarat and/or Xavier.
>>     >
>>     >
>>     >
>>     > > 21. Compression of the metrics on the client: what's the default?
>>     > >
>>     >
>>     > How about we specify a prioritized list: zstd, lz4, snappy, gzip?
>>     > But ultimately it is up to what the client supports.
>>     >
>>     >
>>     > 23. A client instance is considered a metric resource and the
>>     > > resource-level (thus client instance level) labels could include:
>>     > >     client_software_name=confluent-kafka-python
>>     > >     client_software_version=v2.1.3
>>     > >     client_instance_id=B64CD139-3975-440A-91D4
>>     > >     transactional_id=someTxnApp
>>     > > Are those labels added in PushTelemetryRequest? If so, are they
>> per
>>     > metric
>>     > > or per request?
>>     > >
>>     >
>>     >
>>     > client_software* and client_instance_id are not added by the
>> client, but
>>     > available to
>>     > the broker-side metrics plugin for adding as it see fits, remove
>> them from
>>     > the KIP.
>>     >
>>     > As for transactional_id, group_id, etc, which I believe will be
>> useful in
>>     > troubleshooting,
>>     > are included only once (per push) as resource-level attributes (the
>> client
>>     > instance is a singular resource).
>>     >
>>     >
>>     > >
>>     > > 24.  "the broker will only send
>>     > > GetTelemetrySubscriptionsResponse.DeltaTemporality=True" :
>>     > > 24.1 If it's always true, does it need to be part of the protocol?
>>     > >
>>     >
>>     > We're anticipating that it will take a lot longer to upgrade the
>> majority
>>     > of clients than the
>>     > broker/plugin side, which is why we want the client to support both
>>     > temporalities out-of-the-box
>>     > so that cumulative reporting can be turned on seamlessly in the
>> future.
>>     >
>>     >
>>     >
>>     > > 24.2 Does delta only apply to Counter type?
>>     > >
>>     >
>>     >
>>     > And Histograms. More details in Xavier's OTLP link.
>>     >
>>     >
>>     >
>>     > > 24.3 In the delta representation, the first request needs to send
>> the
>>     > full
>>     > > value, how does the broker plugin know whether a value is full or
>> delta?
>>     > >
>>     >
>>     > The client may (should) send the start time for each metric sample,
>>     > indicating when
>>     > the metric began to be collected.
>>     > We've discussed whether this should be the client instance start
>> time or
>>     > the time when a matching
>>     > metric subscription for that metric is received.
>>     > For completeness we recommend using the former, the client instance
>> start
>>     > time.
>>     >
>>     >
>>     >
>>     > > 25. quota:
>>     > > 25.1 Since we are fitting PushTelemetryRequest into the existing
>> request
>>     > > quota, it would be useful to document the impact, i.e. client
>> metric
>>     > > throttling causes the data from the same client to be delayed.
>>     > > 25.2 Is PushTelemetryRequest subject to the write bandwidth quota
>> like
>>     > the
>>     > > producer?
>>     > >
>>     >
>>     >
>>     > Yes, it should be, as to protect the cluster from rogue clients.
>>     > But, in practice the size of metrics will be quite low (e.g.,
>> 1-10kb per
>>     > 60s interval), so I don't think this will pose a problem.
>>     > The KIP has been updated with more details on quota/throttling
>> behaviour,
>>     > see the
>>     > "Throttling and rate-limiting" section.
>>     >
>>     >
>>     > 25.3 THROTTLING_QUOTA_EXCEEDED: Currently, we don't send this error
>> when
>>     > > the request/bandwidth quota is exceeded since those requests are
>> not
>>     > > rejected. We only set this error when the request is rejected
>> (e.g.,
>>     > topic
>>     > > creation). It would be useful to clarify when this error is used.
>>     > >
>>     >
>>     > Right, I was trying to reuse an existing error-code. We can
>> introduce
>>     > a new one for the case where a client pushes metrics at a higher
>> frequency
>>     > than the
>>     > than the configured push interval (e.g., out-of-profile sends).
>>     > This causes the broker to drop those metrics and send this error
>> code back
>>     > to the client. There will be no connection throttling /
>> channel-muting in
>>     > this
>>     > case (unless the standard quotas are exceeded).
>>     >
>>     >
>>     > > 27. kafka-client-metrics.sh: Could we add an example on how to
>> disable a
>>     > > bad client?
>>     > >
>>     >
>>     > There's now a --block option to kafka-client-metrics.sh which
>> overrides all
>>     > subscriptions
>>     > for the matched client(s). This allows silencing metrics for one or
>> more
>>     > clients without having
>>     > to remove existing subscriptions. From the client's perspective it
>> will
>>     > look like it no longer has
>>     > any subscriptions.
>>     >
>>     > # Block metrics collection for a specific client instance
>>     > $ kafka-client-metrics.sh --bootstrap-server $BROKERS \
>>     >    --add \
>>     >    --name 'Disabe_b69cc35a' \  # A descriptive name makes it easier
>> to
>>     > clean up old subscriptions.
>>     >    --match client_instance_id=b69cc35a-7a54-4790-aa69-cc2bd4ee4538
>> \  #
>>     > Match this specific client instance
>>     >    --block
>>     >
>>     >
>>     >
>>     >
>>     > > 28. New broker side metrics: Could we spell out the details of the
>>     > metrics
>>     > > (e.g., group, tags, etc)?
>>     > >
>>     >
>>     > KIP has been updated accordingly (thanks Sarat).
>>     >
>>     >
>>     >
>>     > >
>>     > > 29. Client instance-level metrics: client.io.wait.time is a gauge
>> not a
>>     > > histogram.
>>     > >
>>     >
>>     > I believe a population/distribution should preferably be
>> represented as a
>>     > histogram, space permitting,
>>     > and only secondarily as a Gauge average.
>>     > While we might not want to maintain a bunch of histograms for each
>>     > partition, since that could be
>>     > quite space consuming, this client.io.wait.time is a single metric
>> per
>>     > client instance and can
>>     > thus afford a Histogram representation.
>>     >
>>     >
>>     >
>>     > Thanks,
>>     > Magnus
>>     >
>>     >
>>     >
>>     > > Thanks,
>>     > >
>>     > > Jun
>>     > >
>>     > > On Wed, Jan 26, 2022 at 6:32 AM Magnus Edenhill <
>> magnus@edenhill.se>
>>     > > wrote:
>>     > >
>>     > > > Hi all,
>>     > > >
>>     > > > I've updated the KIP with responses to the latest comments:
>> Java client
>>     > > > dependencies (Thanks Kirk!), alternate designs (separate
>> cluster,
>>     > > separate
>>     > > > producer, etc), etc.
>>     > > >
>>     > > > I will revive the vote thread.
>>     > > >
>>     > > > Thanks,
>>     > > > Magnus
>>     > > >
>>     > > >
>>     > > > Den mån 13 dec. 2021 kl 22:32 skrev Ryanne Dolan <
>>     > ryannedolan@gmail.com
>>     > > >:
>>     > > >
>>     > > > > I think we should be very careful about introducing new
>> runtime
>>     > > > > dependencies into the clients. Historically this has been
>> rare and
>>     > > > > essentially necessary (e.g. compression libs).
>>     > > > >
>>     > > > > Ryanne
>>     > > > >
>>     > > > > On Mon, Dec 13, 2021, 1:06 PM Kirk True <
>> kirk@mustardgrain.com>
>>     > wrote:
>>     > > > >
>>     > > > > > Hi Jun,
>>     > > > > >
>>     > > > > > On Thu, Dec 9, 2021, at 2:28 PM, Jun Rao wrote:
>>     > > > > > > 13. Using OpenTelemetry. Does that require runtime
>> dependency
>>     > > > > > > on OpenTelemetry library? How good is the compatibility
>> story
>>     > > > > > > of OpenTelemetry? This is important since an application
>> could
>>     > have
>>     > > > > other
>>     > > > > > > OpenTelemetry dependencies than the Kafka client.
>>     > > > > >
>>     > > > > > The current design is that the OpenTelemetry JARs would
>> ship with
>>     > the
>>     > > > > > client. Perhaps we can design the client such that the JARs
>> aren't
>>     > > even
>>     > > > > > loaded if the user has opted out. The user could even
>> exclude the
>>     > > JARs
>>     > > > > from
>>     > > > > > their dependencies if they so wished.
>>     > > > > >
>>     > > > > > I can't speak to the compatibility of the libraries. Is it
>> possible
>>     > > > that
>>     > > > > > we include a shaded version?
>>     > > > > >
>>     > > > > > Thanks,
>>     > > > > > Kirk
>>     > > > > >
>>     > > > > > >
>>     > > > > > > 14. The proposal listed idempotence=true. This is more of
>> a
>>     > > > > configuration
>>     > > > > > > than a metric. Are we including that as a metric? What
>> other
>>     > > > > > configurations
>>     > > > > > > are we including? Should we separate the configurations
>> from the
>>     > > > > metrics?
>>     > > > > > >
>>     > > > > > > Thanks,
>>     > > > > > >
>>     > > > > > > Jun
>>     > > > > > >
>>     > > > > > > On Mon, Nov 29, 2021 at 7:34 AM Magnus Edenhill <
>>     > > magnus@edenhill.se>
>>     > > > > > wrote:
>>     > > > > > >
>>     > > > > > > > Hey Bob,
>>     > > > > > > >
>>     > > > > > > > That's a good point.
>>     > > > > > > >
>>     > > > > > > > Request type labels were considered but since they're
>> already
>>     > > > tracked
>>     > > > > > by
>>     > > > > > > > broker-side metrics
>>     > > > > > > > they were left out as to avoid metric duplication,
>> however
>>     > those
>>     > > > > > metrics
>>     > > > > > > > are not per connection,
>>     > > > > > > > so they won't be that useful in practice for
>> troubleshooting
>>     > > > specific
>>     > > > > > > > client instances.
>>     > > > > > > >
>>     > > > > > > > I'll add the request_type label to the relevant metrics.
>>     > > > > > > >
>>     > > > > > > > Thanks,
>>     > > > > > > > Magnus
>>     > > > > > > >
>>     > > > > > > >
>>     > > > > > > > Den tis 23 nov. 2021 kl 19:20 skrev Bob Barrett
>>     > > > > > > > <bo...@confluent.io.invalid>:
>>     > > > > > > >
>>     > > > > > > > > Hi Magnus,
>>     > > > > > > > >
>>     > > > > > > > > Thanks for the thorough KIP, this seems very useful.
>>     > > > > > > > >
>>     > > > > > > > > Would it make sense to include the request type as a
>> label
>>     > for
>>     > > > the
>>     > > > > > > > > `client.request.success`, `client.request.errors` and
>>     > > > > > > > `client.request.rtt`
>>     > > > > > > > > metrics? I think it would be very useful to see which
>>     > specific
>>     > > > > > requests
>>     > > > > > > > are
>>     > > > > > > > > succeeding and failing for a client. One specific
>> case I can
>>     > > > think
>>     > > > > of
>>     > > > > > > > where
>>     > > > > > > > > this could be useful is producer batch timeouts. If a
>> Java
>>     > > > > > application
>>     > > > > > > > does
>>     > > > > > > > > not enable producer client logs (unfortunately, in my
>>     > > experience
>>     > > > > this
>>     > > > > > > > > happens more often than it should), the application
>> logs will
>>     > > > only
>>     > > > > > > > contain
>>     > > > > > > > > the expiration error message, but no information
>> about what
>>     > is
>>     > > > > > causing
>>     > > > > > > > the
>>     > > > > > > > > timeout. The requests might all be succeeding but
>> taking too
>>     > > long
>>     > > > > to
>>     > > > > > > > > process batches, or metadata requests might be
>> failing, or
>>     > some
>>     > > > or
>>     > > > > > all
>>     > > > > > > > > produce requests might be failing (if the bootstrap
>> servers
>>     > are
>>     > > > > > reachable
>>     > > > > > > > > from the client but one or more other brokers are
>> not, for
>>     > > > > example).
>>     > > > > > If
>>     > > > > > > > the
>>     > > > > > > > > cluster operator is able to identify the specific
>> requests
>>     > that
>>     > > > are
>>     > > > > > slow
>>     > > > > > > > or
>>     > > > > > > > > failing for a client, they will be better able to
>> diagnose
>>     > the
>>     > > > > issue
>>     > > > > > > > > causing batch timeouts.
>>     > > > > > > > >
>>     > > > > > > > > One drawback I can think of is that this will
>> increase the
>>     > > > > > cardinality of
>>     > > > > > > > > the request metrics. But any given client is only
>> going to
>>     > use
>>     > > a
>>     > > > > > small
>>     > > > > > > > > subset of the request types, and since we already have
>>     > > partition
>>     > > > > > labels
>>     > > > > > > > for
>>     > > > > > > > > the topic-level metrics, I think request labels will
>> still
>>     > make
>>     > > > up
>>     > > > > a
>>     > > > > > > > > relatively small percentage of the set of metrics.
>>     > > > > > > > >
>>     > > > > > > > > Thanks,
>>     > > > > > > > > Bob
>>     > > > > > > > >
>>     > > > > > > > > On Mon, Nov 22, 2021 at 2:08 AM Viktor Somogyi-Vass <
>>     > > > > > > > > viktorsomogyi@gmail.com>
>>     > > > > > > > > wrote:
>>     > > > > > > > >
>>     > > > > > > > > > Hi Magnus,
>>     > > > > > > > > >
>>     > > > > > > > > > I think this is a very useful addition. We also
>> have a
>>     > > similar
>>     > > > > (but
>>     > > > > > > > much
>>     > > > > > > > > > more simplistic) implementation of this. Maybe I
>> missed it
>>     > in
>>     > > > the
>>     > > > > > KIP
>>     > > > > > > > but
>>     > > > > > > > > > what about adding metrics about the subscription
>> cache
>>     > > itself?
>>     > > > > > That I
>>     > > > > > > > > think
>>     > > > > > > > > > would improve its usability and debuggability as
>> we'd be
>>     > able
>>     > > > to
>>     > > > > > see
>>     > > > > > > > its
>>     > > > > > > > > > performance, hit/miss rates, eviction counts and
>> others.
>>     > > > > > > > > >
>>     > > > > > > > > > Best,
>>     > > > > > > > > > Viktor
>>     > > > > > > > > >
>>     > > > > > > > > > On Thu, Nov 18, 2021 at 5:12 PM Magnus Edenhill <
>>     > > > > > magnus@edenhill.se>
>>     > > > > > > > > > wrote:
>>     > > > > > > > > >
>>     > > > > > > > > > > Hi Mickael,
>>     > > > > > > > > > >
>>     > > > > > > > > > > see inline.
>>     > > > > > > > > > >
>>     > > > > > > > > > > Den ons 10 nov. 2021 kl 15:21 skrev Mickael
>> Maison <
>>     > > > > > > > > > > mickael.maison@gmail.com
>>     > > > > > > > > > > >:
>>     > > > > > > > > > >
>>     > > > > > > > > > > > Hi Magnus,
>>     > > > > > > > > > > >
>>     > > > > > > > > > > > I see you've addressed some of the points I
>> raised
>>     > above
>>     > > > but
>>     > > > > > some
>>     > > > > > > > (4,
>>     > > > > > > > > > > > 5) have not been addressed yet.
>>     > > > > > > > > > > >
>>     > > > > > > > > > >
>>     > > > > > > > > > > Re 4) How will the user/app know metrics are
>> being sent.
>>     > > > > > > > > > >
>>     > > > > > > > > > > One possibility is to add a JMX metric (thus for
>> user
>>     > > > > > consumption)
>>     > > > > > > > for
>>     > > > > > > > > > the
>>     > > > > > > > > > > number of metric pushes the
>>     > > > > > > > > > > client has performed, or perhaps the number of
>> metrics
>>     > > > > > subscriptions
>>     > > > > > > > > > > currently being collected.
>>     > > > > > > > > > > Would that be sufficient?
>>     > > > > > > > > > >
>>     > > > > > > > > > > Re 5) Metric sizes and rates
>>     > > > > > > > > > >
>>     > > > > > > > > > > A worst case scenario for a producer that is
>> producing to
>>     > > 50
>>     > > > > > unique
>>     > > > > > > > > > topics
>>     > > > > > > > > > > and emitting all standard metrics yields
>>     > > > > > > > > > > a serialized size of around 100KB prior to
>> compression,
>>     > > which
>>     > > > > > > > > compresses
>>     > > > > > > > > > > down to about 20-30% of that depending
>>     > > > > > > > > > > on compression type and topic name uniqueness.
>>     > > > > > > > > > > The numbers for a consumer would be similar.
>>     > > > > > > > > > >
>>     > > > > > > > > > > In practice the number of unique topics would be
>> far
>>     > less,
>>     > > > and
>>     > > > > > the
>>     > > > > > > > > > > subscription set would typically be for a subset
>> of
>>     > > metrics.
>>     > > > > > > > > > > So we're probably closer to 1kb, or less,
>> compressed size
>>     > > per
>>     > > > > > client
>>     > > > > > > > > per
>>     > > > > > > > > > > push interval.
>>     > > > > > > > > > >
>>     > > > > > > > > > > As both the subscription set and push intervals
>> are
>>     > > > controlled
>>     > > > > > by the
>>     > > > > > > > > > > cluster operator it shouldn't be too hard
>>     > > > > > > > > > > to strike a good balance between metrics overhead
>> and
>>     > > > > > granularity.
>>     > > > > > > > > > >
>>     > > > > > > > > > >
>>     > > > > > > > > > >
>>     > > > > > > > > > > >
>>     > > > > > > > > > > > I'm really uneasy with this being enabled by
>> default on
>>     > > the
>>     > > > > > client
>>     > > > > > > > > > > > side. When collecting data, I think the best
>> practice
>>     > is
>>     > > to
>>     > > > > > ensure
>>     > > > > > > > > > > > users are explicitly enabling it.
>>     > > > > > > > > > > >
>>     > > > > > > > > > >
>>     > > > > > > > > > > Requiring metrics to be explicitly enabled on
>> clients
>>     > > > severely
>>     > > > > > > > cripples
>>     > > > > > > > > > its
>>     > > > > > > > > > > usability and value.
>>     > > > > > > > > > >
>>     > > > > > > > > > > One of the problems that this KIP aims to solve
>> is for
>>     > > useful
>>     > > > > > metrics
>>     > > > > > > > > to
>>     > > > > > > > > > be
>>     > > > > > > > > > > available on demand
>>     > > > > > > > > > > regardless of the technical expertise of the
>> user. As
>>     > > Ryanne
>>     > > > > > points,
>>     > > > > > > > > out
>>     > > > > > > > > > a
>>     > > > > > > > > > > savvy user/organization
>>     > > > > > > > > > > will typically have metrics collection and
>> monitoring in
>>     > > > place
>>     > > > > > > > already,
>>     > > > > > > > > > and
>>     > > > > > > > > > > the benefits of this KIP
>>     > > > > > > > > > > are then more of a common set and format metrics
>> across
>>     > > > client
>>     > > > > > > > > > > implementations and languages.
>>     > > > > > > > > > > But that is not the typical Kafka user in my
>> experience,
>>     > > > > they're
>>     > > > > > not
>>     > > > > > > > > > Kafka
>>     > > > > > > > > > > experts and they don't have the
>>     > > > > > > > > > > knowledge of how to best instrument their clients.
>>     > > > > > > > > > > Having metrics enabled by default for this user
>> base
>>     > allows
>>     > > > the
>>     > > > > > Kafka
>>     > > > > > > > > > > operators to proactively and reactively
>>     > > > > > > > > > > monitor and troubleshoot client issues, without
>> the need
>>     > > for
>>     > > > > the
>>     > > > > > less
>>     > > > > > > > > > savvy
>>     > > > > > > > > > > user to do anything.
>>     > > > > > > > > > > It is often too late to tell a user to enable
>> metrics
>>     > when
>>     > > > the
>>     > > > > > > > problem
>>     > > > > > > > > > has
>>     > > > > > > > > > > already occurred.
>>     > > > > > > > > > >
>>     > > > > > > > > > > Now, to be clear, even though metrics are enabled
>> by
>>     > > default
>>     > > > on
>>     > > > > > > > clients
>>     > > > > > > > > > it
>>     > > > > > > > > > > is not enabled by default
>>     > > > > > > > > > > on the brokers; the Kafka operator needs to build
>> and set
>>     > > up
>>     > > > a
>>     > > > > > > > metrics
>>     > > > > > > > > > > plugin and add metrics subscriptions
>>     > > > > > > > > > > before anything is sent from the client.
>>     > > > > > > > > > > It is opt-out on the clients and opt-in on the
>> broker.
>>     > > > > > > > > > >
>>     > > > > > > > > > >
>>     > > > > > > > > > >
>>     > > > > > > > > > >
>>     > > > > > > > > > > > You mentioned brokers already have
>>     > > > > > > > > > > > some(most?) of the information contained in
>> metrics, if
>>     > > so
>>     > > > > > then why
>>     > > > > > > > > > > > are we collecting it again? Surely there must
>> be some
>>     > new
>>     > > > > > > > information
>>     > > > > > > > > > > > in the client metrics.
>>     > > > > > > > > > > >
>>     > > > > > > > > > >
>>     > > > > > > > > > > From the user's perspective the Kafka
>> infrastructure
>>     > > extends
>>     > > > > from
>>     > > > > > > > > > > producer.send() to
>>     > > > > > > > > > > messages being returned from consumer.poll(), a
>> giant
>>     > black
>>     > > > box
>>     > > > > > where
>>     > > > > > > > > > > there's a lot going on between those
>>     > > > > > > > > > > two points. The brokers currently only see what
>> happens
>>     > > once
>>     > > > > > those
>>     > > > > > > > > > requests
>>     > > > > > > > > > > and messages hits the broker,
>>     > > > > > > > > > > but as Kafka clients are complex pieces of
>> machinery
>>     > > there's
>>     > > > a
>>     > > > > > myriad
>>     > > > > > > > > of
>>     > > > > > > > > > > queues, timers, and state
>>     > > > > > > > > > > that's critical to the operation and
>> infrastructure
>>     > that's
>>     > > > not
>>     > > > > > > > > currently
>>     > > > > > > > > > > visible to the operator.
>>     > > > > > > > > > > Relying on the user to accurately and timely
>> provide this
>>     > > > > missing
>>     > > > > > > > > > > information is not generally feasible.
>>     > > > > > > > > > >
>>     > > > > > > > > > >
>>     > > > > > > > > > > Most of the standard metrics listed in the KIP
>> are data
>>     > > > points
>>     > > > > > that
>>     > > > > > > > the
>>     > > > > > > > > > > broker does not have.
>>     > > > > > > > > > > Only a small number of metrics are duplicates
>> (like the
>>     > > > request
>>     > > > > > > > counts
>>     > > > > > > > > > and
>>     > > > > > > > > > > sizes), but they are included
>>     > > > > > > > > > > to ease correlation when inspecting these client
>> metrics.
>>     > > > > > > > > > >
>>     > > > > > > > > > >
>>     > > > > > > > > > >
>>     > > > > > > > > > > > Moreover this is a brand new feature so it's
>> even
>>     > harder
>>     > > to
>>     > > > > > justify
>>     > > > > > > > > > > > enabling it and forcing onto all our users. If
>> disabled
>>     > > by
>>     > > > > > default,
>>     > > > > > > > > > > > it's relatively easy to enable in a new release
>> if we
>>     > > > decide
>>     > > > > > to,
>>     > > > > > > > but
>>     > > > > > > > > > > > once enabled by default it's much harder to
>> disable.
>>     > Also
>>     > > > > this
>>     > > > > > > > > feature
>>     > > > > > > > > > > > will apply to all future metrics we will add.
>>     > > > > > > > > > > >
>>     > > > > > > > > > >
>>     > > > > > > > > > > I think maturity of a feature implementation
>> should be
>>     > the
>>     > > > > > deciding
>>     > > > > > > > > > factor,
>>     > > > > > > > > > > rather than
>>     > > > > > > > > > > the design of it (which this KIP is). I.e., if the
>>     > > > > > implementation is
>>     > > > > > > > > not
>>     > > > > > > > > > > deemed mature enough
>>     > > > > > > > > > > for release X.Y it will be disabled.
>>     > > > > > > > > > >
>>     > > > > > > > > > >
>>     > > > > > > > > > >
>>     > > > > > > > > > > > Overall I think it's an interesting feature but
>> I'd
>>     > > prefer
>>     > > > to
>>     > > > > > be
>>     > > > > > > > > > > > slightly defensive and see how it works in
>> practice
>>     > > before
>>     > > > > > enabling
>>     > > > > > > > > it
>>     > > > > > > > > > > > everywhere.
>>     > > > > > > > > > > >
>>     > > > > > > > > > >
>>     > > > > > > > > > > Right, and I agree on being defensive, but since
>> this
>>     > > feature
>>     > > > > > still
>>     > > > > > > > > > > requires manual
>>     > > > > > > > > > > enabling on the brokers before actually being
>> used, I
>>     > think
>>     > > > > that
>>     > > > > > > > gives
>>     > > > > > > > > > > enough control
>>     > > > > > > > > > > to opt-in or out of this feature as needed.
>>     > > > > > > > > > >
>>     > > > > > > > > > > Thanks for your comments!
>>     > > > > > > > > > >
>>     > > > > > > > > > > Regards,
>>     > > > > > > > > > > Magnus
>>     > > > > > > > > > >
>>     > > > > > > > > > >
>>     > > > > > > > > > >
>>     > > > > > > > > > > > Thanks,
>>     > > > > > > > > > > > Mickael
>>     > > > > > > > > > > >
>>     > > > > > > > > > > > On Mon, Nov 8, 2021 at 7:55 AM Magnus Edenhill <
>>     > > > > > magnus@edenhill.se
>>     > > > > > > > >
>>     > > > > > > > > > > wrote:
>>     > > > > > > > > > > > >
>>     > > > > > > > > > > > > Thanks David for pointing this out,
>>     > > > > > > > > > > > > I've updated the KIP to include client_id as a
>>     > matching
>>     > > > > > selector.
>>     > > > > > > > > > > > >
>>     > > > > > > > > > > > > Regards,
>>     > > > > > > > > > > > > Magnus
>>     > > > > > > > > > > > >
>>     > > > > > > > > > > > > Den tors 4 nov. 2021 kl 18:01 skrev David Mao
>>     > > > > > > > > > > <dmao@confluent.io.invalid
>>     > > > > > > > > > > > >:
>>     > > > > > > > > > > > >
>>     > > > > > > > > > > > > > Hey Magnus,
>>     > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > I noticed that the KIP outlines the initial
>>     > selectors
>>     > > > > > supported
>>     > > > > > > > > as:
>>     > > > > > > > > > > > > >
>>     > > > > > > > > > > > > >    - client_instance_id -
>> CLIENT_INSTANCE_ID UUID
>>     > > > string
>>     > > > > > > > > > > > representation.
>>     > > > > > > > > > > > > >    - client_software_name  - client software
>>     > > > > implementation
>>     > > > > > > > name.
>>     > > > > > > > > > > > > >    - client_software_version  - client
>> software
>>     > > > > > implementation
>>     > > > > > > > > > > version.
>>     > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > In the given reactive monitoring workflow,
>> we
>>     > mention
>>     > > > > that
>>     > > > > > the
>>     > > > > > > > > > > > application
>>     > > > > > > > > > > > > > user does not know their client's client
>> instance
>>     > ID,
>>     > > > but
>>     > > > > > it's
>>     > > > > > > > > > > outlined
>>     > > > > > > > > > > > > > that the operator can add a metrics
>> subscription
>>     > > > > selecting
>>     > > > > > for
>>     > > > > > > > > > > > clientId. I
>>     > > > > > > > > > > > > > don't see clientId as one of the supported
>>     > selectors.
>>     > > > > > > > > > > > > > I can see how this would have made sense in
>> a
>>     > > previous
>>     > > > > > > > iteration
>>     > > > > > > > > > > given
>>     > > > > > > > > > > > that
>>     > > > > > > > > > > > > > the previous client instance ID proposal
>> was to
>>     > > > construct
>>     > > > > > the
>>     > > > > > > > > > client
>>     > > > > > > > > > > > > > instance ID using clientId as a prefix. Now
>> that
>>     > the
>>     > > > > client
>>     > > > > > > > > > instance
>>     > > > > > > > > > > > ID is
>>     > > > > > > > > > > > > > a UUID, would we want to add clientId as a
>>     > supported
>>     > > > > > selector?
>>     > > > > > > > > > > > > > Let me know what you think.
>>     > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > David
>>     > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > On Tue, Oct 19, 2021 at 12:33 PM Magnus
>> Edenhill <
>>     > > > > > > > > > magnus@edenhill.se
>>     > > > > > > > > > > >
>>     > > > > > > > > > > > > > wrote:
>>     > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > Hi Mickael!
>>     > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > Den tis 19 okt. 2021 kl 19:30 skrev
>> Mickael
>>     > Maison
>>     > > <
>>     > > > > > > > > > > > > > > mickael.maison@gmail.com
>>     > > > > > > > > > > > > > > >:
>>     > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > Hi Magnus,
>>     > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > Thanks for the proposal.
>>     > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > 1. Looking at the protocol section,
>> isn't
>>     > > > > > > > "ClientInstanceId"
>>     > > > > > > > > > > > expected
>>     > > > > > > > > > > > > > > > to be a field in
>>     > > > GetTelemetrySubscriptionsResponseV0?
>>     > > > > > > > > > Otherwise,
>>     > > > > > > > > > > > how
>>     > > > > > > > > > > > > > > > does a client retrieve this value?
>>     > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > Good catch, it got removed by mistake in
>> one of
>>     > the
>>     > > > > > edits.
>>     > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > 2. In the client API section, you
>> mention a new
>>     > > > > method
>>     > > > > > > > > > > > > > > > "clientInstanceId()". Can you clarify
>> which
>>     > > > > interfaces
>>     > > > > > are
>>     > > > > > > > > > > > affected?
>>     > > > > > > > > > > > > > > > Is it only Consumer and Producer?
>>     > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > And Admin. Will update the KIP.
>>     > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > 3. I'm a bit concerned this is enabled
>> by
>>     > > default.
>>     > > > > > Even if
>>     > > > > > > > > the
>>     > > > > > > > > > > data
>>     > > > > > > > > > > > > > > > collected is supposed to be not
>> sensitive, I
>>     > > think
>>     > > > > > this can
>>     > > > > > > > > be
>>     > > > > > > > > > > > > > > > problematic in some environments. Also
>> users
>>     > > don't
>>     > > > > > seem to
>>     > > > > > > > > have
>>     > > > > > > > > > > the
>>     > > > > > > > > > > > > > > > choice to only expose some metrics.
>> Knowing how
>>     > > > much
>>     > > > > > data
>>     > > > > > > > > > transit
>>     > > > > > > > > > > > > > > > through some applications can be
>> considered
>>     > > > critical.
>>     > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > The broker already knows how much data
>> transits
>>     > > > through
>>     > > > > > the
>>     > > > > > > > > > client
>>     > > > > > > > > > > > > > though,
>>     > > > > > > > > > > > > > > right?
>>     > > > > > > > > > > > > > > Care has been taken not to expose
>> information in
>>     > > the
>>     > > > > > standard
>>     > > > > > > > > > > metrics
>>     > > > > > > > > > > > > > that
>>     > > > > > > > > > > > > > > might
>>     > > > > > > > > > > > > > > reveal sensitive information.
>>     > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > Do you have an example of how the proposed
>>     > metrics
>>     > > > > could
>>     > > > > > leak
>>     > > > > > > > > > > > sensitive
>>     > > > > > > > > > > > > > > information?
>>     > > > > > > > > > > > > > > As for limiting the what metrics to
>> export; I
>>     > guess
>>     > > > > that
>>     > > > > > > > could
>>     > > > > > > > > > make
>>     > > > > > > > > > > > sense
>>     > > > > > > > > > > > > > > in some
>>     > > > > > > > > > > > > > > very sensitive use-cases, but those users
>> might
>>     > > > disable
>>     > > > > > > > metrics
>>     > > > > > > > > > > > > > altogether
>>     > > > > > > > > > > > > > > for now.
>>     > > > > > > > > > > > > > > Could these concerns be addressed by a
>> later KIP?
>>     > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > 4. As a user, how do you know if your
>>     > application
>>     > > > is
>>     > > > > > > > actively
>>     > > > > > > > > > > > sending
>>     > > > > > > > > > > > > > > > metrics? Are there new metrics exposing
>> what's
>>     > > > going
>>     > > > > > on,
>>     > > > > > > > like
>>     > > > > > > > > > how
>>     > > > > > > > > > > > much
>>     > > > > > > > > > > > > > > > data is being sent?
>>     > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > That's a good question.
>>     > > > > > > > > > > > > > > Since the proposed metrics interface is
>> not aimed
>>     > > at,
>>     > > > > or
>>     > > > > > > > > directly
>>     > > > > > > > > > > > > > available
>>     > > > > > > > > > > > > > > to, the application
>>     > > > > > > > > > > > > > > I guess there's little point of adding it
>> here,
>>     > but
>>     > > > > > instead
>>     > > > > > > > > > adding
>>     > > > > > > > > > > > > > > something to the
>>     > > > > > > > > > > > > > > existing JMX metrics?
>>     > > > > > > > > > > > > > > Do you have any suggestions?
>>     > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > 5. If all metrics are enabled on a
>> regular
>>     > > Consumer
>>     > > > > or
>>     > > > > > > > > > Producer,
>>     > > > > > > > > > > do
>>     > > > > > > > > > > > > > > > you have an idea how much throughput
>> this would
>>     > > > use?
>>     > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > It depends on the number of
>> partition/topics/etc
>>     > > the
>>     > > > > > client
>>     > > > > > > > is
>>     > > > > > > > > > > > producing
>>     > > > > > > > > > > > > > > to/consuming from.
>>     > > > > > > > > > > > > > > I'll add some sizes to the KIP for some
>> typical
>>     > > > > > use-cases.
>>     > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > Thanks,
>>     > > > > > > > > > > > > > > Magnus
>>     > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > Thanks
>>     > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > On Tue, Oct 19, 2021 at 5:06 PM Magnus
>>     > Edenhill <
>>     > > > > > > > > > > > magnus@edenhill.se>
>>     > > > > > > > > > > > > > > > wrote:
>>     > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > Den tis 19 okt. 2021 kl 13:22 skrev
>> Tom
>>     > > Bentley <
>>     > > > > > > > > > > > tbentley@redhat.com
>>     > > > > > > > > > > > > > >:
>>     > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > > Hi Magnus,
>>     > > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > > I reviewed the KIP since you called
>> the
>>     > vote
>>     > > > > > (sorry for
>>     > > > > > > > > not
>>     > > > > > > > > > > > > > reviewing
>>     > > > > > > > > > > > > > > > when
>>     > > > > > > > > > > > > > > > > > you announced your intention to
>> call the
>>     > > > vote). I
>>     > > > > > have
>>     > > > > > > > a
>>     > > > > > > > > > few
>>     > > > > > > > > > > > > > > questions
>>     > > > > > > > > > > > > > > > on
>>     > > > > > > > > > > > > > > > > > some of the details.
>>     > > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > > 1. There's no Javadoc on
>>     > > > > > ClientTelemetryPayload.data(),
>>     > > > > > > > > so
>>     > > > > > > > > > I
>>     > > > > > > > > > > > don't
>>     > > > > > > > > > > > > > > know
>>     > > > > > > > > > > > > > > > > > whether the payload is exposed
>> through this
>>     > > > > method
>>     > > > > > as
>>     > > > > > > > > > > > compressed or
>>     > > > > > > > > > > > > > > > not.
>>     > > > > > > > > > > > > > > > > > Later on you say "Decompression of
>> the
>>     > > payloads
>>     > > > > > will be
>>     > > > > > > > > > > > handled by
>>     > > > > > > > > > > > > > > the
>>     > > > > > > > > > > > > > > > > > broker metrics plugin, the broker
>> should
>>     > > > expose a
>>     > > > > > > > > suitable
>>     > > > > > > > > > > > > > > > decompression
>>     > > > > > > > > > > > > > > > > > API to the metrics plugin for this
>>     > purpose.",
>>     > > > > which
>>     > > > > > > > > > suggests
>>     > > > > > > > > > > > it's
>>     > > > > > > > > > > > > > the
>>     > > > > > > > > > > > > > > > > > compressed data in the buffer, but
>> then we
>>     > > > don't
>>     > > > > > know
>>     > > > > > > > > which
>>     > > > > > > > > > > > codec
>>     > > > > > > > > > > > > > was
>>     > > > > > > > > > > > > > > > used,
>>     > > > > > > > > > > > > > > > > > nor the API via which the plugin
>> should
>>     > > > > decompress
>>     > > > > > it
>>     > > > > > > > if
>>     > > > > > > > > > > > required
>>     > > > > > > > > > > > > > for
>>     > > > > > > > > > > > > > > > > > forwarding to the ultimate metrics
>> store.
>>     > > > Should
>>     > > > > > the
>>     > > > > > > > > > > > > > > > ClientTelemetryPayload
>>     > > > > > > > > > > > > > > > > > expose a method to get the
>> compression and
>>     > a
>>     > > > > > > > > decompressor?
>>     > > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > Good point, updated.
>>     > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > > 2. The client-side API is expressed
>> as
>>     > > > > > StringOrError
>>     > > > > > > > > > > > > > > > > > ClientInstance::ClientInstanceId(int
>>     > > > > timeout_ms). I
>>     > > > > > > > > > > understand
>>     > > > > > > > > > > > that
>>     > > > > > > > > > > > > > > > you're
>>     > > > > > > > > > > > > > > > > > thinking about the librdkafka
>>     > implementation,
>>     > > > but
>>     > > > > > it
>>     > > > > > > > > would
>>     > > > > > > > > > be
>>     > > > > > > > > > > > good
>>     > > > > > > > > > > > > > to
>>     > > > > > > > > > > > > > > > show
>>     > > > > > > > > > > > > > > > > > the API as it would appear on the
>> Apache
>>     > > Kafka
>>     > > > > > clients.
>>     > > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > This was meant as pseudo-code, but I
>> changed
>>     > it
>>     > > > to
>>     > > > > > Java.
>>     > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > > 3. "PushTelemetryRequest|Response -
>>     > protocol
>>     > > > > > request
>>     > > > > > > > used
>>     > > > > > > > > > by
>>     > > > > > > > > > > > the
>>     > > > > > > > > > > > > > > > client to
>>     > > > > > > > > > > > > > > > > > send metrics to any broker it is
>> connected
>>     > > to."
>>     > > > > To
>>     > > > > > be
>>     > > > > > > > > > clear,
>>     > > > > > > > > > > > this
>>     > > > > > > > > > > > > > > means
>>     > > > > > > > > > > > > > > > > > that the client can choose any of
>> the
>>     > > connected
>>     > > > > > brokers
>>     > > > > > > > > and
>>     > > > > > > > > > > > push to
>>     > > > > > > > > > > > > > > > just
>>     > > > > > > > > > > > > > > > > > one of them? What should a
>> supporting
>>     > client
>>     > > do
>>     > > > > if
>>     > > > > > it
>>     > > > > > > > > gets
>>     > > > > > > > > > an
>>     > > > > > > > > > > > error
>>     > > > > > > > > > > > > > > > when
>>     > > > > > > > > > > > > > > > > > pushing metrics to a broker, retry
>> sending
>>     > to
>>     > > > the
>>     > > > > > same
>>     > > > > > > > > > broker
>>     > > > > > > > > > > > or
>>     > > > > > > > > > > > > > try
>>     > > > > > > > > > > > > > > > > > pushing to another broker, or drop
>> the
>>     > > metrics?
>>     > > > > > Should
>>     > > > > > > > > > > > supporting
>>     > > > > > > > > > > > > > > > clients
>>     > > > > > > > > > > > > > > > > > send successive requests to a single
>>     > broker,
>>     > > or
>>     > > > > > round
>>     > > > > > > > > > robin,
>>     > > > > > > > > > > > or is
>>     > > > > > > > > > > > > > > > that up
>>     > > > > > > > > > > > > > > > > > to the client author? I'm guessing
>> the
>>     > > > behaviour
>>     > > > > > should
>>     > > > > > > > > be
>>     > > > > > > > > > > > sticky
>>     > > > > > > > > > > > > > to
>>     > > > > > > > > > > > > > > > > > support the rate limiting features,
>> but I
>>     > > think
>>     > > > > it
>>     > > > > > > > would
>>     > > > > > > > > be
>>     > > > > > > > > > > > good
>>     > > > > > > > > > > > > > for
>>     > > > > > > > > > > > > > > > client
>>     > > > > > > > > > > > > > > > > > authors if this section were
>> explicit on
>>     > the
>>     > > > > > > > recommended
>>     > > > > > > > > > > > behaviour.
>>     > > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > You are right, I've updated the KIP
>> to make
>>     > > this
>>     > > > > > clearer.
>>     > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > > 4. "Mapping the client instance id
>> to an
>>     > > actual
>>     > > > > > > > > application
>>     > > > > > > > > > > > > > instance
>>     > > > > > > > > > > > > > > > > > running on a (virtual) machine can
>> be done
>>     > by
>>     > > > > > > > inspecting
>>     > > > > > > > > > the
>>     > > > > > > > > > > > > > metrics
>>     > > > > > > > > > > > > > > > > > resource labels, such as the client
>> source
>>     > > > > address
>>     > > > > > and
>>     > > > > > > > > > source
>>     > > > > > > > > > > > port,
>>     > > > > > > > > > > > > > > or
>>     > > > > > > > > > > > > > > > > > security principal, all of which
>> are added
>>     > by
>>     > > > the
>>     > > > > > > > > receiving
>>     > > > > > > > > > > > broker.
>>     > > > > > > > > > > > > > > > This
>>     > > > > > > > > > > > > > > > > > will allow the operator together
>> with the
>>     > > user
>>     > > > to
>>     > > > > > > > > identify
>>     > > > > > > > > > > the
>>     > > > > > > > > > > > > > actual
>>     > > > > > > > > > > > > > > > > > application instance." Is this
>> really
>>     > always
>>     > > > > true?
>>     > > > > > The
>>     > > > > > > > > > source
>>     > > > > > > > > > > > IP
>>     > > > > > > > > > > > > > and
>>     > > > > > > > > > > > > > > > port
>>     > > > > > > > > > > > > > > > > > might be a loadbalancer/proxy in
>> some
>>     > setups.
>>     > > > The
>>     > > > > > > > > > principal,
>>     > > > > > > > > > > as
>>     > > > > > > > > > > > > > > already
>>     > > > > > > > > > > > > > > > > > mentioned in the KIP, might be
>> shared
>>     > between
>>     > > > > > multiple
>>     > > > > > > > > > > > > > applications.
>>     > > > > > > > > > > > > > > > So at
>>     > > > > > > > > > > > > > > > > > worst the organization running the
>> clients
>>     > > > might
>>     > > > > > have
>>     > > > > > > > to
>>     > > > > > > > > > > > consult
>>     > > > > > > > > > > > > > the
>>     > > > > > > > > > > > > > > > logs
>>     > > > > > > > > > > > > > > > > > of a set of client applications,
>> right?
>>     > > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > Yes, that's correct. There's no
>> guaranteed
>>     > > > mapping
>>     > > > > > from
>>     > > > > > > > > > > > > > > > client_instance_id
>>     > > > > > > > > > > > > > > > > to
>>     > > > > > > > > > > > > > > > > an actual instance, that's why the KIP
>>     > > recommends
>>     > > > > > client
>>     > > > > > > > > > > > > > > implementations
>>     > > > > > > > > > > > > > > > to
>>     > > > > > > > > > > > > > > > > log the client instance id
>>     > > > > > > > > > > > > > > > > upon retrieval, and also provide an
>> API for
>>     > the
>>     > > > > > > > application
>>     > > > > > > > > > to
>>     > > > > > > > > > > > > > retrieve
>>     > > > > > > > > > > > > > > > the
>>     > > > > > > > > > > > > > > > > instance id programmatically
>>     > > > > > > > > > > > > > > > > if it has a better way of exposing it.
>>     > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > 5. "Tests indicate that a compression
>> ratio
>>     > up
>>     > > to
>>     > > > > > 10x is
>>     > > > > > > > > > > > possible for
>>     > > > > > > > > > > > > > > the
>>     > > > > > > > > > > > > > > > > > standard metrics." Client authors
>> might
>>     > > > > appreciate
>>     > > > > > your
>>     > > > > > > > > > > > mentioning
>>     > > > > > > > > > > > > > > > which
>>     > > > > > > > > > > > > > > > > > compression codec got these results.
>>     > > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > Good point. Updated.
>>     > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > > 6. "Should the client send a push
>> request
>>     > > prior
>>     > > > > to
>>     > > > > > > > expiry
>>     > > > > > > > > > of
>>     > > > > > > > > > > > the
>>     > > > > > > > > > > > > > > > previously
>>     > > > > > > > > > > > > > > > > > calculated PushIntervalMs the
>> broker will
>>     > > > discard
>>     > > > > > the
>>     > > > > > > > > > metrics
>>     > > > > > > > > > > > and
>>     > > > > > > > > > > > > > > > return a
>>     > > > > > > > > > > > > > > > > > PushTelemetryResponse with the
>> ErrorCode
>>     > set
>>     > > to
>>     > > > > > > > > > RateLimited."
>>     > > > > > > > > > > > Is
>>     > > > > > > > > > > > > > this
>>     > > > > > > > > > > > > > > > > > RATE_LIMITED a new error code? It's
>> not
>>     > > > mentioned
>>     > > > > > in
>>     > > > > > > > the
>>     > > > > > > > > > "New
>>     > > > > > > > > > > > Error
>>     > > > > > > > > > > > > > > > Codes"
>>     > > > > > > > > > > > > > > > > > section.
>>     > > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > That's a leftover, it should be using
>> the
>>     > > > standard
>>     > > > > > > > > > ThrottleTime
>>     > > > > > > > > > > > > > > > mechanism.
>>     > > > > > > > > > > > > > > > > Fixed.
>>     > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > > 7. In the section "Standard client
>> resource
>>     > > > > labels"
>>     > > > > > > > > > > > application_id
>>     > > > > > > > > > > > > > is
>>     > > > > > > > > > > > > > > > > > described as Kafka Streams only,
>> but the
>>     > > > section
>>     > > > > of
>>     > > > > > > > > "Client
>>     > > > > > > > > > > > > > > > Identification"
>>     > > > > > > > > > > > > > > > > > talks about "application instance
>> id as an
>>     > > > > optional
>>     > > > > > > > > future
>>     > > > > > > > > > > > > > > nice-to-have
>>     > > > > > > > > > > > > > > > > > that may be included as a metrics
>> label if
>>     > it
>>     > > > has
>>     > > > > > been
>>     > > > > > > > > set
>>     > > > > > > > > > by
>>     > > > > > > > > > > > the
>>     > > > > > > > > > > > > > > > user", so
>>     > > > > > > > > > > > > > > > > > I'm confused whether non-Kafka
>> Streams
>>     > > clients
>>     > > > > > should
>>     > > > > > > > set
>>     > > > > > > > > > an
>>     > > > > > > > > > > > > > > > application_id
>>     > > > > > > > > > > > > > > > > > or not.
>>     > > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > I'll clarify this in the KIP, but
>> basically
>>     > we
>>     > > > > would
>>     > > > > > need
>>     > > > > > > > > to
>>     > > > > > > > > > > add
>>     > > > > > > > > > > > an `
>>     > > > > > > > > > > > > > > > > application.id` config
>>     > > > > > > > > > > > > > > > > property for non-streams clients for
>> this
>>     > > > purpose,
>>     > > > > > and
>>     > > > > > > > > that's
>>     > > > > > > > > > > > outside
>>     > > > > > > > > > > > > > > the
>>     > > > > > > > > > > > > > > > > scope of this KIP since we want to
>> make it
>>     > > > > > zero-conf:ish
>>     > > > > > > > on
>>     > > > > > > > > > the
>>     > > > > > > > > > > > > > client
>>     > > > > > > > > > > > > > > > side.
>>     > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > > Kind regards,
>>     > > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > > Tom
>>     > > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > Thanks for the review,
>>     > > > > > > > > > > > > > > > > Magnus
>>     > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > > On Thu, Oct 7, 2021 at 5:26 PM
>> Magnus
>>     > > Edenhill
>>     > > > <
>>     > > > > > > > > > > > magnus@edenhill.se
>>     > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > wrote:
>>     > > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > > > Hi all,
>>     > > > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > > > I've updated the KIP following
>> our recent
>>     > > > > > discussions
>>     > > > > > > > > on
>>     > > > > > > > > > > the
>>     > > > > > > > > > > > > > > mailing
>>     > > > > > > > > > > > > > > > > > list:
>>     > > > > > > > > > > > > > > > > > >  - split the protocol in two, one
>> for
>>     > > getting
>>     > > > > the
>>     > > > > > > > > metrics
>>     > > > > > > > > > > > > > > > subscriptions,
>>     > > > > > > > > > > > > > > > > > > and one for pushing the metrics.
>>     > > > > > > > > > > > > > > > > > >  - simplifications: initially
>> only one
>>     > > > > supported
>>     > > > > > > > > metrics
>>     > > > > > > > > > > > format,
>>     > > > > > > > > > > > > > no
>>     > > > > > > > > > > > > > > > > > > client.id in the instance id,
>> etc.
>>     > > > > > > > > > > > > > > > > > >  - made CLIENT_METRICS
>> subscription
>>     > > > > configuration
>>     > > > > > > > > entries
>>     > > > > > > > > > > > more
>>     > > > > > > > > > > > > > > > structured
>>     > > > > > > > > > > > > > > > > > >    and allowing better client
>> matching
>>     > > > > selectors
>>     > > > > > (not
>>     > > > > > > > > > only
>>     > > > > > > > > > > > on the
>>     > > > > > > > > > > > > > > > > > instance
>>     > > > > > > > > > > > > > > > > > > id, but also the other
>>     > > > > > > > > > > > > > > > > > >    client resource labels, such as
>>     > > > > > > > > client_software_name,
>>     > > > > > > > > > > > etc.).
>>     > > > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > > > Unless there are further comments
>> I'll
>>     > call
>>     > > > the
>>     > > > > > vote
>>     > > > > > > > > in a
>>     > > > > > > > > > > > day or
>>     > > > > > > > > > > > > > > two.
>>     > > > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > > > Regards,
>>     > > > > > > > > > > > > > > > > > > Magnus
>>     > > > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > > > Den mån 4 okt. 2021 kl 20:57
>> skrev Magnus
>>     > > > > > Edenhill <
>>     > > > > > > > > > > > > > > > magnus@edenhill.se>:
>>     > > > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > > > > Hi Gwen,
>>     > > > > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > > > > I'm finishing up the KIP based
>> on the
>>     > > last
>>     > > > > > couple
>>     > > > > > > > of
>>     > > > > > > > > > > > discussion
>>     > > > > > > > > > > > > > > > points
>>     > > > > > > > > > > > > > > > > > in
>>     > > > > > > > > > > > > > > > > > > > this thread
>>     > > > > > > > > > > > > > > > > > > > and will call the Vote later
>> this week.
>>     > > > > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > > > > Best,
>>     > > > > > > > > > > > > > > > > > > > Magnus
>>     > > > > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > > > > Den lör 2 okt. 2021 kl 02:01
>> skrev Gwen
>>     > > > > Shapira
>>     > > > > > > > > > > > > > > > > > > <gwen@confluent.io.invalid
>>     > > > > > > > > > > > > > > > > > > > >:
>>     > > > > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > > > >> Hey,
>>     > > > > > > > > > > > > > > > > > > >>
>>     > > > > > > > > > > > > > > > > > > >> I noticed that there was no
>> discussion
>>     > > for
>>     > > > > the
>>     > > > > > > > last
>>     > > > > > > > > 10
>>     > > > > > > > > > > > days,
>>     > > > > > > > > > > > > > > but I
>>     > > > > > > > > > > > > > > > > > > >> couldn't
>>     > > > > > > > > > > > > > > > > > > >> find the vote thread. Is there
>> one
>>     > that
>>     > > > I'm
>>     > > > > > > > missing?
>>     > > > > > > > > > > > > > > > > > > >>
>>     > > > > > > > > > > > > > > > > > > >> Gwen
>>     > > > > > > > > > > > > > > > > > > >>
>>     > > > > > > > > > > > > > > > > > > >> On Wed, Sep 22, 2021 at 4:58
>> AM Magnus
>>     > > > > > Edenhill <
>>     > > > > > > > > > > > > > > > magnus@edenhill.se>
>>     > > > > > > > > > > > > > > > > > > >> wrote:
>>     > > > > > > > > > > > > > > > > > > >>
>>     > > > > > > > > > > > > > > > > > > >> > Den tis 21 sep. 2021 kl
>> 06:58 skrev
>>     > > > Colin
>>     > > > > > > > McCabe <
>>     > > > > > > > > > > > > > > > > > cmccabe@apache.org
>>     > > > > > > > > > > > > > > > > > > >:
>>     > > > > > > > > > > > > > > > > > > >> >
>>     > > > > > > > > > > > > > > > > > > >> > > On Mon, Sep 20, 2021, at
>> 17:35,
>>     > Feng
>>     > > > Min
>>     > > > > > > > wrote:
>>     > > > > > > > > > > > > > > > > > > >> > > > Thanks Magnus & Colin
>> for the
>>     > > > > > discussion.
>>     > > > > > > > > > > > > > > > > > > >> > > >
>>     > > > > > > > > > > > > > > > > > > >> > > > Based on KIP-714's
>> stateless
>>     > > design,
>>     > > > > > Client
>>     > > > > > > > > can
>>     > > > > > > > > > > > pretty
>>     > > > > > > > > > > > > > > much
>>     > > > > > > > > > > > > > > > use
>>     > > > > > > > > > > > > > > > > > > any
>>     > > > > > > > > > > > > > > > > > > >> > > > connection to any broker
>> to send
>>     > > > > > metrics. We
>>     > > > > > > > > are
>>     > > > > > > > > > > not
>>     > > > > > > > > > > > > > > > associating
>>     > > > > > > > > > > > > > > > > > > >> > > connection
>>     > > > > > > > > > > > > > > > > > > >> > > > with client metric
>> state. Is my
>>     > > > > > > > understanding
>>     > > > > > > > > > > > correct?
>>     > > > > > > > > > > > > > If
>>     > > > > > > > > > > > > > > > yes,
>>     > > > > > > > > > > > > > > > > > > how
>>     > > > > > > > > > > > > > > > > > > >> > about
>>     > > > > > > > > > > > > > > > > > > >> > > > the following two
>> scenarios
>>     > > > > > > > > > > > > > > > > > > >> > > >
>>     > > > > > > > > > > > > > > > > > > >> > > > 1) One Client (Client-ID)
>>     > > registers
>>     > > > > two
>>     > > > > > > > > > different
>>     > > > > > > > > > > > client
>>     > > > > > > > > > > > > > > > > > instance
>>     > > > > > > > > > > > > > > > > > > id
>>     > > > > > > > > > > > > > > > > > > >> > via
>>     > > > > > > > > > > > > > > > > > > >> > > > separate registration.
>> Is it
>>     > > > > permitted?
>>     > > > > > If
>>     > > > > > > > OK,
>>     > > > > > > > > > how
>>     > > > > > > > > > > > to
>>     > > > > > > > > > > > > > > > > > distinguish
>>     > > > > > > > > > > > > > > > > > > >> them
>>     > > > > > > > > > > > > > > > > > > >> > > from
>>     > > > > > > > > > > > > > > > > > > >> > > > the case 2 below.
>>     > > > > > > > > > > > > > > > > > > >> > > >
>>     > > > > > > > > > > > > > > > > > > >> > >
>>     > > > > > > > > > > > > > > > > > > >> > > Hi Feng,
>>     > > > > > > > > > > > > > > > > > > >> > >
>>     > > > > > > > > > > > > > > > > > > >> > > My understanding, which
>> Magnus can
>>     > > > > > clarify I
>>     > > > > > > > > > guess,
>>     > > > > > > > > > > is
>>     > > > > > > > > > > > > > that
>>     > > > > > > > > > > > > > > > you
>>     > > > > > > > > > > > > > > > > > > could
>>     > > > > > > > > > > > > > > > > > > >> > have
>>     > > > > > > > > > > > > > > > > > > >> > > something like two Producer
>>     > > instances
>>     > > > > > running
>>     > > > > > > > > with
>>     > > > > > > > > > > the
>>     > > > > > > > > > > > > > same
>>     > > > > > > > > > > > > > > > > > > client.id
>>     > > > > > > > > > > > > > > > > > > >> > > (perhaps because they're
>> using the
>>     > > > same
>>     > > > > > config
>>     > > > > > > > > > file,
>>     > > > > > > > > > > > for
>>     > > > > > > > > > > > > > > > example).
>>     > > > > > > > > > > > > > > > > > > >> They
>>     > > > > > > > > > > > > > > > > > > >> > > could even be in the same
>> process.
>>     > > But
>>     > > > > > they
>>     > > > > > > > > would
>>     > > > > > > > > > > get
>>     > > > > > > > > > > > > > > separate
>>     > > > > > > > > > > > > > > > > > > UUIDs.
>>     > > > > > > > > > > > > > > > > > > >> > >
>>     > > > > > > > > > > > > > > > > > > >> > > I believe Magnus used the
>> term
>>     > > client
>>     > > > to
>>     > > > > > mean
>>     > > > > > > > > > > > "Producer or
>>     > > > > > > > > > > > > > > > > > > Consumer".
>>     > > > > > > > > > > > > > > > > > > >> So
>>     > > > > > > > > > > > > > > > > > > >> > > if you have both a
>> Producer and a
>>     > > > > > Consumer in
>>     > > > > > > > > your
>>     > > > > > > > > > > > > > > > application I
>>     > > > > > > > > > > > > > > > > > > would
>>     > > > > > > > > > > > > > > > > > > >> > > expect you'd get separate
>> UUIDs
>>     > for
>>     > > > > both.
>>     > > > > > > > Again
>>     > > > > > > > > > > > Magnus can
>>     > > > > > > > > > > > > > > > chime
>>     > > > > > > > > > > > > > > > > > in
>>     > > > > > > > > > > > > > > > > > > >> > here, I
>>     > > > > > > > > > > > > > > > > > > >> > > guess.
>>     > > > > > > > > > > > > > > > > > > >> > >
>>     > > > > > > > > > > > > > > > > > > >> >
>>     > > > > > > > > > > > > > > > > > > >> > That's correct.
>>     > > > > > > > > > > > > > > > > > > >> >
>>     > > > > > > > > > > > > > > > > > > >> >
>>     > > > > > > > > > > > > > > > > > > >> > >
>>     > > > > > > > > > > > > > > > > > > >> > > > 2) How about the client
>>     > > restarting?
>>     > > > > > What's
>>     > > > > > > > the
>>     > > > > > > > > > > > > > > expectation?
>>     > > > > > > > > > > > > > > > > > Should
>>     > > > > > > > > > > > > > > > > > > >> the
>>     > > > > > > > > > > > > > > > > > > >> > > > server expect the client
>> to
>>     > carry
>>     > > a
>>     > > > > > > > persisted
>>     > > > > > > > > > > client
>>     > > > > > > > > > > > > > > > instance id
>>     > > > > > > > > > > > > > > > > > > or
>>     > > > > > > > > > > > > > > > > > > >> > > should
>>     > > > > > > > > > > > > > > > > > > >> > > > the client be treated as
>> a new
>>     > > > > instance?
>>     > > > > > > > > > > > > > > > > > > >> > >
>>     > > > > > > > > > > > > > > > > > > >> > > The KIP doesn't describe
>> any
>>     > > mechanism
>>     > > > > for
>>     > > > > > > > > > > > persistence,
>>     > > > > > > > > > > > > > so I
>>     > > > > > > > > > > > > > > > would
>>     > > > > > > > > > > > > > > > > > > >> assume
>>     > > > > > > > > > > > > > > > > > > >> > > that when you restart the
>> client
>>     > you
>>     > > > get
>>     > > > > > a new
>>     > > > > > > > > > > UUID. I
>>     > > > > > > > > > > > > > agree
>>     > > > > > > > > > > > > > > > that
>>     > > > > > > > > > > > > > > > > > it
>>     > > > > > > > > > > > > > > > > > > >> > would
>>     > > > > > > > > > > > > > > > > > > >> > > be good to spell this out.
>>     > > > > > > > > > > > > > > > > > > >> > >
>>     > > > > > > > > > > > > > > > > > > >> > >
>>     > > > > > > > > > > > > > > > > > > >> > Right, it will not be
>> persisted
>>     > since
>>     > > a
>>     > > > > > client
>>     > > > > > > > > > > instance
>>     > > > > > > > > > > > > > can't
>>     > > > > > > > > > > > > > > be
>>     > > > > > > > > > > > > > > > > > > >> restarted.
>>     > > > > > > > > > > > > > > > > > > >> >
>>     > > > > > > > > > > > > > > > > > > >> > Will update the KIP to make
>> this
>>     > > > clearer.
>>     > > > > > > > > > > > > > > > > > > >> >
>>     > > > > > > > > > > > > > > > > > > >> > /Magnus
>>     > > > > > > > > > > > > > > > > > > >> >
>>     > > > > > > > > > > > > > > > > > > >>
>>     > > > > > > > > > > > > > > > > > > >>
>>     > > > > > > > > > > > > > > > > > > >> --
>>     > > > > > > > > > > > > > > > > > > >> Gwen Shapira
>>     > > > > > > > > > > > > > > > > > > >> Engineering Manager | Confluent
>>     > > > > > > > > > > > > > > > > > > >> 650.450.2760 | @gwenshap
>>     > > > > > > > > > > > > > > > > > > >> Follow us: Twitter | blog
>>     > > > > > > > > > > > > > > > > > > >>
>>     > > > > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > > >
>>     > > > > > > > > > > > > >
>>     > > > > > > > > > > >
>>     > > > > > > > > > >
>>     > > > > > > > > >
>>     > > > > > > > >
>>     > > > > > > >
>>     > > > > > >
>>     > > > > >
>>     > > > >
>>     > > >
>>     > >
>>     >
>>
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Kirk True <ki...@kirktrue.pro>.
Hi Jun,

On Tue, Jun 21, 2022, at 5:24 PM, Jun Rao wrote:
> Hi, Magnus, Kirk,
> 
> Thanks for the reply. A few more comments on your reply.
> 
> 100. I agree there are some benefits of having a set of standard metrics
> across all clients, but I am just wondering how practical it is, given that
> the proposal doesn't require this set like the Kafka protocol.
> 100.1 A client may not implement all or some of the standard metrics. Then,
> we won't have complete standardized names across clients.

True, a client need not implement all the metrics from the KIP. However, those that it does implement will use the names specified in the KIP. The rest of the metrics that a client doesn't implement should be considered as "reserved for future use."

> 100.2 The set of standard metrics needs to be common across all clients.
> For example, client.consumer.poll.latency implies that all clients
> implement a poll() interface. Is that true for all clients?
> client.producer.record.queue.bytes. Do all producers have queues? We
> probably need to make a pass of those metrics to see if they are indeed
> common across all clients.

There are certainly metrics that are not applicable for all client implementations. For example, some of the host-specific CPU timing metrics are "hard" to get on a JVM using standard Java APIs. Ultimately the client author must make a judgement call whether or not to implement a metric. If a given metric from the KIP is truly non-applicable for a client, the author would likely omit it from the client.

Regarding the request to "make a pass" of the clients, are there any client implementations in particular that I should consider reviewing?

I will make an effort to look at some of the more common clients to determine which metrics they expose. I'm a little concerned that could take on outsized amount of effort, depending on the clients' documentation. Researching the code base of each client to ascertain the exposed metrics sounds very time consuming.

> Also, a bunch of standard metrics have type
> Histogram. Java client doesn't have good Histogram support yet. I am also
> not sure if all clients support Histogram. Should we avoid Histogram type
> in standardized metrics?

That's a good question. I can try to get a feel for the existing histogram support in the ecosystem clients and report back.

The KIP does specify an alternate means to report histogram data using time-based averages:

"For [simplicity] a client implementation may choose to provide an average value as [a] Gauge instead of a Histogram. These averages should be using the original Histogram metric name + '.avg', e.g., 'client.request.rtt.avg'."

This approach offers lower fidelity, of course, but it's hopefully more useful in general to have _some_ data than _no_ data?

Perhaps we should replace histograms with this simplified implementation in the KIP, deferring proper histogram support to a future revision?

> 100.3 For a subset of metrics that are truly common across clients, it
> would be confusing for each client to maintain two sets of metrics for the
> same thing. We could document them, but that means every user of every
> client needs to remember this mapping. This is a much bigger
> inconvenience than standardizing the metric names on the server side. If we
> want to go this route, my preference is to deprecate the existing metric
> names that are covered by the standard metric names.

Ah, good point. I admit my focus is too Java-centric.

I want to make sure I understand more specifically what "the server" is in your point regarding 'standardizing the metric names on the server.' At some point there needs to be code that executes on the server that has knowledge of all the clients' metric names as well as a given organization's preferred metric names. Would this code live in the main Apache Kafka repo? Or is it in the organization's ClientTelemetryReceiver implementation? Or somewhere else?

How about introducing a new pluggable mechanism/interface that the broker invokes to determine the metric name mapping? We could provide two out-of-the-box implementations: 1) a default no-op mapper, and 2) a configuration file-based mapper that operates off something akin to a set of Java properties files (one mapping file for each known client). The implementation of the mapper is configured by the cluster administrator and, of course, each organization can provide their own implementation.

> 101. "or if the client-specific metrics are not converted to some common
> form, name, semantic, etc, it'll make creating meaningful aggregations and
> monitoring more complex in the upstream telemetry system with a scattered
> plethora of custom metrics." There will always be client specific metrics.
> So, it seems that we have to deal with scattered custom metrics even with a
> set of standard metrics.

Yes, this is true.

I do believe the KIP should establish a clear means to communicate about the different metrics and their meaning.

When a team is troubleshooting a high-severity incident, these client metrics provide a powerful tool to understand, remediate, and resolve those incidents. The goal of standardizing the metric names is to minimize communication roadblocks in that effort.

> 102. "However, in, let's say the Apache Kafka 3.7 release, the metric name
> is changed to "connections.open.count." At this point, there are two names
> and machine-to-machine communication will likely be effected. With that
> change, all client telemetry plugin(s) used in an organization must be
> updated to reflect that change, else data loss or bugs could be
> introduced." The standard metric names could change too in the future,
> right? So, we need to deal with a similar problem if that happens.

Also true :)

But the metric names, when standardized via a KIP, would undergo a well-known process when being changed in the future. Any metric name changes would be required to be included in a KIP and would require the old and new metric names to co-exist for a period of X releases. This would give teams that are upgrading to newer Kafka versions clear and consistent advance notice to make the needed changes on their end.

Granted, custom, client-specific metrics don't go through the KIP process. We don't "own" that code or their processes, so any usage of client-specific metrics runs the thread of a caveat emptor situation.

> 103. "Are there any inobvious security/privacy-related edge cases where
> shipping certain metrics to the broker would be "bad?"" I am not sure. But
> if a metric can be shipped to the server, it would be useful for the same
> metric to be visible on the client side.

Agreed. The question is, does the reverse hold true?

Thanks Jun!!!!

Kirk

> 
> Thanks,
> 
> Jun
> 
> 
> On Tue, Jun 21, 2022 at 8:19 AM Kirk True <ki...@kirktrue.pro> wrote:
> 
> > Hi Jun,
> >
> > Thank you for all your continued interest in shaping the KIP :)
> >
> > On Thu, Jun 16, 2022, at 2:38 PM, Jun Rao wrote:
> > > Hi, Kirk,
> > >
> > > Thanks for the reply. A couple of more comments.
> > >
> > > (1) "Another perspective is that these two sets of metrics serve
> > different
> > > purposes and/or have different audiences, which argues that they should
> > > maintain their individuality and purpose. " Hmm, I am wondering if those
> > > metrics are really for different audiences and purposes? For example, if
> > > the operator detected an issue through a client metric collected through
> > > the server, the operator may need to communicate that back to the client.
> > > It would be weird if that same metric is not visible on the client side.
> >
> > I agree in the principal that all client metrics visible on the client can
> > also be available to be sent to the broker.
> >
> > Are there any inobvious security/privacy-related edge cases where shipping
> > certain metrics to the broker would be "bad?"
> >
> > > (2) If we could standardize the names on the server side, do we need to
> > > enforce a naming convention for all clients?
> >
> > "Enforce" is such an ugly word :P
> >
> > But yes, I do feel that a consistent naming convention across all clients
> > provides communication benefits between two entities:
> >
> >  1. Human-to-human communication. Ecosystem-wide agreement and
> > understanding of metrics helps all to communicate more efficiently.
> >  2. Machine-to-machine communication. Defining the names via the KIP
> > mechanism help to ensure stabilization across releases of a given client.
> >
> > Point 1: Human-to-human Communication
> >
> > There are quite a handful of parties that must communicate effectively
> > across the Kafka ecosystem. Here are the ones I can think of off the top of
> > my head:
> >
> >  1. Kafka client authors
> >  2. Kafka client users
> >  3. Kafka client telemetry plugin authors
> >  4. Support teams (within an organization or vendor-supplied across
> > organizations)
> >  5. Kafka cluster operators
> >
> > There should be a standard so that these parties can understand the
> > metrics' meaning and be able to correlate that across all clients.
> >
> > As a concrete example, KIP-714 includes a metric for tracking the number
> > of active client connections to a cluster, named
> > "org.apache.kafka.client.connection.active." Given this name, all client
> > implementations can communicate this name and its value to all parties
> > consistently. Without a standard naming convention, the metric might be
> > named "connections.open" in the Java client and "Connections/Alive" in
> > librdkafka. This inconsistency of naming would impact the discussions
> > between one or more of the parties involved.
> >
> > To your point, it's absolutely a design choice to keep the naming
> > convention the same between each client. We can change that if it makes
> > sense.
> >
> > Point 2: Machine-to-machine Communication
> >
> > Standardization at the client level provides stability through an implied
> > contract that a client should not introduce a breaking name change between
> > releases. Otherwise, the ability for the metrics to be "understood" in a
> > machine-to-machine context would be forfeit.
> >
> > For example, let's say that we give the clients the latitude to name
> > metrics as they wish. In this example, let's say that the Apache Kafka 3.4
> > release decides to name this metric "connections.open." It's a good name!
> > It says what it is. However, in, let's say the Apache Kafka 3.7 release,
> > the metric name is changed to "connections.open.count." At this point,
> > there are two names and machine-to-machine communication will likely be
> > effected. With that change, all client telemetry plugin(s) used in an
> > organization must be updated to reflect that change, else data loss or bugs
> > could be introduced.
> >
> > That the KIP defines the names of the metrics does, admittedly, constrain
> > the options of authors of the different clients. The metric named
> > "org.apache.kafka.client.connection.active" may be confusing in some client
> > implementations. For whatever reason, a client author may even find it
> > "undesirable" to include a reference that includes "Apache" in their code.
> >
> > There's also the precedent set by the existing (JMX-based) client metrics.
> > Though these are applicable only to the Java client, we can see that having
> > a standardized naming convention there has helped with communication.
> >
> > So, IMO, it makes sense to define the metric names via the KIP mechanism
> > and--let's say, "ask"--that client implementations abide by those.
> >
> > Thanks,
> > Kirk
> >
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Thu, Jun 16, 2022 at 12:00 PM Kirk True <ki...@kirktrue.pro> wrote:
> > >
> > > > Hi Jun,
> > > >
> > > > I'll try to answer the questions posed...
> > > >
> > > > On Tue, Jun 7, 2022, at 4:32 PM, Jun Rao wrote:
> > > > > Hi, Magnus,
> > > > >
> > > > > Thanks for the reply.
> > > > >
> > > > > So, the standard set of generic metrics is just a recommendation and
> > not
> > > > a
> > > > > requirement? This sounds good to me since it makes the adoption of
> > the
> > > > KIP
> > > > > easier.
> > > >
> > > > I believe that was the intent, yes.
> > > >
> > > > > Regarding the metric names, I have two concerns.
> > > >
> > > > (I'm splitting these two up for readability...)
> > > >
> > > > > (1) If a client already
> > > > > has an existing metric similar to the standard one, duplicating the
> > > > metric
> > > > > seems to be confusing.
> > > >
> > > > Agreed. I'm dealing with that situation as I write the Java client
> > > > implementation.
> > > >
> > > > The existing Java client exposes a set of metrics via JMX. The updated
> > > > Java client will introduce a second set of metrics, which instead are
> > > > exposed via sending them to the broker. There is substantial overlap
> > with
> > > > the two set of metrics and in a few places in the code under
> > development,
> > > > there are essentially two separate calls to update metrics: one for the
> > > > JMX-bound metrics and one for the broker-bound metrics.
> > > >
> > > > To be candid, I have gone back-and-forth on that design. From one
> > > > perspective, it could be argued that the set of client metrics should
> > be
> > > > standardized across a given client, regardless of how those metrics are
> > > > exposed for consumption. Another perspective is that these two sets of
> > > > metrics serve different purposes and/or have different audiences, which
> > > > argues that they should maintain their individuality and purpose. Your
> > > > inputs/suggestions are certainly welcome!
> > > >
> > > > > (2) If a client needs to implement a standard metric
> > > > > that doesn't exist yet, using a naming convention (e.g., using dash
> > vs
> > > > dot)
> > > > > different from other existing metrics also seems a bit confusing. It
> > > > seems
> > > > > that the main benefit of having standard metric names across clients
> > is
> > > > for
> > > > > better server side monitoring. Could we do the standardization in the
> > > > > plugin on the server?
> > > >
> > > > I think the expectation is that the plugin implementation will perform
> > > > transformation of metric names, if needed, to fit in with an
> > organization's
> > > > monitoring naming standards. Perhaps we need to call that out in the
> > KIP
> > > > itself.
> > > >
> > > > Thanks,
> > > > Kirk
> > > >
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jun
> > > > >
> > > > >
> > > > >
> > > > > On Tue, Jun 7, 2022 at 6:53 AM Magnus Edenhill <ma...@edenhill.se>
> > > > wrote:
> > > > >
> > > > > > Hey Jun,
> > > > > >
> > > > > > I've clarified the scope of the standard metrics in the KIP, but
> > > > basically:
> > > > > >
> > > > > >  * We define a standard set of generic metrics that should be
> > relevant
> > > > to
> > > > > > most client implementations, e.g., each producer implementation
> > > > probably
> > > > > > has some sort of per-partition message queue.
> > > > > >  * A client implementation should strive to implement as many of
> > the
> > > > > > standard metrics as possible, but only the ones that make sense.
> > > > > >  * For metrics that are not in the standard set, a client
> > maintainer
> > > > can
> > > > > > choose to either submit a KIP to add additional standard metrics -
> > if
> > > > > > they're relevant, or go ahead and add custom metrics that are
> > specific
> > > > to
> > > > > > that client implementation. These custom metrics will have a prefix
> > > > > > specific to that client implementation, as opposed to the standard
> > > > metric
> > > > > > set that resides under "org.apache.kafka...". E.g.,
> > > > > > "se.edenhill.librdkafka" or whatever.
> > > > > >  * Existing non-KIP-714 metrics should remain untouched. In some
> > cases
> > > > we
> > > > > > might be able to use the same meter given it is compatible with the
> > > > > > standard metric set definition, in other cases a semi-duplicate
> > meter
> > > > may
> > > > > > be needed. Thus this will not affect the metrics exposed through
> > JMX,
> > > > or
> > > > > > vice versa.
> > > > > >
> > > > > > Thanks,
> > > > > > Magnus
> > > > > >
> > > > > >
> > > > > >
> > > > > > Den ons 1 juni 2022 kl 18:55 skrev Jun Rao
> > <ju...@confluent.io.invalid>:
> > > > > >
> > > > > > > Hi, Magnus,
> > > > > > >
> > > > > > > 51. Just to clarify my question.  (1) Are standard metrics
> > required
> > > > for
> > > > > > > every client for this KIP to function?  (2) Are we converting
> > > > existing
> > > > > > java
> > > > > > > metrics to the standard metrics and deprecating the old ones? If
> > so,
> > > > > > could
> > > > > > > we list all existing java metrics that need to be renamed and the
> > > > > > > corresponding new name?
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > Jun
> > > > > > >
> > > > > > > On Tue, May 31, 2022 at 3:29 PM Jun Rao <ju...@confluent.io>
> > wrote:
> > > > > > >
> > > > > > > > Hi, Magnus,
> > > > > > > >
> > > > > > > > Thanks for the reply.
> > > > > > > >
> > > > > > > > 51. I think it's fine to have a list of recommended metrics for
> > > > every
> > > > > > > > client to implement. I am just not sure that standardizing on
> > the
> > > > > > metric
> > > > > > > > names across all clients is practical. The list of common
> > metrics
> > > > in
> > > > > > the
> > > > > > > > KIP have completely different names from the java metric names.
> > > > Some of
> > > > > > > > them have different types. For example, some of the common
> > metrics
> > > > > > have a
> > > > > > > > type of histogram, but the java client metrics don't use
> > histogram
> > > > in
> > > > > > > > general. Requiring the operator to translate those names and
> > > > understand
> > > > > > > the
> > > > > > > > subtle differences across clients seem to cause more confusion
> > > > during
> > > > > > > > troubleshooting.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > >
> > > > > > > > Jun
> > > > > > > >
> > > > > > > > On Tue, May 31, 2022 at 5:02 AM Magnus Edenhill <
> > > > magnus@edenhill.se>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > >> Den fre 20 maj 2022 kl 01:23 skrev Jun Rao
> > > > <jun@confluent.io.invalid
> > > > > > >:
> > > > > > > >>
> > > > > > > >> > Hi, Magus,
> > > > > > > >> >
> > > > > > > >> > Thanks for the reply.
> > > > > > > >> >
> > > > > > > >> > 50. Sounds good.
> > > > > > > >> >
> > > > > > > >> > 51. I miss-understood the proposal in the KIP then. The
> > > > proposal is
> > > > > > to
> > > > > > > >> > define a set of common metric names that every client should
> > > > > > > implement.
> > > > > > > >> The
> > > > > > > >> > problem is that every client already has its own set of
> > metrics
> > > > with
> > > > > > > its
> > > > > > > >> > own names. I am not sure that we could easily agree upon a
> > > > common
> > > > > > set
> > > > > > > of
> > > > > > > >> > metrics that work with all clients. There are likely to be
> > some
> > > > > > > metrics
> > > > > > > >> > that are client specific. Translating between the common
> > name
> > > > and
> > > > > > > client
> > > > > > > >> > specific name is probably going to add more confusion. As
> > > > mentioned
> > > > > > in
> > > > > > > >> the
> > > > > > > >> > KIP, similar metrics from different clients could have
> > subtle
> > > > > > > >> > semantic differences. Could we just let each client use its
> > own
> > > > set
> > > > > > of
> > > > > > > >> > metric names?
> > > > > > > >> >
> > > > > > > >>
> > > > > > > >> We identified a common set of metrics that should be relevant
> > for
> > > > most
> > > > > > > >> client implementations,
> > > > > > > >> they're the ones listed in the KIP.
> > > > > > > >> A supporting client does not have to implement all those
> > metrics,
> > > > only
> > > > > > > the
> > > > > > > >> ones that makes sense
> > > > > > > >> based on that client implementation, and a client may
> > implement
> > > > other
> > > > > > > >> metrics that are not listed
> > > > > > > >> in the KIP under its own namespace.
> > > > > > > >> This approach has two benefits:
> > > > > > > >>  - there will be a common set of metrics that most/all clients
> > > > > > > implement,
> > > > > > > >> which makes monitoring
> > > > > > > >>   and troubleshooting easier across fleets with multiple Kafka
> > > > client
> > > > > > > >> languages/implementations.
> > > > > > > >>  - client-specific metrics are still possible, so if there is
> > no
> > > > > > > suitable
> > > > > > > >> standard metric a client can still
> > > > > > > >>    provide what special metrics it has.
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> Thanks,
> > > > > > > >> Magnus
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> On Thu, May 19, 2022 at 10:39 AM Magnus Edenhill <
> > > > magnus@edenhill.se>
> > > > > > > >> wrote:
> > > > > > > >> >
> > > > > > > >> > > Den ons 18 maj 2022 kl 19:57 skrev Jun Rao
> > > > > > <jun@confluent.io.invalid
> > > > > > > >> >:
> > > > > > > >> > >
> > > > > > > >> > > > Hi, Magnus,
> > > > > > > >> > > >
> > > > > > > >> > >
> > > > > > > >> > > Hi Jun
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> > > >
> > > > > > > >> > > > Thanks for the updated KIP. Just a couple of more
> > comments.
> > > > > > > >> > > >
> > > > > > > >> > > > 50. To troubleshoot a particular client issue, I imagine
> > > > that
> > > > > > the
> > > > > > > >> > client
> > > > > > > >> > > > needs to identify its client_instance_id. How does the
> > > > client
> > > > > > find
> > > > > > > >> this
> > > > > > > >> > > > out? Do we plan to include client_instance_id in the
> > client
> > > > log,
> > > > > > > >> expose
> > > > > > > >> > > it
> > > > > > > >> > > > as a metric or something else?
> > > > > > > >> > > >
> > > > > > > >> > >
> > > > > > > >> > > The KIP suggests that client implementations emit an
> > > > informative
> > > > > > log
> > > > > > > >> > > message
> > > > > > > >> > > with the assigned client-instance-id once it is retrieved
> > > > (once
> > > > > > per
> > > > > > > >> > client
> > > > > > > >> > > instance lifetime).
> > > > > > > >> > > There's also a clientInstanceId() method that an
> > application
> > > > can
> > > > > > use
> > > > > > > >> to
> > > > > > > >> > > retrieve
> > > > > > > >> > > the client instance id and emit through whatever side
> > channels
> > > > > > makes
> > > > > > > >> > sense.
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> > > > 51. The KIP lists a bunch of metrics that need to be
> > > > collected
> > > > > > at
> > > > > > > >> the
> > > > > > > >> > > > client side. However, it seems quite a few useful java
> > > > client
> > > > > > > >> metrics
> > > > > > > >> > > like
> > > > > > > >> > > > the following are missing.
> > > > > > > >> > > >     buffer-total-bytes
> > > > > > > >> > > >     buffer-available-bytes
> > > > > > > >> > > >
> > > > > > > >> > >
> > > > > > > >> > > These are covered by client.producer.record.queue.bytes
> > and
> > > > > > > >> > > client.producer.record.queue.max.bytes.
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> > > >     bufferpool-wait-time
> > > > > > > >> > > >
> > > > > > > >> > >
> > > > > > > >> > > Missing, but somewhat implementation specific.
> > > > > > > >> > > If it was up to me we would add this later if there's a
> > need.
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> > > >     batch-size-avg
> > > > > > > >> > > >     batch-size-max
> > > > > > > >> > > >
> > > > > > > >> > >
> > > > > > > >> > > These are missing and would be suitably represented as a
> > > > > > histogram.
> > > > > > > >> I'll
> > > > > > > >> > > add them.
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> > > >     io-wait-ratio
> > > > > > > >> > > >     io-ratio
> > > > > > > >> > > >
> > > > > > > >> > >
> > > > > > > >> > > There's client.io.wait.time which should cover
> > io-wait-ratio.
> > > > > > > >> > > We could add a client.io.time as well, now or in a later
> > KIP.
> > > > > > > >> > >
> > > > > > > >> > > Thanks,
> > > > > > > >> > > Magnus
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> > > >
> > > > > > > >> > > > Thanks,
> > > > > > > >> > > >
> > > > > > > >> > > > Jun
> > > > > > > >> > > >
> > > > > > > >> > > > On Mon, Apr 4, 2022 at 10:01 AM Jun Rao <
> > jun@confluent.io>
> > > > > > wrote:
> > > > > > > >> > > >
> > > > > > > >> > > > > Hi, Xavier,
> > > > > > > >> > > > >
> > > > > > > >> > > > > Thanks for the reply.
> > > > > > > >> > > > >
> > > > > > > >> > > > > 28. It does seem that we have started using
> > KafkaMetrics
> > > > on
> > > > > > the
> > > > > > > >> > broker
> > > > > > > >> > > > > side. Then, my only concern is on the usage of
> > Histogram
> > > > in
> > > > > > > >> > > KafkaMetrics.
> > > > > > > >> > > > > Histogram in KafkaMetrics statically divides the value
> > > > space
> > > > > > > into
> > > > > > > >> a
> > > > > > > >> > > fixed
> > > > > > > >> > > > > number of buckets and only returns values on the
> > bucket
> > > > > > > boundary.
> > > > > > > >> So,
> > > > > > > >> > > the
> > > > > > > >> > > > > returned histogram value may never show up in a
> > recorded
> > > > > > value.
> > > > > > > >> > Yammer
> > > > > > > >> > > > > Histogram, on the other hand, uses reservoir
> > sampling. The
> > > > > > > >> reported
> > > > > > > >> > > value
> > > > > > > >> > > > > is always one of the recorded values. So, I am not
> > sure
> > > > that
> > > > > > > >> > Histogram
> > > > > > > >> > > in
> > > > > > > >> > > > > KafkaMetrics is as good as Yammer Histogram.
> > > > > > > >> > > > ClientMetricsPluginExportTime
> > > > > > > >> > > > > uses Histogram.
> > > > > > > >> > > > >
> > > > > > > >> > > > > Thanks,
> > > > > > > >> > > > >
> > > > > > > >> > > > > Jun
> > > > > > > >> > > > >
> > > > > > > >> > > > > On Thu, Mar 31, 2022 at 5:21 PM Xavier Léauté
> > > > > > > >> > > > <xa...@confluent.io.invalid>
> > > > > > > >> > > > > wrote:
> > > > > > > >> > > > >
> > > > > > > >> > > > >> >
> > > > > > > >> > > > >> > 28. On the broker, we typically use Yammer metrics.
> > > > Only
> > > > > > for
> > > > > > > >> > metrics
> > > > > > > >> > > > >> that
> > > > > > > >> > > > >> > depend on Kafka metric features (e.g., quota), we
> > use
> > > > the
> > > > > > > Kafka
> > > > > > > >> > > > metric.
> > > > > > > >> > > > >> > Yammer metrics have 4 types: gauge, meter,
> > histogram
> > > > and
> > > > > > > timer.
> > > > > > > >> > > meter
> > > > > > > >> > > > >> > calculates a rate, but also exposes an accumulated
> > > > value.
> > > > > > > >> > > > >> >
> > > > > > > >> > > > >>
> > > > > > > >> > > > >> I don't see a good reason we should limit ourselves
> > to
> > > > Yammer
> > > > > > > >> > metrics
> > > > > > > >> > > on
> > > > > > > >> > > > >> the broker. KafkaMetrics was written
> > > > > > > >> > > > >> to replace Yammer metrics and is used for all new
> > > > components
> > > > > > > >> > (clients,
> > > > > > > >> > > > >> streams, connect, etc.)
> > > > > > > >> > > > >> My understanding is that the original goal was to
> > retire
> > > > > > Yammer
> > > > > > > >> > > metrics
> > > > > > > >> > > > in
> > > > > > > >> > > > >> the broker in favor of KafkaMetrics.
> > > > > > > >> > > > >> We just haven't done so out of backwards
> > compatibility
> > > > > > > concerns.
> > > > > > > >> > > > >> There are other broker metrics such as group
> > coordinator,
> > > > > > > >> > transaction
> > > > > > > >> > > > >> state
> > > > > > > >> > > > >> manager, and various socket server metrics
> > > > > > > >> > > > >> already using KafkaMetrics that don't need specific
> > Kafka
> > > > > > > metric
> > > > > > > >> > > > features,
> > > > > > > >> > > > >> so I don't see why we should refrain from using
> > > > > > > >> > > > >> Kafka metrics on the broker unless there are real
> > > > > > compatibility
> > > > > > > >> > > concerns
> > > > > > > >> > > > >> or
> > > > > > > >> > > > >> where implementation specifics could lead to
> > confusion
> > > > when
> > > > > > > >> > comparing
> > > > > > > >> > > > >> metrics using different implementations.
> > > > > > > >> > > > >>
> > > > > > > >> > > > >> In my opinion we should encourage people to use
> > > > KafkaMetrics
> > > > > > > >> going
> > > > > > > >> > > > forward
> > > > > > > >> > > > >> on the broker as well, for two reasons:
> > > > > > > >> > > > >> a) yammer metrics is long deprecated and no longer
> > > > maintained
> > > > > > > >> > > > >> b) yammer metrics are much less expressive
> > > > > > > >> > > > >> c) we don't have a proper API to expose yammer
> > metrics
> > > > > > outside
> > > > > > > of
> > > > > > > >> > JMX
> > > > > > > >> > > > >> (MetricsReporter only exposes KafkaMetrics)
> > > > > > > >> > > > >>
> > > > > > > >> > > > >
> > > > > > > >> > > >
> > > > > > > >> > >
> > > > > > > >> >
> > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> 

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Jun Rao <ju...@confluent.io.INVALID>.
Hi, Magnus, Kirk,

Thanks for the reply. A few more comments on your reply.

100. I agree there are some benefits of having a set of standard metrics
across all clients, but I am just wondering how practical it is, given that
the proposal doesn't require this set like the Kafka protocol.
100.1 A client may not implement all or some of the standard metrics. Then,
we won't have complete standardized names across clients.
100.2 The set of standard metrics needs to be common across all clients.
For example, client.consumer.poll.latency implies that all clients
implement a poll() interface. Is that true for all clients?
client.producer.record.queue.bytes. Do all producers have queues? We
probably need to make a pass of those metrics to see if they are indeed
common across all clients. Also, a bunch of standard metrics have type
Histogram. Java client doesn't have good Histogram support yet. I am also
not sure if all clients support Histogram. Should we avoid Histogram type
in standardized metrics?
100.3 For a subset of metrics that are truly common across clients, it
would be confusing for each client to maintain two sets of metrics for the
same thing. We could document them, but that means every user of every
client needs to remember this mapping. This is a much bigger
inconvenience than standardizing the metric names on the server side. If we
want to go this route, my preference is to deprecate the existing metric
names that are covered by the standard metric names.

101. "or if the client-specific metrics are not converted to some common
form, name, semantic, etc, it'll make creating meaningful aggregations and
monitoring more complex in the upstream telemetry system with a scattered
plethora of custom metrics." There will always be client specific metrics.
So, it seems that we have to deal with scattered custom metrics even with a
set of standard metrics.

102. "However, in, let's say the Apache Kafka 3.7 release, the metric name
is changed to "connections.open.count." At this point, there are two names
and machine-to-machine communication will likely be effected. With that
change, all client telemetry plugin(s) used in an organization must be
updated to reflect that change, else data loss or bugs could be
introduced." The standard metric names could change too in the future,
right? So, we need to deal with a similar problem if that happens.

103. "Are there any inobvious security/privacy-related edge cases where
shipping certain metrics to the broker would be "bad?"" I am not sure. But
if a metric can be shipped to the server, it would be useful for the same
metric to be visible on the client side.

Thanks,

Jun


On Tue, Jun 21, 2022 at 8:19 AM Kirk True <ki...@kirktrue.pro> wrote:

> Hi Jun,
>
> Thank you for all your continued interest in shaping the KIP :)
>
> On Thu, Jun 16, 2022, at 2:38 PM, Jun Rao wrote:
> > Hi, Kirk,
> >
> > Thanks for the reply. A couple of more comments.
> >
> > (1) "Another perspective is that these two sets of metrics serve
> different
> > purposes and/or have different audiences, which argues that they should
> > maintain their individuality and purpose. " Hmm, I am wondering if those
> > metrics are really for different audiences and purposes? For example, if
> > the operator detected an issue through a client metric collected through
> > the server, the operator may need to communicate that back to the client.
> > It would be weird if that same metric is not visible on the client side.
>
> I agree in the principal that all client metrics visible on the client can
> also be available to be sent to the broker.
>
> Are there any inobvious security/privacy-related edge cases where shipping
> certain metrics to the broker would be "bad?"
>
> > (2) If we could standardize the names on the server side, do we need to
> > enforce a naming convention for all clients?
>
> "Enforce" is such an ugly word :P
>
> But yes, I do feel that a consistent naming convention across all clients
> provides communication benefits between two entities:
>
>  1. Human-to-human communication. Ecosystem-wide agreement and
> understanding of metrics helps all to communicate more efficiently.
>  2. Machine-to-machine communication. Defining the names via the KIP
> mechanism help to ensure stabilization across releases of a given client.
>
> Point 1: Human-to-human Communication
>
> There are quite a handful of parties that must communicate effectively
> across the Kafka ecosystem. Here are the ones I can think of off the top of
> my head:
>
>  1. Kafka client authors
>  2. Kafka client users
>  3. Kafka client telemetry plugin authors
>  4. Support teams (within an organization or vendor-supplied across
> organizations)
>  5. Kafka cluster operators
>
> There should be a standard so that these parties can understand the
> metrics' meaning and be able to correlate that across all clients.
>
> As a concrete example, KIP-714 includes a metric for tracking the number
> of active client connections to a cluster, named
> "org.apache.kafka.client.connection.active." Given this name, all client
> implementations can communicate this name and its value to all parties
> consistently. Without a standard naming convention, the metric might be
> named "connections.open" in the Java client and "Connections/Alive" in
> librdkafka. This inconsistency of naming would impact the discussions
> between one or more of the parties involved.
>
> To your point, it's absolutely a design choice to keep the naming
> convention the same between each client. We can change that if it makes
> sense.
>
> Point 2: Machine-to-machine Communication
>
> Standardization at the client level provides stability through an implied
> contract that a client should not introduce a breaking name change between
> releases. Otherwise, the ability for the metrics to be "understood" in a
> machine-to-machine context would be forfeit.
>
> For example, let's say that we give the clients the latitude to name
> metrics as they wish. In this example, let's say that the Apache Kafka 3.4
> release decides to name this metric "connections.open." It's a good name!
> It says what it is. However, in, let's say the Apache Kafka 3.7 release,
> the metric name is changed to "connections.open.count." At this point,
> there are two names and machine-to-machine communication will likely be
> effected. With that change, all client telemetry plugin(s) used in an
> organization must be updated to reflect that change, else data loss or bugs
> could be introduced.
>
> That the KIP defines the names of the metrics does, admittedly, constrain
> the options of authors of the different clients. The metric named
> "org.apache.kafka.client.connection.active" may be confusing in some client
> implementations. For whatever reason, a client author may even find it
> "undesirable" to include a reference that includes "Apache" in their code.
>
> There's also the precedent set by the existing (JMX-based) client metrics.
> Though these are applicable only to the Java client, we can see that having
> a standardized naming convention there has helped with communication.
>
> So, IMO, it makes sense to define the metric names via the KIP mechanism
> and--let's say, "ask"--that client implementations abide by those.
>
> Thanks,
> Kirk
>
> >
> > Thanks,
> >
> > Jun
> >
> > On Thu, Jun 16, 2022 at 12:00 PM Kirk True <ki...@kirktrue.pro> wrote:
> >
> > > Hi Jun,
> > >
> > > I'll try to answer the questions posed...
> > >
> > > On Tue, Jun 7, 2022, at 4:32 PM, Jun Rao wrote:
> > > > Hi, Magnus,
> > > >
> > > > Thanks for the reply.
> > > >
> > > > So, the standard set of generic metrics is just a recommendation and
> not
> > > a
> > > > requirement? This sounds good to me since it makes the adoption of
> the
> > > KIP
> > > > easier.
> > >
> > > I believe that was the intent, yes.
> > >
> > > > Regarding the metric names, I have two concerns.
> > >
> > > (I'm splitting these two up for readability...)
> > >
> > > > (1) If a client already
> > > > has an existing metric similar to the standard one, duplicating the
> > > metric
> > > > seems to be confusing.
> > >
> > > Agreed. I'm dealing with that situation as I write the Java client
> > > implementation.
> > >
> > > The existing Java client exposes a set of metrics via JMX. The updated
> > > Java client will introduce a second set of metrics, which instead are
> > > exposed via sending them to the broker. There is substantial overlap
> with
> > > the two set of metrics and in a few places in the code under
> development,
> > > there are essentially two separate calls to update metrics: one for the
> > > JMX-bound metrics and one for the broker-bound metrics.
> > >
> > > To be candid, I have gone back-and-forth on that design. From one
> > > perspective, it could be argued that the set of client metrics should
> be
> > > standardized across a given client, regardless of how those metrics are
> > > exposed for consumption. Another perspective is that these two sets of
> > > metrics serve different purposes and/or have different audiences, which
> > > argues that they should maintain their individuality and purpose. Your
> > > inputs/suggestions are certainly welcome!
> > >
> > > > (2) If a client needs to implement a standard metric
> > > > that doesn't exist yet, using a naming convention (e.g., using dash
> vs
> > > dot)
> > > > different from other existing metrics also seems a bit confusing. It
> > > seems
> > > > that the main benefit of having standard metric names across clients
> is
> > > for
> > > > better server side monitoring. Could we do the standardization in the
> > > > plugin on the server?
> > >
> > > I think the expectation is that the plugin implementation will perform
> > > transformation of metric names, if needed, to fit in with an
> organization's
> > > monitoring naming standards. Perhaps we need to call that out in the
> KIP
> > > itself.
> > >
> > > Thanks,
> > > Kirk
> > >
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > >
> > > >
> > > > On Tue, Jun 7, 2022 at 6:53 AM Magnus Edenhill <ma...@edenhill.se>
> > > wrote:
> > > >
> > > > > Hey Jun,
> > > > >
> > > > > I've clarified the scope of the standard metrics in the KIP, but
> > > basically:
> > > > >
> > > > >  * We define a standard set of generic metrics that should be
> relevant
> > > to
> > > > > most client implementations, e.g., each producer implementation
> > > probably
> > > > > has some sort of per-partition message queue.
> > > > >  * A client implementation should strive to implement as many of
> the
> > > > > standard metrics as possible, but only the ones that make sense.
> > > > >  * For metrics that are not in the standard set, a client
> maintainer
> > > can
> > > > > choose to either submit a KIP to add additional standard metrics -
> if
> > > > > they're relevant, or go ahead and add custom metrics that are
> specific
> > > to
> > > > > that client implementation. These custom metrics will have a prefix
> > > > > specific to that client implementation, as opposed to the standard
> > > metric
> > > > > set that resides under "org.apache.kafka...". E.g.,
> > > > > "se.edenhill.librdkafka" or whatever.
> > > > >  * Existing non-KIP-714 metrics should remain untouched. In some
> cases
> > > we
> > > > > might be able to use the same meter given it is compatible with the
> > > > > standard metric set definition, in other cases a semi-duplicate
> meter
> > > may
> > > > > be needed. Thus this will not affect the metrics exposed through
> JMX,
> > > or
> > > > > vice versa.
> > > > >
> > > > > Thanks,
> > > > > Magnus
> > > > >
> > > > >
> > > > >
> > > > > Den ons 1 juni 2022 kl 18:55 skrev Jun Rao
> <ju...@confluent.io.invalid>:
> > > > >
> > > > > > Hi, Magnus,
> > > > > >
> > > > > > 51. Just to clarify my question.  (1) Are standard metrics
> required
> > > for
> > > > > > every client for this KIP to function?  (2) Are we converting
> > > existing
> > > > > java
> > > > > > metrics to the standard metrics and deprecating the old ones? If
> so,
> > > > > could
> > > > > > we list all existing java metrics that need to be renamed and the
> > > > > > corresponding new name?
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Jun
> > > > > >
> > > > > > On Tue, May 31, 2022 at 3:29 PM Jun Rao <ju...@confluent.io>
> wrote:
> > > > > >
> > > > > > > Hi, Magnus,
> > > > > > >
> > > > > > > Thanks for the reply.
> > > > > > >
> > > > > > > 51. I think it's fine to have a list of recommended metrics for
> > > every
> > > > > > > client to implement. I am just not sure that standardizing on
> the
> > > > > metric
> > > > > > > names across all clients is practical. The list of common
> metrics
> > > in
> > > > > the
> > > > > > > KIP have completely different names from the java metric names.
> > > Some of
> > > > > > > them have different types. For example, some of the common
> metrics
> > > > > have a
> > > > > > > type of histogram, but the java client metrics don't use
> histogram
> > > in
> > > > > > > general. Requiring the operator to translate those names and
> > > understand
> > > > > > the
> > > > > > > subtle differences across clients seem to cause more confusion
> > > during
> > > > > > > troubleshooting.
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > Jun
> > > > > > >
> > > > > > > On Tue, May 31, 2022 at 5:02 AM Magnus Edenhill <
> > > magnus@edenhill.se>
> > > > > > > wrote:
> > > > > > >
> > > > > > >> Den fre 20 maj 2022 kl 01:23 skrev Jun Rao
> > > <jun@confluent.io.invalid
> > > > > >:
> > > > > > >>
> > > > > > >> > Hi, Magus,
> > > > > > >> >
> > > > > > >> > Thanks for the reply.
> > > > > > >> >
> > > > > > >> > 50. Sounds good.
> > > > > > >> >
> > > > > > >> > 51. I miss-understood the proposal in the KIP then. The
> > > proposal is
> > > > > to
> > > > > > >> > define a set of common metric names that every client should
> > > > > > implement.
> > > > > > >> The
> > > > > > >> > problem is that every client already has its own set of
> metrics
> > > with
> > > > > > its
> > > > > > >> > own names. I am not sure that we could easily agree upon a
> > > common
> > > > > set
> > > > > > of
> > > > > > >> > metrics that work with all clients. There are likely to be
> some
> > > > > > metrics
> > > > > > >> > that are client specific. Translating between the common
> name
> > > and
> > > > > > client
> > > > > > >> > specific name is probably going to add more confusion. As
> > > mentioned
> > > > > in
> > > > > > >> the
> > > > > > >> > KIP, similar metrics from different clients could have
> subtle
> > > > > > >> > semantic differences. Could we just let each client use its
> own
> > > set
> > > > > of
> > > > > > >> > metric names?
> > > > > > >> >
> > > > > > >>
> > > > > > >> We identified a common set of metrics that should be relevant
> for
> > > most
> > > > > > >> client implementations,
> > > > > > >> they're the ones listed in the KIP.
> > > > > > >> A supporting client does not have to implement all those
> metrics,
> > > only
> > > > > > the
> > > > > > >> ones that makes sense
> > > > > > >> based on that client implementation, and a client may
> implement
> > > other
> > > > > > >> metrics that are not listed
> > > > > > >> in the KIP under its own namespace.
> > > > > > >> This approach has two benefits:
> > > > > > >>  - there will be a common set of metrics that most/all clients
> > > > > > implement,
> > > > > > >> which makes monitoring
> > > > > > >>   and troubleshooting easier across fleets with multiple Kafka
> > > client
> > > > > > >> languages/implementations.
> > > > > > >>  - client-specific metrics are still possible, so if there is
> no
> > > > > > suitable
> > > > > > >> standard metric a client can still
> > > > > > >>    provide what special metrics it has.
> > > > > > >>
> > > > > > >>
> > > > > > >> Thanks,
> > > > > > >> Magnus
> > > > > > >>
> > > > > > >>
> > > > > > >> On Thu, May 19, 2022 at 10:39 AM Magnus Edenhill <
> > > magnus@edenhill.se>
> > > > > > >> wrote:
> > > > > > >> >
> > > > > > >> > > Den ons 18 maj 2022 kl 19:57 skrev Jun Rao
> > > > > <jun@confluent.io.invalid
> > > > > > >> >:
> > > > > > >> > >
> > > > > > >> > > > Hi, Magnus,
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> > > Hi Jun
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > > >
> > > > > > >> > > > Thanks for the updated KIP. Just a couple of more
> comments.
> > > > > > >> > > >
> > > > > > >> > > > 50. To troubleshoot a particular client issue, I imagine
> > > that
> > > > > the
> > > > > > >> > client
> > > > > > >> > > > needs to identify its client_instance_id. How does the
> > > client
> > > > > find
> > > > > > >> this
> > > > > > >> > > > out? Do we plan to include client_instance_id in the
> client
> > > log,
> > > > > > >> expose
> > > > > > >> > > it
> > > > > > >> > > > as a metric or something else?
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> > > The KIP suggests that client implementations emit an
> > > informative
> > > > > log
> > > > > > >> > > message
> > > > > > >> > > with the assigned client-instance-id once it is retrieved
> > > (once
> > > > > per
> > > > > > >> > client
> > > > > > >> > > instance lifetime).
> > > > > > >> > > There's also a clientInstanceId() method that an
> application
> > > can
> > > > > use
> > > > > > >> to
> > > > > > >> > > retrieve
> > > > > > >> > > the client instance id and emit through whatever side
> channels
> > > > > makes
> > > > > > >> > sense.
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > > > 51. The KIP lists a bunch of metrics that need to be
> > > collected
> > > > > at
> > > > > > >> the
> > > > > > >> > > > client side. However, it seems quite a few useful java
> > > client
> > > > > > >> metrics
> > > > > > >> > > like
> > > > > > >> > > > the following are missing.
> > > > > > >> > > >     buffer-total-bytes
> > > > > > >> > > >     buffer-available-bytes
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> > > These are covered by client.producer.record.queue.bytes
> and
> > > > > > >> > > client.producer.record.queue.max.bytes.
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > > >     bufferpool-wait-time
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> > > Missing, but somewhat implementation specific.
> > > > > > >> > > If it was up to me we would add this later if there's a
> need.
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > > >     batch-size-avg
> > > > > > >> > > >     batch-size-max
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> > > These are missing and would be suitably represented as a
> > > > > histogram.
> > > > > > >> I'll
> > > > > > >> > > add them.
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > > >     io-wait-ratio
> > > > > > >> > > >     io-ratio
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> > > There's client.io.wait.time which should cover
> io-wait-ratio.
> > > > > > >> > > We could add a client.io.time as well, now or in a later
> KIP.
> > > > > > >> > >
> > > > > > >> > > Thanks,
> > > > > > >> > > Magnus
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > > >
> > > > > > >> > > > Thanks,
> > > > > > >> > > >
> > > > > > >> > > > Jun
> > > > > > >> > > >
> > > > > > >> > > > On Mon, Apr 4, 2022 at 10:01 AM Jun Rao <
> jun@confluent.io>
> > > > > wrote:
> > > > > > >> > > >
> > > > > > >> > > > > Hi, Xavier,
> > > > > > >> > > > >
> > > > > > >> > > > > Thanks for the reply.
> > > > > > >> > > > >
> > > > > > >> > > > > 28. It does seem that we have started using
> KafkaMetrics
> > > on
> > > > > the
> > > > > > >> > broker
> > > > > > >> > > > > side. Then, my only concern is on the usage of
> Histogram
> > > in
> > > > > > >> > > KafkaMetrics.
> > > > > > >> > > > > Histogram in KafkaMetrics statically divides the value
> > > space
> > > > > > into
> > > > > > >> a
> > > > > > >> > > fixed
> > > > > > >> > > > > number of buckets and only returns values on the
> bucket
> > > > > > boundary.
> > > > > > >> So,
> > > > > > >> > > the
> > > > > > >> > > > > returned histogram value may never show up in a
> recorded
> > > > > value.
> > > > > > >> > Yammer
> > > > > > >> > > > > Histogram, on the other hand, uses reservoir
> sampling. The
> > > > > > >> reported
> > > > > > >> > > value
> > > > > > >> > > > > is always one of the recorded values. So, I am not
> sure
> > > that
> > > > > > >> > Histogram
> > > > > > >> > > in
> > > > > > >> > > > > KafkaMetrics is as good as Yammer Histogram.
> > > > > > >> > > > ClientMetricsPluginExportTime
> > > > > > >> > > > > uses Histogram.
> > > > > > >> > > > >
> > > > > > >> > > > > Thanks,
> > > > > > >> > > > >
> > > > > > >> > > > > Jun
> > > > > > >> > > > >
> > > > > > >> > > > > On Thu, Mar 31, 2022 at 5:21 PM Xavier Léauté
> > > > > > >> > > > <xa...@confluent.io.invalid>
> > > > > > >> > > > > wrote:
> > > > > > >> > > > >
> > > > > > >> > > > >> >
> > > > > > >> > > > >> > 28. On the broker, we typically use Yammer metrics.
> > > Only
> > > > > for
> > > > > > >> > metrics
> > > > > > >> > > > >> that
> > > > > > >> > > > >> > depend on Kafka metric features (e.g., quota), we
> use
> > > the
> > > > > > Kafka
> > > > > > >> > > > metric.
> > > > > > >> > > > >> > Yammer metrics have 4 types: gauge, meter,
> histogram
> > > and
> > > > > > timer.
> > > > > > >> > > meter
> > > > > > >> > > > >> > calculates a rate, but also exposes an accumulated
> > > value.
> > > > > > >> > > > >> >
> > > > > > >> > > > >>
> > > > > > >> > > > >> I don't see a good reason we should limit ourselves
> to
> > > Yammer
> > > > > > >> > metrics
> > > > > > >> > > on
> > > > > > >> > > > >> the broker. KafkaMetrics was written
> > > > > > >> > > > >> to replace Yammer metrics and is used for all new
> > > components
> > > > > > >> > (clients,
> > > > > > >> > > > >> streams, connect, etc.)
> > > > > > >> > > > >> My understanding is that the original goal was to
> retire
> > > > > Yammer
> > > > > > >> > > metrics
> > > > > > >> > > > in
> > > > > > >> > > > >> the broker in favor of KafkaMetrics.
> > > > > > >> > > > >> We just haven't done so out of backwards
> compatibility
> > > > > > concerns.
> > > > > > >> > > > >> There are other broker metrics such as group
> coordinator,
> > > > > > >> > transaction
> > > > > > >> > > > >> state
> > > > > > >> > > > >> manager, and various socket server metrics
> > > > > > >> > > > >> already using KafkaMetrics that don't need specific
> Kafka
> > > > > > metric
> > > > > > >> > > > features,
> > > > > > >> > > > >> so I don't see why we should refrain from using
> > > > > > >> > > > >> Kafka metrics on the broker unless there are real
> > > > > compatibility
> > > > > > >> > > concerns
> > > > > > >> > > > >> or
> > > > > > >> > > > >> where implementation specifics could lead to
> confusion
> > > when
> > > > > > >> > comparing
> > > > > > >> > > > >> metrics using different implementations.
> > > > > > >> > > > >>
> > > > > > >> > > > >> In my opinion we should encourage people to use
> > > KafkaMetrics
> > > > > > >> going
> > > > > > >> > > > forward
> > > > > > >> > > > >> on the broker as well, for two reasons:
> > > > > > >> > > > >> a) yammer metrics is long deprecated and no longer
> > > maintained
> > > > > > >> > > > >> b) yammer metrics are much less expressive
> > > > > > >> > > > >> c) we don't have a proper API to expose yammer
> metrics
> > > > > outside
> > > > > > of
> > > > > > >> > JMX
> > > > > > >> > > > >> (MetricsReporter only exposes KafkaMetrics)
> > > > > > >> > > > >>
> > > > > > >> > > > >
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> >
> > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Kirk True <ki...@kirktrue.pro>.
Hi Jun,

Thank you for all your continued interest in shaping the KIP :)

On Thu, Jun 16, 2022, at 2:38 PM, Jun Rao wrote:
> Hi, Kirk,
> 
> Thanks for the reply. A couple of more comments.
> 
> (1) "Another perspective is that these two sets of metrics serve different
> purposes and/or have different audiences, which argues that they should
> maintain their individuality and purpose. " Hmm, I am wondering if those
> metrics are really for different audiences and purposes? For example, if
> the operator detected an issue through a client metric collected through
> the server, the operator may need to communicate that back to the client.
> It would be weird if that same metric is not visible on the client side.

I agree in the principal that all client metrics visible on the client can also be available to be sent to the broker.

Are there any inobvious security/privacy-related edge cases where shipping certain metrics to the broker would be "bad?"

> (2) If we could standardize the names on the server side, do we need to
> enforce a naming convention for all clients?

"Enforce" is such an ugly word :P

But yes, I do feel that a consistent naming convention across all clients provides communication benefits between two entities:

 1. Human-to-human communication. Ecosystem-wide agreement and understanding of metrics helps all to communicate more efficiently.
 2. Machine-to-machine communication. Defining the names via the KIP mechanism help to ensure stabilization across releases of a given client.

Point 1: Human-to-human Communication

There are quite a handful of parties that must communicate effectively across the Kafka ecosystem. Here are the ones I can think of off the top of my head:

 1. Kafka client authors
 2. Kafka client users
 3. Kafka client telemetry plugin authors
 4. Support teams (within an organization or vendor-supplied across organizations)
 5. Kafka cluster operators

There should be a standard so that these parties can understand the metrics' meaning and be able to correlate that across all clients.

As a concrete example, KIP-714 includes a metric for tracking the number of active client connections to a cluster, named "org.apache.kafka.client.connection.active." Given this name, all client implementations can communicate this name and its value to all parties consistently. Without a standard naming convention, the metric might be named "connections.open" in the Java client and "Connections/Alive" in librdkafka. This inconsistency of naming would impact the discussions between one or more of the parties involved.

To your point, it's absolutely a design choice to keep the naming convention the same between each client. We can change that if it makes sense.

Point 2: Machine-to-machine Communication

Standardization at the client level provides stability through an implied contract that a client should not introduce a breaking name change between releases. Otherwise, the ability for the metrics to be "understood" in a machine-to-machine context would be forfeit.

For example, let's say that we give the clients the latitude to name metrics as they wish. In this example, let's say that the Apache Kafka 3.4 release decides to name this metric "connections.open." It's a good name! It says what it is. However, in, let's say the Apache Kafka 3.7 release, the metric name is changed to "connections.open.count." At this point, there are two names and machine-to-machine communication will likely be effected. With that change, all client telemetry plugin(s) used in an organization must be updated to reflect that change, else data loss or bugs could be introduced.

That the KIP defines the names of the metrics does, admittedly, constrain the options of authors of the different clients. The metric named "org.apache.kafka.client.connection.active" may be confusing in some client implementations. For whatever reason, a client author may even find it "undesirable" to include a reference that includes "Apache" in their code.

There's also the precedent set by the existing (JMX-based) client metrics. Though these are applicable only to the Java client, we can see that having a standardized naming convention there has helped with communication.

So, IMO, it makes sense to define the metric names via the KIP mechanism and--let's say, "ask"--that client implementations abide by those.

Thanks,
Kirk

> 
> Thanks,
> 
> Jun
> 
> On Thu, Jun 16, 2022 at 12:00 PM Kirk True <ki...@kirktrue.pro> wrote:
> 
> > Hi Jun,
> >
> > I'll try to answer the questions posed...
> >
> > On Tue, Jun 7, 2022, at 4:32 PM, Jun Rao wrote:
> > > Hi, Magnus,
> > >
> > > Thanks for the reply.
> > >
> > > So, the standard set of generic metrics is just a recommendation and not
> > a
> > > requirement? This sounds good to me since it makes the adoption of the
> > KIP
> > > easier.
> >
> > I believe that was the intent, yes.
> >
> > > Regarding the metric names, I have two concerns.
> >
> > (I'm splitting these two up for readability...)
> >
> > > (1) If a client already
> > > has an existing metric similar to the standard one, duplicating the
> > metric
> > > seems to be confusing.
> >
> > Agreed. I'm dealing with that situation as I write the Java client
> > implementation.
> >
> > The existing Java client exposes a set of metrics via JMX. The updated
> > Java client will introduce a second set of metrics, which instead are
> > exposed via sending them to the broker. There is substantial overlap with
> > the two set of metrics and in a few places in the code under development,
> > there are essentially two separate calls to update metrics: one for the
> > JMX-bound metrics and one for the broker-bound metrics.
> >
> > To be candid, I have gone back-and-forth on that design. From one
> > perspective, it could be argued that the set of client metrics should be
> > standardized across a given client, regardless of how those metrics are
> > exposed for consumption. Another perspective is that these two sets of
> > metrics serve different purposes and/or have different audiences, which
> > argues that they should maintain their individuality and purpose. Your
> > inputs/suggestions are certainly welcome!
> >
> > > (2) If a client needs to implement a standard metric
> > > that doesn't exist yet, using a naming convention (e.g., using dash vs
> > dot)
> > > different from other existing metrics also seems a bit confusing. It
> > seems
> > > that the main benefit of having standard metric names across clients is
> > for
> > > better server side monitoring. Could we do the standardization in the
> > > plugin on the server?
> >
> > I think the expectation is that the plugin implementation will perform
> > transformation of metric names, if needed, to fit in with an organization's
> > monitoring naming standards. Perhaps we need to call that out in the KIP
> > itself.
> >
> > Thanks,
> > Kirk
> >
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > >
> > >
> > > On Tue, Jun 7, 2022 at 6:53 AM Magnus Edenhill <ma...@edenhill.se>
> > wrote:
> > >
> > > > Hey Jun,
> > > >
> > > > I've clarified the scope of the standard metrics in the KIP, but
> > basically:
> > > >
> > > >  * We define a standard set of generic metrics that should be relevant
> > to
> > > > most client implementations, e.g., each producer implementation
> > probably
> > > > has some sort of per-partition message queue.
> > > >  * A client implementation should strive to implement as many of the
> > > > standard metrics as possible, but only the ones that make sense.
> > > >  * For metrics that are not in the standard set, a client maintainer
> > can
> > > > choose to either submit a KIP to add additional standard metrics - if
> > > > they're relevant, or go ahead and add custom metrics that are specific
> > to
> > > > that client implementation. These custom metrics will have a prefix
> > > > specific to that client implementation, as opposed to the standard
> > metric
> > > > set that resides under "org.apache.kafka...". E.g.,
> > > > "se.edenhill.librdkafka" or whatever.
> > > >  * Existing non-KIP-714 metrics should remain untouched. In some cases
> > we
> > > > might be able to use the same meter given it is compatible with the
> > > > standard metric set definition, in other cases a semi-duplicate meter
> > may
> > > > be needed. Thus this will not affect the metrics exposed through JMX,
> > or
> > > > vice versa.
> > > >
> > > > Thanks,
> > > > Magnus
> > > >
> > > >
> > > >
> > > > Den ons 1 juni 2022 kl 18:55 skrev Jun Rao <ju...@confluent.io.invalid>:
> > > >
> > > > > Hi, Magnus,
> > > > >
> > > > > 51. Just to clarify my question.  (1) Are standard metrics required
> > for
> > > > > every client for this KIP to function?  (2) Are we converting
> > existing
> > > > java
> > > > > metrics to the standard metrics and deprecating the old ones? If so,
> > > > could
> > > > > we list all existing java metrics that need to be renamed and the
> > > > > corresponding new name?
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jun
> > > > >
> > > > > On Tue, May 31, 2022 at 3:29 PM Jun Rao <ju...@confluent.io> wrote:
> > > > >
> > > > > > Hi, Magnus,
> > > > > >
> > > > > > Thanks for the reply.
> > > > > >
> > > > > > 51. I think it's fine to have a list of recommended metrics for
> > every
> > > > > > client to implement. I am just not sure that standardizing on the
> > > > metric
> > > > > > names across all clients is practical. The list of common metrics
> > in
> > > > the
> > > > > > KIP have completely different names from the java metric names.
> > Some of
> > > > > > them have different types. For example, some of the common metrics
> > > > have a
> > > > > > type of histogram, but the java client metrics don't use histogram
> > in
> > > > > > general. Requiring the operator to translate those names and
> > understand
> > > > > the
> > > > > > subtle differences across clients seem to cause more confusion
> > during
> > > > > > troubleshooting.
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Jun
> > > > > >
> > > > > > On Tue, May 31, 2022 at 5:02 AM Magnus Edenhill <
> > magnus@edenhill.se>
> > > > > > wrote:
> > > > > >
> > > > > >> Den fre 20 maj 2022 kl 01:23 skrev Jun Rao
> > <jun@confluent.io.invalid
> > > > >:
> > > > > >>
> > > > > >> > Hi, Magus,
> > > > > >> >
> > > > > >> > Thanks for the reply.
> > > > > >> >
> > > > > >> > 50. Sounds good.
> > > > > >> >
> > > > > >> > 51. I miss-understood the proposal in the KIP then. The
> > proposal is
> > > > to
> > > > > >> > define a set of common metric names that every client should
> > > > > implement.
> > > > > >> The
> > > > > >> > problem is that every client already has its own set of metrics
> > with
> > > > > its
> > > > > >> > own names. I am not sure that we could easily agree upon a
> > common
> > > > set
> > > > > of
> > > > > >> > metrics that work with all clients. There are likely to be some
> > > > > metrics
> > > > > >> > that are client specific. Translating between the common name
> > and
> > > > > client
> > > > > >> > specific name is probably going to add more confusion. As
> > mentioned
> > > > in
> > > > > >> the
> > > > > >> > KIP, similar metrics from different clients could have subtle
> > > > > >> > semantic differences. Could we just let each client use its own
> > set
> > > > of
> > > > > >> > metric names?
> > > > > >> >
> > > > > >>
> > > > > >> We identified a common set of metrics that should be relevant for
> > most
> > > > > >> client implementations,
> > > > > >> they're the ones listed in the KIP.
> > > > > >> A supporting client does not have to implement all those metrics,
> > only
> > > > > the
> > > > > >> ones that makes sense
> > > > > >> based on that client implementation, and a client may implement
> > other
> > > > > >> metrics that are not listed
> > > > > >> in the KIP under its own namespace.
> > > > > >> This approach has two benefits:
> > > > > >>  - there will be a common set of metrics that most/all clients
> > > > > implement,
> > > > > >> which makes monitoring
> > > > > >>   and troubleshooting easier across fleets with multiple Kafka
> > client
> > > > > >> languages/implementations.
> > > > > >>  - client-specific metrics are still possible, so if there is no
> > > > > suitable
> > > > > >> standard metric a client can still
> > > > > >>    provide what special metrics it has.
> > > > > >>
> > > > > >>
> > > > > >> Thanks,
> > > > > >> Magnus
> > > > > >>
> > > > > >>
> > > > > >> On Thu, May 19, 2022 at 10:39 AM Magnus Edenhill <
> > magnus@edenhill.se>
> > > > > >> wrote:
> > > > > >> >
> > > > > >> > > Den ons 18 maj 2022 kl 19:57 skrev Jun Rao
> > > > <jun@confluent.io.invalid
> > > > > >> >:
> > > > > >> > >
> > > > > >> > > > Hi, Magnus,
> > > > > >> > > >
> > > > > >> > >
> > > > > >> > > Hi Jun
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > >
> > > > > >> > > > Thanks for the updated KIP. Just a couple of more comments.
> > > > > >> > > >
> > > > > >> > > > 50. To troubleshoot a particular client issue, I imagine
> > that
> > > > the
> > > > > >> > client
> > > > > >> > > > needs to identify its client_instance_id. How does the
> > client
> > > > find
> > > > > >> this
> > > > > >> > > > out? Do we plan to include client_instance_id in the client
> > log,
> > > > > >> expose
> > > > > >> > > it
> > > > > >> > > > as a metric or something else?
> > > > > >> > > >
> > > > > >> > >
> > > > > >> > > The KIP suggests that client implementations emit an
> > informative
> > > > log
> > > > > >> > > message
> > > > > >> > > with the assigned client-instance-id once it is retrieved
> > (once
> > > > per
> > > > > >> > client
> > > > > >> > > instance lifetime).
> > > > > >> > > There's also a clientInstanceId() method that an application
> > can
> > > > use
> > > > > >> to
> > > > > >> > > retrieve
> > > > > >> > > the client instance id and emit through whatever side channels
> > > > makes
> > > > > >> > sense.
> > > > > >> > >
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > > 51. The KIP lists a bunch of metrics that need to be
> > collected
> > > > at
> > > > > >> the
> > > > > >> > > > client side. However, it seems quite a few useful java
> > client
> > > > > >> metrics
> > > > > >> > > like
> > > > > >> > > > the following are missing.
> > > > > >> > > >     buffer-total-bytes
> > > > > >> > > >     buffer-available-bytes
> > > > > >> > > >
> > > > > >> > >
> > > > > >> > > These are covered by client.producer.record.queue.bytes and
> > > > > >> > > client.producer.record.queue.max.bytes.
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > >     bufferpool-wait-time
> > > > > >> > > >
> > > > > >> > >
> > > > > >> > > Missing, but somewhat implementation specific.
> > > > > >> > > If it was up to me we would add this later if there's a need.
> > > > > >> > >
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > >     batch-size-avg
> > > > > >> > > >     batch-size-max
> > > > > >> > > >
> > > > > >> > >
> > > > > >> > > These are missing and would be suitably represented as a
> > > > histogram.
> > > > > >> I'll
> > > > > >> > > add them.
> > > > > >> > >
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > >     io-wait-ratio
> > > > > >> > > >     io-ratio
> > > > > >> > > >
> > > > > >> > >
> > > > > >> > > There's client.io.wait.time which should cover io-wait-ratio.
> > > > > >> > > We could add a client.io.time as well, now or in a later KIP.
> > > > > >> > >
> > > > > >> > > Thanks,
> > > > > >> > > Magnus
> > > > > >> > >
> > > > > >> > >
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > >
> > > > > >> > > > Thanks,
> > > > > >> > > >
> > > > > >> > > > Jun
> > > > > >> > > >
> > > > > >> > > > On Mon, Apr 4, 2022 at 10:01 AM Jun Rao <ju...@confluent.io>
> > > > wrote:
> > > > > >> > > >
> > > > > >> > > > > Hi, Xavier,
> > > > > >> > > > >
> > > > > >> > > > > Thanks for the reply.
> > > > > >> > > > >
> > > > > >> > > > > 28. It does seem that we have started using KafkaMetrics
> > on
> > > > the
> > > > > >> > broker
> > > > > >> > > > > side. Then, my only concern is on the usage of Histogram
> > in
> > > > > >> > > KafkaMetrics.
> > > > > >> > > > > Histogram in KafkaMetrics statically divides the value
> > space
> > > > > into
> > > > > >> a
> > > > > >> > > fixed
> > > > > >> > > > > number of buckets and only returns values on the bucket
> > > > > boundary.
> > > > > >> So,
> > > > > >> > > the
> > > > > >> > > > > returned histogram value may never show up in a recorded
> > > > value.
> > > > > >> > Yammer
> > > > > >> > > > > Histogram, on the other hand, uses reservoir sampling. The
> > > > > >> reported
> > > > > >> > > value
> > > > > >> > > > > is always one of the recorded values. So, I am not sure
> > that
> > > > > >> > Histogram
> > > > > >> > > in
> > > > > >> > > > > KafkaMetrics is as good as Yammer Histogram.
> > > > > >> > > > ClientMetricsPluginExportTime
> > > > > >> > > > > uses Histogram.
> > > > > >> > > > >
> > > > > >> > > > > Thanks,
> > > > > >> > > > >
> > > > > >> > > > > Jun
> > > > > >> > > > >
> > > > > >> > > > > On Thu, Mar 31, 2022 at 5:21 PM Xavier Léauté
> > > > > >> > > > <xa...@confluent.io.invalid>
> > > > > >> > > > > wrote:
> > > > > >> > > > >
> > > > > >> > > > >> >
> > > > > >> > > > >> > 28. On the broker, we typically use Yammer metrics.
> > Only
> > > > for
> > > > > >> > metrics
> > > > > >> > > > >> that
> > > > > >> > > > >> > depend on Kafka metric features (e.g., quota), we use
> > the
> > > > > Kafka
> > > > > >> > > > metric.
> > > > > >> > > > >> > Yammer metrics have 4 types: gauge, meter, histogram
> > and
> > > > > timer.
> > > > > >> > > meter
> > > > > >> > > > >> > calculates a rate, but also exposes an accumulated
> > value.
> > > > > >> > > > >> >
> > > > > >> > > > >>
> > > > > >> > > > >> I don't see a good reason we should limit ourselves to
> > Yammer
> > > > > >> > metrics
> > > > > >> > > on
> > > > > >> > > > >> the broker. KafkaMetrics was written
> > > > > >> > > > >> to replace Yammer metrics and is used for all new
> > components
> > > > > >> > (clients,
> > > > > >> > > > >> streams, connect, etc.)
> > > > > >> > > > >> My understanding is that the original goal was to retire
> > > > Yammer
> > > > > >> > > metrics
> > > > > >> > > > in
> > > > > >> > > > >> the broker in favor of KafkaMetrics.
> > > > > >> > > > >> We just haven't done so out of backwards compatibility
> > > > > concerns.
> > > > > >> > > > >> There are other broker metrics such as group coordinator,
> > > > > >> > transaction
> > > > > >> > > > >> state
> > > > > >> > > > >> manager, and various socket server metrics
> > > > > >> > > > >> already using KafkaMetrics that don't need specific Kafka
> > > > > metric
> > > > > >> > > > features,
> > > > > >> > > > >> so I don't see why we should refrain from using
> > > > > >> > > > >> Kafka metrics on the broker unless there are real
> > > > compatibility
> > > > > >> > > concerns
> > > > > >> > > > >> or
> > > > > >> > > > >> where implementation specifics could lead to confusion
> > when
> > > > > >> > comparing
> > > > > >> > > > >> metrics using different implementations.
> > > > > >> > > > >>
> > > > > >> > > > >> In my opinion we should encourage people to use
> > KafkaMetrics
> > > > > >> going
> > > > > >> > > > forward
> > > > > >> > > > >> on the broker as well, for two reasons:
> > > > > >> > > > >> a) yammer metrics is long deprecated and no longer
> > maintained
> > > > > >> > > > >> b) yammer metrics are much less expressive
> > > > > >> > > > >> c) we don't have a proper API to expose yammer metrics
> > > > outside
> > > > > of
> > > > > >> > JMX
> > > > > >> > > > >> (MetricsReporter only exposes KafkaMetrics)
> > > > > >> > > > >>
> > > > > >> > > > >
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> 

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Magnus Edenhill <ma...@edenhill.se>.
Hey Jun and Kirk,


I see that there's a lot of focus on the existing metrics in the Java
clients, which makes sense,
but the KIP aims to approach the problem space from a higher and more
generic level by
defining:
1) a standard protocol for subscribing to, and pushing metrics,
2) an existing industry standard encoding and semantics for those metrics
(OTLP),
3) as well as a standard set of metrics that we believe are relevant to
most/all client implementations


The counter-alternative to these points, which have come up before in
various forms during the KIP discussions (see rejected alternatives) in the
KIP are:
1) use an existing out-of-band protocol,
2) use Kafka protocol encoding for the metrics,
3) let each client implementation provide their own set of metrics.

So why is the KIP not suggesting this approach? Well, in short:
 1) defies the zero-conf/always-available requirement - clients, networks,
firewalls, etc, must be specifically configured - which will not be
feasible.
 2) we would need to duplicate the work of the industry leading telemetry
people (opentelemetry) - reaping no benefits of their existing and future
work, and making integration with upstream telemetry systems harder,
 3a) these client-specific metrics would either need to be converted to
some common form - which is not only cpu/memory costly - but also hard from
an operational standpoint:
     someone, is it the kafka operator?, would need to understand what
client-specific metrics are available and what their semantics are - and
then for each such client implementation write translation code in the
broker-side plugin to try to mangle the custom metrics into a standard set
of metrics that can be monitored with a single upstream metric. With seven
or eight different client implementations in the wild, all with new
releases coming out every now and then some perhaps without per-metric
documentation, well that just seems like a daunting task that will be hard
to win.
 3b) or if the client-specific metrics are not converted to some common
form, name, semantic, etc, it'll make creating meaningful aggregations and
monitoring more complex in the upstream telemetry system with a scattered
plethora of custom metrics.

Additionally, the proposed standard set of metrics are derived from what is
available in existing clients and while the fit may not be perfect to
existing metrics, they won't be too off.
More so, having a standard set of metrics to implement makes it easier for
client maintainers to know which metrics they should expose and are
considered relevant to monitoring and troubleshooting.

As for manually mapping KIP-714 metric names to JMX during troubleshooting;
I agree that is not perfect but could be solved quite easily through
documentation. E.g,, "MetricA is also known as metric.foo.a in OTLP".

Another point worth mentioning is that, while the KIP does not cover it, a
future enhancement to the clients is to also expose the OTLP metrics
directly to the application as an alternative to JMX (or whatever the
client currently exposes, e.g. JSON), which makes integration with upstream
metrics systems easier.


Thanks,
Magnus







Den tors 16 juni 2022 kl 23:38 skrev Jun Rao <ju...@confluent.io.invalid>:

> Hi, Kirk,
>
> Thanks for the reply. A couple of more comments.
>
> (1) "Another perspective is that these two sets of metrics serve different
> purposes and/or have different audiences, which argues that they should
> maintain their individuality and purpose. " Hmm, I am wondering if those
> metrics are really for different audiences and purposes? For example, if
> the operator detected an issue through a client metric collected through
> the server, the operator may need to communicate that back to the client.
> It would be weird if that same metric is not visible on the client side.
>
> (2) If we could standardize the names on the server side, do we need to
> enforce a naming convention for all clients?


> Thanks,
>
> Jun
>
> On Thu, Jun 16, 2022 at 12:00 PM Kirk True <ki...@kirktrue.pro> wrote:
>
> > Hi Jun,
> >
> > I'll try to answer the questions posed...
> >
> > On Tue, Jun 7, 2022, at 4:32 PM, Jun Rao wrote:
> > > Hi, Magnus,
> > >
> > > Thanks for the reply.
> > >
> > > So, the standard set of generic metrics is just a recommendation and
> not
> > a
> > > requirement? This sounds good to me since it makes the adoption of the
> > KIP
> > > easier.
> >
> > I believe that was the intent, yes.
> >
> > > Regarding the metric names, I have two concerns.
> >
> > (I'm splitting these two up for readability...)
> >
> > > (1) If a client already
> > > has an existing metric similar to the standard one, duplicating the
> > metric
> > > seems to be confusing.
> >
> > Agreed. I'm dealing with that situation as I write the Java client
> > implementation.
> >
> > The existing Java client exposes a set of metrics via JMX. The updated
> > Java client will introduce a second set of metrics, which instead are
> > exposed via sending them to the broker. There is substantial overlap with
> > the two set of metrics and in a few places in the code under development,
> > there are essentially two separate calls to update metrics: one for the
> > JMX-bound metrics and one for the broker-bound metrics.
> >
> > To be candid, I have gone back-and-forth on that design. From one
> > perspective, it could be argued that the set of client metrics should be
> > standardized across a given client, regardless of how those metrics are
> > exposed for consumption. Another perspective is that these two sets of
> > metrics serve different purposes and/or have different audiences, which
> > argues that they should maintain their individuality and purpose. Your
> > inputs/suggestions are certainly welcome!
> >
> > > (2) If a client needs to implement a standard metric
> > > that doesn't exist yet, using a naming convention (e.g., using dash vs
> > dot)
> > > different from other existing metrics also seems a bit confusing. It
> > seems
> > > that the main benefit of having standard metric names across clients is
> > for
> > > better server side monitoring. Could we do the standardization in the
> > > plugin on the server?
> >
> > I think the expectation is that the plugin implementation will perform
> > transformation of metric names, if needed, to fit in with an
> organization's
> > monitoring naming standards. Perhaps we need to call that out in the KIP
> > itself.
> >
> > Thanks,
> > Kirk
> >
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > >
> > >
> > > On Tue, Jun 7, 2022 at 6:53 AM Magnus Edenhill <ma...@edenhill.se>
> > wrote:
> > >
> > > > Hey Jun,
> > > >
> > > > I've clarified the scope of the standard metrics in the KIP, but
> > basically:
> > > >
> > > >  * We define a standard set of generic metrics that should be
> relevant
> > to
> > > > most client implementations, e.g., each producer implementation
> > probably
> > > > has some sort of per-partition message queue.
> > > >  * A client implementation should strive to implement as many of the
> > > > standard metrics as possible, but only the ones that make sense.
> > > >  * For metrics that are not in the standard set, a client maintainer
> > can
> > > > choose to either submit a KIP to add additional standard metrics - if
> > > > they're relevant, or go ahead and add custom metrics that are
> specific
> > to
> > > > that client implementation. These custom metrics will have a prefix
> > > > specific to that client implementation, as opposed to the standard
> > metric
> > > > set that resides under "org.apache.kafka...". E.g.,
> > > > "se.edenhill.librdkafka" or whatever.
> > > >  * Existing non-KIP-714 metrics should remain untouched. In some
> cases
> > we
> > > > might be able to use the same meter given it is compatible with the
> > > > standard metric set definition, in other cases a semi-duplicate meter
> > may
> > > > be needed. Thus this will not affect the metrics exposed through JMX,
> > or
> > > > vice versa.
> > > >
> > > > Thanks,
> > > > Magnus
> > > >
> > > >
> > > >
> > > > Den ons 1 juni 2022 kl 18:55 skrev Jun Rao <jun@confluent.io.invalid
> >:
> > > >
> > > > > Hi, Magnus,
> > > > >
> > > > > 51. Just to clarify my question.  (1) Are standard metrics required
> > for
> > > > > every client for this KIP to function?  (2) Are we converting
> > existing
> > > > java
> > > > > metrics to the standard metrics and deprecating the old ones? If
> so,
> > > > could
> > > > > we list all existing java metrics that need to be renamed and the
> > > > > corresponding new name?
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jun
> > > > >
> > > > > On Tue, May 31, 2022 at 3:29 PM Jun Rao <ju...@confluent.io> wrote:
> > > > >
> > > > > > Hi, Magnus,
> > > > > >
> > > > > > Thanks for the reply.
> > > > > >
> > > > > > 51. I think it's fine to have a list of recommended metrics for
> > every
> > > > > > client to implement. I am just not sure that standardizing on the
> > > > metric
> > > > > > names across all clients is practical. The list of common metrics
> > in
> > > > the
> > > > > > KIP have completely different names from the java metric names.
> > Some of
> > > > > > them have different types. For example, some of the common
> metrics
> > > > have a
> > > > > > type of histogram, but the java client metrics don't use
> histogram
> > in
> > > > > > general. Requiring the operator to translate those names and
> > understand
> > > > > the
> > > > > > subtle differences across clients seem to cause more confusion
> > during
> > > > > > troubleshooting.
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Jun
> > > > > >
> > > > > > On Tue, May 31, 2022 at 5:02 AM Magnus Edenhill <
> > magnus@edenhill.se>
> > > > > > wrote:
> > > > > >
> > > > > >> Den fre 20 maj 2022 kl 01:23 skrev Jun Rao
> > <jun@confluent.io.invalid
> > > > >:
> > > > > >>
> > > > > >> > Hi, Magus,
> > > > > >> >
> > > > > >> > Thanks for the reply.
> > > > > >> >
> > > > > >> > 50. Sounds good.
> > > > > >> >
> > > > > >> > 51. I miss-understood the proposal in the KIP then. The
> > proposal is
> > > > to
> > > > > >> > define a set of common metric names that every client should
> > > > > implement.
> > > > > >> The
> > > > > >> > problem is that every client already has its own set of
> metrics
> > with
> > > > > its
> > > > > >> > own names. I am not sure that we could easily agree upon a
> > common
> > > > set
> > > > > of
> > > > > >> > metrics that work with all clients. There are likely to be
> some
> > > > > metrics
> > > > > >> > that are client specific. Translating between the common name
> > and
> > > > > client
> > > > > >> > specific name is probably going to add more confusion. As
> > mentioned
> > > > in
> > > > > >> the
> > > > > >> > KIP, similar metrics from different clients could have subtle
> > > > > >> > semantic differences. Could we just let each client use its
> own
> > set
> > > > of
> > > > > >> > metric names?
> > > > > >> >
> > > > > >>
> > > > > >> We identified a common set of metrics that should be relevant
> for
> > most
> > > > > >> client implementations,
> > > > > >> they're the ones listed in the KIP.
> > > > > >> A supporting client does not have to implement all those
> metrics,
> > only
> > > > > the
> > > > > >> ones that makes sense
> > > > > >> based on that client implementation, and a client may implement
> > other
> > > > > >> metrics that are not listed
> > > > > >> in the KIP under its own namespace.
> > > > > >> This approach has two benefits:
> > > > > >>  - there will be a common set of metrics that most/all clients
> > > > > implement,
> > > > > >> which makes monitoring
> > > > > >>   and troubleshooting easier across fleets with multiple Kafka
> > client
> > > > > >> languages/implementations.
> > > > > >>  - client-specific metrics are still possible, so if there is no
> > > > > suitable
> > > > > >> standard metric a client can still
> > > > > >>    provide what special metrics it has.
> > > > > >>
> > > > > >>
> > > > > >> Thanks,
> > > > > >> Magnus
> > > > > >>
> > > > > >>
> > > > > >> On Thu, May 19, 2022 at 10:39 AM Magnus Edenhill <
> > magnus@edenhill.se>
> > > > > >> wrote:
> > > > > >> >
> > > > > >> > > Den ons 18 maj 2022 kl 19:57 skrev Jun Rao
> > > > <jun@confluent.io.invalid
> > > > > >> >:
> > > > > >> > >
> > > > > >> > > > Hi, Magnus,
> > > > > >> > > >
> > > > > >> > >
> > > > > >> > > Hi Jun
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > >
> > > > > >> > > > Thanks for the updated KIP. Just a couple of more
> comments.
> > > > > >> > > >
> > > > > >> > > > 50. To troubleshoot a particular client issue, I imagine
> > that
> > > > the
> > > > > >> > client
> > > > > >> > > > needs to identify its client_instance_id. How does the
> > client
> > > > find
> > > > > >> this
> > > > > >> > > > out? Do we plan to include client_instance_id in the
> client
> > log,
> > > > > >> expose
> > > > > >> > > it
> > > > > >> > > > as a metric or something else?
> > > > > >> > > >
> > > > > >> > >
> > > > > >> > > The KIP suggests that client implementations emit an
> > informative
> > > > log
> > > > > >> > > message
> > > > > >> > > with the assigned client-instance-id once it is retrieved
> > (once
> > > > per
> > > > > >> > client
> > > > > >> > > instance lifetime).
> > > > > >> > > There's also a clientInstanceId() method that an application
> > can
> > > > use
> > > > > >> to
> > > > > >> > > retrieve
> > > > > >> > > the client instance id and emit through whatever side
> channels
> > > > makes
> > > > > >> > sense.
> > > > > >> > >
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > > 51. The KIP lists a bunch of metrics that need to be
> > collected
> > > > at
> > > > > >> the
> > > > > >> > > > client side. However, it seems quite a few useful java
> > client
> > > > > >> metrics
> > > > > >> > > like
> > > > > >> > > > the following are missing.
> > > > > >> > > >     buffer-total-bytes
> > > > > >> > > >     buffer-available-bytes
> > > > > >> > > >
> > > > > >> > >
> > > > > >> > > These are covered by client.producer.record.queue.bytes and
> > > > > >> > > client.producer.record.queue.max.bytes.
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > >     bufferpool-wait-time
> > > > > >> > > >
> > > > > >> > >
> > > > > >> > > Missing, but somewhat implementation specific.
> > > > > >> > > If it was up to me we would add this later if there's a
> need.
> > > > > >> > >
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > >     batch-size-avg
> > > > > >> > > >     batch-size-max
> > > > > >> > > >
> > > > > >> > >
> > > > > >> > > These are missing and would be suitably represented as a
> > > > histogram.
> > > > > >> I'll
> > > > > >> > > add them.
> > > > > >> > >
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > >     io-wait-ratio
> > > > > >> > > >     io-ratio
> > > > > >> > > >
> > > > > >> > >
> > > > > >> > > There's client.io.wait.time which should cover
> io-wait-ratio.
> > > > > >> > > We could add a client.io.time as well, now or in a later
> KIP.
> > > > > >> > >
> > > > > >> > > Thanks,
> > > > > >> > > Magnus
> > > > > >> > >
> > > > > >> > >
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > >
> > > > > >> > > > Thanks,
> > > > > >> > > >
> > > > > >> > > > Jun
> > > > > >> > > >
> > > > > >> > > > On Mon, Apr 4, 2022 at 10:01 AM Jun Rao <jun@confluent.io
> >
> > > > wrote:
> > > > > >> > > >
> > > > > >> > > > > Hi, Xavier,
> > > > > >> > > > >
> > > > > >> > > > > Thanks for the reply.
> > > > > >> > > > >
> > > > > >> > > > > 28. It does seem that we have started using KafkaMetrics
> > on
> > > > the
> > > > > >> > broker
> > > > > >> > > > > side. Then, my only concern is on the usage of Histogram
> > in
> > > > > >> > > KafkaMetrics.
> > > > > >> > > > > Histogram in KafkaMetrics statically divides the value
> > space
> > > > > into
> > > > > >> a
> > > > > >> > > fixed
> > > > > >> > > > > number of buckets and only returns values on the bucket
> > > > > boundary.
> > > > > >> So,
> > > > > >> > > the
> > > > > >> > > > > returned histogram value may never show up in a recorded
> > > > value.
> > > > > >> > Yammer
> > > > > >> > > > > Histogram, on the other hand, uses reservoir sampling.
> The
> > > > > >> reported
> > > > > >> > > value
> > > > > >> > > > > is always one of the recorded values. So, I am not sure
> > that
> > > > > >> > Histogram
> > > > > >> > > in
> > > > > >> > > > > KafkaMetrics is as good as Yammer Histogram.
> > > > > >> > > > ClientMetricsPluginExportTime
> > > > > >> > > > > uses Histogram.
> > > > > >> > > > >
> > > > > >> > > > > Thanks,
> > > > > >> > > > >
> > > > > >> > > > > Jun
> > > > > >> > > > >
> > > > > >> > > > > On Thu, Mar 31, 2022 at 5:21 PM Xavier Léauté
> > > > > >> > > > <xa...@confluent.io.invalid>
> > > > > >> > > > > wrote:
> > > > > >> > > > >
> > > > > >> > > > >> >
> > > > > >> > > > >> > 28. On the broker, we typically use Yammer metrics.
> > Only
> > > > for
> > > > > >> > metrics
> > > > > >> > > > >> that
> > > > > >> > > > >> > depend on Kafka metric features (e.g., quota), we use
> > the
> > > > > Kafka
> > > > > >> > > > metric.
> > > > > >> > > > >> > Yammer metrics have 4 types: gauge, meter, histogram
> > and
> > > > > timer.
> > > > > >> > > meter
> > > > > >> > > > >> > calculates a rate, but also exposes an accumulated
> > value.
> > > > > >> > > > >> >
> > > > > >> > > > >>
> > > > > >> > > > >> I don't see a good reason we should limit ourselves to
> > Yammer
> > > > > >> > metrics
> > > > > >> > > on
> > > > > >> > > > >> the broker. KafkaMetrics was written
> > > > > >> > > > >> to replace Yammer metrics and is used for all new
> > components
> > > > > >> > (clients,
> > > > > >> > > > >> streams, connect, etc.)
> > > > > >> > > > >> My understanding is that the original goal was to
> retire
> > > > Yammer
> > > > > >> > > metrics
> > > > > >> > > > in
> > > > > >> > > > >> the broker in favor of KafkaMetrics.
> > > > > >> > > > >> We just haven't done so out of backwards compatibility
> > > > > concerns.
> > > > > >> > > > >> There are other broker metrics such as group
> coordinator,
> > > > > >> > transaction
> > > > > >> > > > >> state
> > > > > >> > > > >> manager, and various socket server metrics
> > > > > >> > > > >> already using KafkaMetrics that don't need specific
> Kafka
> > > > > metric
> > > > > >> > > > features,
> > > > > >> > > > >> so I don't see why we should refrain from using
> > > > > >> > > > >> Kafka metrics on the broker unless there are real
> > > > compatibility
> > > > > >> > > concerns
> > > > > >> > > > >> or
> > > > > >> > > > >> where implementation specifics could lead to confusion
> > when
> > > > > >> > comparing
> > > > > >> > > > >> metrics using different implementations.
> > > > > >> > > > >>
> > > > > >> > > > >> In my opinion we should encourage people to use
> > KafkaMetrics
> > > > > >> going
> > > > > >> > > > forward
> > > > > >> > > > >> on the broker as well, for two reasons:
> > > > > >> > > > >> a) yammer metrics is long deprecated and no longer
> > maintained
> > > > > >> > > > >> b) yammer metrics are much less expressive
> > > > > >> > > > >> c) we don't have a proper API to expose yammer metrics
> > > > outside
> > > > > of
> > > > > >> > JMX
> > > > > >> > > > >> (MetricsReporter only exposes KafkaMetrics)
> > > > > >> > > > >>
> > > > > >> > > > >
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Jun Rao <ju...@confluent.io.INVALID>.
Hi, Kirk,

Thanks for the reply. A couple of more comments.

(1) "Another perspective is that these two sets of metrics serve different
purposes and/or have different audiences, which argues that they should
maintain their individuality and purpose. " Hmm, I am wondering if those
metrics are really for different audiences and purposes? For example, if
the operator detected an issue through a client metric collected through
the server, the operator may need to communicate that back to the client.
It would be weird if that same metric is not visible on the client side.

(2) If we could standardize the names on the server side, do we need to
enforce a naming convention for all clients?

Thanks,

Jun

On Thu, Jun 16, 2022 at 12:00 PM Kirk True <ki...@kirktrue.pro> wrote:

> Hi Jun,
>
> I'll try to answer the questions posed...
>
> On Tue, Jun 7, 2022, at 4:32 PM, Jun Rao wrote:
> > Hi, Magnus,
> >
> > Thanks for the reply.
> >
> > So, the standard set of generic metrics is just a recommendation and not
> a
> > requirement? This sounds good to me since it makes the adoption of the
> KIP
> > easier.
>
> I believe that was the intent, yes.
>
> > Regarding the metric names, I have two concerns.
>
> (I'm splitting these two up for readability...)
>
> > (1) If a client already
> > has an existing metric similar to the standard one, duplicating the
> metric
> > seems to be confusing.
>
> Agreed. I'm dealing with that situation as I write the Java client
> implementation.
>
> The existing Java client exposes a set of metrics via JMX. The updated
> Java client will introduce a second set of metrics, which instead are
> exposed via sending them to the broker. There is substantial overlap with
> the two set of metrics and in a few places in the code under development,
> there are essentially two separate calls to update metrics: one for the
> JMX-bound metrics and one for the broker-bound metrics.
>
> To be candid, I have gone back-and-forth on that design. From one
> perspective, it could be argued that the set of client metrics should be
> standardized across a given client, regardless of how those metrics are
> exposed for consumption. Another perspective is that these two sets of
> metrics serve different purposes and/or have different audiences, which
> argues that they should maintain their individuality and purpose. Your
> inputs/suggestions are certainly welcome!
>
> > (2) If a client needs to implement a standard metric
> > that doesn't exist yet, using a naming convention (e.g., using dash vs
> dot)
> > different from other existing metrics also seems a bit confusing. It
> seems
> > that the main benefit of having standard metric names across clients is
> for
> > better server side monitoring. Could we do the standardization in the
> > plugin on the server?
>
> I think the expectation is that the plugin implementation will perform
> transformation of metric names, if needed, to fit in with an organization's
> monitoring naming standards. Perhaps we need to call that out in the KIP
> itself.
>
> Thanks,
> Kirk
>
> >
> > Thanks,
> >
> > Jun
> >
> >
> >
> > On Tue, Jun 7, 2022 at 6:53 AM Magnus Edenhill <ma...@edenhill.se>
> wrote:
> >
> > > Hey Jun,
> > >
> > > I've clarified the scope of the standard metrics in the KIP, but
> basically:
> > >
> > >  * We define a standard set of generic metrics that should be relevant
> to
> > > most client implementations, e.g., each producer implementation
> probably
> > > has some sort of per-partition message queue.
> > >  * A client implementation should strive to implement as many of the
> > > standard metrics as possible, but only the ones that make sense.
> > >  * For metrics that are not in the standard set, a client maintainer
> can
> > > choose to either submit a KIP to add additional standard metrics - if
> > > they're relevant, or go ahead and add custom metrics that are specific
> to
> > > that client implementation. These custom metrics will have a prefix
> > > specific to that client implementation, as opposed to the standard
> metric
> > > set that resides under "org.apache.kafka...". E.g.,
> > > "se.edenhill.librdkafka" or whatever.
> > >  * Existing non-KIP-714 metrics should remain untouched. In some cases
> we
> > > might be able to use the same meter given it is compatible with the
> > > standard metric set definition, in other cases a semi-duplicate meter
> may
> > > be needed. Thus this will not affect the metrics exposed through JMX,
> or
> > > vice versa.
> > >
> > > Thanks,
> > > Magnus
> > >
> > >
> > >
> > > Den ons 1 juni 2022 kl 18:55 skrev Jun Rao <ju...@confluent.io.invalid>:
> > >
> > > > Hi, Magnus,
> > > >
> > > > 51. Just to clarify my question.  (1) Are standard metrics required
> for
> > > > every client for this KIP to function?  (2) Are we converting
> existing
> > > java
> > > > metrics to the standard metrics and deprecating the old ones? If so,
> > > could
> > > > we list all existing java metrics that need to be renamed and the
> > > > corresponding new name?
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > > On Tue, May 31, 2022 at 3:29 PM Jun Rao <ju...@confluent.io> wrote:
> > > >
> > > > > Hi, Magnus,
> > > > >
> > > > > Thanks for the reply.
> > > > >
> > > > > 51. I think it's fine to have a list of recommended metrics for
> every
> > > > > client to implement. I am just not sure that standardizing on the
> > > metric
> > > > > names across all clients is practical. The list of common metrics
> in
> > > the
> > > > > KIP have completely different names from the java metric names.
> Some of
> > > > > them have different types. For example, some of the common metrics
> > > have a
> > > > > type of histogram, but the java client metrics don't use histogram
> in
> > > > > general. Requiring the operator to translate those names and
> understand
> > > > the
> > > > > subtle differences across clients seem to cause more confusion
> during
> > > > > troubleshooting.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jun
> > > > >
> > > > > On Tue, May 31, 2022 at 5:02 AM Magnus Edenhill <
> magnus@edenhill.se>
> > > > > wrote:
> > > > >
> > > > >> Den fre 20 maj 2022 kl 01:23 skrev Jun Rao
> <jun@confluent.io.invalid
> > > >:
> > > > >>
> > > > >> > Hi, Magus,
> > > > >> >
> > > > >> > Thanks for the reply.
> > > > >> >
> > > > >> > 50. Sounds good.
> > > > >> >
> > > > >> > 51. I miss-understood the proposal in the KIP then. The
> proposal is
> > > to
> > > > >> > define a set of common metric names that every client should
> > > > implement.
> > > > >> The
> > > > >> > problem is that every client already has its own set of metrics
> with
> > > > its
> > > > >> > own names. I am not sure that we could easily agree upon a
> common
> > > set
> > > > of
> > > > >> > metrics that work with all clients. There are likely to be some
> > > > metrics
> > > > >> > that are client specific. Translating between the common name
> and
> > > > client
> > > > >> > specific name is probably going to add more confusion. As
> mentioned
> > > in
> > > > >> the
> > > > >> > KIP, similar metrics from different clients could have subtle
> > > > >> > semantic differences. Could we just let each client use its own
> set
> > > of
> > > > >> > metric names?
> > > > >> >
> > > > >>
> > > > >> We identified a common set of metrics that should be relevant for
> most
> > > > >> client implementations,
> > > > >> they're the ones listed in the KIP.
> > > > >> A supporting client does not have to implement all those metrics,
> only
> > > > the
> > > > >> ones that makes sense
> > > > >> based on that client implementation, and a client may implement
> other
> > > > >> metrics that are not listed
> > > > >> in the KIP under its own namespace.
> > > > >> This approach has two benefits:
> > > > >>  - there will be a common set of metrics that most/all clients
> > > > implement,
> > > > >> which makes monitoring
> > > > >>   and troubleshooting easier across fleets with multiple Kafka
> client
> > > > >> languages/implementations.
> > > > >>  - client-specific metrics are still possible, so if there is no
> > > > suitable
> > > > >> standard metric a client can still
> > > > >>    provide what special metrics it has.
> > > > >>
> > > > >>
> > > > >> Thanks,
> > > > >> Magnus
> > > > >>
> > > > >>
> > > > >> On Thu, May 19, 2022 at 10:39 AM Magnus Edenhill <
> magnus@edenhill.se>
> > > > >> wrote:
> > > > >> >
> > > > >> > > Den ons 18 maj 2022 kl 19:57 skrev Jun Rao
> > > <jun@confluent.io.invalid
> > > > >> >:
> > > > >> > >
> > > > >> > > > Hi, Magnus,
> > > > >> > > >
> > > > >> > >
> > > > >> > > Hi Jun
> > > > >> > >
> > > > >> > >
> > > > >> > > >
> > > > >> > > > Thanks for the updated KIP. Just a couple of more comments.
> > > > >> > > >
> > > > >> > > > 50. To troubleshoot a particular client issue, I imagine
> that
> > > the
> > > > >> > client
> > > > >> > > > needs to identify its client_instance_id. How does the
> client
> > > find
> > > > >> this
> > > > >> > > > out? Do we plan to include client_instance_id in the client
> log,
> > > > >> expose
> > > > >> > > it
> > > > >> > > > as a metric or something else?
> > > > >> > > >
> > > > >> > >
> > > > >> > > The KIP suggests that client implementations emit an
> informative
> > > log
> > > > >> > > message
> > > > >> > > with the assigned client-instance-id once it is retrieved
> (once
> > > per
> > > > >> > client
> > > > >> > > instance lifetime).
> > > > >> > > There's also a clientInstanceId() method that an application
> can
> > > use
> > > > >> to
> > > > >> > > retrieve
> > > > >> > > the client instance id and emit through whatever side channels
> > > makes
> > > > >> > sense.
> > > > >> > >
> > > > >> > >
> > > > >> > >
> > > > >> > > > 51. The KIP lists a bunch of metrics that need to be
> collected
> > > at
> > > > >> the
> > > > >> > > > client side. However, it seems quite a few useful java
> client
> > > > >> metrics
> > > > >> > > like
> > > > >> > > > the following are missing.
> > > > >> > > >     buffer-total-bytes
> > > > >> > > >     buffer-available-bytes
> > > > >> > > >
> > > > >> > >
> > > > >> > > These are covered by client.producer.record.queue.bytes and
> > > > >> > > client.producer.record.queue.max.bytes.
> > > > >> > >
> > > > >> > >
> > > > >> > > >     bufferpool-wait-time
> > > > >> > > >
> > > > >> > >
> > > > >> > > Missing, but somewhat implementation specific.
> > > > >> > > If it was up to me we would add this later if there's a need.
> > > > >> > >
> > > > >> > >
> > > > >> > >
> > > > >> > > >     batch-size-avg
> > > > >> > > >     batch-size-max
> > > > >> > > >
> > > > >> > >
> > > > >> > > These are missing and would be suitably represented as a
> > > histogram.
> > > > >> I'll
> > > > >> > > add them.
> > > > >> > >
> > > > >> > >
> > > > >> > >
> > > > >> > > >     io-wait-ratio
> > > > >> > > >     io-ratio
> > > > >> > > >
> > > > >> > >
> > > > >> > > There's client.io.wait.time which should cover io-wait-ratio.
> > > > >> > > We could add a client.io.time as well, now or in a later KIP.
> > > > >> > >
> > > > >> > > Thanks,
> > > > >> > > Magnus
> > > > >> > >
> > > > >> > >
> > > > >> > >
> > > > >> > >
> > > > >> > > >
> > > > >> > > > Thanks,
> > > > >> > > >
> > > > >> > > > Jun
> > > > >> > > >
> > > > >> > > > On Mon, Apr 4, 2022 at 10:01 AM Jun Rao <ju...@confluent.io>
> > > wrote:
> > > > >> > > >
> > > > >> > > > > Hi, Xavier,
> > > > >> > > > >
> > > > >> > > > > Thanks for the reply.
> > > > >> > > > >
> > > > >> > > > > 28. It does seem that we have started using KafkaMetrics
> on
> > > the
> > > > >> > broker
> > > > >> > > > > side. Then, my only concern is on the usage of Histogram
> in
> > > > >> > > KafkaMetrics.
> > > > >> > > > > Histogram in KafkaMetrics statically divides the value
> space
> > > > into
> > > > >> a
> > > > >> > > fixed
> > > > >> > > > > number of buckets and only returns values on the bucket
> > > > boundary.
> > > > >> So,
> > > > >> > > the
> > > > >> > > > > returned histogram value may never show up in a recorded
> > > value.
> > > > >> > Yammer
> > > > >> > > > > Histogram, on the other hand, uses reservoir sampling. The
> > > > >> reported
> > > > >> > > value
> > > > >> > > > > is always one of the recorded values. So, I am not sure
> that
> > > > >> > Histogram
> > > > >> > > in
> > > > >> > > > > KafkaMetrics is as good as Yammer Histogram.
> > > > >> > > > ClientMetricsPluginExportTime
> > > > >> > > > > uses Histogram.
> > > > >> > > > >
> > > > >> > > > > Thanks,
> > > > >> > > > >
> > > > >> > > > > Jun
> > > > >> > > > >
> > > > >> > > > > On Thu, Mar 31, 2022 at 5:21 PM Xavier Léauté
> > > > >> > > > <xa...@confluent.io.invalid>
> > > > >> > > > > wrote:
> > > > >> > > > >
> > > > >> > > > >> >
> > > > >> > > > >> > 28. On the broker, we typically use Yammer metrics.
> Only
> > > for
> > > > >> > metrics
> > > > >> > > > >> that
> > > > >> > > > >> > depend on Kafka metric features (e.g., quota), we use
> the
> > > > Kafka
> > > > >> > > > metric.
> > > > >> > > > >> > Yammer metrics have 4 types: gauge, meter, histogram
> and
> > > > timer.
> > > > >> > > meter
> > > > >> > > > >> > calculates a rate, but also exposes an accumulated
> value.
> > > > >> > > > >> >
> > > > >> > > > >>
> > > > >> > > > >> I don't see a good reason we should limit ourselves to
> Yammer
> > > > >> > metrics
> > > > >> > > on
> > > > >> > > > >> the broker. KafkaMetrics was written
> > > > >> > > > >> to replace Yammer metrics and is used for all new
> components
> > > > >> > (clients,
> > > > >> > > > >> streams, connect, etc.)
> > > > >> > > > >> My understanding is that the original goal was to retire
> > > Yammer
> > > > >> > > metrics
> > > > >> > > > in
> > > > >> > > > >> the broker in favor of KafkaMetrics.
> > > > >> > > > >> We just haven't done so out of backwards compatibility
> > > > concerns.
> > > > >> > > > >> There are other broker metrics such as group coordinator,
> > > > >> > transaction
> > > > >> > > > >> state
> > > > >> > > > >> manager, and various socket server metrics
> > > > >> > > > >> already using KafkaMetrics that don't need specific Kafka
> > > > metric
> > > > >> > > > features,
> > > > >> > > > >> so I don't see why we should refrain from using
> > > > >> > > > >> Kafka metrics on the broker unless there are real
> > > compatibility
> > > > >> > > concerns
> > > > >> > > > >> or
> > > > >> > > > >> where implementation specifics could lead to confusion
> when
> > > > >> > comparing
> > > > >> > > > >> metrics using different implementations.
> > > > >> > > > >>
> > > > >> > > > >> In my opinion we should encourage people to use
> KafkaMetrics
> > > > >> going
> > > > >> > > > forward
> > > > >> > > > >> on the broker as well, for two reasons:
> > > > >> > > > >> a) yammer metrics is long deprecated and no longer
> maintained
> > > > >> > > > >> b) yammer metrics are much less expressive
> > > > >> > > > >> c) we don't have a proper API to expose yammer metrics
> > > outside
> > > > of
> > > > >> > JMX
> > > > >> > > > >> (MetricsReporter only exposes KafkaMetrics)
> > > > >> > > > >>
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Kirk True <ki...@kirktrue.pro>.
Hi Jun,

I'll try to answer the questions posed...

On Tue, Jun 7, 2022, at 4:32 PM, Jun Rao wrote:
> Hi, Magnus,
> 
> Thanks for the reply.
> 
> So, the standard set of generic metrics is just a recommendation and not a
> requirement? This sounds good to me since it makes the adoption of the KIP
> easier.

I believe that was the intent, yes.

> Regarding the metric names, I have two concerns.

(I'm splitting these two up for readability...)

> (1) If a client already
> has an existing metric similar to the standard one, duplicating the metric
> seems to be confusing.

Agreed. I'm dealing with that situation as I write the Java client implementation.

The existing Java client exposes a set of metrics via JMX. The updated Java client will introduce a second set of metrics, which instead are exposed via sending them to the broker. There is substantial overlap with the two set of metrics and in a few places in the code under development, there are essentially two separate calls to update metrics: one for the JMX-bound metrics and one for the broker-bound metrics.

To be candid, I have gone back-and-forth on that design. From one perspective, it could be argued that the set of client metrics should be standardized across a given client, regardless of how those metrics are exposed for consumption. Another perspective is that these two sets of metrics serve different purposes and/or have different audiences, which argues that they should maintain their individuality and purpose. Your inputs/suggestions are certainly welcome! 

> (2) If a client needs to implement a standard metric
> that doesn't exist yet, using a naming convention (e.g., using dash vs dot)
> different from other existing metrics also seems a bit confusing. It seems
> that the main benefit of having standard metric names across clients is for
> better server side monitoring. Could we do the standardization in the
> plugin on the server?

I think the expectation is that the plugin implementation will perform transformation of metric names, if needed, to fit in with an organization's monitoring naming standards. Perhaps we need to call that out in the KIP itself.

Thanks,
Kirk

> 
> Thanks,
> 
> Jun
> 
> 
> 
> On Tue, Jun 7, 2022 at 6:53 AM Magnus Edenhill <ma...@edenhill.se> wrote:
> 
> > Hey Jun,
> >
> > I've clarified the scope of the standard metrics in the KIP, but basically:
> >
> >  * We define a standard set of generic metrics that should be relevant to
> > most client implementations, e.g., each producer implementation probably
> > has some sort of per-partition message queue.
> >  * A client implementation should strive to implement as many of the
> > standard metrics as possible, but only the ones that make sense.
> >  * For metrics that are not in the standard set, a client maintainer can
> > choose to either submit a KIP to add additional standard metrics - if
> > they're relevant, or go ahead and add custom metrics that are specific to
> > that client implementation. These custom metrics will have a prefix
> > specific to that client implementation, as opposed to the standard metric
> > set that resides under "org.apache.kafka...". E.g.,
> > "se.edenhill.librdkafka" or whatever.
> >  * Existing non-KIP-714 metrics should remain untouched. In some cases we
> > might be able to use the same meter given it is compatible with the
> > standard metric set definition, in other cases a semi-duplicate meter may
> > be needed. Thus this will not affect the metrics exposed through JMX, or
> > vice versa.
> >
> > Thanks,
> > Magnus
> >
> >
> >
> > Den ons 1 juni 2022 kl 18:55 skrev Jun Rao <ju...@confluent.io.invalid>:
> >
> > > Hi, Magnus,
> > >
> > > 51. Just to clarify my question.  (1) Are standard metrics required for
> > > every client for this KIP to function?  (2) Are we converting existing
> > java
> > > metrics to the standard metrics and deprecating the old ones? If so,
> > could
> > > we list all existing java metrics that need to be renamed and the
> > > corresponding new name?
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Tue, May 31, 2022 at 3:29 PM Jun Rao <ju...@confluent.io> wrote:
> > >
> > > > Hi, Magnus,
> > > >
> > > > Thanks for the reply.
> > > >
> > > > 51. I think it's fine to have a list of recommended metrics for every
> > > > client to implement. I am just not sure that standardizing on the
> > metric
> > > > names across all clients is practical. The list of common metrics in
> > the
> > > > KIP have completely different names from the java metric names. Some of
> > > > them have different types. For example, some of the common metrics
> > have a
> > > > type of histogram, but the java client metrics don't use histogram in
> > > > general. Requiring the operator to translate those names and understand
> > > the
> > > > subtle differences across clients seem to cause more confusion during
> > > > troubleshooting.
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > > On Tue, May 31, 2022 at 5:02 AM Magnus Edenhill <ma...@edenhill.se>
> > > > wrote:
> > > >
> > > >> Den fre 20 maj 2022 kl 01:23 skrev Jun Rao <jun@confluent.io.invalid
> > >:
> > > >>
> > > >> > Hi, Magus,
> > > >> >
> > > >> > Thanks for the reply.
> > > >> >
> > > >> > 50. Sounds good.
> > > >> >
> > > >> > 51. I miss-understood the proposal in the KIP then. The proposal is
> > to
> > > >> > define a set of common metric names that every client should
> > > implement.
> > > >> The
> > > >> > problem is that every client already has its own set of metrics with
> > > its
> > > >> > own names. I am not sure that we could easily agree upon a common
> > set
> > > of
> > > >> > metrics that work with all clients. There are likely to be some
> > > metrics
> > > >> > that are client specific. Translating between the common name and
> > > client
> > > >> > specific name is probably going to add more confusion. As mentioned
> > in
> > > >> the
> > > >> > KIP, similar metrics from different clients could have subtle
> > > >> > semantic differences. Could we just let each client use its own set
> > of
> > > >> > metric names?
> > > >> >
> > > >>
> > > >> We identified a common set of metrics that should be relevant for most
> > > >> client implementations,
> > > >> they're the ones listed in the KIP.
> > > >> A supporting client does not have to implement all those metrics, only
> > > the
> > > >> ones that makes sense
> > > >> based on that client implementation, and a client may implement other
> > > >> metrics that are not listed
> > > >> in the KIP under its own namespace.
> > > >> This approach has two benefits:
> > > >>  - there will be a common set of metrics that most/all clients
> > > implement,
> > > >> which makes monitoring
> > > >>   and troubleshooting easier across fleets with multiple Kafka client
> > > >> languages/implementations.
> > > >>  - client-specific metrics are still possible, so if there is no
> > > suitable
> > > >> standard metric a client can still
> > > >>    provide what special metrics it has.
> > > >>
> > > >>
> > > >> Thanks,
> > > >> Magnus
> > > >>
> > > >>
> > > >> On Thu, May 19, 2022 at 10:39 AM Magnus Edenhill <ma...@edenhill.se>
> > > >> wrote:
> > > >> >
> > > >> > > Den ons 18 maj 2022 kl 19:57 skrev Jun Rao
> > <jun@confluent.io.invalid
> > > >> >:
> > > >> > >
> > > >> > > > Hi, Magnus,
> > > >> > > >
> > > >> > >
> > > >> > > Hi Jun
> > > >> > >
> > > >> > >
> > > >> > > >
> > > >> > > > Thanks for the updated KIP. Just a couple of more comments.
> > > >> > > >
> > > >> > > > 50. To troubleshoot a particular client issue, I imagine that
> > the
> > > >> > client
> > > >> > > > needs to identify its client_instance_id. How does the client
> > find
> > > >> this
> > > >> > > > out? Do we plan to include client_instance_id in the client log,
> > > >> expose
> > > >> > > it
> > > >> > > > as a metric or something else?
> > > >> > > >
> > > >> > >
> > > >> > > The KIP suggests that client implementations emit an informative
> > log
> > > >> > > message
> > > >> > > with the assigned client-instance-id once it is retrieved (once
> > per
> > > >> > client
> > > >> > > instance lifetime).
> > > >> > > There's also a clientInstanceId() method that an application can
> > use
> > > >> to
> > > >> > > retrieve
> > > >> > > the client instance id and emit through whatever side channels
> > makes
> > > >> > sense.
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> > > > 51. The KIP lists a bunch of metrics that need to be collected
> > at
> > > >> the
> > > >> > > > client side. However, it seems quite a few useful java client
> > > >> metrics
> > > >> > > like
> > > >> > > > the following are missing.
> > > >> > > >     buffer-total-bytes
> > > >> > > >     buffer-available-bytes
> > > >> > > >
> > > >> > >
> > > >> > > These are covered by client.producer.record.queue.bytes and
> > > >> > > client.producer.record.queue.max.bytes.
> > > >> > >
> > > >> > >
> > > >> > > >     bufferpool-wait-time
> > > >> > > >
> > > >> > >
> > > >> > > Missing, but somewhat implementation specific.
> > > >> > > If it was up to me we would add this later if there's a need.
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> > > >     batch-size-avg
> > > >> > > >     batch-size-max
> > > >> > > >
> > > >> > >
> > > >> > > These are missing and would be suitably represented as a
> > histogram.
> > > >> I'll
> > > >> > > add them.
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> > > >     io-wait-ratio
> > > >> > > >     io-ratio
> > > >> > > >
> > > >> > >
> > > >> > > There's client.io.wait.time which should cover io-wait-ratio.
> > > >> > > We could add a client.io.time as well, now or in a later KIP.
> > > >> > >
> > > >> > > Thanks,
> > > >> > > Magnus
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> > > >
> > > >> > > > Thanks,
> > > >> > > >
> > > >> > > > Jun
> > > >> > > >
> > > >> > > > On Mon, Apr 4, 2022 at 10:01 AM Jun Rao <ju...@confluent.io>
> > wrote:
> > > >> > > >
> > > >> > > > > Hi, Xavier,
> > > >> > > > >
> > > >> > > > > Thanks for the reply.
> > > >> > > > >
> > > >> > > > > 28. It does seem that we have started using KafkaMetrics on
> > the
> > > >> > broker
> > > >> > > > > side. Then, my only concern is on the usage of Histogram in
> > > >> > > KafkaMetrics.
> > > >> > > > > Histogram in KafkaMetrics statically divides the value space
> > > into
> > > >> a
> > > >> > > fixed
> > > >> > > > > number of buckets and only returns values on the bucket
> > > boundary.
> > > >> So,
> > > >> > > the
> > > >> > > > > returned histogram value may never show up in a recorded
> > value.
> > > >> > Yammer
> > > >> > > > > Histogram, on the other hand, uses reservoir sampling. The
> > > >> reported
> > > >> > > value
> > > >> > > > > is always one of the recorded values. So, I am not sure that
> > > >> > Histogram
> > > >> > > in
> > > >> > > > > KafkaMetrics is as good as Yammer Histogram.
> > > >> > > > ClientMetricsPluginExportTime
> > > >> > > > > uses Histogram.
> > > >> > > > >
> > > >> > > > > Thanks,
> > > >> > > > >
> > > >> > > > > Jun
> > > >> > > > >
> > > >> > > > > On Thu, Mar 31, 2022 at 5:21 PM Xavier Léauté
> > > >> > > > <xa...@confluent.io.invalid>
> > > >> > > > > wrote:
> > > >> > > > >
> > > >> > > > >> >
> > > >> > > > >> > 28. On the broker, we typically use Yammer metrics. Only
> > for
> > > >> > metrics
> > > >> > > > >> that
> > > >> > > > >> > depend on Kafka metric features (e.g., quota), we use the
> > > Kafka
> > > >> > > > metric.
> > > >> > > > >> > Yammer metrics have 4 types: gauge, meter, histogram and
> > > timer.
> > > >> > > meter
> > > >> > > > >> > calculates a rate, but also exposes an accumulated value.
> > > >> > > > >> >
> > > >> > > > >>
> > > >> > > > >> I don't see a good reason we should limit ourselves to Yammer
> > > >> > metrics
> > > >> > > on
> > > >> > > > >> the broker. KafkaMetrics was written
> > > >> > > > >> to replace Yammer metrics and is used for all new components
> > > >> > (clients,
> > > >> > > > >> streams, connect, etc.)
> > > >> > > > >> My understanding is that the original goal was to retire
> > Yammer
> > > >> > > metrics
> > > >> > > > in
> > > >> > > > >> the broker in favor of KafkaMetrics.
> > > >> > > > >> We just haven't done so out of backwards compatibility
> > > concerns.
> > > >> > > > >> There are other broker metrics such as group coordinator,
> > > >> > transaction
> > > >> > > > >> state
> > > >> > > > >> manager, and various socket server metrics
> > > >> > > > >> already using KafkaMetrics that don't need specific Kafka
> > > metric
> > > >> > > > features,
> > > >> > > > >> so I don't see why we should refrain from using
> > > >> > > > >> Kafka metrics on the broker unless there are real
> > compatibility
> > > >> > > concerns
> > > >> > > > >> or
> > > >> > > > >> where implementation specifics could lead to confusion when
> > > >> > comparing
> > > >> > > > >> metrics using different implementations.
> > > >> > > > >>
> > > >> > > > >> In my opinion we should encourage people to use KafkaMetrics
> > > >> going
> > > >> > > > forward
> > > >> > > > >> on the broker as well, for two reasons:
> > > >> > > > >> a) yammer metrics is long deprecated and no longer maintained
> > > >> > > > >> b) yammer metrics are much less expressive
> > > >> > > > >> c) we don't have a proper API to expose yammer metrics
> > outside
> > > of
> > > >> > JMX
> > > >> > > > >> (MetricsReporter only exposes KafkaMetrics)
> > > >> > > > >>
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > >
> >
> 

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Jun Rao <ju...@confluent.io.INVALID>.
Hi, Magnus,

Thanks for the reply.

So, the standard set of generic metrics is just a recommendation and not a
requirement? This sounds good to me since it makes the adoption of the KIP
easier.

Regarding the metric names, I have two concerns. (1) If a client already
has an existing metric similar to the standard one, duplicating the metric
seems to be confusing. (2) If a client needs to implement a standard metric
that doesn't exist yet, using a naming convention (e.g., using dash vs dot)
different from other existing metrics also seems a bit confusing. It seems
that the main benefit of having standard metric names across clients is for
better server side monitoring. Could we do the standardization in the
plugin on the server?

Thanks,

Jun



On Tue, Jun 7, 2022 at 6:53 AM Magnus Edenhill <ma...@edenhill.se> wrote:

> Hey Jun,
>
> I've clarified the scope of the standard metrics in the KIP, but basically:
>
>  * We define a standard set of generic metrics that should be relevant to
> most client implementations, e.g., each producer implementation probably
> has some sort of per-partition message queue.
>  * A client implementation should strive to implement as many of the
> standard metrics as possible, but only the ones that make sense.
>  * For metrics that are not in the standard set, a client maintainer can
> choose to either submit a KIP to add additional standard metrics - if
> they're relevant, or go ahead and add custom metrics that are specific to
> that client implementation. These custom metrics will have a prefix
> specific to that client implementation, as opposed to the standard metric
> set that resides under "org.apache.kafka...". E.g.,
> "se.edenhill.librdkafka" or whatever.
>  * Existing non-KIP-714 metrics should remain untouched. In some cases we
> might be able to use the same meter given it is compatible with the
> standard metric set definition, in other cases a semi-duplicate meter may
> be needed. Thus this will not affect the metrics exposed through JMX, or
> vice versa.
>
> Thanks,
> Magnus
>
>
>
> Den ons 1 juni 2022 kl 18:55 skrev Jun Rao <ju...@confluent.io.invalid>:
>
> > Hi, Magnus,
> >
> > 51. Just to clarify my question.  (1) Are standard metrics required for
> > every client for this KIP to function?  (2) Are we converting existing
> java
> > metrics to the standard metrics and deprecating the old ones? If so,
> could
> > we list all existing java metrics that need to be renamed and the
> > corresponding new name?
> >
> > Thanks,
> >
> > Jun
> >
> > On Tue, May 31, 2022 at 3:29 PM Jun Rao <ju...@confluent.io> wrote:
> >
> > > Hi, Magnus,
> > >
> > > Thanks for the reply.
> > >
> > > 51. I think it's fine to have a list of recommended metrics for every
> > > client to implement. I am just not sure that standardizing on the
> metric
> > > names across all clients is practical. The list of common metrics in
> the
> > > KIP have completely different names from the java metric names. Some of
> > > them have different types. For example, some of the common metrics
> have a
> > > type of histogram, but the java client metrics don't use histogram in
> > > general. Requiring the operator to translate those names and understand
> > the
> > > subtle differences across clients seem to cause more confusion during
> > > troubleshooting.
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Tue, May 31, 2022 at 5:02 AM Magnus Edenhill <ma...@edenhill.se>
> > > wrote:
> > >
> > >> Den fre 20 maj 2022 kl 01:23 skrev Jun Rao <jun@confluent.io.invalid
> >:
> > >>
> > >> > Hi, Magus,
> > >> >
> > >> > Thanks for the reply.
> > >> >
> > >> > 50. Sounds good.
> > >> >
> > >> > 51. I miss-understood the proposal in the KIP then. The proposal is
> to
> > >> > define a set of common metric names that every client should
> > implement.
> > >> The
> > >> > problem is that every client already has its own set of metrics with
> > its
> > >> > own names. I am not sure that we could easily agree upon a common
> set
> > of
> > >> > metrics that work with all clients. There are likely to be some
> > metrics
> > >> > that are client specific. Translating between the common name and
> > client
> > >> > specific name is probably going to add more confusion. As mentioned
> in
> > >> the
> > >> > KIP, similar metrics from different clients could have subtle
> > >> > semantic differences. Could we just let each client use its own set
> of
> > >> > metric names?
> > >> >
> > >>
> > >> We identified a common set of metrics that should be relevant for most
> > >> client implementations,
> > >> they're the ones listed in the KIP.
> > >> A supporting client does not have to implement all those metrics, only
> > the
> > >> ones that makes sense
> > >> based on that client implementation, and a client may implement other
> > >> metrics that are not listed
> > >> in the KIP under its own namespace.
> > >> This approach has two benefits:
> > >>  - there will be a common set of metrics that most/all clients
> > implement,
> > >> which makes monitoring
> > >>   and troubleshooting easier across fleets with multiple Kafka client
> > >> languages/implementations.
> > >>  - client-specific metrics are still possible, so if there is no
> > suitable
> > >> standard metric a client can still
> > >>    provide what special metrics it has.
> > >>
> > >>
> > >> Thanks,
> > >> Magnus
> > >>
> > >>
> > >> On Thu, May 19, 2022 at 10:39 AM Magnus Edenhill <ma...@edenhill.se>
> > >> wrote:
> > >> >
> > >> > > Den ons 18 maj 2022 kl 19:57 skrev Jun Rao
> <jun@confluent.io.invalid
> > >> >:
> > >> > >
> > >> > > > Hi, Magnus,
> > >> > > >
> > >> > >
> > >> > > Hi Jun
> > >> > >
> > >> > >
> > >> > > >
> > >> > > > Thanks for the updated KIP. Just a couple of more comments.
> > >> > > >
> > >> > > > 50. To troubleshoot a particular client issue, I imagine that
> the
> > >> > client
> > >> > > > needs to identify its client_instance_id. How does the client
> find
> > >> this
> > >> > > > out? Do we plan to include client_instance_id in the client log,
> > >> expose
> > >> > > it
> > >> > > > as a metric or something else?
> > >> > > >
> > >> > >
> > >> > > The KIP suggests that client implementations emit an informative
> log
> > >> > > message
> > >> > > with the assigned client-instance-id once it is retrieved (once
> per
> > >> > client
> > >> > > instance lifetime).
> > >> > > There's also a clientInstanceId() method that an application can
> use
> > >> to
> > >> > > retrieve
> > >> > > the client instance id and emit through whatever side channels
> makes
> > >> > sense.
> > >> > >
> > >> > >
> > >> > >
> > >> > > > 51. The KIP lists a bunch of metrics that need to be collected
> at
> > >> the
> > >> > > > client side. However, it seems quite a few useful java client
> > >> metrics
> > >> > > like
> > >> > > > the following are missing.
> > >> > > >     buffer-total-bytes
> > >> > > >     buffer-available-bytes
> > >> > > >
> > >> > >
> > >> > > These are covered by client.producer.record.queue.bytes and
> > >> > > client.producer.record.queue.max.bytes.
> > >> > >
> > >> > >
> > >> > > >     bufferpool-wait-time
> > >> > > >
> > >> > >
> > >> > > Missing, but somewhat implementation specific.
> > >> > > If it was up to me we would add this later if there's a need.
> > >> > >
> > >> > >
> > >> > >
> > >> > > >     batch-size-avg
> > >> > > >     batch-size-max
> > >> > > >
> > >> > >
> > >> > > These are missing and would be suitably represented as a
> histogram.
> > >> I'll
> > >> > > add them.
> > >> > >
> > >> > >
> > >> > >
> > >> > > >     io-wait-ratio
> > >> > > >     io-ratio
> > >> > > >
> > >> > >
> > >> > > There's client.io.wait.time which should cover io-wait-ratio.
> > >> > > We could add a client.io.time as well, now or in a later KIP.
> > >> > >
> > >> > > Thanks,
> > >> > > Magnus
> > >> > >
> > >> > >
> > >> > >
> > >> > >
> > >> > > >
> > >> > > > Thanks,
> > >> > > >
> > >> > > > Jun
> > >> > > >
> > >> > > > On Mon, Apr 4, 2022 at 10:01 AM Jun Rao <ju...@confluent.io>
> wrote:
> > >> > > >
> > >> > > > > Hi, Xavier,
> > >> > > > >
> > >> > > > > Thanks for the reply.
> > >> > > > >
> > >> > > > > 28. It does seem that we have started using KafkaMetrics on
> the
> > >> > broker
> > >> > > > > side. Then, my only concern is on the usage of Histogram in
> > >> > > KafkaMetrics.
> > >> > > > > Histogram in KafkaMetrics statically divides the value space
> > into
> > >> a
> > >> > > fixed
> > >> > > > > number of buckets and only returns values on the bucket
> > boundary.
> > >> So,
> > >> > > the
> > >> > > > > returned histogram value may never show up in a recorded
> value.
> > >> > Yammer
> > >> > > > > Histogram, on the other hand, uses reservoir sampling. The
> > >> reported
> > >> > > value
> > >> > > > > is always one of the recorded values. So, I am not sure that
> > >> > Histogram
> > >> > > in
> > >> > > > > KafkaMetrics is as good as Yammer Histogram.
> > >> > > > ClientMetricsPluginExportTime
> > >> > > > > uses Histogram.
> > >> > > > >
> > >> > > > > Thanks,
> > >> > > > >
> > >> > > > > Jun
> > >> > > > >
> > >> > > > > On Thu, Mar 31, 2022 at 5:21 PM Xavier Léauté
> > >> > > > <xa...@confluent.io.invalid>
> > >> > > > > wrote:
> > >> > > > >
> > >> > > > >> >
> > >> > > > >> > 28. On the broker, we typically use Yammer metrics. Only
> for
> > >> > metrics
> > >> > > > >> that
> > >> > > > >> > depend on Kafka metric features (e.g., quota), we use the
> > Kafka
> > >> > > > metric.
> > >> > > > >> > Yammer metrics have 4 types: gauge, meter, histogram and
> > timer.
> > >> > > meter
> > >> > > > >> > calculates a rate, but also exposes an accumulated value.
> > >> > > > >> >
> > >> > > > >>
> > >> > > > >> I don't see a good reason we should limit ourselves to Yammer
> > >> > metrics
> > >> > > on
> > >> > > > >> the broker. KafkaMetrics was written
> > >> > > > >> to replace Yammer metrics and is used for all new components
> > >> > (clients,
> > >> > > > >> streams, connect, etc.)
> > >> > > > >> My understanding is that the original goal was to retire
> Yammer
> > >> > > metrics
> > >> > > > in
> > >> > > > >> the broker in favor of KafkaMetrics.
> > >> > > > >> We just haven't done so out of backwards compatibility
> > concerns.
> > >> > > > >> There are other broker metrics such as group coordinator,
> > >> > transaction
> > >> > > > >> state
> > >> > > > >> manager, and various socket server metrics
> > >> > > > >> already using KafkaMetrics that don't need specific Kafka
> > metric
> > >> > > > features,
> > >> > > > >> so I don't see why we should refrain from using
> > >> > > > >> Kafka metrics on the broker unless there are real
> compatibility
> > >> > > concerns
> > >> > > > >> or
> > >> > > > >> where implementation specifics could lead to confusion when
> > >> > comparing
> > >> > > > >> metrics using different implementations.
> > >> > > > >>
> > >> > > > >> In my opinion we should encourage people to use KafkaMetrics
> > >> going
> > >> > > > forward
> > >> > > > >> on the broker as well, for two reasons:
> > >> > > > >> a) yammer metrics is long deprecated and no longer maintained
> > >> > > > >> b) yammer metrics are much less expressive
> > >> > > > >> c) we don't have a proper API to expose yammer metrics
> outside
> > of
> > >> > JMX
> > >> > > > >> (MetricsReporter only exposes KafkaMetrics)
> > >> > > > >>
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Magnus Edenhill <ma...@edenhill.se>.
Hey Jun,

I've clarified the scope of the standard metrics in the KIP, but basically:

 * We define a standard set of generic metrics that should be relevant to
most client implementations, e.g., each producer implementation probably
has some sort of per-partition message queue.
 * A client implementation should strive to implement as many of the
standard metrics as possible, but only the ones that make sense.
 * For metrics that are not in the standard set, a client maintainer can
choose to either submit a KIP to add additional standard metrics - if
they're relevant, or go ahead and add custom metrics that are specific to
that client implementation. These custom metrics will have a prefix
specific to that client implementation, as opposed to the standard metric
set that resides under "org.apache.kafka...". E.g.,
"se.edenhill.librdkafka" or whatever.
 * Existing non-KIP-714 metrics should remain untouched. In some cases we
might be able to use the same meter given it is compatible with the
standard metric set definition, in other cases a semi-duplicate meter may
be needed. Thus this will not affect the metrics exposed through JMX, or
vice versa.

Thanks,
Magnus



Den ons 1 juni 2022 kl 18:55 skrev Jun Rao <ju...@confluent.io.invalid>:

> Hi, Magnus,
>
> 51. Just to clarify my question.  (1) Are standard metrics required for
> every client for this KIP to function?  (2) Are we converting existing java
> metrics to the standard metrics and deprecating the old ones? If so, could
> we list all existing java metrics that need to be renamed and the
> corresponding new name?
>
> Thanks,
>
> Jun
>
> On Tue, May 31, 2022 at 3:29 PM Jun Rao <ju...@confluent.io> wrote:
>
> > Hi, Magnus,
> >
> > Thanks for the reply.
> >
> > 51. I think it's fine to have a list of recommended metrics for every
> > client to implement. I am just not sure that standardizing on the metric
> > names across all clients is practical. The list of common metrics in the
> > KIP have completely different names from the java metric names. Some of
> > them have different types. For example, some of the common metrics have a
> > type of histogram, but the java client metrics don't use histogram in
> > general. Requiring the operator to translate those names and understand
> the
> > subtle differences across clients seem to cause more confusion during
> > troubleshooting.
> >
> > Thanks,
> >
> > Jun
> >
> > On Tue, May 31, 2022 at 5:02 AM Magnus Edenhill <ma...@edenhill.se>
> > wrote:
> >
> >> Den fre 20 maj 2022 kl 01:23 skrev Jun Rao <ju...@confluent.io.invalid>:
> >>
> >> > Hi, Magus,
> >> >
> >> > Thanks for the reply.
> >> >
> >> > 50. Sounds good.
> >> >
> >> > 51. I miss-understood the proposal in the KIP then. The proposal is to
> >> > define a set of common metric names that every client should
> implement.
> >> The
> >> > problem is that every client already has its own set of metrics with
> its
> >> > own names. I am not sure that we could easily agree upon a common set
> of
> >> > metrics that work with all clients. There are likely to be some
> metrics
> >> > that are client specific. Translating between the common name and
> client
> >> > specific name is probably going to add more confusion. As mentioned in
> >> the
> >> > KIP, similar metrics from different clients could have subtle
> >> > semantic differences. Could we just let each client use its own set of
> >> > metric names?
> >> >
> >>
> >> We identified a common set of metrics that should be relevant for most
> >> client implementations,
> >> they're the ones listed in the KIP.
> >> A supporting client does not have to implement all those metrics, only
> the
> >> ones that makes sense
> >> based on that client implementation, and a client may implement other
> >> metrics that are not listed
> >> in the KIP under its own namespace.
> >> This approach has two benefits:
> >>  - there will be a common set of metrics that most/all clients
> implement,
> >> which makes monitoring
> >>   and troubleshooting easier across fleets with multiple Kafka client
> >> languages/implementations.
> >>  - client-specific metrics are still possible, so if there is no
> suitable
> >> standard metric a client can still
> >>    provide what special metrics it has.
> >>
> >>
> >> Thanks,
> >> Magnus
> >>
> >>
> >> On Thu, May 19, 2022 at 10:39 AM Magnus Edenhill <ma...@edenhill.se>
> >> wrote:
> >> >
> >> > > Den ons 18 maj 2022 kl 19:57 skrev Jun Rao <jun@confluent.io.invalid
> >> >:
> >> > >
> >> > > > Hi, Magnus,
> >> > > >
> >> > >
> >> > > Hi Jun
> >> > >
> >> > >
> >> > > >
> >> > > > Thanks for the updated KIP. Just a couple of more comments.
> >> > > >
> >> > > > 50. To troubleshoot a particular client issue, I imagine that the
> >> > client
> >> > > > needs to identify its client_instance_id. How does the client find
> >> this
> >> > > > out? Do we plan to include client_instance_id in the client log,
> >> expose
> >> > > it
> >> > > > as a metric or something else?
> >> > > >
> >> > >
> >> > > The KIP suggests that client implementations emit an informative log
> >> > > message
> >> > > with the assigned client-instance-id once it is retrieved (once per
> >> > client
> >> > > instance lifetime).
> >> > > There's also a clientInstanceId() method that an application can use
> >> to
> >> > > retrieve
> >> > > the client instance id and emit through whatever side channels makes
> >> > sense.
> >> > >
> >> > >
> >> > >
> >> > > > 51. The KIP lists a bunch of metrics that need to be collected at
> >> the
> >> > > > client side. However, it seems quite a few useful java client
> >> metrics
> >> > > like
> >> > > > the following are missing.
> >> > > >     buffer-total-bytes
> >> > > >     buffer-available-bytes
> >> > > >
> >> > >
> >> > > These are covered by client.producer.record.queue.bytes and
> >> > > client.producer.record.queue.max.bytes.
> >> > >
> >> > >
> >> > > >     bufferpool-wait-time
> >> > > >
> >> > >
> >> > > Missing, but somewhat implementation specific.
> >> > > If it was up to me we would add this later if there's a need.
> >> > >
> >> > >
> >> > >
> >> > > >     batch-size-avg
> >> > > >     batch-size-max
> >> > > >
> >> > >
> >> > > These are missing and would be suitably represented as a histogram.
> >> I'll
> >> > > add them.
> >> > >
> >> > >
> >> > >
> >> > > >     io-wait-ratio
> >> > > >     io-ratio
> >> > > >
> >> > >
> >> > > There's client.io.wait.time which should cover io-wait-ratio.
> >> > > We could add a client.io.time as well, now or in a later KIP.
> >> > >
> >> > > Thanks,
> >> > > Magnus
> >> > >
> >> > >
> >> > >
> >> > >
> >> > > >
> >> > > > Thanks,
> >> > > >
> >> > > > Jun
> >> > > >
> >> > > > On Mon, Apr 4, 2022 at 10:01 AM Jun Rao <ju...@confluent.io> wrote:
> >> > > >
> >> > > > > Hi, Xavier,
> >> > > > >
> >> > > > > Thanks for the reply.
> >> > > > >
> >> > > > > 28. It does seem that we have started using KafkaMetrics on the
> >> > broker
> >> > > > > side. Then, my only concern is on the usage of Histogram in
> >> > > KafkaMetrics.
> >> > > > > Histogram in KafkaMetrics statically divides the value space
> into
> >> a
> >> > > fixed
> >> > > > > number of buckets and only returns values on the bucket
> boundary.
> >> So,
> >> > > the
> >> > > > > returned histogram value may never show up in a recorded value.
> >> > Yammer
> >> > > > > Histogram, on the other hand, uses reservoir sampling. The
> >> reported
> >> > > value
> >> > > > > is always one of the recorded values. So, I am not sure that
> >> > Histogram
> >> > > in
> >> > > > > KafkaMetrics is as good as Yammer Histogram.
> >> > > > ClientMetricsPluginExportTime
> >> > > > > uses Histogram.
> >> > > > >
> >> > > > > Thanks,
> >> > > > >
> >> > > > > Jun
> >> > > > >
> >> > > > > On Thu, Mar 31, 2022 at 5:21 PM Xavier Léauté
> >> > > > <xa...@confluent.io.invalid>
> >> > > > > wrote:
> >> > > > >
> >> > > > >> >
> >> > > > >> > 28. On the broker, we typically use Yammer metrics. Only for
> >> > metrics
> >> > > > >> that
> >> > > > >> > depend on Kafka metric features (e.g., quota), we use the
> Kafka
> >> > > > metric.
> >> > > > >> > Yammer metrics have 4 types: gauge, meter, histogram and
> timer.
> >> > > meter
> >> > > > >> > calculates a rate, but also exposes an accumulated value.
> >> > > > >> >
> >> > > > >>
> >> > > > >> I don't see a good reason we should limit ourselves to Yammer
> >> > metrics
> >> > > on
> >> > > > >> the broker. KafkaMetrics was written
> >> > > > >> to replace Yammer metrics and is used for all new components
> >> > (clients,
> >> > > > >> streams, connect, etc.)
> >> > > > >> My understanding is that the original goal was to retire Yammer
> >> > > metrics
> >> > > > in
> >> > > > >> the broker in favor of KafkaMetrics.
> >> > > > >> We just haven't done so out of backwards compatibility
> concerns.
> >> > > > >> There are other broker metrics such as group coordinator,
> >> > transaction
> >> > > > >> state
> >> > > > >> manager, and various socket server metrics
> >> > > > >> already using KafkaMetrics that don't need specific Kafka
> metric
> >> > > > features,
> >> > > > >> so I don't see why we should refrain from using
> >> > > > >> Kafka metrics on the broker unless there are real compatibility
> >> > > concerns
> >> > > > >> or
> >> > > > >> where implementation specifics could lead to confusion when
> >> > comparing
> >> > > > >> metrics using different implementations.
> >> > > > >>
> >> > > > >> In my opinion we should encourage people to use KafkaMetrics
> >> going
> >> > > > forward
> >> > > > >> on the broker as well, for two reasons:
> >> > > > >> a) yammer metrics is long deprecated and no longer maintained
> >> > > > >> b) yammer metrics are much less expressive
> >> > > > >> c) we don't have a proper API to expose yammer metrics outside
> of
> >> > JMX
> >> > > > >> (MetricsReporter only exposes KafkaMetrics)
> >> > > > >>
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Jun Rao <ju...@confluent.io.INVALID>.
Hi, Magnus,

51. Just to clarify my question.  (1) Are standard metrics required for
every client for this KIP to function?  (2) Are we converting existing java
metrics to the standard metrics and deprecating the old ones? If so, could
we list all existing java metrics that need to be renamed and the
corresponding new name?

Thanks,

Jun

On Tue, May 31, 2022 at 3:29 PM Jun Rao <ju...@confluent.io> wrote:

> Hi, Magnus,
>
> Thanks for the reply.
>
> 51. I think it's fine to have a list of recommended metrics for every
> client to implement. I am just not sure that standardizing on the metric
> names across all clients is practical. The list of common metrics in the
> KIP have completely different names from the java metric names. Some of
> them have different types. For example, some of the common metrics have a
> type of histogram, but the java client metrics don't use histogram in
> general. Requiring the operator to translate those names and understand the
> subtle differences across clients seem to cause more confusion during
> troubleshooting.
>
> Thanks,
>
> Jun
>
> On Tue, May 31, 2022 at 5:02 AM Magnus Edenhill <ma...@edenhill.se>
> wrote:
>
>> Den fre 20 maj 2022 kl 01:23 skrev Jun Rao <ju...@confluent.io.invalid>:
>>
>> > Hi, Magus,
>> >
>> > Thanks for the reply.
>> >
>> > 50. Sounds good.
>> >
>> > 51. I miss-understood the proposal in the KIP then. The proposal is to
>> > define a set of common metric names that every client should implement.
>> The
>> > problem is that every client already has its own set of metrics with its
>> > own names. I am not sure that we could easily agree upon a common set of
>> > metrics that work with all clients. There are likely to be some metrics
>> > that are client specific. Translating between the common name and client
>> > specific name is probably going to add more confusion. As mentioned in
>> the
>> > KIP, similar metrics from different clients could have subtle
>> > semantic differences. Could we just let each client use its own set of
>> > metric names?
>> >
>>
>> We identified a common set of metrics that should be relevant for most
>> client implementations,
>> they're the ones listed in the KIP.
>> A supporting client does not have to implement all those metrics, only the
>> ones that makes sense
>> based on that client implementation, and a client may implement other
>> metrics that are not listed
>> in the KIP under its own namespace.
>> This approach has two benefits:
>>  - there will be a common set of metrics that most/all clients implement,
>> which makes monitoring
>>   and troubleshooting easier across fleets with multiple Kafka client
>> languages/implementations.
>>  - client-specific metrics are still possible, so if there is no suitable
>> standard metric a client can still
>>    provide what special metrics it has.
>>
>>
>> Thanks,
>> Magnus
>>
>>
>> On Thu, May 19, 2022 at 10:39 AM Magnus Edenhill <ma...@edenhill.se>
>> wrote:
>> >
>> > > Den ons 18 maj 2022 kl 19:57 skrev Jun Rao <jun@confluent.io.invalid
>> >:
>> > >
>> > > > Hi, Magnus,
>> > > >
>> > >
>> > > Hi Jun
>> > >
>> > >
>> > > >
>> > > > Thanks for the updated KIP. Just a couple of more comments.
>> > > >
>> > > > 50. To troubleshoot a particular client issue, I imagine that the
>> > client
>> > > > needs to identify its client_instance_id. How does the client find
>> this
>> > > > out? Do we plan to include client_instance_id in the client log,
>> expose
>> > > it
>> > > > as a metric or something else?
>> > > >
>> > >
>> > > The KIP suggests that client implementations emit an informative log
>> > > message
>> > > with the assigned client-instance-id once it is retrieved (once per
>> > client
>> > > instance lifetime).
>> > > There's also a clientInstanceId() method that an application can use
>> to
>> > > retrieve
>> > > the client instance id and emit through whatever side channels makes
>> > sense.
>> > >
>> > >
>> > >
>> > > > 51. The KIP lists a bunch of metrics that need to be collected at
>> the
>> > > > client side. However, it seems quite a few useful java client
>> metrics
>> > > like
>> > > > the following are missing.
>> > > >     buffer-total-bytes
>> > > >     buffer-available-bytes
>> > > >
>> > >
>> > > These are covered by client.producer.record.queue.bytes and
>> > > client.producer.record.queue.max.bytes.
>> > >
>> > >
>> > > >     bufferpool-wait-time
>> > > >
>> > >
>> > > Missing, but somewhat implementation specific.
>> > > If it was up to me we would add this later if there's a need.
>> > >
>> > >
>> > >
>> > > >     batch-size-avg
>> > > >     batch-size-max
>> > > >
>> > >
>> > > These are missing and would be suitably represented as a histogram.
>> I'll
>> > > add them.
>> > >
>> > >
>> > >
>> > > >     io-wait-ratio
>> > > >     io-ratio
>> > > >
>> > >
>> > > There's client.io.wait.time which should cover io-wait-ratio.
>> > > We could add a client.io.time as well, now or in a later KIP.
>> > >
>> > > Thanks,
>> > > Magnus
>> > >
>> > >
>> > >
>> > >
>> > > >
>> > > > Thanks,
>> > > >
>> > > > Jun
>> > > >
>> > > > On Mon, Apr 4, 2022 at 10:01 AM Jun Rao <ju...@confluent.io> wrote:
>> > > >
>> > > > > Hi, Xavier,
>> > > > >
>> > > > > Thanks for the reply.
>> > > > >
>> > > > > 28. It does seem that we have started using KafkaMetrics on the
>> > broker
>> > > > > side. Then, my only concern is on the usage of Histogram in
>> > > KafkaMetrics.
>> > > > > Histogram in KafkaMetrics statically divides the value space into
>> a
>> > > fixed
>> > > > > number of buckets and only returns values on the bucket boundary.
>> So,
>> > > the
>> > > > > returned histogram value may never show up in a recorded value.
>> > Yammer
>> > > > > Histogram, on the other hand, uses reservoir sampling. The
>> reported
>> > > value
>> > > > > is always one of the recorded values. So, I am not sure that
>> > Histogram
>> > > in
>> > > > > KafkaMetrics is as good as Yammer Histogram.
>> > > > ClientMetricsPluginExportTime
>> > > > > uses Histogram.
>> > > > >
>> > > > > Thanks,
>> > > > >
>> > > > > Jun
>> > > > >
>> > > > > On Thu, Mar 31, 2022 at 5:21 PM Xavier Léauté
>> > > > <xa...@confluent.io.invalid>
>> > > > > wrote:
>> > > > >
>> > > > >> >
>> > > > >> > 28. On the broker, we typically use Yammer metrics. Only for
>> > metrics
>> > > > >> that
>> > > > >> > depend on Kafka metric features (e.g., quota), we use the Kafka
>> > > > metric.
>> > > > >> > Yammer metrics have 4 types: gauge, meter, histogram and timer.
>> > > meter
>> > > > >> > calculates a rate, but also exposes an accumulated value.
>> > > > >> >
>> > > > >>
>> > > > >> I don't see a good reason we should limit ourselves to Yammer
>> > metrics
>> > > on
>> > > > >> the broker. KafkaMetrics was written
>> > > > >> to replace Yammer metrics and is used for all new components
>> > (clients,
>> > > > >> streams, connect, etc.)
>> > > > >> My understanding is that the original goal was to retire Yammer
>> > > metrics
>> > > > in
>> > > > >> the broker in favor of KafkaMetrics.
>> > > > >> We just haven't done so out of backwards compatibility concerns.
>> > > > >> There are other broker metrics such as group coordinator,
>> > transaction
>> > > > >> state
>> > > > >> manager, and various socket server metrics
>> > > > >> already using KafkaMetrics that don't need specific Kafka metric
>> > > > features,
>> > > > >> so I don't see why we should refrain from using
>> > > > >> Kafka metrics on the broker unless there are real compatibility
>> > > concerns
>> > > > >> or
>> > > > >> where implementation specifics could lead to confusion when
>> > comparing
>> > > > >> metrics using different implementations.
>> > > > >>
>> > > > >> In my opinion we should encourage people to use KafkaMetrics
>> going
>> > > > forward
>> > > > >> on the broker as well, for two reasons:
>> > > > >> a) yammer metrics is long deprecated and no longer maintained
>> > > > >> b) yammer metrics are much less expressive
>> > > > >> c) we don't have a proper API to expose yammer metrics outside of
>> > JMX
>> > > > >> (MetricsReporter only exposes KafkaMetrics)
>> > > > >>
>> > > > >
>> > > >
>> > >
>> >
>>
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Jun Rao <ju...@confluent.io.INVALID>.
Hi, Magnus,

Thanks for the reply.

51. I think it's fine to have a list of recommended metrics for every
client to implement. I am just not sure that standardizing on the metric
names across all clients is practical. The list of common metrics in the
KIP have completely different names from the java metric names. Some of
them have different types. For example, some of the common metrics have a
type of histogram, but the java client metrics don't use histogram in
general. Requiring the operator to translate those names and understand the
subtle differences across clients seem to cause more confusion during
troubleshooting.

Thanks,

Jun

On Tue, May 31, 2022 at 5:02 AM Magnus Edenhill <ma...@edenhill.se> wrote:

> Den fre 20 maj 2022 kl 01:23 skrev Jun Rao <ju...@confluent.io.invalid>:
>
> > Hi, Magus,
> >
> > Thanks for the reply.
> >
> > 50. Sounds good.
> >
> > 51. I miss-understood the proposal in the KIP then. The proposal is to
> > define a set of common metric names that every client should implement.
> The
> > problem is that every client already has its own set of metrics with its
> > own names. I am not sure that we could easily agree upon a common set of
> > metrics that work with all clients. There are likely to be some metrics
> > that are client specific. Translating between the common name and client
> > specific name is probably going to add more confusion. As mentioned in
> the
> > KIP, similar metrics from different clients could have subtle
> > semantic differences. Could we just let each client use its own set of
> > metric names?
> >
>
> We identified a common set of metrics that should be relevant for most
> client implementations,
> they're the ones listed in the KIP.
> A supporting client does not have to implement all those metrics, only the
> ones that makes sense
> based on that client implementation, and a client may implement other
> metrics that are not listed
> in the KIP under its own namespace.
> This approach has two benefits:
>  - there will be a common set of metrics that most/all clients implement,
> which makes monitoring
>   and troubleshooting easier across fleets with multiple Kafka client
> languages/implementations.
>  - client-specific metrics are still possible, so if there is no suitable
> standard metric a client can still
>    provide what special metrics it has.
>
>
> Thanks,
> Magnus
>
>
> On Thu, May 19, 2022 at 10:39 AM Magnus Edenhill <ma...@edenhill.se>
> wrote:
> >
> > > Den ons 18 maj 2022 kl 19:57 skrev Jun Rao <ju...@confluent.io.invalid>:
> > >
> > > > Hi, Magnus,
> > > >
> > >
> > > Hi Jun
> > >
> > >
> > > >
> > > > Thanks for the updated KIP. Just a couple of more comments.
> > > >
> > > > 50. To troubleshoot a particular client issue, I imagine that the
> > client
> > > > needs to identify its client_instance_id. How does the client find
> this
> > > > out? Do we plan to include client_instance_id in the client log,
> expose
> > > it
> > > > as a metric or something else?
> > > >
> > >
> > > The KIP suggests that client implementations emit an informative log
> > > message
> > > with the assigned client-instance-id once it is retrieved (once per
> > client
> > > instance lifetime).
> > > There's also a clientInstanceId() method that an application can use to
> > > retrieve
> > > the client instance id and emit through whatever side channels makes
> > sense.
> > >
> > >
> > >
> > > > 51. The KIP lists a bunch of metrics that need to be collected at the
> > > > client side. However, it seems quite a few useful java client metrics
> > > like
> > > > the following are missing.
> > > >     buffer-total-bytes
> > > >     buffer-available-bytes
> > > >
> > >
> > > These are covered by client.producer.record.queue.bytes and
> > > client.producer.record.queue.max.bytes.
> > >
> > >
> > > >     bufferpool-wait-time
> > > >
> > >
> > > Missing, but somewhat implementation specific.
> > > If it was up to me we would add this later if there's a need.
> > >
> > >
> > >
> > > >     batch-size-avg
> > > >     batch-size-max
> > > >
> > >
> > > These are missing and would be suitably represented as a histogram.
> I'll
> > > add them.
> > >
> > >
> > >
> > > >     io-wait-ratio
> > > >     io-ratio
> > > >
> > >
> > > There's client.io.wait.time which should cover io-wait-ratio.
> > > We could add a client.io.time as well, now or in a later KIP.
> > >
> > > Thanks,
> > > Magnus
> > >
> > >
> > >
> > >
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > > On Mon, Apr 4, 2022 at 10:01 AM Jun Rao <ju...@confluent.io> wrote:
> > > >
> > > > > Hi, Xavier,
> > > > >
> > > > > Thanks for the reply.
> > > > >
> > > > > 28. It does seem that we have started using KafkaMetrics on the
> > broker
> > > > > side. Then, my only concern is on the usage of Histogram in
> > > KafkaMetrics.
> > > > > Histogram in KafkaMetrics statically divides the value space into a
> > > fixed
> > > > > number of buckets and only returns values on the bucket boundary.
> So,
> > > the
> > > > > returned histogram value may never show up in a recorded value.
> > Yammer
> > > > > Histogram, on the other hand, uses reservoir sampling. The reported
> > > value
> > > > > is always one of the recorded values. So, I am not sure that
> > Histogram
> > > in
> > > > > KafkaMetrics is as good as Yammer Histogram.
> > > > ClientMetricsPluginExportTime
> > > > > uses Histogram.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jun
> > > > >
> > > > > On Thu, Mar 31, 2022 at 5:21 PM Xavier Léauté
> > > > <xa...@confluent.io.invalid>
> > > > > wrote:
> > > > >
> > > > >> >
> > > > >> > 28. On the broker, we typically use Yammer metrics. Only for
> > metrics
> > > > >> that
> > > > >> > depend on Kafka metric features (e.g., quota), we use the Kafka
> > > > metric.
> > > > >> > Yammer metrics have 4 types: gauge, meter, histogram and timer.
> > > meter
> > > > >> > calculates a rate, but also exposes an accumulated value.
> > > > >> >
> > > > >>
> > > > >> I don't see a good reason we should limit ourselves to Yammer
> > metrics
> > > on
> > > > >> the broker. KafkaMetrics was written
> > > > >> to replace Yammer metrics and is used for all new components
> > (clients,
> > > > >> streams, connect, etc.)
> > > > >> My understanding is that the original goal was to retire Yammer
> > > metrics
> > > > in
> > > > >> the broker in favor of KafkaMetrics.
> > > > >> We just haven't done so out of backwards compatibility concerns.
> > > > >> There are other broker metrics such as group coordinator,
> > transaction
> > > > >> state
> > > > >> manager, and various socket server metrics
> > > > >> already using KafkaMetrics that don't need specific Kafka metric
> > > > features,
> > > > >> so I don't see why we should refrain from using
> > > > >> Kafka metrics on the broker unless there are real compatibility
> > > concerns
> > > > >> or
> > > > >> where implementation specifics could lead to confusion when
> > comparing
> > > > >> metrics using different implementations.
> > > > >>
> > > > >> In my opinion we should encourage people to use KafkaMetrics going
> > > > forward
> > > > >> on the broker as well, for two reasons:
> > > > >> a) yammer metrics is long deprecated and no longer maintained
> > > > >> b) yammer metrics are much less expressive
> > > > >> c) we don't have a proper API to expose yammer metrics outside of
> > JMX
> > > > >> (MetricsReporter only exposes KafkaMetrics)
> > > > >>
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Magnus Edenhill <ma...@edenhill.se>.
Den fre 20 maj 2022 kl 01:23 skrev Jun Rao <ju...@confluent.io.invalid>:

> Hi, Magus,
>
> Thanks for the reply.
>
> 50. Sounds good.
>
> 51. I miss-understood the proposal in the KIP then. The proposal is to
> define a set of common metric names that every client should implement. The
> problem is that every client already has its own set of metrics with its
> own names. I am not sure that we could easily agree upon a common set of
> metrics that work with all clients. There are likely to be some metrics
> that are client specific. Translating between the common name and client
> specific name is probably going to add more confusion. As mentioned in the
> KIP, similar metrics from different clients could have subtle
> semantic differences. Could we just let each client use its own set of
> metric names?
>

We identified a common set of metrics that should be relevant for most
client implementations,
they're the ones listed in the KIP.
A supporting client does not have to implement all those metrics, only the
ones that makes sense
based on that client implementation, and a client may implement other
metrics that are not listed
in the KIP under its own namespace.
This approach has two benefits:
 - there will be a common set of metrics that most/all clients implement,
which makes monitoring
  and troubleshooting easier across fleets with multiple Kafka client
languages/implementations.
 - client-specific metrics are still possible, so if there is no suitable
standard metric a client can still
   provide what special metrics it has.


Thanks,
Magnus


On Thu, May 19, 2022 at 10:39 AM Magnus Edenhill <ma...@edenhill.se> wrote:
>
> > Den ons 18 maj 2022 kl 19:57 skrev Jun Rao <ju...@confluent.io.invalid>:
> >
> > > Hi, Magnus,
> > >
> >
> > Hi Jun
> >
> >
> > >
> > > Thanks for the updated KIP. Just a couple of more comments.
> > >
> > > 50. To troubleshoot a particular client issue, I imagine that the
> client
> > > needs to identify its client_instance_id. How does the client find this
> > > out? Do we plan to include client_instance_id in the client log, expose
> > it
> > > as a metric or something else?
> > >
> >
> > The KIP suggests that client implementations emit an informative log
> > message
> > with the assigned client-instance-id once it is retrieved (once per
> client
> > instance lifetime).
> > There's also a clientInstanceId() method that an application can use to
> > retrieve
> > the client instance id and emit through whatever side channels makes
> sense.
> >
> >
> >
> > > 51. The KIP lists a bunch of metrics that need to be collected at the
> > > client side. However, it seems quite a few useful java client metrics
> > like
> > > the following are missing.
> > >     buffer-total-bytes
> > >     buffer-available-bytes
> > >
> >
> > These are covered by client.producer.record.queue.bytes and
> > client.producer.record.queue.max.bytes.
> >
> >
> > >     bufferpool-wait-time
> > >
> >
> > Missing, but somewhat implementation specific.
> > If it was up to me we would add this later if there's a need.
> >
> >
> >
> > >     batch-size-avg
> > >     batch-size-max
> > >
> >
> > These are missing and would be suitably represented as a histogram. I'll
> > add them.
> >
> >
> >
> > >     io-wait-ratio
> > >     io-ratio
> > >
> >
> > There's client.io.wait.time which should cover io-wait-ratio.
> > We could add a client.io.time as well, now or in a later KIP.
> >
> > Thanks,
> > Magnus
> >
> >
> >
> >
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Mon, Apr 4, 2022 at 10:01 AM Jun Rao <ju...@confluent.io> wrote:
> > >
> > > > Hi, Xavier,
> > > >
> > > > Thanks for the reply.
> > > >
> > > > 28. It does seem that we have started using KafkaMetrics on the
> broker
> > > > side. Then, my only concern is on the usage of Histogram in
> > KafkaMetrics.
> > > > Histogram in KafkaMetrics statically divides the value space into a
> > fixed
> > > > number of buckets and only returns values on the bucket boundary. So,
> > the
> > > > returned histogram value may never show up in a recorded value.
> Yammer
> > > > Histogram, on the other hand, uses reservoir sampling. The reported
> > value
> > > > is always one of the recorded values. So, I am not sure that
> Histogram
> > in
> > > > KafkaMetrics is as good as Yammer Histogram.
> > > ClientMetricsPluginExportTime
> > > > uses Histogram.
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > > On Thu, Mar 31, 2022 at 5:21 PM Xavier Léauté
> > > <xa...@confluent.io.invalid>
> > > > wrote:
> > > >
> > > >> >
> > > >> > 28. On the broker, we typically use Yammer metrics. Only for
> metrics
> > > >> that
> > > >> > depend on Kafka metric features (e.g., quota), we use the Kafka
> > > metric.
> > > >> > Yammer metrics have 4 types: gauge, meter, histogram and timer.
> > meter
> > > >> > calculates a rate, but also exposes an accumulated value.
> > > >> >
> > > >>
> > > >> I don't see a good reason we should limit ourselves to Yammer
> metrics
> > on
> > > >> the broker. KafkaMetrics was written
> > > >> to replace Yammer metrics and is used for all new components
> (clients,
> > > >> streams, connect, etc.)
> > > >> My understanding is that the original goal was to retire Yammer
> > metrics
> > > in
> > > >> the broker in favor of KafkaMetrics.
> > > >> We just haven't done so out of backwards compatibility concerns.
> > > >> There are other broker metrics such as group coordinator,
> transaction
> > > >> state
> > > >> manager, and various socket server metrics
> > > >> already using KafkaMetrics that don't need specific Kafka metric
> > > features,
> > > >> so I don't see why we should refrain from using
> > > >> Kafka metrics on the broker unless there are real compatibility
> > concerns
> > > >> or
> > > >> where implementation specifics could lead to confusion when
> comparing
> > > >> metrics using different implementations.
> > > >>
> > > >> In my opinion we should encourage people to use KafkaMetrics going
> > > forward
> > > >> on the broker as well, for two reasons:
> > > >> a) yammer metrics is long deprecated and no longer maintained
> > > >> b) yammer metrics are much less expressive
> > > >> c) we don't have a proper API to expose yammer metrics outside of
> JMX
> > > >> (MetricsReporter only exposes KafkaMetrics)
> > > >>
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Jun Rao <ju...@confluent.io.INVALID>.
Hi, Magus,

Thanks for the reply.

50. Sounds good.

51. I miss-understood the proposal in the KIP then. The proposal is to
define a set of common metric names that every client should implement. The
problem is that every client already has its own set of metrics with its
own names. I am not sure that we could easily agree upon a common set of
metrics that work with all clients. There are likely to be some metrics
that are client specific. Translating between the common name and client
specific name is probably going to add more confusion. As mentioned in the
KIP, similar metrics from different clients could have subtle
semantic differences. Could we just let each client use its own set of
metric names?

Thanks,

Jun

On Thu, May 19, 2022 at 10:39 AM Magnus Edenhill <ma...@edenhill.se> wrote:

> Den ons 18 maj 2022 kl 19:57 skrev Jun Rao <ju...@confluent.io.invalid>:
>
> > Hi, Magnus,
> >
>
> Hi Jun
>
>
> >
> > Thanks for the updated KIP. Just a couple of more comments.
> >
> > 50. To troubleshoot a particular client issue, I imagine that the client
> > needs to identify its client_instance_id. How does the client find this
> > out? Do we plan to include client_instance_id in the client log, expose
> it
> > as a metric or something else?
> >
>
> The KIP suggests that client implementations emit an informative log
> message
> with the assigned client-instance-id once it is retrieved (once per client
> instance lifetime).
> There's also a clientInstanceId() method that an application can use to
> retrieve
> the client instance id and emit through whatever side channels makes sense.
>
>
>
> > 51. The KIP lists a bunch of metrics that need to be collected at the
> > client side. However, it seems quite a few useful java client metrics
> like
> > the following are missing.
> >     buffer-total-bytes
> >     buffer-available-bytes
> >
>
> These are covered by client.producer.record.queue.bytes and
> client.producer.record.queue.max.bytes.
>
>
> >     bufferpool-wait-time
> >
>
> Missing, but somewhat implementation specific.
> If it was up to me we would add this later if there's a need.
>
>
>
> >     batch-size-avg
> >     batch-size-max
> >
>
> These are missing and would be suitably represented as a histogram. I'll
> add them.
>
>
>
> >     io-wait-ratio
> >     io-ratio
> >
>
> There's client.io.wait.time which should cover io-wait-ratio.
> We could add a client.io.time as well, now or in a later KIP.
>
> Thanks,
> Magnus
>
>
>
>
> >
> > Thanks,
> >
> > Jun
> >
> > On Mon, Apr 4, 2022 at 10:01 AM Jun Rao <ju...@confluent.io> wrote:
> >
> > > Hi, Xavier,
> > >
> > > Thanks for the reply.
> > >
> > > 28. It does seem that we have started using KafkaMetrics on the broker
> > > side. Then, my only concern is on the usage of Histogram in
> KafkaMetrics.
> > > Histogram in KafkaMetrics statically divides the value space into a
> fixed
> > > number of buckets and only returns values on the bucket boundary. So,
> the
> > > returned histogram value may never show up in a recorded value. Yammer
> > > Histogram, on the other hand, uses reservoir sampling. The reported
> value
> > > is always one of the recorded values. So, I am not sure that Histogram
> in
> > > KafkaMetrics is as good as Yammer Histogram.
> > ClientMetricsPluginExportTime
> > > uses Histogram.
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Thu, Mar 31, 2022 at 5:21 PM Xavier Léauté
> > <xa...@confluent.io.invalid>
> > > wrote:
> > >
> > >> >
> > >> > 28. On the broker, we typically use Yammer metrics. Only for metrics
> > >> that
> > >> > depend on Kafka metric features (e.g., quota), we use the Kafka
> > metric.
> > >> > Yammer metrics have 4 types: gauge, meter, histogram and timer.
> meter
> > >> > calculates a rate, but also exposes an accumulated value.
> > >> >
> > >>
> > >> I don't see a good reason we should limit ourselves to Yammer metrics
> on
> > >> the broker. KafkaMetrics was written
> > >> to replace Yammer metrics and is used for all new components (clients,
> > >> streams, connect, etc.)
> > >> My understanding is that the original goal was to retire Yammer
> metrics
> > in
> > >> the broker in favor of KafkaMetrics.
> > >> We just haven't done so out of backwards compatibility concerns.
> > >> There are other broker metrics such as group coordinator, transaction
> > >> state
> > >> manager, and various socket server metrics
> > >> already using KafkaMetrics that don't need specific Kafka metric
> > features,
> > >> so I don't see why we should refrain from using
> > >> Kafka metrics on the broker unless there are real compatibility
> concerns
> > >> or
> > >> where implementation specifics could lead to confusion when comparing
> > >> metrics using different implementations.
> > >>
> > >> In my opinion we should encourage people to use KafkaMetrics going
> > forward
> > >> on the broker as well, for two reasons:
> > >> a) yammer metrics is long deprecated and no longer maintained
> > >> b) yammer metrics are much less expressive
> > >> c) we don't have a proper API to expose yammer metrics outside of JMX
> > >> (MetricsReporter only exposes KafkaMetrics)
> > >>
> > >
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Magnus Edenhill <ma...@edenhill.se>.
Den ons 18 maj 2022 kl 19:57 skrev Jun Rao <ju...@confluent.io.invalid>:

> Hi, Magnus,
>

Hi Jun


>
> Thanks for the updated KIP. Just a couple of more comments.
>
> 50. To troubleshoot a particular client issue, I imagine that the client
> needs to identify its client_instance_id. How does the client find this
> out? Do we plan to include client_instance_id in the client log, expose it
> as a metric or something else?
>

The KIP suggests that client implementations emit an informative log message
with the assigned client-instance-id once it is retrieved (once per client
instance lifetime).
There's also a clientInstanceId() method that an application can use to
retrieve
the client instance id and emit through whatever side channels makes sense.



> 51. The KIP lists a bunch of metrics that need to be collected at the
> client side. However, it seems quite a few useful java client metrics like
> the following are missing.
>     buffer-total-bytes
>     buffer-available-bytes
>

These are covered by client.producer.record.queue.bytes and
client.producer.record.queue.max.bytes.


>     bufferpool-wait-time
>

Missing, but somewhat implementation specific.
If it was up to me we would add this later if there's a need.



>     batch-size-avg
>     batch-size-max
>

These are missing and would be suitably represented as a histogram. I'll
add them.



>     io-wait-ratio
>     io-ratio
>

There's client.io.wait.time which should cover io-wait-ratio.
We could add a client.io.time as well, now or in a later KIP.

Thanks,
Magnus




>
> Thanks,
>
> Jun
>
> On Mon, Apr 4, 2022 at 10:01 AM Jun Rao <ju...@confluent.io> wrote:
>
> > Hi, Xavier,
> >
> > Thanks for the reply.
> >
> > 28. It does seem that we have started using KafkaMetrics on the broker
> > side. Then, my only concern is on the usage of Histogram in KafkaMetrics.
> > Histogram in KafkaMetrics statically divides the value space into a fixed
> > number of buckets and only returns values on the bucket boundary. So, the
> > returned histogram value may never show up in a recorded value. Yammer
> > Histogram, on the other hand, uses reservoir sampling. The reported value
> > is always one of the recorded values. So, I am not sure that Histogram in
> > KafkaMetrics is as good as Yammer Histogram.
> ClientMetricsPluginExportTime
> > uses Histogram.
> >
> > Thanks,
> >
> > Jun
> >
> > On Thu, Mar 31, 2022 at 5:21 PM Xavier Léauté
> <xa...@confluent.io.invalid>
> > wrote:
> >
> >> >
> >> > 28. On the broker, we typically use Yammer metrics. Only for metrics
> >> that
> >> > depend on Kafka metric features (e.g., quota), we use the Kafka
> metric.
> >> > Yammer metrics have 4 types: gauge, meter, histogram and timer. meter
> >> > calculates a rate, but also exposes an accumulated value.
> >> >
> >>
> >> I don't see a good reason we should limit ourselves to Yammer metrics on
> >> the broker. KafkaMetrics was written
> >> to replace Yammer metrics and is used for all new components (clients,
> >> streams, connect, etc.)
> >> My understanding is that the original goal was to retire Yammer metrics
> in
> >> the broker in favor of KafkaMetrics.
> >> We just haven't done so out of backwards compatibility concerns.
> >> There are other broker metrics such as group coordinator, transaction
> >> state
> >> manager, and various socket server metrics
> >> already using KafkaMetrics that don't need specific Kafka metric
> features,
> >> so I don't see why we should refrain from using
> >> Kafka metrics on the broker unless there are real compatibility concerns
> >> or
> >> where implementation specifics could lead to confusion when comparing
> >> metrics using different implementations.
> >>
> >> In my opinion we should encourage people to use KafkaMetrics going
> forward
> >> on the broker as well, for two reasons:
> >> a) yammer metrics is long deprecated and no longer maintained
> >> b) yammer metrics are much less expressive
> >> c) we don't have a proper API to expose yammer metrics outside of JMX
> >> (MetricsReporter only exposes KafkaMetrics)
> >>
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Jun Rao <ju...@confluent.io.INVALID>.
Hi, Magnus,

Thanks for the updated KIP. Just a couple of more comments.

50. To troubleshoot a particular client issue, I imagine that the client
needs to identify its client_instance_id. How does the client find this
out? Do we plan to include client_instance_id in the client log, expose it
as a metric or something else?

51. The KIP lists a bunch of metrics that need to be collected at the
client side. However, it seems quite a few useful java client metrics like
the following are missing.
    buffer-total-bytes
    buffer-available-bytes
    bufferpool-wait-time
    batch-size-avg
    batch-size-max
    io-wait-ratio
    io-ratio

Thanks,

Jun

On Mon, Apr 4, 2022 at 10:01 AM Jun Rao <ju...@confluent.io> wrote:

> Hi, Xavier,
>
> Thanks for the reply.
>
> 28. It does seem that we have started using KafkaMetrics on the broker
> side. Then, my only concern is on the usage of Histogram in KafkaMetrics.
> Histogram in KafkaMetrics statically divides the value space into a fixed
> number of buckets and only returns values on the bucket boundary. So, the
> returned histogram value may never show up in a recorded value. Yammer
> Histogram, on the other hand, uses reservoir sampling. The reported value
> is always one of the recorded values. So, I am not sure that Histogram in
> KafkaMetrics is as good as Yammer Histogram. ClientMetricsPluginExportTime
> uses Histogram.
>
> Thanks,
>
> Jun
>
> On Thu, Mar 31, 2022 at 5:21 PM Xavier Léauté <xa...@confluent.io.invalid>
> wrote:
>
>> >
>> > 28. On the broker, we typically use Yammer metrics. Only for metrics
>> that
>> > depend on Kafka metric features (e.g., quota), we use the Kafka metric.
>> > Yammer metrics have 4 types: gauge, meter, histogram and timer. meter
>> > calculates a rate, but also exposes an accumulated value.
>> >
>>
>> I don't see a good reason we should limit ourselves to Yammer metrics on
>> the broker. KafkaMetrics was written
>> to replace Yammer metrics and is used for all new components (clients,
>> streams, connect, etc.)
>> My understanding is that the original goal was to retire Yammer metrics in
>> the broker in favor of KafkaMetrics.
>> We just haven't done so out of backwards compatibility concerns.
>> There are other broker metrics such as group coordinator, transaction
>> state
>> manager, and various socket server metrics
>> already using KafkaMetrics that don't need specific Kafka metric features,
>> so I don't see why we should refrain from using
>> Kafka metrics on the broker unless there are real compatibility concerns
>> or
>> where implementation specifics could lead to confusion when comparing
>> metrics using different implementations.
>>
>> In my opinion we should encourage people to use KafkaMetrics going forward
>> on the broker as well, for two reasons:
>> a) yammer metrics is long deprecated and no longer maintained
>> b) yammer metrics are much less expressive
>> c) we don't have a proper API to expose yammer metrics outside of JMX
>> (MetricsReporter only exposes KafkaMetrics)
>>
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Jun Rao <ju...@confluent.io.INVALID>.
Hi, Xavier,

Thanks for the reply.

28. It does seem that we have started using KafkaMetrics on the broker
side. Then, my only concern is on the usage of Histogram in KafkaMetrics.
Histogram in KafkaMetrics statically divides the value space into a fixed
number of buckets and only returns values on the bucket boundary. So, the
returned histogram value may never show up in a recorded value. Yammer
Histogram, on the other hand, uses reservoir sampling. The reported value
is always one of the recorded values. So, I am not sure that Histogram in
KafkaMetrics is as good as Yammer Histogram. ClientMetricsPluginExportTime
uses Histogram.

Thanks,

Jun

On Thu, Mar 31, 2022 at 5:21 PM Xavier Léauté <xa...@confluent.io.invalid>
wrote:

> >
> > 28. On the broker, we typically use Yammer metrics. Only for metrics that
> > depend on Kafka metric features (e.g., quota), we use the Kafka metric.
> > Yammer metrics have 4 types: gauge, meter, histogram and timer. meter
> > calculates a rate, but also exposes an accumulated value.
> >
>
> I don't see a good reason we should limit ourselves to Yammer metrics on
> the broker. KafkaMetrics was written
> to replace Yammer metrics and is used for all new components (clients,
> streams, connect, etc.)
> My understanding is that the original goal was to retire Yammer metrics in
> the broker in favor of KafkaMetrics.
> We just haven't done so out of backwards compatibility concerns.
> There are other broker metrics such as group coordinator, transaction state
> manager, and various socket server metrics
> already using KafkaMetrics that don't need specific Kafka metric features,
> so I don't see why we should refrain from using
> Kafka metrics on the broker unless there are real compatibility concerns or
> where implementation specifics could lead to confusion when comparing
> metrics using different implementations.
>
> In my opinion we should encourage people to use KafkaMetrics going forward
> on the broker as well, for two reasons:
> a) yammer metrics is long deprecated and no longer maintained
> b) yammer metrics are much less expressive
> c) we don't have a proper API to expose yammer metrics outside of JMX
> (MetricsReporter only exposes KafkaMetrics)
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Xavier Léauté <xa...@confluent.io.INVALID>.
>
> 28. On the broker, we typically use Yammer metrics. Only for metrics that
> depend on Kafka metric features (e.g., quota), we use the Kafka metric.
> Yammer metrics have 4 types: gauge, meter, histogram and timer. meter
> calculates a rate, but also exposes an accumulated value.
>

I don't see a good reason we should limit ourselves to Yammer metrics on
the broker. KafkaMetrics was written
to replace Yammer metrics and is used for all new components (clients,
streams, connect, etc.)
My understanding is that the original goal was to retire Yammer metrics in
the broker in favor of KafkaMetrics.
We just haven't done so out of backwards compatibility concerns.
There are other broker metrics such as group coordinator, transaction state
manager, and various socket server metrics
already using KafkaMetrics that don't need specific Kafka metric features,
so I don't see why we should refrain from using
Kafka metrics on the broker unless there are real compatibility concerns or
where implementation specifics could lead to confusion when comparing
metrics using different implementations.

In my opinion we should encourage people to use KafkaMetrics going forward
on the broker as well, for two reasons:
a) yammer metrics is long deprecated and no longer maintained
b) yammer metrics are much less expressive
c) we don't have a proper API to expose yammer metrics outside of JMX
(MetricsReporter only exposes KafkaMetrics)

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Jun Rao <ju...@confluent.io.INVALID>.
Hi, Kirk, Sarat,

Thanks for the reply.

28. On the broker, we typically use Yammer metrics. Only for metrics that
depend on Kafka metric features (e.g., quota), we use the Kafka metric.
Yammer metrics have 4 types: gauge, meter, histogram and timer. meter
calculates a rate, but also exposes an accumulated value.

29. The Histogram class in org.apache.kafka.common.metrics.stats was never
used in the client metrics. The implementation of Histogram only provides a
fixed number of values in the domain and may not capture the quantiles very
accurately. So, we punted on using it.

Thanks,

Jun



On Thu, Mar 10, 2022 at 10:59 AM Sarat Kakarla
<sk...@confluent.io.invalid> wrote:

> Jun,
>
>   >>  28. For the broker metrics, could you spell out the full metric name
>   >>   including groups, tags, etc? We typically don't add the broker_id
> label for
>   >>   broker metrics. Also, brokers use Yammer metrics, which doesn't
> have type
>   >>   Sum.
>
> Sure,  I will update the KIP-714 with the above information, will remove
> the broker-id label from the metrics.
>
> Regarding the type is CumulativeSum the right type to use in the place of
> Sum?
>
> Thanks
> Sarat
>
>
> On 3/8/22, 5:48 PM, "Jun Rao" <ju...@confluent.io.INVALID> wrote:
>
>     Hi, Magnus, Sarat and Xavier,
>
>     Thanks for the reply. A few more comments below.
>
>     20. It seems that we are piggybacking the plugin on the
>     existing MetricsReporter. So, this seems fine.
>
>     21. That could work. Are we requiring any additional jar dependency on
> the
>     client? Or, are you suggesting that we check the runtime dependency to
> pick
>     the compression codec?
>
>     28. For the broker metrics, could you spell out the full metric name
>     including groups, tags, etc? We typically don't add the broker_id
> label for
>     broker metrics. Also, brokers use Yammer metrics, which doesn't have
> type
>     Sum.
>
>     29. There are several client metrics listed as histogram. However, the
> java
>     client currently doesn't support histogram type.
>
>     30. Could you show an example of the metric payload in
> PushTelemetryRequest
>     to help understand how we organize metrics at different levels (per
>     instance, per topic, per partition, per broker, etc)?
>
>     31. Could you add a bit more detail on which client thread sends the
>     PushTelemetryRequest?
>
>     Thanks,
>
>     Jun
>
>     On Mon, Mar 7, 2022 at 11:48 AM Magnus Edenhill <ma...@edenhill.se>
> wrote:
>
>     > Hi Jun,
>     >
>     > thanks for your initiated questions, see my answers below.
>     > There's been a number of clarifications to the KIP.
>     >
>     >
>     >
>     > Den tors 27 jan. 2022 kl 20:08 skrev Jun Rao
> <ju...@confluent.io.invalid>:
>     >
>     > > Hi, Magnus,
>     > >
>     > > Thanks for updating the KIP. The overall approach makes sense to
> me. A
>     > few
>     > > more detailed comments below.
>     > >
>     > > 20. ClientTelemetry: Should it be extending configurable and
> closable?
>     > >
>     >
>     > I'll pass this question to Sarat and/or Xavier.
>     >
>     >
>     >
>     > > 21. Compression of the metrics on the client: what's the default?
>     > >
>     >
>     > How about we specify a prioritized list: zstd, lz4, snappy, gzip?
>     > But ultimately it is up to what the client supports.
>     >
>     >
>     > 23. A client instance is considered a metric resource and the
>     > > resource-level (thus client instance level) labels could include:
>     > >     client_software_name=confluent-kafka-python
>     > >     client_software_version=v2.1.3
>     > >     client_instance_id=B64CD139-3975-440A-91D4
>     > >     transactional_id=someTxnApp
>     > > Are those labels added in PushTelemetryRequest? If so, are they per
>     > metric
>     > > or per request?
>     > >
>     >
>     >
>     > client_software* and client_instance_id are not added by the client,
> but
>     > available to
>     > the broker-side metrics plugin for adding as it see fits, remove
> them from
>     > the KIP.
>     >
>     > As for transactional_id, group_id, etc, which I believe will be
> useful in
>     > troubleshooting,
>     > are included only once (per push) as resource-level attributes (the
> client
>     > instance is a singular resource).
>     >
>     >
>     > >
>     > > 24.  "the broker will only send
>     > > GetTelemetrySubscriptionsResponse.DeltaTemporality=True" :
>     > > 24.1 If it's always true, does it need to be part of the protocol?
>     > >
>     >
>     > We're anticipating that it will take a lot longer to upgrade the
> majority
>     > of clients than the
>     > broker/plugin side, which is why we want the client to support both
>     > temporalities out-of-the-box
>     > so that cumulative reporting can be turned on seamlessly in the
> future.
>     >
>     >
>     >
>     > > 24.2 Does delta only apply to Counter type?
>     > >
>     >
>     >
>     > And Histograms. More details in Xavier's OTLP link.
>     >
>     >
>     >
>     > > 24.3 In the delta representation, the first request needs to send
> the
>     > full
>     > > value, how does the broker plugin know whether a value is full or
> delta?
>     > >
>     >
>     > The client may (should) send the start time for each metric sample,
>     > indicating when
>     > the metric began to be collected.
>     > We've discussed whether this should be the client instance start
> time or
>     > the time when a matching
>     > metric subscription for that metric is received.
>     > For completeness we recommend using the former, the client instance
> start
>     > time.
>     >
>     >
>     >
>     > > 25. quota:
>     > > 25.1 Since we are fitting PushTelemetryRequest into the existing
> request
>     > > quota, it would be useful to document the impact, i.e. client
> metric
>     > > throttling causes the data from the same client to be delayed.
>     > > 25.2 Is PushTelemetryRequest subject to the write bandwidth quota
> like
>     > the
>     > > producer?
>     > >
>     >
>     >
>     > Yes, it should be, as to protect the cluster from rogue clients.
>     > But, in practice the size of metrics will be quite low (e.g., 1-10kb
> per
>     > 60s interval), so I don't think this will pose a problem.
>     > The KIP has been updated with more details on quota/throttling
> behaviour,
>     > see the
>     > "Throttling and rate-limiting" section.
>     >
>     >
>     > 25.3 THROTTLING_QUOTA_EXCEEDED: Currently, we don't send this error
> when
>     > > the request/bandwidth quota is exceeded since those requests are
> not
>     > > rejected. We only set this error when the request is rejected
> (e.g.,
>     > topic
>     > > creation). It would be useful to clarify when this error is used.
>     > >
>     >
>     > Right, I was trying to reuse an existing error-code. We can introduce
>     > a new one for the case where a client pushes metrics at a higher
> frequency
>     > than the
>     > than the configured push interval (e.g., out-of-profile sends).
>     > This causes the broker to drop those metrics and send this error
> code back
>     > to the client. There will be no connection throttling /
> channel-muting in
>     > this
>     > case (unless the standard quotas are exceeded).
>     >
>     >
>     > > 27. kafka-client-metrics.sh: Could we add an example on how to
> disable a
>     > > bad client?
>     > >
>     >
>     > There's now a --block option to kafka-client-metrics.sh which
> overrides all
>     > subscriptions
>     > for the matched client(s). This allows silencing metrics for one or
> more
>     > clients without having
>     > to remove existing subscriptions. From the client's perspective it
> will
>     > look like it no longer has
>     > any subscriptions.
>     >
>     > # Block metrics collection for a specific client instance
>     > $ kafka-client-metrics.sh --bootstrap-server $BROKERS \
>     >    --add \
>     >    --name 'Disabe_b69cc35a' \  # A descriptive name makes it easier
> to
>     > clean up old subscriptions.
>     >    --match client_instance_id=b69cc35a-7a54-4790-aa69-cc2bd4ee4538
> \  #
>     > Match this specific client instance
>     >    --block
>     >
>     >
>     >
>     >
>     > > 28. New broker side metrics: Could we spell out the details of the
>     > metrics
>     > > (e.g., group, tags, etc)?
>     > >
>     >
>     > KIP has been updated accordingly (thanks Sarat).
>     >
>     >
>     >
>     > >
>     > > 29. Client instance-level metrics: client.io.wait.time is a gauge
> not a
>     > > histogram.
>     > >
>     >
>     > I believe a population/distribution should preferably be represented
> as a
>     > histogram, space permitting,
>     > and only secondarily as a Gauge average.
>     > While we might not want to maintain a bunch of histograms for each
>     > partition, since that could be
>     > quite space consuming, this client.io.wait.time is a single metric
> per
>     > client instance and can
>     > thus afford a Histogram representation.
>     >
>     >
>     >
>     > Thanks,
>     > Magnus
>     >
>     >
>     >
>     > > Thanks,
>     > >
>     > > Jun
>     > >
>     > > On Wed, Jan 26, 2022 at 6:32 AM Magnus Edenhill <
> magnus@edenhill.se>
>     > > wrote:
>     > >
>     > > > Hi all,
>     > > >
>     > > > I've updated the KIP with responses to the latest comments: Java
> client
>     > > > dependencies (Thanks Kirk!), alternate designs (separate cluster,
>     > > separate
>     > > > producer, etc), etc.
>     > > >
>     > > > I will revive the vote thread.
>     > > >
>     > > > Thanks,
>     > > > Magnus
>     > > >
>     > > >
>     > > > Den mån 13 dec. 2021 kl 22:32 skrev Ryanne Dolan <
>     > ryannedolan@gmail.com
>     > > >:
>     > > >
>     > > > > I think we should be very careful about introducing new runtime
>     > > > > dependencies into the clients. Historically this has been rare
> and
>     > > > > essentially necessary (e.g. compression libs).
>     > > > >
>     > > > > Ryanne
>     > > > >
>     > > > > On Mon, Dec 13, 2021, 1:06 PM Kirk True <kirk@mustardgrain.com
> >
>     > wrote:
>     > > > >
>     > > > > > Hi Jun,
>     > > > > >
>     > > > > > On Thu, Dec 9, 2021, at 2:28 PM, Jun Rao wrote:
>     > > > > > > 13. Using OpenTelemetry. Does that require runtime
> dependency
>     > > > > > > on OpenTelemetry library? How good is the compatibility
> story
>     > > > > > > of OpenTelemetry? This is important since an application
> could
>     > have
>     > > > > other
>     > > > > > > OpenTelemetry dependencies than the Kafka client.
>     > > > > >
>     > > > > > The current design is that the OpenTelemetry JARs would ship
> with
>     > the
>     > > > > > client. Perhaps we can design the client such that the JARs
> aren't
>     > > even
>     > > > > > loaded if the user has opted out. The user could even
> exclude the
>     > > JARs
>     > > > > from
>     > > > > > their dependencies if they so wished.
>     > > > > >
>     > > > > > I can't speak to the compatibility of the libraries. Is it
> possible
>     > > > that
>     > > > > > we include a shaded version?
>     > > > > >
>     > > > > > Thanks,
>     > > > > > Kirk
>     > > > > >
>     > > > > > >
>     > > > > > > 14. The proposal listed idempotence=true. This is more of a
>     > > > > configuration
>     > > > > > > than a metric. Are we including that as a metric? What
> other
>     > > > > > configurations
>     > > > > > > are we including? Should we separate the configurations
> from the
>     > > > > metrics?
>     > > > > > >
>     > > > > > > Thanks,
>     > > > > > >
>     > > > > > > Jun
>     > > > > > >
>     > > > > > > On Mon, Nov 29, 2021 at 7:34 AM Magnus Edenhill <
>     > > magnus@edenhill.se>
>     > > > > > wrote:
>     > > > > > >
>     > > > > > > > Hey Bob,
>     > > > > > > >
>     > > > > > > > That's a good point.
>     > > > > > > >
>     > > > > > > > Request type labels were considered but since they're
> already
>     > > > tracked
>     > > > > > by
>     > > > > > > > broker-side metrics
>     > > > > > > > they were left out as to avoid metric duplication,
> however
>     > those
>     > > > > > metrics
>     > > > > > > > are not per connection,
>     > > > > > > > so they won't be that useful in practice for
> troubleshooting
>     > > > specific
>     > > > > > > > client instances.
>     > > > > > > >
>     > > > > > > > I'll add the request_type label to the relevant metrics.
>     > > > > > > >
>     > > > > > > > Thanks,
>     > > > > > > > Magnus
>     > > > > > > >
>     > > > > > > >
>     > > > > > > > Den tis 23 nov. 2021 kl 19:20 skrev Bob Barrett
>     > > > > > > > <bo...@confluent.io.invalid>:
>     > > > > > > >
>     > > > > > > > > Hi Magnus,
>     > > > > > > > >
>     > > > > > > > > Thanks for the thorough KIP, this seems very useful.
>     > > > > > > > >
>     > > > > > > > > Would it make sense to include the request type as a
> label
>     > for
>     > > > the
>     > > > > > > > > `client.request.success`, `client.request.errors` and
>     > > > > > > > `client.request.rtt`
>     > > > > > > > > metrics? I think it would be very useful to see which
>     > specific
>     > > > > > requests
>     > > > > > > > are
>     > > > > > > > > succeeding and failing for a client. One specific case
> I can
>     > > > think
>     > > > > of
>     > > > > > > > where
>     > > > > > > > > this could be useful is producer batch timeouts. If a
> Java
>     > > > > > application
>     > > > > > > > does
>     > > > > > > > > not enable producer client logs (unfortunately, in my
>     > > experience
>     > > > > this
>     > > > > > > > > happens more often than it should), the application
> logs will
>     > > > only
>     > > > > > > > contain
>     > > > > > > > > the expiration error message, but no information about
> what
>     > is
>     > > > > > causing
>     > > > > > > > the
>     > > > > > > > > timeout. The requests might all be succeeding but
> taking too
>     > > long
>     > > > > to
>     > > > > > > > > process batches, or metadata requests might be
> failing, or
>     > some
>     > > > or
>     > > > > > all
>     > > > > > > > > produce requests might be failing (if the bootstrap
> servers
>     > are
>     > > > > > reachable
>     > > > > > > > > from the client but one or more other brokers are not,
> for
>     > > > > example).
>     > > > > > If
>     > > > > > > > the
>     > > > > > > > > cluster operator is able to identify the specific
> requests
>     > that
>     > > > are
>     > > > > > slow
>     > > > > > > > or
>     > > > > > > > > failing for a client, they will be better able to
> diagnose
>     > the
>     > > > > issue
>     > > > > > > > > causing batch timeouts.
>     > > > > > > > >
>     > > > > > > > > One drawback I can think of is that this will increase
> the
>     > > > > > cardinality of
>     > > > > > > > > the request metrics. But any given client is only
> going to
>     > use
>     > > a
>     > > > > > small
>     > > > > > > > > subset of the request types, and since we already have
>     > > partition
>     > > > > > labels
>     > > > > > > > for
>     > > > > > > > > the topic-level metrics, I think request labels will
> still
>     > make
>     > > > up
>     > > > > a
>     > > > > > > > > relatively small percentage of the set of metrics.
>     > > > > > > > >
>     > > > > > > > > Thanks,
>     > > > > > > > > Bob
>     > > > > > > > >
>     > > > > > > > > On Mon, Nov 22, 2021 at 2:08 AM Viktor Somogyi-Vass <
>     > > > > > > > > viktorsomogyi@gmail.com>
>     > > > > > > > > wrote:
>     > > > > > > > >
>     > > > > > > > > > Hi Magnus,
>     > > > > > > > > >
>     > > > > > > > > > I think this is a very useful addition. We also have
> a
>     > > similar
>     > > > > (but
>     > > > > > > > much
>     > > > > > > > > > more simplistic) implementation of this. Maybe I
> missed it
>     > in
>     > > > the
>     > > > > > KIP
>     > > > > > > > but
>     > > > > > > > > > what about adding metrics about the subscription
> cache
>     > > itself?
>     > > > > > That I
>     > > > > > > > > think
>     > > > > > > > > > would improve its usability and debuggability as
> we'd be
>     > able
>     > > > to
>     > > > > > see
>     > > > > > > > its
>     > > > > > > > > > performance, hit/miss rates, eviction counts and
> others.
>     > > > > > > > > >
>     > > > > > > > > > Best,
>     > > > > > > > > > Viktor
>     > > > > > > > > >
>     > > > > > > > > > On Thu, Nov 18, 2021 at 5:12 PM Magnus Edenhill <
>     > > > > > magnus@edenhill.se>
>     > > > > > > > > > wrote:
>     > > > > > > > > >
>     > > > > > > > > > > Hi Mickael,
>     > > > > > > > > > >
>     > > > > > > > > > > see inline.
>     > > > > > > > > > >
>     > > > > > > > > > > Den ons 10 nov. 2021 kl 15:21 skrev Mickael Maison
> <
>     > > > > > > > > > > mickael.maison@gmail.com
>     > > > > > > > > > > >:
>     > > > > > > > > > >
>     > > > > > > > > > > > Hi Magnus,
>     > > > > > > > > > > >
>     > > > > > > > > > > > I see you've addressed some of the points I
> raised
>     > above
>     > > > but
>     > > > > > some
>     > > > > > > > (4,
>     > > > > > > > > > > > 5) have not been addressed yet.
>     > > > > > > > > > > >
>     > > > > > > > > > >
>     > > > > > > > > > > Re 4) How will the user/app know metrics are being
> sent.
>     > > > > > > > > > >
>     > > > > > > > > > > One possibility is to add a JMX metric (thus for
> user
>     > > > > > consumption)
>     > > > > > > > for
>     > > > > > > > > > the
>     > > > > > > > > > > number of metric pushes the
>     > > > > > > > > > > client has performed, or perhaps the number of
> metrics
>     > > > > > subscriptions
>     > > > > > > > > > > currently being collected.
>     > > > > > > > > > > Would that be sufficient?
>     > > > > > > > > > >
>     > > > > > > > > > > Re 5) Metric sizes and rates
>     > > > > > > > > > >
>     > > > > > > > > > > A worst case scenario for a producer that is
> producing to
>     > > 50
>     > > > > > unique
>     > > > > > > > > > topics
>     > > > > > > > > > > and emitting all standard metrics yields
>     > > > > > > > > > > a serialized size of around 100KB prior to
> compression,
>     > > which
>     > > > > > > > > compresses
>     > > > > > > > > > > down to about 20-30% of that depending
>     > > > > > > > > > > on compression type and topic name uniqueness.
>     > > > > > > > > > > The numbers for a consumer would be similar.
>     > > > > > > > > > >
>     > > > > > > > > > > In practice the number of unique topics would be
> far
>     > less,
>     > > > and
>     > > > > > the
>     > > > > > > > > > > subscription set would typically be for a subset of
>     > > metrics.
>     > > > > > > > > > > So we're probably closer to 1kb, or less,
> compressed size
>     > > per
>     > > > > > client
>     > > > > > > > > per
>     > > > > > > > > > > push interval.
>     > > > > > > > > > >
>     > > > > > > > > > > As both the subscription set and push intervals are
>     > > > controlled
>     > > > > > by the
>     > > > > > > > > > > cluster operator it shouldn't be too hard
>     > > > > > > > > > > to strike a good balance between metrics overhead
> and
>     > > > > > granularity.
>     > > > > > > > > > >
>     > > > > > > > > > >
>     > > > > > > > > > >
>     > > > > > > > > > > >
>     > > > > > > > > > > > I'm really uneasy with this being enabled by
> default on
>     > > the
>     > > > > > client
>     > > > > > > > > > > > side. When collecting data, I think the best
> practice
>     > is
>     > > to
>     > > > > > ensure
>     > > > > > > > > > > > users are explicitly enabling it.
>     > > > > > > > > > > >
>     > > > > > > > > > >
>     > > > > > > > > > > Requiring metrics to be explicitly enabled on
> clients
>     > > > severely
>     > > > > > > > cripples
>     > > > > > > > > > its
>     > > > > > > > > > > usability and value.
>     > > > > > > > > > >
>     > > > > > > > > > > One of the problems that this KIP aims to solve is
> for
>     > > useful
>     > > > > > metrics
>     > > > > > > > > to
>     > > > > > > > > > be
>     > > > > > > > > > > available on demand
>     > > > > > > > > > > regardless of the technical expertise of the user.
> As
>     > > Ryanne
>     > > > > > points,
>     > > > > > > > > out
>     > > > > > > > > > a
>     > > > > > > > > > > savvy user/organization
>     > > > > > > > > > > will typically have metrics collection and
> monitoring in
>     > > > place
>     > > > > > > > already,
>     > > > > > > > > > and
>     > > > > > > > > > > the benefits of this KIP
>     > > > > > > > > > > are then more of a common set and format metrics
> across
>     > > > client
>     > > > > > > > > > > implementations and languages.
>     > > > > > > > > > > But that is not the typical Kafka user in my
> experience,
>     > > > > they're
>     > > > > > not
>     > > > > > > > > > Kafka
>     > > > > > > > > > > experts and they don't have the
>     > > > > > > > > > > knowledge of how to best instrument their clients.
>     > > > > > > > > > > Having metrics enabled by default for this user
> base
>     > allows
>     > > > the
>     > > > > > Kafka
>     > > > > > > > > > > operators to proactively and reactively
>     > > > > > > > > > > monitor and troubleshoot client issues, without
> the need
>     > > for
>     > > > > the
>     > > > > > less
>     > > > > > > > > > savvy
>     > > > > > > > > > > user to do anything.
>     > > > > > > > > > > It is often too late to tell a user to enable
> metrics
>     > when
>     > > > the
>     > > > > > > > problem
>     > > > > > > > > > has
>     > > > > > > > > > > already occurred.
>     > > > > > > > > > >
>     > > > > > > > > > > Now, to be clear, even though metrics are enabled
> by
>     > > default
>     > > > on
>     > > > > > > > clients
>     > > > > > > > > > it
>     > > > > > > > > > > is not enabled by default
>     > > > > > > > > > > on the brokers; the Kafka operator needs to build
> and set
>     > > up
>     > > > a
>     > > > > > > > metrics
>     > > > > > > > > > > plugin and add metrics subscriptions
>     > > > > > > > > > > before anything is sent from the client.
>     > > > > > > > > > > It is opt-out on the clients and opt-in on the
> broker.
>     > > > > > > > > > >
>     > > > > > > > > > >
>     > > > > > > > > > >
>     > > > > > > > > > >
>     > > > > > > > > > > > You mentioned brokers already have
>     > > > > > > > > > > > some(most?) of the information contained in
> metrics, if
>     > > so
>     > > > > > then why
>     > > > > > > > > > > > are we collecting it again? Surely there must be
> some
>     > new
>     > > > > > > > information
>     > > > > > > > > > > > in the client metrics.
>     > > > > > > > > > > >
>     > > > > > > > > > >
>     > > > > > > > > > > From the user's perspective the Kafka
> infrastructure
>     > > extends
>     > > > > from
>     > > > > > > > > > > producer.send() to
>     > > > > > > > > > > messages being returned from consumer.poll(), a
> giant
>     > black
>     > > > box
>     > > > > > where
>     > > > > > > > > > > there's a lot going on between those
>     > > > > > > > > > > two points. The brokers currently only see what
> happens
>     > > once
>     > > > > > those
>     > > > > > > > > > requests
>     > > > > > > > > > > and messages hits the broker,
>     > > > > > > > > > > but as Kafka clients are complex pieces of
> machinery
>     > > there's
>     > > > a
>     > > > > > myriad
>     > > > > > > > > of
>     > > > > > > > > > > queues, timers, and state
>     > > > > > > > > > > that's critical to the operation and infrastructure
>     > that's
>     > > > not
>     > > > > > > > > currently
>     > > > > > > > > > > visible to the operator.
>     > > > > > > > > > > Relying on the user to accurately and timely
> provide this
>     > > > > missing
>     > > > > > > > > > > information is not generally feasible.
>     > > > > > > > > > >
>     > > > > > > > > > >
>     > > > > > > > > > > Most of the standard metrics listed in the KIP are
> data
>     > > > points
>     > > > > > that
>     > > > > > > > the
>     > > > > > > > > > > broker does not have.
>     > > > > > > > > > > Only a small number of metrics are duplicates
> (like the
>     > > > request
>     > > > > > > > counts
>     > > > > > > > > > and
>     > > > > > > > > > > sizes), but they are included
>     > > > > > > > > > > to ease correlation when inspecting these client
> metrics.
>     > > > > > > > > > >
>     > > > > > > > > > >
>     > > > > > > > > > >
>     > > > > > > > > > > > Moreover this is a brand new feature so it's even
>     > harder
>     > > to
>     > > > > > justify
>     > > > > > > > > > > > enabling it and forcing onto all our users. If
> disabled
>     > > by
>     > > > > > default,
>     > > > > > > > > > > > it's relatively easy to enable in a new release
> if we
>     > > > decide
>     > > > > > to,
>     > > > > > > > but
>     > > > > > > > > > > > once enabled by default it's much harder to
> disable.
>     > Also
>     > > > > this
>     > > > > > > > > feature
>     > > > > > > > > > > > will apply to all future metrics we will add.
>     > > > > > > > > > > >
>     > > > > > > > > > >
>     > > > > > > > > > > I think maturity of a feature implementation
> should be
>     > the
>     > > > > > deciding
>     > > > > > > > > > factor,
>     > > > > > > > > > > rather than
>     > > > > > > > > > > the design of it (which this KIP is). I.e., if the
>     > > > > > implementation is
>     > > > > > > > > not
>     > > > > > > > > > > deemed mature enough
>     > > > > > > > > > > for release X.Y it will be disabled.
>     > > > > > > > > > >
>     > > > > > > > > > >
>     > > > > > > > > > >
>     > > > > > > > > > > > Overall I think it's an interesting feature but
> I'd
>     > > prefer
>     > > > to
>     > > > > > be
>     > > > > > > > > > > > slightly defensive and see how it works in
> practice
>     > > before
>     > > > > > enabling
>     > > > > > > > > it
>     > > > > > > > > > > > everywhere.
>     > > > > > > > > > > >
>     > > > > > > > > > >
>     > > > > > > > > > > Right, and I agree on being defensive, but since
> this
>     > > feature
>     > > > > > still
>     > > > > > > > > > > requires manual
>     > > > > > > > > > > enabling on the brokers before actually being
> used, I
>     > think
>     > > > > that
>     > > > > > > > gives
>     > > > > > > > > > > enough control
>     > > > > > > > > > > to opt-in or out of this feature as needed.
>     > > > > > > > > > >
>     > > > > > > > > > > Thanks for your comments!
>     > > > > > > > > > >
>     > > > > > > > > > > Regards,
>     > > > > > > > > > > Magnus
>     > > > > > > > > > >
>     > > > > > > > > > >
>     > > > > > > > > > >
>     > > > > > > > > > > > Thanks,
>     > > > > > > > > > > > Mickael
>     > > > > > > > > > > >
>     > > > > > > > > > > > On Mon, Nov 8, 2021 at 7:55 AM Magnus Edenhill <
>     > > > > > magnus@edenhill.se
>     > > > > > > > >
>     > > > > > > > > > > wrote:
>     > > > > > > > > > > > >
>     > > > > > > > > > > > > Thanks David for pointing this out,
>     > > > > > > > > > > > > I've updated the KIP to include client_id as a
>     > matching
>     > > > > > selector.
>     > > > > > > > > > > > >
>     > > > > > > > > > > > > Regards,
>     > > > > > > > > > > > > Magnus
>     > > > > > > > > > > > >
>     > > > > > > > > > > > > Den tors 4 nov. 2021 kl 18:01 skrev David Mao
>     > > > > > > > > > > <dmao@confluent.io.invalid
>     > > > > > > > > > > > >:
>     > > > > > > > > > > > >
>     > > > > > > > > > > > > > Hey Magnus,
>     > > > > > > > > > > > > >
>     > > > > > > > > > > > > > I noticed that the KIP outlines the initial
>     > selectors
>     > > > > > supported
>     > > > > > > > > as:
>     > > > > > > > > > > > > >
>     > > > > > > > > > > > > >    - client_instance_id - CLIENT_INSTANCE_ID
> UUID
>     > > > string
>     > > > > > > > > > > > representation.
>     > > > > > > > > > > > > >    - client_software_name  - client software
>     > > > > implementation
>     > > > > > > > name.
>     > > > > > > > > > > > > >    - client_software_version  - client
> software
>     > > > > > implementation
>     > > > > > > > > > > version.
>     > > > > > > > > > > > > >
>     > > > > > > > > > > > > > In the given reactive monitoring workflow, we
>     > mention
>     > > > > that
>     > > > > > the
>     > > > > > > > > > > > application
>     > > > > > > > > > > > > > user does not know their client's client
> instance
>     > ID,
>     > > > but
>     > > > > > it's
>     > > > > > > > > > > outlined
>     > > > > > > > > > > > > > that the operator can add a metrics
> subscription
>     > > > > selecting
>     > > > > > for
>     > > > > > > > > > > > clientId. I
>     > > > > > > > > > > > > > don't see clientId as one of the supported
>     > selectors.
>     > > > > > > > > > > > > > I can see how this would have made sense in a
>     > > previous
>     > > > > > > > iteration
>     > > > > > > > > > > given
>     > > > > > > > > > > > that
>     > > > > > > > > > > > > > the previous client instance ID proposal was
> to
>     > > > construct
>     > > > > > the
>     > > > > > > > > > client
>     > > > > > > > > > > > > > instance ID using clientId as a prefix. Now
> that
>     > the
>     > > > > client
>     > > > > > > > > > instance
>     > > > > > > > > > > > ID is
>     > > > > > > > > > > > > > a UUID, would we want to add clientId as a
>     > supported
>     > > > > > selector?
>     > > > > > > > > > > > > > Let me know what you think.
>     > > > > > > > > > > > > >
>     > > > > > > > > > > > > > David
>     > > > > > > > > > > > > >
>     > > > > > > > > > > > > > On Tue, Oct 19, 2021 at 12:33 PM Magnus
> Edenhill <
>     > > > > > > > > > magnus@edenhill.se
>     > > > > > > > > > > >
>     > > > > > > > > > > > > > wrote:
>     > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > Hi Mickael!
>     > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > Den tis 19 okt. 2021 kl 19:30 skrev Mickael
>     > Maison
>     > > <
>     > > > > > > > > > > > > > > mickael.maison@gmail.com
>     > > > > > > > > > > > > > > >:
>     > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > Hi Magnus,
>     > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > Thanks for the proposal.
>     > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > 1. Looking at the protocol section, isn't
>     > > > > > > > "ClientInstanceId"
>     > > > > > > > > > > > expected
>     > > > > > > > > > > > > > > > to be a field in
>     > > > GetTelemetrySubscriptionsResponseV0?
>     > > > > > > > > > Otherwise,
>     > > > > > > > > > > > how
>     > > > > > > > > > > > > > > > does a client retrieve this value?
>     > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > Good catch, it got removed by mistake in
> one of
>     > the
>     > > > > > edits.
>     > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > 2. In the client API section, you
> mention a new
>     > > > > method
>     > > > > > > > > > > > > > > > "clientInstanceId()". Can you clarify
> which
>     > > > > interfaces
>     > > > > > are
>     > > > > > > > > > > > affected?
>     > > > > > > > > > > > > > > > Is it only Consumer and Producer?
>     > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > And Admin. Will update the KIP.
>     > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > 3. I'm a bit concerned this is enabled by
>     > > default.
>     > > > > > Even if
>     > > > > > > > > the
>     > > > > > > > > > > data
>     > > > > > > > > > > > > > > > collected is supposed to be not
> sensitive, I
>     > > think
>     > > > > > this can
>     > > > > > > > > be
>     > > > > > > > > > > > > > > > problematic in some environments. Also
> users
>     > > don't
>     > > > > > seem to
>     > > > > > > > > have
>     > > > > > > > > > > the
>     > > > > > > > > > > > > > > > choice to only expose some metrics.
> Knowing how
>     > > > much
>     > > > > > data
>     > > > > > > > > > transit
>     > > > > > > > > > > > > > > > through some applications can be
> considered
>     > > > critical.
>     > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > The broker already knows how much data
> transits
>     > > > through
>     > > > > > the
>     > > > > > > > > > client
>     > > > > > > > > > > > > > though,
>     > > > > > > > > > > > > > > right?
>     > > > > > > > > > > > > > > Care has been taken not to expose
> information in
>     > > the
>     > > > > > standard
>     > > > > > > > > > > metrics
>     > > > > > > > > > > > > > that
>     > > > > > > > > > > > > > > might
>     > > > > > > > > > > > > > > reveal sensitive information.
>     > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > Do you have an example of how the proposed
>     > metrics
>     > > > > could
>     > > > > > leak
>     > > > > > > > > > > > sensitive
>     > > > > > > > > > > > > > > information?
>     > > > > > > > > > > > > > > As for limiting the what metrics to
> export; I
>     > guess
>     > > > > that
>     > > > > > > > could
>     > > > > > > > > > make
>     > > > > > > > > > > > sense
>     > > > > > > > > > > > > > > in some
>     > > > > > > > > > > > > > > very sensitive use-cases, but those users
> might
>     > > > disable
>     > > > > > > > metrics
>     > > > > > > > > > > > > > altogether
>     > > > > > > > > > > > > > > for now.
>     > > > > > > > > > > > > > > Could these concerns be addressed by a
> later KIP?
>     > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > 4. As a user, how do you know if your
>     > application
>     > > > is
>     > > > > > > > actively
>     > > > > > > > > > > > sending
>     > > > > > > > > > > > > > > > metrics? Are there new metrics exposing
> what's
>     > > > going
>     > > > > > on,
>     > > > > > > > like
>     > > > > > > > > > how
>     > > > > > > > > > > > much
>     > > > > > > > > > > > > > > > data is being sent?
>     > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > That's a good question.
>     > > > > > > > > > > > > > > Since the proposed metrics interface is
> not aimed
>     > > at,
>     > > > > or
>     > > > > > > > > directly
>     > > > > > > > > > > > > > available
>     > > > > > > > > > > > > > > to, the application
>     > > > > > > > > > > > > > > I guess there's little point of adding it
> here,
>     > but
>     > > > > > instead
>     > > > > > > > > > adding
>     > > > > > > > > > > > > > > something to the
>     > > > > > > > > > > > > > > existing JMX metrics?
>     > > > > > > > > > > > > > > Do you have any suggestions?
>     > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > 5. If all metrics are enabled on a
> regular
>     > > Consumer
>     > > > > or
>     > > > > > > > > > Producer,
>     > > > > > > > > > > do
>     > > > > > > > > > > > > > > > you have an idea how much throughput
> this would
>     > > > use?
>     > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > It depends on the number of
> partition/topics/etc
>     > > the
>     > > > > > client
>     > > > > > > > is
>     > > > > > > > > > > > producing
>     > > > > > > > > > > > > > > to/consuming from.
>     > > > > > > > > > > > > > > I'll add some sizes to the KIP for some
> typical
>     > > > > > use-cases.
>     > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > Thanks,
>     > > > > > > > > > > > > > > Magnus
>     > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > Thanks
>     > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > On Tue, Oct 19, 2021 at 5:06 PM Magnus
>     > Edenhill <
>     > > > > > > > > > > > magnus@edenhill.se>
>     > > > > > > > > > > > > > > > wrote:
>     > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > Den tis 19 okt. 2021 kl 13:22 skrev Tom
>     > > Bentley <
>     > > > > > > > > > > > tbentley@redhat.com
>     > > > > > > > > > > > > > >:
>     > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > > Hi Magnus,
>     > > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > > I reviewed the KIP since you called
> the
>     > vote
>     > > > > > (sorry for
>     > > > > > > > > not
>     > > > > > > > > > > > > > reviewing
>     > > > > > > > > > > > > > > > when
>     > > > > > > > > > > > > > > > > > you announced your intention to call
> the
>     > > > vote). I
>     > > > > > have
>     > > > > > > > a
>     > > > > > > > > > few
>     > > > > > > > > > > > > > > questions
>     > > > > > > > > > > > > > > > on
>     > > > > > > > > > > > > > > > > > some of the details.
>     > > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > > 1. There's no Javadoc on
>     > > > > > ClientTelemetryPayload.data(),
>     > > > > > > > > so
>     > > > > > > > > > I
>     > > > > > > > > > > > don't
>     > > > > > > > > > > > > > > know
>     > > > > > > > > > > > > > > > > > whether the payload is exposed
> through this
>     > > > > method
>     > > > > > as
>     > > > > > > > > > > > compressed or
>     > > > > > > > > > > > > > > > not.
>     > > > > > > > > > > > > > > > > > Later on you say "Decompression of
> the
>     > > payloads
>     > > > > > will be
>     > > > > > > > > > > > handled by
>     > > > > > > > > > > > > > > the
>     > > > > > > > > > > > > > > > > > broker metrics plugin, the broker
> should
>     > > > expose a
>     > > > > > > > > suitable
>     > > > > > > > > > > > > > > > decompression
>     > > > > > > > > > > > > > > > > > API to the metrics plugin for this
>     > purpose.",
>     > > > > which
>     > > > > > > > > > suggests
>     > > > > > > > > > > > it's
>     > > > > > > > > > > > > > the
>     > > > > > > > > > > > > > > > > > compressed data in the buffer, but
> then we
>     > > > don't
>     > > > > > know
>     > > > > > > > > which
>     > > > > > > > > > > > codec
>     > > > > > > > > > > > > > was
>     > > > > > > > > > > > > > > > used,
>     > > > > > > > > > > > > > > > > > nor the API via which the plugin
> should
>     > > > > decompress
>     > > > > > it
>     > > > > > > > if
>     > > > > > > > > > > > required
>     > > > > > > > > > > > > > for
>     > > > > > > > > > > > > > > > > > forwarding to the ultimate metrics
> store.
>     > > > Should
>     > > > > > the
>     > > > > > > > > > > > > > > > ClientTelemetryPayload
>     > > > > > > > > > > > > > > > > > expose a method to get the
> compression and
>     > a
>     > > > > > > > > decompressor?
>     > > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > Good point, updated.
>     > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > > 2. The client-side API is expressed
> as
>     > > > > > StringOrError
>     > > > > > > > > > > > > > > > > > ClientInstance::ClientInstanceId(int
>     > > > > timeout_ms). I
>     > > > > > > > > > > understand
>     > > > > > > > > > > > that
>     > > > > > > > > > > > > > > > you're
>     > > > > > > > > > > > > > > > > > thinking about the librdkafka
>     > implementation,
>     > > > but
>     > > > > > it
>     > > > > > > > > would
>     > > > > > > > > > be
>     > > > > > > > > > > > good
>     > > > > > > > > > > > > > to
>     > > > > > > > > > > > > > > > show
>     > > > > > > > > > > > > > > > > > the API as it would appear on the
> Apache
>     > > Kafka
>     > > > > > clients.
>     > > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > This was meant as pseudo-code, but I
> changed
>     > it
>     > > > to
>     > > > > > Java.
>     > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > > 3. "PushTelemetryRequest|Response -
>     > protocol
>     > > > > > request
>     > > > > > > > used
>     > > > > > > > > > by
>     > > > > > > > > > > > the
>     > > > > > > > > > > > > > > > client to
>     > > > > > > > > > > > > > > > > > send metrics to any broker it is
> connected
>     > > to."
>     > > > > To
>     > > > > > be
>     > > > > > > > > > clear,
>     > > > > > > > > > > > this
>     > > > > > > > > > > > > > > means
>     > > > > > > > > > > > > > > > > > that the client can choose any of the
>     > > connected
>     > > > > > brokers
>     > > > > > > > > and
>     > > > > > > > > > > > push to
>     > > > > > > > > > > > > > > > just
>     > > > > > > > > > > > > > > > > > one of them? What should a supporting
>     > client
>     > > do
>     > > > > if
>     > > > > > it
>     > > > > > > > > gets
>     > > > > > > > > > an
>     > > > > > > > > > > > error
>     > > > > > > > > > > > > > > > when
>     > > > > > > > > > > > > > > > > > pushing metrics to a broker, retry
> sending
>     > to
>     > > > the
>     > > > > > same
>     > > > > > > > > > broker
>     > > > > > > > > > > > or
>     > > > > > > > > > > > > > try
>     > > > > > > > > > > > > > > > > > pushing to another broker, or drop
> the
>     > > metrics?
>     > > > > > Should
>     > > > > > > > > > > > supporting
>     > > > > > > > > > > > > > > > clients
>     > > > > > > > > > > > > > > > > > send successive requests to a single
>     > broker,
>     > > or
>     > > > > > round
>     > > > > > > > > > robin,
>     > > > > > > > > > > > or is
>     > > > > > > > > > > > > > > > that up
>     > > > > > > > > > > > > > > > > > to the client author? I'm guessing
> the
>     > > > behaviour
>     > > > > > should
>     > > > > > > > > be
>     > > > > > > > > > > > sticky
>     > > > > > > > > > > > > > to
>     > > > > > > > > > > > > > > > > > support the rate limiting features,
> but I
>     > > think
>     > > > > it
>     > > > > > > > would
>     > > > > > > > > be
>     > > > > > > > > > > > good
>     > > > > > > > > > > > > > for
>     > > > > > > > > > > > > > > > client
>     > > > > > > > > > > > > > > > > > authors if this section were
> explicit on
>     > the
>     > > > > > > > recommended
>     > > > > > > > > > > > behaviour.
>     > > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > You are right, I've updated the KIP to
> make
>     > > this
>     > > > > > clearer.
>     > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > > 4. "Mapping the client instance id
> to an
>     > > actual
>     > > > > > > > > application
>     > > > > > > > > > > > > > instance
>     > > > > > > > > > > > > > > > > > running on a (virtual) machine can
> be done
>     > by
>     > > > > > > > inspecting
>     > > > > > > > > > the
>     > > > > > > > > > > > > > metrics
>     > > > > > > > > > > > > > > > > > resource labels, such as the client
> source
>     > > > > address
>     > > > > > and
>     > > > > > > > > > source
>     > > > > > > > > > > > port,
>     > > > > > > > > > > > > > > or
>     > > > > > > > > > > > > > > > > > security principal, all of which are
> added
>     > by
>     > > > the
>     > > > > > > > > receiving
>     > > > > > > > > > > > broker.
>     > > > > > > > > > > > > > > > This
>     > > > > > > > > > > > > > > > > > will allow the operator together
> with the
>     > > user
>     > > > to
>     > > > > > > > > identify
>     > > > > > > > > > > the
>     > > > > > > > > > > > > > actual
>     > > > > > > > > > > > > > > > > > application instance." Is this really
>     > always
>     > > > > true?
>     > > > > > The
>     > > > > > > > > > source
>     > > > > > > > > > > > IP
>     > > > > > > > > > > > > > and
>     > > > > > > > > > > > > > > > port
>     > > > > > > > > > > > > > > > > > might be a loadbalancer/proxy in some
>     > setups.
>     > > > The
>     > > > > > > > > > principal,
>     > > > > > > > > > > as
>     > > > > > > > > > > > > > > already
>     > > > > > > > > > > > > > > > > > mentioned in the KIP, might be shared
>     > between
>     > > > > > multiple
>     > > > > > > > > > > > > > applications.
>     > > > > > > > > > > > > > > > So at
>     > > > > > > > > > > > > > > > > > worst the organization running the
> clients
>     > > > might
>     > > > > > have
>     > > > > > > > to
>     > > > > > > > > > > > consult
>     > > > > > > > > > > > > > the
>     > > > > > > > > > > > > > > > logs
>     > > > > > > > > > > > > > > > > > of a set of client applications,
> right?
>     > > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > Yes, that's correct. There's no
> guaranteed
>     > > > mapping
>     > > > > > from
>     > > > > > > > > > > > > > > > client_instance_id
>     > > > > > > > > > > > > > > > > to
>     > > > > > > > > > > > > > > > > an actual instance, that's why the KIP
>     > > recommends
>     > > > > > client
>     > > > > > > > > > > > > > > implementations
>     > > > > > > > > > > > > > > > to
>     > > > > > > > > > > > > > > > > log the client instance id
>     > > > > > > > > > > > > > > > > upon retrieval, and also provide an
> API for
>     > the
>     > > > > > > > application
>     > > > > > > > > > to
>     > > > > > > > > > > > > > retrieve
>     > > > > > > > > > > > > > > > the
>     > > > > > > > > > > > > > > > > instance id programmatically
>     > > > > > > > > > > > > > > > > if it has a better way of exposing it.
>     > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > 5. "Tests indicate that a compression
> ratio
>     > up
>     > > to
>     > > > > > 10x is
>     > > > > > > > > > > > possible for
>     > > > > > > > > > > > > > > the
>     > > > > > > > > > > > > > > > > > standard metrics." Client authors
> might
>     > > > > appreciate
>     > > > > > your
>     > > > > > > > > > > > mentioning
>     > > > > > > > > > > > > > > > which
>     > > > > > > > > > > > > > > > > > compression codec got these results.
>     > > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > Good point. Updated.
>     > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > > 6. "Should the client send a push
> request
>     > > prior
>     > > > > to
>     > > > > > > > expiry
>     > > > > > > > > > of
>     > > > > > > > > > > > the
>     > > > > > > > > > > > > > > > previously
>     > > > > > > > > > > > > > > > > > calculated PushIntervalMs the broker
> will
>     > > > discard
>     > > > > > the
>     > > > > > > > > > metrics
>     > > > > > > > > > > > and
>     > > > > > > > > > > > > > > > return a
>     > > > > > > > > > > > > > > > > > PushTelemetryResponse with the
> ErrorCode
>     > set
>     > > to
>     > > > > > > > > > RateLimited."
>     > > > > > > > > > > > Is
>     > > > > > > > > > > > > > this
>     > > > > > > > > > > > > > > > > > RATE_LIMITED a new error code? It's
> not
>     > > > mentioned
>     > > > > > in
>     > > > > > > > the
>     > > > > > > > > > "New
>     > > > > > > > > > > > Error
>     > > > > > > > > > > > > > > > Codes"
>     > > > > > > > > > > > > > > > > > section.
>     > > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > That's a leftover, it should be using
> the
>     > > > standard
>     > > > > > > > > > ThrottleTime
>     > > > > > > > > > > > > > > > mechanism.
>     > > > > > > > > > > > > > > > > Fixed.
>     > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > > 7. In the section "Standard client
> resource
>     > > > > labels"
>     > > > > > > > > > > > application_id
>     > > > > > > > > > > > > > is
>     > > > > > > > > > > > > > > > > > described as Kafka Streams only, but
> the
>     > > > section
>     > > > > of
>     > > > > > > > > "Client
>     > > > > > > > > > > > > > > > Identification"
>     > > > > > > > > > > > > > > > > > talks about "application instance id
> as an
>     > > > > optional
>     > > > > > > > > future
>     > > > > > > > > > > > > > > nice-to-have
>     > > > > > > > > > > > > > > > > > that may be included as a metrics
> label if
>     > it
>     > > > has
>     > > > > > been
>     > > > > > > > > set
>     > > > > > > > > > by
>     > > > > > > > > > > > the
>     > > > > > > > > > > > > > > > user", so
>     > > > > > > > > > > > > > > > > > I'm confused whether non-Kafka
> Streams
>     > > clients
>     > > > > > should
>     > > > > > > > set
>     > > > > > > > > > an
>     > > > > > > > > > > > > > > > application_id
>     > > > > > > > > > > > > > > > > > or not.
>     > > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > I'll clarify this in the KIP, but
> basically
>     > we
>     > > > > would
>     > > > > > need
>     > > > > > > > > to
>     > > > > > > > > > > add
>     > > > > > > > > > > > an `
>     > > > > > > > > > > > > > > > > application.id` config
>     > > > > > > > > > > > > > > > > property for non-streams clients for
> this
>     > > > purpose,
>     > > > > > and
>     > > > > > > > > that's
>     > > > > > > > > > > > outside
>     > > > > > > > > > > > > > > the
>     > > > > > > > > > > > > > > > > scope of this KIP since we want to
> make it
>     > > > > > zero-conf:ish
>     > > > > > > > on
>     > > > > > > > > > the
>     > > > > > > > > > > > > > client
>     > > > > > > > > > > > > > > > side.
>     > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > > Kind regards,
>     > > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > > Tom
>     > > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > Thanks for the review,
>     > > > > > > > > > > > > > > > > Magnus
>     > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > > On Thu, Oct 7, 2021 at 5:26 PM Magnus
>     > > Edenhill
>     > > > <
>     > > > > > > > > > > > magnus@edenhill.se
>     > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > wrote:
>     > > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > > > Hi all,
>     > > > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > > > I've updated the KIP following our
> recent
>     > > > > > discussions
>     > > > > > > > > on
>     > > > > > > > > > > the
>     > > > > > > > > > > > > > > mailing
>     > > > > > > > > > > > > > > > > > list:
>     > > > > > > > > > > > > > > > > > >  - split the protocol in two, one
> for
>     > > getting
>     > > > > the
>     > > > > > > > > metrics
>     > > > > > > > > > > > > > > > subscriptions,
>     > > > > > > > > > > > > > > > > > > and one for pushing the metrics.
>     > > > > > > > > > > > > > > > > > >  - simplifications: initially only
> one
>     > > > > supported
>     > > > > > > > > metrics
>     > > > > > > > > > > > format,
>     > > > > > > > > > > > > > no
>     > > > > > > > > > > > > > > > > > > client.id in the instance id, etc.
>     > > > > > > > > > > > > > > > > > >  - made CLIENT_METRICS subscription
>     > > > > configuration
>     > > > > > > > > entries
>     > > > > > > > > > > > more
>     > > > > > > > > > > > > > > > structured
>     > > > > > > > > > > > > > > > > > >    and allowing better client
> matching
>     > > > > selectors
>     > > > > > (not
>     > > > > > > > > > only
>     > > > > > > > > > > > on the
>     > > > > > > > > > > > > > > > > > instance
>     > > > > > > > > > > > > > > > > > > id, but also the other
>     > > > > > > > > > > > > > > > > > >    client resource labels, such as
>     > > > > > > > > client_software_name,
>     > > > > > > > > > > > etc.).
>     > > > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > > > Unless there are further comments
> I'll
>     > call
>     > > > the
>     > > > > > vote
>     > > > > > > > > in a
>     > > > > > > > > > > > day or
>     > > > > > > > > > > > > > > two.
>     > > > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > > > Regards,
>     > > > > > > > > > > > > > > > > > > Magnus
>     > > > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > > > Den mån 4 okt. 2021 kl 20:57 skrev
> Magnus
>     > > > > > Edenhill <
>     > > > > > > > > > > > > > > > magnus@edenhill.se>:
>     > > > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > > > > Hi Gwen,
>     > > > > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > > > > I'm finishing up the KIP based
> on the
>     > > last
>     > > > > > couple
>     > > > > > > > of
>     > > > > > > > > > > > discussion
>     > > > > > > > > > > > > > > > points
>     > > > > > > > > > > > > > > > > > in
>     > > > > > > > > > > > > > > > > > > > this thread
>     > > > > > > > > > > > > > > > > > > > and will call the Vote later
> this week.
>     > > > > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > > > > Best,
>     > > > > > > > > > > > > > > > > > > > Magnus
>     > > > > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > > > > Den lör 2 okt. 2021 kl 02:01
> skrev Gwen
>     > > > > Shapira
>     > > > > > > > > > > > > > > > > > > <gwen@confluent.io.invalid
>     > > > > > > > > > > > > > > > > > > > >:
>     > > > > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > > > >> Hey,
>     > > > > > > > > > > > > > > > > > > >>
>     > > > > > > > > > > > > > > > > > > >> I noticed that there was no
> discussion
>     > > for
>     > > > > the
>     > > > > > > > last
>     > > > > > > > > 10
>     > > > > > > > > > > > days,
>     > > > > > > > > > > > > > > but I
>     > > > > > > > > > > > > > > > > > > >> couldn't
>     > > > > > > > > > > > > > > > > > > >> find the vote thread. Is there
> one
>     > that
>     > > > I'm
>     > > > > > > > missing?
>     > > > > > > > > > > > > > > > > > > >>
>     > > > > > > > > > > > > > > > > > > >> Gwen
>     > > > > > > > > > > > > > > > > > > >>
>     > > > > > > > > > > > > > > > > > > >> On Wed, Sep 22, 2021 at 4:58 AM
> Magnus
>     > > > > > Edenhill <
>     > > > > > > > > > > > > > > > magnus@edenhill.se>
>     > > > > > > > > > > > > > > > > > > >> wrote:
>     > > > > > > > > > > > > > > > > > > >>
>     > > > > > > > > > > > > > > > > > > >> > Den tis 21 sep. 2021 kl 06:58
> skrev
>     > > > Colin
>     > > > > > > > McCabe <
>     > > > > > > > > > > > > > > > > > cmccabe@apache.org
>     > > > > > > > > > > > > > > > > > > >:
>     > > > > > > > > > > > > > > > > > > >> >
>     > > > > > > > > > > > > > > > > > > >> > > On Mon, Sep 20, 2021, at
> 17:35,
>     > Feng
>     > > > Min
>     > > > > > > > wrote:
>     > > > > > > > > > > > > > > > > > > >> > > > Thanks Magnus & Colin for
> the
>     > > > > > discussion.
>     > > > > > > > > > > > > > > > > > > >> > > >
>     > > > > > > > > > > > > > > > > > > >> > > > Based on KIP-714's
> stateless
>     > > design,
>     > > > > > Client
>     > > > > > > > > can
>     > > > > > > > > > > > pretty
>     > > > > > > > > > > > > > > much
>     > > > > > > > > > > > > > > > use
>     > > > > > > > > > > > > > > > > > > any
>     > > > > > > > > > > > > > > > > > > >> > > > connection to any broker
> to send
>     > > > > > metrics. We
>     > > > > > > > > are
>     > > > > > > > > > > not
>     > > > > > > > > > > > > > > > associating
>     > > > > > > > > > > > > > > > > > > >> > > connection
>     > > > > > > > > > > > > > > > > > > >> > > > with client metric state.
> Is my
>     > > > > > > > understanding
>     > > > > > > > > > > > correct?
>     > > > > > > > > > > > > > If
>     > > > > > > > > > > > > > > > yes,
>     > > > > > > > > > > > > > > > > > > how
>     > > > > > > > > > > > > > > > > > > >> > about
>     > > > > > > > > > > > > > > > > > > >> > > > the following two
> scenarios
>     > > > > > > > > > > > > > > > > > > >> > > >
>     > > > > > > > > > > > > > > > > > > >> > > > 1) One Client (Client-ID)
>     > > registers
>     > > > > two
>     > > > > > > > > > different
>     > > > > > > > > > > > client
>     > > > > > > > > > > > > > > > > > instance
>     > > > > > > > > > > > > > > > > > > id
>     > > > > > > > > > > > > > > > > > > >> > via
>     > > > > > > > > > > > > > > > > > > >> > > > separate registration. Is
> it
>     > > > > permitted?
>     > > > > > If
>     > > > > > > > OK,
>     > > > > > > > > > how
>     > > > > > > > > > > > to
>     > > > > > > > > > > > > > > > > > distinguish
>     > > > > > > > > > > > > > > > > > > >> them
>     > > > > > > > > > > > > > > > > > > >> > > from
>     > > > > > > > > > > > > > > > > > > >> > > > the case 2 below.
>     > > > > > > > > > > > > > > > > > > >> > > >
>     > > > > > > > > > > > > > > > > > > >> > >
>     > > > > > > > > > > > > > > > > > > >> > > Hi Feng,
>     > > > > > > > > > > > > > > > > > > >> > >
>     > > > > > > > > > > > > > > > > > > >> > > My understanding, which
> Magnus can
>     > > > > > clarify I
>     > > > > > > > > > guess,
>     > > > > > > > > > > is
>     > > > > > > > > > > > > > that
>     > > > > > > > > > > > > > > > you
>     > > > > > > > > > > > > > > > > > > could
>     > > > > > > > > > > > > > > > > > > >> > have
>     > > > > > > > > > > > > > > > > > > >> > > something like two Producer
>     > > instances
>     > > > > > running
>     > > > > > > > > with
>     > > > > > > > > > > the
>     > > > > > > > > > > > > > same
>     > > > > > > > > > > > > > > > > > > client.id
>     > > > > > > > > > > > > > > > > > > >> > > (perhaps because they're
> using the
>     > > > same
>     > > > > > config
>     > > > > > > > > > file,
>     > > > > > > > > > > > for
>     > > > > > > > > > > > > > > > example).
>     > > > > > > > > > > > > > > > > > > >> They
>     > > > > > > > > > > > > > > > > > > >> > > could even be in the same
> process.
>     > > But
>     > > > > > they
>     > > > > > > > > would
>     > > > > > > > > > > get
>     > > > > > > > > > > > > > > separate
>     > > > > > > > > > > > > > > > > > > UUIDs.
>     > > > > > > > > > > > > > > > > > > >> > >
>     > > > > > > > > > > > > > > > > > > >> > > I believe Magnus used the
> term
>     > > client
>     > > > to
>     > > > > > mean
>     > > > > > > > > > > > "Producer or
>     > > > > > > > > > > > > > > > > > > Consumer".
>     > > > > > > > > > > > > > > > > > > >> So
>     > > > > > > > > > > > > > > > > > > >> > > if you have both a Producer
> and a
>     > > > > > Consumer in
>     > > > > > > > > your
>     > > > > > > > > > > > > > > > application I
>     > > > > > > > > > > > > > > > > > > would
>     > > > > > > > > > > > > > > > > > > >> > > expect you'd get separate
> UUIDs
>     > for
>     > > > > both.
>     > > > > > > > Again
>     > > > > > > > > > > > Magnus can
>     > > > > > > > > > > > > > > > chime
>     > > > > > > > > > > > > > > > > > in
>     > > > > > > > > > > > > > > > > > > >> > here, I
>     > > > > > > > > > > > > > > > > > > >> > > guess.
>     > > > > > > > > > > > > > > > > > > >> > >
>     > > > > > > > > > > > > > > > > > > >> >
>     > > > > > > > > > > > > > > > > > > >> > That's correct.
>     > > > > > > > > > > > > > > > > > > >> >
>     > > > > > > > > > > > > > > > > > > >> >
>     > > > > > > > > > > > > > > > > > > >> > >
>     > > > > > > > > > > > > > > > > > > >> > > > 2) How about the client
>     > > restarting?
>     > > > > > What's
>     > > > > > > > the
>     > > > > > > > > > > > > > > expectation?
>     > > > > > > > > > > > > > > > > > Should
>     > > > > > > > > > > > > > > > > > > >> the
>     > > > > > > > > > > > > > > > > > > >> > > > server expect the client
> to
>     > carry
>     > > a
>     > > > > > > > persisted
>     > > > > > > > > > > client
>     > > > > > > > > > > > > > > > instance id
>     > > > > > > > > > > > > > > > > > > or
>     > > > > > > > > > > > > > > > > > > >> > > should
>     > > > > > > > > > > > > > > > > > > >> > > > the client be treated as
> a new
>     > > > > instance?
>     > > > > > > > > > > > > > > > > > > >> > >
>     > > > > > > > > > > > > > > > > > > >> > > The KIP doesn't describe any
>     > > mechanism
>     > > > > for
>     > > > > > > > > > > > persistence,
>     > > > > > > > > > > > > > so I
>     > > > > > > > > > > > > > > > would
>     > > > > > > > > > > > > > > > > > > >> assume
>     > > > > > > > > > > > > > > > > > > >> > > that when you restart the
> client
>     > you
>     > > > get
>     > > > > > a new
>     > > > > > > > > > > UUID. I
>     > > > > > > > > > > > > > agree
>     > > > > > > > > > > > > > > > that
>     > > > > > > > > > > > > > > > > > it
>     > > > > > > > > > > > > > > > > > > >> > would
>     > > > > > > > > > > > > > > > > > > >> > > be good to spell this out.
>     > > > > > > > > > > > > > > > > > > >> > >
>     > > > > > > > > > > > > > > > > > > >> > >
>     > > > > > > > > > > > > > > > > > > >> > Right, it will not be
> persisted
>     > since
>     > > a
>     > > > > > client
>     > > > > > > > > > > instance
>     > > > > > > > > > > > > > can't
>     > > > > > > > > > > > > > > be
>     > > > > > > > > > > > > > > > > > > >> restarted.
>     > > > > > > > > > > > > > > > > > > >> >
>     > > > > > > > > > > > > > > > > > > >> > Will update the KIP to make
> this
>     > > > clearer.
>     > > > > > > > > > > > > > > > > > > >> >
>     > > > > > > > > > > > > > > > > > > >> > /Magnus
>     > > > > > > > > > > > > > > > > > > >> >
>     > > > > > > > > > > > > > > > > > > >>
>     > > > > > > > > > > > > > > > > > > >>
>     > > > > > > > > > > > > > > > > > > >> --
>     > > > > > > > > > > > > > > > > > > >> Gwen Shapira
>     > > > > > > > > > > > > > > > > > > >> Engineering Manager | Confluent
>     > > > > > > > > > > > > > > > > > > >> 650.450.2760 | @gwenshap
>     > > > > > > > > > > > > > > > > > > >> Follow us: Twitter | blog
>     > > > > > > > > > > > > > > > > > > >>
>     > > > > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > > >
>     > > > > > > > > > > > > > >
>     > > > > > > > > > > > > >
>     > > > > > > > > > > >
>     > > > > > > > > > >
>     > > > > > > > > >
>     > > > > > > > >
>     > > > > > > >
>     > > > > > >
>     > > > > >
>     > > > >
>     > > >
>     > >
>     >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Sarat Kakarla <sk...@confluent.io.INVALID>.
Jun,
 
  >>  28. For the broker metrics, could you spell out the full metric name
  >>   including groups, tags, etc? We typically don't add the broker_id label for
  >>   broker metrics. Also, brokers use Yammer metrics, which doesn't have type
  >>   Sum.

Sure,  I will update the KIP-714 with the above information, will remove the broker-id label from the metrics.

Regarding the type is CumulativeSum the right type to use in the place of Sum?

Thanks
Sarat


On 3/8/22, 5:48 PM, "Jun Rao" <ju...@confluent.io.INVALID> wrote:

    Hi, Magnus, Sarat and Xavier,

    Thanks for the reply. A few more comments below.

    20. It seems that we are piggybacking the plugin on the
    existing MetricsReporter. So, this seems fine.

    21. That could work. Are we requiring any additional jar dependency on the
    client? Or, are you suggesting that we check the runtime dependency to pick
    the compression codec?

    28. For the broker metrics, could you spell out the full metric name
    including groups, tags, etc? We typically don't add the broker_id label for
    broker metrics. Also, brokers use Yammer metrics, which doesn't have type
    Sum.

    29. There are several client metrics listed as histogram. However, the java
    client currently doesn't support histogram type.

    30. Could you show an example of the metric payload in PushTelemetryRequest
    to help understand how we organize metrics at different levels (per
    instance, per topic, per partition, per broker, etc)?

    31. Could you add a bit more detail on which client thread sends the
    PushTelemetryRequest?

    Thanks,

    Jun

    On Mon, Mar 7, 2022 at 11:48 AM Magnus Edenhill <ma...@edenhill.se> wrote:

    > Hi Jun,
    >
    > thanks for your initiated questions, see my answers below.
    > There's been a number of clarifications to the KIP.
    >
    >
    >
    > Den tors 27 jan. 2022 kl 20:08 skrev Jun Rao <ju...@confluent.io.invalid>:
    >
    > > Hi, Magnus,
    > >
    > > Thanks for updating the KIP. The overall approach makes sense to me. A
    > few
    > > more detailed comments below.
    > >
    > > 20. ClientTelemetry: Should it be extending configurable and closable?
    > >
    >
    > I'll pass this question to Sarat and/or Xavier.
    >
    >
    >
    > > 21. Compression of the metrics on the client: what's the default?
    > >
    >
    > How about we specify a prioritized list: zstd, lz4, snappy, gzip?
    > But ultimately it is up to what the client supports.
    >
    >
    > 23. A client instance is considered a metric resource and the
    > > resource-level (thus client instance level) labels could include:
    > >     client_software_name=confluent-kafka-python
    > >     client_software_version=v2.1.3
    > >     client_instance_id=B64CD139-3975-440A-91D4
    > >     transactional_id=someTxnApp
    > > Are those labels added in PushTelemetryRequest? If so, are they per
    > metric
    > > or per request?
    > >
    >
    >
    > client_software* and client_instance_id are not added by the client, but
    > available to
    > the broker-side metrics plugin for adding as it see fits, remove them from
    > the KIP.
    >
    > As for transactional_id, group_id, etc, which I believe will be useful in
    > troubleshooting,
    > are included only once (per push) as resource-level attributes (the client
    > instance is a singular resource).
    >
    >
    > >
    > > 24.  "the broker will only send
    > > GetTelemetrySubscriptionsResponse.DeltaTemporality=True" :
    > > 24.1 If it's always true, does it need to be part of the protocol?
    > >
    >
    > We're anticipating that it will take a lot longer to upgrade the majority
    > of clients than the
    > broker/plugin side, which is why we want the client to support both
    > temporalities out-of-the-box
    > so that cumulative reporting can be turned on seamlessly in the future.
    >
    >
    >
    > > 24.2 Does delta only apply to Counter type?
    > >
    >
    >
    > And Histograms. More details in Xavier's OTLP link.
    >
    >
    >
    > > 24.3 In the delta representation, the first request needs to send the
    > full
    > > value, how does the broker plugin know whether a value is full or delta?
    > >
    >
    > The client may (should) send the start time for each metric sample,
    > indicating when
    > the metric began to be collected.
    > We've discussed whether this should be the client instance start time or
    > the time when a matching
    > metric subscription for that metric is received.
    > For completeness we recommend using the former, the client instance start
    > time.
    >
    >
    >
    > > 25. quota:
    > > 25.1 Since we are fitting PushTelemetryRequest into the existing request
    > > quota, it would be useful to document the impact, i.e. client metric
    > > throttling causes the data from the same client to be delayed.
    > > 25.2 Is PushTelemetryRequest subject to the write bandwidth quota like
    > the
    > > producer?
    > >
    >
    >
    > Yes, it should be, as to protect the cluster from rogue clients.
    > But, in practice the size of metrics will be quite low (e.g., 1-10kb per
    > 60s interval), so I don't think this will pose a problem.
    > The KIP has been updated with more details on quota/throttling behaviour,
    > see the
    > "Throttling and rate-limiting" section.
    >
    >
    > 25.3 THROTTLING_QUOTA_EXCEEDED: Currently, we don't send this error when
    > > the request/bandwidth quota is exceeded since those requests are not
    > > rejected. We only set this error when the request is rejected (e.g.,
    > topic
    > > creation). It would be useful to clarify when this error is used.
    > >
    >
    > Right, I was trying to reuse an existing error-code. We can introduce
    > a new one for the case where a client pushes metrics at a higher frequency
    > than the
    > than the configured push interval (e.g., out-of-profile sends).
    > This causes the broker to drop those metrics and send this error code back
    > to the client. There will be no connection throttling / channel-muting in
    > this
    > case (unless the standard quotas are exceeded).
    >
    >
    > > 27. kafka-client-metrics.sh: Could we add an example on how to disable a
    > > bad client?
    > >
    >
    > There's now a --block option to kafka-client-metrics.sh which overrides all
    > subscriptions
    > for the matched client(s). This allows silencing metrics for one or more
    > clients without having
    > to remove existing subscriptions. From the client's perspective it will
    > look like it no longer has
    > any subscriptions.
    >
    > # Block metrics collection for a specific client instance
    > $ kafka-client-metrics.sh --bootstrap-server $BROKERS \
    >    --add \
    >    --name 'Disabe_b69cc35a' \  # A descriptive name makes it easier to
    > clean up old subscriptions.
    >    --match client_instance_id=b69cc35a-7a54-4790-aa69-cc2bd4ee4538 \  #
    > Match this specific client instance
    >    --block
    >
    >
    >
    >
    > > 28. New broker side metrics: Could we spell out the details of the
    > metrics
    > > (e.g., group, tags, etc)?
    > >
    >
    > KIP has been updated accordingly (thanks Sarat).
    >
    >
    >
    > >
    > > 29. Client instance-level metrics: client.io.wait.time is a gauge not a
    > > histogram.
    > >
    >
    > I believe a population/distribution should preferably be represented as a
    > histogram, space permitting,
    > and only secondarily as a Gauge average.
    > While we might not want to maintain a bunch of histograms for each
    > partition, since that could be
    > quite space consuming, this client.io.wait.time is a single metric per
    > client instance and can
    > thus afford a Histogram representation.
    >
    >
    >
    > Thanks,
    > Magnus
    >
    >
    >
    > > Thanks,
    > >
    > > Jun
    > >
    > > On Wed, Jan 26, 2022 at 6:32 AM Magnus Edenhill <ma...@edenhill.se>
    > > wrote:
    > >
    > > > Hi all,
    > > >
    > > > I've updated the KIP with responses to the latest comments: Java client
    > > > dependencies (Thanks Kirk!), alternate designs (separate cluster,
    > > separate
    > > > producer, etc), etc.
    > > >
    > > > I will revive the vote thread.
    > > >
    > > > Thanks,
    > > > Magnus
    > > >
    > > >
    > > > Den mån 13 dec. 2021 kl 22:32 skrev Ryanne Dolan <
    > ryannedolan@gmail.com
    > > >:
    > > >
    > > > > I think we should be very careful about introducing new runtime
    > > > > dependencies into the clients. Historically this has been rare and
    > > > > essentially necessary (e.g. compression libs).
    > > > >
    > > > > Ryanne
    > > > >
    > > > > On Mon, Dec 13, 2021, 1:06 PM Kirk True <ki...@mustardgrain.com>
    > wrote:
    > > > >
    > > > > > Hi Jun,
    > > > > >
    > > > > > On Thu, Dec 9, 2021, at 2:28 PM, Jun Rao wrote:
    > > > > > > 13. Using OpenTelemetry. Does that require runtime dependency
    > > > > > > on OpenTelemetry library? How good is the compatibility story
    > > > > > > of OpenTelemetry? This is important since an application could
    > have
    > > > > other
    > > > > > > OpenTelemetry dependencies than the Kafka client.
    > > > > >
    > > > > > The current design is that the OpenTelemetry JARs would ship with
    > the
    > > > > > client. Perhaps we can design the client such that the JARs aren't
    > > even
    > > > > > loaded if the user has opted out. The user could even exclude the
    > > JARs
    > > > > from
    > > > > > their dependencies if they so wished.
    > > > > >
    > > > > > I can't speak to the compatibility of the libraries. Is it possible
    > > > that
    > > > > > we include a shaded version?
    > > > > >
    > > > > > Thanks,
    > > > > > Kirk
    > > > > >
    > > > > > >
    > > > > > > 14. The proposal listed idempotence=true. This is more of a
    > > > > configuration
    > > > > > > than a metric. Are we including that as a metric? What other
    > > > > > configurations
    > > > > > > are we including? Should we separate the configurations from the
    > > > > metrics?
    > > > > > >
    > > > > > > Thanks,
    > > > > > >
    > > > > > > Jun
    > > > > > >
    > > > > > > On Mon, Nov 29, 2021 at 7:34 AM Magnus Edenhill <
    > > magnus@edenhill.se>
    > > > > > wrote:
    > > > > > >
    > > > > > > > Hey Bob,
    > > > > > > >
    > > > > > > > That's a good point.
    > > > > > > >
    > > > > > > > Request type labels were considered but since they're already
    > > > tracked
    > > > > > by
    > > > > > > > broker-side metrics
    > > > > > > > they were left out as to avoid metric duplication, however
    > those
    > > > > > metrics
    > > > > > > > are not per connection,
    > > > > > > > so they won't be that useful in practice for troubleshooting
    > > > specific
    > > > > > > > client instances.
    > > > > > > >
    > > > > > > > I'll add the request_type label to the relevant metrics.
    > > > > > > >
    > > > > > > > Thanks,
    > > > > > > > Magnus
    > > > > > > >
    > > > > > > >
    > > > > > > > Den tis 23 nov. 2021 kl 19:20 skrev Bob Barrett
    > > > > > > > <bo...@confluent.io.invalid>:
    > > > > > > >
    > > > > > > > > Hi Magnus,
    > > > > > > > >
    > > > > > > > > Thanks for the thorough KIP, this seems very useful.
    > > > > > > > >
    > > > > > > > > Would it make sense to include the request type as a label
    > for
    > > > the
    > > > > > > > > `client.request.success`, `client.request.errors` and
    > > > > > > > `client.request.rtt`
    > > > > > > > > metrics? I think it would be very useful to see which
    > specific
    > > > > > requests
    > > > > > > > are
    > > > > > > > > succeeding and failing for a client. One specific case I can
    > > > think
    > > > > of
    > > > > > > > where
    > > > > > > > > this could be useful is producer batch timeouts. If a Java
    > > > > > application
    > > > > > > > does
    > > > > > > > > not enable producer client logs (unfortunately, in my
    > > experience
    > > > > this
    > > > > > > > > happens more often than it should), the application logs will
    > > > only
    > > > > > > > contain
    > > > > > > > > the expiration error message, but no information about what
    > is
    > > > > > causing
    > > > > > > > the
    > > > > > > > > timeout. The requests might all be succeeding but taking too
    > > long
    > > > > to
    > > > > > > > > process batches, or metadata requests might be failing, or
    > some
    > > > or
    > > > > > all
    > > > > > > > > produce requests might be failing (if the bootstrap servers
    > are
    > > > > > reachable
    > > > > > > > > from the client but one or more other brokers are not, for
    > > > > example).
    > > > > > If
    > > > > > > > the
    > > > > > > > > cluster operator is able to identify the specific requests
    > that
    > > > are
    > > > > > slow
    > > > > > > > or
    > > > > > > > > failing for a client, they will be better able to diagnose
    > the
    > > > > issue
    > > > > > > > > causing batch timeouts.
    > > > > > > > >
    > > > > > > > > One drawback I can think of is that this will increase the
    > > > > > cardinality of
    > > > > > > > > the request metrics. But any given client is only going to
    > use
    > > a
    > > > > > small
    > > > > > > > > subset of the request types, and since we already have
    > > partition
    > > > > > labels
    > > > > > > > for
    > > > > > > > > the topic-level metrics, I think request labels will still
    > make
    > > > up
    > > > > a
    > > > > > > > > relatively small percentage of the set of metrics.
    > > > > > > > >
    > > > > > > > > Thanks,
    > > > > > > > > Bob
    > > > > > > > >
    > > > > > > > > On Mon, Nov 22, 2021 at 2:08 AM Viktor Somogyi-Vass <
    > > > > > > > > viktorsomogyi@gmail.com>
    > > > > > > > > wrote:
    > > > > > > > >
    > > > > > > > > > Hi Magnus,
    > > > > > > > > >
    > > > > > > > > > I think this is a very useful addition. We also have a
    > > similar
    > > > > (but
    > > > > > > > much
    > > > > > > > > > more simplistic) implementation of this. Maybe I missed it
    > in
    > > > the
    > > > > > KIP
    > > > > > > > but
    > > > > > > > > > what about adding metrics about the subscription cache
    > > itself?
    > > > > > That I
    > > > > > > > > think
    > > > > > > > > > would improve its usability and debuggability as we'd be
    > able
    > > > to
    > > > > > see
    > > > > > > > its
    > > > > > > > > > performance, hit/miss rates, eviction counts and others.
    > > > > > > > > >
    > > > > > > > > > Best,
    > > > > > > > > > Viktor
    > > > > > > > > >
    > > > > > > > > > On Thu, Nov 18, 2021 at 5:12 PM Magnus Edenhill <
    > > > > > magnus@edenhill.se>
    > > > > > > > > > wrote:
    > > > > > > > > >
    > > > > > > > > > > Hi Mickael,
    > > > > > > > > > >
    > > > > > > > > > > see inline.
    > > > > > > > > > >
    > > > > > > > > > > Den ons 10 nov. 2021 kl 15:21 skrev Mickael Maison <
    > > > > > > > > > > mickael.maison@gmail.com
    > > > > > > > > > > >:
    > > > > > > > > > >
    > > > > > > > > > > > Hi Magnus,
    > > > > > > > > > > >
    > > > > > > > > > > > I see you've addressed some of the points I raised
    > above
    > > > but
    > > > > > some
    > > > > > > > (4,
    > > > > > > > > > > > 5) have not been addressed yet.
    > > > > > > > > > > >
    > > > > > > > > > >
    > > > > > > > > > > Re 4) How will the user/app know metrics are being sent.
    > > > > > > > > > >
    > > > > > > > > > > One possibility is to add a JMX metric (thus for user
    > > > > > consumption)
    > > > > > > > for
    > > > > > > > > > the
    > > > > > > > > > > number of metric pushes the
    > > > > > > > > > > client has performed, or perhaps the number of metrics
    > > > > > subscriptions
    > > > > > > > > > > currently being collected.
    > > > > > > > > > > Would that be sufficient?
    > > > > > > > > > >
    > > > > > > > > > > Re 5) Metric sizes and rates
    > > > > > > > > > >
    > > > > > > > > > > A worst case scenario for a producer that is producing to
    > > 50
    > > > > > unique
    > > > > > > > > > topics
    > > > > > > > > > > and emitting all standard metrics yields
    > > > > > > > > > > a serialized size of around 100KB prior to compression,
    > > which
    > > > > > > > > compresses
    > > > > > > > > > > down to about 20-30% of that depending
    > > > > > > > > > > on compression type and topic name uniqueness.
    > > > > > > > > > > The numbers for a consumer would be similar.
    > > > > > > > > > >
    > > > > > > > > > > In practice the number of unique topics would be far
    > less,
    > > > and
    > > > > > the
    > > > > > > > > > > subscription set would typically be for a subset of
    > > metrics.
    > > > > > > > > > > So we're probably closer to 1kb, or less, compressed size
    > > per
    > > > > > client
    > > > > > > > > per
    > > > > > > > > > > push interval.
    > > > > > > > > > >
    > > > > > > > > > > As both the subscription set and push intervals are
    > > > controlled
    > > > > > by the
    > > > > > > > > > > cluster operator it shouldn't be too hard
    > > > > > > > > > > to strike a good balance between metrics overhead and
    > > > > > granularity.
    > > > > > > > > > >
    > > > > > > > > > >
    > > > > > > > > > >
    > > > > > > > > > > >
    > > > > > > > > > > > I'm really uneasy with this being enabled by default on
    > > the
    > > > > > client
    > > > > > > > > > > > side. When collecting data, I think the best practice
    > is
    > > to
    > > > > > ensure
    > > > > > > > > > > > users are explicitly enabling it.
    > > > > > > > > > > >
    > > > > > > > > > >
    > > > > > > > > > > Requiring metrics to be explicitly enabled on clients
    > > > severely
    > > > > > > > cripples
    > > > > > > > > > its
    > > > > > > > > > > usability and value.
    > > > > > > > > > >
    > > > > > > > > > > One of the problems that this KIP aims to solve is for
    > > useful
    > > > > > metrics
    > > > > > > > > to
    > > > > > > > > > be
    > > > > > > > > > > available on demand
    > > > > > > > > > > regardless of the technical expertise of the user. As
    > > Ryanne
    > > > > > points,
    > > > > > > > > out
    > > > > > > > > > a
    > > > > > > > > > > savvy user/organization
    > > > > > > > > > > will typically have metrics collection and monitoring in
    > > > place
    > > > > > > > already,
    > > > > > > > > > and
    > > > > > > > > > > the benefits of this KIP
    > > > > > > > > > > are then more of a common set and format metrics across
    > > > client
    > > > > > > > > > > implementations and languages.
    > > > > > > > > > > But that is not the typical Kafka user in my experience,
    > > > > they're
    > > > > > not
    > > > > > > > > > Kafka
    > > > > > > > > > > experts and they don't have the
    > > > > > > > > > > knowledge of how to best instrument their clients.
    > > > > > > > > > > Having metrics enabled by default for this user base
    > allows
    > > > the
    > > > > > Kafka
    > > > > > > > > > > operators to proactively and reactively
    > > > > > > > > > > monitor and troubleshoot client issues, without the need
    > > for
    > > > > the
    > > > > > less
    > > > > > > > > > savvy
    > > > > > > > > > > user to do anything.
    > > > > > > > > > > It is often too late to tell a user to enable metrics
    > when
    > > > the
    > > > > > > > problem
    > > > > > > > > > has
    > > > > > > > > > > already occurred.
    > > > > > > > > > >
    > > > > > > > > > > Now, to be clear, even though metrics are enabled by
    > > default
    > > > on
    > > > > > > > clients
    > > > > > > > > > it
    > > > > > > > > > > is not enabled by default
    > > > > > > > > > > on the brokers; the Kafka operator needs to build and set
    > > up
    > > > a
    > > > > > > > metrics
    > > > > > > > > > > plugin and add metrics subscriptions
    > > > > > > > > > > before anything is sent from the client.
    > > > > > > > > > > It is opt-out on the clients and opt-in on the broker.
    > > > > > > > > > >
    > > > > > > > > > >
    > > > > > > > > > >
    > > > > > > > > > >
    > > > > > > > > > > > You mentioned brokers already have
    > > > > > > > > > > > some(most?) of the information contained in metrics, if
    > > so
    > > > > > then why
    > > > > > > > > > > > are we collecting it again? Surely there must be some
    > new
    > > > > > > > information
    > > > > > > > > > > > in the client metrics.
    > > > > > > > > > > >
    > > > > > > > > > >
    > > > > > > > > > > From the user's perspective the Kafka infrastructure
    > > extends
    > > > > from
    > > > > > > > > > > producer.send() to
    > > > > > > > > > > messages being returned from consumer.poll(), a giant
    > black
    > > > box
    > > > > > where
    > > > > > > > > > > there's a lot going on between those
    > > > > > > > > > > two points. The brokers currently only see what happens
    > > once
    > > > > > those
    > > > > > > > > > requests
    > > > > > > > > > > and messages hits the broker,
    > > > > > > > > > > but as Kafka clients are complex pieces of machinery
    > > there's
    > > > a
    > > > > > myriad
    > > > > > > > > of
    > > > > > > > > > > queues, timers, and state
    > > > > > > > > > > that's critical to the operation and infrastructure
    > that's
    > > > not
    > > > > > > > > currently
    > > > > > > > > > > visible to the operator.
    > > > > > > > > > > Relying on the user to accurately and timely provide this
    > > > > missing
    > > > > > > > > > > information is not generally feasible.
    > > > > > > > > > >
    > > > > > > > > > >
    > > > > > > > > > > Most of the standard metrics listed in the KIP are data
    > > > points
    > > > > > that
    > > > > > > > the
    > > > > > > > > > > broker does not have.
    > > > > > > > > > > Only a small number of metrics are duplicates (like the
    > > > request
    > > > > > > > counts
    > > > > > > > > > and
    > > > > > > > > > > sizes), but they are included
    > > > > > > > > > > to ease correlation when inspecting these client metrics.
    > > > > > > > > > >
    > > > > > > > > > >
    > > > > > > > > > >
    > > > > > > > > > > > Moreover this is a brand new feature so it's even
    > harder
    > > to
    > > > > > justify
    > > > > > > > > > > > enabling it and forcing onto all our users. If disabled
    > > by
    > > > > > default,
    > > > > > > > > > > > it's relatively easy to enable in a new release if we
    > > > decide
    > > > > > to,
    > > > > > > > but
    > > > > > > > > > > > once enabled by default it's much harder to disable.
    > Also
    > > > > this
    > > > > > > > > feature
    > > > > > > > > > > > will apply to all future metrics we will add.
    > > > > > > > > > > >
    > > > > > > > > > >
    > > > > > > > > > > I think maturity of a feature implementation should be
    > the
    > > > > > deciding
    > > > > > > > > > factor,
    > > > > > > > > > > rather than
    > > > > > > > > > > the design of it (which this KIP is). I.e., if the
    > > > > > implementation is
    > > > > > > > > not
    > > > > > > > > > > deemed mature enough
    > > > > > > > > > > for release X.Y it will be disabled.
    > > > > > > > > > >
    > > > > > > > > > >
    > > > > > > > > > >
    > > > > > > > > > > > Overall I think it's an interesting feature but I'd
    > > prefer
    > > > to
    > > > > > be
    > > > > > > > > > > > slightly defensive and see how it works in practice
    > > before
    > > > > > enabling
    > > > > > > > > it
    > > > > > > > > > > > everywhere.
    > > > > > > > > > > >
    > > > > > > > > > >
    > > > > > > > > > > Right, and I agree on being defensive, but since this
    > > feature
    > > > > > still
    > > > > > > > > > > requires manual
    > > > > > > > > > > enabling on the brokers before actually being used, I
    > think
    > > > > that
    > > > > > > > gives
    > > > > > > > > > > enough control
    > > > > > > > > > > to opt-in or out of this feature as needed.
    > > > > > > > > > >
    > > > > > > > > > > Thanks for your comments!
    > > > > > > > > > >
    > > > > > > > > > > Regards,
    > > > > > > > > > > Magnus
    > > > > > > > > > >
    > > > > > > > > > >
    > > > > > > > > > >
    > > > > > > > > > > > Thanks,
    > > > > > > > > > > > Mickael
    > > > > > > > > > > >
    > > > > > > > > > > > On Mon, Nov 8, 2021 at 7:55 AM Magnus Edenhill <
    > > > > > magnus@edenhill.se
    > > > > > > > >
    > > > > > > > > > > wrote:
    > > > > > > > > > > > >
    > > > > > > > > > > > > Thanks David for pointing this out,
    > > > > > > > > > > > > I've updated the KIP to include client_id as a
    > matching
    > > > > > selector.
    > > > > > > > > > > > >
    > > > > > > > > > > > > Regards,
    > > > > > > > > > > > > Magnus
    > > > > > > > > > > > >
    > > > > > > > > > > > > Den tors 4 nov. 2021 kl 18:01 skrev David Mao
    > > > > > > > > > > <dmao@confluent.io.invalid
    > > > > > > > > > > > >:
    > > > > > > > > > > > >
    > > > > > > > > > > > > > Hey Magnus,
    > > > > > > > > > > > > >
    > > > > > > > > > > > > > I noticed that the KIP outlines the initial
    > selectors
    > > > > > supported
    > > > > > > > > as:
    > > > > > > > > > > > > >
    > > > > > > > > > > > > >    - client_instance_id - CLIENT_INSTANCE_ID UUID
    > > > string
    > > > > > > > > > > > representation.
    > > > > > > > > > > > > >    - client_software_name  - client software
    > > > > implementation
    > > > > > > > name.
    > > > > > > > > > > > > >    - client_software_version  - client software
    > > > > > implementation
    > > > > > > > > > > version.
    > > > > > > > > > > > > >
    > > > > > > > > > > > > > In the given reactive monitoring workflow, we
    > mention
    > > > > that
    > > > > > the
    > > > > > > > > > > > application
    > > > > > > > > > > > > > user does not know their client's client instance
    > ID,
    > > > but
    > > > > > it's
    > > > > > > > > > > outlined
    > > > > > > > > > > > > > that the operator can add a metrics subscription
    > > > > selecting
    > > > > > for
    > > > > > > > > > > > clientId. I
    > > > > > > > > > > > > > don't see clientId as one of the supported
    > selectors.
    > > > > > > > > > > > > > I can see how this would have made sense in a
    > > previous
    > > > > > > > iteration
    > > > > > > > > > > given
    > > > > > > > > > > > that
    > > > > > > > > > > > > > the previous client instance ID proposal was to
    > > > construct
    > > > > > the
    > > > > > > > > > client
    > > > > > > > > > > > > > instance ID using clientId as a prefix. Now that
    > the
    > > > > client
    > > > > > > > > > instance
    > > > > > > > > > > > ID is
    > > > > > > > > > > > > > a UUID, would we want to add clientId as a
    > supported
    > > > > > selector?
    > > > > > > > > > > > > > Let me know what you think.
    > > > > > > > > > > > > >
    > > > > > > > > > > > > > David
    > > > > > > > > > > > > >
    > > > > > > > > > > > > > On Tue, Oct 19, 2021 at 12:33 PM Magnus Edenhill <
    > > > > > > > > > magnus@edenhill.se
    > > > > > > > > > > >
    > > > > > > > > > > > > > wrote:
    > > > > > > > > > > > > >
    > > > > > > > > > > > > > > Hi Mickael!
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > Den tis 19 okt. 2021 kl 19:30 skrev Mickael
    > Maison
    > > <
    > > > > > > > > > > > > > > mickael.maison@gmail.com
    > > > > > > > > > > > > > > >:
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > Hi Magnus,
    > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > Thanks for the proposal.
    > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > 1. Looking at the protocol section, isn't
    > > > > > > > "ClientInstanceId"
    > > > > > > > > > > > expected
    > > > > > > > > > > > > > > > to be a field in
    > > > GetTelemetrySubscriptionsResponseV0?
    > > > > > > > > > Otherwise,
    > > > > > > > > > > > how
    > > > > > > > > > > > > > > > does a client retrieve this value?
    > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > Good catch, it got removed by mistake in one of
    > the
    > > > > > edits.
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > 2. In the client API section, you mention a new
    > > > > method
    > > > > > > > > > > > > > > > "clientInstanceId()". Can you clarify which
    > > > > interfaces
    > > > > > are
    > > > > > > > > > > > affected?
    > > > > > > > > > > > > > > > Is it only Consumer and Producer?
    > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > And Admin. Will update the KIP.
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > 3. I'm a bit concerned this is enabled by
    > > default.
    > > > > > Even if
    > > > > > > > > the
    > > > > > > > > > > data
    > > > > > > > > > > > > > > > collected is supposed to be not sensitive, I
    > > think
    > > > > > this can
    > > > > > > > > be
    > > > > > > > > > > > > > > > problematic in some environments. Also users
    > > don't
    > > > > > seem to
    > > > > > > > > have
    > > > > > > > > > > the
    > > > > > > > > > > > > > > > choice to only expose some metrics. Knowing how
    > > > much
    > > > > > data
    > > > > > > > > > transit
    > > > > > > > > > > > > > > > through some applications can be considered
    > > > critical.
    > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > The broker already knows how much data transits
    > > > through
    > > > > > the
    > > > > > > > > > client
    > > > > > > > > > > > > > though,
    > > > > > > > > > > > > > > right?
    > > > > > > > > > > > > > > Care has been taken not to expose information in
    > > the
    > > > > > standard
    > > > > > > > > > > metrics
    > > > > > > > > > > > > > that
    > > > > > > > > > > > > > > might
    > > > > > > > > > > > > > > reveal sensitive information.
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > Do you have an example of how the proposed
    > metrics
    > > > > could
    > > > > > leak
    > > > > > > > > > > > sensitive
    > > > > > > > > > > > > > > information?
    > > > > > > > > > > > > > > As for limiting the what metrics to export; I
    > guess
    > > > > that
    > > > > > > > could
    > > > > > > > > > make
    > > > > > > > > > > > sense
    > > > > > > > > > > > > > > in some
    > > > > > > > > > > > > > > very sensitive use-cases, but those users might
    > > > disable
    > > > > > > > metrics
    > > > > > > > > > > > > > altogether
    > > > > > > > > > > > > > > for now.
    > > > > > > > > > > > > > > Could these concerns be addressed by a later KIP?
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > 4. As a user, how do you know if your
    > application
    > > > is
    > > > > > > > actively
    > > > > > > > > > > > sending
    > > > > > > > > > > > > > > > metrics? Are there new metrics exposing what's
    > > > going
    > > > > > on,
    > > > > > > > like
    > > > > > > > > > how
    > > > > > > > > > > > much
    > > > > > > > > > > > > > > > data is being sent?
    > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > That's a good question.
    > > > > > > > > > > > > > > Since the proposed metrics interface is not aimed
    > > at,
    > > > > or
    > > > > > > > > directly
    > > > > > > > > > > > > > available
    > > > > > > > > > > > > > > to, the application
    > > > > > > > > > > > > > > I guess there's little point of adding it here,
    > but
    > > > > > instead
    > > > > > > > > > adding
    > > > > > > > > > > > > > > something to the
    > > > > > > > > > > > > > > existing JMX metrics?
    > > > > > > > > > > > > > > Do you have any suggestions?
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > 5. If all metrics are enabled on a regular
    > > Consumer
    > > > > or
    > > > > > > > > > Producer,
    > > > > > > > > > > do
    > > > > > > > > > > > > > > > you have an idea how much throughput this would
    > > > use?
    > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > It depends on the number of partition/topics/etc
    > > the
    > > > > > client
    > > > > > > > is
    > > > > > > > > > > > producing
    > > > > > > > > > > > > > > to/consuming from.
    > > > > > > > > > > > > > > I'll add some sizes to the KIP for some typical
    > > > > > use-cases.
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > Thanks,
    > > > > > > > > > > > > > > Magnus
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > Thanks
    > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > On Tue, Oct 19, 2021 at 5:06 PM Magnus
    > Edenhill <
    > > > > > > > > > > > magnus@edenhill.se>
    > > > > > > > > > > > > > > > wrote:
    > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > Den tis 19 okt. 2021 kl 13:22 skrev Tom
    > > Bentley <
    > > > > > > > > > > > tbentley@redhat.com
    > > > > > > > > > > > > > >:
    > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > > Hi Magnus,
    > > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > > I reviewed the KIP since you called the
    > vote
    > > > > > (sorry for
    > > > > > > > > not
    > > > > > > > > > > > > > reviewing
    > > > > > > > > > > > > > > > when
    > > > > > > > > > > > > > > > > > you announced your intention to call the
    > > > vote). I
    > > > > > have
    > > > > > > > a
    > > > > > > > > > few
    > > > > > > > > > > > > > > questions
    > > > > > > > > > > > > > > > on
    > > > > > > > > > > > > > > > > > some of the details.
    > > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > > 1. There's no Javadoc on
    > > > > > ClientTelemetryPayload.data(),
    > > > > > > > > so
    > > > > > > > > > I
    > > > > > > > > > > > don't
    > > > > > > > > > > > > > > know
    > > > > > > > > > > > > > > > > > whether the payload is exposed through this
    > > > > method
    > > > > > as
    > > > > > > > > > > > compressed or
    > > > > > > > > > > > > > > > not.
    > > > > > > > > > > > > > > > > > Later on you say "Decompression of the
    > > payloads
    > > > > > will be
    > > > > > > > > > > > handled by
    > > > > > > > > > > > > > > the
    > > > > > > > > > > > > > > > > > broker metrics plugin, the broker should
    > > > expose a
    > > > > > > > > suitable
    > > > > > > > > > > > > > > > decompression
    > > > > > > > > > > > > > > > > > API to the metrics plugin for this
    > purpose.",
    > > > > which
    > > > > > > > > > suggests
    > > > > > > > > > > > it's
    > > > > > > > > > > > > > the
    > > > > > > > > > > > > > > > > > compressed data in the buffer, but then we
    > > > don't
    > > > > > know
    > > > > > > > > which
    > > > > > > > > > > > codec
    > > > > > > > > > > > > > was
    > > > > > > > > > > > > > > > used,
    > > > > > > > > > > > > > > > > > nor the API via which the plugin should
    > > > > decompress
    > > > > > it
    > > > > > > > if
    > > > > > > > > > > > required
    > > > > > > > > > > > > > for
    > > > > > > > > > > > > > > > > > forwarding to the ultimate metrics store.
    > > > Should
    > > > > > the
    > > > > > > > > > > > > > > > ClientTelemetryPayload
    > > > > > > > > > > > > > > > > > expose a method to get the compression and
    > a
    > > > > > > > > decompressor?
    > > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > Good point, updated.
    > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > > 2. The client-side API is expressed as
    > > > > > StringOrError
    > > > > > > > > > > > > > > > > > ClientInstance::ClientInstanceId(int
    > > > > timeout_ms). I
    > > > > > > > > > > understand
    > > > > > > > > > > > that
    > > > > > > > > > > > > > > > you're
    > > > > > > > > > > > > > > > > > thinking about the librdkafka
    > implementation,
    > > > but
    > > > > > it
    > > > > > > > > would
    > > > > > > > > > be
    > > > > > > > > > > > good
    > > > > > > > > > > > > > to
    > > > > > > > > > > > > > > > show
    > > > > > > > > > > > > > > > > > the API as it would appear on the Apache
    > > Kafka
    > > > > > clients.
    > > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > This was meant as pseudo-code, but I changed
    > it
    > > > to
    > > > > > Java.
    > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > > 3. "PushTelemetryRequest|Response -
    > protocol
    > > > > > request
    > > > > > > > used
    > > > > > > > > > by
    > > > > > > > > > > > the
    > > > > > > > > > > > > > > > client to
    > > > > > > > > > > > > > > > > > send metrics to any broker it is connected
    > > to."
    > > > > To
    > > > > > be
    > > > > > > > > > clear,
    > > > > > > > > > > > this
    > > > > > > > > > > > > > > means
    > > > > > > > > > > > > > > > > > that the client can choose any of the
    > > connected
    > > > > > brokers
    > > > > > > > > and
    > > > > > > > > > > > push to
    > > > > > > > > > > > > > > > just
    > > > > > > > > > > > > > > > > > one of them? What should a supporting
    > client
    > > do
    > > > > if
    > > > > > it
    > > > > > > > > gets
    > > > > > > > > > an
    > > > > > > > > > > > error
    > > > > > > > > > > > > > > > when
    > > > > > > > > > > > > > > > > > pushing metrics to a broker, retry sending
    > to
    > > > the
    > > > > > same
    > > > > > > > > > broker
    > > > > > > > > > > > or
    > > > > > > > > > > > > > try
    > > > > > > > > > > > > > > > > > pushing to another broker, or drop the
    > > metrics?
    > > > > > Should
    > > > > > > > > > > > supporting
    > > > > > > > > > > > > > > > clients
    > > > > > > > > > > > > > > > > > send successive requests to a single
    > broker,
    > > or
    > > > > > round
    > > > > > > > > > robin,
    > > > > > > > > > > > or is
    > > > > > > > > > > > > > > > that up
    > > > > > > > > > > > > > > > > > to the client author? I'm guessing the
    > > > behaviour
    > > > > > should
    > > > > > > > > be
    > > > > > > > > > > > sticky
    > > > > > > > > > > > > > to
    > > > > > > > > > > > > > > > > > support the rate limiting features, but I
    > > think
    > > > > it
    > > > > > > > would
    > > > > > > > > be
    > > > > > > > > > > > good
    > > > > > > > > > > > > > for
    > > > > > > > > > > > > > > > client
    > > > > > > > > > > > > > > > > > authors if this section were explicit on
    > the
    > > > > > > > recommended
    > > > > > > > > > > > behaviour.
    > > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > You are right, I've updated the KIP to make
    > > this
    > > > > > clearer.
    > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > > 4. "Mapping the client instance id to an
    > > actual
    > > > > > > > > application
    > > > > > > > > > > > > > instance
    > > > > > > > > > > > > > > > > > running on a (virtual) machine can be done
    > by
    > > > > > > > inspecting
    > > > > > > > > > the
    > > > > > > > > > > > > > metrics
    > > > > > > > > > > > > > > > > > resource labels, such as the client source
    > > > > address
    > > > > > and
    > > > > > > > > > source
    > > > > > > > > > > > port,
    > > > > > > > > > > > > > > or
    > > > > > > > > > > > > > > > > > security principal, all of which are added
    > by
    > > > the
    > > > > > > > > receiving
    > > > > > > > > > > > broker.
    > > > > > > > > > > > > > > > This
    > > > > > > > > > > > > > > > > > will allow the operator together with the
    > > user
    > > > to
    > > > > > > > > identify
    > > > > > > > > > > the
    > > > > > > > > > > > > > actual
    > > > > > > > > > > > > > > > > > application instance." Is this really
    > always
    > > > > true?
    > > > > > The
    > > > > > > > > > source
    > > > > > > > > > > > IP
    > > > > > > > > > > > > > and
    > > > > > > > > > > > > > > > port
    > > > > > > > > > > > > > > > > > might be a loadbalancer/proxy in some
    > setups.
    > > > The
    > > > > > > > > > principal,
    > > > > > > > > > > as
    > > > > > > > > > > > > > > already
    > > > > > > > > > > > > > > > > > mentioned in the KIP, might be shared
    > between
    > > > > > multiple
    > > > > > > > > > > > > > applications.
    > > > > > > > > > > > > > > > So at
    > > > > > > > > > > > > > > > > > worst the organization running the clients
    > > > might
    > > > > > have
    > > > > > > > to
    > > > > > > > > > > > consult
    > > > > > > > > > > > > > the
    > > > > > > > > > > > > > > > logs
    > > > > > > > > > > > > > > > > > of a set of client applications, right?
    > > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > Yes, that's correct. There's no guaranteed
    > > > mapping
    > > > > > from
    > > > > > > > > > > > > > > > client_instance_id
    > > > > > > > > > > > > > > > > to
    > > > > > > > > > > > > > > > > an actual instance, that's why the KIP
    > > recommends
    > > > > > client
    > > > > > > > > > > > > > > implementations
    > > > > > > > > > > > > > > > to
    > > > > > > > > > > > > > > > > log the client instance id
    > > > > > > > > > > > > > > > > upon retrieval, and also provide an API for
    > the
    > > > > > > > application
    > > > > > > > > > to
    > > > > > > > > > > > > > retrieve
    > > > > > > > > > > > > > > > the
    > > > > > > > > > > > > > > > > instance id programmatically
    > > > > > > > > > > > > > > > > if it has a better way of exposing it.
    > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > 5. "Tests indicate that a compression ratio
    > up
    > > to
    > > > > > 10x is
    > > > > > > > > > > > possible for
    > > > > > > > > > > > > > > the
    > > > > > > > > > > > > > > > > > standard metrics." Client authors might
    > > > > appreciate
    > > > > > your
    > > > > > > > > > > > mentioning
    > > > > > > > > > > > > > > > which
    > > > > > > > > > > > > > > > > > compression codec got these results.
    > > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > Good point. Updated.
    > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > > 6. "Should the client send a push request
    > > prior
    > > > > to
    > > > > > > > expiry
    > > > > > > > > > of
    > > > > > > > > > > > the
    > > > > > > > > > > > > > > > previously
    > > > > > > > > > > > > > > > > > calculated PushIntervalMs the broker will
    > > > discard
    > > > > > the
    > > > > > > > > > metrics
    > > > > > > > > > > > and
    > > > > > > > > > > > > > > > return a
    > > > > > > > > > > > > > > > > > PushTelemetryResponse with the ErrorCode
    > set
    > > to
    > > > > > > > > > RateLimited."
    > > > > > > > > > > > Is
    > > > > > > > > > > > > > this
    > > > > > > > > > > > > > > > > > RATE_LIMITED a new error code? It's not
    > > > mentioned
    > > > > > in
    > > > > > > > the
    > > > > > > > > > "New
    > > > > > > > > > > > Error
    > > > > > > > > > > > > > > > Codes"
    > > > > > > > > > > > > > > > > > section.
    > > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > That's a leftover, it should be using the
    > > > standard
    > > > > > > > > > ThrottleTime
    > > > > > > > > > > > > > > > mechanism.
    > > > > > > > > > > > > > > > > Fixed.
    > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > > 7. In the section "Standard client resource
    > > > > labels"
    > > > > > > > > > > > application_id
    > > > > > > > > > > > > > is
    > > > > > > > > > > > > > > > > > described as Kafka Streams only, but the
    > > > section
    > > > > of
    > > > > > > > > "Client
    > > > > > > > > > > > > > > > Identification"
    > > > > > > > > > > > > > > > > > talks about "application instance id as an
    > > > > optional
    > > > > > > > > future
    > > > > > > > > > > > > > > nice-to-have
    > > > > > > > > > > > > > > > > > that may be included as a metrics label if
    > it
    > > > has
    > > > > > been
    > > > > > > > > set
    > > > > > > > > > by
    > > > > > > > > > > > the
    > > > > > > > > > > > > > > > user", so
    > > > > > > > > > > > > > > > > > I'm confused whether non-Kafka Streams
    > > clients
    > > > > > should
    > > > > > > > set
    > > > > > > > > > an
    > > > > > > > > > > > > > > > application_id
    > > > > > > > > > > > > > > > > > or not.
    > > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > I'll clarify this in the KIP, but basically
    > we
    > > > > would
    > > > > > need
    > > > > > > > > to
    > > > > > > > > > > add
    > > > > > > > > > > > an `
    > > > > > > > > > > > > > > > > application.id` config
    > > > > > > > > > > > > > > > > property for non-streams clients for this
    > > > purpose,
    > > > > > and
    > > > > > > > > that's
    > > > > > > > > > > > outside
    > > > > > > > > > > > > > > the
    > > > > > > > > > > > > > > > > scope of this KIP since we want to make it
    > > > > > zero-conf:ish
    > > > > > > > on
    > > > > > > > > > the
    > > > > > > > > > > > > > client
    > > > > > > > > > > > > > > > side.
    > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > > Kind regards,
    > > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > > Tom
    > > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > Thanks for the review,
    > > > > > > > > > > > > > > > > Magnus
    > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > > On Thu, Oct 7, 2021 at 5:26 PM Magnus
    > > Edenhill
    > > > <
    > > > > > > > > > > > magnus@edenhill.se
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > wrote:
    > > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > > > Hi all,
    > > > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > > > I've updated the KIP following our recent
    > > > > > discussions
    > > > > > > > > on
    > > > > > > > > > > the
    > > > > > > > > > > > > > > mailing
    > > > > > > > > > > > > > > > > > list:
    > > > > > > > > > > > > > > > > > >  - split the protocol in two, one for
    > > getting
    > > > > the
    > > > > > > > > metrics
    > > > > > > > > > > > > > > > subscriptions,
    > > > > > > > > > > > > > > > > > > and one for pushing the metrics.
    > > > > > > > > > > > > > > > > > >  - simplifications: initially only one
    > > > > supported
    > > > > > > > > metrics
    > > > > > > > > > > > format,
    > > > > > > > > > > > > > no
    > > > > > > > > > > > > > > > > > > client.id in the instance id, etc.
    > > > > > > > > > > > > > > > > > >  - made CLIENT_METRICS subscription
    > > > > configuration
    > > > > > > > > entries
    > > > > > > > > > > > more
    > > > > > > > > > > > > > > > structured
    > > > > > > > > > > > > > > > > > >    and allowing better client matching
    > > > > selectors
    > > > > > (not
    > > > > > > > > > only
    > > > > > > > > > > > on the
    > > > > > > > > > > > > > > > > > instance
    > > > > > > > > > > > > > > > > > > id, but also the other
    > > > > > > > > > > > > > > > > > >    client resource labels, such as
    > > > > > > > > client_software_name,
    > > > > > > > > > > > etc.).
    > > > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > > > Unless there are further comments I'll
    > call
    > > > the
    > > > > > vote
    > > > > > > > > in a
    > > > > > > > > > > > day or
    > > > > > > > > > > > > > > two.
    > > > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > > > Regards,
    > > > > > > > > > > > > > > > > > > Magnus
    > > > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > > > Den mån 4 okt. 2021 kl 20:57 skrev Magnus
    > > > > > Edenhill <
    > > > > > > > > > > > > > > > magnus@edenhill.se>:
    > > > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > > > > Hi Gwen,
    > > > > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > > > > I'm finishing up the KIP based on the
    > > last
    > > > > > couple
    > > > > > > > of
    > > > > > > > > > > > discussion
    > > > > > > > > > > > > > > > points
    > > > > > > > > > > > > > > > > > in
    > > > > > > > > > > > > > > > > > > > this thread
    > > > > > > > > > > > > > > > > > > > and will call the Vote later this week.
    > > > > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > > > > Best,
    > > > > > > > > > > > > > > > > > > > Magnus
    > > > > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > > > > Den lör 2 okt. 2021 kl 02:01 skrev Gwen
    > > > > Shapira
    > > > > > > > > > > > > > > > > > > <gwen@confluent.io.invalid
    > > > > > > > > > > > > > > > > > > > >:
    > > > > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > > > >> Hey,
    > > > > > > > > > > > > > > > > > > >>
    > > > > > > > > > > > > > > > > > > >> I noticed that there was no discussion
    > > for
    > > > > the
    > > > > > > > last
    > > > > > > > > 10
    > > > > > > > > > > > days,
    > > > > > > > > > > > > > > but I
    > > > > > > > > > > > > > > > > > > >> couldn't
    > > > > > > > > > > > > > > > > > > >> find the vote thread. Is there one
    > that
    > > > I'm
    > > > > > > > missing?
    > > > > > > > > > > > > > > > > > > >>
    > > > > > > > > > > > > > > > > > > >> Gwen
    > > > > > > > > > > > > > > > > > > >>
    > > > > > > > > > > > > > > > > > > >> On Wed, Sep 22, 2021 at 4:58 AM Magnus
    > > > > > Edenhill <
    > > > > > > > > > > > > > > > magnus@edenhill.se>
    > > > > > > > > > > > > > > > > > > >> wrote:
    > > > > > > > > > > > > > > > > > > >>
    > > > > > > > > > > > > > > > > > > >> > Den tis 21 sep. 2021 kl 06:58 skrev
    > > > Colin
    > > > > > > > McCabe <
    > > > > > > > > > > > > > > > > > cmccabe@apache.org
    > > > > > > > > > > > > > > > > > > >:
    > > > > > > > > > > > > > > > > > > >> >
    > > > > > > > > > > > > > > > > > > >> > > On Mon, Sep 20, 2021, at 17:35,
    > Feng
    > > > Min
    > > > > > > > wrote:
    > > > > > > > > > > > > > > > > > > >> > > > Thanks Magnus & Colin for the
    > > > > > discussion.
    > > > > > > > > > > > > > > > > > > >> > > >
    > > > > > > > > > > > > > > > > > > >> > > > Based on KIP-714's stateless
    > > design,
    > > > > > Client
    > > > > > > > > can
    > > > > > > > > > > > pretty
    > > > > > > > > > > > > > > much
    > > > > > > > > > > > > > > > use
    > > > > > > > > > > > > > > > > > > any
    > > > > > > > > > > > > > > > > > > >> > > > connection to any broker to send
    > > > > > metrics. We
    > > > > > > > > are
    > > > > > > > > > > not
    > > > > > > > > > > > > > > > associating
    > > > > > > > > > > > > > > > > > > >> > > connection
    > > > > > > > > > > > > > > > > > > >> > > > with client metric state. Is my
    > > > > > > > understanding
    > > > > > > > > > > > correct?
    > > > > > > > > > > > > > If
    > > > > > > > > > > > > > > > yes,
    > > > > > > > > > > > > > > > > > > how
    > > > > > > > > > > > > > > > > > > >> > about
    > > > > > > > > > > > > > > > > > > >> > > > the following two scenarios
    > > > > > > > > > > > > > > > > > > >> > > >
    > > > > > > > > > > > > > > > > > > >> > > > 1) One Client (Client-ID)
    > > registers
    > > > > two
    > > > > > > > > > different
    > > > > > > > > > > > client
    > > > > > > > > > > > > > > > > > instance
    > > > > > > > > > > > > > > > > > > id
    > > > > > > > > > > > > > > > > > > >> > via
    > > > > > > > > > > > > > > > > > > >> > > > separate registration. Is it
    > > > > permitted?
    > > > > > If
    > > > > > > > OK,
    > > > > > > > > > how
    > > > > > > > > > > > to
    > > > > > > > > > > > > > > > > > distinguish
    > > > > > > > > > > > > > > > > > > >> them
    > > > > > > > > > > > > > > > > > > >> > > from
    > > > > > > > > > > > > > > > > > > >> > > > the case 2 below.
    > > > > > > > > > > > > > > > > > > >> > > >
    > > > > > > > > > > > > > > > > > > >> > >
    > > > > > > > > > > > > > > > > > > >> > > Hi Feng,
    > > > > > > > > > > > > > > > > > > >> > >
    > > > > > > > > > > > > > > > > > > >> > > My understanding, which Magnus can
    > > > > > clarify I
    > > > > > > > > > guess,
    > > > > > > > > > > is
    > > > > > > > > > > > > > that
    > > > > > > > > > > > > > > > you
    > > > > > > > > > > > > > > > > > > could
    > > > > > > > > > > > > > > > > > > >> > have
    > > > > > > > > > > > > > > > > > > >> > > something like two Producer
    > > instances
    > > > > > running
    > > > > > > > > with
    > > > > > > > > > > the
    > > > > > > > > > > > > > same
    > > > > > > > > > > > > > > > > > > client.id
    > > > > > > > > > > > > > > > > > > >> > > (perhaps because they're using the
    > > > same
    > > > > > config
    > > > > > > > > > file,
    > > > > > > > > > > > for
    > > > > > > > > > > > > > > > example).
    > > > > > > > > > > > > > > > > > > >> They
    > > > > > > > > > > > > > > > > > > >> > > could even be in the same process.
    > > But
    > > > > > they
    > > > > > > > > would
    > > > > > > > > > > get
    > > > > > > > > > > > > > > separate
    > > > > > > > > > > > > > > > > > > UUIDs.
    > > > > > > > > > > > > > > > > > > >> > >
    > > > > > > > > > > > > > > > > > > >> > > I believe Magnus used the term
    > > client
    > > > to
    > > > > > mean
    > > > > > > > > > > > "Producer or
    > > > > > > > > > > > > > > > > > > Consumer".
    > > > > > > > > > > > > > > > > > > >> So
    > > > > > > > > > > > > > > > > > > >> > > if you have both a Producer and a
    > > > > > Consumer in
    > > > > > > > > your
    > > > > > > > > > > > > > > > application I
    > > > > > > > > > > > > > > > > > > would
    > > > > > > > > > > > > > > > > > > >> > > expect you'd get separate UUIDs
    > for
    > > > > both.
    > > > > > > > Again
    > > > > > > > > > > > Magnus can
    > > > > > > > > > > > > > > > chime
    > > > > > > > > > > > > > > > > > in
    > > > > > > > > > > > > > > > > > > >> > here, I
    > > > > > > > > > > > > > > > > > > >> > > guess.
    > > > > > > > > > > > > > > > > > > >> > >
    > > > > > > > > > > > > > > > > > > >> >
    > > > > > > > > > > > > > > > > > > >> > That's correct.
    > > > > > > > > > > > > > > > > > > >> >
    > > > > > > > > > > > > > > > > > > >> >
    > > > > > > > > > > > > > > > > > > >> > >
    > > > > > > > > > > > > > > > > > > >> > > > 2) How about the client
    > > restarting?
    > > > > > What's
    > > > > > > > the
    > > > > > > > > > > > > > > expectation?
    > > > > > > > > > > > > > > > > > Should
    > > > > > > > > > > > > > > > > > > >> the
    > > > > > > > > > > > > > > > > > > >> > > > server expect the client to
    > carry
    > > a
    > > > > > > > persisted
    > > > > > > > > > > client
    > > > > > > > > > > > > > > > instance id
    > > > > > > > > > > > > > > > > > > or
    > > > > > > > > > > > > > > > > > > >> > > should
    > > > > > > > > > > > > > > > > > > >> > > > the client be treated as a new
    > > > > instance?
    > > > > > > > > > > > > > > > > > > >> > >
    > > > > > > > > > > > > > > > > > > >> > > The KIP doesn't describe any
    > > mechanism
    > > > > for
    > > > > > > > > > > > persistence,
    > > > > > > > > > > > > > so I
    > > > > > > > > > > > > > > > would
    > > > > > > > > > > > > > > > > > > >> assume
    > > > > > > > > > > > > > > > > > > >> > > that when you restart the client
    > you
    > > > get
    > > > > > a new
    > > > > > > > > > > UUID. I
    > > > > > > > > > > > > > agree
    > > > > > > > > > > > > > > > that
    > > > > > > > > > > > > > > > > > it
    > > > > > > > > > > > > > > > > > > >> > would
    > > > > > > > > > > > > > > > > > > >> > > be good to spell this out.
    > > > > > > > > > > > > > > > > > > >> > >
    > > > > > > > > > > > > > > > > > > >> > >
    > > > > > > > > > > > > > > > > > > >> > Right, it will not be persisted
    > since
    > > a
    > > > > > client
    > > > > > > > > > > instance
    > > > > > > > > > > > > > can't
    > > > > > > > > > > > > > > be
    > > > > > > > > > > > > > > > > > > >> restarted.
    > > > > > > > > > > > > > > > > > > >> >
    > > > > > > > > > > > > > > > > > > >> > Will update the KIP to make this
    > > > clearer.
    > > > > > > > > > > > > > > > > > > >> >
    > > > > > > > > > > > > > > > > > > >> > /Magnus
    > > > > > > > > > > > > > > > > > > >> >
    > > > > > > > > > > > > > > > > > > >>
    > > > > > > > > > > > > > > > > > > >>
    > > > > > > > > > > > > > > > > > > >> --
    > > > > > > > > > > > > > > > > > > >> Gwen Shapira
    > > > > > > > > > > > > > > > > > > >> Engineering Manager | Confluent
    > > > > > > > > > > > > > > > > > > >> 650.450.2760 | @gwenshap
    > > > > > > > > > > > > > > > > > > >> Follow us: Twitter | blog
    > > > > > > > > > > > > > > > > > > >>
    > > > > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > > >
    > > > > > > > > > > > > > >
    > > > > > > > > > > > > >
    > > > > > > > > > > >
    > > > > > > > > > >
    > > > > > > > > >
    > > > > > > > >
    > > > > > > >
    > > > > > >
    > > > > >
    > > > >
    > > >
    > >
    >

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Kirk True <ki...@mustardgrain.com>.
Hi Jun,

On Tue, Mar 8, 2022, at 5:47 PM, Jun Rao wrote:
> Hi, Magnus, Sarat and Xavier,
> 
> Thanks for the reply. A few more comments below.
> 
> 20. It seems that we are piggybacking the plugin on the
> existing MetricsReporter. So, this seems fine.
> 
> 21. That could work. Are we requiring any additional jar dependency on the
> client? Or, are you suggesting that we check the runtime dependency to pick
> the compression codec?

The Java client doesn't require any additional libraries for compression, no.

> 28. For the broker metrics, could you spell out the full metric name
> including groups, tags, etc? We typically don't add the broker_id label for
> broker metrics. Also, brokers use Yammer metrics, which doesn't have type
> Sum.
> 
> 29. There are several client metrics listed as histogram. However, the java
> client currently doesn't support histogram type.

There does appear to be some code related to histograms in the org.apache.kafka.common.metrics.stats package. But we're still looking into the implementation to see if there's anything needed for KIP-714.

> 30. Could you show an example of the metric payload in PushTelemetryRequest
> to help understand how we organize metrics at different levels (per
> instance, per topic, per partition, per broker, etc)?
> 
> 31. Could you add a bit more detail on which client thread sends the
> PushTelemetryRequest?

Yes, I will add that the KIP.

Thanks,
Kirk

> Thanks,
> 
> Jun
> 
> On Mon, Mar 7, 2022 at 11:48 AM Magnus Edenhill <ma...@edenhill.se> wrote:
> 
> > Hi Jun,
> >
> > thanks for your initiated questions, see my answers below.
> > There's been a number of clarifications to the KIP.
> >
> >
> >
> > Den tors 27 jan. 2022 kl 20:08 skrev Jun Rao <ju...@confluent.io.invalid>:
> >
> > > Hi, Magnus,
> > >
> > > Thanks for updating the KIP. The overall approach makes sense to me. A
> > few
> > > more detailed comments below.
> > >
> > > 20. ClientTelemetry: Should it be extending configurable and closable?
> > >
> >
> > I'll pass this question to Sarat and/or Xavier.
> >
> >
> >
> > > 21. Compression of the metrics on the client: what's the default?
> > >
> >
> > How about we specify a prioritized list: zstd, lz4, snappy, gzip?
> > But ultimately it is up to what the client supports.
> >
> >
> > 23. A client instance is considered a metric resource and the
> > > resource-level (thus client instance level) labels could include:
> > >     client_software_name=confluent-kafka-python
> > >     client_software_version=v2.1.3
> > >     client_instance_id=B64CD139-3975-440A-91D4
> > >     transactional_id=someTxnApp
> > > Are those labels added in PushTelemetryRequest? If so, are they per
> > metric
> > > or per request?
> > >
> >
> >
> > client_software* and client_instance_id are not added by the client, but
> > available to
> > the broker-side metrics plugin for adding as it see fits, remove them from
> > the KIP.
> >
> > As for transactional_id, group_id, etc, which I believe will be useful in
> > troubleshooting,
> > are included only once (per push) as resource-level attributes (the client
> > instance is a singular resource).
> >
> >
> > >
> > > 24.  "the broker will only send
> > > GetTelemetrySubscriptionsResponse.DeltaTemporality=True" :
> > > 24.1 If it's always true, does it need to be part of the protocol?
> > >
> >
> > We're anticipating that it will take a lot longer to upgrade the majority
> > of clients than the
> > broker/plugin side, which is why we want the client to support both
> > temporalities out-of-the-box
> > so that cumulative reporting can be turned on seamlessly in the future.
> >
> >
> >
> > > 24.2 Does delta only apply to Counter type?
> > >
> >
> >
> > And Histograms. More details in Xavier's OTLP link.
> >
> >
> >
> > > 24.3 In the delta representation, the first request needs to send the
> > full
> > > value, how does the broker plugin know whether a value is full or delta?
> > >
> >
> > The client may (should) send the start time for each metric sample,
> > indicating when
> > the metric began to be collected.
> > We've discussed whether this should be the client instance start time or
> > the time when a matching
> > metric subscription for that metric is received.
> > For completeness we recommend using the former, the client instance start
> > time.
> >
> >
> >
> > > 25. quota:
> > > 25.1 Since we are fitting PushTelemetryRequest into the existing request
> > > quota, it would be useful to document the impact, i.e. client metric
> > > throttling causes the data from the same client to be delayed.
> > > 25.2 Is PushTelemetryRequest subject to the write bandwidth quota like
> > the
> > > producer?
> > >
> >
> >
> > Yes, it should be, as to protect the cluster from rogue clients.
> > But, in practice the size of metrics will be quite low (e.g., 1-10kb per
> > 60s interval), so I don't think this will pose a problem.
> > The KIP has been updated with more details on quota/throttling behaviour,
> > see the
> > "Throttling and rate-limiting" section.
> >
> >
> > 25.3 THROTTLING_QUOTA_EXCEEDED: Currently, we don't send this error when
> > > the request/bandwidth quota is exceeded since those requests are not
> > > rejected. We only set this error when the request is rejected (e.g.,
> > topic
> > > creation). It would be useful to clarify when this error is used.
> > >
> >
> > Right, I was trying to reuse an existing error-code. We can introduce
> > a new one for the case where a client pushes metrics at a higher frequency
> > than the
> > than the configured push interval (e.g., out-of-profile sends).
> > This causes the broker to drop those metrics and send this error code back
> > to the client. There will be no connection throttling / channel-muting in
> > this
> > case (unless the standard quotas are exceeded).
> >
> >
> > > 27. kafka-client-metrics.sh: Could we add an example on how to disable a
> > > bad client?
> > >
> >
> > There's now a --block option to kafka-client-metrics.sh which overrides all
> > subscriptions
> > for the matched client(s). This allows silencing metrics for one or more
> > clients without having
> > to remove existing subscriptions. From the client's perspective it will
> > look like it no longer has
> > any subscriptions.
> >
> > # Block metrics collection for a specific client instance
> > $ kafka-client-metrics.sh --bootstrap-server $BROKERS \
> >    --add \
> >    --name 'Disabe_b69cc35a' \  # A descriptive name makes it easier to
> > clean up old subscriptions.
> >    --match client_instance_id=b69cc35a-7a54-4790-aa69-cc2bd4ee4538 \  #
> > Match this specific client instance
> >    --block
> >
> >
> >
> >
> > > 28. New broker side metrics: Could we spell out the details of the
> > metrics
> > > (e.g., group, tags, etc)?
> > >
> >
> > KIP has been updated accordingly (thanks Sarat).
> >
> >
> >
> > >
> > > 29. Client instance-level metrics: client.io.wait.time is a gauge not a
> > > histogram.
> > >
> >
> > I believe a population/distribution should preferably be represented as a
> > histogram, space permitting,
> > and only secondarily as a Gauge average.
> > While we might not want to maintain a bunch of histograms for each
> > partition, since that could be
> > quite space consuming, this client.io.wait.time is a single metric per
> > client instance and can
> > thus afford a Histogram representation.
> >
> >
> >
> > Thanks,
> > Magnus
> >
> >
> >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Wed, Jan 26, 2022 at 6:32 AM Magnus Edenhill <ma...@edenhill.se>
> > > wrote:
> > >
> > > > Hi all,
> > > >
> > > > I've updated the KIP with responses to the latest comments: Java client
> > > > dependencies (Thanks Kirk!), alternate designs (separate cluster,
> > > separate
> > > > producer, etc), etc.
> > > >
> > > > I will revive the vote thread.
> > > >
> > > > Thanks,
> > > > Magnus
> > > >
> > > >
> > > > Den mån 13 dec. 2021 kl 22:32 skrev Ryanne Dolan <
> > ryannedolan@gmail.com
> > > >:
> > > >
> > > > > I think we should be very careful about introducing new runtime
> > > > > dependencies into the clients. Historically this has been rare and
> > > > > essentially necessary (e.g. compression libs).
> > > > >
> > > > > Ryanne
> > > > >
> > > > > On Mon, Dec 13, 2021, 1:06 PM Kirk True <ki...@mustardgrain.com>
> > wrote:
> > > > >
> > > > > > Hi Jun,
> > > > > >
> > > > > > On Thu, Dec 9, 2021, at 2:28 PM, Jun Rao wrote:
> > > > > > > 13. Using OpenTelemetry. Does that require runtime dependency
> > > > > > > on OpenTelemetry library? How good is the compatibility story
> > > > > > > of OpenTelemetry? This is important since an application could
> > have
> > > > > other
> > > > > > > OpenTelemetry dependencies than the Kafka client.
> > > > > >
> > > > > > The current design is that the OpenTelemetry JARs would ship with
> > the
> > > > > > client. Perhaps we can design the client such that the JARs aren't
> > > even
> > > > > > loaded if the user has opted out. The user could even exclude the
> > > JARs
> > > > > from
> > > > > > their dependencies if they so wished.
> > > > > >
> > > > > > I can't speak to the compatibility of the libraries. Is it possible
> > > > that
> > > > > > we include a shaded version?
> > > > > >
> > > > > > Thanks,
> > > > > > Kirk
> > > > > >
> > > > > > >
> > > > > > > 14. The proposal listed idempotence=true. This is more of a
> > > > > configuration
> > > > > > > than a metric. Are we including that as a metric? What other
> > > > > > configurations
> > > > > > > are we including? Should we separate the configurations from the
> > > > > metrics?
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > Jun
> > > > > > >
> > > > > > > On Mon, Nov 29, 2021 at 7:34 AM Magnus Edenhill <
> > > magnus@edenhill.se>
> > > > > > wrote:
> > > > > > >
> > > > > > > > Hey Bob,
> > > > > > > >
> > > > > > > > That's a good point.
> > > > > > > >
> > > > > > > > Request type labels were considered but since they're already
> > > > tracked
> > > > > > by
> > > > > > > > broker-side metrics
> > > > > > > > they were left out as to avoid metric duplication, however
> > those
> > > > > > metrics
> > > > > > > > are not per connection,
> > > > > > > > so they won't be that useful in practice for troubleshooting
> > > > specific
> > > > > > > > client instances.
> > > > > > > >
> > > > > > > > I'll add the request_type label to the relevant metrics.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Magnus
> > > > > > > >
> > > > > > > >
> > > > > > > > Den tis 23 nov. 2021 kl 19:20 skrev Bob Barrett
> > > > > > > > <bo...@confluent.io.invalid>:
> > > > > > > >
> > > > > > > > > Hi Magnus,
> > > > > > > > >
> > > > > > > > > Thanks for the thorough KIP, this seems very useful.
> > > > > > > > >
> > > > > > > > > Would it make sense to include the request type as a label
> > for
> > > > the
> > > > > > > > > `client.request.success`, `client.request.errors` and
> > > > > > > > `client.request.rtt`
> > > > > > > > > metrics? I think it would be very useful to see which
> > specific
> > > > > > requests
> > > > > > > > are
> > > > > > > > > succeeding and failing for a client. One specific case I can
> > > > think
> > > > > of
> > > > > > > > where
> > > > > > > > > this could be useful is producer batch timeouts. If a Java
> > > > > > application
> > > > > > > > does
> > > > > > > > > not enable producer client logs (unfortunately, in my
> > > experience
> > > > > this
> > > > > > > > > happens more often than it should), the application logs will
> > > > only
> > > > > > > > contain
> > > > > > > > > the expiration error message, but no information about what
> > is
> > > > > > causing
> > > > > > > > the
> > > > > > > > > timeout. The requests might all be succeeding but taking too
> > > long
> > > > > to
> > > > > > > > > process batches, or metadata requests might be failing, or
> > some
> > > > or
> > > > > > all
> > > > > > > > > produce requests might be failing (if the bootstrap servers
> > are
> > > > > > reachable
> > > > > > > > > from the client but one or more other brokers are not, for
> > > > > example).
> > > > > > If
> > > > > > > > the
> > > > > > > > > cluster operator is able to identify the specific requests
> > that
> > > > are
> > > > > > slow
> > > > > > > > or
> > > > > > > > > failing for a client, they will be better able to diagnose
> > the
> > > > > issue
> > > > > > > > > causing batch timeouts.
> > > > > > > > >
> > > > > > > > > One drawback I can think of is that this will increase the
> > > > > > cardinality of
> > > > > > > > > the request metrics. But any given client is only going to
> > use
> > > a
> > > > > > small
> > > > > > > > > subset of the request types, and since we already have
> > > partition
> > > > > > labels
> > > > > > > > for
> > > > > > > > > the topic-level metrics, I think request labels will still
> > make
> > > > up
> > > > > a
> > > > > > > > > relatively small percentage of the set of metrics.
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Bob
> > > > > > > > >
> > > > > > > > > On Mon, Nov 22, 2021 at 2:08 AM Viktor Somogyi-Vass <
> > > > > > > > > viktorsomogyi@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi Magnus,
> > > > > > > > > >
> > > > > > > > > > I think this is a very useful addition. We also have a
> > > similar
> > > > > (but
> > > > > > > > much
> > > > > > > > > > more simplistic) implementation of this. Maybe I missed it
> > in
> > > > the
> > > > > > KIP
> > > > > > > > but
> > > > > > > > > > what about adding metrics about the subscription cache
> > > itself?
> > > > > > That I
> > > > > > > > > think
> > > > > > > > > > would improve its usability and debuggability as we'd be
> > able
> > > > to
> > > > > > see
> > > > > > > > its
> > > > > > > > > > performance, hit/miss rates, eviction counts and others.
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Viktor
> > > > > > > > > >
> > > > > > > > > > On Thu, Nov 18, 2021 at 5:12 PM Magnus Edenhill <
> > > > > > magnus@edenhill.se>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi Mickael,
> > > > > > > > > > >
> > > > > > > > > > > see inline.
> > > > > > > > > > >
> > > > > > > > > > > Den ons 10 nov. 2021 kl 15:21 skrev Mickael Maison <
> > > > > > > > > > > mickael.maison@gmail.com
> > > > > > > > > > > >:
> > > > > > > > > > >
> > > > > > > > > > > > Hi Magnus,
> > > > > > > > > > > >
> > > > > > > > > > > > I see you've addressed some of the points I raised
> > above
> > > > but
> > > > > > some
> > > > > > > > (4,
> > > > > > > > > > > > 5) have not been addressed yet.
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Re 4) How will the user/app know metrics are being sent.
> > > > > > > > > > >
> > > > > > > > > > > One possibility is to add a JMX metric (thus for user
> > > > > > consumption)
> > > > > > > > for
> > > > > > > > > > the
> > > > > > > > > > > number of metric pushes the
> > > > > > > > > > > client has performed, or perhaps the number of metrics
> > > > > > subscriptions
> > > > > > > > > > > currently being collected.
> > > > > > > > > > > Would that be sufficient?
> > > > > > > > > > >
> > > > > > > > > > > Re 5) Metric sizes and rates
> > > > > > > > > > >
> > > > > > > > > > > A worst case scenario for a producer that is producing to
> > > 50
> > > > > > unique
> > > > > > > > > > topics
> > > > > > > > > > > and emitting all standard metrics yields
> > > > > > > > > > > a serialized size of around 100KB prior to compression,
> > > which
> > > > > > > > > compresses
> > > > > > > > > > > down to about 20-30% of that depending
> > > > > > > > > > > on compression type and topic name uniqueness.
> > > > > > > > > > > The numbers for a consumer would be similar.
> > > > > > > > > > >
> > > > > > > > > > > In practice the number of unique topics would be far
> > less,
> > > > and
> > > > > > the
> > > > > > > > > > > subscription set would typically be for a subset of
> > > metrics.
> > > > > > > > > > > So we're probably closer to 1kb, or less, compressed size
> > > per
> > > > > > client
> > > > > > > > > per
> > > > > > > > > > > push interval.
> > > > > > > > > > >
> > > > > > > > > > > As both the subscription set and push intervals are
> > > > controlled
> > > > > > by the
> > > > > > > > > > > cluster operator it shouldn't be too hard
> > > > > > > > > > > to strike a good balance between metrics overhead and
> > > > > > granularity.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > I'm really uneasy with this being enabled by default on
> > > the
> > > > > > client
> > > > > > > > > > > > side. When collecting data, I think the best practice
> > is
> > > to
> > > > > > ensure
> > > > > > > > > > > > users are explicitly enabling it.
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Requiring metrics to be explicitly enabled on clients
> > > > severely
> > > > > > > > cripples
> > > > > > > > > > its
> > > > > > > > > > > usability and value.
> > > > > > > > > > >
> > > > > > > > > > > One of the problems that this KIP aims to solve is for
> > > useful
> > > > > > metrics
> > > > > > > > > to
> > > > > > > > > > be
> > > > > > > > > > > available on demand
> > > > > > > > > > > regardless of the technical expertise of the user. As
> > > Ryanne
> > > > > > points,
> > > > > > > > > out
> > > > > > > > > > a
> > > > > > > > > > > savvy user/organization
> > > > > > > > > > > will typically have metrics collection and monitoring in
> > > > place
> > > > > > > > already,
> > > > > > > > > > and
> > > > > > > > > > > the benefits of this KIP
> > > > > > > > > > > are then more of a common set and format metrics across
> > > > client
> > > > > > > > > > > implementations and languages.
> > > > > > > > > > > But that is not the typical Kafka user in my experience,
> > > > > they're
> > > > > > not
> > > > > > > > > > Kafka
> > > > > > > > > > > experts and they don't have the
> > > > > > > > > > > knowledge of how to best instrument their clients.
> > > > > > > > > > > Having metrics enabled by default for this user base
> > allows
> > > > the
> > > > > > Kafka
> > > > > > > > > > > operators to proactively and reactively
> > > > > > > > > > > monitor and troubleshoot client issues, without the need
> > > for
> > > > > the
> > > > > > less
> > > > > > > > > > savvy
> > > > > > > > > > > user to do anything.
> > > > > > > > > > > It is often too late to tell a user to enable metrics
> > when
> > > > the
> > > > > > > > problem
> > > > > > > > > > has
> > > > > > > > > > > already occurred.
> > > > > > > > > > >
> > > > > > > > > > > Now, to be clear, even though metrics are enabled by
> > > default
> > > > on
> > > > > > > > clients
> > > > > > > > > > it
> > > > > > > > > > > is not enabled by default
> > > > > > > > > > > on the brokers; the Kafka operator needs to build and set
> > > up
> > > > a
> > > > > > > > metrics
> > > > > > > > > > > plugin and add metrics subscriptions
> > > > > > > > > > > before anything is sent from the client.
> > > > > > > > > > > It is opt-out on the clients and opt-in on the broker.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > You mentioned brokers already have
> > > > > > > > > > > > some(most?) of the information contained in metrics, if
> > > so
> > > > > > then why
> > > > > > > > > > > > are we collecting it again? Surely there must be some
> > new
> > > > > > > > information
> > > > > > > > > > > > in the client metrics.
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > From the user's perspective the Kafka infrastructure
> > > extends
> > > > > from
> > > > > > > > > > > producer.send() to
> > > > > > > > > > > messages being returned from consumer.poll(), a giant
> > black
> > > > box
> > > > > > where
> > > > > > > > > > > there's a lot going on between those
> > > > > > > > > > > two points. The brokers currently only see what happens
> > > once
> > > > > > those
> > > > > > > > > > requests
> > > > > > > > > > > and messages hits the broker,
> > > > > > > > > > > but as Kafka clients are complex pieces of machinery
> > > there's
> > > > a
> > > > > > myriad
> > > > > > > > > of
> > > > > > > > > > > queues, timers, and state
> > > > > > > > > > > that's critical to the operation and infrastructure
> > that's
> > > > not
> > > > > > > > > currently
> > > > > > > > > > > visible to the operator.
> > > > > > > > > > > Relying on the user to accurately and timely provide this
> > > > > missing
> > > > > > > > > > > information is not generally feasible.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Most of the standard metrics listed in the KIP are data
> > > > points
> > > > > > that
> > > > > > > > the
> > > > > > > > > > > broker does not have.
> > > > > > > > > > > Only a small number of metrics are duplicates (like the
> > > > request
> > > > > > > > counts
> > > > > > > > > > and
> > > > > > > > > > > sizes), but they are included
> > > > > > > > > > > to ease correlation when inspecting these client metrics.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > Moreover this is a brand new feature so it's even
> > harder
> > > to
> > > > > > justify
> > > > > > > > > > > > enabling it and forcing onto all our users. If disabled
> > > by
> > > > > > default,
> > > > > > > > > > > > it's relatively easy to enable in a new release if we
> > > > decide
> > > > > > to,
> > > > > > > > but
> > > > > > > > > > > > once enabled by default it's much harder to disable.
> > Also
> > > > > this
> > > > > > > > > feature
> > > > > > > > > > > > will apply to all future metrics we will add.
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > I think maturity of a feature implementation should be
> > the
> > > > > > deciding
> > > > > > > > > > factor,
> > > > > > > > > > > rather than
> > > > > > > > > > > the design of it (which this KIP is). I.e., if the
> > > > > > implementation is
> > > > > > > > > not
> > > > > > > > > > > deemed mature enough
> > > > > > > > > > > for release X.Y it will be disabled.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > Overall I think it's an interesting feature but I'd
> > > prefer
> > > > to
> > > > > > be
> > > > > > > > > > > > slightly defensive and see how it works in practice
> > > before
> > > > > > enabling
> > > > > > > > > it
> > > > > > > > > > > > everywhere.
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Right, and I agree on being defensive, but since this
> > > feature
> > > > > > still
> > > > > > > > > > > requires manual
> > > > > > > > > > > enabling on the brokers before actually being used, I
> > think
> > > > > that
> > > > > > > > gives
> > > > > > > > > > > enough control
> > > > > > > > > > > to opt-in or out of this feature as needed.
> > > > > > > > > > >
> > > > > > > > > > > Thanks for your comments!
> > > > > > > > > > >
> > > > > > > > > > > Regards,
> > > > > > > > > > > Magnus
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > Thanks,
> > > > > > > > > > > > Mickael
> > > > > > > > > > > >
> > > > > > > > > > > > On Mon, Nov 8, 2021 at 7:55 AM Magnus Edenhill <
> > > > > > magnus@edenhill.se
> > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks David for pointing this out,
> > > > > > > > > > > > > I've updated the KIP to include client_id as a
> > matching
> > > > > > selector.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Regards,
> > > > > > > > > > > > > Magnus
> > > > > > > > > > > > >
> > > > > > > > > > > > > Den tors 4 nov. 2021 kl 18:01 skrev David Mao
> > > > > > > > > > > <dmao@confluent.io.invalid
> > > > > > > > > > > > >:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hey Magnus,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I noticed that the KIP outlines the initial
> > selectors
> > > > > > supported
> > > > > > > > > as:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >    - client_instance_id - CLIENT_INSTANCE_ID UUID
> > > > string
> > > > > > > > > > > > representation.
> > > > > > > > > > > > > >    - client_software_name  - client software
> > > > > implementation
> > > > > > > > name.
> > > > > > > > > > > > > >    - client_software_version  - client software
> > > > > > implementation
> > > > > > > > > > > version.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > In the given reactive monitoring workflow, we
> > mention
> > > > > that
> > > > > > the
> > > > > > > > > > > > application
> > > > > > > > > > > > > > user does not know their client's client instance
> > ID,
> > > > but
> > > > > > it's
> > > > > > > > > > > outlined
> > > > > > > > > > > > > > that the operator can add a metrics subscription
> > > > > selecting
> > > > > > for
> > > > > > > > > > > > clientId. I
> > > > > > > > > > > > > > don't see clientId as one of the supported
> > selectors.
> > > > > > > > > > > > > > I can see how this would have made sense in a
> > > previous
> > > > > > > > iteration
> > > > > > > > > > > given
> > > > > > > > > > > > that
> > > > > > > > > > > > > > the previous client instance ID proposal was to
> > > > construct
> > > > > > the
> > > > > > > > > > client
> > > > > > > > > > > > > > instance ID using clientId as a prefix. Now that
> > the
> > > > > client
> > > > > > > > > > instance
> > > > > > > > > > > > ID is
> > > > > > > > > > > > > > a UUID, would we want to add clientId as a
> > supported
> > > > > > selector?
> > > > > > > > > > > > > > Let me know what you think.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > David
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Tue, Oct 19, 2021 at 12:33 PM Magnus Edenhill <
> > > > > > > > > > magnus@edenhill.se
> > > > > > > > > > > >
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi Mickael!
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Den tis 19 okt. 2021 kl 19:30 skrev Mickael
> > Maison
> > > <
> > > > > > > > > > > > > > > mickael.maison@gmail.com
> > > > > > > > > > > > > > > >:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Hi Magnus,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thanks for the proposal.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 1. Looking at the protocol section, isn't
> > > > > > > > "ClientInstanceId"
> > > > > > > > > > > > expected
> > > > > > > > > > > > > > > > to be a field in
> > > > GetTelemetrySubscriptionsResponseV0?
> > > > > > > > > > Otherwise,
> > > > > > > > > > > > how
> > > > > > > > > > > > > > > > does a client retrieve this value?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Good catch, it got removed by mistake in one of
> > the
> > > > > > edits.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 2. In the client API section, you mention a new
> > > > > method
> > > > > > > > > > > > > > > > "clientInstanceId()". Can you clarify which
> > > > > interfaces
> > > > > > are
> > > > > > > > > > > > affected?
> > > > > > > > > > > > > > > > Is it only Consumer and Producer?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > And Admin. Will update the KIP.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 3. I'm a bit concerned this is enabled by
> > > default.
> > > > > > Even if
> > > > > > > > > the
> > > > > > > > > > > data
> > > > > > > > > > > > > > > > collected is supposed to be not sensitive, I
> > > think
> > > > > > this can
> > > > > > > > > be
> > > > > > > > > > > > > > > > problematic in some environments. Also users
> > > don't
> > > > > > seem to
> > > > > > > > > have
> > > > > > > > > > > the
> > > > > > > > > > > > > > > > choice to only expose some metrics. Knowing how
> > > > much
> > > > > > data
> > > > > > > > > > transit
> > > > > > > > > > > > > > > > through some applications can be considered
> > > > critical.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > The broker already knows how much data transits
> > > > through
> > > > > > the
> > > > > > > > > > client
> > > > > > > > > > > > > > though,
> > > > > > > > > > > > > > > right?
> > > > > > > > > > > > > > > Care has been taken not to expose information in
> > > the
> > > > > > standard
> > > > > > > > > > > metrics
> > > > > > > > > > > > > > that
> > > > > > > > > > > > > > > might
> > > > > > > > > > > > > > > reveal sensitive information.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Do you have an example of how the proposed
> > metrics
> > > > > could
> > > > > > leak
> > > > > > > > > > > > sensitive
> > > > > > > > > > > > > > > information?
> > > > > > > > > > > > > > > As for limiting the what metrics to export; I
> > guess
> > > > > that
> > > > > > > > could
> > > > > > > > > > make
> > > > > > > > > > > > sense
> > > > > > > > > > > > > > > in some
> > > > > > > > > > > > > > > very sensitive use-cases, but those users might
> > > > disable
> > > > > > > > metrics
> > > > > > > > > > > > > > altogether
> > > > > > > > > > > > > > > for now.
> > > > > > > > > > > > > > > Could these concerns be addressed by a later KIP?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 4. As a user, how do you know if your
> > application
> > > > is
> > > > > > > > actively
> > > > > > > > > > > > sending
> > > > > > > > > > > > > > > > metrics? Are there new metrics exposing what's
> > > > going
> > > > > > on,
> > > > > > > > like
> > > > > > > > > > how
> > > > > > > > > > > > much
> > > > > > > > > > > > > > > > data is being sent?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > That's a good question.
> > > > > > > > > > > > > > > Since the proposed metrics interface is not aimed
> > > at,
> > > > > or
> > > > > > > > > directly
> > > > > > > > > > > > > > available
> > > > > > > > > > > > > > > to, the application
> > > > > > > > > > > > > > > I guess there's little point of adding it here,
> > but
> > > > > > instead
> > > > > > > > > > adding
> > > > > > > > > > > > > > > something to the
> > > > > > > > > > > > > > > existing JMX metrics?
> > > > > > > > > > > > > > > Do you have any suggestions?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 5. If all metrics are enabled on a regular
> > > Consumer
> > > > > or
> > > > > > > > > > Producer,
> > > > > > > > > > > do
> > > > > > > > > > > > > > > > you have an idea how much throughput this would
> > > > use?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > It depends on the number of partition/topics/etc
> > > the
> > > > > > client
> > > > > > > > is
> > > > > > > > > > > > producing
> > > > > > > > > > > > > > > to/consuming from.
> > > > > > > > > > > > > > > I'll add some sizes to the KIP for some typical
> > > > > > use-cases.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > Magnus
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thanks
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Tue, Oct 19, 2021 at 5:06 PM Magnus
> > Edenhill <
> > > > > > > > > > > > magnus@edenhill.se>
> > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Den tis 19 okt. 2021 kl 13:22 skrev Tom
> > > Bentley <
> > > > > > > > > > > > tbentley@redhat.com
> > > > > > > > > > > > > > >:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Hi Magnus,
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > I reviewed the KIP since you called the
> > vote
> > > > > > (sorry for
> > > > > > > > > not
> > > > > > > > > > > > > > reviewing
> > > > > > > > > > > > > > > > when
> > > > > > > > > > > > > > > > > > you announced your intention to call the
> > > > vote). I
> > > > > > have
> > > > > > > > a
> > > > > > > > > > few
> > > > > > > > > > > > > > > questions
> > > > > > > > > > > > > > > > on
> > > > > > > > > > > > > > > > > > some of the details.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > 1. There's no Javadoc on
> > > > > > ClientTelemetryPayload.data(),
> > > > > > > > > so
> > > > > > > > > > I
> > > > > > > > > > > > don't
> > > > > > > > > > > > > > > know
> > > > > > > > > > > > > > > > > > whether the payload is exposed through this
> > > > > method
> > > > > > as
> > > > > > > > > > > > compressed or
> > > > > > > > > > > > > > > > not.
> > > > > > > > > > > > > > > > > > Later on you say "Decompression of the
> > > payloads
> > > > > > will be
> > > > > > > > > > > > handled by
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > broker metrics plugin, the broker should
> > > > expose a
> > > > > > > > > suitable
> > > > > > > > > > > > > > > > decompression
> > > > > > > > > > > > > > > > > > API to the metrics plugin for this
> > purpose.",
> > > > > which
> > > > > > > > > > suggests
> > > > > > > > > > > > it's
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > compressed data in the buffer, but then we
> > > > don't
> > > > > > know
> > > > > > > > > which
> > > > > > > > > > > > codec
> > > > > > > > > > > > > > was
> > > > > > > > > > > > > > > > used,
> > > > > > > > > > > > > > > > > > nor the API via which the plugin should
> > > > > decompress
> > > > > > it
> > > > > > > > if
> > > > > > > > > > > > required
> > > > > > > > > > > > > > for
> > > > > > > > > > > > > > > > > > forwarding to the ultimate metrics store.
> > > > Should
> > > > > > the
> > > > > > > > > > > > > > > > ClientTelemetryPayload
> > > > > > > > > > > > > > > > > > expose a method to get the compression and
> > a
> > > > > > > > > decompressor?
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Good point, updated.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > 2. The client-side API is expressed as
> > > > > > StringOrError
> > > > > > > > > > > > > > > > > > ClientInstance::ClientInstanceId(int
> > > > > timeout_ms). I
> > > > > > > > > > > understand
> > > > > > > > > > > > that
> > > > > > > > > > > > > > > > you're
> > > > > > > > > > > > > > > > > > thinking about the librdkafka
> > implementation,
> > > > but
> > > > > > it
> > > > > > > > > would
> > > > > > > > > > be
> > > > > > > > > > > > good
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > show
> > > > > > > > > > > > > > > > > > the API as it would appear on the Apache
> > > Kafka
> > > > > > clients.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > This was meant as pseudo-code, but I changed
> > it
> > > > to
> > > > > > Java.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > 3. "PushTelemetryRequest|Response -
> > protocol
> > > > > > request
> > > > > > > > used
> > > > > > > > > > by
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > > client to
> > > > > > > > > > > > > > > > > > send metrics to any broker it is connected
> > > to."
> > > > > To
> > > > > > be
> > > > > > > > > > clear,
> > > > > > > > > > > > this
> > > > > > > > > > > > > > > means
> > > > > > > > > > > > > > > > > > that the client can choose any of the
> > > connected
> > > > > > brokers
> > > > > > > > > and
> > > > > > > > > > > > push to
> > > > > > > > > > > > > > > > just
> > > > > > > > > > > > > > > > > > one of them? What should a supporting
> > client
> > > do
> > > > > if
> > > > > > it
> > > > > > > > > gets
> > > > > > > > > > an
> > > > > > > > > > > > error
> > > > > > > > > > > > > > > > when
> > > > > > > > > > > > > > > > > > pushing metrics to a broker, retry sending
> > to
> > > > the
> > > > > > same
> > > > > > > > > > broker
> > > > > > > > > > > > or
> > > > > > > > > > > > > > try
> > > > > > > > > > > > > > > > > > pushing to another broker, or drop the
> > > metrics?
> > > > > > Should
> > > > > > > > > > > > supporting
> > > > > > > > > > > > > > > > clients
> > > > > > > > > > > > > > > > > > send successive requests to a single
> > broker,
> > > or
> > > > > > round
> > > > > > > > > > robin,
> > > > > > > > > > > > or is
> > > > > > > > > > > > > > > > that up
> > > > > > > > > > > > > > > > > > to the client author? I'm guessing the
> > > > behaviour
> > > > > > should
> > > > > > > > > be
> > > > > > > > > > > > sticky
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > support the rate limiting features, but I
> > > think
> > > > > it
> > > > > > > > would
> > > > > > > > > be
> > > > > > > > > > > > good
> > > > > > > > > > > > > > for
> > > > > > > > > > > > > > > > client
> > > > > > > > > > > > > > > > > > authors if this section were explicit on
> > the
> > > > > > > > recommended
> > > > > > > > > > > > behaviour.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > You are right, I've updated the KIP to make
> > > this
> > > > > > clearer.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > 4. "Mapping the client instance id to an
> > > actual
> > > > > > > > > application
> > > > > > > > > > > > > > instance
> > > > > > > > > > > > > > > > > > running on a (virtual) machine can be done
> > by
> > > > > > > > inspecting
> > > > > > > > > > the
> > > > > > > > > > > > > > metrics
> > > > > > > > > > > > > > > > > > resource labels, such as the client source
> > > > > address
> > > > > > and
> > > > > > > > > > source
> > > > > > > > > > > > port,
> > > > > > > > > > > > > > > or
> > > > > > > > > > > > > > > > > > security principal, all of which are added
> > by
> > > > the
> > > > > > > > > receiving
> > > > > > > > > > > > broker.
> > > > > > > > > > > > > > > > This
> > > > > > > > > > > > > > > > > > will allow the operator together with the
> > > user
> > > > to
> > > > > > > > > identify
> > > > > > > > > > > the
> > > > > > > > > > > > > > actual
> > > > > > > > > > > > > > > > > > application instance." Is this really
> > always
> > > > > true?
> > > > > > The
> > > > > > > > > > source
> > > > > > > > > > > > IP
> > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > port
> > > > > > > > > > > > > > > > > > might be a loadbalancer/proxy in some
> > setups.
> > > > The
> > > > > > > > > > principal,
> > > > > > > > > > > as
> > > > > > > > > > > > > > > already
> > > > > > > > > > > > > > > > > > mentioned in the KIP, might be shared
> > between
> > > > > > multiple
> > > > > > > > > > > > > > applications.
> > > > > > > > > > > > > > > > So at
> > > > > > > > > > > > > > > > > > worst the organization running the clients
> > > > might
> > > > > > have
> > > > > > > > to
> > > > > > > > > > > > consult
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > logs
> > > > > > > > > > > > > > > > > > of a set of client applications, right?
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Yes, that's correct. There's no guaranteed
> > > > mapping
> > > > > > from
> > > > > > > > > > > > > > > > client_instance_id
> > > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > an actual instance, that's why the KIP
> > > recommends
> > > > > > client
> > > > > > > > > > > > > > > implementations
> > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > log the client instance id
> > > > > > > > > > > > > > > > > upon retrieval, and also provide an API for
> > the
> > > > > > > > application
> > > > > > > > > > to
> > > > > > > > > > > > > > retrieve
> > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > instance id programmatically
> > > > > > > > > > > > > > > > > if it has a better way of exposing it.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > 5. "Tests indicate that a compression ratio
> > up
> > > to
> > > > > > 10x is
> > > > > > > > > > > > possible for
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > standard metrics." Client authors might
> > > > > appreciate
> > > > > > your
> > > > > > > > > > > > mentioning
> > > > > > > > > > > > > > > > which
> > > > > > > > > > > > > > > > > > compression codec got these results.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Good point. Updated.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > 6. "Should the client send a push request
> > > prior
> > > > > to
> > > > > > > > expiry
> > > > > > > > > > of
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > > previously
> > > > > > > > > > > > > > > > > > calculated PushIntervalMs the broker will
> > > > discard
> > > > > > the
> > > > > > > > > > metrics
> > > > > > > > > > > > and
> > > > > > > > > > > > > > > > return a
> > > > > > > > > > > > > > > > > > PushTelemetryResponse with the ErrorCode
> > set
> > > to
> > > > > > > > > > RateLimited."
> > > > > > > > > > > > Is
> > > > > > > > > > > > > > this
> > > > > > > > > > > > > > > > > > RATE_LIMITED a new error code? It's not
> > > > mentioned
> > > > > > in
> > > > > > > > the
> > > > > > > > > > "New
> > > > > > > > > > > > Error
> > > > > > > > > > > > > > > > Codes"
> > > > > > > > > > > > > > > > > > section.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > That's a leftover, it should be using the
> > > > standard
> > > > > > > > > > ThrottleTime
> > > > > > > > > > > > > > > > mechanism.
> > > > > > > > > > > > > > > > > Fixed.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > 7. In the section "Standard client resource
> > > > > labels"
> > > > > > > > > > > > application_id
> > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > > > described as Kafka Streams only, but the
> > > > section
> > > > > of
> > > > > > > > > "Client
> > > > > > > > > > > > > > > > Identification"
> > > > > > > > > > > > > > > > > > talks about "application instance id as an
> > > > > optional
> > > > > > > > > future
> > > > > > > > > > > > > > > nice-to-have
> > > > > > > > > > > > > > > > > > that may be included as a metrics label if
> > it
> > > > has
> > > > > > been
> > > > > > > > > set
> > > > > > > > > > by
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > > user", so
> > > > > > > > > > > > > > > > > > I'm confused whether non-Kafka Streams
> > > clients
> > > > > > should
> > > > > > > > set
> > > > > > > > > > an
> > > > > > > > > > > > > > > > application_id
> > > > > > > > > > > > > > > > > > or not.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > I'll clarify this in the KIP, but basically
> > we
> > > > > would
> > > > > > need
> > > > > > > > > to
> > > > > > > > > > > add
> > > > > > > > > > > > an `
> > > > > > > > > > > > > > > > > application.id` config
> > > > > > > > > > > > > > > > > property for non-streams clients for this
> > > > purpose,
> > > > > > and
> > > > > > > > > that's
> > > > > > > > > > > > outside
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > scope of this KIP since we want to make it
> > > > > > zero-conf:ish
> > > > > > > > on
> > > > > > > > > > the
> > > > > > > > > > > > > > client
> > > > > > > > > > > > > > > > side.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Kind regards,
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Tom
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Thanks for the review,
> > > > > > > > > > > > > > > > > Magnus
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > On Thu, Oct 7, 2021 at 5:26 PM Magnus
> > > Edenhill
> > > > <
> > > > > > > > > > > > magnus@edenhill.se
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Hi all,
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > I've updated the KIP following our recent
> > > > > > discussions
> > > > > > > > > on
> > > > > > > > > > > the
> > > > > > > > > > > > > > > mailing
> > > > > > > > > > > > > > > > > > list:
> > > > > > > > > > > > > > > > > > >  - split the protocol in two, one for
> > > getting
> > > > > the
> > > > > > > > > metrics
> > > > > > > > > > > > > > > > subscriptions,
> > > > > > > > > > > > > > > > > > > and one for pushing the metrics.
> > > > > > > > > > > > > > > > > > >  - simplifications: initially only one
> > > > > supported
> > > > > > > > > metrics
> > > > > > > > > > > > format,
> > > > > > > > > > > > > > no
> > > > > > > > > > > > > > > > > > > client.id in the instance id, etc.
> > > > > > > > > > > > > > > > > > >  - made CLIENT_METRICS subscription
> > > > > configuration
> > > > > > > > > entries
> > > > > > > > > > > > more
> > > > > > > > > > > > > > > > structured
> > > > > > > > > > > > > > > > > > >    and allowing better client matching
> > > > > selectors
> > > > > > (not
> > > > > > > > > > only
> > > > > > > > > > > > on the
> > > > > > > > > > > > > > > > > > instance
> > > > > > > > > > > > > > > > > > > id, but also the other
> > > > > > > > > > > > > > > > > > >    client resource labels, such as
> > > > > > > > > client_software_name,
> > > > > > > > > > > > etc.).
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Unless there are further comments I'll
> > call
> > > > the
> > > > > > vote
> > > > > > > > > in a
> > > > > > > > > > > > day or
> > > > > > > > > > > > > > > two.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Regards,
> > > > > > > > > > > > > > > > > > > Magnus
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Den mån 4 okt. 2021 kl 20:57 skrev Magnus
> > > > > > Edenhill <
> > > > > > > > > > > > > > > > magnus@edenhill.se>:
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Hi Gwen,
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > I'm finishing up the KIP based on the
> > > last
> > > > > > couple
> > > > > > > > of
> > > > > > > > > > > > discussion
> > > > > > > > > > > > > > > > points
> > > > > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > > > > > this thread
> > > > > > > > > > > > > > > > > > > > and will call the Vote later this week.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Best,
> > > > > > > > > > > > > > > > > > > > Magnus
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Den lör 2 okt. 2021 kl 02:01 skrev Gwen
> > > > > Shapira
> > > > > > > > > > > > > > > > > > > <gwen@confluent.io.invalid
> > > > > > > > > > > > > > > > > > > > >:
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >> Hey,
> > > > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > > > >> I noticed that there was no discussion
> > > for
> > > > > the
> > > > > > > > last
> > > > > > > > > 10
> > > > > > > > > > > > days,
> > > > > > > > > > > > > > > but I
> > > > > > > > > > > > > > > > > > > >> couldn't
> > > > > > > > > > > > > > > > > > > >> find the vote thread. Is there one
> > that
> > > > I'm
> > > > > > > > missing?
> > > > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > > > >> Gwen
> > > > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > > > >> On Wed, Sep 22, 2021 at 4:58 AM Magnus
> > > > > > Edenhill <
> > > > > > > > > > > > > > > > magnus@edenhill.se>
> > > > > > > > > > > > > > > > > > > >> wrote:
> > > > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > > > >> > Den tis 21 sep. 2021 kl 06:58 skrev
> > > > Colin
> > > > > > > > McCabe <
> > > > > > > > > > > > > > > > > > cmccabe@apache.org
> > > > > > > > > > > > > > > > > > > >:
> > > > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > > > >> > > On Mon, Sep 20, 2021, at 17:35,
> > Feng
> > > > Min
> > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > >> > > > Thanks Magnus & Colin for the
> > > > > > discussion.
> > > > > > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > > > > > >> > > > Based on KIP-714's stateless
> > > design,
> > > > > > Client
> > > > > > > > > can
> > > > > > > > > > > > pretty
> > > > > > > > > > > > > > > much
> > > > > > > > > > > > > > > > use
> > > > > > > > > > > > > > > > > > > any
> > > > > > > > > > > > > > > > > > > >> > > > connection to any broker to send
> > > > > > metrics. We
> > > > > > > > > are
> > > > > > > > > > > not
> > > > > > > > > > > > > > > > associating
> > > > > > > > > > > > > > > > > > > >> > > connection
> > > > > > > > > > > > > > > > > > > >> > > > with client metric state. Is my
> > > > > > > > understanding
> > > > > > > > > > > > correct?
> > > > > > > > > > > > > > If
> > > > > > > > > > > > > > > > yes,
> > > > > > > > > > > > > > > > > > > how
> > > > > > > > > > > > > > > > > > > >> > about
> > > > > > > > > > > > > > > > > > > >> > > > the following two scenarios
> > > > > > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > > > > > >> > > > 1) One Client (Client-ID)
> > > registers
> > > > > two
> > > > > > > > > > different
> > > > > > > > > > > > client
> > > > > > > > > > > > > > > > > > instance
> > > > > > > > > > > > > > > > > > > id
> > > > > > > > > > > > > > > > > > > >> > via
> > > > > > > > > > > > > > > > > > > >> > > > separate registration. Is it
> > > > > permitted?
> > > > > > If
> > > > > > > > OK,
> > > > > > > > > > how
> > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > distinguish
> > > > > > > > > > > > > > > > > > > >> them
> > > > > > > > > > > > > > > > > > > >> > > from
> > > > > > > > > > > > > > > > > > > >> > > > the case 2 below.
> > > > > > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > > > > >> > > Hi Feng,
> > > > > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > > > > >> > > My understanding, which Magnus can
> > > > > > clarify I
> > > > > > > > > > guess,
> > > > > > > > > > > is
> > > > > > > > > > > > > > that
> > > > > > > > > > > > > > > > you
> > > > > > > > > > > > > > > > > > > could
> > > > > > > > > > > > > > > > > > > >> > have
> > > > > > > > > > > > > > > > > > > >> > > something like two Producer
> > > instances
> > > > > > running
> > > > > > > > > with
> > > > > > > > > > > the
> > > > > > > > > > > > > > same
> > > > > > > > > > > > > > > > > > > client.id
> > > > > > > > > > > > > > > > > > > >> > > (perhaps because they're using the
> > > > same
> > > > > > config
> > > > > > > > > > file,
> > > > > > > > > > > > for
> > > > > > > > > > > > > > > > example).
> > > > > > > > > > > > > > > > > > > >> They
> > > > > > > > > > > > > > > > > > > >> > > could even be in the same process.
> > > But
> > > > > > they
> > > > > > > > > would
> > > > > > > > > > > get
> > > > > > > > > > > > > > > separate
> > > > > > > > > > > > > > > > > > > UUIDs.
> > > > > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > > > > >> > > I believe Magnus used the term
> > > client
> > > > to
> > > > > > mean
> > > > > > > > > > > > "Producer or
> > > > > > > > > > > > > > > > > > > Consumer".
> > > > > > > > > > > > > > > > > > > >> So
> > > > > > > > > > > > > > > > > > > >> > > if you have both a Producer and a
> > > > > > Consumer in
> > > > > > > > > your
> > > > > > > > > > > > > > > > application I
> > > > > > > > > > > > > > > > > > > would
> > > > > > > > > > > > > > > > > > > >> > > expect you'd get separate UUIDs
> > for
> > > > > both.
> > > > > > > > Again
> > > > > > > > > > > > Magnus can
> > > > > > > > > > > > > > > > chime
> > > > > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > > > > >> > here, I
> > > > > > > > > > > > > > > > > > > >> > > guess.
> > > > > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > > > >> > That's correct.
> > > > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > > > > >> > > > 2) How about the client
> > > restarting?
> > > > > > What's
> > > > > > > > the
> > > > > > > > > > > > > > > expectation?
> > > > > > > > > > > > > > > > > > Should
> > > > > > > > > > > > > > > > > > > >> the
> > > > > > > > > > > > > > > > > > > >> > > > server expect the client to
> > carry
> > > a
> > > > > > > > persisted
> > > > > > > > > > > client
> > > > > > > > > > > > > > > > instance id
> > > > > > > > > > > > > > > > > > > or
> > > > > > > > > > > > > > > > > > > >> > > should
> > > > > > > > > > > > > > > > > > > >> > > > the client be treated as a new
> > > > > instance?
> > > > > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > > > > >> > > The KIP doesn't describe any
> > > mechanism
> > > > > for
> > > > > > > > > > > > persistence,
> > > > > > > > > > > > > > so I
> > > > > > > > > > > > > > > > would
> > > > > > > > > > > > > > > > > > > >> assume
> > > > > > > > > > > > > > > > > > > >> > > that when you restart the client
> > you
> > > > get
> > > > > > a new
> > > > > > > > > > > UUID. I
> > > > > > > > > > > > > > agree
> > > > > > > > > > > > > > > > that
> > > > > > > > > > > > > > > > > > it
> > > > > > > > > > > > > > > > > > > >> > would
> > > > > > > > > > > > > > > > > > > >> > > be good to spell this out.
> > > > > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > > > > >> > Right, it will not be persisted
> > since
> > > a
> > > > > > client
> > > > > > > > > > > instance
> > > > > > > > > > > > > > can't
> > > > > > > > > > > > > > > be
> > > > > > > > > > > > > > > > > > > >> restarted.
> > > > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > > > >> > Will update the KIP to make this
> > > > clearer.
> > > > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > > > >> > /Magnus
> > > > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > > > >> --
> > > > > > > > > > > > > > > > > > > >> Gwen Shapira
> > > > > > > > > > > > > > > > > > > >> Engineering Manager | Confluent
> > > > > > > > > > > > > > > > > > > >> 650.450.2760 | @gwenshap
> > > > > > > > > > > > > > > > > > > >> Follow us: Twitter | blog
> > > > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> 

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Jun Rao <ju...@confluent.io.INVALID>.
Hi, Magnus, Sarat and Xavier,

Thanks for the reply. A few more comments below.

20. It seems that we are piggybacking the plugin on the
existing MetricsReporter. So, this seems fine.

21. That could work. Are we requiring any additional jar dependency on the
client? Or, are you suggesting that we check the runtime dependency to pick
the compression codec?

28. For the broker metrics, could you spell out the full metric name
including groups, tags, etc? We typically don't add the broker_id label for
broker metrics. Also, brokers use Yammer metrics, which doesn't have type
Sum.

29. There are several client metrics listed as histogram. However, the java
client currently doesn't support histogram type.

30. Could you show an example of the metric payload in PushTelemetryRequest
to help understand how we organize metrics at different levels (per
instance, per topic, per partition, per broker, etc)?

31. Could you add a bit more detail on which client thread sends the
PushTelemetryRequest?

Thanks,

Jun

On Mon, Mar 7, 2022 at 11:48 AM Magnus Edenhill <ma...@edenhill.se> wrote:

> Hi Jun,
>
> thanks for your initiated questions, see my answers below.
> There's been a number of clarifications to the KIP.
>
>
>
> Den tors 27 jan. 2022 kl 20:08 skrev Jun Rao <ju...@confluent.io.invalid>:
>
> > Hi, Magnus,
> >
> > Thanks for updating the KIP. The overall approach makes sense to me. A
> few
> > more detailed comments below.
> >
> > 20. ClientTelemetry: Should it be extending configurable and closable?
> >
>
> I'll pass this question to Sarat and/or Xavier.
>
>
>
> > 21. Compression of the metrics on the client: what's the default?
> >
>
> How about we specify a prioritized list: zstd, lz4, snappy, gzip?
> But ultimately it is up to what the client supports.
>
>
> 23. A client instance is considered a metric resource and the
> > resource-level (thus client instance level) labels could include:
> >     client_software_name=confluent-kafka-python
> >     client_software_version=v2.1.3
> >     client_instance_id=B64CD139-3975-440A-91D4
> >     transactional_id=someTxnApp
> > Are those labels added in PushTelemetryRequest? If so, are they per
> metric
> > or per request?
> >
>
>
> client_software* and client_instance_id are not added by the client, but
> available to
> the broker-side metrics plugin for adding as it see fits, remove them from
> the KIP.
>
> As for transactional_id, group_id, etc, which I believe will be useful in
> troubleshooting,
> are included only once (per push) as resource-level attributes (the client
> instance is a singular resource).
>
>
> >
> > 24.  "the broker will only send
> > GetTelemetrySubscriptionsResponse.DeltaTemporality=True" :
> > 24.1 If it's always true, does it need to be part of the protocol?
> >
>
> We're anticipating that it will take a lot longer to upgrade the majority
> of clients than the
> broker/plugin side, which is why we want the client to support both
> temporalities out-of-the-box
> so that cumulative reporting can be turned on seamlessly in the future.
>
>
>
> > 24.2 Does delta only apply to Counter type?
> >
>
>
> And Histograms. More details in Xavier's OTLP link.
>
>
>
> > 24.3 In the delta representation, the first request needs to send the
> full
> > value, how does the broker plugin know whether a value is full or delta?
> >
>
> The client may (should) send the start time for each metric sample,
> indicating when
> the metric began to be collected.
> We've discussed whether this should be the client instance start time or
> the time when a matching
> metric subscription for that metric is received.
> For completeness we recommend using the former, the client instance start
> time.
>
>
>
> > 25. quota:
> > 25.1 Since we are fitting PushTelemetryRequest into the existing request
> > quota, it would be useful to document the impact, i.e. client metric
> > throttling causes the data from the same client to be delayed.
> > 25.2 Is PushTelemetryRequest subject to the write bandwidth quota like
> the
> > producer?
> >
>
>
> Yes, it should be, as to protect the cluster from rogue clients.
> But, in practice the size of metrics will be quite low (e.g., 1-10kb per
> 60s interval), so I don't think this will pose a problem.
> The KIP has been updated with more details on quota/throttling behaviour,
> see the
> "Throttling and rate-limiting" section.
>
>
> 25.3 THROTTLING_QUOTA_EXCEEDED: Currently, we don't send this error when
> > the request/bandwidth quota is exceeded since those requests are not
> > rejected. We only set this error when the request is rejected (e.g.,
> topic
> > creation). It would be useful to clarify when this error is used.
> >
>
> Right, I was trying to reuse an existing error-code. We can introduce
> a new one for the case where a client pushes metrics at a higher frequency
> than the
> than the configured push interval (e.g., out-of-profile sends).
> This causes the broker to drop those metrics and send this error code back
> to the client. There will be no connection throttling / channel-muting in
> this
> case (unless the standard quotas are exceeded).
>
>
> > 27. kafka-client-metrics.sh: Could we add an example on how to disable a
> > bad client?
> >
>
> There's now a --block option to kafka-client-metrics.sh which overrides all
> subscriptions
> for the matched client(s). This allows silencing metrics for one or more
> clients without having
> to remove existing subscriptions. From the client's perspective it will
> look like it no longer has
> any subscriptions.
>
> # Block metrics collection for a specific client instance
> $ kafka-client-metrics.sh --bootstrap-server $BROKERS \
>    --add \
>    --name 'Disabe_b69cc35a' \  # A descriptive name makes it easier to
> clean up old subscriptions.
>    --match client_instance_id=b69cc35a-7a54-4790-aa69-cc2bd4ee4538 \  #
> Match this specific client instance
>    --block
>
>
>
>
> > 28. New broker side metrics: Could we spell out the details of the
> metrics
> > (e.g., group, tags, etc)?
> >
>
> KIP has been updated accordingly (thanks Sarat).
>
>
>
> >
> > 29. Client instance-level metrics: client.io.wait.time is a gauge not a
> > histogram.
> >
>
> I believe a population/distribution should preferably be represented as a
> histogram, space permitting,
> and only secondarily as a Gauge average.
> While we might not want to maintain a bunch of histograms for each
> partition, since that could be
> quite space consuming, this client.io.wait.time is a single metric per
> client instance and can
> thus afford a Histogram representation.
>
>
>
> Thanks,
> Magnus
>
>
>
> > Thanks,
> >
> > Jun
> >
> > On Wed, Jan 26, 2022 at 6:32 AM Magnus Edenhill <ma...@edenhill.se>
> > wrote:
> >
> > > Hi all,
> > >
> > > I've updated the KIP with responses to the latest comments: Java client
> > > dependencies (Thanks Kirk!), alternate designs (separate cluster,
> > separate
> > > producer, etc), etc.
> > >
> > > I will revive the vote thread.
> > >
> > > Thanks,
> > > Magnus
> > >
> > >
> > > Den mån 13 dec. 2021 kl 22:32 skrev Ryanne Dolan <
> ryannedolan@gmail.com
> > >:
> > >
> > > > I think we should be very careful about introducing new runtime
> > > > dependencies into the clients. Historically this has been rare and
> > > > essentially necessary (e.g. compression libs).
> > > >
> > > > Ryanne
> > > >
> > > > On Mon, Dec 13, 2021, 1:06 PM Kirk True <ki...@mustardgrain.com>
> wrote:
> > > >
> > > > > Hi Jun,
> > > > >
> > > > > On Thu, Dec 9, 2021, at 2:28 PM, Jun Rao wrote:
> > > > > > 13. Using OpenTelemetry. Does that require runtime dependency
> > > > > > on OpenTelemetry library? How good is the compatibility story
> > > > > > of OpenTelemetry? This is important since an application could
> have
> > > > other
> > > > > > OpenTelemetry dependencies than the Kafka client.
> > > > >
> > > > > The current design is that the OpenTelemetry JARs would ship with
> the
> > > > > client. Perhaps we can design the client such that the JARs aren't
> > even
> > > > > loaded if the user has opted out. The user could even exclude the
> > JARs
> > > > from
> > > > > their dependencies if they so wished.
> > > > >
> > > > > I can't speak to the compatibility of the libraries. Is it possible
> > > that
> > > > > we include a shaded version?
> > > > >
> > > > > Thanks,
> > > > > Kirk
> > > > >
> > > > > >
> > > > > > 14. The proposal listed idempotence=true. This is more of a
> > > > configuration
> > > > > > than a metric. Are we including that as a metric? What other
> > > > > configurations
> > > > > > are we including? Should we separate the configurations from the
> > > > metrics?
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Jun
> > > > > >
> > > > > > On Mon, Nov 29, 2021 at 7:34 AM Magnus Edenhill <
> > magnus@edenhill.se>
> > > > > wrote:
> > > > > >
> > > > > > > Hey Bob,
> > > > > > >
> > > > > > > That's a good point.
> > > > > > >
> > > > > > > Request type labels were considered but since they're already
> > > tracked
> > > > > by
> > > > > > > broker-side metrics
> > > > > > > they were left out as to avoid metric duplication, however
> those
> > > > > metrics
> > > > > > > are not per connection,
> > > > > > > so they won't be that useful in practice for troubleshooting
> > > specific
> > > > > > > client instances.
> > > > > > >
> > > > > > > I'll add the request_type label to the relevant metrics.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Magnus
> > > > > > >
> > > > > > >
> > > > > > > Den tis 23 nov. 2021 kl 19:20 skrev Bob Barrett
> > > > > > > <bo...@confluent.io.invalid>:
> > > > > > >
> > > > > > > > Hi Magnus,
> > > > > > > >
> > > > > > > > Thanks for the thorough KIP, this seems very useful.
> > > > > > > >
> > > > > > > > Would it make sense to include the request type as a label
> for
> > > the
> > > > > > > > `client.request.success`, `client.request.errors` and
> > > > > > > `client.request.rtt`
> > > > > > > > metrics? I think it would be very useful to see which
> specific
> > > > > requests
> > > > > > > are
> > > > > > > > succeeding and failing for a client. One specific case I can
> > > think
> > > > of
> > > > > > > where
> > > > > > > > this could be useful is producer batch timeouts. If a Java
> > > > > application
> > > > > > > does
> > > > > > > > not enable producer client logs (unfortunately, in my
> > experience
> > > > this
> > > > > > > > happens more often than it should), the application logs will
> > > only
> > > > > > > contain
> > > > > > > > the expiration error message, but no information about what
> is
> > > > > causing
> > > > > > > the
> > > > > > > > timeout. The requests might all be succeeding but taking too
> > long
> > > > to
> > > > > > > > process batches, or metadata requests might be failing, or
> some
> > > or
> > > > > all
> > > > > > > > produce requests might be failing (if the bootstrap servers
> are
> > > > > reachable
> > > > > > > > from the client but one or more other brokers are not, for
> > > > example).
> > > > > If
> > > > > > > the
> > > > > > > > cluster operator is able to identify the specific requests
> that
> > > are
> > > > > slow
> > > > > > > or
> > > > > > > > failing for a client, they will be better able to diagnose
> the
> > > > issue
> > > > > > > > causing batch timeouts.
> > > > > > > >
> > > > > > > > One drawback I can think of is that this will increase the
> > > > > cardinality of
> > > > > > > > the request metrics. But any given client is only going to
> use
> > a
> > > > > small
> > > > > > > > subset of the request types, and since we already have
> > partition
> > > > > labels
> > > > > > > for
> > > > > > > > the topic-level metrics, I think request labels will still
> make
> > > up
> > > > a
> > > > > > > > relatively small percentage of the set of metrics.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Bob
> > > > > > > >
> > > > > > > > On Mon, Nov 22, 2021 at 2:08 AM Viktor Somogyi-Vass <
> > > > > > > > viktorsomogyi@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Magnus,
> > > > > > > > >
> > > > > > > > > I think this is a very useful addition. We also have a
> > similar
> > > > (but
> > > > > > > much
> > > > > > > > > more simplistic) implementation of this. Maybe I missed it
> in
> > > the
> > > > > KIP
> > > > > > > but
> > > > > > > > > what about adding metrics about the subscription cache
> > itself?
> > > > > That I
> > > > > > > > think
> > > > > > > > > would improve its usability and debuggability as we'd be
> able
> > > to
> > > > > see
> > > > > > > its
> > > > > > > > > performance, hit/miss rates, eviction counts and others.
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > > Viktor
> > > > > > > > >
> > > > > > > > > On Thu, Nov 18, 2021 at 5:12 PM Magnus Edenhill <
> > > > > magnus@edenhill.se>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi Mickael,
> > > > > > > > > >
> > > > > > > > > > see inline.
> > > > > > > > > >
> > > > > > > > > > Den ons 10 nov. 2021 kl 15:21 skrev Mickael Maison <
> > > > > > > > > > mickael.maison@gmail.com
> > > > > > > > > > >:
> > > > > > > > > >
> > > > > > > > > > > Hi Magnus,
> > > > > > > > > > >
> > > > > > > > > > > I see you've addressed some of the points I raised
> above
> > > but
> > > > > some
> > > > > > > (4,
> > > > > > > > > > > 5) have not been addressed yet.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Re 4) How will the user/app know metrics are being sent.
> > > > > > > > > >
> > > > > > > > > > One possibility is to add a JMX metric (thus for user
> > > > > consumption)
> > > > > > > for
> > > > > > > > > the
> > > > > > > > > > number of metric pushes the
> > > > > > > > > > client has performed, or perhaps the number of metrics
> > > > > subscriptions
> > > > > > > > > > currently being collected.
> > > > > > > > > > Would that be sufficient?
> > > > > > > > > >
> > > > > > > > > > Re 5) Metric sizes and rates
> > > > > > > > > >
> > > > > > > > > > A worst case scenario for a producer that is producing to
> > 50
> > > > > unique
> > > > > > > > > topics
> > > > > > > > > > and emitting all standard metrics yields
> > > > > > > > > > a serialized size of around 100KB prior to compression,
> > which
> > > > > > > > compresses
> > > > > > > > > > down to about 20-30% of that depending
> > > > > > > > > > on compression type and topic name uniqueness.
> > > > > > > > > > The numbers for a consumer would be similar.
> > > > > > > > > >
> > > > > > > > > > In practice the number of unique topics would be far
> less,
> > > and
> > > > > the
> > > > > > > > > > subscription set would typically be for a subset of
> > metrics.
> > > > > > > > > > So we're probably closer to 1kb, or less, compressed size
> > per
> > > > > client
> > > > > > > > per
> > > > > > > > > > push interval.
> > > > > > > > > >
> > > > > > > > > > As both the subscription set and push intervals are
> > > controlled
> > > > > by the
> > > > > > > > > > cluster operator it shouldn't be too hard
> > > > > > > > > > to strike a good balance between metrics overhead and
> > > > > granularity.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > I'm really uneasy with this being enabled by default on
> > the
> > > > > client
> > > > > > > > > > > side. When collecting data, I think the best practice
> is
> > to
> > > > > ensure
> > > > > > > > > > > users are explicitly enabling it.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Requiring metrics to be explicitly enabled on clients
> > > severely
> > > > > > > cripples
> > > > > > > > > its
> > > > > > > > > > usability and value.
> > > > > > > > > >
> > > > > > > > > > One of the problems that this KIP aims to solve is for
> > useful
> > > > > metrics
> > > > > > > > to
> > > > > > > > > be
> > > > > > > > > > available on demand
> > > > > > > > > > regardless of the technical expertise of the user. As
> > Ryanne
> > > > > points,
> > > > > > > > out
> > > > > > > > > a
> > > > > > > > > > savvy user/organization
> > > > > > > > > > will typically have metrics collection and monitoring in
> > > place
> > > > > > > already,
> > > > > > > > > and
> > > > > > > > > > the benefits of this KIP
> > > > > > > > > > are then more of a common set and format metrics across
> > > client
> > > > > > > > > > implementations and languages.
> > > > > > > > > > But that is not the typical Kafka user in my experience,
> > > > they're
> > > > > not
> > > > > > > > > Kafka
> > > > > > > > > > experts and they don't have the
> > > > > > > > > > knowledge of how to best instrument their clients.
> > > > > > > > > > Having metrics enabled by default for this user base
> allows
> > > the
> > > > > Kafka
> > > > > > > > > > operators to proactively and reactively
> > > > > > > > > > monitor and troubleshoot client issues, without the need
> > for
> > > > the
> > > > > less
> > > > > > > > > savvy
> > > > > > > > > > user to do anything.
> > > > > > > > > > It is often too late to tell a user to enable metrics
> when
> > > the
> > > > > > > problem
> > > > > > > > > has
> > > > > > > > > > already occurred.
> > > > > > > > > >
> > > > > > > > > > Now, to be clear, even though metrics are enabled by
> > default
> > > on
> > > > > > > clients
> > > > > > > > > it
> > > > > > > > > > is not enabled by default
> > > > > > > > > > on the brokers; the Kafka operator needs to build and set
> > up
> > > a
> > > > > > > metrics
> > > > > > > > > > plugin and add metrics subscriptions
> > > > > > > > > > before anything is sent from the client.
> > > > > > > > > > It is opt-out on the clients and opt-in on the broker.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > You mentioned brokers already have
> > > > > > > > > > > some(most?) of the information contained in metrics, if
> > so
> > > > > then why
> > > > > > > > > > > are we collecting it again? Surely there must be some
> new
> > > > > > > information
> > > > > > > > > > > in the client metrics.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > From the user's perspective the Kafka infrastructure
> > extends
> > > > from
> > > > > > > > > > producer.send() to
> > > > > > > > > > messages being returned from consumer.poll(), a giant
> black
> > > box
> > > > > where
> > > > > > > > > > there's a lot going on between those
> > > > > > > > > > two points. The brokers currently only see what happens
> > once
> > > > > those
> > > > > > > > > requests
> > > > > > > > > > and messages hits the broker,
> > > > > > > > > > but as Kafka clients are complex pieces of machinery
> > there's
> > > a
> > > > > myriad
> > > > > > > > of
> > > > > > > > > > queues, timers, and state
> > > > > > > > > > that's critical to the operation and infrastructure
> that's
> > > not
> > > > > > > > currently
> > > > > > > > > > visible to the operator.
> > > > > > > > > > Relying on the user to accurately and timely provide this
> > > > missing
> > > > > > > > > > information is not generally feasible.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Most of the standard metrics listed in the KIP are data
> > > points
> > > > > that
> > > > > > > the
> > > > > > > > > > broker does not have.
> > > > > > > > > > Only a small number of metrics are duplicates (like the
> > > request
> > > > > > > counts
> > > > > > > > > and
> > > > > > > > > > sizes), but they are included
> > > > > > > > > > to ease correlation when inspecting these client metrics.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > Moreover this is a brand new feature so it's even
> harder
> > to
> > > > > justify
> > > > > > > > > > > enabling it and forcing onto all our users. If disabled
> > by
> > > > > default,
> > > > > > > > > > > it's relatively easy to enable in a new release if we
> > > decide
> > > > > to,
> > > > > > > but
> > > > > > > > > > > once enabled by default it's much harder to disable.
> Also
> > > > this
> > > > > > > > feature
> > > > > > > > > > > will apply to all future metrics we will add.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > I think maturity of a feature implementation should be
> the
> > > > > deciding
> > > > > > > > > factor,
> > > > > > > > > > rather than
> > > > > > > > > > the design of it (which this KIP is). I.e., if the
> > > > > implementation is
> > > > > > > > not
> > > > > > > > > > deemed mature enough
> > > > > > > > > > for release X.Y it will be disabled.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > Overall I think it's an interesting feature but I'd
> > prefer
> > > to
> > > > > be
> > > > > > > > > > > slightly defensive and see how it works in practice
> > before
> > > > > enabling
> > > > > > > > it
> > > > > > > > > > > everywhere.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Right, and I agree on being defensive, but since this
> > feature
> > > > > still
> > > > > > > > > > requires manual
> > > > > > > > > > enabling on the brokers before actually being used, I
> think
> > > > that
> > > > > > > gives
> > > > > > > > > > enough control
> > > > > > > > > > to opt-in or out of this feature as needed.
> > > > > > > > > >
> > > > > > > > > > Thanks for your comments!
> > > > > > > > > >
> > > > > > > > > > Regards,
> > > > > > > > > > Magnus
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > > Mickael
> > > > > > > > > > >
> > > > > > > > > > > On Mon, Nov 8, 2021 at 7:55 AM Magnus Edenhill <
> > > > > magnus@edenhill.se
> > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks David for pointing this out,
> > > > > > > > > > > > I've updated the KIP to include client_id as a
> matching
> > > > > selector.
> > > > > > > > > > > >
> > > > > > > > > > > > Regards,
> > > > > > > > > > > > Magnus
> > > > > > > > > > > >
> > > > > > > > > > > > Den tors 4 nov. 2021 kl 18:01 skrev David Mao
> > > > > > > > > > <dmao@confluent.io.invalid
> > > > > > > > > > > >:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hey Magnus,
> > > > > > > > > > > > >
> > > > > > > > > > > > > I noticed that the KIP outlines the initial
> selectors
> > > > > supported
> > > > > > > > as:
> > > > > > > > > > > > >
> > > > > > > > > > > > >    - client_instance_id - CLIENT_INSTANCE_ID UUID
> > > string
> > > > > > > > > > > representation.
> > > > > > > > > > > > >    - client_software_name  - client software
> > > > implementation
> > > > > > > name.
> > > > > > > > > > > > >    - client_software_version  - client software
> > > > > implementation
> > > > > > > > > > version.
> > > > > > > > > > > > >
> > > > > > > > > > > > > In the given reactive monitoring workflow, we
> mention
> > > > that
> > > > > the
> > > > > > > > > > > application
> > > > > > > > > > > > > user does not know their client's client instance
> ID,
> > > but
> > > > > it's
> > > > > > > > > > outlined
> > > > > > > > > > > > > that the operator can add a metrics subscription
> > > > selecting
> > > > > for
> > > > > > > > > > > clientId. I
> > > > > > > > > > > > > don't see clientId as one of the supported
> selectors.
> > > > > > > > > > > > > I can see how this would have made sense in a
> > previous
> > > > > > > iteration
> > > > > > > > > > given
> > > > > > > > > > > that
> > > > > > > > > > > > > the previous client instance ID proposal was to
> > > construct
> > > > > the
> > > > > > > > > client
> > > > > > > > > > > > > instance ID using clientId as a prefix. Now that
> the
> > > > client
> > > > > > > > > instance
> > > > > > > > > > > ID is
> > > > > > > > > > > > > a UUID, would we want to add clientId as a
> supported
> > > > > selector?
> > > > > > > > > > > > > Let me know what you think.
> > > > > > > > > > > > >
> > > > > > > > > > > > > David
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Tue, Oct 19, 2021 at 12:33 PM Magnus Edenhill <
> > > > > > > > > magnus@edenhill.se
> > > > > > > > > > >
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi Mickael!
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Den tis 19 okt. 2021 kl 19:30 skrev Mickael
> Maison
> > <
> > > > > > > > > > > > > > mickael.maison@gmail.com
> > > > > > > > > > > > > > >:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi Magnus,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks for the proposal.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 1. Looking at the protocol section, isn't
> > > > > > > "ClientInstanceId"
> > > > > > > > > > > expected
> > > > > > > > > > > > > > > to be a field in
> > > GetTelemetrySubscriptionsResponseV0?
> > > > > > > > > Otherwise,
> > > > > > > > > > > how
> > > > > > > > > > > > > > > does a client retrieve this value?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Good catch, it got removed by mistake in one of
> the
> > > > > edits.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 2. In the client API section, you mention a new
> > > > method
> > > > > > > > > > > > > > > "clientInstanceId()". Can you clarify which
> > > > interfaces
> > > > > are
> > > > > > > > > > > affected?
> > > > > > > > > > > > > > > Is it only Consumer and Producer?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > And Admin. Will update the KIP.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 3. I'm a bit concerned this is enabled by
> > default.
> > > > > Even if
> > > > > > > > the
> > > > > > > > > > data
> > > > > > > > > > > > > > > collected is supposed to be not sensitive, I
> > think
> > > > > this can
> > > > > > > > be
> > > > > > > > > > > > > > > problematic in some environments. Also users
> > don't
> > > > > seem to
> > > > > > > > have
> > > > > > > > > > the
> > > > > > > > > > > > > > > choice to only expose some metrics. Knowing how
> > > much
> > > > > data
> > > > > > > > > transit
> > > > > > > > > > > > > > > through some applications can be considered
> > > critical.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > The broker already knows how much data transits
> > > through
> > > > > the
> > > > > > > > > client
> > > > > > > > > > > > > though,
> > > > > > > > > > > > > > right?
> > > > > > > > > > > > > > Care has been taken not to expose information in
> > the
> > > > > standard
> > > > > > > > > > metrics
> > > > > > > > > > > > > that
> > > > > > > > > > > > > > might
> > > > > > > > > > > > > > reveal sensitive information.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Do you have an example of how the proposed
> metrics
> > > > could
> > > > > leak
> > > > > > > > > > > sensitive
> > > > > > > > > > > > > > information?
> > > > > > > > > > > > > > As for limiting the what metrics to export; I
> guess
> > > > that
> > > > > > > could
> > > > > > > > > make
> > > > > > > > > > > sense
> > > > > > > > > > > > > > in some
> > > > > > > > > > > > > > very sensitive use-cases, but those users might
> > > disable
> > > > > > > metrics
> > > > > > > > > > > > > altogether
> > > > > > > > > > > > > > for now.
> > > > > > > > > > > > > > Could these concerns be addressed by a later KIP?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 4. As a user, how do you know if your
> application
> > > is
> > > > > > > actively
> > > > > > > > > > > sending
> > > > > > > > > > > > > > > metrics? Are there new metrics exposing what's
> > > going
> > > > > on,
> > > > > > > like
> > > > > > > > > how
> > > > > > > > > > > much
> > > > > > > > > > > > > > > data is being sent?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > That's a good question.
> > > > > > > > > > > > > > Since the proposed metrics interface is not aimed
> > at,
> > > > or
> > > > > > > > directly
> > > > > > > > > > > > > available
> > > > > > > > > > > > > > to, the application
> > > > > > > > > > > > > > I guess there's little point of adding it here,
> but
> > > > > instead
> > > > > > > > > adding
> > > > > > > > > > > > > > something to the
> > > > > > > > > > > > > > existing JMX metrics?
> > > > > > > > > > > > > > Do you have any suggestions?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 5. If all metrics are enabled on a regular
> > Consumer
> > > > or
> > > > > > > > > Producer,
> > > > > > > > > > do
> > > > > > > > > > > > > > > you have an idea how much throughput this would
> > > use?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > It depends on the number of partition/topics/etc
> > the
> > > > > client
> > > > > > > is
> > > > > > > > > > > producing
> > > > > > > > > > > > > > to/consuming from.
> > > > > > > > > > > > > > I'll add some sizes to the KIP for some typical
> > > > > use-cases.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > Magnus
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Tue, Oct 19, 2021 at 5:06 PM Magnus
> Edenhill <
> > > > > > > > > > > magnus@edenhill.se>
> > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Den tis 19 okt. 2021 kl 13:22 skrev Tom
> > Bentley <
> > > > > > > > > > > tbentley@redhat.com
> > > > > > > > > > > > > >:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Hi Magnus,
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > I reviewed the KIP since you called the
> vote
> > > > > (sorry for
> > > > > > > > not
> > > > > > > > > > > > > reviewing
> > > > > > > > > > > > > > > when
> > > > > > > > > > > > > > > > > you announced your intention to call the
> > > vote). I
> > > > > have
> > > > > > > a
> > > > > > > > > few
> > > > > > > > > > > > > > questions
> > > > > > > > > > > > > > > on
> > > > > > > > > > > > > > > > > some of the details.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > 1. There's no Javadoc on
> > > > > ClientTelemetryPayload.data(),
> > > > > > > > so
> > > > > > > > > I
> > > > > > > > > > > don't
> > > > > > > > > > > > > > know
> > > > > > > > > > > > > > > > > whether the payload is exposed through this
> > > > method
> > > > > as
> > > > > > > > > > > compressed or
> > > > > > > > > > > > > > > not.
> > > > > > > > > > > > > > > > > Later on you say "Decompression of the
> > payloads
> > > > > will be
> > > > > > > > > > > handled by
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > broker metrics plugin, the broker should
> > > expose a
> > > > > > > > suitable
> > > > > > > > > > > > > > > decompression
> > > > > > > > > > > > > > > > > API to the metrics plugin for this
> purpose.",
> > > > which
> > > > > > > > > suggests
> > > > > > > > > > > it's
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > compressed data in the buffer, but then we
> > > don't
> > > > > know
> > > > > > > > which
> > > > > > > > > > > codec
> > > > > > > > > > > > > was
> > > > > > > > > > > > > > > used,
> > > > > > > > > > > > > > > > > nor the API via which the plugin should
> > > > decompress
> > > > > it
> > > > > > > if
> > > > > > > > > > > required
> > > > > > > > > > > > > for
> > > > > > > > > > > > > > > > > forwarding to the ultimate metrics store.
> > > Should
> > > > > the
> > > > > > > > > > > > > > > ClientTelemetryPayload
> > > > > > > > > > > > > > > > > expose a method to get the compression and
> a
> > > > > > > > decompressor?
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Good point, updated.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > 2. The client-side API is expressed as
> > > > > StringOrError
> > > > > > > > > > > > > > > > > ClientInstance::ClientInstanceId(int
> > > > timeout_ms). I
> > > > > > > > > > understand
> > > > > > > > > > > that
> > > > > > > > > > > > > > > you're
> > > > > > > > > > > > > > > > > thinking about the librdkafka
> implementation,
> > > but
> > > > > it
> > > > > > > > would
> > > > > > > > > be
> > > > > > > > > > > good
> > > > > > > > > > > > > to
> > > > > > > > > > > > > > > show
> > > > > > > > > > > > > > > > > the API as it would appear on the Apache
> > Kafka
> > > > > clients.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > This was meant as pseudo-code, but I changed
> it
> > > to
> > > > > Java.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > 3. "PushTelemetryRequest|Response -
> protocol
> > > > > request
> > > > > > > used
> > > > > > > > > by
> > > > > > > > > > > the
> > > > > > > > > > > > > > > client to
> > > > > > > > > > > > > > > > > send metrics to any broker it is connected
> > to."
> > > > To
> > > > > be
> > > > > > > > > clear,
> > > > > > > > > > > this
> > > > > > > > > > > > > > means
> > > > > > > > > > > > > > > > > that the client can choose any of the
> > connected
> > > > > brokers
> > > > > > > > and
> > > > > > > > > > > push to
> > > > > > > > > > > > > > > just
> > > > > > > > > > > > > > > > > one of them? What should a supporting
> client
> > do
> > > > if
> > > > > it
> > > > > > > > gets
> > > > > > > > > an
> > > > > > > > > > > error
> > > > > > > > > > > > > > > when
> > > > > > > > > > > > > > > > > pushing metrics to a broker, retry sending
> to
> > > the
> > > > > same
> > > > > > > > > broker
> > > > > > > > > > > or
> > > > > > > > > > > > > try
> > > > > > > > > > > > > > > > > pushing to another broker, or drop the
> > metrics?
> > > > > Should
> > > > > > > > > > > supporting
> > > > > > > > > > > > > > > clients
> > > > > > > > > > > > > > > > > send successive requests to a single
> broker,
> > or
> > > > > round
> > > > > > > > > robin,
> > > > > > > > > > > or is
> > > > > > > > > > > > > > > that up
> > > > > > > > > > > > > > > > > to the client author? I'm guessing the
> > > behaviour
> > > > > should
> > > > > > > > be
> > > > > > > > > > > sticky
> > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > support the rate limiting features, but I
> > think
> > > > it
> > > > > > > would
> > > > > > > > be
> > > > > > > > > > > good
> > > > > > > > > > > > > for
> > > > > > > > > > > > > > > client
> > > > > > > > > > > > > > > > > authors if this section were explicit on
> the
> > > > > > > recommended
> > > > > > > > > > > behaviour.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > You are right, I've updated the KIP to make
> > this
> > > > > clearer.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > 4. "Mapping the client instance id to an
> > actual
> > > > > > > > application
> > > > > > > > > > > > > instance
> > > > > > > > > > > > > > > > > running on a (virtual) machine can be done
> by
> > > > > > > inspecting
> > > > > > > > > the
> > > > > > > > > > > > > metrics
> > > > > > > > > > > > > > > > > resource labels, such as the client source
> > > > address
> > > > > and
> > > > > > > > > source
> > > > > > > > > > > port,
> > > > > > > > > > > > > > or
> > > > > > > > > > > > > > > > > security principal, all of which are added
> by
> > > the
> > > > > > > > receiving
> > > > > > > > > > > broker.
> > > > > > > > > > > > > > > This
> > > > > > > > > > > > > > > > > will allow the operator together with the
> > user
> > > to
> > > > > > > > identify
> > > > > > > > > > the
> > > > > > > > > > > > > actual
> > > > > > > > > > > > > > > > > application instance." Is this really
> always
> > > > true?
> > > > > The
> > > > > > > > > source
> > > > > > > > > > > IP
> > > > > > > > > > > > > and
> > > > > > > > > > > > > > > port
> > > > > > > > > > > > > > > > > might be a loadbalancer/proxy in some
> setups.
> > > The
> > > > > > > > > principal,
> > > > > > > > > > as
> > > > > > > > > > > > > > already
> > > > > > > > > > > > > > > > > mentioned in the KIP, might be shared
> between
> > > > > multiple
> > > > > > > > > > > > > applications.
> > > > > > > > > > > > > > > So at
> > > > > > > > > > > > > > > > > worst the organization running the clients
> > > might
> > > > > have
> > > > > > > to
> > > > > > > > > > > consult
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > logs
> > > > > > > > > > > > > > > > > of a set of client applications, right?
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Yes, that's correct. There's no guaranteed
> > > mapping
> > > > > from
> > > > > > > > > > > > > > > client_instance_id
> > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > an actual instance, that's why the KIP
> > recommends
> > > > > client
> > > > > > > > > > > > > > implementations
> > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > log the client instance id
> > > > > > > > > > > > > > > > upon retrieval, and also provide an API for
> the
> > > > > > > application
> > > > > > > > > to
> > > > > > > > > > > > > retrieve
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > instance id programmatically
> > > > > > > > > > > > > > > > if it has a better way of exposing it.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 5. "Tests indicate that a compression ratio
> up
> > to
> > > > > 10x is
> > > > > > > > > > > possible for
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > standard metrics." Client authors might
> > > > appreciate
> > > > > your
> > > > > > > > > > > mentioning
> > > > > > > > > > > > > > > which
> > > > > > > > > > > > > > > > > compression codec got these results.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Good point. Updated.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > 6. "Should the client send a push request
> > prior
> > > > to
> > > > > > > expiry
> > > > > > > > > of
> > > > > > > > > > > the
> > > > > > > > > > > > > > > previously
> > > > > > > > > > > > > > > > > calculated PushIntervalMs the broker will
> > > discard
> > > > > the
> > > > > > > > > metrics
> > > > > > > > > > > and
> > > > > > > > > > > > > > > return a
> > > > > > > > > > > > > > > > > PushTelemetryResponse with the ErrorCode
> set
> > to
> > > > > > > > > RateLimited."
> > > > > > > > > > > Is
> > > > > > > > > > > > > this
> > > > > > > > > > > > > > > > > RATE_LIMITED a new error code? It's not
> > > mentioned
> > > > > in
> > > > > > > the
> > > > > > > > > "New
> > > > > > > > > > > Error
> > > > > > > > > > > > > > > Codes"
> > > > > > > > > > > > > > > > > section.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > That's a leftover, it should be using the
> > > standard
> > > > > > > > > ThrottleTime
> > > > > > > > > > > > > > > mechanism.
> > > > > > > > > > > > > > > > Fixed.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > 7. In the section "Standard client resource
> > > > labels"
> > > > > > > > > > > application_id
> > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > > described as Kafka Streams only, but the
> > > section
> > > > of
> > > > > > > > "Client
> > > > > > > > > > > > > > > Identification"
> > > > > > > > > > > > > > > > > talks about "application instance id as an
> > > > optional
> > > > > > > > future
> > > > > > > > > > > > > > nice-to-have
> > > > > > > > > > > > > > > > > that may be included as a metrics label if
> it
> > > has
> > > > > been
> > > > > > > > set
> > > > > > > > > by
> > > > > > > > > > > the
> > > > > > > > > > > > > > > user", so
> > > > > > > > > > > > > > > > > I'm confused whether non-Kafka Streams
> > clients
> > > > > should
> > > > > > > set
> > > > > > > > > an
> > > > > > > > > > > > > > > application_id
> > > > > > > > > > > > > > > > > or not.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I'll clarify this in the KIP, but basically
> we
> > > > would
> > > > > need
> > > > > > > > to
> > > > > > > > > > add
> > > > > > > > > > > an `
> > > > > > > > > > > > > > > > application.id` config
> > > > > > > > > > > > > > > > property for non-streams clients for this
> > > purpose,
> > > > > and
> > > > > > > > that's
> > > > > > > > > > > outside
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > scope of this KIP since we want to make it
> > > > > zero-conf:ish
> > > > > > > on
> > > > > > > > > the
> > > > > > > > > > > > > client
> > > > > > > > > > > > > > > side.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Kind regards,
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Tom
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thanks for the review,
> > > > > > > > > > > > > > > > Magnus
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On Thu, Oct 7, 2021 at 5:26 PM Magnus
> > Edenhill
> > > <
> > > > > > > > > > > magnus@edenhill.se
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Hi all,
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > I've updated the KIP following our recent
> > > > > discussions
> > > > > > > > on
> > > > > > > > > > the
> > > > > > > > > > > > > > mailing
> > > > > > > > > > > > > > > > > list:
> > > > > > > > > > > > > > > > > >  - split the protocol in two, one for
> > getting
> > > > the
> > > > > > > > metrics
> > > > > > > > > > > > > > > subscriptions,
> > > > > > > > > > > > > > > > > > and one for pushing the metrics.
> > > > > > > > > > > > > > > > > >  - simplifications: initially only one
> > > > supported
> > > > > > > > metrics
> > > > > > > > > > > format,
> > > > > > > > > > > > > no
> > > > > > > > > > > > > > > > > > client.id in the instance id, etc.
> > > > > > > > > > > > > > > > > >  - made CLIENT_METRICS subscription
> > > > configuration
> > > > > > > > entries
> > > > > > > > > > > more
> > > > > > > > > > > > > > > structured
> > > > > > > > > > > > > > > > > >    and allowing better client matching
> > > > selectors
> > > > > (not
> > > > > > > > > only
> > > > > > > > > > > on the
> > > > > > > > > > > > > > > > > instance
> > > > > > > > > > > > > > > > > > id, but also the other
> > > > > > > > > > > > > > > > > >    client resource labels, such as
> > > > > > > > client_software_name,
> > > > > > > > > > > etc.).
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Unless there are further comments I'll
> call
> > > the
> > > > > vote
> > > > > > > > in a
> > > > > > > > > > > day or
> > > > > > > > > > > > > > two.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Regards,
> > > > > > > > > > > > > > > > > > Magnus
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Den mån 4 okt. 2021 kl 20:57 skrev Magnus
> > > > > Edenhill <
> > > > > > > > > > > > > > > magnus@edenhill.se>:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Hi Gwen,
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > I'm finishing up the KIP based on the
> > last
> > > > > couple
> > > > > > > of
> > > > > > > > > > > discussion
> > > > > > > > > > > > > > > points
> > > > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > > > > this thread
> > > > > > > > > > > > > > > > > > > and will call the Vote later this week.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Best,
> > > > > > > > > > > > > > > > > > > Magnus
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Den lör 2 okt. 2021 kl 02:01 skrev Gwen
> > > > Shapira
> > > > > > > > > > > > > > > > > > <gwen@confluent.io.invalid
> > > > > > > > > > > > > > > > > > > >:
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >> Hey,
> > > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > > >> I noticed that there was no discussion
> > for
> > > > the
> > > > > > > last
> > > > > > > > 10
> > > > > > > > > > > days,
> > > > > > > > > > > > > > but I
> > > > > > > > > > > > > > > > > > >> couldn't
> > > > > > > > > > > > > > > > > > >> find the vote thread. Is there one
> that
> > > I'm
> > > > > > > missing?
> > > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > > >> Gwen
> > > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > > >> On Wed, Sep 22, 2021 at 4:58 AM Magnus
> > > > > Edenhill <
> > > > > > > > > > > > > > > magnus@edenhill.se>
> > > > > > > > > > > > > > > > > > >> wrote:
> > > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > > >> > Den tis 21 sep. 2021 kl 06:58 skrev
> > > Colin
> > > > > > > McCabe <
> > > > > > > > > > > > > > > > > cmccabe@apache.org
> > > > > > > > > > > > > > > > > > >:
> > > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > > >> > > On Mon, Sep 20, 2021, at 17:35,
> Feng
> > > Min
> > > > > > > wrote:
> > > > > > > > > > > > > > > > > > >> > > > Thanks Magnus & Colin for the
> > > > > discussion.
> > > > > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > > > > >> > > > Based on KIP-714's stateless
> > design,
> > > > > Client
> > > > > > > > can
> > > > > > > > > > > pretty
> > > > > > > > > > > > > > much
> > > > > > > > > > > > > > > use
> > > > > > > > > > > > > > > > > > any
> > > > > > > > > > > > > > > > > > >> > > > connection to any broker to send
> > > > > metrics. We
> > > > > > > > are
> > > > > > > > > > not
> > > > > > > > > > > > > > > associating
> > > > > > > > > > > > > > > > > > >> > > connection
> > > > > > > > > > > > > > > > > > >> > > > with client metric state. Is my
> > > > > > > understanding
> > > > > > > > > > > correct?
> > > > > > > > > > > > > If
> > > > > > > > > > > > > > > yes,
> > > > > > > > > > > > > > > > > > how
> > > > > > > > > > > > > > > > > > >> > about
> > > > > > > > > > > > > > > > > > >> > > > the following two scenarios
> > > > > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > > > > >> > > > 1) One Client (Client-ID)
> > registers
> > > > two
> > > > > > > > > different
> > > > > > > > > > > client
> > > > > > > > > > > > > > > > > instance
> > > > > > > > > > > > > > > > > > id
> > > > > > > > > > > > > > > > > > >> > via
> > > > > > > > > > > > > > > > > > >> > > > separate registration. Is it
> > > > permitted?
> > > > > If
> > > > > > > OK,
> > > > > > > > > how
> > > > > > > > > > > to
> > > > > > > > > > > > > > > > > distinguish
> > > > > > > > > > > > > > > > > > >> them
> > > > > > > > > > > > > > > > > > >> > > from
> > > > > > > > > > > > > > > > > > >> > > > the case 2 below.
> > > > > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > > > >> > > Hi Feng,
> > > > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > > > >> > > My understanding, which Magnus can
> > > > > clarify I
> > > > > > > > > guess,
> > > > > > > > > > is
> > > > > > > > > > > > > that
> > > > > > > > > > > > > > > you
> > > > > > > > > > > > > > > > > > could
> > > > > > > > > > > > > > > > > > >> > have
> > > > > > > > > > > > > > > > > > >> > > something like two Producer
> > instances
> > > > > running
> > > > > > > > with
> > > > > > > > > > the
> > > > > > > > > > > > > same
> > > > > > > > > > > > > > > > > > client.id
> > > > > > > > > > > > > > > > > > >> > > (perhaps because they're using the
> > > same
> > > > > config
> > > > > > > > > file,
> > > > > > > > > > > for
> > > > > > > > > > > > > > > example).
> > > > > > > > > > > > > > > > > > >> They
> > > > > > > > > > > > > > > > > > >> > > could even be in the same process.
> > But
> > > > > they
> > > > > > > > would
> > > > > > > > > > get
> > > > > > > > > > > > > > separate
> > > > > > > > > > > > > > > > > > UUIDs.
> > > > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > > > >> > > I believe Magnus used the term
> > client
> > > to
> > > > > mean
> > > > > > > > > > > "Producer or
> > > > > > > > > > > > > > > > > > Consumer".
> > > > > > > > > > > > > > > > > > >> So
> > > > > > > > > > > > > > > > > > >> > > if you have both a Producer and a
> > > > > Consumer in
> > > > > > > > your
> > > > > > > > > > > > > > > application I
> > > > > > > > > > > > > > > > > > would
> > > > > > > > > > > > > > > > > > >> > > expect you'd get separate UUIDs
> for
> > > > both.
> > > > > > > Again
> > > > > > > > > > > Magnus can
> > > > > > > > > > > > > > > chime
> > > > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > > > >> > here, I
> > > > > > > > > > > > > > > > > > >> > > guess.
> > > > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > > >> > That's correct.
> > > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > > > >> > > > 2) How about the client
> > restarting?
> > > > > What's
> > > > > > > the
> > > > > > > > > > > > > > expectation?
> > > > > > > > > > > > > > > > > Should
> > > > > > > > > > > > > > > > > > >> the
> > > > > > > > > > > > > > > > > > >> > > > server expect the client to
> carry
> > a
> > > > > > > persisted
> > > > > > > > > > client
> > > > > > > > > > > > > > > instance id
> > > > > > > > > > > > > > > > > > or
> > > > > > > > > > > > > > > > > > >> > > should
> > > > > > > > > > > > > > > > > > >> > > > the client be treated as a new
> > > > instance?
> > > > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > > > >> > > The KIP doesn't describe any
> > mechanism
> > > > for
> > > > > > > > > > > persistence,
> > > > > > > > > > > > > so I
> > > > > > > > > > > > > > > would
> > > > > > > > > > > > > > > > > > >> assume
> > > > > > > > > > > > > > > > > > >> > > that when you restart the client
> you
> > > get
> > > > > a new
> > > > > > > > > > UUID. I
> > > > > > > > > > > > > agree
> > > > > > > > > > > > > > > that
> > > > > > > > > > > > > > > > > it
> > > > > > > > > > > > > > > > > > >> > would
> > > > > > > > > > > > > > > > > > >> > > be good to spell this out.
> > > > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > > > >> > Right, it will not be persisted
> since
> > a
> > > > > client
> > > > > > > > > > instance
> > > > > > > > > > > > > can't
> > > > > > > > > > > > > > be
> > > > > > > > > > > > > > > > > > >> restarted.
> > > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > > >> > Will update the KIP to make this
> > > clearer.
> > > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > > >> > /Magnus
> > > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > > >> --
> > > > > > > > > > > > > > > > > > >> Gwen Shapira
> > > > > > > > > > > > > > > > > > >> Engineering Manager | Confluent
> > > > > > > > > > > > > > > > > > >> 650.450.2760 | @gwenshap
> > > > > > > > > > > > > > > > > > >> Follow us: Twitter | blog
> > > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Magnus Edenhill <ma...@edenhill.se>.
Hi Jun,

thanks for your initiated questions, see my answers below.
There's been a number of clarifications to the KIP.



Den tors 27 jan. 2022 kl 20:08 skrev Jun Rao <ju...@confluent.io.invalid>:

> Hi, Magnus,
>
> Thanks for updating the KIP. The overall approach makes sense to me. A few
> more detailed comments below.
>
> 20. ClientTelemetry: Should it be extending configurable and closable?
>

I'll pass this question to Sarat and/or Xavier.



> 21. Compression of the metrics on the client: what's the default?
>

How about we specify a prioritized list: zstd, lz4, snappy, gzip?
But ultimately it is up to what the client supports.


23. A client instance is considered a metric resource and the
> resource-level (thus client instance level) labels could include:
>     client_software_name=confluent-kafka-python
>     client_software_version=v2.1.3
>     client_instance_id=B64CD139-3975-440A-91D4
>     transactional_id=someTxnApp
> Are those labels added in PushTelemetryRequest? If so, are they per metric
> or per request?
>


client_software* and client_instance_id are not added by the client, but
available to
the broker-side metrics plugin for adding as it see fits, remove them from
the KIP.

As for transactional_id, group_id, etc, which I believe will be useful in
troubleshooting,
are included only once (per push) as resource-level attributes (the client
instance is a singular resource).


>
> 24.  "the broker will only send
> GetTelemetrySubscriptionsResponse.DeltaTemporality=True" :
> 24.1 If it's always true, does it need to be part of the protocol?
>

We're anticipating that it will take a lot longer to upgrade the majority
of clients than the
broker/plugin side, which is why we want the client to support both
temporalities out-of-the-box
so that cumulative reporting can be turned on seamlessly in the future.



> 24.2 Does delta only apply to Counter type?
>


And Histograms. More details in Xavier's OTLP link.



> 24.3 In the delta representation, the first request needs to send the full
> value, how does the broker plugin know whether a value is full or delta?
>

The client may (should) send the start time for each metric sample,
indicating when
the metric began to be collected.
We've discussed whether this should be the client instance start time or
the time when a matching
metric subscription for that metric is received.
For completeness we recommend using the former, the client instance start
time.



> 25. quota:
> 25.1 Since we are fitting PushTelemetryRequest into the existing request
> quota, it would be useful to document the impact, i.e. client metric
> throttling causes the data from the same client to be delayed.
> 25.2 Is PushTelemetryRequest subject to the write bandwidth quota like the
> producer?
>


Yes, it should be, as to protect the cluster from rogue clients.
But, in practice the size of metrics will be quite low (e.g., 1-10kb per
60s interval), so I don't think this will pose a problem.
The KIP has been updated with more details on quota/throttling behaviour,
see the
"Throttling and rate-limiting" section.


25.3 THROTTLING_QUOTA_EXCEEDED: Currently, we don't send this error when
> the request/bandwidth quota is exceeded since those requests are not
> rejected. We only set this error when the request is rejected (e.g., topic
> creation). It would be useful to clarify when this error is used.
>

Right, I was trying to reuse an existing error-code. We can introduce
a new one for the case where a client pushes metrics at a higher frequency
than the
than the configured push interval (e.g., out-of-profile sends).
This causes the broker to drop those metrics and send this error code back
to the client. There will be no connection throttling / channel-muting in
this
case (unless the standard quotas are exceeded).


> 27. kafka-client-metrics.sh: Could we add an example on how to disable a
> bad client?
>

There's now a --block option to kafka-client-metrics.sh which overrides all
subscriptions
for the matched client(s). This allows silencing metrics for one or more
clients without having
to remove existing subscriptions. From the client's perspective it will
look like it no longer has
any subscriptions.

# Block metrics collection for a specific client instance
$ kafka-client-metrics.sh --bootstrap-server $BROKERS \
   --add \
   --name 'Disabe_b69cc35a' \  # A descriptive name makes it easier to
clean up old subscriptions.
   --match client_instance_id=b69cc35a-7a54-4790-aa69-cc2bd4ee4538 \  #
Match this specific client instance
   --block




> 28. New broker side metrics: Could we spell out the details of the metrics
> (e.g., group, tags, etc)?
>

KIP has been updated accordingly (thanks Sarat).



>
> 29. Client instance-level metrics: client.io.wait.time is a gauge not a
> histogram.
>

I believe a population/distribution should preferably be represented as a
histogram, space permitting,
and only secondarily as a Gauge average.
While we might not want to maintain a bunch of histograms for each
partition, since that could be
quite space consuming, this client.io.wait.time is a single metric per
client instance and can
thus afford a Histogram representation.



Thanks,
Magnus



> Thanks,
>
> Jun
>
> On Wed, Jan 26, 2022 at 6:32 AM Magnus Edenhill <ma...@edenhill.se>
> wrote:
>
> > Hi all,
> >
> > I've updated the KIP with responses to the latest comments: Java client
> > dependencies (Thanks Kirk!), alternate designs (separate cluster,
> separate
> > producer, etc), etc.
> >
> > I will revive the vote thread.
> >
> > Thanks,
> > Magnus
> >
> >
> > Den mån 13 dec. 2021 kl 22:32 skrev Ryanne Dolan <ryannedolan@gmail.com
> >:
> >
> > > I think we should be very careful about introducing new runtime
> > > dependencies into the clients. Historically this has been rare and
> > > essentially necessary (e.g. compression libs).
> > >
> > > Ryanne
> > >
> > > On Mon, Dec 13, 2021, 1:06 PM Kirk True <ki...@mustardgrain.com> wrote:
> > >
> > > > Hi Jun,
> > > >
> > > > On Thu, Dec 9, 2021, at 2:28 PM, Jun Rao wrote:
> > > > > 13. Using OpenTelemetry. Does that require runtime dependency
> > > > > on OpenTelemetry library? How good is the compatibility story
> > > > > of OpenTelemetry? This is important since an application could have
> > > other
> > > > > OpenTelemetry dependencies than the Kafka client.
> > > >
> > > > The current design is that the OpenTelemetry JARs would ship with the
> > > > client. Perhaps we can design the client such that the JARs aren't
> even
> > > > loaded if the user has opted out. The user could even exclude the
> JARs
> > > from
> > > > their dependencies if they so wished.
> > > >
> > > > I can't speak to the compatibility of the libraries. Is it possible
> > that
> > > > we include a shaded version?
> > > >
> > > > Thanks,
> > > > Kirk
> > > >
> > > > >
> > > > > 14. The proposal listed idempotence=true. This is more of a
> > > configuration
> > > > > than a metric. Are we including that as a metric? What other
> > > > configurations
> > > > > are we including? Should we separate the configurations from the
> > > metrics?
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jun
> > > > >
> > > > > On Mon, Nov 29, 2021 at 7:34 AM Magnus Edenhill <
> magnus@edenhill.se>
> > > > wrote:
> > > > >
> > > > > > Hey Bob,
> > > > > >
> > > > > > That's a good point.
> > > > > >
> > > > > > Request type labels were considered but since they're already
> > tracked
> > > > by
> > > > > > broker-side metrics
> > > > > > they were left out as to avoid metric duplication, however those
> > > > metrics
> > > > > > are not per connection,
> > > > > > so they won't be that useful in practice for troubleshooting
> > specific
> > > > > > client instances.
> > > > > >
> > > > > > I'll add the request_type label to the relevant metrics.
> > > > > >
> > > > > > Thanks,
> > > > > > Magnus
> > > > > >
> > > > > >
> > > > > > Den tis 23 nov. 2021 kl 19:20 skrev Bob Barrett
> > > > > > <bo...@confluent.io.invalid>:
> > > > > >
> > > > > > > Hi Magnus,
> > > > > > >
> > > > > > > Thanks for the thorough KIP, this seems very useful.
> > > > > > >
> > > > > > > Would it make sense to include the request type as a label for
> > the
> > > > > > > `client.request.success`, `client.request.errors` and
> > > > > > `client.request.rtt`
> > > > > > > metrics? I think it would be very useful to see which specific
> > > > requests
> > > > > > are
> > > > > > > succeeding and failing for a client. One specific case I can
> > think
> > > of
> > > > > > where
> > > > > > > this could be useful is producer batch timeouts. If a Java
> > > > application
> > > > > > does
> > > > > > > not enable producer client logs (unfortunately, in my
> experience
> > > this
> > > > > > > happens more often than it should), the application logs will
> > only
> > > > > > contain
> > > > > > > the expiration error message, but no information about what is
> > > > causing
> > > > > > the
> > > > > > > timeout. The requests might all be succeeding but taking too
> long
> > > to
> > > > > > > process batches, or metadata requests might be failing, or some
> > or
> > > > all
> > > > > > > produce requests might be failing (if the bootstrap servers are
> > > > reachable
> > > > > > > from the client but one or more other brokers are not, for
> > > example).
> > > > If
> > > > > > the
> > > > > > > cluster operator is able to identify the specific requests that
> > are
> > > > slow
> > > > > > or
> > > > > > > failing for a client, they will be better able to diagnose the
> > > issue
> > > > > > > causing batch timeouts.
> > > > > > >
> > > > > > > One drawback I can think of is that this will increase the
> > > > cardinality of
> > > > > > > the request metrics. But any given client is only going to use
> a
> > > > small
> > > > > > > subset of the request types, and since we already have
> partition
> > > > labels
> > > > > > for
> > > > > > > the topic-level metrics, I think request labels will still make
> > up
> > > a
> > > > > > > relatively small percentage of the set of metrics.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Bob
> > > > > > >
> > > > > > > On Mon, Nov 22, 2021 at 2:08 AM Viktor Somogyi-Vass <
> > > > > > > viktorsomogyi@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Magnus,
> > > > > > > >
> > > > > > > > I think this is a very useful addition. We also have a
> similar
> > > (but
> > > > > > much
> > > > > > > > more simplistic) implementation of this. Maybe I missed it in
> > the
> > > > KIP
> > > > > > but
> > > > > > > > what about adding metrics about the subscription cache
> itself?
> > > > That I
> > > > > > > think
> > > > > > > > would improve its usability and debuggability as we'd be able
> > to
> > > > see
> > > > > > its
> > > > > > > > performance, hit/miss rates, eviction counts and others.
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Viktor
> > > > > > > >
> > > > > > > > On Thu, Nov 18, 2021 at 5:12 PM Magnus Edenhill <
> > > > magnus@edenhill.se>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Mickael,
> > > > > > > > >
> > > > > > > > > see inline.
> > > > > > > > >
> > > > > > > > > Den ons 10 nov. 2021 kl 15:21 skrev Mickael Maison <
> > > > > > > > > mickael.maison@gmail.com
> > > > > > > > > >:
> > > > > > > > >
> > > > > > > > > > Hi Magnus,
> > > > > > > > > >
> > > > > > > > > > I see you've addressed some of the points I raised above
> > but
> > > > some
> > > > > > (4,
> > > > > > > > > > 5) have not been addressed yet.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Re 4) How will the user/app know metrics are being sent.
> > > > > > > > >
> > > > > > > > > One possibility is to add a JMX metric (thus for user
> > > > consumption)
> > > > > > for
> > > > > > > > the
> > > > > > > > > number of metric pushes the
> > > > > > > > > client has performed, or perhaps the number of metrics
> > > > subscriptions
> > > > > > > > > currently being collected.
> > > > > > > > > Would that be sufficient?
> > > > > > > > >
> > > > > > > > > Re 5) Metric sizes and rates
> > > > > > > > >
> > > > > > > > > A worst case scenario for a producer that is producing to
> 50
> > > > unique
> > > > > > > > topics
> > > > > > > > > and emitting all standard metrics yields
> > > > > > > > > a serialized size of around 100KB prior to compression,
> which
> > > > > > > compresses
> > > > > > > > > down to about 20-30% of that depending
> > > > > > > > > on compression type and topic name uniqueness.
> > > > > > > > > The numbers for a consumer would be similar.
> > > > > > > > >
> > > > > > > > > In practice the number of unique topics would be far less,
> > and
> > > > the
> > > > > > > > > subscription set would typically be for a subset of
> metrics.
> > > > > > > > > So we're probably closer to 1kb, or less, compressed size
> per
> > > > client
> > > > > > > per
> > > > > > > > > push interval.
> > > > > > > > >
> > > > > > > > > As both the subscription set and push intervals are
> > controlled
> > > > by the
> > > > > > > > > cluster operator it shouldn't be too hard
> > > > > > > > > to strike a good balance between metrics overhead and
> > > > granularity.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > I'm really uneasy with this being enabled by default on
> the
> > > > client
> > > > > > > > > > side. When collecting data, I think the best practice is
> to
> > > > ensure
> > > > > > > > > > users are explicitly enabling it.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Requiring metrics to be explicitly enabled on clients
> > severely
> > > > > > cripples
> > > > > > > > its
> > > > > > > > > usability and value.
> > > > > > > > >
> > > > > > > > > One of the problems that this KIP aims to solve is for
> useful
> > > > metrics
> > > > > > > to
> > > > > > > > be
> > > > > > > > > available on demand
> > > > > > > > > regardless of the technical expertise of the user. As
> Ryanne
> > > > points,
> > > > > > > out
> > > > > > > > a
> > > > > > > > > savvy user/organization
> > > > > > > > > will typically have metrics collection and monitoring in
> > place
> > > > > > already,
> > > > > > > > and
> > > > > > > > > the benefits of this KIP
> > > > > > > > > are then more of a common set and format metrics across
> > client
> > > > > > > > > implementations and languages.
> > > > > > > > > But that is not the typical Kafka user in my experience,
> > > they're
> > > > not
> > > > > > > > Kafka
> > > > > > > > > experts and they don't have the
> > > > > > > > > knowledge of how to best instrument their clients.
> > > > > > > > > Having metrics enabled by default for this user base allows
> > the
> > > > Kafka
> > > > > > > > > operators to proactively and reactively
> > > > > > > > > monitor and troubleshoot client issues, without the need
> for
> > > the
> > > > less
> > > > > > > > savvy
> > > > > > > > > user to do anything.
> > > > > > > > > It is often too late to tell a user to enable metrics when
> > the
> > > > > > problem
> > > > > > > > has
> > > > > > > > > already occurred.
> > > > > > > > >
> > > > > > > > > Now, to be clear, even though metrics are enabled by
> default
> > on
> > > > > > clients
> > > > > > > > it
> > > > > > > > > is not enabled by default
> > > > > > > > > on the brokers; the Kafka operator needs to build and set
> up
> > a
> > > > > > metrics
> > > > > > > > > plugin and add metrics subscriptions
> > > > > > > > > before anything is sent from the client.
> > > > > > > > > It is opt-out on the clients and opt-in on the broker.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > You mentioned brokers already have
> > > > > > > > > > some(most?) of the information contained in metrics, if
> so
> > > > then why
> > > > > > > > > > are we collecting it again? Surely there must be some new
> > > > > > information
> > > > > > > > > > in the client metrics.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > From the user's perspective the Kafka infrastructure
> extends
> > > from
> > > > > > > > > producer.send() to
> > > > > > > > > messages being returned from consumer.poll(), a giant black
> > box
> > > > where
> > > > > > > > > there's a lot going on between those
> > > > > > > > > two points. The brokers currently only see what happens
> once
> > > > those
> > > > > > > > requests
> > > > > > > > > and messages hits the broker,
> > > > > > > > > but as Kafka clients are complex pieces of machinery
> there's
> > a
> > > > myriad
> > > > > > > of
> > > > > > > > > queues, timers, and state
> > > > > > > > > that's critical to the operation and infrastructure that's
> > not
> > > > > > > currently
> > > > > > > > > visible to the operator.
> > > > > > > > > Relying on the user to accurately and timely provide this
> > > missing
> > > > > > > > > information is not generally feasible.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Most of the standard metrics listed in the KIP are data
> > points
> > > > that
> > > > > > the
> > > > > > > > > broker does not have.
> > > > > > > > > Only a small number of metrics are duplicates (like the
> > request
> > > > > > counts
> > > > > > > > and
> > > > > > > > > sizes), but they are included
> > > > > > > > > to ease correlation when inspecting these client metrics.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > Moreover this is a brand new feature so it's even harder
> to
> > > > justify
> > > > > > > > > > enabling it and forcing onto all our users. If disabled
> by
> > > > default,
> > > > > > > > > > it's relatively easy to enable in a new release if we
> > decide
> > > > to,
> > > > > > but
> > > > > > > > > > once enabled by default it's much harder to disable. Also
> > > this
> > > > > > > feature
> > > > > > > > > > will apply to all future metrics we will add.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > I think maturity of a feature implementation should be the
> > > > deciding
> > > > > > > > factor,
> > > > > > > > > rather than
> > > > > > > > > the design of it (which this KIP is). I.e., if the
> > > > implementation is
> > > > > > > not
> > > > > > > > > deemed mature enough
> > > > > > > > > for release X.Y it will be disabled.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > Overall I think it's an interesting feature but I'd
> prefer
> > to
> > > > be
> > > > > > > > > > slightly defensive and see how it works in practice
> before
> > > > enabling
> > > > > > > it
> > > > > > > > > > everywhere.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Right, and I agree on being defensive, but since this
> feature
> > > > still
> > > > > > > > > requires manual
> > > > > > > > > enabling on the brokers before actually being used, I think
> > > that
> > > > > > gives
> > > > > > > > > enough control
> > > > > > > > > to opt-in or out of this feature as needed.
> > > > > > > > >
> > > > > > > > > Thanks for your comments!
> > > > > > > > >
> > > > > > > > > Regards,
> > > > > > > > > Magnus
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Mickael
> > > > > > > > > >
> > > > > > > > > > On Mon, Nov 8, 2021 at 7:55 AM Magnus Edenhill <
> > > > magnus@edenhill.se
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > Thanks David for pointing this out,
> > > > > > > > > > > I've updated the KIP to include client_id as a matching
> > > > selector.
> > > > > > > > > > >
> > > > > > > > > > > Regards,
> > > > > > > > > > > Magnus
> > > > > > > > > > >
> > > > > > > > > > > Den tors 4 nov. 2021 kl 18:01 skrev David Mao
> > > > > > > > > <dmao@confluent.io.invalid
> > > > > > > > > > >:
> > > > > > > > > > >
> > > > > > > > > > > > Hey Magnus,
> > > > > > > > > > > >
> > > > > > > > > > > > I noticed that the KIP outlines the initial selectors
> > > > supported
> > > > > > > as:
> > > > > > > > > > > >
> > > > > > > > > > > >    - client_instance_id - CLIENT_INSTANCE_ID UUID
> > string
> > > > > > > > > > representation.
> > > > > > > > > > > >    - client_software_name  - client software
> > > implementation
> > > > > > name.
> > > > > > > > > > > >    - client_software_version  - client software
> > > > implementation
> > > > > > > > > version.
> > > > > > > > > > > >
> > > > > > > > > > > > In the given reactive monitoring workflow, we mention
> > > that
> > > > the
> > > > > > > > > > application
> > > > > > > > > > > > user does not know their client's client instance ID,
> > but
> > > > it's
> > > > > > > > > outlined
> > > > > > > > > > > > that the operator can add a metrics subscription
> > > selecting
> > > > for
> > > > > > > > > > clientId. I
> > > > > > > > > > > > don't see clientId as one of the supported selectors.
> > > > > > > > > > > > I can see how this would have made sense in a
> previous
> > > > > > iteration
> > > > > > > > > given
> > > > > > > > > > that
> > > > > > > > > > > > the previous client instance ID proposal was to
> > construct
> > > > the
> > > > > > > > client
> > > > > > > > > > > > instance ID using clientId as a prefix. Now that the
> > > client
> > > > > > > > instance
> > > > > > > > > > ID is
> > > > > > > > > > > > a UUID, would we want to add clientId as a supported
> > > > selector?
> > > > > > > > > > > > Let me know what you think.
> > > > > > > > > > > >
> > > > > > > > > > > > David
> > > > > > > > > > > >
> > > > > > > > > > > > On Tue, Oct 19, 2021 at 12:33 PM Magnus Edenhill <
> > > > > > > > magnus@edenhill.se
> > > > > > > > > >
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi Mickael!
> > > > > > > > > > > > >
> > > > > > > > > > > > > Den tis 19 okt. 2021 kl 19:30 skrev Mickael Maison
> <
> > > > > > > > > > > > > mickael.maison@gmail.com
> > > > > > > > > > > > > >:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi Magnus,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks for the proposal.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 1. Looking at the protocol section, isn't
> > > > > > "ClientInstanceId"
> > > > > > > > > > expected
> > > > > > > > > > > > > > to be a field in
> > GetTelemetrySubscriptionsResponseV0?
> > > > > > > > Otherwise,
> > > > > > > > > > how
> > > > > > > > > > > > > > does a client retrieve this value?
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Good catch, it got removed by mistake in one of the
> > > > edits.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 2. In the client API section, you mention a new
> > > method
> > > > > > > > > > > > > > "clientInstanceId()". Can you clarify which
> > > interfaces
> > > > are
> > > > > > > > > > affected?
> > > > > > > > > > > > > > Is it only Consumer and Producer?
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > And Admin. Will update the KIP.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > > 3. I'm a bit concerned this is enabled by
> default.
> > > > Even if
> > > > > > > the
> > > > > > > > > data
> > > > > > > > > > > > > > collected is supposed to be not sensitive, I
> think
> > > > this can
> > > > > > > be
> > > > > > > > > > > > > > problematic in some environments. Also users
> don't
> > > > seem to
> > > > > > > have
> > > > > > > > > the
> > > > > > > > > > > > > > choice to only expose some metrics. Knowing how
> > much
> > > > data
> > > > > > > > transit
> > > > > > > > > > > > > > through some applications can be considered
> > critical.
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > The broker already knows how much data transits
> > through
> > > > the
> > > > > > > > client
> > > > > > > > > > > > though,
> > > > > > > > > > > > > right?
> > > > > > > > > > > > > Care has been taken not to expose information in
> the
> > > > standard
> > > > > > > > > metrics
> > > > > > > > > > > > that
> > > > > > > > > > > > > might
> > > > > > > > > > > > > reveal sensitive information.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Do you have an example of how the proposed metrics
> > > could
> > > > leak
> > > > > > > > > > sensitive
> > > > > > > > > > > > > information?
> > > > > > > > > > > > > As for limiting the what metrics to export; I guess
> > > that
> > > > > > could
> > > > > > > > make
> > > > > > > > > > sense
> > > > > > > > > > > > > in some
> > > > > > > > > > > > > very sensitive use-cases, but those users might
> > disable
> > > > > > metrics
> > > > > > > > > > > > altogether
> > > > > > > > > > > > > for now.
> > > > > > > > > > > > > Could these concerns be addressed by a later KIP?
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 4. As a user, how do you know if your application
> > is
> > > > > > actively
> > > > > > > > > > sending
> > > > > > > > > > > > > > metrics? Are there new metrics exposing what's
> > going
> > > > on,
> > > > > > like
> > > > > > > > how
> > > > > > > > > > much
> > > > > > > > > > > > > > data is being sent?
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > That's a good question.
> > > > > > > > > > > > > Since the proposed metrics interface is not aimed
> at,
> > > or
> > > > > > > directly
> > > > > > > > > > > > available
> > > > > > > > > > > > > to, the application
> > > > > > > > > > > > > I guess there's little point of adding it here, but
> > > > instead
> > > > > > > > adding
> > > > > > > > > > > > > something to the
> > > > > > > > > > > > > existing JMX metrics?
> > > > > > > > > > > > > Do you have any suggestions?
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > > 5. If all metrics are enabled on a regular
> Consumer
> > > or
> > > > > > > > Producer,
> > > > > > > > > do
> > > > > > > > > > > > > > you have an idea how much throughput this would
> > use?
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > It depends on the number of partition/topics/etc
> the
> > > > client
> > > > > > is
> > > > > > > > > > producing
> > > > > > > > > > > > > to/consuming from.
> > > > > > > > > > > > > I'll add some sizes to the KIP for some typical
> > > > use-cases.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > Magnus
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Tue, Oct 19, 2021 at 5:06 PM Magnus Edenhill <
> > > > > > > > > > magnus@edenhill.se>
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Den tis 19 okt. 2021 kl 13:22 skrev Tom
> Bentley <
> > > > > > > > > > tbentley@redhat.com
> > > > > > > > > > > > >:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Hi Magnus,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I reviewed the KIP since you called the vote
> > > > (sorry for
> > > > > > > not
> > > > > > > > > > > > reviewing
> > > > > > > > > > > > > > when
> > > > > > > > > > > > > > > > you announced your intention to call the
> > vote). I
> > > > have
> > > > > > a
> > > > > > > > few
> > > > > > > > > > > > > questions
> > > > > > > > > > > > > > on
> > > > > > > > > > > > > > > > some of the details.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 1. There's no Javadoc on
> > > > ClientTelemetryPayload.data(),
> > > > > > > so
> > > > > > > > I
> > > > > > > > > > don't
> > > > > > > > > > > > > know
> > > > > > > > > > > > > > > > whether the payload is exposed through this
> > > method
> > > > as
> > > > > > > > > > compressed or
> > > > > > > > > > > > > > not.
> > > > > > > > > > > > > > > > Later on you say "Decompression of the
> payloads
> > > > will be
> > > > > > > > > > handled by
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > broker metrics plugin, the broker should
> > expose a
> > > > > > > suitable
> > > > > > > > > > > > > > decompression
> > > > > > > > > > > > > > > > API to the metrics plugin for this purpose.",
> > > which
> > > > > > > > suggests
> > > > > > > > > > it's
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > > compressed data in the buffer, but then we
> > don't
> > > > know
> > > > > > > which
> > > > > > > > > > codec
> > > > > > > > > > > > was
> > > > > > > > > > > > > > used,
> > > > > > > > > > > > > > > > nor the API via which the plugin should
> > > decompress
> > > > it
> > > > > > if
> > > > > > > > > > required
> > > > > > > > > > > > for
> > > > > > > > > > > > > > > > forwarding to the ultimate metrics store.
> > Should
> > > > the
> > > > > > > > > > > > > > ClientTelemetryPayload
> > > > > > > > > > > > > > > > expose a method to get the compression and a
> > > > > > > decompressor?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Good point, updated.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 2. The client-side API is expressed as
> > > > StringOrError
> > > > > > > > > > > > > > > > ClientInstance::ClientInstanceId(int
> > > timeout_ms). I
> > > > > > > > > understand
> > > > > > > > > > that
> > > > > > > > > > > > > > you're
> > > > > > > > > > > > > > > > thinking about the librdkafka implementation,
> > but
> > > > it
> > > > > > > would
> > > > > > > > be
> > > > > > > > > > good
> > > > > > > > > > > > to
> > > > > > > > > > > > > > show
> > > > > > > > > > > > > > > > the API as it would appear on the Apache
> Kafka
> > > > clients.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > This was meant as pseudo-code, but I changed it
> > to
> > > > Java.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 3. "PushTelemetryRequest|Response - protocol
> > > > request
> > > > > > used
> > > > > > > > by
> > > > > > > > > > the
> > > > > > > > > > > > > > client to
> > > > > > > > > > > > > > > > send metrics to any broker it is connected
> to."
> > > To
> > > > be
> > > > > > > > clear,
> > > > > > > > > > this
> > > > > > > > > > > > > means
> > > > > > > > > > > > > > > > that the client can choose any of the
> connected
> > > > brokers
> > > > > > > and
> > > > > > > > > > push to
> > > > > > > > > > > > > > just
> > > > > > > > > > > > > > > > one of them? What should a supporting client
> do
> > > if
> > > > it
> > > > > > > gets
> > > > > > > > an
> > > > > > > > > > error
> > > > > > > > > > > > > > when
> > > > > > > > > > > > > > > > pushing metrics to a broker, retry sending to
> > the
> > > > same
> > > > > > > > broker
> > > > > > > > > > or
> > > > > > > > > > > > try
> > > > > > > > > > > > > > > > pushing to another broker, or drop the
> metrics?
> > > > Should
> > > > > > > > > > supporting
> > > > > > > > > > > > > > clients
> > > > > > > > > > > > > > > > send successive requests to a single broker,
> or
> > > > round
> > > > > > > > robin,
> > > > > > > > > > or is
> > > > > > > > > > > > > > that up
> > > > > > > > > > > > > > > > to the client author? I'm guessing the
> > behaviour
> > > > should
> > > > > > > be
> > > > > > > > > > sticky
> > > > > > > > > > > > to
> > > > > > > > > > > > > > > > support the rate limiting features, but I
> think
> > > it
> > > > > > would
> > > > > > > be
> > > > > > > > > > good
> > > > > > > > > > > > for
> > > > > > > > > > > > > > client
> > > > > > > > > > > > > > > > authors if this section were explicit on the
> > > > > > recommended
> > > > > > > > > > behaviour.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > You are right, I've updated the KIP to make
> this
> > > > clearer.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 4. "Mapping the client instance id to an
> actual
> > > > > > > application
> > > > > > > > > > > > instance
> > > > > > > > > > > > > > > > running on a (virtual) machine can be done by
> > > > > > inspecting
> > > > > > > > the
> > > > > > > > > > > > metrics
> > > > > > > > > > > > > > > > resource labels, such as the client source
> > > address
> > > > and
> > > > > > > > source
> > > > > > > > > > port,
> > > > > > > > > > > > > or
> > > > > > > > > > > > > > > > security principal, all of which are added by
> > the
> > > > > > > receiving
> > > > > > > > > > broker.
> > > > > > > > > > > > > > This
> > > > > > > > > > > > > > > > will allow the operator together with the
> user
> > to
> > > > > > > identify
> > > > > > > > > the
> > > > > > > > > > > > actual
> > > > > > > > > > > > > > > > application instance." Is this really always
> > > true?
> > > > The
> > > > > > > > source
> > > > > > > > > > IP
> > > > > > > > > > > > and
> > > > > > > > > > > > > > port
> > > > > > > > > > > > > > > > might be a loadbalancer/proxy in some setups.
> > The
> > > > > > > > principal,
> > > > > > > > > as
> > > > > > > > > > > > > already
> > > > > > > > > > > > > > > > mentioned in the KIP, might be shared between
> > > > multiple
> > > > > > > > > > > > applications.
> > > > > > > > > > > > > > So at
> > > > > > > > > > > > > > > > worst the organization running the clients
> > might
> > > > have
> > > > > > to
> > > > > > > > > > consult
> > > > > > > > > > > > the
> > > > > > > > > > > > > > logs
> > > > > > > > > > > > > > > > of a set of client applications, right?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Yes, that's correct. There's no guaranteed
> > mapping
> > > > from
> > > > > > > > > > > > > > client_instance_id
> > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > an actual instance, that's why the KIP
> recommends
> > > > client
> > > > > > > > > > > > > implementations
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > log the client instance id
> > > > > > > > > > > > > > > upon retrieval, and also provide an API for the
> > > > > > application
> > > > > > > > to
> > > > > > > > > > > > retrieve
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > instance id programmatically
> > > > > > > > > > > > > > > if it has a better way of exposing it.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 5. "Tests indicate that a compression ratio up
> to
> > > > 10x is
> > > > > > > > > > possible for
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > standard metrics." Client authors might
> > > appreciate
> > > > your
> > > > > > > > > > mentioning
> > > > > > > > > > > > > > which
> > > > > > > > > > > > > > > > compression codec got these results.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Good point. Updated.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 6. "Should the client send a push request
> prior
> > > to
> > > > > > expiry
> > > > > > > > of
> > > > > > > > > > the
> > > > > > > > > > > > > > previously
> > > > > > > > > > > > > > > > calculated PushIntervalMs the broker will
> > discard
> > > > the
> > > > > > > > metrics
> > > > > > > > > > and
> > > > > > > > > > > > > > return a
> > > > > > > > > > > > > > > > PushTelemetryResponse with the ErrorCode set
> to
> > > > > > > > RateLimited."
> > > > > > > > > > Is
> > > > > > > > > > > > this
> > > > > > > > > > > > > > > > RATE_LIMITED a new error code? It's not
> > mentioned
> > > > in
> > > > > > the
> > > > > > > > "New
> > > > > > > > > > Error
> > > > > > > > > > > > > > Codes"
> > > > > > > > > > > > > > > > section.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > That's a leftover, it should be using the
> > standard
> > > > > > > > ThrottleTime
> > > > > > > > > > > > > > mechanism.
> > > > > > > > > > > > > > > Fixed.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 7. In the section "Standard client resource
> > > labels"
> > > > > > > > > > application_id
> > > > > > > > > > > > is
> > > > > > > > > > > > > > > > described as Kafka Streams only, but the
> > section
> > > of
> > > > > > > "Client
> > > > > > > > > > > > > > Identification"
> > > > > > > > > > > > > > > > talks about "application instance id as an
> > > optional
> > > > > > > future
> > > > > > > > > > > > > nice-to-have
> > > > > > > > > > > > > > > > that may be included as a metrics label if it
> > has
> > > > been
> > > > > > > set
> > > > > > > > by
> > > > > > > > > > the
> > > > > > > > > > > > > > user", so
> > > > > > > > > > > > > > > > I'm confused whether non-Kafka Streams
> clients
> > > > should
> > > > > > set
> > > > > > > > an
> > > > > > > > > > > > > > application_id
> > > > > > > > > > > > > > > > or not.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I'll clarify this in the KIP, but basically we
> > > would
> > > > need
> > > > > > > to
> > > > > > > > > add
> > > > > > > > > > an `
> > > > > > > > > > > > > > > application.id` config
> > > > > > > > > > > > > > > property for non-streams clients for this
> > purpose,
> > > > and
> > > > > > > that's
> > > > > > > > > > outside
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > scope of this KIP since we want to make it
> > > > zero-conf:ish
> > > > > > on
> > > > > > > > the
> > > > > > > > > > > > client
> > > > > > > > > > > > > > side.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Kind regards,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Tom
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks for the review,
> > > > > > > > > > > > > > > Magnus
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Thu, Oct 7, 2021 at 5:26 PM Magnus
> Edenhill
> > <
> > > > > > > > > > magnus@edenhill.se
> > > > > > > > > > > > >
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Hi all,
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > I've updated the KIP following our recent
> > > > discussions
> > > > > > > on
> > > > > > > > > the
> > > > > > > > > > > > > mailing
> > > > > > > > > > > > > > > > list:
> > > > > > > > > > > > > > > > >  - split the protocol in two, one for
> getting
> > > the
> > > > > > > metrics
> > > > > > > > > > > > > > subscriptions,
> > > > > > > > > > > > > > > > > and one for pushing the metrics.
> > > > > > > > > > > > > > > > >  - simplifications: initially only one
> > > supported
> > > > > > > metrics
> > > > > > > > > > format,
> > > > > > > > > > > > no
> > > > > > > > > > > > > > > > > client.id in the instance id, etc.
> > > > > > > > > > > > > > > > >  - made CLIENT_METRICS subscription
> > > configuration
> > > > > > > entries
> > > > > > > > > > more
> > > > > > > > > > > > > > structured
> > > > > > > > > > > > > > > > >    and allowing better client matching
> > > selectors
> > > > (not
> > > > > > > > only
> > > > > > > > > > on the
> > > > > > > > > > > > > > > > instance
> > > > > > > > > > > > > > > > > id, but also the other
> > > > > > > > > > > > > > > > >    client resource labels, such as
> > > > > > > client_software_name,
> > > > > > > > > > etc.).
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Unless there are further comments I'll call
> > the
> > > > vote
> > > > > > > in a
> > > > > > > > > > day or
> > > > > > > > > > > > > two.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Regards,
> > > > > > > > > > > > > > > > > Magnus
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Den mån 4 okt. 2021 kl 20:57 skrev Magnus
> > > > Edenhill <
> > > > > > > > > > > > > > magnus@edenhill.se>:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Hi Gwen,
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > I'm finishing up the KIP based on the
> last
> > > > couple
> > > > > > of
> > > > > > > > > > discussion
> > > > > > > > > > > > > > points
> > > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > > > this thread
> > > > > > > > > > > > > > > > > > and will call the Vote later this week.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Best,
> > > > > > > > > > > > > > > > > > Magnus
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Den lör 2 okt. 2021 kl 02:01 skrev Gwen
> > > Shapira
> > > > > > > > > > > > > > > > > <gwen@confluent.io.invalid
> > > > > > > > > > > > > > > > > > >:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >> Hey,
> > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > >> I noticed that there was no discussion
> for
> > > the
> > > > > > last
> > > > > > > 10
> > > > > > > > > > days,
> > > > > > > > > > > > > but I
> > > > > > > > > > > > > > > > > >> couldn't
> > > > > > > > > > > > > > > > > >> find the vote thread. Is there one that
> > I'm
> > > > > > missing?
> > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > >> Gwen
> > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > >> On Wed, Sep 22, 2021 at 4:58 AM Magnus
> > > > Edenhill <
> > > > > > > > > > > > > > magnus@edenhill.se>
> > > > > > > > > > > > > > > > > >> wrote:
> > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > >> > Den tis 21 sep. 2021 kl 06:58 skrev
> > Colin
> > > > > > McCabe <
> > > > > > > > > > > > > > > > cmccabe@apache.org
> > > > > > > > > > > > > > > > > >:
> > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > >> > > On Mon, Sep 20, 2021, at 17:35, Feng
> > Min
> > > > > > wrote:
> > > > > > > > > > > > > > > > > >> > > > Thanks Magnus & Colin for the
> > > > discussion.
> > > > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > > > >> > > > Based on KIP-714's stateless
> design,
> > > > Client
> > > > > > > can
> > > > > > > > > > pretty
> > > > > > > > > > > > > much
> > > > > > > > > > > > > > use
> > > > > > > > > > > > > > > > > any
> > > > > > > > > > > > > > > > > >> > > > connection to any broker to send
> > > > metrics. We
> > > > > > > are
> > > > > > > > > not
> > > > > > > > > > > > > > associating
> > > > > > > > > > > > > > > > > >> > > connection
> > > > > > > > > > > > > > > > > >> > > > with client metric state. Is my
> > > > > > understanding
> > > > > > > > > > correct?
> > > > > > > > > > > > If
> > > > > > > > > > > > > > yes,
> > > > > > > > > > > > > > > > > how
> > > > > > > > > > > > > > > > > >> > about
> > > > > > > > > > > > > > > > > >> > > > the following two scenarios
> > > > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > > > >> > > > 1) One Client (Client-ID)
> registers
> > > two
> > > > > > > > different
> > > > > > > > > > client
> > > > > > > > > > > > > > > > instance
> > > > > > > > > > > > > > > > > id
> > > > > > > > > > > > > > > > > >> > via
> > > > > > > > > > > > > > > > > >> > > > separate registration. Is it
> > > permitted?
> > > > If
> > > > > > OK,
> > > > > > > > how
> > > > > > > > > > to
> > > > > > > > > > > > > > > > distinguish
> > > > > > > > > > > > > > > > > >> them
> > > > > > > > > > > > > > > > > >> > > from
> > > > > > > > > > > > > > > > > >> > > > the case 2 below.
> > > > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > > >> > > Hi Feng,
> > > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > > >> > > My understanding, which Magnus can
> > > > clarify I
> > > > > > > > guess,
> > > > > > > > > is
> > > > > > > > > > > > that
> > > > > > > > > > > > > > you
> > > > > > > > > > > > > > > > > could
> > > > > > > > > > > > > > > > > >> > have
> > > > > > > > > > > > > > > > > >> > > something like two Producer
> instances
> > > > running
> > > > > > > with
> > > > > > > > > the
> > > > > > > > > > > > same
> > > > > > > > > > > > > > > > > client.id
> > > > > > > > > > > > > > > > > >> > > (perhaps because they're using the
> > same
> > > > config
> > > > > > > > file,
> > > > > > > > > > for
> > > > > > > > > > > > > > example).
> > > > > > > > > > > > > > > > > >> They
> > > > > > > > > > > > > > > > > >> > > could even be in the same process.
> But
> > > > they
> > > > > > > would
> > > > > > > > > get
> > > > > > > > > > > > > separate
> > > > > > > > > > > > > > > > > UUIDs.
> > > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > > >> > > I believe Magnus used the term
> client
> > to
> > > > mean
> > > > > > > > > > "Producer or
> > > > > > > > > > > > > > > > > Consumer".
> > > > > > > > > > > > > > > > > >> So
> > > > > > > > > > > > > > > > > >> > > if you have both a Producer and a
> > > > Consumer in
> > > > > > > your
> > > > > > > > > > > > > > application I
> > > > > > > > > > > > > > > > > would
> > > > > > > > > > > > > > > > > >> > > expect you'd get separate UUIDs for
> > > both.
> > > > > > Again
> > > > > > > > > > Magnus can
> > > > > > > > > > > > > > chime
> > > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > > >> > here, I
> > > > > > > > > > > > > > > > > >> > > guess.
> > > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > >> > That's correct.
> > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > > >> > > > 2) How about the client
> restarting?
> > > > What's
> > > > > > the
> > > > > > > > > > > > > expectation?
> > > > > > > > > > > > > > > > Should
> > > > > > > > > > > > > > > > > >> the
> > > > > > > > > > > > > > > > > >> > > > server expect the client to carry
> a
> > > > > > persisted
> > > > > > > > > client
> > > > > > > > > > > > > > instance id
> > > > > > > > > > > > > > > > > or
> > > > > > > > > > > > > > > > > >> > > should
> > > > > > > > > > > > > > > > > >> > > > the client be treated as a new
> > > instance?
> > > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > > >> > > The KIP doesn't describe any
> mechanism
> > > for
> > > > > > > > > > persistence,
> > > > > > > > > > > > so I
> > > > > > > > > > > > > > would
> > > > > > > > > > > > > > > > > >> assume
> > > > > > > > > > > > > > > > > >> > > that when you restart the client you
> > get
> > > > a new
> > > > > > > > > UUID. I
> > > > > > > > > > > > agree
> > > > > > > > > > > > > > that
> > > > > > > > > > > > > > > > it
> > > > > > > > > > > > > > > > > >> > would
> > > > > > > > > > > > > > > > > >> > > be good to spell this out.
> > > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > > >> > Right, it will not be persisted since
> a
> > > > client
> > > > > > > > > instance
> > > > > > > > > > > > can't
> > > > > > > > > > > > > be
> > > > > > > > > > > > > > > > > >> restarted.
> > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > >> > Will update the KIP to make this
> > clearer.
> > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > >> > /Magnus
> > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > >> --
> > > > > > > > > > > > > > > > > >> Gwen Shapira
> > > > > > > > > > > > > > > > > >> Engineering Manager | Confluent
> > > > > > > > > > > > > > > > > >> 650.450.2760 | @gwenshap
> > > > > > > > > > > > > > > > > >> Follow us: Twitter | blog
> > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Jun Rao <ju...@confluent.io.INVALID>.
Hi, Magnus,

Thanks for updating the KIP. The overall approach makes sense to me. A few
more detailed comments below.

20. ClientTelemetry: Should it be extending configurable and closable?

21. Compression of the metrics on the client: what's the default?

22. "Client metrics plugin / extending the MetricsReporter interface":
ClientTelemetry doesn't seem to extend MetricsReporter.

23. A client instance is considered a metric resource and the
resource-level (thus client instance level) labels could include:
    client_software_name=confluent-kafka-python
    client_software_version=v2.1.3
    client_instance_id=B64CD139-3975-440A-91D4
    transactional_id=someTxnApp
Are those labels added in PushTelemetryRequest? If so, are they per metric
or per request?

24.  "the broker will only send
GetTelemetrySubscriptionsResponse.DeltaTemporality=True" :
24.1 If it's always true, does it need to be part of the protocol?
24.2 Does delta only apply to Counter type?
24.3 In the delta representation, the first request needs to send the full
value, how does the broker plugin know whether a value is full or delta?

25. quota:
25.1 Since we are fitting PushTelemetryRequest into the existing request
quota, it would be useful to document the impact, i.e. client metric
throttling causes the data from the same client to be delayed.
25.2 Is PushTelemetryRequest subject to the write bandwidth quota like the
producer?
25.3 THROTTLING_QUOTA_EXCEEDED: Currently, we don't send this error when
the request/bandwidth quota is exceeded since those requests are not
rejected. We only set this error when the request is rejected (e.g., topic
creation). It would be useful to clarify when this error is used.

26. client-metrics entity:
26.1 It seems that we could add multiple entities that match to the same
client. Which one takes precedent?
26.2 How do we persist the new client metrics entities? Do we need to add
new ZK paths and new records in KRaft?

27. kafka-client-metrics.sh: Could we add an example on how to disable a
bad client?

28. New broker side metrics: Could we spell out the details of the metrics
(e.g., group, tags, etc)?

29. Client instance-level metrics: client.io.wait.time is a gauge not a
histogram.

Thanks,

Jun

On Wed, Jan 26, 2022 at 6:32 AM Magnus Edenhill <ma...@edenhill.se> wrote:

> Hi all,
>
> I've updated the KIP with responses to the latest comments: Java client
> dependencies (Thanks Kirk!), alternate designs (separate cluster, separate
> producer, etc), etc.
>
> I will revive the vote thread.
>
> Thanks,
> Magnus
>
>
> Den mån 13 dec. 2021 kl 22:32 skrev Ryanne Dolan <ry...@gmail.com>:
>
> > I think we should be very careful about introducing new runtime
> > dependencies into the clients. Historically this has been rare and
> > essentially necessary (e.g. compression libs).
> >
> > Ryanne
> >
> > On Mon, Dec 13, 2021, 1:06 PM Kirk True <ki...@mustardgrain.com> wrote:
> >
> > > Hi Jun,
> > >
> > > On Thu, Dec 9, 2021, at 2:28 PM, Jun Rao wrote:
> > > > 13. Using OpenTelemetry. Does that require runtime dependency
> > > > on OpenTelemetry library? How good is the compatibility story
> > > > of OpenTelemetry? This is important since an application could have
> > other
> > > > OpenTelemetry dependencies than the Kafka client.
> > >
> > > The current design is that the OpenTelemetry JARs would ship with the
> > > client. Perhaps we can design the client such that the JARs aren't even
> > > loaded if the user has opted out. The user could even exclude the JARs
> > from
> > > their dependencies if they so wished.
> > >
> > > I can't speak to the compatibility of the libraries. Is it possible
> that
> > > we include a shaded version?
> > >
> > > Thanks,
> > > Kirk
> > >
> > > >
> > > > 14. The proposal listed idempotence=true. This is more of a
> > configuration
> > > > than a metric. Are we including that as a metric? What other
> > > configurations
> > > > are we including? Should we separate the configurations from the
> > metrics?
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > > On Mon, Nov 29, 2021 at 7:34 AM Magnus Edenhill <ma...@edenhill.se>
> > > wrote:
> > > >
> > > > > Hey Bob,
> > > > >
> > > > > That's a good point.
> > > > >
> > > > > Request type labels were considered but since they're already
> tracked
> > > by
> > > > > broker-side metrics
> > > > > they were left out as to avoid metric duplication, however those
> > > metrics
> > > > > are not per connection,
> > > > > so they won't be that useful in practice for troubleshooting
> specific
> > > > > client instances.
> > > > >
> > > > > I'll add the request_type label to the relevant metrics.
> > > > >
> > > > > Thanks,
> > > > > Magnus
> > > > >
> > > > >
> > > > > Den tis 23 nov. 2021 kl 19:20 skrev Bob Barrett
> > > > > <bo...@confluent.io.invalid>:
> > > > >
> > > > > > Hi Magnus,
> > > > > >
> > > > > > Thanks for the thorough KIP, this seems very useful.
> > > > > >
> > > > > > Would it make sense to include the request type as a label for
> the
> > > > > > `client.request.success`, `client.request.errors` and
> > > > > `client.request.rtt`
> > > > > > metrics? I think it would be very useful to see which specific
> > > requests
> > > > > are
> > > > > > succeeding and failing for a client. One specific case I can
> think
> > of
> > > > > where
> > > > > > this could be useful is producer batch timeouts. If a Java
> > > application
> > > > > does
> > > > > > not enable producer client logs (unfortunately, in my experience
> > this
> > > > > > happens more often than it should), the application logs will
> only
> > > > > contain
> > > > > > the expiration error message, but no information about what is
> > > causing
> > > > > the
> > > > > > timeout. The requests might all be succeeding but taking too long
> > to
> > > > > > process batches, or metadata requests might be failing, or some
> or
> > > all
> > > > > > produce requests might be failing (if the bootstrap servers are
> > > reachable
> > > > > > from the client but one or more other brokers are not, for
> > example).
> > > If
> > > > > the
> > > > > > cluster operator is able to identify the specific requests that
> are
> > > slow
> > > > > or
> > > > > > failing for a client, they will be better able to diagnose the
> > issue
> > > > > > causing batch timeouts.
> > > > > >
> > > > > > One drawback I can think of is that this will increase the
> > > cardinality of
> > > > > > the request metrics. But any given client is only going to use a
> > > small
> > > > > > subset of the request types, and since we already have partition
> > > labels
> > > > > for
> > > > > > the topic-level metrics, I think request labels will still make
> up
> > a
> > > > > > relatively small percentage of the set of metrics.
> > > > > >
> > > > > > Thanks,
> > > > > > Bob
> > > > > >
> > > > > > On Mon, Nov 22, 2021 at 2:08 AM Viktor Somogyi-Vass <
> > > > > > viktorsomogyi@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Magnus,
> > > > > > >
> > > > > > > I think this is a very useful addition. We also have a similar
> > (but
> > > > > much
> > > > > > > more simplistic) implementation of this. Maybe I missed it in
> the
> > > KIP
> > > > > but
> > > > > > > what about adding metrics about the subscription cache itself?
> > > That I
> > > > > > think
> > > > > > > would improve its usability and debuggability as we'd be able
> to
> > > see
> > > > > its
> > > > > > > performance, hit/miss rates, eviction counts and others.
> > > > > > >
> > > > > > > Best,
> > > > > > > Viktor
> > > > > > >
> > > > > > > On Thu, Nov 18, 2021 at 5:12 PM Magnus Edenhill <
> > > magnus@edenhill.se>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Mickael,
> > > > > > > >
> > > > > > > > see inline.
> > > > > > > >
> > > > > > > > Den ons 10 nov. 2021 kl 15:21 skrev Mickael Maison <
> > > > > > > > mickael.maison@gmail.com
> > > > > > > > >:
> > > > > > > >
> > > > > > > > > Hi Magnus,
> > > > > > > > >
> > > > > > > > > I see you've addressed some of the points I raised above
> but
> > > some
> > > > > (4,
> > > > > > > > > 5) have not been addressed yet.
> > > > > > > > >
> > > > > > > >
> > > > > > > > Re 4) How will the user/app know metrics are being sent.
> > > > > > > >
> > > > > > > > One possibility is to add a JMX metric (thus for user
> > > consumption)
> > > > > for
> > > > > > > the
> > > > > > > > number of metric pushes the
> > > > > > > > client has performed, or perhaps the number of metrics
> > > subscriptions
> > > > > > > > currently being collected.
> > > > > > > > Would that be sufficient?
> > > > > > > >
> > > > > > > > Re 5) Metric sizes and rates
> > > > > > > >
> > > > > > > > A worst case scenario for a producer that is producing to 50
> > > unique
> > > > > > > topics
> > > > > > > > and emitting all standard metrics yields
> > > > > > > > a serialized size of around 100KB prior to compression, which
> > > > > > compresses
> > > > > > > > down to about 20-30% of that depending
> > > > > > > > on compression type and topic name uniqueness.
> > > > > > > > The numbers for a consumer would be similar.
> > > > > > > >
> > > > > > > > In practice the number of unique topics would be far less,
> and
> > > the
> > > > > > > > subscription set would typically be for a subset of metrics.
> > > > > > > > So we're probably closer to 1kb, or less, compressed size per
> > > client
> > > > > > per
> > > > > > > > push interval.
> > > > > > > >
> > > > > > > > As both the subscription set and push intervals are
> controlled
> > > by the
> > > > > > > > cluster operator it shouldn't be too hard
> > > > > > > > to strike a good balance between metrics overhead and
> > > granularity.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > I'm really uneasy with this being enabled by default on the
> > > client
> > > > > > > > > side. When collecting data, I think the best practice is to
> > > ensure
> > > > > > > > > users are explicitly enabling it.
> > > > > > > > >
> > > > > > > >
> > > > > > > > Requiring metrics to be explicitly enabled on clients
> severely
> > > > > cripples
> > > > > > > its
> > > > > > > > usability and value.
> > > > > > > >
> > > > > > > > One of the problems that this KIP aims to solve is for useful
> > > metrics
> > > > > > to
> > > > > > > be
> > > > > > > > available on demand
> > > > > > > > regardless of the technical expertise of the user. As Ryanne
> > > points,
> > > > > > out
> > > > > > > a
> > > > > > > > savvy user/organization
> > > > > > > > will typically have metrics collection and monitoring in
> place
> > > > > already,
> > > > > > > and
> > > > > > > > the benefits of this KIP
> > > > > > > > are then more of a common set and format metrics across
> client
> > > > > > > > implementations and languages.
> > > > > > > > But that is not the typical Kafka user in my experience,
> > they're
> > > not
> > > > > > > Kafka
> > > > > > > > experts and they don't have the
> > > > > > > > knowledge of how to best instrument their clients.
> > > > > > > > Having metrics enabled by default for this user base allows
> the
> > > Kafka
> > > > > > > > operators to proactively and reactively
> > > > > > > > monitor and troubleshoot client issues, without the need for
> > the
> > > less
> > > > > > > savvy
> > > > > > > > user to do anything.
> > > > > > > > It is often too late to tell a user to enable metrics when
> the
> > > > > problem
> > > > > > > has
> > > > > > > > already occurred.
> > > > > > > >
> > > > > > > > Now, to be clear, even though metrics are enabled by default
> on
> > > > > clients
> > > > > > > it
> > > > > > > > is not enabled by default
> > > > > > > > on the brokers; the Kafka operator needs to build and set up
> a
> > > > > metrics
> > > > > > > > plugin and add metrics subscriptions
> > > > > > > > before anything is sent from the client.
> > > > > > > > It is opt-out on the clients and opt-in on the broker.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > You mentioned brokers already have
> > > > > > > > > some(most?) of the information contained in metrics, if so
> > > then why
> > > > > > > > > are we collecting it again? Surely there must be some new
> > > > > information
> > > > > > > > > in the client metrics.
> > > > > > > > >
> > > > > > > >
> > > > > > > > From the user's perspective the Kafka infrastructure extends
> > from
> > > > > > > > producer.send() to
> > > > > > > > messages being returned from consumer.poll(), a giant black
> box
> > > where
> > > > > > > > there's a lot going on between those
> > > > > > > > two points. The brokers currently only see what happens once
> > > those
> > > > > > > requests
> > > > > > > > and messages hits the broker,
> > > > > > > > but as Kafka clients are complex pieces of machinery there's
> a
> > > myriad
> > > > > > of
> > > > > > > > queues, timers, and state
> > > > > > > > that's critical to the operation and infrastructure that's
> not
> > > > > > currently
> > > > > > > > visible to the operator.
> > > > > > > > Relying on the user to accurately and timely provide this
> > missing
> > > > > > > > information is not generally feasible.
> > > > > > > >
> > > > > > > >
> > > > > > > > Most of the standard metrics listed in the KIP are data
> points
> > > that
> > > > > the
> > > > > > > > broker does not have.
> > > > > > > > Only a small number of metrics are duplicates (like the
> request
> > > > > counts
> > > > > > > and
> > > > > > > > sizes), but they are included
> > > > > > > > to ease correlation when inspecting these client metrics.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > Moreover this is a brand new feature so it's even harder to
> > > justify
> > > > > > > > > enabling it and forcing onto all our users. If disabled by
> > > default,
> > > > > > > > > it's relatively easy to enable in a new release if we
> decide
> > > to,
> > > > > but
> > > > > > > > > once enabled by default it's much harder to disable. Also
> > this
> > > > > > feature
> > > > > > > > > will apply to all future metrics we will add.
> > > > > > > > >
> > > > > > > >
> > > > > > > > I think maturity of a feature implementation should be the
> > > deciding
> > > > > > > factor,
> > > > > > > > rather than
> > > > > > > > the design of it (which this KIP is). I.e., if the
> > > implementation is
> > > > > > not
> > > > > > > > deemed mature enough
> > > > > > > > for release X.Y it will be disabled.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > Overall I think it's an interesting feature but I'd prefer
> to
> > > be
> > > > > > > > > slightly defensive and see how it works in practice before
> > > enabling
> > > > > > it
> > > > > > > > > everywhere.
> > > > > > > > >
> > > > > > > >
> > > > > > > > Right, and I agree on being defensive, but since this feature
> > > still
> > > > > > > > requires manual
> > > > > > > > enabling on the brokers before actually being used, I think
> > that
> > > > > gives
> > > > > > > > enough control
> > > > > > > > to opt-in or out of this feature as needed.
> > > > > > > >
> > > > > > > > Thanks for your comments!
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > > Magnus
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Mickael
> > > > > > > > >
> > > > > > > > > On Mon, Nov 8, 2021 at 7:55 AM Magnus Edenhill <
> > > magnus@edenhill.se
> > > > > >
> > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > Thanks David for pointing this out,
> > > > > > > > > > I've updated the KIP to include client_id as a matching
> > > selector.
> > > > > > > > > >
> > > > > > > > > > Regards,
> > > > > > > > > > Magnus
> > > > > > > > > >
> > > > > > > > > > Den tors 4 nov. 2021 kl 18:01 skrev David Mao
> > > > > > > > <dmao@confluent.io.invalid
> > > > > > > > > >:
> > > > > > > > > >
> > > > > > > > > > > Hey Magnus,
> > > > > > > > > > >
> > > > > > > > > > > I noticed that the KIP outlines the initial selectors
> > > supported
> > > > > > as:
> > > > > > > > > > >
> > > > > > > > > > >    - client_instance_id - CLIENT_INSTANCE_ID UUID
> string
> > > > > > > > > representation.
> > > > > > > > > > >    - client_software_name  - client software
> > implementation
> > > > > name.
> > > > > > > > > > >    - client_software_version  - client software
> > > implementation
> > > > > > > > version.
> > > > > > > > > > >
> > > > > > > > > > > In the given reactive monitoring workflow, we mention
> > that
> > > the
> > > > > > > > > application
> > > > > > > > > > > user does not know their client's client instance ID,
> but
> > > it's
> > > > > > > > outlined
> > > > > > > > > > > that the operator can add a metrics subscription
> > selecting
> > > for
> > > > > > > > > clientId. I
> > > > > > > > > > > don't see clientId as one of the supported selectors.
> > > > > > > > > > > I can see how this would have made sense in a previous
> > > > > iteration
> > > > > > > > given
> > > > > > > > > that
> > > > > > > > > > > the previous client instance ID proposal was to
> construct
> > > the
> > > > > > > client
> > > > > > > > > > > instance ID using clientId as a prefix. Now that the
> > client
> > > > > > > instance
> > > > > > > > > ID is
> > > > > > > > > > > a UUID, would we want to add clientId as a supported
> > > selector?
> > > > > > > > > > > Let me know what you think.
> > > > > > > > > > >
> > > > > > > > > > > David
> > > > > > > > > > >
> > > > > > > > > > > On Tue, Oct 19, 2021 at 12:33 PM Magnus Edenhill <
> > > > > > > magnus@edenhill.se
> > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi Mickael!
> > > > > > > > > > > >
> > > > > > > > > > > > Den tis 19 okt. 2021 kl 19:30 skrev Mickael Maison <
> > > > > > > > > > > > mickael.maison@gmail.com
> > > > > > > > > > > > >:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi Magnus,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks for the proposal.
> > > > > > > > > > > > >
> > > > > > > > > > > > > 1. Looking at the protocol section, isn't
> > > > > "ClientInstanceId"
> > > > > > > > > expected
> > > > > > > > > > > > > to be a field in
> GetTelemetrySubscriptionsResponseV0?
> > > > > > > Otherwise,
> > > > > > > > > how
> > > > > > > > > > > > > does a client retrieve this value?
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Good catch, it got removed by mistake in one of the
> > > edits.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > 2. In the client API section, you mention a new
> > method
> > > > > > > > > > > > > "clientInstanceId()". Can you clarify which
> > interfaces
> > > are
> > > > > > > > > affected?
> > > > > > > > > > > > > Is it only Consumer and Producer?
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > And Admin. Will update the KIP.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > 3. I'm a bit concerned this is enabled by default.
> > > Even if
> > > > > > the
> > > > > > > > data
> > > > > > > > > > > > > collected is supposed to be not sensitive, I think
> > > this can
> > > > > > be
> > > > > > > > > > > > > problematic in some environments. Also users don't
> > > seem to
> > > > > > have
> > > > > > > > the
> > > > > > > > > > > > > choice to only expose some metrics. Knowing how
> much
> > > data
> > > > > > > transit
> > > > > > > > > > > > > through some applications can be considered
> critical.
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > The broker already knows how much data transits
> through
> > > the
> > > > > > > client
> > > > > > > > > > > though,
> > > > > > > > > > > > right?
> > > > > > > > > > > > Care has been taken not to expose information in the
> > > standard
> > > > > > > > metrics
> > > > > > > > > > > that
> > > > > > > > > > > > might
> > > > > > > > > > > > reveal sensitive information.
> > > > > > > > > > > >
> > > > > > > > > > > > Do you have an example of how the proposed metrics
> > could
> > > leak
> > > > > > > > > sensitive
> > > > > > > > > > > > information?
> > > > > > > > > > > > As for limiting the what metrics to export; I guess
> > that
> > > > > could
> > > > > > > make
> > > > > > > > > sense
> > > > > > > > > > > > in some
> > > > > > > > > > > > very sensitive use-cases, but those users might
> disable
> > > > > metrics
> > > > > > > > > > > altogether
> > > > > > > > > > > > for now.
> > > > > > > > > > > > Could these concerns be addressed by a later KIP?
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > 4. As a user, how do you know if your application
> is
> > > > > actively
> > > > > > > > > sending
> > > > > > > > > > > > > metrics? Are there new metrics exposing what's
> going
> > > on,
> > > > > like
> > > > > > > how
> > > > > > > > > much
> > > > > > > > > > > > > data is being sent?
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > That's a good question.
> > > > > > > > > > > > Since the proposed metrics interface is not aimed at,
> > or
> > > > > > directly
> > > > > > > > > > > available
> > > > > > > > > > > > to, the application
> > > > > > > > > > > > I guess there's little point of adding it here, but
> > > instead
> > > > > > > adding
> > > > > > > > > > > > something to the
> > > > > > > > > > > > existing JMX metrics?
> > > > > > > > > > > > Do you have any suggestions?
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > 5. If all metrics are enabled on a regular Consumer
> > or
> > > > > > > Producer,
> > > > > > > > do
> > > > > > > > > > > > > you have an idea how much throughput this would
> use?
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > It depends on the number of partition/topics/etc the
> > > client
> > > > > is
> > > > > > > > > producing
> > > > > > > > > > > > to/consuming from.
> > > > > > > > > > > > I'll add some sizes to the KIP for some typical
> > > use-cases.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks,
> > > > > > > > > > > > Magnus
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > Thanks
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Tue, Oct 19, 2021 at 5:06 PM Magnus Edenhill <
> > > > > > > > > magnus@edenhill.se>
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Den tis 19 okt. 2021 kl 13:22 skrev Tom Bentley <
> > > > > > > > > tbentley@redhat.com
> > > > > > > > > > > >:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi Magnus,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I reviewed the KIP since you called the vote
> > > (sorry for
> > > > > > not
> > > > > > > > > > > reviewing
> > > > > > > > > > > > > when
> > > > > > > > > > > > > > > you announced your intention to call the
> vote). I
> > > have
> > > > > a
> > > > > > > few
> > > > > > > > > > > > questions
> > > > > > > > > > > > > on
> > > > > > > > > > > > > > > some of the details.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 1. There's no Javadoc on
> > > ClientTelemetryPayload.data(),
> > > > > > so
> > > > > > > I
> > > > > > > > > don't
> > > > > > > > > > > > know
> > > > > > > > > > > > > > > whether the payload is exposed through this
> > method
> > > as
> > > > > > > > > compressed or
> > > > > > > > > > > > > not.
> > > > > > > > > > > > > > > Later on you say "Decompression of the payloads
> > > will be
> > > > > > > > > handled by
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > broker metrics plugin, the broker should
> expose a
> > > > > > suitable
> > > > > > > > > > > > > decompression
> > > > > > > > > > > > > > > API to the metrics plugin for this purpose.",
> > which
> > > > > > > suggests
> > > > > > > > > it's
> > > > > > > > > > > the
> > > > > > > > > > > > > > > compressed data in the buffer, but then we
> don't
> > > know
> > > > > > which
> > > > > > > > > codec
> > > > > > > > > > > was
> > > > > > > > > > > > > used,
> > > > > > > > > > > > > > > nor the API via which the plugin should
> > decompress
> > > it
> > > > > if
> > > > > > > > > required
> > > > > > > > > > > for
> > > > > > > > > > > > > > > forwarding to the ultimate metrics store.
> Should
> > > the
> > > > > > > > > > > > > ClientTelemetryPayload
> > > > > > > > > > > > > > > expose a method to get the compression and a
> > > > > > decompressor?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Good point, updated.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 2. The client-side API is expressed as
> > > StringOrError
> > > > > > > > > > > > > > > ClientInstance::ClientInstanceId(int
> > timeout_ms). I
> > > > > > > > understand
> > > > > > > > > that
> > > > > > > > > > > > > you're
> > > > > > > > > > > > > > > thinking about the librdkafka implementation,
> but
> > > it
> > > > > > would
> > > > > > > be
> > > > > > > > > good
> > > > > > > > > > > to
> > > > > > > > > > > > > show
> > > > > > > > > > > > > > > the API as it would appear on the Apache Kafka
> > > clients.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > This was meant as pseudo-code, but I changed it
> to
> > > Java.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 3. "PushTelemetryRequest|Response - protocol
> > > request
> > > > > used
> > > > > > > by
> > > > > > > > > the
> > > > > > > > > > > > > client to
> > > > > > > > > > > > > > > send metrics to any broker it is connected to."
> > To
> > > be
> > > > > > > clear,
> > > > > > > > > this
> > > > > > > > > > > > means
> > > > > > > > > > > > > > > that the client can choose any of the connected
> > > brokers
> > > > > > and
> > > > > > > > > push to
> > > > > > > > > > > > > just
> > > > > > > > > > > > > > > one of them? What should a supporting client do
> > if
> > > it
> > > > > > gets
> > > > > > > an
> > > > > > > > > error
> > > > > > > > > > > > > when
> > > > > > > > > > > > > > > pushing metrics to a broker, retry sending to
> the
> > > same
> > > > > > > broker
> > > > > > > > > or
> > > > > > > > > > > try
> > > > > > > > > > > > > > > pushing to another broker, or drop the metrics?
> > > Should
> > > > > > > > > supporting
> > > > > > > > > > > > > clients
> > > > > > > > > > > > > > > send successive requests to a single broker, or
> > > round
> > > > > > > robin,
> > > > > > > > > or is
> > > > > > > > > > > > > that up
> > > > > > > > > > > > > > > to the client author? I'm guessing the
> behaviour
> > > should
> > > > > > be
> > > > > > > > > sticky
> > > > > > > > > > > to
> > > > > > > > > > > > > > > support the rate limiting features, but I think
> > it
> > > > > would
> > > > > > be
> > > > > > > > > good
> > > > > > > > > > > for
> > > > > > > > > > > > > client
> > > > > > > > > > > > > > > authors if this section were explicit on the
> > > > > recommended
> > > > > > > > > behaviour.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > You are right, I've updated the KIP to make this
> > > clearer.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 4. "Mapping the client instance id to an actual
> > > > > > application
> > > > > > > > > > > instance
> > > > > > > > > > > > > > > running on a (virtual) machine can be done by
> > > > > inspecting
> > > > > > > the
> > > > > > > > > > > metrics
> > > > > > > > > > > > > > > resource labels, such as the client source
> > address
> > > and
> > > > > > > source
> > > > > > > > > port,
> > > > > > > > > > > > or
> > > > > > > > > > > > > > > security principal, all of which are added by
> the
> > > > > > receiving
> > > > > > > > > broker.
> > > > > > > > > > > > > This
> > > > > > > > > > > > > > > will allow the operator together with the user
> to
> > > > > > identify
> > > > > > > > the
> > > > > > > > > > > actual
> > > > > > > > > > > > > > > application instance." Is this really always
> > true?
> > > The
> > > > > > > source
> > > > > > > > > IP
> > > > > > > > > > > and
> > > > > > > > > > > > > port
> > > > > > > > > > > > > > > might be a loadbalancer/proxy in some setups.
> The
> > > > > > > principal,
> > > > > > > > as
> > > > > > > > > > > > already
> > > > > > > > > > > > > > > mentioned in the KIP, might be shared between
> > > multiple
> > > > > > > > > > > applications.
> > > > > > > > > > > > > So at
> > > > > > > > > > > > > > > worst the organization running the clients
> might
> > > have
> > > > > to
> > > > > > > > > consult
> > > > > > > > > > > the
> > > > > > > > > > > > > logs
> > > > > > > > > > > > > > > of a set of client applications, right?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Yes, that's correct. There's no guaranteed
> mapping
> > > from
> > > > > > > > > > > > > client_instance_id
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > an actual instance, that's why the KIP recommends
> > > client
> > > > > > > > > > > > implementations
> > > > > > > > > > > > > to
> > > > > > > > > > > > > > log the client instance id
> > > > > > > > > > > > > > upon retrieval, and also provide an API for the
> > > > > application
> > > > > > > to
> > > > > > > > > > > retrieve
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > instance id programmatically
> > > > > > > > > > > > > > if it has a better way of exposing it.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 5. "Tests indicate that a compression ratio up to
> > > 10x is
> > > > > > > > > possible for
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > standard metrics." Client authors might
> > appreciate
> > > your
> > > > > > > > > mentioning
> > > > > > > > > > > > > which
> > > > > > > > > > > > > > > compression codec got these results.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Good point. Updated.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 6. "Should the client send a push request prior
> > to
> > > > > expiry
> > > > > > > of
> > > > > > > > > the
> > > > > > > > > > > > > previously
> > > > > > > > > > > > > > > calculated PushIntervalMs the broker will
> discard
> > > the
> > > > > > > metrics
> > > > > > > > > and
> > > > > > > > > > > > > return a
> > > > > > > > > > > > > > > PushTelemetryResponse with the ErrorCode set to
> > > > > > > RateLimited."
> > > > > > > > > Is
> > > > > > > > > > > this
> > > > > > > > > > > > > > > RATE_LIMITED a new error code? It's not
> mentioned
> > > in
> > > > > the
> > > > > > > "New
> > > > > > > > > Error
> > > > > > > > > > > > > Codes"
> > > > > > > > > > > > > > > section.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > That's a leftover, it should be using the
> standard
> > > > > > > ThrottleTime
> > > > > > > > > > > > > mechanism.
> > > > > > > > > > > > > > Fixed.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 7. In the section "Standard client resource
> > labels"
> > > > > > > > > application_id
> > > > > > > > > > > is
> > > > > > > > > > > > > > > described as Kafka Streams only, but the
> section
> > of
> > > > > > "Client
> > > > > > > > > > > > > Identification"
> > > > > > > > > > > > > > > talks about "application instance id as an
> > optional
> > > > > > future
> > > > > > > > > > > > nice-to-have
> > > > > > > > > > > > > > > that may be included as a metrics label if it
> has
> > > been
> > > > > > set
> > > > > > > by
> > > > > > > > > the
> > > > > > > > > > > > > user", so
> > > > > > > > > > > > > > > I'm confused whether non-Kafka Streams clients
> > > should
> > > > > set
> > > > > > > an
> > > > > > > > > > > > > application_id
> > > > > > > > > > > > > > > or not.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I'll clarify this in the KIP, but basically we
> > would
> > > need
> > > > > > to
> > > > > > > > add
> > > > > > > > > an `
> > > > > > > > > > > > > > application.id` config
> > > > > > > > > > > > > > property for non-streams clients for this
> purpose,
> > > and
> > > > > > that's
> > > > > > > > > outside
> > > > > > > > > > > > the
> > > > > > > > > > > > > > scope of this KIP since we want to make it
> > > zero-conf:ish
> > > > > on
> > > > > > > the
> > > > > > > > > > > client
> > > > > > > > > > > > > side.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Kind regards,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Tom
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks for the review,
> > > > > > > > > > > > > > Magnus
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Thu, Oct 7, 2021 at 5:26 PM Magnus Edenhill
> <
> > > > > > > > > magnus@edenhill.se
> > > > > > > > > > > >
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Hi all,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I've updated the KIP following our recent
> > > discussions
> > > > > > on
> > > > > > > > the
> > > > > > > > > > > > mailing
> > > > > > > > > > > > > > > list:
> > > > > > > > > > > > > > > >  - split the protocol in two, one for getting
> > the
> > > > > > metrics
> > > > > > > > > > > > > subscriptions,
> > > > > > > > > > > > > > > > and one for pushing the metrics.
> > > > > > > > > > > > > > > >  - simplifications: initially only one
> > supported
> > > > > > metrics
> > > > > > > > > format,
> > > > > > > > > > > no
> > > > > > > > > > > > > > > > client.id in the instance id, etc.
> > > > > > > > > > > > > > > >  - made CLIENT_METRICS subscription
> > configuration
> > > > > > entries
> > > > > > > > > more
> > > > > > > > > > > > > structured
> > > > > > > > > > > > > > > >    and allowing better client matching
> > selectors
> > > (not
> > > > > > > only
> > > > > > > > > on the
> > > > > > > > > > > > > > > instance
> > > > > > > > > > > > > > > > id, but also the other
> > > > > > > > > > > > > > > >    client resource labels, such as
> > > > > > client_software_name,
> > > > > > > > > etc.).
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Unless there are further comments I'll call
> the
> > > vote
> > > > > > in a
> > > > > > > > > day or
> > > > > > > > > > > > two.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Regards,
> > > > > > > > > > > > > > > > Magnus
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Den mån 4 okt. 2021 kl 20:57 skrev Magnus
> > > Edenhill <
> > > > > > > > > > > > > magnus@edenhill.se>:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Hi Gwen,
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > I'm finishing up the KIP based on the last
> > > couple
> > > > > of
> > > > > > > > > discussion
> > > > > > > > > > > > > points
> > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > > this thread
> > > > > > > > > > > > > > > > > and will call the Vote later this week.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Best,
> > > > > > > > > > > > > > > > > Magnus
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Den lör 2 okt. 2021 kl 02:01 skrev Gwen
> > Shapira
> > > > > > > > > > > > > > > > <gwen@confluent.io.invalid
> > > > > > > > > > > > > > > > > >:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >> Hey,
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >> I noticed that there was no discussion for
> > the
> > > > > last
> > > > > > 10
> > > > > > > > > days,
> > > > > > > > > > > > but I
> > > > > > > > > > > > > > > > >> couldn't
> > > > > > > > > > > > > > > > >> find the vote thread. Is there one that
> I'm
> > > > > missing?
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >> Gwen
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >> On Wed, Sep 22, 2021 at 4:58 AM Magnus
> > > Edenhill <
> > > > > > > > > > > > > magnus@edenhill.se>
> > > > > > > > > > > > > > > > >> wrote:
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >> > Den tis 21 sep. 2021 kl 06:58 skrev
> Colin
> > > > > McCabe <
> > > > > > > > > > > > > > > cmccabe@apache.org
> > > > > > > > > > > > > > > > >:
> > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > >> > > On Mon, Sep 20, 2021, at 17:35, Feng
> Min
> > > > > wrote:
> > > > > > > > > > > > > > > > >> > > > Thanks Magnus & Colin for the
> > > discussion.
> > > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > > >> > > > Based on KIP-714's stateless design,
> > > Client
> > > > > > can
> > > > > > > > > pretty
> > > > > > > > > > > > much
> > > > > > > > > > > > > use
> > > > > > > > > > > > > > > > any
> > > > > > > > > > > > > > > > >> > > > connection to any broker to send
> > > metrics. We
> > > > > > are
> > > > > > > > not
> > > > > > > > > > > > > associating
> > > > > > > > > > > > > > > > >> > > connection
> > > > > > > > > > > > > > > > >> > > > with client metric state. Is my
> > > > > understanding
> > > > > > > > > correct?
> > > > > > > > > > > If
> > > > > > > > > > > > > yes,
> > > > > > > > > > > > > > > > how
> > > > > > > > > > > > > > > > >> > about
> > > > > > > > > > > > > > > > >> > > > the following two scenarios
> > > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > > >> > > > 1) One Client (Client-ID) registers
> > two
> > > > > > > different
> > > > > > > > > client
> > > > > > > > > > > > > > > instance
> > > > > > > > > > > > > > > > id
> > > > > > > > > > > > > > > > >> > via
> > > > > > > > > > > > > > > > >> > > > separate registration. Is it
> > permitted?
> > > If
> > > > > OK,
> > > > > > > how
> > > > > > > > > to
> > > > > > > > > > > > > > > distinguish
> > > > > > > > > > > > > > > > >> them
> > > > > > > > > > > > > > > > >> > > from
> > > > > > > > > > > > > > > > >> > > > the case 2 below.
> > > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > >> > > Hi Feng,
> > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > >> > > My understanding, which Magnus can
> > > clarify I
> > > > > > > guess,
> > > > > > > > is
> > > > > > > > > > > that
> > > > > > > > > > > > > you
> > > > > > > > > > > > > > > > could
> > > > > > > > > > > > > > > > >> > have
> > > > > > > > > > > > > > > > >> > > something like two Producer instances
> > > running
> > > > > > with
> > > > > > > > the
> > > > > > > > > > > same
> > > > > > > > > > > > > > > > client.id
> > > > > > > > > > > > > > > > >> > > (perhaps because they're using the
> same
> > > config
> > > > > > > file,
> > > > > > > > > for
> > > > > > > > > > > > > example).
> > > > > > > > > > > > > > > > >> They
> > > > > > > > > > > > > > > > >> > > could even be in the same process. But
> > > they
> > > > > > would
> > > > > > > > get
> > > > > > > > > > > > separate
> > > > > > > > > > > > > > > > UUIDs.
> > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > >> > > I believe Magnus used the term client
> to
> > > mean
> > > > > > > > > "Producer or
> > > > > > > > > > > > > > > > Consumer".
> > > > > > > > > > > > > > > > >> So
> > > > > > > > > > > > > > > > >> > > if you have both a Producer and a
> > > Consumer in
> > > > > > your
> > > > > > > > > > > > > application I
> > > > > > > > > > > > > > > > would
> > > > > > > > > > > > > > > > >> > > expect you'd get separate UUIDs for
> > both.
> > > > > Again
> > > > > > > > > Magnus can
> > > > > > > > > > > > > chime
> > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > >> > here, I
> > > > > > > > > > > > > > > > >> > > guess.
> > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > >> > That's correct.
> > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > >> > > > 2) How about the client restarting?
> > > What's
> > > > > the
> > > > > > > > > > > > expectation?
> > > > > > > > > > > > > > > Should
> > > > > > > > > > > > > > > > >> the
> > > > > > > > > > > > > > > > >> > > > server expect the client to carry a
> > > > > persisted
> > > > > > > > client
> > > > > > > > > > > > > instance id
> > > > > > > > > > > > > > > > or
> > > > > > > > > > > > > > > > >> > > should
> > > > > > > > > > > > > > > > >> > > > the client be treated as a new
> > instance?
> > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > >> > > The KIP doesn't describe any mechanism
> > for
> > > > > > > > > persistence,
> > > > > > > > > > > so I
> > > > > > > > > > > > > would
> > > > > > > > > > > > > > > > >> assume
> > > > > > > > > > > > > > > > >> > > that when you restart the client you
> get
> > > a new
> > > > > > > > UUID. I
> > > > > > > > > > > agree
> > > > > > > > > > > > > that
> > > > > > > > > > > > > > > it
> > > > > > > > > > > > > > > > >> > would
> > > > > > > > > > > > > > > > >> > > be good to spell this out.
> > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > > >> > Right, it will not be persisted since a
> > > client
> > > > > > > > instance
> > > > > > > > > > > can't
> > > > > > > > > > > > be
> > > > > > > > > > > > > > > > >> restarted.
> > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > >> > Will update the KIP to make this
> clearer.
> > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > >> > /Magnus
> > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >> --
> > > > > > > > > > > > > > > > >> Gwen Shapira
> > > > > > > > > > > > > > > > >> Engineering Manager | Confluent
> > > > > > > > > > > > > > > > >> 650.450.2760 | @gwenshap
> > > > > > > > > > > > > > > > >> Follow us: Twitter | blog
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Magnus Edenhill <ma...@edenhill.se>.
Hi all,

I've updated the KIP with responses to the latest comments: Java client
dependencies (Thanks Kirk!), alternate designs (separate cluster, separate
producer, etc), etc.

I will revive the vote thread.

Thanks,
Magnus


Den mån 13 dec. 2021 kl 22:32 skrev Ryanne Dolan <ry...@gmail.com>:

> I think we should be very careful about introducing new runtime
> dependencies into the clients. Historically this has been rare and
> essentially necessary (e.g. compression libs).
>
> Ryanne
>
> On Mon, Dec 13, 2021, 1:06 PM Kirk True <ki...@mustardgrain.com> wrote:
>
> > Hi Jun,
> >
> > On Thu, Dec 9, 2021, at 2:28 PM, Jun Rao wrote:
> > > 13. Using OpenTelemetry. Does that require runtime dependency
> > > on OpenTelemetry library? How good is the compatibility story
> > > of OpenTelemetry? This is important since an application could have
> other
> > > OpenTelemetry dependencies than the Kafka client.
> >
> > The current design is that the OpenTelemetry JARs would ship with the
> > client. Perhaps we can design the client such that the JARs aren't even
> > loaded if the user has opted out. The user could even exclude the JARs
> from
> > their dependencies if they so wished.
> >
> > I can't speak to the compatibility of the libraries. Is it possible that
> > we include a shaded version?
> >
> > Thanks,
> > Kirk
> >
> > >
> > > 14. The proposal listed idempotence=true. This is more of a
> configuration
> > > than a metric. Are we including that as a metric? What other
> > configurations
> > > are we including? Should we separate the configurations from the
> metrics?
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Mon, Nov 29, 2021 at 7:34 AM Magnus Edenhill <ma...@edenhill.se>
> > wrote:
> > >
> > > > Hey Bob,
> > > >
> > > > That's a good point.
> > > >
> > > > Request type labels were considered but since they're already tracked
> > by
> > > > broker-side metrics
> > > > they were left out as to avoid metric duplication, however those
> > metrics
> > > > are not per connection,
> > > > so they won't be that useful in practice for troubleshooting specific
> > > > client instances.
> > > >
> > > > I'll add the request_type label to the relevant metrics.
> > > >
> > > > Thanks,
> > > > Magnus
> > > >
> > > >
> > > > Den tis 23 nov. 2021 kl 19:20 skrev Bob Barrett
> > > > <bo...@confluent.io.invalid>:
> > > >
> > > > > Hi Magnus,
> > > > >
> > > > > Thanks for the thorough KIP, this seems very useful.
> > > > >
> > > > > Would it make sense to include the request type as a label for the
> > > > > `client.request.success`, `client.request.errors` and
> > > > `client.request.rtt`
> > > > > metrics? I think it would be very useful to see which specific
> > requests
> > > > are
> > > > > succeeding and failing for a client. One specific case I can think
> of
> > > > where
> > > > > this could be useful is producer batch timeouts. If a Java
> > application
> > > > does
> > > > > not enable producer client logs (unfortunately, in my experience
> this
> > > > > happens more often than it should), the application logs will only
> > > > contain
> > > > > the expiration error message, but no information about what is
> > causing
> > > > the
> > > > > timeout. The requests might all be succeeding but taking too long
> to
> > > > > process batches, or metadata requests might be failing, or some or
> > all
> > > > > produce requests might be failing (if the bootstrap servers are
> > reachable
> > > > > from the client but one or more other brokers are not, for
> example).
> > If
> > > > the
> > > > > cluster operator is able to identify the specific requests that are
> > slow
> > > > or
> > > > > failing for a client, they will be better able to diagnose the
> issue
> > > > > causing batch timeouts.
> > > > >
> > > > > One drawback I can think of is that this will increase the
> > cardinality of
> > > > > the request metrics. But any given client is only going to use a
> > small
> > > > > subset of the request types, and since we already have partition
> > labels
> > > > for
> > > > > the topic-level metrics, I think request labels will still make up
> a
> > > > > relatively small percentage of the set of metrics.
> > > > >
> > > > > Thanks,
> > > > > Bob
> > > > >
> > > > > On Mon, Nov 22, 2021 at 2:08 AM Viktor Somogyi-Vass <
> > > > > viktorsomogyi@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi Magnus,
> > > > > >
> > > > > > I think this is a very useful addition. We also have a similar
> (but
> > > > much
> > > > > > more simplistic) implementation of this. Maybe I missed it in the
> > KIP
> > > > but
> > > > > > what about adding metrics about the subscription cache itself?
> > That I
> > > > > think
> > > > > > would improve its usability and debuggability as we'd be able to
> > see
> > > > its
> > > > > > performance, hit/miss rates, eviction counts and others.
> > > > > >
> > > > > > Best,
> > > > > > Viktor
> > > > > >
> > > > > > On Thu, Nov 18, 2021 at 5:12 PM Magnus Edenhill <
> > magnus@edenhill.se>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Mickael,
> > > > > > >
> > > > > > > see inline.
> > > > > > >
> > > > > > > Den ons 10 nov. 2021 kl 15:21 skrev Mickael Maison <
> > > > > > > mickael.maison@gmail.com
> > > > > > > >:
> > > > > > >
> > > > > > > > Hi Magnus,
> > > > > > > >
> > > > > > > > I see you've addressed some of the points I raised above but
> > some
> > > > (4,
> > > > > > > > 5) have not been addressed yet.
> > > > > > > >
> > > > > > >
> > > > > > > Re 4) How will the user/app know metrics are being sent.
> > > > > > >
> > > > > > > One possibility is to add a JMX metric (thus for user
> > consumption)
> > > > for
> > > > > > the
> > > > > > > number of metric pushes the
> > > > > > > client has performed, or perhaps the number of metrics
> > subscriptions
> > > > > > > currently being collected.
> > > > > > > Would that be sufficient?
> > > > > > >
> > > > > > > Re 5) Metric sizes and rates
> > > > > > >
> > > > > > > A worst case scenario for a producer that is producing to 50
> > unique
> > > > > > topics
> > > > > > > and emitting all standard metrics yields
> > > > > > > a serialized size of around 100KB prior to compression, which
> > > > > compresses
> > > > > > > down to about 20-30% of that depending
> > > > > > > on compression type and topic name uniqueness.
> > > > > > > The numbers for a consumer would be similar.
> > > > > > >
> > > > > > > In practice the number of unique topics would be far less, and
> > the
> > > > > > > subscription set would typically be for a subset of metrics.
> > > > > > > So we're probably closer to 1kb, or less, compressed size per
> > client
> > > > > per
> > > > > > > push interval.
> > > > > > >
> > > > > > > As both the subscription set and push intervals are controlled
> > by the
> > > > > > > cluster operator it shouldn't be too hard
> > > > > > > to strike a good balance between metrics overhead and
> > granularity.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > I'm really uneasy with this being enabled by default on the
> > client
> > > > > > > > side. When collecting data, I think the best practice is to
> > ensure
> > > > > > > > users are explicitly enabling it.
> > > > > > > >
> > > > > > >
> > > > > > > Requiring metrics to be explicitly enabled on clients severely
> > > > cripples
> > > > > > its
> > > > > > > usability and value.
> > > > > > >
> > > > > > > One of the problems that this KIP aims to solve is for useful
> > metrics
> > > > > to
> > > > > > be
> > > > > > > available on demand
> > > > > > > regardless of the technical expertise of the user. As Ryanne
> > points,
> > > > > out
> > > > > > a
> > > > > > > savvy user/organization
> > > > > > > will typically have metrics collection and monitoring in place
> > > > already,
> > > > > > and
> > > > > > > the benefits of this KIP
> > > > > > > are then more of a common set and format metrics across client
> > > > > > > implementations and languages.
> > > > > > > But that is not the typical Kafka user in my experience,
> they're
> > not
> > > > > > Kafka
> > > > > > > experts and they don't have the
> > > > > > > knowledge of how to best instrument their clients.
> > > > > > > Having metrics enabled by default for this user base allows the
> > Kafka
> > > > > > > operators to proactively and reactively
> > > > > > > monitor and troubleshoot client issues, without the need for
> the
> > less
> > > > > > savvy
> > > > > > > user to do anything.
> > > > > > > It is often too late to tell a user to enable metrics when the
> > > > problem
> > > > > > has
> > > > > > > already occurred.
> > > > > > >
> > > > > > > Now, to be clear, even though metrics are enabled by default on
> > > > clients
> > > > > > it
> > > > > > > is not enabled by default
> > > > > > > on the brokers; the Kafka operator needs to build and set up a
> > > > metrics
> > > > > > > plugin and add metrics subscriptions
> > > > > > > before anything is sent from the client.
> > > > > > > It is opt-out on the clients and opt-in on the broker.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > You mentioned brokers already have
> > > > > > > > some(most?) of the information contained in metrics, if so
> > then why
> > > > > > > > are we collecting it again? Surely there must be some new
> > > > information
> > > > > > > > in the client metrics.
> > > > > > > >
> > > > > > >
> > > > > > > From the user's perspective the Kafka infrastructure extends
> from
> > > > > > > producer.send() to
> > > > > > > messages being returned from consumer.poll(), a giant black box
> > where
> > > > > > > there's a lot going on between those
> > > > > > > two points. The brokers currently only see what happens once
> > those
> > > > > > requests
> > > > > > > and messages hits the broker,
> > > > > > > but as Kafka clients are complex pieces of machinery there's a
> > myriad
> > > > > of
> > > > > > > queues, timers, and state
> > > > > > > that's critical to the operation and infrastructure that's not
> > > > > currently
> > > > > > > visible to the operator.
> > > > > > > Relying on the user to accurately and timely provide this
> missing
> > > > > > > information is not generally feasible.
> > > > > > >
> > > > > > >
> > > > > > > Most of the standard metrics listed in the KIP are data points
> > that
> > > > the
> > > > > > > broker does not have.
> > > > > > > Only a small number of metrics are duplicates (like the request
> > > > counts
> > > > > > and
> > > > > > > sizes), but they are included
> > > > > > > to ease correlation when inspecting these client metrics.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > Moreover this is a brand new feature so it's even harder to
> > justify
> > > > > > > > enabling it and forcing onto all our users. If disabled by
> > default,
> > > > > > > > it's relatively easy to enable in a new release if we decide
> > to,
> > > > but
> > > > > > > > once enabled by default it's much harder to disable. Also
> this
> > > > > feature
> > > > > > > > will apply to all future metrics we will add.
> > > > > > > >
> > > > > > >
> > > > > > > I think maturity of a feature implementation should be the
> > deciding
> > > > > > factor,
> > > > > > > rather than
> > > > > > > the design of it (which this KIP is). I.e., if the
> > implementation is
> > > > > not
> > > > > > > deemed mature enough
> > > > > > > for release X.Y it will be disabled.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > Overall I think it's an interesting feature but I'd prefer to
> > be
> > > > > > > > slightly defensive and see how it works in practice before
> > enabling
> > > > > it
> > > > > > > > everywhere.
> > > > > > > >
> > > > > > >
> > > > > > > Right, and I agree on being defensive, but since this feature
> > still
> > > > > > > requires manual
> > > > > > > enabling on the brokers before actually being used, I think
> that
> > > > gives
> > > > > > > enough control
> > > > > > > to opt-in or out of this feature as needed.
> > > > > > >
> > > > > > > Thanks for your comments!
> > > > > > >
> > > > > > > Regards,
> > > > > > > Magnus
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Mickael
> > > > > > > >
> > > > > > > > On Mon, Nov 8, 2021 at 7:55 AM Magnus Edenhill <
> > magnus@edenhill.se
> > > > >
> > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > Thanks David for pointing this out,
> > > > > > > > > I've updated the KIP to include client_id as a matching
> > selector.
> > > > > > > > >
> > > > > > > > > Regards,
> > > > > > > > > Magnus
> > > > > > > > >
> > > > > > > > > Den tors 4 nov. 2021 kl 18:01 skrev David Mao
> > > > > > > <dmao@confluent.io.invalid
> > > > > > > > >:
> > > > > > > > >
> > > > > > > > > > Hey Magnus,
> > > > > > > > > >
> > > > > > > > > > I noticed that the KIP outlines the initial selectors
> > supported
> > > > > as:
> > > > > > > > > >
> > > > > > > > > >    - client_instance_id - CLIENT_INSTANCE_ID UUID string
> > > > > > > > representation.
> > > > > > > > > >    - client_software_name  - client software
> implementation
> > > > name.
> > > > > > > > > >    - client_software_version  - client software
> > implementation
> > > > > > > version.
> > > > > > > > > >
> > > > > > > > > > In the given reactive monitoring workflow, we mention
> that
> > the
> > > > > > > > application
> > > > > > > > > > user does not know their client's client instance ID, but
> > it's
> > > > > > > outlined
> > > > > > > > > > that the operator can add a metrics subscription
> selecting
> > for
> > > > > > > > clientId. I
> > > > > > > > > > don't see clientId as one of the supported selectors.
> > > > > > > > > > I can see how this would have made sense in a previous
> > > > iteration
> > > > > > > given
> > > > > > > > that
> > > > > > > > > > the previous client instance ID proposal was to construct
> > the
> > > > > > client
> > > > > > > > > > instance ID using clientId as a prefix. Now that the
> client
> > > > > > instance
> > > > > > > > ID is
> > > > > > > > > > a UUID, would we want to add clientId as a supported
> > selector?
> > > > > > > > > > Let me know what you think.
> > > > > > > > > >
> > > > > > > > > > David
> > > > > > > > > >
> > > > > > > > > > On Tue, Oct 19, 2021 at 12:33 PM Magnus Edenhill <
> > > > > > magnus@edenhill.se
> > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi Mickael!
> > > > > > > > > > >
> > > > > > > > > > > Den tis 19 okt. 2021 kl 19:30 skrev Mickael Maison <
> > > > > > > > > > > mickael.maison@gmail.com
> > > > > > > > > > > >:
> > > > > > > > > > >
> > > > > > > > > > > > Hi Magnus,
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks for the proposal.
> > > > > > > > > > > >
> > > > > > > > > > > > 1. Looking at the protocol section, isn't
> > > > "ClientInstanceId"
> > > > > > > > expected
> > > > > > > > > > > > to be a field in GetTelemetrySubscriptionsResponseV0?
> > > > > > Otherwise,
> > > > > > > > how
> > > > > > > > > > > > does a client retrieve this value?
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Good catch, it got removed by mistake in one of the
> > edits.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > 2. In the client API section, you mention a new
> method
> > > > > > > > > > > > "clientInstanceId()". Can you clarify which
> interfaces
> > are
> > > > > > > > affected?
> > > > > > > > > > > > Is it only Consumer and Producer?
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > And Admin. Will update the KIP.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > 3. I'm a bit concerned this is enabled by default.
> > Even if
> > > > > the
> > > > > > > data
> > > > > > > > > > > > collected is supposed to be not sensitive, I think
> > this can
> > > > > be
> > > > > > > > > > > > problematic in some environments. Also users don't
> > seem to
> > > > > have
> > > > > > > the
> > > > > > > > > > > > choice to only expose some metrics. Knowing how much
> > data
> > > > > > transit
> > > > > > > > > > > > through some applications can be considered critical.
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > The broker already knows how much data transits through
> > the
> > > > > > client
> > > > > > > > > > though,
> > > > > > > > > > > right?
> > > > > > > > > > > Care has been taken not to expose information in the
> > standard
> > > > > > > metrics
> > > > > > > > > > that
> > > > > > > > > > > might
> > > > > > > > > > > reveal sensitive information.
> > > > > > > > > > >
> > > > > > > > > > > Do you have an example of how the proposed metrics
> could
> > leak
> > > > > > > > sensitive
> > > > > > > > > > > information?
> > > > > > > > > > > As for limiting the what metrics to export; I guess
> that
> > > > could
> > > > > > make
> > > > > > > > sense
> > > > > > > > > > > in some
> > > > > > > > > > > very sensitive use-cases, but those users might disable
> > > > metrics
> > > > > > > > > > altogether
> > > > > > > > > > > for now.
> > > > > > > > > > > Could these concerns be addressed by a later KIP?
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > 4. As a user, how do you know if your application is
> > > > actively
> > > > > > > > sending
> > > > > > > > > > > > metrics? Are there new metrics exposing what's going
> > on,
> > > > like
> > > > > > how
> > > > > > > > much
> > > > > > > > > > > > data is being sent?
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > That's a good question.
> > > > > > > > > > > Since the proposed metrics interface is not aimed at,
> or
> > > > > directly
> > > > > > > > > > available
> > > > > > > > > > > to, the application
> > > > > > > > > > > I guess there's little point of adding it here, but
> > instead
> > > > > > adding
> > > > > > > > > > > something to the
> > > > > > > > > > > existing JMX metrics?
> > > > > > > > > > > Do you have any suggestions?
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > 5. If all metrics are enabled on a regular Consumer
> or
> > > > > > Producer,
> > > > > > > do
> > > > > > > > > > > > you have an idea how much throughput this would use?
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > It depends on the number of partition/topics/etc the
> > client
> > > > is
> > > > > > > > producing
> > > > > > > > > > > to/consuming from.
> > > > > > > > > > > I'll add some sizes to the KIP for some typical
> > use-cases.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > > Magnus
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > Thanks
> > > > > > > > > > > >
> > > > > > > > > > > > On Tue, Oct 19, 2021 at 5:06 PM Magnus Edenhill <
> > > > > > > > magnus@edenhill.se>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > Den tis 19 okt. 2021 kl 13:22 skrev Tom Bentley <
> > > > > > > > tbentley@redhat.com
> > > > > > > > > > >:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi Magnus,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I reviewed the KIP since you called the vote
> > (sorry for
> > > > > not
> > > > > > > > > > reviewing
> > > > > > > > > > > > when
> > > > > > > > > > > > > > you announced your intention to call the vote). I
> > have
> > > > a
> > > > > > few
> > > > > > > > > > > questions
> > > > > > > > > > > > on
> > > > > > > > > > > > > > some of the details.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 1. There's no Javadoc on
> > ClientTelemetryPayload.data(),
> > > > > so
> > > > > > I
> > > > > > > > don't
> > > > > > > > > > > know
> > > > > > > > > > > > > > whether the payload is exposed through this
> method
> > as
> > > > > > > > compressed or
> > > > > > > > > > > > not.
> > > > > > > > > > > > > > Later on you say "Decompression of the payloads
> > will be
> > > > > > > > handled by
> > > > > > > > > > > the
> > > > > > > > > > > > > > broker metrics plugin, the broker should expose a
> > > > > suitable
> > > > > > > > > > > > decompression
> > > > > > > > > > > > > > API to the metrics plugin for this purpose.",
> which
> > > > > > suggests
> > > > > > > > it's
> > > > > > > > > > the
> > > > > > > > > > > > > > compressed data in the buffer, but then we don't
> > know
> > > > > which
> > > > > > > > codec
> > > > > > > > > > was
> > > > > > > > > > > > used,
> > > > > > > > > > > > > > nor the API via which the plugin should
> decompress
> > it
> > > > if
> > > > > > > > required
> > > > > > > > > > for
> > > > > > > > > > > > > > forwarding to the ultimate metrics store. Should
> > the
> > > > > > > > > > > > ClientTelemetryPayload
> > > > > > > > > > > > > > expose a method to get the compression and a
> > > > > decompressor?
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Good point, updated.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > > 2. The client-side API is expressed as
> > StringOrError
> > > > > > > > > > > > > > ClientInstance::ClientInstanceId(int
> timeout_ms). I
> > > > > > > understand
> > > > > > > > that
> > > > > > > > > > > > you're
> > > > > > > > > > > > > > thinking about the librdkafka implementation, but
> > it
> > > > > would
> > > > > > be
> > > > > > > > good
> > > > > > > > > > to
> > > > > > > > > > > > show
> > > > > > > > > > > > > > the API as it would appear on the Apache Kafka
> > clients.
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > This was meant as pseudo-code, but I changed it to
> > Java.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > > 3. "PushTelemetryRequest|Response - protocol
> > request
> > > > used
> > > > > > by
> > > > > > > > the
> > > > > > > > > > > > client to
> > > > > > > > > > > > > > send metrics to any broker it is connected to."
> To
> > be
> > > > > > clear,
> > > > > > > > this
> > > > > > > > > > > means
> > > > > > > > > > > > > > that the client can choose any of the connected
> > brokers
> > > > > and
> > > > > > > > push to
> > > > > > > > > > > > just
> > > > > > > > > > > > > > one of them? What should a supporting client do
> if
> > it
> > > > > gets
> > > > > > an
> > > > > > > > error
> > > > > > > > > > > > when
> > > > > > > > > > > > > > pushing metrics to a broker, retry sending to the
> > same
> > > > > > broker
> > > > > > > > or
> > > > > > > > > > try
> > > > > > > > > > > > > > pushing to another broker, or drop the metrics?
> > Should
> > > > > > > > supporting
> > > > > > > > > > > > clients
> > > > > > > > > > > > > > send successive requests to a single broker, or
> > round
> > > > > > robin,
> > > > > > > > or is
> > > > > > > > > > > > that up
> > > > > > > > > > > > > > to the client author? I'm guessing the behaviour
> > should
> > > > > be
> > > > > > > > sticky
> > > > > > > > > > to
> > > > > > > > > > > > > > support the rate limiting features, but I think
> it
> > > > would
> > > > > be
> > > > > > > > good
> > > > > > > > > > for
> > > > > > > > > > > > client
> > > > > > > > > > > > > > authors if this section were explicit on the
> > > > recommended
> > > > > > > > behaviour.
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > You are right, I've updated the KIP to make this
> > clearer.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > > 4. "Mapping the client instance id to an actual
> > > > > application
> > > > > > > > > > instance
> > > > > > > > > > > > > > running on a (virtual) machine can be done by
> > > > inspecting
> > > > > > the
> > > > > > > > > > metrics
> > > > > > > > > > > > > > resource labels, such as the client source
> address
> > and
> > > > > > source
> > > > > > > > port,
> > > > > > > > > > > or
> > > > > > > > > > > > > > security principal, all of which are added by the
> > > > > receiving
> > > > > > > > broker.
> > > > > > > > > > > > This
> > > > > > > > > > > > > > will allow the operator together with the user to
> > > > > identify
> > > > > > > the
> > > > > > > > > > actual
> > > > > > > > > > > > > > application instance." Is this really always
> true?
> > The
> > > > > > source
> > > > > > > > IP
> > > > > > > > > > and
> > > > > > > > > > > > port
> > > > > > > > > > > > > > might be a loadbalancer/proxy in some setups. The
> > > > > > principal,
> > > > > > > as
> > > > > > > > > > > already
> > > > > > > > > > > > > > mentioned in the KIP, might be shared between
> > multiple
> > > > > > > > > > applications.
> > > > > > > > > > > > So at
> > > > > > > > > > > > > > worst the organization running the clients might
> > have
> > > > to
> > > > > > > > consult
> > > > > > > > > > the
> > > > > > > > > > > > logs
> > > > > > > > > > > > > > of a set of client applications, right?
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Yes, that's correct. There's no guaranteed mapping
> > from
> > > > > > > > > > > > client_instance_id
> > > > > > > > > > > > > to
> > > > > > > > > > > > > an actual instance, that's why the KIP recommends
> > client
> > > > > > > > > > > implementations
> > > > > > > > > > > > to
> > > > > > > > > > > > > log the client instance id
> > > > > > > > > > > > > upon retrieval, and also provide an API for the
> > > > application
> > > > > > to
> > > > > > > > > > retrieve
> > > > > > > > > > > > the
> > > > > > > > > > > > > instance id programmatically
> > > > > > > > > > > > > if it has a better way of exposing it.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > 5. "Tests indicate that a compression ratio up to
> > 10x is
> > > > > > > > possible for
> > > > > > > > > > > the
> > > > > > > > > > > > > > standard metrics." Client authors might
> appreciate
> > your
> > > > > > > > mentioning
> > > > > > > > > > > > which
> > > > > > > > > > > > > > compression codec got these results.
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Good point. Updated.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > > 6. "Should the client send a push request prior
> to
> > > > expiry
> > > > > > of
> > > > > > > > the
> > > > > > > > > > > > previously
> > > > > > > > > > > > > > calculated PushIntervalMs the broker will discard
> > the
> > > > > > metrics
> > > > > > > > and
> > > > > > > > > > > > return a
> > > > > > > > > > > > > > PushTelemetryResponse with the ErrorCode set to
> > > > > > RateLimited."
> > > > > > > > Is
> > > > > > > > > > this
> > > > > > > > > > > > > > RATE_LIMITED a new error code? It's not mentioned
> > in
> > > > the
> > > > > > "New
> > > > > > > > Error
> > > > > > > > > > > > Codes"
> > > > > > > > > > > > > > section.
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > That's a leftover, it should be using the standard
> > > > > > ThrottleTime
> > > > > > > > > > > > mechanism.
> > > > > > > > > > > > > Fixed.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > > 7. In the section "Standard client resource
> labels"
> > > > > > > > application_id
> > > > > > > > > > is
> > > > > > > > > > > > > > described as Kafka Streams only, but the section
> of
> > > > > "Client
> > > > > > > > > > > > Identification"
> > > > > > > > > > > > > > talks about "application instance id as an
> optional
> > > > > future
> > > > > > > > > > > nice-to-have
> > > > > > > > > > > > > > that may be included as a metrics label if it has
> > been
> > > > > set
> > > > > > by
> > > > > > > > the
> > > > > > > > > > > > user", so
> > > > > > > > > > > > > > I'm confused whether non-Kafka Streams clients
> > should
> > > > set
> > > > > > an
> > > > > > > > > > > > application_id
> > > > > > > > > > > > > > or not.
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > I'll clarify this in the KIP, but basically we
> would
> > need
> > > > > to
> > > > > > > add
> > > > > > > > an `
> > > > > > > > > > > > > application.id` config
> > > > > > > > > > > > > property for non-streams clients for this purpose,
> > and
> > > > > that's
> > > > > > > > outside
> > > > > > > > > > > the
> > > > > > > > > > > > > scope of this KIP since we want to make it
> > zero-conf:ish
> > > > on
> > > > > > the
> > > > > > > > > > client
> > > > > > > > > > > > side.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Kind regards,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Tom
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks for the review,
> > > > > > > > > > > > > Magnus
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Thu, Oct 7, 2021 at 5:26 PM Magnus Edenhill <
> > > > > > > > magnus@edenhill.se
> > > > > > > > > > >
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi all,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I've updated the KIP following our recent
> > discussions
> > > > > on
> > > > > > > the
> > > > > > > > > > > mailing
> > > > > > > > > > > > > > list:
> > > > > > > > > > > > > > >  - split the protocol in two, one for getting
> the
> > > > > metrics
> > > > > > > > > > > > subscriptions,
> > > > > > > > > > > > > > > and one for pushing the metrics.
> > > > > > > > > > > > > > >  - simplifications: initially only one
> supported
> > > > > metrics
> > > > > > > > format,
> > > > > > > > > > no
> > > > > > > > > > > > > > > client.id in the instance id, etc.
> > > > > > > > > > > > > > >  - made CLIENT_METRICS subscription
> configuration
> > > > > entries
> > > > > > > > more
> > > > > > > > > > > > structured
> > > > > > > > > > > > > > >    and allowing better client matching
> selectors
> > (not
> > > > > > only
> > > > > > > > on the
> > > > > > > > > > > > > > instance
> > > > > > > > > > > > > > > id, but also the other
> > > > > > > > > > > > > > >    client resource labels, such as
> > > > > client_software_name,
> > > > > > > > etc.).
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Unless there are further comments I'll call the
> > vote
> > > > > in a
> > > > > > > > day or
> > > > > > > > > > > two.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Regards,
> > > > > > > > > > > > > > > Magnus
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Den mån 4 okt. 2021 kl 20:57 skrev Magnus
> > Edenhill <
> > > > > > > > > > > > magnus@edenhill.se>:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Hi Gwen,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I'm finishing up the KIP based on the last
> > couple
> > > > of
> > > > > > > > discussion
> > > > > > > > > > > > points
> > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > this thread
> > > > > > > > > > > > > > > > and will call the Vote later this week.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Best,
> > > > > > > > > > > > > > > > Magnus
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Den lör 2 okt. 2021 kl 02:01 skrev Gwen
> Shapira
> > > > > > > > > > > > > > > <gwen@confluent.io.invalid
> > > > > > > > > > > > > > > > >:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >> Hey,
> > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > >> I noticed that there was no discussion for
> the
> > > > last
> > > > > 10
> > > > > > > > days,
> > > > > > > > > > > but I
> > > > > > > > > > > > > > > >> couldn't
> > > > > > > > > > > > > > > >> find the vote thread. Is there one that I'm
> > > > missing?
> > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > >> Gwen
> > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > >> On Wed, Sep 22, 2021 at 4:58 AM Magnus
> > Edenhill <
> > > > > > > > > > > > magnus@edenhill.se>
> > > > > > > > > > > > > > > >> wrote:
> > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > >> > Den tis 21 sep. 2021 kl 06:58 skrev Colin
> > > > McCabe <
> > > > > > > > > > > > > > cmccabe@apache.org
> > > > > > > > > > > > > > > >:
> > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > >> > > On Mon, Sep 20, 2021, at 17:35, Feng Min
> > > > wrote:
> > > > > > > > > > > > > > > >> > > > Thanks Magnus & Colin for the
> > discussion.
> > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > >> > > > Based on KIP-714's stateless design,
> > Client
> > > > > can
> > > > > > > > pretty
> > > > > > > > > > > much
> > > > > > > > > > > > use
> > > > > > > > > > > > > > > any
> > > > > > > > > > > > > > > >> > > > connection to any broker to send
> > metrics. We
> > > > > are
> > > > > > > not
> > > > > > > > > > > > associating
> > > > > > > > > > > > > > > >> > > connection
> > > > > > > > > > > > > > > >> > > > with client metric state. Is my
> > > > understanding
> > > > > > > > correct?
> > > > > > > > > > If
> > > > > > > > > > > > yes,
> > > > > > > > > > > > > > > how
> > > > > > > > > > > > > > > >> > about
> > > > > > > > > > > > > > > >> > > > the following two scenarios
> > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > >> > > > 1) One Client (Client-ID) registers
> two
> > > > > > different
> > > > > > > > client
> > > > > > > > > > > > > > instance
> > > > > > > > > > > > > > > id
> > > > > > > > > > > > > > > >> > via
> > > > > > > > > > > > > > > >> > > > separate registration. Is it
> permitted?
> > If
> > > > OK,
> > > > > > how
> > > > > > > > to
> > > > > > > > > > > > > > distinguish
> > > > > > > > > > > > > > > >> them
> > > > > > > > > > > > > > > >> > > from
> > > > > > > > > > > > > > > >> > > > the case 2 below.
> > > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > >> > > Hi Feng,
> > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > >> > > My understanding, which Magnus can
> > clarify I
> > > > > > guess,
> > > > > > > is
> > > > > > > > > > that
> > > > > > > > > > > > you
> > > > > > > > > > > > > > > could
> > > > > > > > > > > > > > > >> > have
> > > > > > > > > > > > > > > >> > > something like two Producer instances
> > running
> > > > > with
> > > > > > > the
> > > > > > > > > > same
> > > > > > > > > > > > > > > client.id
> > > > > > > > > > > > > > > >> > > (perhaps because they're using the same
> > config
> > > > > > file,
> > > > > > > > for
> > > > > > > > > > > > example).
> > > > > > > > > > > > > > > >> They
> > > > > > > > > > > > > > > >> > > could even be in the same process. But
> > they
> > > > > would
> > > > > > > get
> > > > > > > > > > > separate
> > > > > > > > > > > > > > > UUIDs.
> > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > >> > > I believe Magnus used the term client to
> > mean
> > > > > > > > "Producer or
> > > > > > > > > > > > > > > Consumer".
> > > > > > > > > > > > > > > >> So
> > > > > > > > > > > > > > > >> > > if you have both a Producer and a
> > Consumer in
> > > > > your
> > > > > > > > > > > > application I
> > > > > > > > > > > > > > > would
> > > > > > > > > > > > > > > >> > > expect you'd get separate UUIDs for
> both.
> > > > Again
> > > > > > > > Magnus can
> > > > > > > > > > > > chime
> > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > >> > here, I
> > > > > > > > > > > > > > > >> > > guess.
> > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > >> > That's correct.
> > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > >> > > > 2) How about the client restarting?
> > What's
> > > > the
> > > > > > > > > > > expectation?
> > > > > > > > > > > > > > Should
> > > > > > > > > > > > > > > >> the
> > > > > > > > > > > > > > > >> > > > server expect the client to carry a
> > > > persisted
> > > > > > > client
> > > > > > > > > > > > instance id
> > > > > > > > > > > > > > > or
> > > > > > > > > > > > > > > >> > > should
> > > > > > > > > > > > > > > >> > > > the client be treated as a new
> instance?
> > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > >> > > The KIP doesn't describe any mechanism
> for
> > > > > > > > persistence,
> > > > > > > > > > so I
> > > > > > > > > > > > would
> > > > > > > > > > > > > > > >> assume
> > > > > > > > > > > > > > > >> > > that when you restart the client you get
> > a new
> > > > > > > UUID. I
> > > > > > > > > > agree
> > > > > > > > > > > > that
> > > > > > > > > > > > > > it
> > > > > > > > > > > > > > > >> > would
> > > > > > > > > > > > > > > >> > > be good to spell this out.
> > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > > >> > Right, it will not be persisted since a
> > client
> > > > > > > instance
> > > > > > > > > > can't
> > > > > > > > > > > be
> > > > > > > > > > > > > > > >> restarted.
> > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > >> > Will update the KIP to make this clearer.
> > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > >> > /Magnus
> > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > >> --
> > > > > > > > > > > > > > > >> Gwen Shapira
> > > > > > > > > > > > > > > >> Engineering Manager | Confluent
> > > > > > > > > > > > > > > >> 650.450.2760 | @gwenshap
> > > > > > > > > > > > > > > >> Follow us: Twitter | blog
> > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Ryanne Dolan <ry...@gmail.com>.
I think we should be very careful about introducing new runtime
dependencies into the clients. Historically this has been rare and
essentially necessary (e.g. compression libs).

Ryanne

On Mon, Dec 13, 2021, 1:06 PM Kirk True <ki...@mustardgrain.com> wrote:

> Hi Jun,
>
> On Thu, Dec 9, 2021, at 2:28 PM, Jun Rao wrote:
> > 13. Using OpenTelemetry. Does that require runtime dependency
> > on OpenTelemetry library? How good is the compatibility story
> > of OpenTelemetry? This is important since an application could have other
> > OpenTelemetry dependencies than the Kafka client.
>
> The current design is that the OpenTelemetry JARs would ship with the
> client. Perhaps we can design the client such that the JARs aren't even
> loaded if the user has opted out. The user could even exclude the JARs from
> their dependencies if they so wished.
>
> I can't speak to the compatibility of the libraries. Is it possible that
> we include a shaded version?
>
> Thanks,
> Kirk
>
> >
> > 14. The proposal listed idempotence=true. This is more of a configuration
> > than a metric. Are we including that as a metric? What other
> configurations
> > are we including? Should we separate the configurations from the metrics?
> >
> > Thanks,
> >
> > Jun
> >
> > On Mon, Nov 29, 2021 at 7:34 AM Magnus Edenhill <ma...@edenhill.se>
> wrote:
> >
> > > Hey Bob,
> > >
> > > That's a good point.
> > >
> > > Request type labels were considered but since they're already tracked
> by
> > > broker-side metrics
> > > they were left out as to avoid metric duplication, however those
> metrics
> > > are not per connection,
> > > so they won't be that useful in practice for troubleshooting specific
> > > client instances.
> > >
> > > I'll add the request_type label to the relevant metrics.
> > >
> > > Thanks,
> > > Magnus
> > >
> > >
> > > Den tis 23 nov. 2021 kl 19:20 skrev Bob Barrett
> > > <bo...@confluent.io.invalid>:
> > >
> > > > Hi Magnus,
> > > >
> > > > Thanks for the thorough KIP, this seems very useful.
> > > >
> > > > Would it make sense to include the request type as a label for the
> > > > `client.request.success`, `client.request.errors` and
> > > `client.request.rtt`
> > > > metrics? I think it would be very useful to see which specific
> requests
> > > are
> > > > succeeding and failing for a client. One specific case I can think of
> > > where
> > > > this could be useful is producer batch timeouts. If a Java
> application
> > > does
> > > > not enable producer client logs (unfortunately, in my experience this
> > > > happens more often than it should), the application logs will only
> > > contain
> > > > the expiration error message, but no information about what is
> causing
> > > the
> > > > timeout. The requests might all be succeeding but taking too long to
> > > > process batches, or metadata requests might be failing, or some or
> all
> > > > produce requests might be failing (if the bootstrap servers are
> reachable
> > > > from the client but one or more other brokers are not, for example).
> If
> > > the
> > > > cluster operator is able to identify the specific requests that are
> slow
> > > or
> > > > failing for a client, they will be better able to diagnose the issue
> > > > causing batch timeouts.
> > > >
> > > > One drawback I can think of is that this will increase the
> cardinality of
> > > > the request metrics. But any given client is only going to use a
> small
> > > > subset of the request types, and since we already have partition
> labels
> > > for
> > > > the topic-level metrics, I think request labels will still make up a
> > > > relatively small percentage of the set of metrics.
> > > >
> > > > Thanks,
> > > > Bob
> > > >
> > > > On Mon, Nov 22, 2021 at 2:08 AM Viktor Somogyi-Vass <
> > > > viktorsomogyi@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi Magnus,
> > > > >
> > > > > I think this is a very useful addition. We also have a similar (but
> > > much
> > > > > more simplistic) implementation of this. Maybe I missed it in the
> KIP
> > > but
> > > > > what about adding metrics about the subscription cache itself?
> That I
> > > > think
> > > > > would improve its usability and debuggability as we'd be able to
> see
> > > its
> > > > > performance, hit/miss rates, eviction counts and others.
> > > > >
> > > > > Best,
> > > > > Viktor
> > > > >
> > > > > On Thu, Nov 18, 2021 at 5:12 PM Magnus Edenhill <
> magnus@edenhill.se>
> > > > > wrote:
> > > > >
> > > > > > Hi Mickael,
> > > > > >
> > > > > > see inline.
> > > > > >
> > > > > > Den ons 10 nov. 2021 kl 15:21 skrev Mickael Maison <
> > > > > > mickael.maison@gmail.com
> > > > > > >:
> > > > > >
> > > > > > > Hi Magnus,
> > > > > > >
> > > > > > > I see you've addressed some of the points I raised above but
> some
> > > (4,
> > > > > > > 5) have not been addressed yet.
> > > > > > >
> > > > > >
> > > > > > Re 4) How will the user/app know metrics are being sent.
> > > > > >
> > > > > > One possibility is to add a JMX metric (thus for user
> consumption)
> > > for
> > > > > the
> > > > > > number of metric pushes the
> > > > > > client has performed, or perhaps the number of metrics
> subscriptions
> > > > > > currently being collected.
> > > > > > Would that be sufficient?
> > > > > >
> > > > > > Re 5) Metric sizes and rates
> > > > > >
> > > > > > A worst case scenario for a producer that is producing to 50
> unique
> > > > > topics
> > > > > > and emitting all standard metrics yields
> > > > > > a serialized size of around 100KB prior to compression, which
> > > > compresses
> > > > > > down to about 20-30% of that depending
> > > > > > on compression type and topic name uniqueness.
> > > > > > The numbers for a consumer would be similar.
> > > > > >
> > > > > > In practice the number of unique topics would be far less, and
> the
> > > > > > subscription set would typically be for a subset of metrics.
> > > > > > So we're probably closer to 1kb, or less, compressed size per
> client
> > > > per
> > > > > > push interval.
> > > > > >
> > > > > > As both the subscription set and push intervals are controlled
> by the
> > > > > > cluster operator it shouldn't be too hard
> > > > > > to strike a good balance between metrics overhead and
> granularity.
> > > > > >
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > I'm really uneasy with this being enabled by default on the
> client
> > > > > > > side. When collecting data, I think the best practice is to
> ensure
> > > > > > > users are explicitly enabling it.
> > > > > > >
> > > > > >
> > > > > > Requiring metrics to be explicitly enabled on clients severely
> > > cripples
> > > > > its
> > > > > > usability and value.
> > > > > >
> > > > > > One of the problems that this KIP aims to solve is for useful
> metrics
> > > > to
> > > > > be
> > > > > > available on demand
> > > > > > regardless of the technical expertise of the user. As Ryanne
> points,
> > > > out
> > > > > a
> > > > > > savvy user/organization
> > > > > > will typically have metrics collection and monitoring in place
> > > already,
> > > > > and
> > > > > > the benefits of this KIP
> > > > > > are then more of a common set and format metrics across client
> > > > > > implementations and languages.
> > > > > > But that is not the typical Kafka user in my experience, they're
> not
> > > > > Kafka
> > > > > > experts and they don't have the
> > > > > > knowledge of how to best instrument their clients.
> > > > > > Having metrics enabled by default for this user base allows the
> Kafka
> > > > > > operators to proactively and reactively
> > > > > > monitor and troubleshoot client issues, without the need for the
> less
> > > > > savvy
> > > > > > user to do anything.
> > > > > > It is often too late to tell a user to enable metrics when the
> > > problem
> > > > > has
> > > > > > already occurred.
> > > > > >
> > > > > > Now, to be clear, even though metrics are enabled by default on
> > > clients
> > > > > it
> > > > > > is not enabled by default
> > > > > > on the brokers; the Kafka operator needs to build and set up a
> > > metrics
> > > > > > plugin and add metrics subscriptions
> > > > > > before anything is sent from the client.
> > > > > > It is opt-out on the clients and opt-in on the broker.
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > > You mentioned brokers already have
> > > > > > > some(most?) of the information contained in metrics, if so
> then why
> > > > > > > are we collecting it again? Surely there must be some new
> > > information
> > > > > > > in the client metrics.
> > > > > > >
> > > > > >
> > > > > > From the user's perspective the Kafka infrastructure extends from
> > > > > > producer.send() to
> > > > > > messages being returned from consumer.poll(), a giant black box
> where
> > > > > > there's a lot going on between those
> > > > > > two points. The brokers currently only see what happens once
> those
> > > > > requests
> > > > > > and messages hits the broker,
> > > > > > but as Kafka clients are complex pieces of machinery there's a
> myriad
> > > > of
> > > > > > queues, timers, and state
> > > > > > that's critical to the operation and infrastructure that's not
> > > > currently
> > > > > > visible to the operator.
> > > > > > Relying on the user to accurately and timely provide this missing
> > > > > > information is not generally feasible.
> > > > > >
> > > > > >
> > > > > > Most of the standard metrics listed in the KIP are data points
> that
> > > the
> > > > > > broker does not have.
> > > > > > Only a small number of metrics are duplicates (like the request
> > > counts
> > > > > and
> > > > > > sizes), but they are included
> > > > > > to ease correlation when inspecting these client metrics.
> > > > > >
> > > > > >
> > > > > >
> > > > > > > Moreover this is a brand new feature so it's even harder to
> justify
> > > > > > > enabling it and forcing onto all our users. If disabled by
> default,
> > > > > > > it's relatively easy to enable in a new release if we decide
> to,
> > > but
> > > > > > > once enabled by default it's much harder to disable. Also this
> > > > feature
> > > > > > > will apply to all future metrics we will add.
> > > > > > >
> > > > > >
> > > > > > I think maturity of a feature implementation should be the
> deciding
> > > > > factor,
> > > > > > rather than
> > > > > > the design of it (which this KIP is). I.e., if the
> implementation is
> > > > not
> > > > > > deemed mature enough
> > > > > > for release X.Y it will be disabled.
> > > > > >
> > > > > >
> > > > > >
> > > > > > > Overall I think it's an interesting feature but I'd prefer to
> be
> > > > > > > slightly defensive and see how it works in practice before
> enabling
> > > > it
> > > > > > > everywhere.
> > > > > > >
> > > > > >
> > > > > > Right, and I agree on being defensive, but since this feature
> still
> > > > > > requires manual
> > > > > > enabling on the brokers before actually being used, I think that
> > > gives
> > > > > > enough control
> > > > > > to opt-in or out of this feature as needed.
> > > > > >
> > > > > > Thanks for your comments!
> > > > > >
> > > > > > Regards,
> > > > > > Magnus
> > > > > >
> > > > > >
> > > > > >
> > > > > > > Thanks,
> > > > > > > Mickael
> > > > > > >
> > > > > > > On Mon, Nov 8, 2021 at 7:55 AM Magnus Edenhill <
> magnus@edenhill.se
> > > >
> > > > > > wrote:
> > > > > > > >
> > > > > > > > Thanks David for pointing this out,
> > > > > > > > I've updated the KIP to include client_id as a matching
> selector.
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > > Magnus
> > > > > > > >
> > > > > > > > Den tors 4 nov. 2021 kl 18:01 skrev David Mao
> > > > > > <dmao@confluent.io.invalid
> > > > > > > >:
> > > > > > > >
> > > > > > > > > Hey Magnus,
> > > > > > > > >
> > > > > > > > > I noticed that the KIP outlines the initial selectors
> supported
> > > > as:
> > > > > > > > >
> > > > > > > > >    - client_instance_id - CLIENT_INSTANCE_ID UUID string
> > > > > > > representation.
> > > > > > > > >    - client_software_name  - client software implementation
> > > name.
> > > > > > > > >    - client_software_version  - client software
> implementation
> > > > > > version.
> > > > > > > > >
> > > > > > > > > In the given reactive monitoring workflow, we mention that
> the
> > > > > > > application
> > > > > > > > > user does not know their client's client instance ID, but
> it's
> > > > > > outlined
> > > > > > > > > that the operator can add a metrics subscription selecting
> for
> > > > > > > clientId. I
> > > > > > > > > don't see clientId as one of the supported selectors.
> > > > > > > > > I can see how this would have made sense in a previous
> > > iteration
> > > > > > given
> > > > > > > that
> > > > > > > > > the previous client instance ID proposal was to construct
> the
> > > > > client
> > > > > > > > > instance ID using clientId as a prefix. Now that the client
> > > > > instance
> > > > > > > ID is
> > > > > > > > > a UUID, would we want to add clientId as a supported
> selector?
> > > > > > > > > Let me know what you think.
> > > > > > > > >
> > > > > > > > > David
> > > > > > > > >
> > > > > > > > > On Tue, Oct 19, 2021 at 12:33 PM Magnus Edenhill <
> > > > > magnus@edenhill.se
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi Mickael!
> > > > > > > > > >
> > > > > > > > > > Den tis 19 okt. 2021 kl 19:30 skrev Mickael Maison <
> > > > > > > > > > mickael.maison@gmail.com
> > > > > > > > > > >:
> > > > > > > > > >
> > > > > > > > > > > Hi Magnus,
> > > > > > > > > > >
> > > > > > > > > > > Thanks for the proposal.
> > > > > > > > > > >
> > > > > > > > > > > 1. Looking at the protocol section, isn't
> > > "ClientInstanceId"
> > > > > > > expected
> > > > > > > > > > > to be a field in GetTelemetrySubscriptionsResponseV0?
> > > > > Otherwise,
> > > > > > > how
> > > > > > > > > > > does a client retrieve this value?
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Good catch, it got removed by mistake in one of the
> edits.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > 2. In the client API section, you mention a new method
> > > > > > > > > > > "clientInstanceId()". Can you clarify which interfaces
> are
> > > > > > > affected?
> > > > > > > > > > > Is it only Consumer and Producer?
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > And Admin. Will update the KIP.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > 3. I'm a bit concerned this is enabled by default.
> Even if
> > > > the
> > > > > > data
> > > > > > > > > > > collected is supposed to be not sensitive, I think
> this can
> > > > be
> > > > > > > > > > > problematic in some environments. Also users don't
> seem to
> > > > have
> > > > > > the
> > > > > > > > > > > choice to only expose some metrics. Knowing how much
> data
> > > > > transit
> > > > > > > > > > > through some applications can be considered critical.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > The broker already knows how much data transits through
> the
> > > > > client
> > > > > > > > > though,
> > > > > > > > > > right?
> > > > > > > > > > Care has been taken not to expose information in the
> standard
> > > > > > metrics
> > > > > > > > > that
> > > > > > > > > > might
> > > > > > > > > > reveal sensitive information.
> > > > > > > > > >
> > > > > > > > > > Do you have an example of how the proposed metrics could
> leak
> > > > > > > sensitive
> > > > > > > > > > information?
> > > > > > > > > > As for limiting the what metrics to export; I guess that
> > > could
> > > > > make
> > > > > > > sense
> > > > > > > > > > in some
> > > > > > > > > > very sensitive use-cases, but those users might disable
> > > metrics
> > > > > > > > > altogether
> > > > > > > > > > for now.
> > > > > > > > > > Could these concerns be addressed by a later KIP?
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > 4. As a user, how do you know if your application is
> > > actively
> > > > > > > sending
> > > > > > > > > > > metrics? Are there new metrics exposing what's going
> on,
> > > like
> > > > > how
> > > > > > > much
> > > > > > > > > > > data is being sent?
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > That's a good question.
> > > > > > > > > > Since the proposed metrics interface is not aimed at, or
> > > > directly
> > > > > > > > > available
> > > > > > > > > > to, the application
> > > > > > > > > > I guess there's little point of adding it here, but
> instead
> > > > > adding
> > > > > > > > > > something to the
> > > > > > > > > > existing JMX metrics?
> > > > > > > > > > Do you have any suggestions?
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > 5. If all metrics are enabled on a regular Consumer or
> > > > > Producer,
> > > > > > do
> > > > > > > > > > > you have an idea how much throughput this would use?
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > It depends on the number of partition/topics/etc the
> client
> > > is
> > > > > > > producing
> > > > > > > > > > to/consuming from.
> > > > > > > > > > I'll add some sizes to the KIP for some typical
> use-cases.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Magnus
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > Thanks
> > > > > > > > > > >
> > > > > > > > > > > On Tue, Oct 19, 2021 at 5:06 PM Magnus Edenhill <
> > > > > > > magnus@edenhill.se>
> > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > Den tis 19 okt. 2021 kl 13:22 skrev Tom Bentley <
> > > > > > > tbentley@redhat.com
> > > > > > > > > >:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi Magnus,
> > > > > > > > > > > > >
> > > > > > > > > > > > > I reviewed the KIP since you called the vote
> (sorry for
> > > > not
> > > > > > > > > reviewing
> > > > > > > > > > > when
> > > > > > > > > > > > > you announced your intention to call the vote). I
> have
> > > a
> > > > > few
> > > > > > > > > > questions
> > > > > > > > > > > on
> > > > > > > > > > > > > some of the details.
> > > > > > > > > > > > >
> > > > > > > > > > > > > 1. There's no Javadoc on
> ClientTelemetryPayload.data(),
> > > > so
> > > > > I
> > > > > > > don't
> > > > > > > > > > know
> > > > > > > > > > > > > whether the payload is exposed through this method
> as
> > > > > > > compressed or
> > > > > > > > > > > not.
> > > > > > > > > > > > > Later on you say "Decompression of the payloads
> will be
> > > > > > > handled by
> > > > > > > > > > the
> > > > > > > > > > > > > broker metrics plugin, the broker should expose a
> > > > suitable
> > > > > > > > > > > decompression
> > > > > > > > > > > > > API to the metrics plugin for this purpose.", which
> > > > > suggests
> > > > > > > it's
> > > > > > > > > the
> > > > > > > > > > > > > compressed data in the buffer, but then we don't
> know
> > > > which
> > > > > > > codec
> > > > > > > > > was
> > > > > > > > > > > used,
> > > > > > > > > > > > > nor the API via which the plugin should decompress
> it
> > > if
> > > > > > > required
> > > > > > > > > for
> > > > > > > > > > > > > forwarding to the ultimate metrics store. Should
> the
> > > > > > > > > > > ClientTelemetryPayload
> > > > > > > > > > > > > expose a method to get the compression and a
> > > > decompressor?
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Good point, updated.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > 2. The client-side API is expressed as
> StringOrError
> > > > > > > > > > > > > ClientInstance::ClientInstanceId(int timeout_ms). I
> > > > > > understand
> > > > > > > that
> > > > > > > > > > > you're
> > > > > > > > > > > > > thinking about the librdkafka implementation, but
> it
> > > > would
> > > > > be
> > > > > > > good
> > > > > > > > > to
> > > > > > > > > > > show
> > > > > > > > > > > > > the API as it would appear on the Apache Kafka
> clients.
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > This was meant as pseudo-code, but I changed it to
> Java.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > 3. "PushTelemetryRequest|Response - protocol
> request
> > > used
> > > > > by
> > > > > > > the
> > > > > > > > > > > client to
> > > > > > > > > > > > > send metrics to any broker it is connected to." To
> be
> > > > > clear,
> > > > > > > this
> > > > > > > > > > means
> > > > > > > > > > > > > that the client can choose any of the connected
> brokers
> > > > and
> > > > > > > push to
> > > > > > > > > > > just
> > > > > > > > > > > > > one of them? What should a supporting client do if
> it
> > > > gets
> > > > > an
> > > > > > > error
> > > > > > > > > > > when
> > > > > > > > > > > > > pushing metrics to a broker, retry sending to the
> same
> > > > > broker
> > > > > > > or
> > > > > > > > > try
> > > > > > > > > > > > > pushing to another broker, or drop the metrics?
> Should
> > > > > > > supporting
> > > > > > > > > > > clients
> > > > > > > > > > > > > send successive requests to a single broker, or
> round
> > > > > robin,
> > > > > > > or is
> > > > > > > > > > > that up
> > > > > > > > > > > > > to the client author? I'm guessing the behaviour
> should
> > > > be
> > > > > > > sticky
> > > > > > > > > to
> > > > > > > > > > > > > support the rate limiting features, but I think it
> > > would
> > > > be
> > > > > > > good
> > > > > > > > > for
> > > > > > > > > > > client
> > > > > > > > > > > > > authors if this section were explicit on the
> > > recommended
> > > > > > > behaviour.
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > You are right, I've updated the KIP to make this
> clearer.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > 4. "Mapping the client instance id to an actual
> > > > application
> > > > > > > > > instance
> > > > > > > > > > > > > running on a (virtual) machine can be done by
> > > inspecting
> > > > > the
> > > > > > > > > metrics
> > > > > > > > > > > > > resource labels, such as the client source address
> and
> > > > > source
> > > > > > > port,
> > > > > > > > > > or
> > > > > > > > > > > > > security principal, all of which are added by the
> > > > receiving
> > > > > > > broker.
> > > > > > > > > > > This
> > > > > > > > > > > > > will allow the operator together with the user to
> > > > identify
> > > > > > the
> > > > > > > > > actual
> > > > > > > > > > > > > application instance." Is this really always true?
> The
> > > > > source
> > > > > > > IP
> > > > > > > > > and
> > > > > > > > > > > port
> > > > > > > > > > > > > might be a loadbalancer/proxy in some setups. The
> > > > > principal,
> > > > > > as
> > > > > > > > > > already
> > > > > > > > > > > > > mentioned in the KIP, might be shared between
> multiple
> > > > > > > > > applications.
> > > > > > > > > > > So at
> > > > > > > > > > > > > worst the organization running the clients might
> have
> > > to
> > > > > > > consult
> > > > > > > > > the
> > > > > > > > > > > logs
> > > > > > > > > > > > > of a set of client applications, right?
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Yes, that's correct. There's no guaranteed mapping
> from
> > > > > > > > > > > client_instance_id
> > > > > > > > > > > > to
> > > > > > > > > > > > an actual instance, that's why the KIP recommends
> client
> > > > > > > > > > implementations
> > > > > > > > > > > to
> > > > > > > > > > > > log the client instance id
> > > > > > > > > > > > upon retrieval, and also provide an API for the
> > > application
> > > > > to
> > > > > > > > > retrieve
> > > > > > > > > > > the
> > > > > > > > > > > > instance id programmatically
> > > > > > > > > > > > if it has a better way of exposing it.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > 5. "Tests indicate that a compression ratio up to
> 10x is
> > > > > > > possible for
> > > > > > > > > > the
> > > > > > > > > > > > > standard metrics." Client authors might appreciate
> your
> > > > > > > mentioning
> > > > > > > > > > > which
> > > > > > > > > > > > > compression codec got these results.
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Good point. Updated.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > 6. "Should the client send a push request prior to
> > > expiry
> > > > > of
> > > > > > > the
> > > > > > > > > > > previously
> > > > > > > > > > > > > calculated PushIntervalMs the broker will discard
> the
> > > > > metrics
> > > > > > > and
> > > > > > > > > > > return a
> > > > > > > > > > > > > PushTelemetryResponse with the ErrorCode set to
> > > > > RateLimited."
> > > > > > > Is
> > > > > > > > > this
> > > > > > > > > > > > > RATE_LIMITED a new error code? It's not mentioned
> in
> > > the
> > > > > "New
> > > > > > > Error
> > > > > > > > > > > Codes"
> > > > > > > > > > > > > section.
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > That's a leftover, it should be using the standard
> > > > > ThrottleTime
> > > > > > > > > > > mechanism.
> > > > > > > > > > > > Fixed.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > 7. In the section "Standard client resource labels"
> > > > > > > application_id
> > > > > > > > > is
> > > > > > > > > > > > > described as Kafka Streams only, but the section of
> > > > "Client
> > > > > > > > > > > Identification"
> > > > > > > > > > > > > talks about "application instance id as an optional
> > > > future
> > > > > > > > > > nice-to-have
> > > > > > > > > > > > > that may be included as a metrics label if it has
> been
> > > > set
> > > > > by
> > > > > > > the
> > > > > > > > > > > user", so
> > > > > > > > > > > > > I'm confused whether non-Kafka Streams clients
> should
> > > set
> > > > > an
> > > > > > > > > > > application_id
> > > > > > > > > > > > > or not.
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > I'll clarify this in the KIP, but basically we would
> need
> > > > to
> > > > > > add
> > > > > > > an `
> > > > > > > > > > > > application.id` config
> > > > > > > > > > > > property for non-streams clients for this purpose,
> and
> > > > that's
> > > > > > > outside
> > > > > > > > > > the
> > > > > > > > > > > > scope of this KIP since we want to make it
> zero-conf:ish
> > > on
> > > > > the
> > > > > > > > > client
> > > > > > > > > > > side.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Kind regards,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Tom
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks for the review,
> > > > > > > > > > > > Magnus
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Thu, Oct 7, 2021 at 5:26 PM Magnus Edenhill <
> > > > > > > magnus@edenhill.se
> > > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi all,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I've updated the KIP following our recent
> discussions
> > > > on
> > > > > > the
> > > > > > > > > > mailing
> > > > > > > > > > > > > list:
> > > > > > > > > > > > > >  - split the protocol in two, one for getting the
> > > > metrics
> > > > > > > > > > > subscriptions,
> > > > > > > > > > > > > > and one for pushing the metrics.
> > > > > > > > > > > > > >  - simplifications: initially only one supported
> > > > metrics
> > > > > > > format,
> > > > > > > > > no
> > > > > > > > > > > > > > client.id in the instance id, etc.
> > > > > > > > > > > > > >  - made CLIENT_METRICS subscription configuration
> > > > entries
> > > > > > > more
> > > > > > > > > > > structured
> > > > > > > > > > > > > >    and allowing better client matching selectors
> (not
> > > > > only
> > > > > > > on the
> > > > > > > > > > > > > instance
> > > > > > > > > > > > > > id, but also the other
> > > > > > > > > > > > > >    client resource labels, such as
> > > > client_software_name,
> > > > > > > etc.).
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Unless there are further comments I'll call the
> vote
> > > > in a
> > > > > > > day or
> > > > > > > > > > two.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Regards,
> > > > > > > > > > > > > > Magnus
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Den mån 4 okt. 2021 kl 20:57 skrev Magnus
> Edenhill <
> > > > > > > > > > > magnus@edenhill.se>:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi Gwen,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I'm finishing up the KIP based on the last
> couple
> > > of
> > > > > > > discussion
> > > > > > > > > > > points
> > > > > > > > > > > > > in
> > > > > > > > > > > > > > > this thread
> > > > > > > > > > > > > > > and will call the Vote later this week.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Best,
> > > > > > > > > > > > > > > Magnus
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Den lör 2 okt. 2021 kl 02:01 skrev Gwen Shapira
> > > > > > > > > > > > > > <gwen@confluent.io.invalid
> > > > > > > > > > > > > > > >:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >> Hey,
> > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > >> I noticed that there was no discussion for the
> > > last
> > > > 10
> > > > > > > days,
> > > > > > > > > > but I
> > > > > > > > > > > > > > >> couldn't
> > > > > > > > > > > > > > >> find the vote thread. Is there one that I'm
> > > missing?
> > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > >> Gwen
> > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > >> On Wed, Sep 22, 2021 at 4:58 AM Magnus
> Edenhill <
> > > > > > > > > > > magnus@edenhill.se>
> > > > > > > > > > > > > > >> wrote:
> > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > >> > Den tis 21 sep. 2021 kl 06:58 skrev Colin
> > > McCabe <
> > > > > > > > > > > > > cmccabe@apache.org
> > > > > > > > > > > > > > >:
> > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > >> > > On Mon, Sep 20, 2021, at 17:35, Feng Min
> > > wrote:
> > > > > > > > > > > > > > >> > > > Thanks Magnus & Colin for the
> discussion.
> > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > >> > > > Based on KIP-714's stateless design,
> Client
> > > > can
> > > > > > > pretty
> > > > > > > > > > much
> > > > > > > > > > > use
> > > > > > > > > > > > > > any
> > > > > > > > > > > > > > >> > > > connection to any broker to send
> metrics. We
> > > > are
> > > > > > not
> > > > > > > > > > > associating
> > > > > > > > > > > > > > >> > > connection
> > > > > > > > > > > > > > >> > > > with client metric state. Is my
> > > understanding
> > > > > > > correct?
> > > > > > > > > If
> > > > > > > > > > > yes,
> > > > > > > > > > > > > > how
> > > > > > > > > > > > > > >> > about
> > > > > > > > > > > > > > >> > > > the following two scenarios
> > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > >> > > > 1) One Client (Client-ID) registers two
> > > > > different
> > > > > > > client
> > > > > > > > > > > > > instance
> > > > > > > > > > > > > > id
> > > > > > > > > > > > > > >> > via
> > > > > > > > > > > > > > >> > > > separate registration. Is it permitted?
> If
> > > OK,
> > > > > how
> > > > > > > to
> > > > > > > > > > > > > distinguish
> > > > > > > > > > > > > > >> them
> > > > > > > > > > > > > > >> > > from
> > > > > > > > > > > > > > >> > > > the case 2 below.
> > > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > >> > > Hi Feng,
> > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > >> > > My understanding, which Magnus can
> clarify I
> > > > > guess,
> > > > > > is
> > > > > > > > > that
> > > > > > > > > > > you
> > > > > > > > > > > > > > could
> > > > > > > > > > > > > > >> > have
> > > > > > > > > > > > > > >> > > something like two Producer instances
> running
> > > > with
> > > > > > the
> > > > > > > > > same
> > > > > > > > > > > > > > client.id
> > > > > > > > > > > > > > >> > > (perhaps because they're using the same
> config
> > > > > file,
> > > > > > > for
> > > > > > > > > > > example).
> > > > > > > > > > > > > > >> They
> > > > > > > > > > > > > > >> > > could even be in the same process. But
> they
> > > > would
> > > > > > get
> > > > > > > > > > separate
> > > > > > > > > > > > > > UUIDs.
> > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > >> > > I believe Magnus used the term client to
> mean
> > > > > > > "Producer or
> > > > > > > > > > > > > > Consumer".
> > > > > > > > > > > > > > >> So
> > > > > > > > > > > > > > >> > > if you have both a Producer and a
> Consumer in
> > > > your
> > > > > > > > > > > application I
> > > > > > > > > > > > > > would
> > > > > > > > > > > > > > >> > > expect you'd get separate UUIDs for both.
> > > Again
> > > > > > > Magnus can
> > > > > > > > > > > chime
> > > > > > > > > > > > > in
> > > > > > > > > > > > > > >> > here, I
> > > > > > > > > > > > > > >> > > guess.
> > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > >> > That's correct.
> > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > >> > > > 2) How about the client restarting?
> What's
> > > the
> > > > > > > > > > expectation?
> > > > > > > > > > > > > Should
> > > > > > > > > > > > > > >> the
> > > > > > > > > > > > > > >> > > > server expect the client to carry a
> > > persisted
> > > > > > client
> > > > > > > > > > > instance id
> > > > > > > > > > > > > > or
> > > > > > > > > > > > > > >> > > should
> > > > > > > > > > > > > > >> > > > the client be treated as a new instance?
> > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > >> > > The KIP doesn't describe any mechanism for
> > > > > > > persistence,
> > > > > > > > > so I
> > > > > > > > > > > would
> > > > > > > > > > > > > > >> assume
> > > > > > > > > > > > > > >> > > that when you restart the client you get
> a new
> > > > > > UUID. I
> > > > > > > > > agree
> > > > > > > > > > > that
> > > > > > > > > > > > > it
> > > > > > > > > > > > > > >> > would
> > > > > > > > > > > > > > >> > > be good to spell this out.
> > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > > >> > Right, it will not be persisted since a
> client
> > > > > > instance
> > > > > > > > > can't
> > > > > > > > > > be
> > > > > > > > > > > > > > >> restarted.
> > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > >> > Will update the KIP to make this clearer.
> > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > >> > /Magnus
> > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > >> --
> > > > > > > > > > > > > > >> Gwen Shapira
> > > > > > > > > > > > > > >> Engineering Manager | Confluent
> > > > > > > > > > > > > > >> 650.450.2760 | @gwenshap
> > > > > > > > > > > > > > >> Follow us: Twitter | blog
> > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Posted by Kirk True <ki...@mustardgrain.com>.
Hi Jun,

On Thu, Dec 9, 2021, at 2:28 PM, Jun Rao wrote:
> 13. Using OpenTelemetry. Does that require runtime dependency
> on OpenTelemetry library? How good is the compatibility story
> of OpenTelemetry? This is important since an application could have other
> OpenTelemetry dependencies than the Kafka client.

The current design is that the OpenTelemetry JARs would ship with the client. Perhaps we can design the client such that the JARs aren't even loaded if the user has opted out. The user could even exclude the JARs from their dependencies if they so wished.

I can't speak to the compatibility of the libraries. Is it possible that we include a shaded version?

Thanks,
Kirk

> 
> 14. The proposal listed idempotence=true. This is more of a configuration
> than a metric. Are we including that as a metric? What other configurations
> are we including? Should we separate the configurations from the metrics?
> 
> Thanks,
> 
> Jun
> 
> On Mon, Nov 29, 2021 at 7:34 AM Magnus Edenhill <ma...@edenhill.se> wrote:
> 
> > Hey Bob,
> >
> > That's a good point.
> >
> > Request type labels were considered but since they're already tracked by
> > broker-side metrics
> > they were left out as to avoid metric duplication, however those metrics
> > are not per connection,
> > so they won't be that useful in practice for troubleshooting specific
> > client instances.
> >
> > I'll add the request_type label to the relevant metrics.
> >
> > Thanks,
> > Magnus
> >
> >
> > Den tis 23 nov. 2021 kl 19:20 skrev Bob Barrett
> > <bo...@confluent.io.invalid>:
> >
> > > Hi Magnus,
> > >
> > > Thanks for the thorough KIP, this seems very useful.
> > >
> > > Would it make sense to include the request type as a label for the
> > > `client.request.success`, `client.request.errors` and
> > `client.request.rtt`
> > > metrics? I think it would be very useful to see which specific requests
> > are
> > > succeeding and failing for a client. One specific case I can think of
> > where
> > > this could be useful is producer batch timeouts. If a Java application
> > does
> > > not enable producer client logs (unfortunately, in my experience this
> > > happens more often than it should), the application logs will only
> > contain
> > > the expiration error message, but no information about what is causing
> > the
> > > timeout. The requests might all be succeeding but taking too long to
> > > process batches, or metadata requests might be failing, or some or all
> > > produce requests might be failing (if the bootstrap servers are reachable
> > > from the client but one or more other brokers are not, for example). If
> > the
> > > cluster operator is able to identify the specific requests that are slow
> > or
> > > failing for a client, they will be better able to diagnose the issue
> > > causing batch timeouts.
> > >
> > > One drawback I can think of is that this will increase the cardinality of
> > > the request metrics. But any given client is only going to use a small
> > > subset of the request types, and since we already have partition labels
> > for
> > > the topic-level metrics, I think request labels will still make up a
> > > relatively small percentage of the set of metrics.
> > >
> > > Thanks,
> > > Bob
> > >
> > > On Mon, Nov 22, 2021 at 2:08 AM Viktor Somogyi-Vass <
> > > viktorsomogyi@gmail.com>
> > > wrote:
> > >
> > > > Hi Magnus,
> > > >
> > > > I think this is a very useful addition. We also have a similar (but
> > much
> > > > more simplistic) implementation of this. Maybe I missed it in the KIP
> > but
> > > > what about adding metrics about the subscription cache itself? That I
> > > think
> > > > would improve its usability and debuggability as we'd be able to see
> > its
> > > > performance, hit/miss rates, eviction counts and others.
> > > >
> > > > Best,
> > > > Viktor
> > > >
> > > > On Thu, Nov 18, 2021 at 5:12 PM Magnus Edenhill <ma...@edenhill.se>
> > > > wrote:
> > > >
> > > > > Hi Mickael,
> > > > >
> > > > > see inline.
> > > > >
> > > > > Den ons 10 nov. 2021 kl 15:21 skrev Mickael Maison <
> > > > > mickael.maison@gmail.com
> > > > > >:
> > > > >
> > > > > > Hi Magnus,
> > > > > >
> > > > > > I see you've addressed some of the points I raised above but some
> > (4,
> > > > > > 5) have not been addressed yet.
> > > > > >
> > > > >
> > > > > Re 4) How will the user/app know metrics are being sent.
> > > > >
> > > > > One possibility is to add a JMX metric (thus for user consumption)
> > for
> > > > the
> > > > > number of metric pushes the
> > > > > client has performed, or perhaps the number of metrics subscriptions
> > > > > currently being collected.
> > > > > Would that be sufficient?
> > > > >
> > > > > Re 5) Metric sizes and rates
> > > > >
> > > > > A worst case scenario for a producer that is producing to 50 unique
> > > > topics
> > > > > and emitting all standard metrics yields
> > > > > a serialized size of around 100KB prior to compression, which
> > > compresses
> > > > > down to about 20-30% of that depending
> > > > > on compression type and topic name uniqueness.
> > > > > The numbers for a consumer would be similar.
> > > > >
> > > > > In practice the number of unique topics would be far less, and the
> > > > > subscription set would typically be for a subset of metrics.
> > > > > So we're probably closer to 1kb, or less, compressed size per client
> > > per
> > > > > push interval.
> > > > >
> > > > > As both the subscription set and push intervals are controlled by the
> > > > > cluster operator it shouldn't be too hard
> > > > > to strike a good balance between metrics overhead and granularity.
> > > > >
> > > > >
> > > > >
> > > > > >
> > > > > > I'm really uneasy with this being enabled by default on the client
> > > > > > side. When collecting data, I think the best practice is to ensure
> > > > > > users are explicitly enabling it.
> > > > > >
> > > > >
> > > > > Requiring metrics to be explicitly enabled on clients severely
> > cripples
> > > > its
> > > > > usability and value.
> > > > >
> > > > > One of the problems that this KIP aims to solve is for useful metrics
> > > to
> > > > be
> > > > > available on demand
> > > > > regardless of the technical expertise of the user. As Ryanne points,
> > > out
> > > > a
> > > > > savvy user/organization
> > > > > will typically have metrics collection and monitoring in place
> > already,
> > > > and
> > > > > the benefits of this KIP
> > > > > are then more of a common set and format metrics across client
> > > > > implementations and languages.
> > > > > But that is not the typical Kafka user in my experience, they're not
> > > > Kafka
> > > > > experts and they don't have the
> > > > > knowledge of how to best instrument their clients.
> > > > > Having metrics enabled by default for this user base allows the Kafka
> > > > > operators to proactively and reactively
> > > > > monitor and troubleshoot client issues, without the need for the less
> > > > savvy
> > > > > user to do anything.
> > > > > It is often too late to tell a user to enable metrics when the
> > problem
> > > > has
> > > > > already occurred.
> > > > >
> > > > > Now, to be clear, even though metrics are enabled by default on
> > clients
> > > > it
> > > > > is not enabled by default
> > > > > on the brokers; the Kafka operator needs to build and set up a
> > metrics
> > > > > plugin and add metrics subscriptions
> > > > > before anything is sent from the client.
> > > > > It is opt-out on the clients and opt-in on the broker.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > > You mentioned brokers already have
> > > > > > some(most?) of the information contained in metrics, if so then why
> > > > > > are we collecting it again? Surely there must be some new
> > information
> > > > > > in the client metrics.
> > > > > >
> > > > >
> > > > > From the user's perspective the Kafka infrastructure extends from
> > > > > producer.send() to
> > > > > messages being returned from consumer.poll(), a giant black box where
> > > > > there's a lot going on between those
> > > > > two points. The brokers currently only see what happens once those
> > > > requests
> > > > > and messages hits the broker,
> > > > > but as Kafka clients are complex pieces of machinery there's a myriad
> > > of
> > > > > queues, timers, and state
> > > > > that's critical to the operation and infrastructure that's not
> > > currently
> > > > > visible to the operator.
> > > > > Relying on the user to accurately and timely provide this missing
> > > > > information is not generally feasible.
> > > > >
> > > > >
> > > > > Most of the standard metrics listed in the KIP are data points that
> > the
> > > > > broker does not have.
> > > > > Only a small number of metrics are duplicates (like the request
> > counts
> > > > and
> > > > > sizes), but they are included
> > > > > to ease correlation when inspecting these client metrics.
> > > > >
> > > > >
> > > > >
> > > > > > Moreover this is a brand new feature so it's even harder to justify
> > > > > > enabling it and forcing onto all our users. If disabled by default,
> > > > > > it's relatively easy to enable in a new release if we decide to,
> > but
> > > > > > once enabled by default it's much harder to disable. Also this
> > > feature
> > > > > > will apply to all future metrics we will add.
> > > > > >
> > > > >
> > > > > I think maturity of a feature implementation should be the deciding
> > > > factor,
> > > > > rather than
> > > > > the design of it (which this KIP is). I.e., if the implementation is
> > > not
> > > > > deemed mature enough
> > > > > for release X.Y it will be disabled.
> > > > >
> > > > >
> > > > >
> > > > > > Overall I think it's an interesting feature but I'd prefer to be
> > > > > > slightly defensive and see how it works in practice before enabling
> > > it
> > > > > > everywhere.
> > > > > >
> > > > >
> > > > > Right, and I agree on being defensive, but since this feature still
> > > > > requires manual
> > > > > enabling on the brokers before actually being used, I think that
> > gives
> > > > > enough control
> > > > > to opt-in or out of this feature as needed.
> > > > >
> > > > > Thanks for your comments!
> > > > >
> > > > > Regards,
> > > > > Magnus
> > > > >
> > > > >
> > > > >
> > > > > > Thanks,
> > > > > > Mickael
> > > > > >
> > > > > > On Mon, Nov 8, 2021 at 7:55 AM Magnus Edenhill <magnus@edenhill.se
> > >
> > > > > wrote:
> > > > > > >
> > > > > > > Thanks David for pointing this out,
> > > > > > > I've updated the KIP to include client_id as a matching selector.
> > > > > > >
> > > > > > > Regards,
> > > > > > > Magnus
> > > > > > >
> > > > > > > Den tors 4 nov. 2021 kl 18:01 skrev David Mao
> > > > > <dmao@confluent.io.invalid
> > > > > > >:
> > > > > > >
> > > > > > > > Hey Magnus,
> > > > > > > >
> > > > > > > > I noticed that the KIP outlines the initial selectors supported
> > > as:
> > > > > > > >
> > > > > > > >    - client_instance_id - CLIENT_INSTANCE_ID UUID string
> > > > > > representation.
> > > > > > > >    - client_software_name  - client software implementation
> > name.
> > > > > > > >    - client_software_version  - client software implementation
> > > > > version.
> > > > > > > >
> > > > > > > > In the given reactive monitoring workflow, we mention that the
> > > > > > application
> > > > > > > > user does not know their client's client instance ID, but it's
> > > > > outlined
> > > > > > > > that the operator can add a metrics subscription selecting for
> > > > > > clientId. I
> > > > > > > > don't see clientId as one of the supported selectors.
> > > > > > > > I can see how this would have made sense in a previous
> > iteration
> > > > > given
> > > > > > that
> > > > > > > > the previous client instance ID proposal was to construct the
> > > > client
> > > > > > > > instance ID using clientId as a prefix. Now that the client
> > > > instance
> > > > > > ID is
> > > > > > > > a UUID, would we want to add clientId as a supported selector?
> > > > > > > > Let me know what you think.
> > > > > > > >
> > > > > > > > David
> > > > > > > >
> > > > > > > > On Tue, Oct 19, 2021 at 12:33 PM Magnus Edenhill <
> > > > magnus@edenhill.se
> > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Mickael!
> > > > > > > > >
> > > > > > > > > Den tis 19 okt. 2021 kl 19:30 skrev Mickael Maison <
> > > > > > > > > mickael.maison@gmail.com
> > > > > > > > > >:
> > > > > > > > >
> > > > > > > > > > Hi Magnus,
> > > > > > > > > >
> > > > > > > > > > Thanks for the proposal.
> > > > > > > > > >
> > > > > > > > > > 1. Looking at the protocol section, isn't
> > "ClientInstanceId"
> > > > > > expected
> > > > > > > > > > to be a field in GetTelemetrySubscriptionsResponseV0?
> > > > Otherwise,
> > > > > > how
> > > > > > > > > > does a client retrieve this value?
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Good catch, it got removed by mistake in one of the edits.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > 2. In the client API section, you mention a new method
> > > > > > > > > > "clientInstanceId()". Can you clarify which interfaces are
> > > > > > affected?
> > > > > > > > > > Is it only Consumer and Producer?
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > And Admin. Will update the KIP.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > 3. I'm a bit concerned this is enabled by default. Even if
> > > the
> > > > > data
> > > > > > > > > > collected is supposed to be not sensitive, I think this can
> > > be
> > > > > > > > > > problematic in some environments. Also users don't seem to
> > > have
> > > > > the
> > > > > > > > > > choice to only expose some metrics. Knowing how much data
> > > > transit
> > > > > > > > > > through some applications can be considered critical.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > The broker already knows how much data transits through the
> > > > client
> > > > > > > > though,
> > > > > > > > > right?
> > > > > > > > > Care has been taken not to expose information in the standard
> > > > > metrics
> > > > > > > > that
> > > > > > > > > might
> > > > > > > > > reveal sensitive information.
> > > > > > > > >
> > > > > > > > > Do you have an example of how the proposed metrics could leak
> > > > > > sensitive
> > > > > > > > > information?
> > > > > > > > > As for limiting the what metrics to export; I guess that
> > could
> > > > make
> > > > > > sense
> > > > > > > > > in some
> > > > > > > > > very sensitive use-cases, but those users might disable
> > metrics
> > > > > > > > altogether
> > > > > > > > > for now.
> > > > > > > > > Could these concerns be addressed by a later KIP?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > 4. As a user, how do you know if your application is
> > actively
> > > > > > sending
> > > > > > > > > > metrics? Are there new metrics exposing what's going on,
> > like
> > > > how
> > > > > > much
> > > > > > > > > > data is being sent?
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > That's a good question.
> > > > > > > > > Since the proposed metrics interface is not aimed at, or
> > > directly
> > > > > > > > available
> > > > > > > > > to, the application
> > > > > > > > > I guess there's little point of adding it here, but instead
> > > > adding
> > > > > > > > > something to the
> > > > > > > > > existing JMX metrics?
> > > > > > > > > Do you have any suggestions?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > 5. If all metrics are enabled on a regular Consumer or
> > > > Producer,
> > > > > do
> > > > > > > > > > you have an idea how much throughput this would use?
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > It depends on the number of partition/topics/etc the client
> > is
> > > > > > producing
> > > > > > > > > to/consuming from.
> > > > > > > > > I'll add some sizes to the KIP for some typical use-cases.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Magnus
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > Thanks
> > > > > > > > > >
> > > > > > > > > > On Tue, Oct 19, 2021 at 5:06 PM Magnus Edenhill <
> > > > > > magnus@edenhill.se>
> > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > Den tis 19 okt. 2021 kl 13:22 skrev Tom Bentley <
> > > > > > tbentley@redhat.com
> > > > > > > > >:
> > > > > > > > > > >
> > > > > > > > > > > > Hi Magnus,
> > > > > > > > > > > >
> > > > > > > > > > > > I reviewed the KIP since you called the vote (sorry for
> > > not
> > > > > > > > reviewing
> > > > > > > > > > when
> > > > > > > > > > > > you announced your intention to call the vote). I have
> > a
> > > > few
> > > > > > > > > questions
> > > > > > > > > > on
> > > > > > > > > > > > some of the details.
> > > > > > > > > > > >
> > > > > > > > > > > > 1. There's no Javadoc on ClientTelemetryPayload.data(),
> > > so
> > > > I
> > > > > > don't
> > > > > > > > > know
> > > > > > > > > > > > whether the payload is exposed through this method as
> > > > > > compressed or
> > > > > > > > > > not.
> > > > > > > > > > > > Later on you say "Decompression of the payloads will be
> > > > > > handled by
> > > > > > > > > the
> > > > > > > > > > > > broker metrics plugin, the broker should expose a
> > > suitable
> > > > > > > > > > decompression
> > > > > > > > > > > > API to the metrics plugin for this purpose.", which
> > > > suggests
> > > > > > it's
> > > > > > > > the
> > > > > > > > > > > > compressed data in the buffer, but then we don't know
> > > which
> > > > > > codec
> > > > > > > > was
> > > > > > > > > > used,
> > > > > > > > > > > > nor the API via which the plugin should decompress it
> > if
> > > > > > required
> > > > > > > > for
> > > > > > > > > > > > forwarding to the ultimate metrics store. Should the
> > > > > > > > > > ClientTelemetryPayload
> > > > > > > > > > > > expose a method to get the compression and a
> > > decompressor?
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Good point, updated.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > 2. The client-side API is expressed as StringOrError
> > > > > > > > > > > > ClientInstance::ClientInstanceId(int timeout_ms). I
> > > > > understand
> > > > > > that
> > > > > > > > > > you're
> > > > > > > > > > > > thinking about the librdkafka implementation, but it
> > > would
> > > > be
> > > > > > good
> > > > > > > > to
> > > > > > > > > > show
> > > > > > > > > > > > the API as it would appear on the Apache Kafka clients.
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > This was meant as pseudo-code, but I changed it to Java.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > 3. "PushTelemetryRequest|Response - protocol request
> > used
> > > > by
> > > > > > the
> > > > > > > > > > client to
> > > > > > > > > > > > send metrics to any broker it is connected to." To be
> > > > clear,
> > > > > > this
> > > > > > > > > means
> > > > > > > > > > > > that the client can choose any of the connected brokers
> > > and
> > > > > > push to
> > > > > > > > > > just
> > > > > > > > > > > > one of them? What should a supporting client do if it
> > > gets
> > > > an
> > > > > > error
> > > > > > > > > > when
> > > > > > > > > > > > pushing metrics to a broker, retry sending to the same
> > > > broker
> > > > > > or
> > > > > > > > try
> > > > > > > > > > > > pushing to another broker, or drop the metrics? Should
> > > > > > supporting
> > > > > > > > > > clients
> > > > > > > > > > > > send successive requests to a single broker, or round
> > > > robin,
> > > > > > or is
> > > > > > > > > > that up
> > > > > > > > > > > > to the client author? I'm guessing the behaviour should
> > > be
> > > > > > sticky
> > > > > > > > to
> > > > > > > > > > > > support the rate limiting features, but I think it
> > would
> > > be
> > > > > > good
> > > > > > > > for
> > > > > > > > > > client
> > > > > > > > > > > > authors if this section were explicit on the
> > recommended
> > > > > > behaviour.
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > You are right, I've updated the KIP to make this clearer.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > 4. "Mapping the client instance id to an actual
> > > application
> > > > > > > > instance
> > > > > > > > > > > > running on a (virtual) machine can be done by
> > inspecting
> > > > the
> > > > > > > > metrics
> > > > > > > > > > > > resource labels, such as the client source address and
> > > > source
> > > > > > port,
> > > > > > > > > or
> > > > > > > > > > > > security principal, all of which are added by the
> > > receiving
> > > > > > broker.
> > > > > > > > > > This
> > > > > > > > > > > > will allow the operator together with the user to
> > > identify
> > > > > the
> > > > > > > > actual
> > > > > > > > > > > > application instance." Is this really always true? The
> > > > source
> > > > > > IP
> > > > > > > > and
> > > > > > > > > > port
> > > > > > > > > > > > might be a loadbalancer/proxy in some setups. The
> > > > principal,
> > > > > as
> > > > > > > > > already
> > > > > > > > > > > > mentioned in the KIP, might be shared between multiple
> > > > > > > > applications.
> > > > > > > > > > So at
> > > > > > > > > > > > worst the organization running the clients might have
> > to
> > > > > > consult
> > > > > > > > the
> > > > > > > > > > logs
> > > > > > > > > > > > of a set of client applications, right?
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Yes, that's correct. There's no guaranteed mapping from
> > > > > > > > > > client_instance_id
> > > > > > > > > > > to
> > > > > > > > > > > an actual instance, that's why the KIP recommends client
> > > > > > > > > implementations
> > > > > > > > > > to
> > > > > > > > > > > log the client instance id
> > > > > > > > > > > upon retrieval, and also provide an API for the
> > application
> > > > to
> > > > > > > > retrieve
> > > > > > > > > > the
> > > > > > > > > > > instance id programmatically
> > > > > > > > > > > if it has a better way of exposing it.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > 5. "Tests indicate that a compression ratio up to 10x is
> > > > > > possible for
> > > > > > > > > the
> > > > > > > > > > > > standard metrics." Client authors might appreciate your
> > > > > > mentioning
> > > > > > > > > > which
> > > > > > > > > > > > compression codec got these results.
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Good point. Updated.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > 6. "Should the client send a push request prior to
> > expiry
> > > > of
> > > > > > the
> > > > > > > > > > previously
> > > > > > > > > > > > calculated PushIntervalMs the broker will discard the
> > > > metrics
> > > > > > and
> > > > > > > > > > return a
> > > > > > > > > > > > PushTelemetryResponse with the ErrorCode set to
> > > > RateLimited."
> > > > > > Is
> > > > > > > > this
> > > > > > > > > > > > RATE_LIMITED a new error code? It's not mentioned in
> > the
> > > > "New
> > > > > > Error
> > > > > > > > > > Codes"
> > > > > > > > > > > > section.
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > That's a leftover, it should be using the standard
> > > > ThrottleTime
> > > > > > > > > > mechanism.
> > > > > > > > > > > Fixed.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > 7. In the section "Standard client resource labels"
> > > > > > application_id
> > > > > > > > is
> > > > > > > > > > > > described as Kafka Streams only, but the section of
> > > "Client
> > > > > > > > > > Identification"
> > > > > > > > > > > > talks about "application instance id as an optional
> > > future
> > > > > > > > > nice-to-have
> > > > > > > > > > > > that may be included as a metrics label if it has been
> > > set
> > > > by
> > > > > > the
> > > > > > > > > > user", so
> > > > > > > > > > > > I'm confused whether non-Kafka Streams clients should
> > set
> > > > an
> > > > > > > > > > application_id
> > > > > > > > > > > > or not.
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > I'll clarify this in the KIP, but basically we would need
> > > to
> > > > > add
> > > > > > an `
> > > > > > > > > > > application.id` config
> > > > > > > > > > > property for non-streams clients for this purpose, and
> > > that's
> > > > > > outside
> > > > > > > > > the
> > > > > > > > > > > scope of this KIP since we want to make it zero-conf:ish
> > on
> > > > the
> > > > > > > > client
> > > > > > > > > > side.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Kind regards,
> > > > > > > > > > > >
> > > > > > > > > > > > Tom
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Thanks for the review,
> > > > > > > > > > > Magnus
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Thu, Oct 7, 2021 at 5:26 PM Magnus Edenhill <
> > > > > > magnus@edenhill.se
> > > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi all,
> > > > > > > > > > > > >
> > > > > > > > > > > > > I've updated the KIP following our recent discussions
> > > on
> > > > > the
> > > > > > > > > mailing
> > > > > > > > > > > > list:
> > > > > > > > > > > > >  - split the protocol in two, one for getting the
> > > metrics
> > > > > > > > > > subscriptions,
> > > > > > > > > > > > > and one for pushing the metrics.
> > > > > > > > > > > > >  - simplifications: initially only one supported
> > > metrics
> > > > > > format,
> > > > > > > > no
> > > > > > > > > > > > > client.id in the instance id, etc.
> > > > > > > > > > > > >  - made CLIENT_METRICS subscription configuration
> > > entries
> > > > > > more
> > > > > > > > > > structured
> > > > > > > > > > > > >    and allowing better client matching selectors (not
> > > > only
> > > > > > on the
> > > > > > > > > > > > instance
> > > > > > > > > > > > > id, but also the other
> > > > > > > > > > > > >    client resource labels, such as
> > > client_software_name,
> > > > > > etc.).
> > > > > > > > > > > > >
> > > > > > > > > > > > > Unless there are further comments I'll call the vote
> > > in a
> > > > > > day or
> > > > > > > > > two.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Regards,
> > > > > > > > > > > > > Magnus
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Den mån 4 okt. 2021 kl 20:57 skrev Magnus Edenhill <
> > > > > > > > > > magnus@edenhill.se>:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi Gwen,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I'm finishing up the KIP based on the last couple
> > of
> > > > > > discussion
> > > > > > > > > > points
> > > > > > > > > > > > in
> > > > > > > > > > > > > > this thread
> > > > > > > > > > > > > > and will call the Vote later this week.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Best,
> > > > > > > > > > > > > > Magnus
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Den lör 2 okt. 2021 kl 02:01 skrev Gwen Shapira
> > > > > > > > > > > > > <gwen@confluent.io.invalid
> > > > > > > > > > > > > > >:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >> Hey,
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >> I noticed that there was no discussion for the
> > last
> > > 10
> > > > > > days,
> > > > > > > > > but I
> > > > > > > > > > > > > >> couldn't
> > > > > > > > > > > > > >> find the vote thread. Is there one that I'm
> > missing?
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >> Gwen
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >> On Wed, Sep 22, 2021 at 4:58 AM Magnus Edenhill <
> > > > > > > > > > magnus@edenhill.se>
> > > > > > > > > > > > > >> wrote:
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >> > Den tis 21 sep. 2021 kl 06:58 skrev Colin
> > McCabe <
> > > > > > > > > > > > cmccabe@apache.org
> > > > > > > > > > > > > >:
> > > > > > > > > > > > > >> >
> > > > > > > > > > > > > >> > > On Mon, Sep 20, 2021, at 17:35, Feng Min
> > wrote:
> > > > > > > > > > > > > >> > > > Thanks Magnus & Colin for the discussion.
> > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > >> > > > Based on KIP-714's stateless design, Client
> > > can
> > > > > > pretty
> > > > > > > > > much
> > > > > > > > > > use
> > > > > > > > > > > > > any
> > > > > > > > > > > > > >> > > > connection to any broker to send metrics. We
> > > are
> > > > > not
> > > > > > > > > > associating
> > > > > > > > > > > > > >> > > connection
> > > > > > > > > > > > > >> > > > with client metric state. Is my
> > understanding
> > > > > > correct?
> > > > > > > > If
> > > > > > > > > > yes,
> > > > > > > > > > > > > how
> > > > > > > > > > > > > >> > about
> > > > > > > > > > > > > >> > > > the following two scenarios
> > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > >> > > > 1) One Client (Client-ID) registers two
> > > > different
> > > > > > client
> > > > > > > > > > > > instance
> > > > > > > > > > > > > id
> > > > > > > > > > > > > >> > via
> > > > > > > > > > > > > >> > > > separate registration. Is it permitted? If
> > OK,
> > > > how
> > > > > > to
> > > > > > > > > > > > distinguish
> > > > > > > > > > > > > >> them
> > > > > > > > > > > > > >> > > from
> > > > > > > > > > > > > >> > > > the case 2 below.
> > > > > > > > > > > > > >> > > >
> > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > >> > > Hi Feng,
> > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > >> > > My understanding, which Magnus can clarify I
> > > > guess,
> > > > > is
> > > > > > > > that
> > > > > > > > > > you
> > > > > > > > > > > > > could
> > > > > > > > > > > > > >> > have
> > > > > > > > > > > > > >> > > something like two Producer instances running
> > > with
> > > > > the
> > > > > > > > same
> > > > > > > > > > > > > client.id
> > > > > > > > > > > > > >> > > (perhaps because they're using the same config
> > > > file,
> > > > > > for
> > > > > > > > > > example).
> > > > > > > > > > > > > >> They
> > > > > > > > > > > > > >> > > could even be in the same process. But they
> > > would
> > > > > get
> > > > > > > > > separate
> > > > > > > > > > > > > UUIDs.
> > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > >> > > I believe Magnus used the term client to mean
> > > > > > "Producer or
> > > > > > > > > > > > > Consumer".
> > > > > > > > > > > > > >> So
> > > > > > > > > > > > > >> > > if you have both a Producer and a Consumer in
> > > your
> > > > > > > > > > application I
> > > > > > > > > > > > > would
> > > > > > > > > > > > > >> > > expect you'd get separate UUIDs for both.
> > Again
> > > > > > Magnus can
> > > > > > > > > > chime
> > > > > > > > > > > > in
> > > > > > > > > > > > > >> > here, I
> > > > > > > > > > > > > >> > > guess.
> > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > >> >
> > > > > > > > > > > > > >> > That's correct.
> > > > > > > > > > > > > >> >
> > > > > > > > > > > > > >> >
> > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > >> > > > 2) How about the client restarting? What's
> > the
> > > > > > > > > expectation?
> > > > > > > > > > > > Should
> > > > > > > > > > > > > >> the
> > > > > > > > > > > > > >> > > > server expect the client to carry a
> > persisted
> > > > > client
> > > > > > > > > > instance id
> > > > > > > > > > > > > or
> > > > > > > > > > > > > >> > > should
> > > > > > > > > > > > > >> > > > the client be treated as a new instance?
> > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > >> > > The KIP doesn't describe any mechanism for
> > > > > > persistence,
> > > > > > > > so I
> > > > > > > > > > would
> > > > > > > > > > > > > >> assume
> > > > > > > > > > > > > >> > > that when you restart the client you get a new
> > > > > UUID. I
> > > > > > > > agree
> > > > > > > > > > that
> > > > > > > > > > > > it
> > > > > > > > > > > > > >> > would
> > > > > > > > > > > > > >> > > be good to spell this out.
> > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > >> > >
> > > > > > > > > > > > > >> > Right, it will not be persisted since a client
> > > > > instance
> > > > > > > > can't
> > > > > > > > > be
> > > > > > > > > > > > > >> restarted.
> > > > > > > > > > > > > >> >
> > > > > > > > > > > > > >> > Will update the KIP to make this clearer.
> > > > > > > > > > > > > >> >
> > > > > > > > > > > > > >> > /Magnus
> > > > > > > > > > > > > >> >
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >> --
> > > > > > > > > > > > > >> Gwen Shapira
> > > > > > > > > > > > > >> Engineering Manager | Confluent
> > > > > > > > > > > > > >> 650.450.2760 | @gwenshap
> > > > > > > > > > > > > >> Follow us: Twitter | blog
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>