You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by Jorge Esteban Quilcate Otoya <qu...@gmail.com> on 2022/12/05 00:07:28 UTC

Re: [DISCUSS] KIP-864: Add End-To-End Latency Metrics to Connectors

Thanks for the reminder Chris!

I have added a note on the KIP to include this as part of the KIP as most
of the metrics proposed are per-record and having all on DEBUG would limit
the benefits, and created https://issues.apache.org/jira/browse/KAFKA-14441
to keep track of this task.

Cheers,
Jorge.

On Tue, 29 Nov 2022 at 19:40, Chris Egerton <ch...@aiven.io.invalid> wrote:

> Hi Jorge,
>
> Thanks! What were your thoughts on the possible benchmarking and/or
> downgrading of per-record metrics to DEBUG?
>
> Cheers,
>
> Chris
>
> On Thu, Nov 24, 2022 at 8:20 AM Jorge Esteban Quilcate Otoya <
> quilcate.jorge@gmail.com> wrote:
>
> > Thanks Chris! I have updated the KIP with "transform" instead of "alias".
> > Agree it's clearer.
> >
> > Cheers,
> > Jorge.
> >
> > On Mon, 21 Nov 2022 at 21:36, Chris Egerton <ch...@aiven.io.invalid>
> > wrote:
> >
> > > Hi Jorge,
> > >
> > > Thanks for the updates, and apologies for the delay. The new diagram
> > > directly under the "Proposed Changes" section is absolutely gorgeous!
> > >
> > >
> > > Follow-ups:
> > >
> > > RE 2: Good point. We can use the same level for these metrics, it's
> not a
> > > big deal.
> > >
> > > RE 3: As long as all the per-record metrics are kept at DEBUG level, it
> > > should be fine to leave JMH benchmarking for a follow-up. If we want to
> > add
> > > new per-record, INFO-level metrics, I would be more comfortable with
> > > including benchmarking as part of the testing plan for the KIP. One
> > > possible compromise could be to propose that these features be merged
> at
> > > DEBUG level, and then possibly upgraded to INFO level in the future
> > pending
> > > benchmarks to guard against performance degradation.
> > >
> > > RE 4: I think for a true "end-to-end" metric, it'd be useful to include
> > the
> > > time taken by the task to actually deliver the record. However, with
> the
> > > new metric names and descriptions provided in the KIP, I have no
> > objections
> > > with what's currently proposed, and a new "end-to-end" metric can be
> > taken
> > > on later in a follow-up KIP.
> > >
> > > RE 6: You're right, existing producer metrics should be enough for now.
> > We
> > > can revisit this later if/when we add delivery-centric metrics for sink
> > > tasks as well.
> > >
> > > RE 7: The new metric names in the KIP LGTM; I don't see any need to
> > expand
> > > beyond those but if you'd still like to pursue others, LMK.
> > >
> > >
> > > New thoughts:
> > >
> > > One small thought: instead of "alias" in "alias="{transform_alias}" for
> > the
> > > per-transform metrics, could we use "transform"? IMO it's clearer since
> > we
> > > don't use "alias" in the names of transform-related properties, and
> > "alias"
> > > may be confused with the classloading term where you can use, e.g.,
> > > "FileStreamSource" as the name of a connector class in a connector
> config
> > > instead of "org.apache.kafka.connect.file.FileStreamSourceConnector".
> > >
> > >
> > > Cheers,
> > >
> > > Chris
> > >
> > > On Fri, Nov 18, 2022 at 12:06 PM Jorge Esteban Quilcate Otoya <
> > > quilcate.jorge@gmail.com> wrote:
> > >
> > > > Thanks Mickael!
> > > >
> > > >
> > > > On Wed, 9 Nov 2022 at 15:54, Mickael Maison <
> mickael.maison@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi Jorge,
> > > > >
> > > > > Thanks for the KIP, it is a nice improvement.
> > > > >
> > > > > 1) The per transformation metrics still have a question mark next
> to
> > > > > them in the KIP. Do you want to include them? If so we'll want to
> tag
> > > > > them, we should be able to include the aliases in
> TransformationChain
> > > > > and use them.
> > > > >
> > > >
> > > > Yes, I have added the changes on TransformChain that will be needed
> to
> > > add
> > > > these metrics.
> > > >
> > > >
> > > > >
> > > > > 2) I see no references to predicates. If we don't want to measure
> > > > > their latency, can we say it explicitly?
> > > > >
> > > >
> > > > Good question, I haven't considered these. Though as these are
> > > materialized
> > > > as PredicatedTransformation, they should be covered by these changes.
> > > > Adding a note about this.
> > > >
> > > >
> > > > >
> > > > > 3) Should we have sink-record-batch-latency-avg-ms? All other
> metrics
> > > > > have both the maximum and average values.
> > > > >
> > > > >
> > > > Good question. I will remove it and change the record latency from
> > > > DEBUG->INFO as it already cover the maximum metric.
> > > >
> > > > Hope it's clearer now, let me know if there any additional feedback.
> > > > Thanks!
> > > >
> > > >
> > > >
> > > > > Thanks,
> > > > > Mickael
> > > > >
> > > > > On Thu, Oct 20, 2022 at 9:58 PM Jorge Esteban Quilcate Otoya
> > > > > <qu...@gmail.com> wrote:
> > > > > >
> > > > > > Thanks, Chris! Great feedback! Please, find my comments below:
> > > > > >
> > > > > > On Thu, 13 Oct 2022 at 18:52, Chris Egerton
> > <chrise@aiven.io.invalid
> > > >
> > > > > wrote:
> > > > > >
> > > > > > > Hi Jorge,
> > > > > > >
> > > > > > > Thanks for the KIP. I agree with the overall direction and
> think
> > > this
> > > > > would
> > > > > > > be a nice improvement to Kafka Connect. Here are my initial
> > > thoughts
> > > > > on the
> > > > > > > details:
> > > > > > >
> > > > > > > 1. The motivation section outlines the gaps in Kafka Connect's
> > task
> > > > > metrics
> > > > > > > nicely. I think it'd be useful to include more concrete details
> > on
> > > > why
> > > > > > > these gaps need to be filled in, and in which cases additional
> > > > metrics
> > > > > > > would be helpful. One goal could be to provide enhanced
> > monitoring
> > > of
> > > > > > > production deployments that allows for cluster administrators
> to
> > > set
> > > > up
> > > > > > > automatic alerts for latency spikes and, if triggered, quickly
> > > > > identify the
> > > > > > > root cause of those alerts, reducing the time to remediation.
> > > Another
> > > > > goal
> > > > > > > could be to provide more insight to developers or cluster
> > > > > administrators
> > > > > > > who want to do performance testing on connectors in
> > non-production
> > > > > > > environments. It may help guide our decision making process to
> > > have a
> > > > > > > clearer picture of the goals we're trying to achieve.
> > > > > > >
> > > > > >
> > > > > > Agree. The Motivation section has been updated.
> > > > > > Thanks for the examples, I see both of them being covered by the
> > KIP.
> > > > > > I see how these could give us a good distinction on whether to
> > > position
> > > > > > some metrics at INFO or DEBUG level.
> > > > > >
> > > > > >
> > > > > > > 2. If we're trying to address the alert-and-diagnose use case,
> > it'd
> > > > be
> > > > > > > useful to have as much information as possible at INFO level,
> > > rather
> > > > > than
> > > > > > > forcing cluster administrators to possibly reconfigure a
> > connector
> > > to
> > > > > emit
> > > > > > > DEBUG or TRACE level metrics in order to diagnose a potential
> > > > > > > production-impacting performance bottleneck. I can see the
> > > rationale
> > > > > for
> > > > > > > emitting per-record metrics that track an average value at
> DEBUG
> > > > > level, but
> > > > > > > for per-record metrics that track a maximum value, is there any
> > > > reason
> > > > > not
> > > > > > > to provide this information at INFO level?
> > > > > > >
> > > > > >
> > > > > > Agree. Though with Max and Avg metrics being part of the same
> > sensor
> > > —
> > > > > > where Metric Level is defined — then both metrics get the same
> > level.
> > > > > >
> > > > > >
> > > > > > > 3. I'm also curious about the performance testing suggested by
> > Yash
> > > > to
> > > > > > > gauge the potential impact of this change. Have you been able
> to
> > do
> > > > any
> > > > > > > testing with your draft implementation yet?
> > > > > > >
> > > > > >
> > > > > > No, not so far.
> > > > > > I think it would be valuable to discuss the scope of this testing
> > and
> > > > > maybe
> > > > > > tackle it
> > > > > > in a separate issue as Sensors and Metrics are used all over the
> > > place.
> > > > > > My initial understanding is that these tests should by placed in
> > the
> > > > > > jmh-benchmarks[1].
> > > > > > Then, we could target testing Sensors and Metrics, and validate
> how
> > > > much
> > > > > > overhead
> > > > > > is added by having only Max vs Max,Avg(,Min), etc.
> > > > > > In the other hand, we could extend this to Transformers or other
> > > > Connect
> > > > > > layers.
> > > > > >
> > > > > > Here are some pointers to the Sensors and Metrics implementations
> > > that
> > > > > > could be considered:
> > > > > > Path to metric recording:
> > > > > > -
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/5cab11cf525f6c06fcf9eb43f7f95ef33fe1cdbb/clients/src/main/java/org/apache/kafka/common/metrics/Sensor.java#L195-L199
> > > > > > -
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/5cab11cf525f6c06fcf9eb43f7f95ef33fe1cdbb/clients/src/main/java/org/apache/kafka/common/metrics/Sensor.java#L230-L244
> > > > > >
> > > > > > ```
> > > > > > // increment all the stats
> > > > > > for (StatAndConfig statAndConfig : this.stats) {
> > > > > >    statAndConfig.stat.record(statAndConfig.config(), value,
> > timeMs);
> > > > > > }
> > > > > > ```
> > > > > >
> > > > > > SampledStats:
> > > > > > - Avg:
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/068ab9cefae301f3187ea885d645c425955e77d2/clients/src/main/java/org/apache/kafka/common/metrics/stats/Avg.java
> > > > > > - Max:
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/068ab9cefae301f3187ea885d645c425955e77d2/clients/src/main/java/org/apache/kafka/common/metrics/stats/Max.java
> > > > > > - Min:
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/068ab9cefae301f3187ea885d645c425955e77d2/clients/src/main/java/org/apache/kafka/common/metrics/stats/Min.java
> > > > > >
> > > > > > `stat#record()` are implemented by `update` method in
> SampledStat:
> > > > > >
> > > > > > ```Max.java
> > > > > >     @Override
> > > > > >     protected void update(Sample sample, MetricConfig config,
> > double
> > > > > value,
> > > > > > long now) {
> > > > > >         sample.value += value;
> > > > > >     }
> > > > > > ```
> > > > > >
> > > > > > ```Avg.java
> > > > > >     @Override
> > > > > >     protected void update(Sample sample, MetricConfig config,
> > double
> > > > > value,
> > > > > > long now) {
> > > > > >         sample.value = Math.max(sample.value, value);
> > > > > >     }
> > > > > > ```
> > > > > >
> > > > > > As far as I understand, most of the work of the stats happens on
> > the
> > > > > > `combine` method that is not part of the connector execution but
> > > called
> > > > > > when metrics are queried.
> > > > > >
> > > > > > I wonder whether we should consider Avg and Max for all metrics
> > > > proposed
> > > > > as
> > > > > > the impact on the execution path seems minimal, and even see if
> Min
> > > is
> > > > > also
> > > > > > valuable, and use DEBUG only for more granular metrics.
> > > > > >
> > > > > > [1] https://github.com/apache/kafka/tree/trunk/jmh-benchmarks
> > > > > >
> > > > > >
> > > > > > > 4. Just to make sure I understand correctly--does "time when it
> > has
> > > > > been
> > > > > > > received by the Sink task" refer to the wallclock time directly
> > > > after a
> > > > > > > call to SinkTask::put has been completed (as opposed to
> directly
> > > > before
> > > > > > > that call is made, or something else entirely)?
> > > > > > >
> > > > > >
> > > > > > It currently means when it has been received by the Sink task
> > > > > > right after consumer poll and before conversions.
> > > > > > Would it be valuable to have it after put-sink-records?
> > > > > >
> > > > > >
> > > > > > > 5. If the goal is to identify performance bottlenecks (either
> in
> > > > > production
> > > > > > > or pre-production environments), would it make sense to
> introduce
> > > > > metrics
> > > > > > > for each individual converter (i.e., key/value/header) and
> > > > > transformation?
> > > > > > > It's definitely an improvement to be able to identify the total
> > > time
> > > > > for
> > > > > > > conversion and transformation, but then the immediate follow-up
> > > > > question if
> > > > > > > a bottleneck is found in that phase is "which
> > > > converter/transformation
> > > > > is
> > > > > > > responsible?" It'd be nice if we could provide a way to quickly
> > > > answer
> > > > > that
> > > > > > > question.
> > > > > > >
> > > > > >
> > > > > > This is a great idea. I'd like to consider this as well, though
> > maybe
> > > > > these
> > > > > > more granular
> > > > > > metrics would be good to have them as DEBUG.
> > > > > >
> > > > > >
> > > > > > > 6. Any thoughts about offering latency metrics for source tasks
> > > > between
> > > > > > > receipt of the record from the task and delivery of the record
> to
> > > > Kafka
> > > > > > > (which would be tracked by producer callback)? We could also
> use
> > > the
> > > > > record
> > > > > > > timestamp either instead of or in addition to receipt time if
> the
> > > > task
> > > > > > > provides a timestamp with its records.
> > > > > > >
> > > > > >
> > > > > > With source transform and convert metrics we get part of that
> > > latency.
> > > > > > Looking at the Producer metrics, `request-latency` (though a very
> > > > generic
> > > > > > metric)
> > > > > > sort of answer the time between send request and ack — if my
> > > > > understanding
> > > > > > is correct.
> > > > > > Would these be enough or you're thinking about another approach?
> > > > > > maybe a custom metric to cover the producer side?
> > > > > >
> > > > > >
> > > > > > > 7. We may end up introducing a way for sink tasks to record
> > > > per-record
> > > > > > > delivery to the sink system (see KIP-767 [1]). I'd like it if
> we
> > > > could
> > > > > keep
> > > > > > > the names of our metrics very precise in order to avoid
> confusing
> > > > users
> > > > > > > (who may think that we're providing metrics on actual delivery
> to
> > > the
> > > > > sink
> > > > > > > system, which may not be the case if the connector performs
> > > > > asynchronous
> > > > > > > writes), and in order to leave room for a metrics on true
> > delivery
> > > > > time by
> > > > > > > sink tasks. It'd also be nice if we could remain consistent
> with
> > > > > existing
> > > > > > > metrics such as "put-batch-avg-time-ms". With that in mind,
> what
> > do
> > > > you
> > > > > > > think about renaming these metrics:
> > > > > > > - "sink-record-batch-latency-max-ms" to
> > "put-batch-avg-latency-ms"
> > > > > > > - "sink-record-latency-max-ms" to
> > "put-sink-record-latency-max-ms"
> > > > > > > - "sink-record-latency-avg-ms" to
> > "put-sink-record-latency-avg-ms"
> > > > > > > - "sink-record-convert-transform-time-max-ms" to
> > > > > > > "convert-transform-sink-record-time-max-ms"
> > > > > > > - "sink-record-convert-transform-time-avg-ms" to
> > > > > > > "convert-transform-sink-record-time-avg-ms"
> > > > > > > - "source-record-transform-convert-time-max-ms" to
> > > > > > > "transform-convert-source-record-time-max-ms"
> > > > > > > - "source-record-transform-convert-time-avg-ms" to
> > > > > > > "transform-convert-source-record-time-avg-ms"
> > > > > > >
> > > > > >
> > > > > > Make sense, thanks! I have updated the list of metrics and group
> > them
> > > > by
> > > > > > sensor and applying these suggestions.
> > > > > > The only ones that I want to review are: sink-record-* to
> > put-batch-*
> > > > > > (first 3). Not sure if put-batch/put-sink-record describes the
> > > purpose
> > > > of
> > > > > > the metric — neither `sink-record-latency` to be honest.
> > > > > > My initial thought was to have something like Kafka Streams
> > > > e2e-latency.
> > > > > > Based on 4. and 6. questions, an idea could be to add:
> > > > > > - source-batch-e2e-latency-before-send: measure wallclock -
> source
> > > > record
> > > > > > timestamp after source connector poll.
> > > > > > - source-batch-e2e-latency-after-send: measure wallclock - record
> > > > > timestamp
> > > > > > on producer send callback
> > > > > > - sink-batch-e2e-latency-before-put: measure time wallclock -
> > record
> > > > > > timestamp after consumer poll
> > > > > > - sink-batch-e2e-latency-after-put: measure time wallclock -
> record
> > > > > > timestamp after sink connector put.
> > > > > >
> > > > > >
> > > > > > > Thanks again for the KIP! Looking forward to your thoughts.
> > > > > > >
> > > > > > > Cheers,
> > > > > > >
> > > > > > > Chris
> > > > > > >
> > > > > > > [1] -
> > > > > > >
> > > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-767%3A+Connect+Latency+Metrics
> > > > > > >
> > > > > > > On Thu, Sep 15, 2022 at 1:32 PM Jorge Esteban Quilcate Otoya <
> > > > > > > quilcate.jorge@gmail.com> wrote:
> > > > > > >
> > > > > > > > Hi everyone,
> > > > > > > >
> > > > > > > > I've made a slight addition to the KIP based on Yash
> feedback:
> > > > > > > >
> > > > > > > > - A new metric is added at INFO level to record the max
> latency
> > > > from
> > > > > the
> > > > > > > > batch timestamp, by keeping the oldest record timestamp per
> > > batch.
> > > > > > > > - A draft implementation is linked.
> > > > > > > >
> > > > > > > > Looking forward to your feedback.
> > > > > > > > Also, a kindly reminder that the vote thread is open.
> > > > > > > >
> > > > > > > > Thanks!
> > > > > > > > Jorge.
> > > > > > > >
> > > > > > > > On Thu, 8 Sept 2022 at 14:25, Jorge Esteban Quilcate Otoya <
> > > > > > > > quilcate.jorge@gmail.com> wrote:
> > > > > > > >
> > > > > > > > > Great. I have updated the KIP to reflect this.
> > > > > > > > >
> > > > > > > > > Cheers,
> > > > > > > > > Jorge.
> > > > > > > > >
> > > > > > > > > On Thu, 8 Sept 2022 at 12:26, Yash Mayya <
> > yash.mayya@gmail.com
> > > >
> > > > > wrote:
> > > > > > > > >
> > > > > > > > >> Thanks, I think it makes sense to define these metrics at
> a
> > > > DEBUG
> > > > > > > > >> recording
> > > > > > > > >> level.
> > > > > > > > >>
> > > > > > > > >> On Thu, Sep 8, 2022 at 2:51 PM Jorge Esteban Quilcate
> Otoya
> > <
> > > > > > > > >> quilcate.jorge@gmail.com> wrote:
> > > > > > > > >>
> > > > > > > > >> > On Thu, 8 Sept 2022 at 05:55, Yash Mayya <
> > > > yash.mayya@gmail.com>
> > > > > > > > wrote:
> > > > > > > > >> >
> > > > > > > > >> > > Hi Jorge,
> > > > > > > > >> > >
> > > > > > > > >> > > Thanks for the changes. With regard to having per
> batch
> > vs
> > > > per
> > > > > > > > record
> > > > > > > > >> > > metrics, the additional overhead I was referring to
> > wasn't
> > > > > about
> > > > > > > > >> whether
> > > > > > > > >> > or
> > > > > > > > >> > > not we would need to iterate over all the records in a
> > > > batch.
> > > > > I
> > > > > > > was
> > > > > > > > >> > > referring to the potential additional overhead caused
> by
> > > the
> > > > > > > higher
> > > > > > > > >> > volume
> > > > > > > > >> > > of calls to Sensor::record on the sensors for the new
> > > > metrics
> > > > > (as
> > > > > > > > >> > compared
> > > > > > > > >> > > to the existing batch only metrics), especially for
> high
> > > > > > > throughput
> > > > > > > > >> > > connectors where batch sizes could be large. I guess
> we
> > > may
> > > > > want
> > > > > > > to
> > > > > > > > do
> > > > > > > > >> > some
> > > > > > > > >> > > sort of performance testing and get concrete numbers
> to
> > > > verify
> > > > > > > > whether
> > > > > > > > >> > this
> > > > > > > > >> > > is a valid concern or not?
> > > > > > > > >> > >
> > > > > > > > >> >
> > > > > > > > >> > 6.1. Got it, thanks for clarifying. I guess there could
> > be a
> > > > > > > benchmark
> > > > > > > > >> test
> > > > > > > > >> > of the `Sensor::record` to get an idea of the
> performance
> > > > > impact.
> > > > > > > > >> > Regardless, the fact that these are single-record
> metrics
> > > > > compared
> > > > > > > to
> > > > > > > > >> > existing batch-only could be explicitly defined by
> setting
> > > > these
> > > > > > > > >> metrics at
> > > > > > > > >> > a DEBUG or TRACE metric recording level, leaving the
> > > existing
> > > > at
> > > > > > > INFO
> > > > > > > > >> > level.
> > > > > > > > >> > wdyt?
> > > > > > > > >> >
> > > > > > > > >> >
> > > > > > > > >> > >
> > > > > > > > >> > > Thanks,
> > > > > > > > >> > > Yash
> > > > > > > > >> > >
> > > > > > > > >> > > On Tue, Sep 6, 2022 at 4:42 PM Jorge Esteban Quilcate
> > > Otoya
> > > > <
> > > > > > > > >> > > quilcate.jorge@gmail.com> wrote:
> > > > > > > > >> > >
> > > > > > > > >> > > > Hi Sagar and Yash,
> > > > > > > > >> > > >
> > > > > > > > >> > > > > the way it's defined in
> > > > > > > > >> > > >
> > > > https://kafka.apache.org/documentation/#connect_monitoring
> > > > > for
> > > > > > > > the
> > > > > > > > >> > > metrics
> > > > > > > > >> > > >
> > > > > > > > >> > > > 4.1. Got it. Add it to the KIP.
> > > > > > > > >> > > >
> > > > > > > > >> > > > > The only thing I would argue is do we need
> > > > > > > > >> sink-record-latency-min?
> > > > > > > > >> > > Maybe
> > > > > > > > >> > > > we
> > > > > > > > >> > > > > could remove this min metric as well and make all
> of
> > > the
> > > > > 3 e2e
> > > > > > > > >> > metrics
> > > > > > > > >> > > > > consistent
> > > > > > > > >> > > >
> > > > > > > > >> > > > 4.2 I see. Will remove it from the KIP.
> > > > > > > > >> > > >
> > > > > > > > >> > > > > Probably users can track the metrics at their end
> to
> > > > > > > > >> > > > > figure that out. Do you think that makes sense?
> > > > > > > > >> > > >
> > > > > > > > >> > > > 4.3. Yes, agree. With these new metrics it should be
> > > > easier
> > > > > for
> > > > > > > > >> users
> > > > > > > > >> > to
> > > > > > > > >> > > > track this.
> > > > > > > > >> > > >
> > > > > > > > >> > > > > I think it makes sense to not have a min metric
> for
> > > > > either to
> > > > > > > > >> remain
> > > > > > > > >> > > > > consistent with the existing put-batch and
> > poll-batch
> > > > > metrics
> > > > > > > > >> > > >
> > > > > > > > >> > > > 5.1. Got it. Same as 4.2
> > > > > > > > >> > > >
> > > > > > > > >> > > > > Another naming related suggestion I had was with
> the
> > > > > > > > >> > > > > "convert-time" metrics - we should probably
> include
> > > > > > > > >> transformations
> > > > > > > > >> > in
> > > > > > > > >> > > > the
> > > > > > > > >> > > > > name since SMTs could definitely be attributable
> to
> > a
> > > > > sizable
> > > > > > > > >> chunk
> > > > > > > > >> > of
> > > > > > > > >> > > > the
> > > > > > > > >> > > > > latency depending on the specific transformation
> > > chain.
> > > > > > > > >> > > >
> > > > > > > > >> > > > 5.2. Make sense. I'm proposing to add
> > > > > > > > >> > `sink-record-convert-transform...`
> > > > > > > > >> > > > and `source-record-transform-convert...` to
> represent
> > > > > correctly
> > > > > > > > the
> > > > > > > > >> > order
> > > > > > > > >> > > > of operations.
> > > > > > > > >> > > >
> > > > > > > > >> > > > > it seems like both source and sink tasks only
> record
> > > > > metrics
> > > > > > > at
> > > > > > > > a
> > > > > > > > >> > > "batch"
> > > > > > > > >> > > > > level, not on an individual record level. I think
> it
> > > > > might be
> > > > > > > > >> > > additional
> > > > > > > > >> > > > > overhead if we want to record these new metrics
> all
> > at
> > > > the
> > > > > > > > record
> > > > > > > > >> > > level?
> > > > > > > > >> > > >
> > > > > > > > >> > > > 5.3. I considered at the beginning to implement all
> > > > metrics
> > > > > at
> > > > > > > the
> > > > > > > > >> > batch
> > > > > > > > >> > > > level, but given how the framework process records,
> I
> > > > > fallback
> > > > > > > to
> > > > > > > > >> the
> > > > > > > > >> > > > proposed approach:
> > > > > > > > >> > > > - Sink Task:
> > > > > > > > >> > > >   - `WorkerSinkTask#convertMessages(msgs)` already
> > > > iterates
> > > > > over
> > > > > > > > >> > records,
> > > > > > > > >> > > > so there is no additional overhead to capture record
> > > > > latency per
> > > > > > > > >> > record.
> > > > > > > > >> > > >     -
> > > > > > > > >> > > >
> > > > > > > > >> > > >
> > > > > > > > >> > >
> > > > > > > > >> >
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/9841647c4fe422532f448423c92d26e4fdcb8932/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L490-L514
> > > > > > > > >> > > >   -
> `WorkerSinkTask#convertAndTransformRecord(record)`
> > > > > actually
> > > > > > > > >> happens
> > > > > > > > >> > > > individually. Measuring this operation per batch
> would
> > > > > include
> > > > > > > > >> > processing
> > > > > > > > >> > > > that is not strictly part of "convert and transform"
> > > > > > > > >> > > >     -
> > > > > > > > >> > > >
> > > > > > > > >> > > >
> > > > > > > > >> > >
> > > > > > > > >> >
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/9841647c4fe422532f448423c92d26e4fdcb8932/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L518
> > > > > > > > >> > > > - Source Task:
> > > > > > > > >> > > >   - `AbstractWorkerSourceTask#sendRecords` iterates
> > > over a
> > > > > batch
> > > > > > > > and
> > > > > > > > >> > > > applies transforms and convert record individually
> as
> > > > well:
> > > > > > > > >> > > >     -
> > > > > > > > >> > > >
> > > > > > > > >> > > >
> > > > > > > > >> > >
> > > > > > > > >> >
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/9841647c4fe422532f448423c92d26e4fdcb8932/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/AbstractWorkerSourceTask.java#L389-L390
> > > > > > > > >> > > >
> > > > > > > > >> > > > > This might require some additional changes -
> > > > > > > > >> > > > > for instance, with the "sink-record-latency"
> metric,
> > > we
> > > > > might
> > > > > > > > only
> > > > > > > > >> > want
> > > > > > > > >> > > > to
> > > > > > > > >> > > > > have a "max" metric since "avg" would require
> > > recording
> > > > a
> > > > > > > value
> > > > > > > > on
> > > > > > > > >> > the
> > > > > > > > >> > > > > sensor for each record (whereas we can get a "max"
> > by
> > > > only
> > > > > > > > >> recording
> > > > > > > > >> > a
> > > > > > > > >> > > > > metric value for the oldest record in each batch).
> > > > > > > > >> > > >
> > > > > > > > >> > > > 5.4. Recording record-latency per batch may not be
> as
> > > > > useful as
> > > > > > > > >> there
> > > > > > > > >> > is
> > > > > > > > >> > > no
> > > > > > > > >> > > > guarantee that the oldest record will be
> > representative
> > > of
> > > > > the
> > > > > > > > >> batch.
> > > > > > > > >> > > >
> > > > > > > > >> > > > On Sat, 3 Sept 2022 at 16:02, Yash Mayya <
> > > > > yash.mayya@gmail.com>
> > > > > > > > >> wrote:
> > > > > > > > >> > > >
> > > > > > > > >> > > > > Hi Jorge and Sagar,
> > > > > > > > >> > > > >
> > > > > > > > >> > > > > I think it makes sense to not have a min metric
> for
> > > > > either to
> > > > > > > > >> remain
> > > > > > > > >> > > > > consistent with the existing put-batch and
> > poll-batch
> > > > > metrics
> > > > > > > > (it
> > > > > > > > >> > > doesn't
> > > > > > > > >> > > > > seem particularly useful either anyway). Also, the
> > new
> > > > > > > > >> > > > > "sink-record-latency" metric name looks fine to
> me,
> > > > > thanks for
> > > > > > > > >> making
> > > > > > > > >> > > the
> > > > > > > > >> > > > > changes! Another naming related suggestion I had
> was
> > > > with
> > > > > the
> > > > > > > > >> > > > > "convert-time" metrics - we should probably
> include
> > > > > > > > >> transformations
> > > > > > > > >> > in
> > > > > > > > >> > > > the
> > > > > > > > >> > > > > name since SMTs could definitely be attributable
> to
> > a
> > > > > sizable
> > > > > > > > >> chunk
> > > > > > > > >> > of
> > > > > > > > >> > > > the
> > > > > > > > >> > > > > latency depending on the specific transformation
> > > chain.
> > > > > > > > >> > > > >
> > > > > > > > >> > > > > I have one high level question with respect to
> > > > > implementation
> > > > > > > -
> > > > > > > > >> > > > currently,
> > > > > > > > >> > > > > it seems like both source and sink tasks only
> record
> > > > > metrics
> > > > > > > at
> > > > > > > > a
> > > > > > > > >> > > "batch"
> > > > > > > > >> > > > > level, not on an individual record level. I think
> it
> > > > > might be
> > > > > > > > >> > > additional
> > > > > > > > >> > > > > overhead if we want to record these new metrics
> all
> > at
> > > > the
> > > > > > > > record
> > > > > > > > >> > > level?
> > > > > > > > >> > > > > Could we instead make all of these new metrics for
> > > > > batches of
> > > > > > > > >> records
> > > > > > > > >> > > > > rather than individual records in order to remain
> > > > > consistent
> > > > > > > > with
> > > > > > > > >> the
> > > > > > > > >> > > > > existing task level metrics? This might require
> some
> > > > > > > additional
> > > > > > > > >> > > changes -
> > > > > > > > >> > > > > for instance, with the "sink-record-latency"
> metric,
> > > we
> > > > > might
> > > > > > > > only
> > > > > > > > >> > want
> > > > > > > > >> > > > to
> > > > > > > > >> > > > > have a "max" metric since "avg" would require
> > > recording
> > > > a
> > > > > > > value
> > > > > > > > on
> > > > > > > > >> > the
> > > > > > > > >> > > > > sensor for each record (whereas we can get a "max"
> > by
> > > > only
> > > > > > > > >> recording
> > > > > > > > >> > a
> > > > > > > > >> > > > > metric value for the oldest record in each batch).
> > > > > > > > >> > > > >
> > > > > > > > >> > > > > Thanks,
> > > > > > > > >> > > > > Yash
> > > > > > > > >> > > > >
> > > > > > > > >> > > > > On Fri, Sep 2, 2022 at 3:16 PM Sagar <
> > > > > > > sagarmeansocean@gmail.com
> > > > > > > > >
> > > > > > > > >> > > wrote:
> > > > > > > > >> > > > >
> > > > > > > > >> > > > > > Hi Jorge,
> > > > > > > > >> > > > > >
> > > > > > > > >> > > > > > Thanks for the changes.
> > > > > > > > >> > > > > >
> > > > > > > > >> > > > > > Regarding the metrics, I meant something like
> > this:
> > > > > > > > >> > > > > >
> > > > > > > > >> > > > >
> > > > > > > > >> > > >
> > > > > > > > >> > >
> > > > > > > > >> >
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > >
> > > >
> > >
> >
> kafka.connect:type=sink-task-metrics,connector="{connector}",task="{task}"
> > > > > > > > >> > > > > >
> > > > > > > > >> > > > > > the way it's defined in
> > > > > > > > >> > > > > >
> > > > > https://kafka.apache.org/documentation/#connect_monitoring
> > > > > > > > for
> > > > > > > > >> the
> > > > > > > > >> > > > > > metrics.
> > > > > > > > >> > > > > >
> > > > > > > > >> > > > > > I see what you mean by the 3 metrics and how it
> > can
> > > be
> > > > > > > > >> interpreted.
> > > > > > > > >> > > The
> > > > > > > > >> > > > > > only thing I would argue is do we need
> > > > > > > > sink-record-latency-min?
> > > > > > > > >> > Maybe
> > > > > > > > >> > > > we
> > > > > > > > >> > > > > > could remove this min metric as well and make
> all
> > of
> > > > > the 3
> > > > > > > e2e
> > > > > > > > >> > > metrics
> > > > > > > > >> > > > > > consistent(since put-batch also doesn't expose a
> > min
> > > > > which
> > > > > > > > makes
> > > > > > > > >> > > sense
> > > > > > > > >> > > > to
> > > > > > > > >> > > > > > me). I think this is in contrast to what Yash
> > > pointed
> > > > > out
> > > > > > > > above
> > > > > > > > >> so
> > > > > > > > >> > I
> > > > > > > > >> > > > > would
> > > > > > > > >> > > > > > like to hear his thoughts as well.
> > > > > > > > >> > > > > >
> > > > > > > > >> > > > > > The other point Yash mentioned about the
> slightly
> > > > flawed
> > > > > > > > >> definition
> > > > > > > > >> > > of
> > > > > > > > >> > > > > e2e
> > > > > > > > >> > > > > > is also true in a sense. But I have a feeling
> > that's
> > > > > one the
> > > > > > > > >> > records
> > > > > > > > >> > > > are
> > > > > > > > >> > > > > > polled by the connector tasks, it would be
> > difficult
> > > > to
> > > > > > > track
> > > > > > > > >> the
> > > > > > > > >> > > final
> > > > > > > > >> > > > > leg
> > > > > > > > >> > > > > > via the framework. Probably users can track the
> > > > metrics
> > > > > at
> > > > > > > > their
> > > > > > > > >> > end
> > > > > > > > >> > > to
> > > > > > > > >> > > > > > figure that out. Do you think that makes sense?
> > > > > > > > >> > > > > >
> > > > > > > > >> > > > > > Thanks!
> > > > > > > > >> > > > > > Sagar.
> > > > > > > > >> > > > > >
> > > > > > > > >> > > > > >
> > > > > > > > >> > > > > >
> > > > > > > > >> > > > > >
> > > > > > > > >> > > > > > On Thu, Sep 1, 2022 at 11:40 PM Jorge Esteban
> > > Quilcate
> > > > > > > Otoya <
> > > > > > > > >> > > > > > quilcate.jorge@gmail.com> wrote:
> > > > > > > > >> > > > > >
> > > > > > > > >> > > > > > > Hi Sagar and Yash,
> > > > > > > > >> > > > > > >
> > > > > > > > >> > > > > > > Thanks for your feedback!
> > > > > > > > >> > > > > > >
> > > > > > > > >> > > > > > > > 1) I am assuming the new metrics would be
> task
> > > > level
> > > > > > > > metric.
> > > > > > > > >> > > > > > >
> > > > > > > > >> > > > > > > 1.1 Yes, it will be a task level metric,
> > > implemented
> > > > > on
> > > > > > > the
> > > > > > > > >> > > > > > > Worker[Source/Sink]Task.
> > > > > > > > >> > > > > > >
> > > > > > > > >> > > > > > > > Could you specify the way it's done for
> other
> > > > > > > sink/source
> > > > > > > > >> > > > connector?
> > > > > > > > >> > > > > > >
> > > > > > > > >> > > > > > > 1.2. Not sure what do you mean by this. Could
> > you
> > > > > > > elaborate
> > > > > > > > a
> > > > > > > > >> bit
> > > > > > > > >> > > > more?
> > > > > > > > >> > > > > > >
> > > > > > > > >> > > > > > > > 2. I am slightly confused about the e2e
> > latency
> > > > > > > metric...
> > > > > > > > >> > > > > > >
> > > > > > > > >> > > > > > > 2.1. Yes, I see. I was trying to bring a
> similar
> > > > > concept
> > > > > > > as
> > > > > > > > in
> > > > > > > > >> > > > Streams
> > > > > > > > >> > > > > > with
> > > > > > > > >> > > > > > > KIP-613, though the e2e concept may not be
> > > > > translatable.
> > > > > > > > >> > > > > > > We could keep it as `sink-record-latency` to
> > avoid
> > > > > > > > conflating
> > > > > > > > >> > > > > concepts. A
> > > > > > > > >> > > > > > > similar metric naming was proposed in KIP-489
> > but
> > > at
> > > > > the
> > > > > > > > >> consumer
> > > > > > > > >> > > > > level —
> > > > > > > > >> > > > > > > though it seems dormant for a couple of years.
> > > > > > > > >> > > > > > >
> > > > > > > > >> > > > > > > > However, the put-batch time measures the
> > > > > > > > >> > > > > > > > time to put a batch of records to external
> > sink.
> > > > > So, I
> > > > > > > > would
> > > > > > > > >> > > assume
> > > > > > > > >> > > > > > the 2
> > > > > > > > >> > > > > > > > can't be added as is to compute the e2e
> > latency.
> > > > > Maybe I
> > > > > > > > am
> > > > > > > > >> > > missing
> > > > > > > > >> > > > > > > > something here. Could you plz clarify this.
> > > > > > > > >> > > > > > >
> > > > > > > > >> > > > > > > 2.2. Yes, agree. Not necessarily added, but
> with
> > > > the 3
> > > > > > > > >> latencies
> > > > > > > > >> > > > (poll,
> > > > > > > > >> > > > > > > convert, putBatch) will be clearer where the
> > > > > bottleneck
> > > > > > > may
> > > > > > > > >> be,
> > > > > > > > >> > and
> > > > > > > > >> > > > > > > represent the internal processing.
> > > > > > > > >> > > > > > >
> > > > > > > > >> > > > > > > > however, as per the KIP it looks like it
> will
> > be
> > > > > > > > >> > > > > > > > the latency between when the record was
> > written
> > > to
> > > > > Kafka
> > > > > > > > and
> > > > > > > > >> > when
> > > > > > > > >> > > > the
> > > > > > > > >> > > > > > > > record is returned by a sink task's
> consumer's
> > > > poll?
> > > > > > > > >> > > > > > >
> > > > > > > > >> > > > > > > 3.1. Agree. 2.1. could help to clarify this.
> > > > > > > > >> > > > > > >
> > > > > > > > >> > > > > > > > One more thing - I was wondering
> > > > > > > > >> > > > > > > > if there's a particular reason for having a
> > min
> > > > > metric
> > > > > > > for
> > > > > > > > >> e2e
> > > > > > > > >> > > > > latency
> > > > > > > > >> > > > > > > but
> > > > > > > > >> > > > > > > > not for convert time?
> > > > > > > > >> > > > > > >
> > > > > > > > >> > > > > > > 3.2. Was following KIP-613 for e2e which seems
> > > > useful
> > > > > to
> > > > > > > > >> compare
> > > > > > > > >> > > with
> > > > > > > > >> > > > > > Max a
> > > > > > > > >> > > > > > > get an idea of the window of results, though
> > > current
> > > > > > > > >> latencies in
> > > > > > > > >> > > > > > Connector
> > > > > > > > >> > > > > > > do not include Min, and that's why I haven't
> > added
> > > > it
> > > > > for
> > > > > > > > >> convert
> > > > > > > > >> > > > > > latency.
> > > > > > > > >> > > > > > > Do you think it make sense to extend latency
> > > metrics
> > > > > with
> > > > > > > > Min?
> > > > > > > > >> > > > > > >
> > > > > > > > >> > > > > > > KIP is updated to clarify some of these
> changes.
> > > > > > > > >> > > > > > >
> > > > > > > > >> > > > > > > Many thanks,
> > > > > > > > >> > > > > > > Jorge.
> > > > > > > > >> > > > > > >
> > > > > > > > >> > > > > > > On Thu, 1 Sept 2022 at 18:11, Yash Mayya <
> > > > > > > > >> yash.mayya@gmail.com>
> > > > > > > > >> > > > wrote:
> > > > > > > > >> > > > > > >
> > > > > > > > >> > > > > > > > Hi Jorge,
> > > > > > > > >> > > > > > > >
> > > > > > > > >> > > > > > > > Thanks for the KIP! I have the same
> confusion
> > > with
> > > > > the
> > > > > > > > >> > > e2e-latency
> > > > > > > > >> > > > > > > metrics
> > > > > > > > >> > > > > > > > as Sagar above. "e2e" would seem to indicate
> > the
> > > > > latency
> > > > > > > > >> > between
> > > > > > > > >> > > > when
> > > > > > > > >> > > > > > the
> > > > > > > > >> > > > > > > > record was written to Kafka and when the
> > record
> > > > was
> > > > > > > > written
> > > > > > > > >> to
> > > > > > > > >> > > the
> > > > > > > > >> > > > > sink
> > > > > > > > >> > > > > > > > system by the connector - however, as per
> the
> > > KIP
> > > > it
> > > > > > > looks
> > > > > > > > >> like
> > > > > > > > >> > > it
> > > > > > > > >> > > > > will
> > > > > > > > >> > > > > > > be
> > > > > > > > >> > > > > > > > the latency between when the record was
> > written
> > > to
> > > > > Kafka
> > > > > > > > and
> > > > > > > > >> > when
> > > > > > > > >> > > > the
> > > > > > > > >> > > > > > > > record is returned by a sink task's
> consumer's
> > > > > poll? I
> > > > > > > > think
> > > > > > > > >> > that
> > > > > > > > >> > > > > > metric
> > > > > > > > >> > > > > > > > will be a little confusing to interpret. One
> > > more
> > > > > thing
> > > > > > > -
> > > > > > > > I
> > > > > > > > >> was
> > > > > > > > >> > > > > > wondering
> > > > > > > > >> > > > > > > > if there's a particular reason for having a
> > min
> > > > > metric
> > > > > > > for
> > > > > > > > >> e2e
> > > > > > > > >> > > > > latency
> > > > > > > > >> > > > > > > but
> > > > > > > > >> > > > > > > > not for convert time?
> > > > > > > > >> > > > > > > >
> > > > > > > > >> > > > > > > > Thanks,
> > > > > > > > >> > > > > > > > Yash
> > > > > > > > >> > > > > > > >
> > > > > > > > >> > > > > > > > On Thu, Sep 1, 2022 at 8:59 PM Sagar <
> > > > > > > > >> > sagarmeansocean@gmail.com>
> > > > > > > > >> > > > > > wrote:
> > > > > > > > >> > > > > > > >
> > > > > > > > >> > > > > > > > > Hi Jorge,
> > > > > > > > >> > > > > > > > >
> > > > > > > > >> > > > > > > > > Thanks for the KIP. It looks like a very
> > good
> > > > > > > addition.
> > > > > > > > I
> > > > > > > > >> > > skimmed
> > > > > > > > >> > > > > > > through
> > > > > > > > >> > > > > > > > > once and had a couple of questions =>
> > > > > > > > >> > > > > > > > >
> > > > > > > > >> > > > > > > > > 1) I am assuming the new metrics would be
> > task
> > > > > level
> > > > > > > > >> metric.
> > > > > > > > >> > > > Could
> > > > > > > > >> > > > > > you
> > > > > > > > >> > > > > > > > > specify the way it's done for other
> > > sink/source
> > > > > > > > connector?
> > > > > > > > >> > > > > > > > > 2) I am slightly confused about the e2e
> > > latency
> > > > > > > metric.
> > > > > > > > >> Let's
> > > > > > > > >> > > > > > consider
> > > > > > > > >> > > > > > > > the
> > > > > > > > >> > > > > > > > > sink connector metric. If I look at the
> way
> > > it's
> > > > > > > > supposed
> > > > > > > > >> to
> > > > > > > > >> > be
> > > > > > > > >> > > > > > > > calculated,
> > > > > > > > >> > > > > > > > > i.e the difference between the record
> > > timestamp
> > > > > and
> > > > > > > the
> > > > > > > > >> wall
> > > > > > > > >> > > > clock
> > > > > > > > >> > > > > > > time,
> > > > > > > > >> > > > > > > > it
> > > > > > > > >> > > > > > > > > looks like a per record metric. However,
> the
> > > > > put-batch
> > > > > > > > >> time
> > > > > > > > >> > > > > measures
> > > > > > > > >> > > > > > > the
> > > > > > > > >> > > > > > > > > time to put a batch of records to external
> > > sink.
> > > > > So, I
> > > > > > > > >> would
> > > > > > > > >> > > > assume
> > > > > > > > >> > > > > > > the 2
> > > > > > > > >> > > > > > > > > can't be added as is to compute the e2e
> > > latency.
> > > > > > > Maybe I
> > > > > > > > >> am
> > > > > > > > >> > > > missing
> > > > > > > > >> > > > > > > > > something here. Could you plz clarify
> this.
> > > > > > > > >> > > > > > > > >
> > > > > > > > >> > > > > > > > > Thanks!
> > > > > > > > >> > > > > > > > > Sagar.
> > > > > > > > >> > > > > > > > >
> > > > > > > > >> > > > > > > > > On Tue, Aug 30, 2022 at 8:43 PM Jorge
> > Esteban
> > > > > Quilcate
> > > > > > > > >> Otoya
> > > > > > > > >> > <
> > > > > > > > >> > > > > > > > > quilcate.jorge@gmail.com> wrote:
> > > > > > > > >> > > > > > > > >
> > > > > > > > >> > > > > > > > > > Hi all,
> > > > > > > > >> > > > > > > > > >
> > > > > > > > >> > > > > > > > > > I'd like to start a discussion thread on
> > > > > KIP-864:
> > > > > > > Add
> > > > > > > > >> > > > End-To-End
> > > > > > > > >> > > > > > > > Latency
> > > > > > > > >> > > > > > > > > > Metrics to Connectors.
> > > > > > > > >> > > > > > > > > > This KIP aims to improve the metrics
> > > available
> > > > > on
> > > > > > > > Source
> > > > > > > > >> > and
> > > > > > > > >> > > > Sink
> > > > > > > > >> > > > > > > > > > Connectors to measure end-to-end
> latency,
> > > > > including
> > > > > > > > >> source
> > > > > > > > >> > > and
> > > > > > > > >> > > > > sink
> > > > > > > > >> > > > > > > > > record
> > > > > > > > >> > > > > > > > > > conversion time, and sink record e2e
> > latency
> > > > > > > (similar
> > > > > > > > to
> > > > > > > > >> > > > KIP-613
> > > > > > > > >> > > > > > for
> > > > > > > > >> > > > > > > > > > Streams).
> > > > > > > > >> > > > > > > > > >
> > > > > > > > >> > > > > > > > > > The KIP is here:
> > > > > > > > >> > > > > > > > > >
> > > > > > > > >> > > > > > > > > >
> > > > > > > > >> > > > > > > > >
> > > > > > > > >> > > > > > > >
> > > > > > > > >> > > > > > >
> > > > > > > > >> > > > > >
> > > > > > > > >> > > > >
> > > > > > > > >> > > >
> > > > > > > > >> > >
> > > > > > > > >> >
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-864%3A+Add+End-To-End+Latency+Metrics+to+Connectors
> > > > > > > > >> > > > > > > > > >
> > > > > > > > >> > > > > > > > > > Please take a look and let me know what
> > you
> > > > > think.
> > > > > > > > >> > > > > > > > > >
> > > > > > > > >> > > > > > > > > > Cheers,
> > > > > > > > >> > > > > > > > > > Jorge.
> > > > > > > > >> > > > > > > > > >
> > > > > > > > >> > > > > > > > >
> > > > > > > > >> > > > > > > >
> > > > > > > > >> > > > > > >
> > > > > > > > >> > > > > >
> > > > > > > > >> > > > >
> > > > > > > > >> > > >
> > > > > > > > >> > >
> > > > > > > > >> >
> > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-864: Add End-To-End Latency Metrics to Connectors

Posted by Jorge Esteban Quilcate Otoya <qu...@gmail.com>.
No worries, thanks Chris!

I think most feedback has been covered and the KIP is ready for vote. Will
be starting the vote thread soon.

Cheers,
Jorge.

On Mon, 5 Dec 2022 at 15:10, Chris Egerton <ch...@aiven.io.invalid> wrote:

> Hi Jorge,
>
> Thanks for indulging my paranoia. LGTM!
>
> Cheers,
>
> Chris
>
> On Mon, Dec 5, 2022 at 10:06 AM Jorge Esteban Quilcate Otoya <
> quilcate.jorge@gmail.com> wrote:
>
> > Sure! I have a added the following to the proposed changes section:
> >
> > ```
> > The per-record metrics will definitely be added to Kafka Connect as part
> of
> > this KIP, but their metric level will be changed pending the performance
> > testing described in KAFKA-14441, and will otherwise only be exposed at
> > lower level (DEBUG instead of INFO, and TRACE instead of DEBUG)
> > ```
> >
> > Let me know if how does it look.
> >
> > Many thanks!
> > Jorge.
> >
> > On Mon, 5 Dec 2022 at 14:11, Chris Egerton <ch...@aiven.io.invalid>
> > wrote:
> >
> > > Hi Jorge,
> > >
> > > Thanks for filing KAFKA-14441! In the ticket description we mention
> that
> > > "there will be more confidence whether to design metrics to be exposed
> > at a
> > > DEBUG or INFO level depending on their impact" but it doesn't seem like
> > > this is called out in the KIP and, just based on what's in the KIP, the
> > > proposal is still to have several per-record metrics exposed at INFO
> > level.
> > >
> > > Could we explicitly call out that the per-record metrics will
> definitely
> > be
> > > added to Kafka Connect as part of this KIP, but they will only be
> exposed
> > > at INFO level pending pending the performance testing described in
> > > KAFKA-14441, and will otherwise only be exposed at DEBUG level?
> > Otherwise,
> > > it's possible that a vote for the KIP as it's written today would be a
> > vote
> > > in favor of unconditionally exposing these metrics at INFO level, even
> if
> > > the performance testing reveals issues.
> > >
> > > Cheers,
> > >
> > > Chris
> > >
> > > On Sun, Dec 4, 2022 at 7:08 PM Jorge Esteban Quilcate Otoya <
> > > quilcate.jorge@gmail.com> wrote:
> > >
> > > > Thanks for the reminder Chris!
> > > >
> > > > I have added a note on the KIP to include this as part of the KIP as
> > most
> > > > of the metrics proposed are per-record and having all on DEBUG would
> > > limit
> > > > the benefits, and created
> > > > https://issues.apache.org/jira/browse/KAFKA-14441
> > > > to keep track of this task.
> > > >
> > > > Cheers,
> > > > Jorge.
> > > >
> > > > On Tue, 29 Nov 2022 at 19:40, Chris Egerton <chrise@aiven.io.invalid
> >
> > > > wrote:
> > > >
> > > > > Hi Jorge,
> > > > >
> > > > > Thanks! What were your thoughts on the possible benchmarking and/or
> > > > > downgrading of per-record metrics to DEBUG?
> > > > >
> > > > > Cheers,
> > > > >
> > > > > Chris
> > > > >
> > > > > On Thu, Nov 24, 2022 at 8:20 AM Jorge Esteban Quilcate Otoya <
> > > > > quilcate.jorge@gmail.com> wrote:
> > > > >
> > > > > > Thanks Chris! I have updated the KIP with "transform" instead of
> > > > "alias".
> > > > > > Agree it's clearer.
> > > > > >
> > > > > > Cheers,
> > > > > > Jorge.
> > > > > >
> > > > > > On Mon, 21 Nov 2022 at 21:36, Chris Egerton
> > <chrise@aiven.io.invalid
> > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Jorge,
> > > > > > >
> > > > > > > Thanks for the updates, and apologies for the delay. The new
> > > diagram
> > > > > > > directly under the "Proposed Changes" section is absolutely
> > > gorgeous!
> > > > > > >
> > > > > > >
> > > > > > > Follow-ups:
> > > > > > >
> > > > > > > RE 2: Good point. We can use the same level for these metrics,
> > it's
> > > > > not a
> > > > > > > big deal.
> > > > > > >
> > > > > > > RE 3: As long as all the per-record metrics are kept at DEBUG
> > > level,
> > > > it
> > > > > > > should be fine to leave JMH benchmarking for a follow-up. If we
> > > want
> > > > to
> > > > > > add
> > > > > > > new per-record, INFO-level metrics, I would be more comfortable
> > > with
> > > > > > > including benchmarking as part of the testing plan for the KIP.
> > One
> > > > > > > possible compromise could be to propose that these features be
> > > merged
> > > > > at
> > > > > > > DEBUG level, and then possibly upgraded to INFO level in the
> > future
> > > > > > pending
> > > > > > > benchmarks to guard against performance degradation.
> > > > > > >
> > > > > > > RE 4: I think for a true "end-to-end" metric, it'd be useful to
> > > > include
> > > > > > the
> > > > > > > time taken by the task to actually deliver the record. However,
> > > with
> > > > > the
> > > > > > > new metric names and descriptions provided in the KIP, I have
> no
> > > > > > objections
> > > > > > > with what's currently proposed, and a new "end-to-end" metric
> can
> > > be
> > > > > > taken
> > > > > > > on later in a follow-up KIP.
> > > > > > >
> > > > > > > RE 6: You're right, existing producer metrics should be enough
> > for
> > > > now.
> > > > > > We
> > > > > > > can revisit this later if/when we add delivery-centric metrics
> > for
> > > > sink
> > > > > > > tasks as well.
> > > > > > >
> > > > > > > RE 7: The new metric names in the KIP LGTM; I don't see any
> need
> > to
> > > > > > expand
> > > > > > > beyond those but if you'd still like to pursue others, LMK.
> > > > > > >
> > > > > > >
> > > > > > > New thoughts:
> > > > > > >
> > > > > > > One small thought: instead of "alias" in
> > "alias="{transform_alias}"
> > > > for
> > > > > > the
> > > > > > > per-transform metrics, could we use "transform"? IMO it's
> clearer
> > > > since
> > > > > > we
> > > > > > > don't use "alias" in the names of transform-related properties,
> > and
> > > > > > "alias"
> > > > > > > may be confused with the classloading term where you can use,
> > e.g.,
> > > > > > > "FileStreamSource" as the name of a connector class in a
> > connector
> > > > > config
> > > > > > > instead of
> > > "org.apache.kafka.connect.file.FileStreamSourceConnector".
> > > > > > >
> > > > > > >
> > > > > > > Cheers,
> > > > > > >
> > > > > > > Chris
> > > > > > >
> > > > > > > On Fri, Nov 18, 2022 at 12:06 PM Jorge Esteban Quilcate Otoya <
> > > > > > > quilcate.jorge@gmail.com> wrote:
> > > > > > >
> > > > > > > > Thanks Mickael!
> > > > > > > >
> > > > > > > >
> > > > > > > > On Wed, 9 Nov 2022 at 15:54, Mickael Maison <
> > > > > mickael.maison@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Jorge,
> > > > > > > > >
> > > > > > > > > Thanks for the KIP, it is a nice improvement.
> > > > > > > > >
> > > > > > > > > 1) The per transformation metrics still have a question
> mark
> > > next
> > > > > to
> > > > > > > > > them in the KIP. Do you want to include them? If so we'll
> > want
> > > to
> > > > > tag
> > > > > > > > > them, we should be able to include the aliases in
> > > > > TransformationChain
> > > > > > > > > and use them.
> > > > > > > > >
> > > > > > > >
> > > > > > > > Yes, I have added the changes on TransformChain that will be
> > > needed
> > > > > to
> > > > > > > add
> > > > > > > > these metrics.
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > 2) I see no references to predicates. If we don't want to
> > > measure
> > > > > > > > > their latency, can we say it explicitly?
> > > > > > > > >
> > > > > > > >
> > > > > > > > Good question, I haven't considered these. Though as these
> are
> > > > > > > materialized
> > > > > > > > as PredicatedTransformation, they should be covered by these
> > > > changes.
> > > > > > > > Adding a note about this.
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > 3) Should we have sink-record-batch-latency-avg-ms? All
> other
> > > > > metrics
> > > > > > > > > have both the maximum and average values.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > Good question. I will remove it and change the record latency
> > > from
> > > > > > > > DEBUG->INFO as it already cover the maximum metric.
> > > > > > > >
> > > > > > > > Hope it's clearer now, let me know if there any additional
> > > > feedback.
> > > > > > > > Thanks!
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Mickael
> > > > > > > > >
> > > > > > > > > On Thu, Oct 20, 2022 at 9:58 PM Jorge Esteban Quilcate
> Otoya
> > > > > > > > > <qu...@gmail.com> wrote:
> > > > > > > > > >
> > > > > > > > > > Thanks, Chris! Great feedback! Please, find my comments
> > > below:
> > > > > > > > > >
> > > > > > > > > > On Thu, 13 Oct 2022 at 18:52, Chris Egerton
> > > > > > <chrise@aiven.io.invalid
> > > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi Jorge,
> > > > > > > > > > >
> > > > > > > > > > > Thanks for the KIP. I agree with the overall direction
> > and
> > > > > think
> > > > > > > this
> > > > > > > > > would
> > > > > > > > > > > be a nice improvement to Kafka Connect. Here are my
> > initial
> > > > > > > thoughts
> > > > > > > > > on the
> > > > > > > > > > > details:
> > > > > > > > > > >
> > > > > > > > > > > 1. The motivation section outlines the gaps in Kafka
> > > > Connect's
> > > > > > task
> > > > > > > > > metrics
> > > > > > > > > > > nicely. I think it'd be useful to include more concrete
> > > > details
> > > > > > on
> > > > > > > > why
> > > > > > > > > > > these gaps need to be filled in, and in which cases
> > > > additional
> > > > > > > > metrics
> > > > > > > > > > > would be helpful. One goal could be to provide enhanced
> > > > > > monitoring
> > > > > > > of
> > > > > > > > > > > production deployments that allows for cluster
> > > administrators
> > > > > to
> > > > > > > set
> > > > > > > > up
> > > > > > > > > > > automatic alerts for latency spikes and, if triggered,
> > > > quickly
> > > > > > > > > identify the
> > > > > > > > > > > root cause of those alerts, reducing the time to
> > > remediation.
> > > > > > > Another
> > > > > > > > > goal
> > > > > > > > > > > could be to provide more insight to developers or
> cluster
> > > > > > > > > administrators
> > > > > > > > > > > who want to do performance testing on connectors in
> > > > > > non-production
> > > > > > > > > > > environments. It may help guide our decision making
> > process
> > > > to
> > > > > > > have a
> > > > > > > > > > > clearer picture of the goals we're trying to achieve.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Agree. The Motivation section has been updated.
> > > > > > > > > > Thanks for the examples, I see both of them being covered
> > by
> > > > the
> > > > > > KIP.
> > > > > > > > > > I see how these could give us a good distinction on
> whether
> > > to
> > > > > > > position
> > > > > > > > > > some metrics at INFO or DEBUG level.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > 2. If we're trying to address the alert-and-diagnose
> use
> > > > case,
> > > > > > it'd
> > > > > > > > be
> > > > > > > > > > > useful to have as much information as possible at INFO
> > > level,
> > > > > > > rather
> > > > > > > > > than
> > > > > > > > > > > forcing cluster administrators to possibly reconfigure
> a
> > > > > > connector
> > > > > > > to
> > > > > > > > > emit
> > > > > > > > > > > DEBUG or TRACE level metrics in order to diagnose a
> > > potential
> > > > > > > > > > > production-impacting performance bottleneck. I can see
> > the
> > > > > > > rationale
> > > > > > > > > for
> > > > > > > > > > > emitting per-record metrics that track an average value
> > at
> > > > > DEBUG
> > > > > > > > > level, but
> > > > > > > > > > > for per-record metrics that track a maximum value, is
> > there
> > > > any
> > > > > > > > reason
> > > > > > > > > not
> > > > > > > > > > > to provide this information at INFO level?
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Agree. Though with Max and Avg metrics being part of the
> > same
> > > > > > sensor
> > > > > > > —
> > > > > > > > > > where Metric Level is defined — then both metrics get the
> > > same
> > > > > > level.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > 3. I'm also curious about the performance testing
> > suggested
> > > > by
> > > > > > Yash
> > > > > > > > to
> > > > > > > > > > > gauge the potential impact of this change. Have you
> been
> > > able
> > > > > to
> > > > > > do
> > > > > > > > any
> > > > > > > > > > > testing with your draft implementation yet?
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > No, not so far.
> > > > > > > > > > I think it would be valuable to discuss the scope of this
> > > > testing
> > > > > > and
> > > > > > > > > maybe
> > > > > > > > > > tackle it
> > > > > > > > > > in a separate issue as Sensors and Metrics are used all
> > over
> > > > the
> > > > > > > place.
> > > > > > > > > > My initial understanding is that these tests should by
> > placed
> > > > in
> > > > > > the
> > > > > > > > > > jmh-benchmarks[1].
> > > > > > > > > > Then, we could target testing Sensors and Metrics, and
> > > validate
> > > > > how
> > > > > > > > much
> > > > > > > > > > overhead
> > > > > > > > > > is added by having only Max vs Max,Avg(,Min), etc.
> > > > > > > > > > In the other hand, we could extend this to Transformers
> or
> > > > other
> > > > > > > > Connect
> > > > > > > > > > layers.
> > > > > > > > > >
> > > > > > > > > > Here are some pointers to the Sensors and Metrics
> > > > implementations
> > > > > > > that
> > > > > > > > > > could be considered:
> > > > > > > > > > Path to metric recording:
> > > > > > > > > > -
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/5cab11cf525f6c06fcf9eb43f7f95ef33fe1cdbb/clients/src/main/java/org/apache/kafka/common/metrics/Sensor.java#L195-L199
> > > > > > > > > > -
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/5cab11cf525f6c06fcf9eb43f7f95ef33fe1cdbb/clients/src/main/java/org/apache/kafka/common/metrics/Sensor.java#L230-L244
> > > > > > > > > >
> > > > > > > > > > ```
> > > > > > > > > > // increment all the stats
> > > > > > > > > > for (StatAndConfig statAndConfig : this.stats) {
> > > > > > > > > >    statAndConfig.stat.record(statAndConfig.config(),
> value,
> > > > > > timeMs);
> > > > > > > > > > }
> > > > > > > > > > ```
> > > > > > > > > >
> > > > > > > > > > SampledStats:
> > > > > > > > > > - Avg:
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/068ab9cefae301f3187ea885d645c425955e77d2/clients/src/main/java/org/apache/kafka/common/metrics/stats/Avg.java
> > > > > > > > > > - Max:
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/068ab9cefae301f3187ea885d645c425955e77d2/clients/src/main/java/org/apache/kafka/common/metrics/stats/Max.java
> > > > > > > > > > - Min:
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/068ab9cefae301f3187ea885d645c425955e77d2/clients/src/main/java/org/apache/kafka/common/metrics/stats/Min.java
> > > > > > > > > >
> > > > > > > > > > `stat#record()` are implemented by `update` method in
> > > > > SampledStat:
> > > > > > > > > >
> > > > > > > > > > ```Max.java
> > > > > > > > > >     @Override
> > > > > > > > > >     protected void update(Sample sample, MetricConfig
> > config,
> > > > > > double
> > > > > > > > > value,
> > > > > > > > > > long now) {
> > > > > > > > > >         sample.value += value;
> > > > > > > > > >     }
> > > > > > > > > > ```
> > > > > > > > > >
> > > > > > > > > > ```Avg.java
> > > > > > > > > >     @Override
> > > > > > > > > >     protected void update(Sample sample, MetricConfig
> > config,
> > > > > > double
> > > > > > > > > value,
> > > > > > > > > > long now) {
> > > > > > > > > >         sample.value = Math.max(sample.value, value);
> > > > > > > > > >     }
> > > > > > > > > > ```
> > > > > > > > > >
> > > > > > > > > > As far as I understand, most of the work of the stats
> > happens
> > > > on
> > > > > > the
> > > > > > > > > > `combine` method that is not part of the connector
> > execution
> > > > but
> > > > > > > called
> > > > > > > > > > when metrics are queried.
> > > > > > > > > >
> > > > > > > > > > I wonder whether we should consider Avg and Max for all
> > > metrics
> > > > > > > > proposed
> > > > > > > > > as
> > > > > > > > > > the impact on the execution path seems minimal, and even
> > see
> > > if
> > > > > Min
> > > > > > > is
> > > > > > > > > also
> > > > > > > > > > valuable, and use DEBUG only for more granular metrics.
> > > > > > > > > >
> > > > > > > > > > [1]
> > > https://github.com/apache/kafka/tree/trunk/jmh-benchmarks
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > 4. Just to make sure I understand correctly--does "time
> > > when
> > > > it
> > > > > > has
> > > > > > > > > been
> > > > > > > > > > > received by the Sink task" refer to the wallclock time
> > > > directly
> > > > > > > > after a
> > > > > > > > > > > call to SinkTask::put has been completed (as opposed to
> > > > > directly
> > > > > > > > before
> > > > > > > > > > > that call is made, or something else entirely)?
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > It currently means when it has been received by the Sink
> > task
> > > > > > > > > > right after consumer poll and before conversions.
> > > > > > > > > > Would it be valuable to have it after put-sink-records?
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > 5. If the goal is to identify performance bottlenecks
> > > (either
> > > > > in
> > > > > > > > > production
> > > > > > > > > > > or pre-production environments), would it make sense to
> > > > > introduce
> > > > > > > > > metrics
> > > > > > > > > > > for each individual converter (i.e., key/value/header)
> > and
> > > > > > > > > transformation?
> > > > > > > > > > > It's definitely an improvement to be able to identify
> the
> > > > total
> > > > > > > time
> > > > > > > > > for
> > > > > > > > > > > conversion and transformation, but then the immediate
> > > > follow-up
> > > > > > > > > question if
> > > > > > > > > > > a bottleneck is found in that phase is "which
> > > > > > > > converter/transformation
> > > > > > > > > is
> > > > > > > > > > > responsible?" It'd be nice if we could provide a way to
> > > > quickly
> > > > > > > > answer
> > > > > > > > > that
> > > > > > > > > > > question.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > This is a great idea. I'd like to consider this as well,
> > > though
> > > > > > maybe
> > > > > > > > > these
> > > > > > > > > > more granular
> > > > > > > > > > metrics would be good to have them as DEBUG.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > 6. Any thoughts about offering latency metrics for
> source
> > > > tasks
> > > > > > > > between
> > > > > > > > > > > receipt of the record from the task and delivery of the
> > > > record
> > > > > to
> > > > > > > > Kafka
> > > > > > > > > > > (which would be tracked by producer callback)? We could
> > > also
> > > > > use
> > > > > > > the
> > > > > > > > > record
> > > > > > > > > > > timestamp either instead of or in addition to receipt
> > time
> > > if
> > > > > the
> > > > > > > > task
> > > > > > > > > > > provides a timestamp with its records.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > With source transform and convert metrics we get part of
> > that
> > > > > > > latency.
> > > > > > > > > > Looking at the Producer metrics, `request-latency`
> (though
> > a
> > > > very
> > > > > > > > generic
> > > > > > > > > > metric)
> > > > > > > > > > sort of answer the time between send request and ack — if
> > my
> > > > > > > > > understanding
> > > > > > > > > > is correct.
> > > > > > > > > > Would these be enough or you're thinking about another
> > > > approach?
> > > > > > > > > > maybe a custom metric to cover the producer side?
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > 7. We may end up introducing a way for sink tasks to
> > record
> > > > > > > > per-record
> > > > > > > > > > > delivery to the sink system (see KIP-767 [1]). I'd like
> > it
> > > if
> > > > > we
> > > > > > > > could
> > > > > > > > > keep
> > > > > > > > > > > the names of our metrics very precise in order to avoid
> > > > > confusing
> > > > > > > > users
> > > > > > > > > > > (who may think that we're providing metrics on actual
> > > > delivery
> > > > > to
> > > > > > > the
> > > > > > > > > sink
> > > > > > > > > > > system, which may not be the case if the connector
> > performs
> > > > > > > > > asynchronous
> > > > > > > > > > > writes), and in order to leave room for a metrics on
> true
> > > > > > delivery
> > > > > > > > > time by
> > > > > > > > > > > sink tasks. It'd also be nice if we could remain
> > consistent
> > > > > with
> > > > > > > > > existing
> > > > > > > > > > > metrics such as "put-batch-avg-time-ms". With that in
> > mind,
> > > > > what
> > > > > > do
> > > > > > > > you
> > > > > > > > > > > think about renaming these metrics:
> > > > > > > > > > > - "sink-record-batch-latency-max-ms" to
> > > > > > "put-batch-avg-latency-ms"
> > > > > > > > > > > - "sink-record-latency-max-ms" to
> > > > > > "put-sink-record-latency-max-ms"
> > > > > > > > > > > - "sink-record-latency-avg-ms" to
> > > > > > "put-sink-record-latency-avg-ms"
> > > > > > > > > > > - "sink-record-convert-transform-time-max-ms" to
> > > > > > > > > > > "convert-transform-sink-record-time-max-ms"
> > > > > > > > > > > - "sink-record-convert-transform-time-avg-ms" to
> > > > > > > > > > > "convert-transform-sink-record-time-avg-ms"
> > > > > > > > > > > - "source-record-transform-convert-time-max-ms" to
> > > > > > > > > > > "transform-convert-source-record-time-max-ms"
> > > > > > > > > > > - "source-record-transform-convert-time-avg-ms" to
> > > > > > > > > > > "transform-convert-source-record-time-avg-ms"
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Make sense, thanks! I have updated the list of metrics
> and
> > > > group
> > > > > > them
> > > > > > > > by
> > > > > > > > > > sensor and applying these suggestions.
> > > > > > > > > > The only ones that I want to review are: sink-record-* to
> > > > > > put-batch-*
> > > > > > > > > > (first 3). Not sure if put-batch/put-sink-record
> describes
> > > the
> > > > > > > purpose
> > > > > > > > of
> > > > > > > > > > the metric — neither `sink-record-latency` to be honest.
> > > > > > > > > > My initial thought was to have something like Kafka
> Streams
> > > > > > > > e2e-latency.
> > > > > > > > > > Based on 4. and 6. questions, an idea could be to add:
> > > > > > > > > > - source-batch-e2e-latency-before-send: measure
> wallclock -
> > > > > source
> > > > > > > > record
> > > > > > > > > > timestamp after source connector poll.
> > > > > > > > > > - source-batch-e2e-latency-after-send: measure wallclock
> -
> > > > record
> > > > > > > > > timestamp
> > > > > > > > > > on producer send callback
> > > > > > > > > > - sink-batch-e2e-latency-before-put: measure time
> > wallclock -
> > > > > > record
> > > > > > > > > > timestamp after consumer poll
> > > > > > > > > > - sink-batch-e2e-latency-after-put: measure time
> wallclock
> > -
> > > > > record
> > > > > > > > > > timestamp after sink connector put.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > Thanks again for the KIP! Looking forward to your
> > thoughts.
> > > > > > > > > > >
> > > > > > > > > > > Cheers,
> > > > > > > > > > >
> > > > > > > > > > > Chris
> > > > > > > > > > >
> > > > > > > > > > > [1] -
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-767%3A+Connect+Latency+Metrics
> > > > > > > > > > >
> > > > > > > > > > > On Thu, Sep 15, 2022 at 1:32 PM Jorge Esteban Quilcate
> > > Otoya
> > > > <
> > > > > > > > > > > quilcate.jorge@gmail.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi everyone,
> > > > > > > > > > > >
> > > > > > > > > > > > I've made a slight addition to the KIP based on Yash
> > > > > feedback:
> > > > > > > > > > > >
> > > > > > > > > > > > - A new metric is added at INFO level to record the
> max
> > > > > latency
> > > > > > > > from
> > > > > > > > > the
> > > > > > > > > > > > batch timestamp, by keeping the oldest record
> timestamp
> > > per
> > > > > > > batch.
> > > > > > > > > > > > - A draft implementation is linked.
> > > > > > > > > > > >
> > > > > > > > > > > > Looking forward to your feedback.
> > > > > > > > > > > > Also, a kindly reminder that the vote thread is open.
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks!
> > > > > > > > > > > > Jorge.
> > > > > > > > > > > >
> > > > > > > > > > > > On Thu, 8 Sept 2022 at 14:25, Jorge Esteban Quilcate
> > > Otoya
> > > > <
> > > > > > > > > > > > quilcate.jorge@gmail.com> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Great. I have updated the KIP to reflect this.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Cheers,
> > > > > > > > > > > > > Jorge.
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Thu, 8 Sept 2022 at 12:26, Yash Mayya <
> > > > > > yash.mayya@gmail.com
> > > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > >> Thanks, I think it makes sense to define these
> > metrics
> > > > at
> > > > > a
> > > > > > > > DEBUG
> > > > > > > > > > > > >> recording
> > > > > > > > > > > > >> level.
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> On Thu, Sep 8, 2022 at 2:51 PM Jorge Esteban
> > Quilcate
> > > > > Otoya
> > > > > > <
> > > > > > > > > > > > >> quilcate.jorge@gmail.com> wrote:
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> > On Thu, 8 Sept 2022 at 05:55, Yash Mayya <
> > > > > > > > yash.mayya@gmail.com>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > > Hi Jorge,
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> > > Thanks for the changes. With regard to having
> > per
> > > > > batch
> > > > > > vs
> > > > > > > > per
> > > > > > > > > > > > record
> > > > > > > > > > > > >> > > metrics, the additional overhead I was
> referring
> > > to
> > > > > > wasn't
> > > > > > > > > about
> > > > > > > > > > > > >> whether
> > > > > > > > > > > > >> > or
> > > > > > > > > > > > >> > > not we would need to iterate over all the
> > records
> > > > in a
> > > > > > > > batch.
> > > > > > > > > I
> > > > > > > > > > > was
> > > > > > > > > > > > >> > > referring to the potential additional overhead
> > > > caused
> > > > > by
> > > > > > > the
> > > > > > > > > > > higher
> > > > > > > > > > > > >> > volume
> > > > > > > > > > > > >> > > of calls to Sensor::record on the sensors for
> > the
> > > > new
> > > > > > > > metrics
> > > > > > > > > (as
> > > > > > > > > > > > >> > compared
> > > > > > > > > > > > >> > > to the existing batch only metrics),
> especially
> > > for
> > > > > high
> > > > > > > > > > > throughput
> > > > > > > > > > > > >> > > connectors where batch sizes could be large. I
> > > guess
> > > > > we
> > > > > > > may
> > > > > > > > > want
> > > > > > > > > > > to
> > > > > > > > > > > > do
> > > > > > > > > > > > >> > some
> > > > > > > > > > > > >> > > sort of performance testing and get concrete
> > > numbers
> > > > > to
> > > > > > > > verify
> > > > > > > > > > > > whether
> > > > > > > > > > > > >> > this
> > > > > > > > > > > > >> > > is a valid concern or not?
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > 6.1. Got it, thanks for clarifying. I guess
> there
> > > > could
> > > > > > be a
> > > > > > > > > > > benchmark
> > > > > > > > > > > > >> test
> > > > > > > > > > > > >> > of the `Sensor::record` to get an idea of the
> > > > > performance
> > > > > > > > > impact.
> > > > > > > > > > > > >> > Regardless, the fact that these are
> single-record
> > > > > metrics
> > > > > > > > > compared
> > > > > > > > > > > to
> > > > > > > > > > > > >> > existing batch-only could be explicitly defined
> by
> > > > > setting
> > > > > > > > these
> > > > > > > > > > > > >> metrics at
> > > > > > > > > > > > >> > a DEBUG or TRACE metric recording level, leaving
> > the
> > > > > > > existing
> > > > > > > > at
> > > > > > > > > > > INFO
> > > > > > > > > > > > >> > level.
> > > > > > > > > > > > >> > wdyt?
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> > > Thanks,
> > > > > > > > > > > > >> > > Yash
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> > > On Tue, Sep 6, 2022 at 4:42 PM Jorge Esteban
> > > > Quilcate
> > > > > > > Otoya
> > > > > > > > <
> > > > > > > > > > > > >> > > quilcate.jorge@gmail.com> wrote:
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> > > > Hi Sagar and Yash,
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > > > > the way it's defined in
> > > > > > > > > > > > >> > > >
> > > > > > > > https://kafka.apache.org/documentation/#connect_monitoring
> > > > > > > > > for
> > > > > > > > > > > > the
> > > > > > > > > > > > >> > > metrics
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > > > 4.1. Got it. Add it to the KIP.
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > > > > The only thing I would argue is do we need
> > > > > > > > > > > > >> sink-record-latency-min?
> > > > > > > > > > > > >> > > Maybe
> > > > > > > > > > > > >> > > > we
> > > > > > > > > > > > >> > > > > could remove this min metric as well and
> > make
> > > > all
> > > > > of
> > > > > > > the
> > > > > > > > > 3 e2e
> > > > > > > > > > > > >> > metrics
> > > > > > > > > > > > >> > > > > consistent
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > > > 4.2 I see. Will remove it from the KIP.
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > > > > Probably users can track the metrics at
> > their
> > > > end
> > > > > to
> > > > > > > > > > > > >> > > > > figure that out. Do you think that makes
> > > sense?
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > > > 4.3. Yes, agree. With these new metrics it
> > > should
> > > > be
> > > > > > > > easier
> > > > > > > > > for
> > > > > > > > > > > > >> users
> > > > > > > > > > > > >> > to
> > > > > > > > > > > > >> > > > track this.
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > > > > I think it makes sense to not have a min
> > > metric
> > > > > for
> > > > > > > > > either to
> > > > > > > > > > > > >> remain
> > > > > > > > > > > > >> > > > > consistent with the existing put-batch and
> > > > > > poll-batch
> > > > > > > > > metrics
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > > > 5.1. Got it. Same as 4.2
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > > > > Another naming related suggestion I had
> was
> > > with
> > > > > the
> > > > > > > > > > > > >> > > > > "convert-time" metrics - we should
> probably
> > > > > include
> > > > > > > > > > > > >> transformations
> > > > > > > > > > > > >> > in
> > > > > > > > > > > > >> > > > the
> > > > > > > > > > > > >> > > > > name since SMTs could definitely be
> > > attributable
> > > > > to
> > > > > > a
> > > > > > > > > sizable
> > > > > > > > > > > > >> chunk
> > > > > > > > > > > > >> > of
> > > > > > > > > > > > >> > > > the
> > > > > > > > > > > > >> > > > > latency depending on the specific
> > > transformation
> > > > > > > chain.
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > > > 5.2. Make sense. I'm proposing to add
> > > > > > > > > > > > >> > `sink-record-convert-transform...`
> > > > > > > > > > > > >> > > > and `source-record-transform-convert...` to
> > > > > represent
> > > > > > > > > correctly
> > > > > > > > > > > > the
> > > > > > > > > > > > >> > order
> > > > > > > > > > > > >> > > > of operations.
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > > > > it seems like both source and sink tasks
> > only
> > > > > record
> > > > > > > > > metrics
> > > > > > > > > > > at
> > > > > > > > > > > > a
> > > > > > > > > > > > >> > > "batch"
> > > > > > > > > > > > >> > > > > level, not on an individual record level.
> I
> > > > think
> > > > > it
> > > > > > > > > might be
> > > > > > > > > > > > >> > > additional
> > > > > > > > > > > > >> > > > > overhead if we want to record these new
> > > metrics
> > > > > all
> > > > > > at
> > > > > > > > the
> > > > > > > > > > > > record
> > > > > > > > > > > > >> > > level?
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > > > 5.3. I considered at the beginning to
> > implement
> > > > all
> > > > > > > > metrics
> > > > > > > > > at
> > > > > > > > > > > the
> > > > > > > > > > > > >> > batch
> > > > > > > > > > > > >> > > > level, but given how the framework process
> > > > records,
> > > > > I
> > > > > > > > > fallback
> > > > > > > > > > > to
> > > > > > > > > > > > >> the
> > > > > > > > > > > > >> > > > proposed approach:
> > > > > > > > > > > > >> > > > - Sink Task:
> > > > > > > > > > > > >> > > >   - `WorkerSinkTask#convertMessages(msgs)`
> > > already
> > > > > > > > iterates
> > > > > > > > > over
> > > > > > > > > > > > >> > records,
> > > > > > > > > > > > >> > > > so there is no additional overhead to
> capture
> > > > record
> > > > > > > > > latency per
> > > > > > > > > > > > >> > record.
> > > > > > > > > > > > >> > > >     -
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >>
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/9841647c4fe422532f448423c92d26e4fdcb8932/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L490-L514
> > > > > > > > > > > > >> > > >   -
> > > > > `WorkerSinkTask#convertAndTransformRecord(record)`
> > > > > > > > > actually
> > > > > > > > > > > > >> happens
> > > > > > > > > > > > >> > > > individually. Measuring this operation per
> > batch
> > > > > would
> > > > > > > > > include
> > > > > > > > > > > > >> > processing
> > > > > > > > > > > > >> > > > that is not strictly part of "convert and
> > > > transform"
> > > > > > > > > > > > >> > > >     -
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >>
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/9841647c4fe422532f448423c92d26e4fdcb8932/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L518
> > > > > > > > > > > > >> > > > - Source Task:
> > > > > > > > > > > > >> > > >   - `AbstractWorkerSourceTask#sendRecords`
> > > > iterates
> > > > > > > over a
> > > > > > > > > batch
> > > > > > > > > > > > and
> > > > > > > > > > > > >> > > > applies transforms and convert record
> > > individually
> > > > > as
> > > > > > > > well:
> > > > > > > > > > > > >> > > >     -
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >>
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/9841647c4fe422532f448423c92d26e4fdcb8932/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/AbstractWorkerSourceTask.java#L389-L390
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > > > > This might require some additional
> changes -
> > > > > > > > > > > > >> > > > > for instance, with the
> "sink-record-latency"
> > > > > metric,
> > > > > > > we
> > > > > > > > > might
> > > > > > > > > > > > only
> > > > > > > > > > > > >> > want
> > > > > > > > > > > > >> > > > to
> > > > > > > > > > > > >> > > > > have a "max" metric since "avg" would
> > require
> > > > > > > recording
> > > > > > > > a
> > > > > > > > > > > value
> > > > > > > > > > > > on
> > > > > > > > > > > > >> > the
> > > > > > > > > > > > >> > > > > sensor for each record (whereas we can
> get a
> > > > "max"
> > > > > > by
> > > > > > > > only
> > > > > > > > > > > > >> recording
> > > > > > > > > > > > >> > a
> > > > > > > > > > > > >> > > > > metric value for the oldest record in each
> > > > batch).
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > > > 5.4. Recording record-latency per batch may
> > not
> > > be
> > > > > as
> > > > > > > > > useful as
> > > > > > > > > > > > >> there
> > > > > > > > > > > > >> > is
> > > > > > > > > > > > >> > > no
> > > > > > > > > > > > >> > > > guarantee that the oldest record will be
> > > > > > representative
> > > > > > > of
> > > > > > > > > the
> > > > > > > > > > > > >> batch.
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > > > On Sat, 3 Sept 2022 at 16:02, Yash Mayya <
> > > > > > > > > yash.mayya@gmail.com>
> > > > > > > > > > > > >> wrote:
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > > > > Hi Jorge and Sagar,
> > > > > > > > > > > > >> > > > >
> > > > > > > > > > > > >> > > > > I think it makes sense to not have a min
> > > metric
> > > > > for
> > > > > > > > > either to
> > > > > > > > > > > > >> remain
> > > > > > > > > > > > >> > > > > consistent with the existing put-batch and
> > > > > > poll-batch
> > > > > > > > > metrics
> > > > > > > > > > > > (it
> > > > > > > > > > > > >> > > doesn't
> > > > > > > > > > > > >> > > > > seem particularly useful either anyway).
> > Also,
> > > > the
> > > > > > new
> > > > > > > > > > > > >> > > > > "sink-record-latency" metric name looks
> fine
> > > to
> > > > > me,
> > > > > > > > > thanks for
> > > > > > > > > > > > >> making
> > > > > > > > > > > > >> > > the
> > > > > > > > > > > > >> > > > > changes! Another naming related
> suggestion I
> > > had
> > > > > was
> > > > > > > > with
> > > > > > > > > the
> > > > > > > > > > > > >> > > > > "convert-time" metrics - we should
> probably
> > > > > include
> > > > > > > > > > > > >> transformations
> > > > > > > > > > > > >> > in
> > > > > > > > > > > > >> > > > the
> > > > > > > > > > > > >> > > > > name since SMTs could definitely be
> > > attributable
> > > > > to
> > > > > > a
> > > > > > > > > sizable
> > > > > > > > > > > > >> chunk
> > > > > > > > > > > > >> > of
> > > > > > > > > > > > >> > > > the
> > > > > > > > > > > > >> > > > > latency depending on the specific
> > > transformation
> > > > > > > chain.
> > > > > > > > > > > > >> > > > >
> > > > > > > > > > > > >> > > > > I have one high level question with
> respect
> > to
> > > > > > > > > implementation
> > > > > > > > > > > -
> > > > > > > > > > > > >> > > > currently,
> > > > > > > > > > > > >> > > > > it seems like both source and sink tasks
> > only
> > > > > record
> > > > > > > > > metrics
> > > > > > > > > > > at
> > > > > > > > > > > > a
> > > > > > > > > > > > >> > > "batch"
> > > > > > > > > > > > >> > > > > level, not on an individual record level.
> I
> > > > think
> > > > > it
> > > > > > > > > might be
> > > > > > > > > > > > >> > > additional
> > > > > > > > > > > > >> > > > > overhead if we want to record these new
> > > metrics
> > > > > all
> > > > > > at
> > > > > > > > the
> > > > > > > > > > > > record
> > > > > > > > > > > > >> > > level?
> > > > > > > > > > > > >> > > > > Could we instead make all of these new
> > metrics
> > > > for
> > > > > > > > > batches of
> > > > > > > > > > > > >> records
> > > > > > > > > > > > >> > > > > rather than individual records in order to
> > > > remain
> > > > > > > > > consistent
> > > > > > > > > > > > with
> > > > > > > > > > > > >> the
> > > > > > > > > > > > >> > > > > existing task level metrics? This might
> > > require
> > > > > some
> > > > > > > > > > > additional
> > > > > > > > > > > > >> > > changes -
> > > > > > > > > > > > >> > > > > for instance, with the
> "sink-record-latency"
> > > > > metric,
> > > > > > > we
> > > > > > > > > might
> > > > > > > > > > > > only
> > > > > > > > > > > > >> > want
> > > > > > > > > > > > >> > > > to
> > > > > > > > > > > > >> > > > > have a "max" metric since "avg" would
> > require
> > > > > > > recording
> > > > > > > > a
> > > > > > > > > > > value
> > > > > > > > > > > > on
> > > > > > > > > > > > >> > the
> > > > > > > > > > > > >> > > > > sensor for each record (whereas we can
> get a
> > > > "max"
> > > > > > by
> > > > > > > > only
> > > > > > > > > > > > >> recording
> > > > > > > > > > > > >> > a
> > > > > > > > > > > > >> > > > > metric value for the oldest record in each
> > > > batch).
> > > > > > > > > > > > >> > > > >
> > > > > > > > > > > > >> > > > > Thanks,
> > > > > > > > > > > > >> > > > > Yash
> > > > > > > > > > > > >> > > > >
> > > > > > > > > > > > >> > > > > On Fri, Sep 2, 2022 at 3:16 PM Sagar <
> > > > > > > > > > > sagarmeansocean@gmail.com
> > > > > > > > > > > > >
> > > > > > > > > > > > >> > > wrote:
> > > > > > > > > > > > >> > > > >
> > > > > > > > > > > > >> > > > > > Hi Jorge,
> > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > >> > > > > > Thanks for the changes.
> > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > >> > > > > > Regarding the metrics, I meant something
> > > like
> > > > > > this:
> > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > >> > > > >
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >>
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> kafka.connect:type=sink-task-metrics,connector="{connector}",task="{task}"
> > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > >> > > > > > the way it's defined in
> > > > > > > > > > > > >> > > > > >
> > > > > > > > > https://kafka.apache.org/documentation/#connect_monitoring
> > > > > > > > > > > > for
> > > > > > > > > > > > >> the
> > > > > > > > > > > > >> > > > > > metrics.
> > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > >> > > > > > I see what you mean by the 3 metrics and
> > how
> > > > it
> > > > > > can
> > > > > > > be
> > > > > > > > > > > > >> interpreted.
> > > > > > > > > > > > >> > > The
> > > > > > > > > > > > >> > > > > > only thing I would argue is do we need
> > > > > > > > > > > > sink-record-latency-min?
> > > > > > > > > > > > >> > Maybe
> > > > > > > > > > > > >> > > > we
> > > > > > > > > > > > >> > > > > > could remove this min metric as well and
> > > make
> > > > > all
> > > > > > of
> > > > > > > > > the 3
> > > > > > > > > > > e2e
> > > > > > > > > > > > >> > > metrics
> > > > > > > > > > > > >> > > > > > consistent(since put-batch also doesn't
> > > > expose a
> > > > > > min
> > > > > > > > > which
> > > > > > > > > > > > makes
> > > > > > > > > > > > >> > > sense
> > > > > > > > > > > > >> > > > to
> > > > > > > > > > > > >> > > > > > me). I think this is in contrast to what
> > > Yash
> > > > > > > pointed
> > > > > > > > > out
> > > > > > > > > > > > above
> > > > > > > > > > > > >> so
> > > > > > > > > > > > >> > I
> > > > > > > > > > > > >> > > > > would
> > > > > > > > > > > > >> > > > > > like to hear his thoughts as well.
> > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > >> > > > > > The other point Yash mentioned about the
> > > > > slightly
> > > > > > > > flawed
> > > > > > > > > > > > >> definition
> > > > > > > > > > > > >> > > of
> > > > > > > > > > > > >> > > > > e2e
> > > > > > > > > > > > >> > > > > > is also true in a sense. But I have a
> > > feeling
> > > > > > that's
> > > > > > > > > one the
> > > > > > > > > > > > >> > records
> > > > > > > > > > > > >> > > > are
> > > > > > > > > > > > >> > > > > > polled by the connector tasks, it would
> be
> > > > > > difficult
> > > > > > > > to
> > > > > > > > > > > track
> > > > > > > > > > > > >> the
> > > > > > > > > > > > >> > > final
> > > > > > > > > > > > >> > > > > leg
> > > > > > > > > > > > >> > > > > > via the framework. Probably users can
> > track
> > > > the
> > > > > > > > metrics
> > > > > > > > > at
> > > > > > > > > > > > their
> > > > > > > > > > > > >> > end
> > > > > > > > > > > > >> > > to
> > > > > > > > > > > > >> > > > > > figure that out. Do you think that makes
> > > > sense?
> > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > >> > > > > > Thanks!
> > > > > > > > > > > > >> > > > > > Sagar.
> > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > >> > > > > > On Thu, Sep 1, 2022 at 11:40 PM Jorge
> > > Esteban
> > > > > > > Quilcate
> > > > > > > > > > > Otoya <
> > > > > > > > > > > > >> > > > > > quilcate.jorge@gmail.com> wrote:
> > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > >> > > > > > > Hi Sagar and Yash,
> > > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > > >> > > > > > > Thanks for your feedback!
> > > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > > >> > > > > > > > 1) I am assuming the new metrics
> would
> > > be
> > > > > task
> > > > > > > > level
> > > > > > > > > > > > metric.
> > > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > > >> > > > > > > 1.1 Yes, it will be a task level
> metric,
> > > > > > > implemented
> > > > > > > > > on
> > > > > > > > > > > the
> > > > > > > > > > > > >> > > > > > > Worker[Source/Sink]Task.
> > > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > > >> > > > > > > > Could you specify the way it's done
> > for
> > > > > other
> > > > > > > > > > > sink/source
> > > > > > > > > > > > >> > > > connector?
> > > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > > >> > > > > > > 1.2. Not sure what do you mean by
> this.
> > > > Could
> > > > > > you
> > > > > > > > > > > elaborate
> > > > > > > > > > > > a
> > > > > > > > > > > > >> bit
> > > > > > > > > > > > >> > > > more?
> > > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > > >> > > > > > > > 2. I am slightly confused about the
> > e2e
> > > > > > latency
> > > > > > > > > > > metric...
> > > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > > >> > > > > > > 2.1. Yes, I see. I was trying to
> bring a
> > > > > similar
> > > > > > > > > concept
> > > > > > > > > > > as
> > > > > > > > > > > > in
> > > > > > > > > > > > >> > > > Streams
> > > > > > > > > > > > >> > > > > > with
> > > > > > > > > > > > >> > > > > > > KIP-613, though the e2e concept may
> not
> > be
> > > > > > > > > translatable.
> > > > > > > > > > > > >> > > > > > > We could keep it as
> > `sink-record-latency`
> > > to
> > > > > > avoid
> > > > > > > > > > > > conflating
> > > > > > > > > > > > >> > > > > concepts. A
> > > > > > > > > > > > >> > > > > > > similar metric naming was proposed in
> > > > KIP-489
> > > > > > but
> > > > > > > at
> > > > > > > > > the
> > > > > > > > > > > > >> consumer
> > > > > > > > > > > > >> > > > > level —
> > > > > > > > > > > > >> > > > > > > though it seems dormant for a couple
> of
> > > > years.
> > > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > > >> > > > > > > > However, the put-batch time measures
> > the
> > > > > > > > > > > > >> > > > > > > > time to put a batch of records to
> > > external
> > > > > > sink.
> > > > > > > > > So, I
> > > > > > > > > > > > would
> > > > > > > > > > > > >> > > assume
> > > > > > > > > > > > >> > > > > > the 2
> > > > > > > > > > > > >> > > > > > > > can't be added as is to compute the
> > e2e
> > > > > > latency.
> > > > > > > > > Maybe I
> > > > > > > > > > > > am
> > > > > > > > > > > > >> > > missing
> > > > > > > > > > > > >> > > > > > > > something here. Could you plz
> clarify
> > > > this.
> > > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > > >> > > > > > > 2.2. Yes, agree. Not necessarily
> added,
> > > but
> > > > > with
> > > > > > > > the 3
> > > > > > > > > > > > >> latencies
> > > > > > > > > > > > >> > > > (poll,
> > > > > > > > > > > > >> > > > > > > convert, putBatch) will be clearer
> where
> > > the
> > > > > > > > > bottleneck
> > > > > > > > > > > may
> > > > > > > > > > > > >> be,
> > > > > > > > > > > > >> > and
> > > > > > > > > > > > >> > > > > > > represent the internal processing.
> > > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > > >> > > > > > > > however, as per the KIP it looks
> like
> > it
> > > > > will
> > > > > > be
> > > > > > > > > > > > >> > > > > > > > the latency between when the record
> > was
> > > > > > written
> > > > > > > to
> > > > > > > > > Kafka
> > > > > > > > > > > > and
> > > > > > > > > > > > >> > when
> > > > > > > > > > > > >> > > > the
> > > > > > > > > > > > >> > > > > > > > record is returned by a sink task's
> > > > > consumer's
> > > > > > > > poll?
> > > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > > >> > > > > > > 3.1. Agree. 2.1. could help to clarify
> > > this.
> > > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > > >> > > > > > > > One more thing - I was wondering
> > > > > > > > > > > > >> > > > > > > > if there's a particular reason for
> > > having
> > > > a
> > > > > > min
> > > > > > > > > metric
> > > > > > > > > > > for
> > > > > > > > > > > > >> e2e
> > > > > > > > > > > > >> > > > > latency
> > > > > > > > > > > > >> > > > > > > but
> > > > > > > > > > > > >> > > > > > > > not for convert time?
> > > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > > >> > > > > > > 3.2. Was following KIP-613 for e2e
> which
> > > > seems
> > > > > > > > useful
> > > > > > > > > to
> > > > > > > > > > > > >> compare
> > > > > > > > > > > > >> > > with
> > > > > > > > > > > > >> > > > > > Max a
> > > > > > > > > > > > >> > > > > > > get an idea of the window of results,
> > > though
> > > > > > > current
> > > > > > > > > > > > >> latencies in
> > > > > > > > > > > > >> > > > > > Connector
> > > > > > > > > > > > >> > > > > > > do not include Min, and that's why I
> > > haven't
> > > > > > added
> > > > > > > > it
> > > > > > > > > for
> > > > > > > > > > > > >> convert
> > > > > > > > > > > > >> > > > > > latency.
> > > > > > > > > > > > >> > > > > > > Do you think it make sense to extend
> > > latency
> > > > > > > metrics
> > > > > > > > > with
> > > > > > > > > > > > Min?
> > > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > > >> > > > > > > KIP is updated to clarify some of
> these
> > > > > changes.
> > > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > > >> > > > > > > Many thanks,
> > > > > > > > > > > > >> > > > > > > Jorge.
> > > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > > >> > > > > > > On Thu, 1 Sept 2022 at 18:11, Yash
> > Mayya <
> > > > > > > > > > > > >> yash.mayya@gmail.com>
> > > > > > > > > > > > >> > > > wrote:
> > > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > > >> > > > > > > > Hi Jorge,
> > > > > > > > > > > > >> > > > > > > >
> > > > > > > > > > > > >> > > > > > > > Thanks for the KIP! I have the same
> > > > > confusion
> > > > > > > with
> > > > > > > > > the
> > > > > > > > > > > > >> > > e2e-latency
> > > > > > > > > > > > >> > > > > > > metrics
> > > > > > > > > > > > >> > > > > > > > as Sagar above. "e2e" would seem to
> > > > indicate
> > > > > > the
> > > > > > > > > latency
> > > > > > > > > > > > >> > between
> > > > > > > > > > > > >> > > > when
> > > > > > > > > > > > >> > > > > > the
> > > > > > > > > > > > >> > > > > > > > record was written to Kafka and when
> > the
> > > > > > record
> > > > > > > > was
> > > > > > > > > > > > written
> > > > > > > > > > > > >> to
> > > > > > > > > > > > >> > > the
> > > > > > > > > > > > >> > > > > sink
> > > > > > > > > > > > >> > > > > > > > system by the connector - however,
> as
> > > per
> > > > > the
> > > > > > > KIP
> > > > > > > > it
> > > > > > > > > > > looks
> > > > > > > > > > > > >> like
> > > > > > > > > > > > >> > > it
> > > > > > > > > > > > >> > > > > will
> > > > > > > > > > > > >> > > > > > > be
> > > > > > > > > > > > >> > > > > > > > the latency between when the record
> > was
> > > > > > written
> > > > > > > to
> > > > > > > > > Kafka
> > > > > > > > > > > > and
> > > > > > > > > > > > >> > when
> > > > > > > > > > > > >> > > > the
> > > > > > > > > > > > >> > > > > > > > record is returned by a sink task's
> > > > > consumer's
> > > > > > > > > poll? I
> > > > > > > > > > > > think
> > > > > > > > > > > > >> > that
> > > > > > > > > > > > >> > > > > > metric
> > > > > > > > > > > > >> > > > > > > > will be a little confusing to
> > interpret.
> > > > One
> > > > > > > more
> > > > > > > > > thing
> > > > > > > > > > > -
> > > > > > > > > > > > I
> > > > > > > > > > > > >> was
> > > > > > > > > > > > >> > > > > > wondering
> > > > > > > > > > > > >> > > > > > > > if there's a particular reason for
> > > having
> > > > a
> > > > > > min
> > > > > > > > > metric
> > > > > > > > > > > for
> > > > > > > > > > > > >> e2e
> > > > > > > > > > > > >> > > > > latency
> > > > > > > > > > > > >> > > > > > > but
> > > > > > > > > > > > >> > > > > > > > not for convert time?
> > > > > > > > > > > > >> > > > > > > >
> > > > > > > > > > > > >> > > > > > > > Thanks,
> > > > > > > > > > > > >> > > > > > > > Yash
> > > > > > > > > > > > >> > > > > > > >
> > > > > > > > > > > > >> > > > > > > > On Thu, Sep 1, 2022 at 8:59 PM
> Sagar <
> > > > > > > > > > > > >> > sagarmeansocean@gmail.com>
> > > > > > > > > > > > >> > > > > > wrote:
> > > > > > > > > > > > >> > > > > > > >
> > > > > > > > > > > > >> > > > > > > > > Hi Jorge,
> > > > > > > > > > > > >> > > > > > > > >
> > > > > > > > > > > > >> > > > > > > > > Thanks for the KIP. It looks like
> a
> > > very
> > > > > > good
> > > > > > > > > > > addition.
> > > > > > > > > > > > I
> > > > > > > > > > > > >> > > skimmed
> > > > > > > > > > > > >> > > > > > > through
> > > > > > > > > > > > >> > > > > > > > > once and had a couple of questions
> > =>
> > > > > > > > > > > > >> > > > > > > > >
> > > > > > > > > > > > >> > > > > > > > > 1) I am assuming the new metrics
> > would
> > > > be
> > > > > > task
> > > > > > > > > level
> > > > > > > > > > > > >> metric.
> > > > > > > > > > > > >> > > > Could
> > > > > > > > > > > > >> > > > > > you
> > > > > > > > > > > > >> > > > > > > > > specify the way it's done for
> other
> > > > > > > sink/source
> > > > > > > > > > > > connector?
> > > > > > > > > > > > >> > > > > > > > > 2) I am slightly confused about
> the
> > > e2e
> > > > > > > latency
> > > > > > > > > > > metric.
> > > > > > > > > > > > >> Let's
> > > > > > > > > > > > >> > > > > > consider
> > > > > > > > > > > > >> > > > > > > > the
> > > > > > > > > > > > >> > > > > > > > > sink connector metric. If I look
> at
> > > the
> > > > > way
> > > > > > > it's
> > > > > > > > > > > > supposed
> > > > > > > > > > > > >> to
> > > > > > > > > > > > >> > be
> > > > > > > > > > > > >> > > > > > > > calculated,
> > > > > > > > > > > > >> > > > > > > > > i.e the difference between the
> > record
> > > > > > > timestamp
> > > > > > > > > and
> > > > > > > > > > > the
> > > > > > > > > > > > >> wall
> > > > > > > > > > > > >> > > > clock
> > > > > > > > > > > > >> > > > > > > time,
> > > > > > > > > > > > >> > > > > > > > it
> > > > > > > > > > > > >> > > > > > > > > looks like a per record metric.
> > > However,
> > > > > the
> > > > > > > > > put-batch
> > > > > > > > > > > > >> time
> > > > > > > > > > > > >> > > > > measures
> > > > > > > > > > > > >> > > > > > > the
> > > > > > > > > > > > >> > > > > > > > > time to put a batch of records to
> > > > external
> > > > > > > sink.
> > > > > > > > > So, I
> > > > > > > > > > > > >> would
> > > > > > > > > > > > >> > > > assume
> > > > > > > > > > > > >> > > > > > > the 2
> > > > > > > > > > > > >> > > > > > > > > can't be added as is to compute
> the
> > > e2e
> > > > > > > latency.
> > > > > > > > > > > Maybe I
> > > > > > > > > > > > >> am
> > > > > > > > > > > > >> > > > missing
> > > > > > > > > > > > >> > > > > > > > > something here. Could you plz
> > clarify
> > > > > this.
> > > > > > > > > > > > >> > > > > > > > >
> > > > > > > > > > > > >> > > > > > > > > Thanks!
> > > > > > > > > > > > >> > > > > > > > > Sagar.
> > > > > > > > > > > > >> > > > > > > > >
> > > > > > > > > > > > >> > > > > > > > > On Tue, Aug 30, 2022 at 8:43 PM
> > Jorge
> > > > > > Esteban
> > > > > > > > > Quilcate
> > > > > > > > > > > > >> Otoya
> > > > > > > > > > > > >> > <
> > > > > > > > > > > > >> > > > > > > > > quilcate.jorge@gmail.com> wrote:
> > > > > > > > > > > > >> > > > > > > > >
> > > > > > > > > > > > >> > > > > > > > > > Hi all,
> > > > > > > > > > > > >> > > > > > > > > >
> > > > > > > > > > > > >> > > > > > > > > > I'd like to start a discussion
> > > thread
> > > > on
> > > > > > > > > KIP-864:
> > > > > > > > > > > Add
> > > > > > > > > > > > >> > > > End-To-End
> > > > > > > > > > > > >> > > > > > > > Latency
> > > > > > > > > > > > >> > > > > > > > > > Metrics to Connectors.
> > > > > > > > > > > > >> > > > > > > > > > This KIP aims to improve the
> > metrics
> > > > > > > available
> > > > > > > > > on
> > > > > > > > > > > > Source
> > > > > > > > > > > > >> > and
> > > > > > > > > > > > >> > > > Sink
> > > > > > > > > > > > >> > > > > > > > > > Connectors to measure end-to-end
> > > > > latency,
> > > > > > > > > including
> > > > > > > > > > > > >> source
> > > > > > > > > > > > >> > > and
> > > > > > > > > > > > >> > > > > sink
> > > > > > > > > > > > >> > > > > > > > > record
> > > > > > > > > > > > >> > > > > > > > > > conversion time, and sink record
> > e2e
> > > > > > latency
> > > > > > > > > > > (similar
> > > > > > > > > > > > to
> > > > > > > > > > > > >> > > > KIP-613
> > > > > > > > > > > > >> > > > > > for
> > > > > > > > > > > > >> > > > > > > > > > Streams).
> > > > > > > > > > > > >> > > > > > > > > >
> > > > > > > > > > > > >> > > > > > > > > > The KIP is here:
> > > > > > > > > > > > >> > > > > > > > > >
> > > > > > > > > > > > >> > > > > > > > > >
> > > > > > > > > > > > >> > > > > > > > >
> > > > > > > > > > > > >> > > > > > > >
> > > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > >> > > > >
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >>
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-864%3A+Add+End-To-End+Latency+Metrics+to+Connectors
> > > > > > > > > > > > >> > > > > > > > > >
> > > > > > > > > > > > >> > > > > > > > > > Please take a look and let me
> know
> > > > what
> > > > > > you
> > > > > > > > > think.
> > > > > > > > > > > > >> > > > > > > > > >
> > > > > > > > > > > > >> > > > > > > > > > Cheers,
> > > > > > > > > > > > >> > > > > > > > > > Jorge.
> > > > > > > > > > > > >> > > > > > > > > >
> > > > > > > > > > > > >> > > > > > > > >
> > > > > > > > > > > > >> > > > > > > >
> > > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > >> > > > >
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >>
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-864: Add End-To-End Latency Metrics to Connectors

Posted by Chris Egerton <ch...@aiven.io.INVALID>.
Hi Jorge,

Thanks for indulging my paranoia. LGTM!

Cheers,

Chris

On Mon, Dec 5, 2022 at 10:06 AM Jorge Esteban Quilcate Otoya <
quilcate.jorge@gmail.com> wrote:

> Sure! I have a added the following to the proposed changes section:
>
> ```
> The per-record metrics will definitely be added to Kafka Connect as part of
> this KIP, but their metric level will be changed pending the performance
> testing described in KAFKA-14441, and will otherwise only be exposed at
> lower level (DEBUG instead of INFO, and TRACE instead of DEBUG)
> ```
>
> Let me know if how does it look.
>
> Many thanks!
> Jorge.
>
> On Mon, 5 Dec 2022 at 14:11, Chris Egerton <ch...@aiven.io.invalid>
> wrote:
>
> > Hi Jorge,
> >
> > Thanks for filing KAFKA-14441! In the ticket description we mention that
> > "there will be more confidence whether to design metrics to be exposed
> at a
> > DEBUG or INFO level depending on their impact" but it doesn't seem like
> > this is called out in the KIP and, just based on what's in the KIP, the
> > proposal is still to have several per-record metrics exposed at INFO
> level.
> >
> > Could we explicitly call out that the per-record metrics will definitely
> be
> > added to Kafka Connect as part of this KIP, but they will only be exposed
> > at INFO level pending pending the performance testing described in
> > KAFKA-14441, and will otherwise only be exposed at DEBUG level?
> Otherwise,
> > it's possible that a vote for the KIP as it's written today would be a
> vote
> > in favor of unconditionally exposing these metrics at INFO level, even if
> > the performance testing reveals issues.
> >
> > Cheers,
> >
> > Chris
> >
> > On Sun, Dec 4, 2022 at 7:08 PM Jorge Esteban Quilcate Otoya <
> > quilcate.jorge@gmail.com> wrote:
> >
> > > Thanks for the reminder Chris!
> > >
> > > I have added a note on the KIP to include this as part of the KIP as
> most
> > > of the metrics proposed are per-record and having all on DEBUG would
> > limit
> > > the benefits, and created
> > > https://issues.apache.org/jira/browse/KAFKA-14441
> > > to keep track of this task.
> > >
> > > Cheers,
> > > Jorge.
> > >
> > > On Tue, 29 Nov 2022 at 19:40, Chris Egerton <ch...@aiven.io.invalid>
> > > wrote:
> > >
> > > > Hi Jorge,
> > > >
> > > > Thanks! What were your thoughts on the possible benchmarking and/or
> > > > downgrading of per-record metrics to DEBUG?
> > > >
> > > > Cheers,
> > > >
> > > > Chris
> > > >
> > > > On Thu, Nov 24, 2022 at 8:20 AM Jorge Esteban Quilcate Otoya <
> > > > quilcate.jorge@gmail.com> wrote:
> > > >
> > > > > Thanks Chris! I have updated the KIP with "transform" instead of
> > > "alias".
> > > > > Agree it's clearer.
> > > > >
> > > > > Cheers,
> > > > > Jorge.
> > > > >
> > > > > On Mon, 21 Nov 2022 at 21:36, Chris Egerton
> <chrise@aiven.io.invalid
> > >
> > > > > wrote:
> > > > >
> > > > > > Hi Jorge,
> > > > > >
> > > > > > Thanks for the updates, and apologies for the delay. The new
> > diagram
> > > > > > directly under the "Proposed Changes" section is absolutely
> > gorgeous!
> > > > > >
> > > > > >
> > > > > > Follow-ups:
> > > > > >
> > > > > > RE 2: Good point. We can use the same level for these metrics,
> it's
> > > > not a
> > > > > > big deal.
> > > > > >
> > > > > > RE 3: As long as all the per-record metrics are kept at DEBUG
> > level,
> > > it
> > > > > > should be fine to leave JMH benchmarking for a follow-up. If we
> > want
> > > to
> > > > > add
> > > > > > new per-record, INFO-level metrics, I would be more comfortable
> > with
> > > > > > including benchmarking as part of the testing plan for the KIP.
> One
> > > > > > possible compromise could be to propose that these features be
> > merged
> > > > at
> > > > > > DEBUG level, and then possibly upgraded to INFO level in the
> future
> > > > > pending
> > > > > > benchmarks to guard against performance degradation.
> > > > > >
> > > > > > RE 4: I think for a true "end-to-end" metric, it'd be useful to
> > > include
> > > > > the
> > > > > > time taken by the task to actually deliver the record. However,
> > with
> > > > the
> > > > > > new metric names and descriptions provided in the KIP, I have no
> > > > > objections
> > > > > > with what's currently proposed, and a new "end-to-end" metric can
> > be
> > > > > taken
> > > > > > on later in a follow-up KIP.
> > > > > >
> > > > > > RE 6: You're right, existing producer metrics should be enough
> for
> > > now.
> > > > > We
> > > > > > can revisit this later if/when we add delivery-centric metrics
> for
> > > sink
> > > > > > tasks as well.
> > > > > >
> > > > > > RE 7: The new metric names in the KIP LGTM; I don't see any need
> to
> > > > > expand
> > > > > > beyond those but if you'd still like to pursue others, LMK.
> > > > > >
> > > > > >
> > > > > > New thoughts:
> > > > > >
> > > > > > One small thought: instead of "alias" in
> "alias="{transform_alias}"
> > > for
> > > > > the
> > > > > > per-transform metrics, could we use "transform"? IMO it's clearer
> > > since
> > > > > we
> > > > > > don't use "alias" in the names of transform-related properties,
> and
> > > > > "alias"
> > > > > > may be confused with the classloading term where you can use,
> e.g.,
> > > > > > "FileStreamSource" as the name of a connector class in a
> connector
> > > > config
> > > > > > instead of
> > "org.apache.kafka.connect.file.FileStreamSourceConnector".
> > > > > >
> > > > > >
> > > > > > Cheers,
> > > > > >
> > > > > > Chris
> > > > > >
> > > > > > On Fri, Nov 18, 2022 at 12:06 PM Jorge Esteban Quilcate Otoya <
> > > > > > quilcate.jorge@gmail.com> wrote:
> > > > > >
> > > > > > > Thanks Mickael!
> > > > > > >
> > > > > > >
> > > > > > > On Wed, 9 Nov 2022 at 15:54, Mickael Maison <
> > > > mickael.maison@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Jorge,
> > > > > > > >
> > > > > > > > Thanks for the KIP, it is a nice improvement.
> > > > > > > >
> > > > > > > > 1) The per transformation metrics still have a question mark
> > next
> > > > to
> > > > > > > > them in the KIP. Do you want to include them? If so we'll
> want
> > to
> > > > tag
> > > > > > > > them, we should be able to include the aliases in
> > > > TransformationChain
> > > > > > > > and use them.
> > > > > > > >
> > > > > > >
> > > > > > > Yes, I have added the changes on TransformChain that will be
> > needed
> > > > to
> > > > > > add
> > > > > > > these metrics.
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > 2) I see no references to predicates. If we don't want to
> > measure
> > > > > > > > their latency, can we say it explicitly?
> > > > > > > >
> > > > > > >
> > > > > > > Good question, I haven't considered these. Though as these are
> > > > > > materialized
> > > > > > > as PredicatedTransformation, they should be covered by these
> > > changes.
> > > > > > > Adding a note about this.
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > 3) Should we have sink-record-batch-latency-avg-ms? All other
> > > > metrics
> > > > > > > > have both the maximum and average values.
> > > > > > > >
> > > > > > > >
> > > > > > > Good question. I will remove it and change the record latency
> > from
> > > > > > > DEBUG->INFO as it already cover the maximum metric.
> > > > > > >
> > > > > > > Hope it's clearer now, let me know if there any additional
> > > feedback.
> > > > > > > Thanks!
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Mickael
> > > > > > > >
> > > > > > > > On Thu, Oct 20, 2022 at 9:58 PM Jorge Esteban Quilcate Otoya
> > > > > > > > <qu...@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > Thanks, Chris! Great feedback! Please, find my comments
> > below:
> > > > > > > > >
> > > > > > > > > On Thu, 13 Oct 2022 at 18:52, Chris Egerton
> > > > > <chrise@aiven.io.invalid
> > > > > > >
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi Jorge,
> > > > > > > > > >
> > > > > > > > > > Thanks for the KIP. I agree with the overall direction
> and
> > > > think
> > > > > > this
> > > > > > > > would
> > > > > > > > > > be a nice improvement to Kafka Connect. Here are my
> initial
> > > > > > thoughts
> > > > > > > > on the
> > > > > > > > > > details:
> > > > > > > > > >
> > > > > > > > > > 1. The motivation section outlines the gaps in Kafka
> > > Connect's
> > > > > task
> > > > > > > > metrics
> > > > > > > > > > nicely. I think it'd be useful to include more concrete
> > > details
> > > > > on
> > > > > > > why
> > > > > > > > > > these gaps need to be filled in, and in which cases
> > > additional
> > > > > > > metrics
> > > > > > > > > > would be helpful. One goal could be to provide enhanced
> > > > > monitoring
> > > > > > of
> > > > > > > > > > production deployments that allows for cluster
> > administrators
> > > > to
> > > > > > set
> > > > > > > up
> > > > > > > > > > automatic alerts for latency spikes and, if triggered,
> > > quickly
> > > > > > > > identify the
> > > > > > > > > > root cause of those alerts, reducing the time to
> > remediation.
> > > > > > Another
> > > > > > > > goal
> > > > > > > > > > could be to provide more insight to developers or cluster
> > > > > > > > administrators
> > > > > > > > > > who want to do performance testing on connectors in
> > > > > non-production
> > > > > > > > > > environments. It may help guide our decision making
> process
> > > to
> > > > > > have a
> > > > > > > > > > clearer picture of the goals we're trying to achieve.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Agree. The Motivation section has been updated.
> > > > > > > > > Thanks for the examples, I see both of them being covered
> by
> > > the
> > > > > KIP.
> > > > > > > > > I see how these could give us a good distinction on whether
> > to
> > > > > > position
> > > > > > > > > some metrics at INFO or DEBUG level.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > 2. If we're trying to address the alert-and-diagnose use
> > > case,
> > > > > it'd
> > > > > > > be
> > > > > > > > > > useful to have as much information as possible at INFO
> > level,
> > > > > > rather
> > > > > > > > than
> > > > > > > > > > forcing cluster administrators to possibly reconfigure a
> > > > > connector
> > > > > > to
> > > > > > > > emit
> > > > > > > > > > DEBUG or TRACE level metrics in order to diagnose a
> > potential
> > > > > > > > > > production-impacting performance bottleneck. I can see
> the
> > > > > > rationale
> > > > > > > > for
> > > > > > > > > > emitting per-record metrics that track an average value
> at
> > > > DEBUG
> > > > > > > > level, but
> > > > > > > > > > for per-record metrics that track a maximum value, is
> there
> > > any
> > > > > > > reason
> > > > > > > > not
> > > > > > > > > > to provide this information at INFO level?
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Agree. Though with Max and Avg metrics being part of the
> same
> > > > > sensor
> > > > > > —
> > > > > > > > > where Metric Level is defined — then both metrics get the
> > same
> > > > > level.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > 3. I'm also curious about the performance testing
> suggested
> > > by
> > > > > Yash
> > > > > > > to
> > > > > > > > > > gauge the potential impact of this change. Have you been
> > able
> > > > to
> > > > > do
> > > > > > > any
> > > > > > > > > > testing with your draft implementation yet?
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > No, not so far.
> > > > > > > > > I think it would be valuable to discuss the scope of this
> > > testing
> > > > > and
> > > > > > > > maybe
> > > > > > > > > tackle it
> > > > > > > > > in a separate issue as Sensors and Metrics are used all
> over
> > > the
> > > > > > place.
> > > > > > > > > My initial understanding is that these tests should by
> placed
> > > in
> > > > > the
> > > > > > > > > jmh-benchmarks[1].
> > > > > > > > > Then, we could target testing Sensors and Metrics, and
> > validate
> > > > how
> > > > > > > much
> > > > > > > > > overhead
> > > > > > > > > is added by having only Max vs Max,Avg(,Min), etc.
> > > > > > > > > In the other hand, we could extend this to Transformers or
> > > other
> > > > > > > Connect
> > > > > > > > > layers.
> > > > > > > > >
> > > > > > > > > Here are some pointers to the Sensors and Metrics
> > > implementations
> > > > > > that
> > > > > > > > > could be considered:
> > > > > > > > > Path to metric recording:
> > > > > > > > > -
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/5cab11cf525f6c06fcf9eb43f7f95ef33fe1cdbb/clients/src/main/java/org/apache/kafka/common/metrics/Sensor.java#L195-L199
> > > > > > > > > -
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/5cab11cf525f6c06fcf9eb43f7f95ef33fe1cdbb/clients/src/main/java/org/apache/kafka/common/metrics/Sensor.java#L230-L244
> > > > > > > > >
> > > > > > > > > ```
> > > > > > > > > // increment all the stats
> > > > > > > > > for (StatAndConfig statAndConfig : this.stats) {
> > > > > > > > >    statAndConfig.stat.record(statAndConfig.config(), value,
> > > > > timeMs);
> > > > > > > > > }
> > > > > > > > > ```
> > > > > > > > >
> > > > > > > > > SampledStats:
> > > > > > > > > - Avg:
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/068ab9cefae301f3187ea885d645c425955e77d2/clients/src/main/java/org/apache/kafka/common/metrics/stats/Avg.java
> > > > > > > > > - Max:
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/068ab9cefae301f3187ea885d645c425955e77d2/clients/src/main/java/org/apache/kafka/common/metrics/stats/Max.java
> > > > > > > > > - Min:
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/068ab9cefae301f3187ea885d645c425955e77d2/clients/src/main/java/org/apache/kafka/common/metrics/stats/Min.java
> > > > > > > > >
> > > > > > > > > `stat#record()` are implemented by `update` method in
> > > > SampledStat:
> > > > > > > > >
> > > > > > > > > ```Max.java
> > > > > > > > >     @Override
> > > > > > > > >     protected void update(Sample sample, MetricConfig
> config,
> > > > > double
> > > > > > > > value,
> > > > > > > > > long now) {
> > > > > > > > >         sample.value += value;
> > > > > > > > >     }
> > > > > > > > > ```
> > > > > > > > >
> > > > > > > > > ```Avg.java
> > > > > > > > >     @Override
> > > > > > > > >     protected void update(Sample sample, MetricConfig
> config,
> > > > > double
> > > > > > > > value,
> > > > > > > > > long now) {
> > > > > > > > >         sample.value = Math.max(sample.value, value);
> > > > > > > > >     }
> > > > > > > > > ```
> > > > > > > > >
> > > > > > > > > As far as I understand, most of the work of the stats
> happens
> > > on
> > > > > the
> > > > > > > > > `combine` method that is not part of the connector
> execution
> > > but
> > > > > > called
> > > > > > > > > when metrics are queried.
> > > > > > > > >
> > > > > > > > > I wonder whether we should consider Avg and Max for all
> > metrics
> > > > > > > proposed
> > > > > > > > as
> > > > > > > > > the impact on the execution path seems minimal, and even
> see
> > if
> > > > Min
> > > > > > is
> > > > > > > > also
> > > > > > > > > valuable, and use DEBUG only for more granular metrics.
> > > > > > > > >
> > > > > > > > > [1]
> > https://github.com/apache/kafka/tree/trunk/jmh-benchmarks
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > 4. Just to make sure I understand correctly--does "time
> > when
> > > it
> > > > > has
> > > > > > > > been
> > > > > > > > > > received by the Sink task" refer to the wallclock time
> > > directly
> > > > > > > after a
> > > > > > > > > > call to SinkTask::put has been completed (as opposed to
> > > > directly
> > > > > > > before
> > > > > > > > > > that call is made, or something else entirely)?
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > It currently means when it has been received by the Sink
> task
> > > > > > > > > right after consumer poll and before conversions.
> > > > > > > > > Would it be valuable to have it after put-sink-records?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > 5. If the goal is to identify performance bottlenecks
> > (either
> > > > in
> > > > > > > > production
> > > > > > > > > > or pre-production environments), would it make sense to
> > > > introduce
> > > > > > > > metrics
> > > > > > > > > > for each individual converter (i.e., key/value/header)
> and
> > > > > > > > transformation?
> > > > > > > > > > It's definitely an improvement to be able to identify the
> > > total
> > > > > > time
> > > > > > > > for
> > > > > > > > > > conversion and transformation, but then the immediate
> > > follow-up
> > > > > > > > question if
> > > > > > > > > > a bottleneck is found in that phase is "which
> > > > > > > converter/transformation
> > > > > > > > is
> > > > > > > > > > responsible?" It'd be nice if we could provide a way to
> > > quickly
> > > > > > > answer
> > > > > > > > that
> > > > > > > > > > question.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > This is a great idea. I'd like to consider this as well,
> > though
> > > > > maybe
> > > > > > > > these
> > > > > > > > > more granular
> > > > > > > > > metrics would be good to have them as DEBUG.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > 6. Any thoughts about offering latency metrics for source
> > > tasks
> > > > > > > between
> > > > > > > > > > receipt of the record from the task and delivery of the
> > > record
> > > > to
> > > > > > > Kafka
> > > > > > > > > > (which would be tracked by producer callback)? We could
> > also
> > > > use
> > > > > > the
> > > > > > > > record
> > > > > > > > > > timestamp either instead of or in addition to receipt
> time
> > if
> > > > the
> > > > > > > task
> > > > > > > > > > provides a timestamp with its records.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > With source transform and convert metrics we get part of
> that
> > > > > > latency.
> > > > > > > > > Looking at the Producer metrics, `request-latency` (though
> a
> > > very
> > > > > > > generic
> > > > > > > > > metric)
> > > > > > > > > sort of answer the time between send request and ack — if
> my
> > > > > > > > understanding
> > > > > > > > > is correct.
> > > > > > > > > Would these be enough or you're thinking about another
> > > approach?
> > > > > > > > > maybe a custom metric to cover the producer side?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > 7. We may end up introducing a way for sink tasks to
> record
> > > > > > > per-record
> > > > > > > > > > delivery to the sink system (see KIP-767 [1]). I'd like
> it
> > if
> > > > we
> > > > > > > could
> > > > > > > > keep
> > > > > > > > > > the names of our metrics very precise in order to avoid
> > > > confusing
> > > > > > > users
> > > > > > > > > > (who may think that we're providing metrics on actual
> > > delivery
> > > > to
> > > > > > the
> > > > > > > > sink
> > > > > > > > > > system, which may not be the case if the connector
> performs
> > > > > > > > asynchronous
> > > > > > > > > > writes), and in order to leave room for a metrics on true
> > > > > delivery
> > > > > > > > time by
> > > > > > > > > > sink tasks. It'd also be nice if we could remain
> consistent
> > > > with
> > > > > > > > existing
> > > > > > > > > > metrics such as "put-batch-avg-time-ms". With that in
> mind,
> > > > what
> > > > > do
> > > > > > > you
> > > > > > > > > > think about renaming these metrics:
> > > > > > > > > > - "sink-record-batch-latency-max-ms" to
> > > > > "put-batch-avg-latency-ms"
> > > > > > > > > > - "sink-record-latency-max-ms" to
> > > > > "put-sink-record-latency-max-ms"
> > > > > > > > > > - "sink-record-latency-avg-ms" to
> > > > > "put-sink-record-latency-avg-ms"
> > > > > > > > > > - "sink-record-convert-transform-time-max-ms" to
> > > > > > > > > > "convert-transform-sink-record-time-max-ms"
> > > > > > > > > > - "sink-record-convert-transform-time-avg-ms" to
> > > > > > > > > > "convert-transform-sink-record-time-avg-ms"
> > > > > > > > > > - "source-record-transform-convert-time-max-ms" to
> > > > > > > > > > "transform-convert-source-record-time-max-ms"
> > > > > > > > > > - "source-record-transform-convert-time-avg-ms" to
> > > > > > > > > > "transform-convert-source-record-time-avg-ms"
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Make sense, thanks! I have updated the list of metrics and
> > > group
> > > > > them
> > > > > > > by
> > > > > > > > > sensor and applying these suggestions.
> > > > > > > > > The only ones that I want to review are: sink-record-* to
> > > > > put-batch-*
> > > > > > > > > (first 3). Not sure if put-batch/put-sink-record describes
> > the
> > > > > > purpose
> > > > > > > of
> > > > > > > > > the metric — neither `sink-record-latency` to be honest.
> > > > > > > > > My initial thought was to have something like Kafka Streams
> > > > > > > e2e-latency.
> > > > > > > > > Based on 4. and 6. questions, an idea could be to add:
> > > > > > > > > - source-batch-e2e-latency-before-send: measure wallclock -
> > > > source
> > > > > > > record
> > > > > > > > > timestamp after source connector poll.
> > > > > > > > > - source-batch-e2e-latency-after-send: measure wallclock -
> > > record
> > > > > > > > timestamp
> > > > > > > > > on producer send callback
> > > > > > > > > - sink-batch-e2e-latency-before-put: measure time
> wallclock -
> > > > > record
> > > > > > > > > timestamp after consumer poll
> > > > > > > > > - sink-batch-e2e-latency-after-put: measure time wallclock
> -
> > > > record
> > > > > > > > > timestamp after sink connector put.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > Thanks again for the KIP! Looking forward to your
> thoughts.
> > > > > > > > > >
> > > > > > > > > > Cheers,
> > > > > > > > > >
> > > > > > > > > > Chris
> > > > > > > > > >
> > > > > > > > > > [1] -
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-767%3A+Connect+Latency+Metrics
> > > > > > > > > >
> > > > > > > > > > On Thu, Sep 15, 2022 at 1:32 PM Jorge Esteban Quilcate
> > Otoya
> > > <
> > > > > > > > > > quilcate.jorge@gmail.com> wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi everyone,
> > > > > > > > > > >
> > > > > > > > > > > I've made a slight addition to the KIP based on Yash
> > > > feedback:
> > > > > > > > > > >
> > > > > > > > > > > - A new metric is added at INFO level to record the max
> > > > latency
> > > > > > > from
> > > > > > > > the
> > > > > > > > > > > batch timestamp, by keeping the oldest record timestamp
> > per
> > > > > > batch.
> > > > > > > > > > > - A draft implementation is linked.
> > > > > > > > > > >
> > > > > > > > > > > Looking forward to your feedback.
> > > > > > > > > > > Also, a kindly reminder that the vote thread is open.
> > > > > > > > > > >
> > > > > > > > > > > Thanks!
> > > > > > > > > > > Jorge.
> > > > > > > > > > >
> > > > > > > > > > > On Thu, 8 Sept 2022 at 14:25, Jorge Esteban Quilcate
> > Otoya
> > > <
> > > > > > > > > > > quilcate.jorge@gmail.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Great. I have updated the KIP to reflect this.
> > > > > > > > > > > >
> > > > > > > > > > > > Cheers,
> > > > > > > > > > > > Jorge.
> > > > > > > > > > > >
> > > > > > > > > > > > On Thu, 8 Sept 2022 at 12:26, Yash Mayya <
> > > > > yash.mayya@gmail.com
> > > > > > >
> > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > >> Thanks, I think it makes sense to define these
> metrics
> > > at
> > > > a
> > > > > > > DEBUG
> > > > > > > > > > > >> recording
> > > > > > > > > > > >> level.
> > > > > > > > > > > >>
> > > > > > > > > > > >> On Thu, Sep 8, 2022 at 2:51 PM Jorge Esteban
> Quilcate
> > > > Otoya
> > > > > <
> > > > > > > > > > > >> quilcate.jorge@gmail.com> wrote:
> > > > > > > > > > > >>
> > > > > > > > > > > >> > On Thu, 8 Sept 2022 at 05:55, Yash Mayya <
> > > > > > > yash.mayya@gmail.com>
> > > > > > > > > > > wrote:
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > > Hi Jorge,
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> > > Thanks for the changes. With regard to having
> per
> > > > batch
> > > > > vs
> > > > > > > per
> > > > > > > > > > > record
> > > > > > > > > > > >> > > metrics, the additional overhead I was referring
> > to
> > > > > wasn't
> > > > > > > > about
> > > > > > > > > > > >> whether
> > > > > > > > > > > >> > or
> > > > > > > > > > > >> > > not we would need to iterate over all the
> records
> > > in a
> > > > > > > batch.
> > > > > > > > I
> > > > > > > > > > was
> > > > > > > > > > > >> > > referring to the potential additional overhead
> > > caused
> > > > by
> > > > > > the
> > > > > > > > > > higher
> > > > > > > > > > > >> > volume
> > > > > > > > > > > >> > > of calls to Sensor::record on the sensors for
> the
> > > new
> > > > > > > metrics
> > > > > > > > (as
> > > > > > > > > > > >> > compared
> > > > > > > > > > > >> > > to the existing batch only metrics), especially
> > for
> > > > high
> > > > > > > > > > throughput
> > > > > > > > > > > >> > > connectors where batch sizes could be large. I
> > guess
> > > > we
> > > > > > may
> > > > > > > > want
> > > > > > > > > > to
> > > > > > > > > > > do
> > > > > > > > > > > >> > some
> > > > > > > > > > > >> > > sort of performance testing and get concrete
> > numbers
> > > > to
> > > > > > > verify
> > > > > > > > > > > whether
> > > > > > > > > > > >> > this
> > > > > > > > > > > >> > > is a valid concern or not?
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > 6.1. Got it, thanks for clarifying. I guess there
> > > could
> > > > > be a
> > > > > > > > > > benchmark
> > > > > > > > > > > >> test
> > > > > > > > > > > >> > of the `Sensor::record` to get an idea of the
> > > > performance
> > > > > > > > impact.
> > > > > > > > > > > >> > Regardless, the fact that these are single-record
> > > > metrics
> > > > > > > > compared
> > > > > > > > > > to
> > > > > > > > > > > >> > existing batch-only could be explicitly defined by
> > > > setting
> > > > > > > these
> > > > > > > > > > > >> metrics at
> > > > > > > > > > > >> > a DEBUG or TRACE metric recording level, leaving
> the
> > > > > > existing
> > > > > > > at
> > > > > > > > > > INFO
> > > > > > > > > > > >> > level.
> > > > > > > > > > > >> > wdyt?
> > > > > > > > > > > >> >
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> > > Thanks,
> > > > > > > > > > > >> > > Yash
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> > > On Tue, Sep 6, 2022 at 4:42 PM Jorge Esteban
> > > Quilcate
> > > > > > Otoya
> > > > > > > <
> > > > > > > > > > > >> > > quilcate.jorge@gmail.com> wrote:
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> > > > Hi Sagar and Yash,
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > > > > the way it's defined in
> > > > > > > > > > > >> > > >
> > > > > > > https://kafka.apache.org/documentation/#connect_monitoring
> > > > > > > > for
> > > > > > > > > > > the
> > > > > > > > > > > >> > > metrics
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > > > 4.1. Got it. Add it to the KIP.
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > > > > The only thing I would argue is do we need
> > > > > > > > > > > >> sink-record-latency-min?
> > > > > > > > > > > >> > > Maybe
> > > > > > > > > > > >> > > > we
> > > > > > > > > > > >> > > > > could remove this min metric as well and
> make
> > > all
> > > > of
> > > > > > the
> > > > > > > > 3 e2e
> > > > > > > > > > > >> > metrics
> > > > > > > > > > > >> > > > > consistent
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > > > 4.2 I see. Will remove it from the KIP.
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > > > > Probably users can track the metrics at
> their
> > > end
> > > > to
> > > > > > > > > > > >> > > > > figure that out. Do you think that makes
> > sense?
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > > > 4.3. Yes, agree. With these new metrics it
> > should
> > > be
> > > > > > > easier
> > > > > > > > for
> > > > > > > > > > > >> users
> > > > > > > > > > > >> > to
> > > > > > > > > > > >> > > > track this.
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > > > > I think it makes sense to not have a min
> > metric
> > > > for
> > > > > > > > either to
> > > > > > > > > > > >> remain
> > > > > > > > > > > >> > > > > consistent with the existing put-batch and
> > > > > poll-batch
> > > > > > > > metrics
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > > > 5.1. Got it. Same as 4.2
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > > > > Another naming related suggestion I had was
> > with
> > > > the
> > > > > > > > > > > >> > > > > "convert-time" metrics - we should probably
> > > > include
> > > > > > > > > > > >> transformations
> > > > > > > > > > > >> > in
> > > > > > > > > > > >> > > > the
> > > > > > > > > > > >> > > > > name since SMTs could definitely be
> > attributable
> > > > to
> > > > > a
> > > > > > > > sizable
> > > > > > > > > > > >> chunk
> > > > > > > > > > > >> > of
> > > > > > > > > > > >> > > > the
> > > > > > > > > > > >> > > > > latency depending on the specific
> > transformation
> > > > > > chain.
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > > > 5.2. Make sense. I'm proposing to add
> > > > > > > > > > > >> > `sink-record-convert-transform...`
> > > > > > > > > > > >> > > > and `source-record-transform-convert...` to
> > > > represent
> > > > > > > > correctly
> > > > > > > > > > > the
> > > > > > > > > > > >> > order
> > > > > > > > > > > >> > > > of operations.
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > > > > it seems like both source and sink tasks
> only
> > > > record
> > > > > > > > metrics
> > > > > > > > > > at
> > > > > > > > > > > a
> > > > > > > > > > > >> > > "batch"
> > > > > > > > > > > >> > > > > level, not on an individual record level. I
> > > think
> > > > it
> > > > > > > > might be
> > > > > > > > > > > >> > > additional
> > > > > > > > > > > >> > > > > overhead if we want to record these new
> > metrics
> > > > all
> > > > > at
> > > > > > > the
> > > > > > > > > > > record
> > > > > > > > > > > >> > > level?
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > > > 5.3. I considered at the beginning to
> implement
> > > all
> > > > > > > metrics
> > > > > > > > at
> > > > > > > > > > the
> > > > > > > > > > > >> > batch
> > > > > > > > > > > >> > > > level, but given how the framework process
> > > records,
> > > > I
> > > > > > > > fallback
> > > > > > > > > > to
> > > > > > > > > > > >> the
> > > > > > > > > > > >> > > > proposed approach:
> > > > > > > > > > > >> > > > - Sink Task:
> > > > > > > > > > > >> > > >   - `WorkerSinkTask#convertMessages(msgs)`
> > already
> > > > > > > iterates
> > > > > > > > over
> > > > > > > > > > > >> > records,
> > > > > > > > > > > >> > > > so there is no additional overhead to capture
> > > record
> > > > > > > > latency per
> > > > > > > > > > > >> > record.
> > > > > > > > > > > >> > > >     -
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> >
> > > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/9841647c4fe422532f448423c92d26e4fdcb8932/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L490-L514
> > > > > > > > > > > >> > > >   -
> > > > `WorkerSinkTask#convertAndTransformRecord(record)`
> > > > > > > > actually
> > > > > > > > > > > >> happens
> > > > > > > > > > > >> > > > individually. Measuring this operation per
> batch
> > > > would
> > > > > > > > include
> > > > > > > > > > > >> > processing
> > > > > > > > > > > >> > > > that is not strictly part of "convert and
> > > transform"
> > > > > > > > > > > >> > > >     -
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> >
> > > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/9841647c4fe422532f448423c92d26e4fdcb8932/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L518
> > > > > > > > > > > >> > > > - Source Task:
> > > > > > > > > > > >> > > >   - `AbstractWorkerSourceTask#sendRecords`
> > > iterates
> > > > > > over a
> > > > > > > > batch
> > > > > > > > > > > and
> > > > > > > > > > > >> > > > applies transforms and convert record
> > individually
> > > > as
> > > > > > > well:
> > > > > > > > > > > >> > > >     -
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> >
> > > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/9841647c4fe422532f448423c92d26e4fdcb8932/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/AbstractWorkerSourceTask.java#L389-L390
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > > > > This might require some additional changes -
> > > > > > > > > > > >> > > > > for instance, with the "sink-record-latency"
> > > > metric,
> > > > > > we
> > > > > > > > might
> > > > > > > > > > > only
> > > > > > > > > > > >> > want
> > > > > > > > > > > >> > > > to
> > > > > > > > > > > >> > > > > have a "max" metric since "avg" would
> require
> > > > > > recording
> > > > > > > a
> > > > > > > > > > value
> > > > > > > > > > > on
> > > > > > > > > > > >> > the
> > > > > > > > > > > >> > > > > sensor for each record (whereas we can get a
> > > "max"
> > > > > by
> > > > > > > only
> > > > > > > > > > > >> recording
> > > > > > > > > > > >> > a
> > > > > > > > > > > >> > > > > metric value for the oldest record in each
> > > batch).
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > > > 5.4. Recording record-latency per batch may
> not
> > be
> > > > as
> > > > > > > > useful as
> > > > > > > > > > > >> there
> > > > > > > > > > > >> > is
> > > > > > > > > > > >> > > no
> > > > > > > > > > > >> > > > guarantee that the oldest record will be
> > > > > representative
> > > > > > of
> > > > > > > > the
> > > > > > > > > > > >> batch.
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > > > On Sat, 3 Sept 2022 at 16:02, Yash Mayya <
> > > > > > > > yash.mayya@gmail.com>
> > > > > > > > > > > >> wrote:
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > > > > Hi Jorge and Sagar,
> > > > > > > > > > > >> > > > >
> > > > > > > > > > > >> > > > > I think it makes sense to not have a min
> > metric
> > > > for
> > > > > > > > either to
> > > > > > > > > > > >> remain
> > > > > > > > > > > >> > > > > consistent with the existing put-batch and
> > > > > poll-batch
> > > > > > > > metrics
> > > > > > > > > > > (it
> > > > > > > > > > > >> > > doesn't
> > > > > > > > > > > >> > > > > seem particularly useful either anyway).
> Also,
> > > the
> > > > > new
> > > > > > > > > > > >> > > > > "sink-record-latency" metric name looks fine
> > to
> > > > me,
> > > > > > > > thanks for
> > > > > > > > > > > >> making
> > > > > > > > > > > >> > > the
> > > > > > > > > > > >> > > > > changes! Another naming related suggestion I
> > had
> > > > was
> > > > > > > with
> > > > > > > > the
> > > > > > > > > > > >> > > > > "convert-time" metrics - we should probably
> > > > include
> > > > > > > > > > > >> transformations
> > > > > > > > > > > >> > in
> > > > > > > > > > > >> > > > the
> > > > > > > > > > > >> > > > > name since SMTs could definitely be
> > attributable
> > > > to
> > > > > a
> > > > > > > > sizable
> > > > > > > > > > > >> chunk
> > > > > > > > > > > >> > of
> > > > > > > > > > > >> > > > the
> > > > > > > > > > > >> > > > > latency depending on the specific
> > transformation
> > > > > > chain.
> > > > > > > > > > > >> > > > >
> > > > > > > > > > > >> > > > > I have one high level question with respect
> to
> > > > > > > > implementation
> > > > > > > > > > -
> > > > > > > > > > > >> > > > currently,
> > > > > > > > > > > >> > > > > it seems like both source and sink tasks
> only
> > > > record
> > > > > > > > metrics
> > > > > > > > > > at
> > > > > > > > > > > a
> > > > > > > > > > > >> > > "batch"
> > > > > > > > > > > >> > > > > level, not on an individual record level. I
> > > think
> > > > it
> > > > > > > > might be
> > > > > > > > > > > >> > > additional
> > > > > > > > > > > >> > > > > overhead if we want to record these new
> > metrics
> > > > all
> > > > > at
> > > > > > > the
> > > > > > > > > > > record
> > > > > > > > > > > >> > > level?
> > > > > > > > > > > >> > > > > Could we instead make all of these new
> metrics
> > > for
> > > > > > > > batches of
> > > > > > > > > > > >> records
> > > > > > > > > > > >> > > > > rather than individual records in order to
> > > remain
> > > > > > > > consistent
> > > > > > > > > > > with
> > > > > > > > > > > >> the
> > > > > > > > > > > >> > > > > existing task level metrics? This might
> > require
> > > > some
> > > > > > > > > > additional
> > > > > > > > > > > >> > > changes -
> > > > > > > > > > > >> > > > > for instance, with the "sink-record-latency"
> > > > metric,
> > > > > > we
> > > > > > > > might
> > > > > > > > > > > only
> > > > > > > > > > > >> > want
> > > > > > > > > > > >> > > > to
> > > > > > > > > > > >> > > > > have a "max" metric since "avg" would
> require
> > > > > > recording
> > > > > > > a
> > > > > > > > > > value
> > > > > > > > > > > on
> > > > > > > > > > > >> > the
> > > > > > > > > > > >> > > > > sensor for each record (whereas we can get a
> > > "max"
> > > > > by
> > > > > > > only
> > > > > > > > > > > >> recording
> > > > > > > > > > > >> > a
> > > > > > > > > > > >> > > > > metric value for the oldest record in each
> > > batch).
> > > > > > > > > > > >> > > > >
> > > > > > > > > > > >> > > > > Thanks,
> > > > > > > > > > > >> > > > > Yash
> > > > > > > > > > > >> > > > >
> > > > > > > > > > > >> > > > > On Fri, Sep 2, 2022 at 3:16 PM Sagar <
> > > > > > > > > > sagarmeansocean@gmail.com
> > > > > > > > > > > >
> > > > > > > > > > > >> > > wrote:
> > > > > > > > > > > >> > > > >
> > > > > > > > > > > >> > > > > > Hi Jorge,
> > > > > > > > > > > >> > > > > >
> > > > > > > > > > > >> > > > > > Thanks for the changes.
> > > > > > > > > > > >> > > > > >
> > > > > > > > > > > >> > > > > > Regarding the metrics, I meant something
> > like
> > > > > this:
> > > > > > > > > > > >> > > > > >
> > > > > > > > > > > >> > > > >
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> >
> > > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> kafka.connect:type=sink-task-metrics,connector="{connector}",task="{task}"
> > > > > > > > > > > >> > > > > >
> > > > > > > > > > > >> > > > > > the way it's defined in
> > > > > > > > > > > >> > > > > >
> > > > > > > > https://kafka.apache.org/documentation/#connect_monitoring
> > > > > > > > > > > for
> > > > > > > > > > > >> the
> > > > > > > > > > > >> > > > > > metrics.
> > > > > > > > > > > >> > > > > >
> > > > > > > > > > > >> > > > > > I see what you mean by the 3 metrics and
> how
> > > it
> > > > > can
> > > > > > be
> > > > > > > > > > > >> interpreted.
> > > > > > > > > > > >> > > The
> > > > > > > > > > > >> > > > > > only thing I would argue is do we need
> > > > > > > > > > > sink-record-latency-min?
> > > > > > > > > > > >> > Maybe
> > > > > > > > > > > >> > > > we
> > > > > > > > > > > >> > > > > > could remove this min metric as well and
> > make
> > > > all
> > > > > of
> > > > > > > > the 3
> > > > > > > > > > e2e
> > > > > > > > > > > >> > > metrics
> > > > > > > > > > > >> > > > > > consistent(since put-batch also doesn't
> > > expose a
> > > > > min
> > > > > > > > which
> > > > > > > > > > > makes
> > > > > > > > > > > >> > > sense
> > > > > > > > > > > >> > > > to
> > > > > > > > > > > >> > > > > > me). I think this is in contrast to what
> > Yash
> > > > > > pointed
> > > > > > > > out
> > > > > > > > > > > above
> > > > > > > > > > > >> so
> > > > > > > > > > > >> > I
> > > > > > > > > > > >> > > > > would
> > > > > > > > > > > >> > > > > > like to hear his thoughts as well.
> > > > > > > > > > > >> > > > > >
> > > > > > > > > > > >> > > > > > The other point Yash mentioned about the
> > > > slightly
> > > > > > > flawed
> > > > > > > > > > > >> definition
> > > > > > > > > > > >> > > of
> > > > > > > > > > > >> > > > > e2e
> > > > > > > > > > > >> > > > > > is also true in a sense. But I have a
> > feeling
> > > > > that's
> > > > > > > > one the
> > > > > > > > > > > >> > records
> > > > > > > > > > > >> > > > are
> > > > > > > > > > > >> > > > > > polled by the connector tasks, it would be
> > > > > difficult
> > > > > > > to
> > > > > > > > > > track
> > > > > > > > > > > >> the
> > > > > > > > > > > >> > > final
> > > > > > > > > > > >> > > > > leg
> > > > > > > > > > > >> > > > > > via the framework. Probably users can
> track
> > > the
> > > > > > > metrics
> > > > > > > > at
> > > > > > > > > > > their
> > > > > > > > > > > >> > end
> > > > > > > > > > > >> > > to
> > > > > > > > > > > >> > > > > > figure that out. Do you think that makes
> > > sense?
> > > > > > > > > > > >> > > > > >
> > > > > > > > > > > >> > > > > > Thanks!
> > > > > > > > > > > >> > > > > > Sagar.
> > > > > > > > > > > >> > > > > >
> > > > > > > > > > > >> > > > > >
> > > > > > > > > > > >> > > > > >
> > > > > > > > > > > >> > > > > >
> > > > > > > > > > > >> > > > > > On Thu, Sep 1, 2022 at 11:40 PM Jorge
> > Esteban
> > > > > > Quilcate
> > > > > > > > > > Otoya <
> > > > > > > > > > > >> > > > > > quilcate.jorge@gmail.com> wrote:
> > > > > > > > > > > >> > > > > >
> > > > > > > > > > > >> > > > > > > Hi Sagar and Yash,
> > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > >> > > > > > > Thanks for your feedback!
> > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > >> > > > > > > > 1) I am assuming the new metrics would
> > be
> > > > task
> > > > > > > level
> > > > > > > > > > > metric.
> > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > >> > > > > > > 1.1 Yes, it will be a task level metric,
> > > > > > implemented
> > > > > > > > on
> > > > > > > > > > the
> > > > > > > > > > > >> > > > > > > Worker[Source/Sink]Task.
> > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > >> > > > > > > > Could you specify the way it's done
> for
> > > > other
> > > > > > > > > > sink/source
> > > > > > > > > > > >> > > > connector?
> > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > >> > > > > > > 1.2. Not sure what do you mean by this.
> > > Could
> > > > > you
> > > > > > > > > > elaborate
> > > > > > > > > > > a
> > > > > > > > > > > >> bit
> > > > > > > > > > > >> > > > more?
> > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > >> > > > > > > > 2. I am slightly confused about the
> e2e
> > > > > latency
> > > > > > > > > > metric...
> > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > >> > > > > > > 2.1. Yes, I see. I was trying to bring a
> > > > similar
> > > > > > > > concept
> > > > > > > > > > as
> > > > > > > > > > > in
> > > > > > > > > > > >> > > > Streams
> > > > > > > > > > > >> > > > > > with
> > > > > > > > > > > >> > > > > > > KIP-613, though the e2e concept may not
> be
> > > > > > > > translatable.
> > > > > > > > > > > >> > > > > > > We could keep it as
> `sink-record-latency`
> > to
> > > > > avoid
> > > > > > > > > > > conflating
> > > > > > > > > > > >> > > > > concepts. A
> > > > > > > > > > > >> > > > > > > similar metric naming was proposed in
> > > KIP-489
> > > > > but
> > > > > > at
> > > > > > > > the
> > > > > > > > > > > >> consumer
> > > > > > > > > > > >> > > > > level —
> > > > > > > > > > > >> > > > > > > though it seems dormant for a couple of
> > > years.
> > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > >> > > > > > > > However, the put-batch time measures
> the
> > > > > > > > > > > >> > > > > > > > time to put a batch of records to
> > external
> > > > > sink.
> > > > > > > > So, I
> > > > > > > > > > > would
> > > > > > > > > > > >> > > assume
> > > > > > > > > > > >> > > > > > the 2
> > > > > > > > > > > >> > > > > > > > can't be added as is to compute the
> e2e
> > > > > latency.
> > > > > > > > Maybe I
> > > > > > > > > > > am
> > > > > > > > > > > >> > > missing
> > > > > > > > > > > >> > > > > > > > something here. Could you plz clarify
> > > this.
> > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > >> > > > > > > 2.2. Yes, agree. Not necessarily added,
> > but
> > > > with
> > > > > > > the 3
> > > > > > > > > > > >> latencies
> > > > > > > > > > > >> > > > (poll,
> > > > > > > > > > > >> > > > > > > convert, putBatch) will be clearer where
> > the
> > > > > > > > bottleneck
> > > > > > > > > > may
> > > > > > > > > > > >> be,
> > > > > > > > > > > >> > and
> > > > > > > > > > > >> > > > > > > represent the internal processing.
> > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > >> > > > > > > > however, as per the KIP it looks like
> it
> > > > will
> > > > > be
> > > > > > > > > > > >> > > > > > > > the latency between when the record
> was
> > > > > written
> > > > > > to
> > > > > > > > Kafka
> > > > > > > > > > > and
> > > > > > > > > > > >> > when
> > > > > > > > > > > >> > > > the
> > > > > > > > > > > >> > > > > > > > record is returned by a sink task's
> > > > consumer's
> > > > > > > poll?
> > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > >> > > > > > > 3.1. Agree. 2.1. could help to clarify
> > this.
> > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > >> > > > > > > > One more thing - I was wondering
> > > > > > > > > > > >> > > > > > > > if there's a particular reason for
> > having
> > > a
> > > > > min
> > > > > > > > metric
> > > > > > > > > > for
> > > > > > > > > > > >> e2e
> > > > > > > > > > > >> > > > > latency
> > > > > > > > > > > >> > > > > > > but
> > > > > > > > > > > >> > > > > > > > not for convert time?
> > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > >> > > > > > > 3.2. Was following KIP-613 for e2e which
> > > seems
> > > > > > > useful
> > > > > > > > to
> > > > > > > > > > > >> compare
> > > > > > > > > > > >> > > with
> > > > > > > > > > > >> > > > > > Max a
> > > > > > > > > > > >> > > > > > > get an idea of the window of results,
> > though
> > > > > > current
> > > > > > > > > > > >> latencies in
> > > > > > > > > > > >> > > > > > Connector
> > > > > > > > > > > >> > > > > > > do not include Min, and that's why I
> > haven't
> > > > > added
> > > > > > > it
> > > > > > > > for
> > > > > > > > > > > >> convert
> > > > > > > > > > > >> > > > > > latency.
> > > > > > > > > > > >> > > > > > > Do you think it make sense to extend
> > latency
> > > > > > metrics
> > > > > > > > with
> > > > > > > > > > > Min?
> > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > >> > > > > > > KIP is updated to clarify some of these
> > > > changes.
> > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > >> > > > > > > Many thanks,
> > > > > > > > > > > >> > > > > > > Jorge.
> > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > >> > > > > > > On Thu, 1 Sept 2022 at 18:11, Yash
> Mayya <
> > > > > > > > > > > >> yash.mayya@gmail.com>
> > > > > > > > > > > >> > > > wrote:
> > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > >> > > > > > > > Hi Jorge,
> > > > > > > > > > > >> > > > > > > >
> > > > > > > > > > > >> > > > > > > > Thanks for the KIP! I have the same
> > > > confusion
> > > > > > with
> > > > > > > > the
> > > > > > > > > > > >> > > e2e-latency
> > > > > > > > > > > >> > > > > > > metrics
> > > > > > > > > > > >> > > > > > > > as Sagar above. "e2e" would seem to
> > > indicate
> > > > > the
> > > > > > > > latency
> > > > > > > > > > > >> > between
> > > > > > > > > > > >> > > > when
> > > > > > > > > > > >> > > > > > the
> > > > > > > > > > > >> > > > > > > > record was written to Kafka and when
> the
> > > > > record
> > > > > > > was
> > > > > > > > > > > written
> > > > > > > > > > > >> to
> > > > > > > > > > > >> > > the
> > > > > > > > > > > >> > > > > sink
> > > > > > > > > > > >> > > > > > > > system by the connector - however, as
> > per
> > > > the
> > > > > > KIP
> > > > > > > it
> > > > > > > > > > looks
> > > > > > > > > > > >> like
> > > > > > > > > > > >> > > it
> > > > > > > > > > > >> > > > > will
> > > > > > > > > > > >> > > > > > > be
> > > > > > > > > > > >> > > > > > > > the latency between when the record
> was
> > > > > written
> > > > > > to
> > > > > > > > Kafka
> > > > > > > > > > > and
> > > > > > > > > > > >> > when
> > > > > > > > > > > >> > > > the
> > > > > > > > > > > >> > > > > > > > record is returned by a sink task's
> > > > consumer's
> > > > > > > > poll? I
> > > > > > > > > > > think
> > > > > > > > > > > >> > that
> > > > > > > > > > > >> > > > > > metric
> > > > > > > > > > > >> > > > > > > > will be a little confusing to
> interpret.
> > > One
> > > > > > more
> > > > > > > > thing
> > > > > > > > > > -
> > > > > > > > > > > I
> > > > > > > > > > > >> was
> > > > > > > > > > > >> > > > > > wondering
> > > > > > > > > > > >> > > > > > > > if there's a particular reason for
> > having
> > > a
> > > > > min
> > > > > > > > metric
> > > > > > > > > > for
> > > > > > > > > > > >> e2e
> > > > > > > > > > > >> > > > > latency
> > > > > > > > > > > >> > > > > > > but
> > > > > > > > > > > >> > > > > > > > not for convert time?
> > > > > > > > > > > >> > > > > > > >
> > > > > > > > > > > >> > > > > > > > Thanks,
> > > > > > > > > > > >> > > > > > > > Yash
> > > > > > > > > > > >> > > > > > > >
> > > > > > > > > > > >> > > > > > > > On Thu, Sep 1, 2022 at 8:59 PM Sagar <
> > > > > > > > > > > >> > sagarmeansocean@gmail.com>
> > > > > > > > > > > >> > > > > > wrote:
> > > > > > > > > > > >> > > > > > > >
> > > > > > > > > > > >> > > > > > > > > Hi Jorge,
> > > > > > > > > > > >> > > > > > > > >
> > > > > > > > > > > >> > > > > > > > > Thanks for the KIP. It looks like a
> > very
> > > > > good
> > > > > > > > > > addition.
> > > > > > > > > > > I
> > > > > > > > > > > >> > > skimmed
> > > > > > > > > > > >> > > > > > > through
> > > > > > > > > > > >> > > > > > > > > once and had a couple of questions
> =>
> > > > > > > > > > > >> > > > > > > > >
> > > > > > > > > > > >> > > > > > > > > 1) I am assuming the new metrics
> would
> > > be
> > > > > task
> > > > > > > > level
> > > > > > > > > > > >> metric.
> > > > > > > > > > > >> > > > Could
> > > > > > > > > > > >> > > > > > you
> > > > > > > > > > > >> > > > > > > > > specify the way it's done for other
> > > > > > sink/source
> > > > > > > > > > > connector?
> > > > > > > > > > > >> > > > > > > > > 2) I am slightly confused about the
> > e2e
> > > > > > latency
> > > > > > > > > > metric.
> > > > > > > > > > > >> Let's
> > > > > > > > > > > >> > > > > > consider
> > > > > > > > > > > >> > > > > > > > the
> > > > > > > > > > > >> > > > > > > > > sink connector metric. If I look at
> > the
> > > > way
> > > > > > it's
> > > > > > > > > > > supposed
> > > > > > > > > > > >> to
> > > > > > > > > > > >> > be
> > > > > > > > > > > >> > > > > > > > calculated,
> > > > > > > > > > > >> > > > > > > > > i.e the difference between the
> record
> > > > > > timestamp
> > > > > > > > and
> > > > > > > > > > the
> > > > > > > > > > > >> wall
> > > > > > > > > > > >> > > > clock
> > > > > > > > > > > >> > > > > > > time,
> > > > > > > > > > > >> > > > > > > > it
> > > > > > > > > > > >> > > > > > > > > looks like a per record metric.
> > However,
> > > > the
> > > > > > > > put-batch
> > > > > > > > > > > >> time
> > > > > > > > > > > >> > > > > measures
> > > > > > > > > > > >> > > > > > > the
> > > > > > > > > > > >> > > > > > > > > time to put a batch of records to
> > > external
> > > > > > sink.
> > > > > > > > So, I
> > > > > > > > > > > >> would
> > > > > > > > > > > >> > > > assume
> > > > > > > > > > > >> > > > > > > the 2
> > > > > > > > > > > >> > > > > > > > > can't be added as is to compute the
> > e2e
> > > > > > latency.
> > > > > > > > > > Maybe I
> > > > > > > > > > > >> am
> > > > > > > > > > > >> > > > missing
> > > > > > > > > > > >> > > > > > > > > something here. Could you plz
> clarify
> > > > this.
> > > > > > > > > > > >> > > > > > > > >
> > > > > > > > > > > >> > > > > > > > > Thanks!
> > > > > > > > > > > >> > > > > > > > > Sagar.
> > > > > > > > > > > >> > > > > > > > >
> > > > > > > > > > > >> > > > > > > > > On Tue, Aug 30, 2022 at 8:43 PM
> Jorge
> > > > > Esteban
> > > > > > > > Quilcate
> > > > > > > > > > > >> Otoya
> > > > > > > > > > > >> > <
> > > > > > > > > > > >> > > > > > > > > quilcate.jorge@gmail.com> wrote:
> > > > > > > > > > > >> > > > > > > > >
> > > > > > > > > > > >> > > > > > > > > > Hi all,
> > > > > > > > > > > >> > > > > > > > > >
> > > > > > > > > > > >> > > > > > > > > > I'd like to start a discussion
> > thread
> > > on
> > > > > > > > KIP-864:
> > > > > > > > > > Add
> > > > > > > > > > > >> > > > End-To-End
> > > > > > > > > > > >> > > > > > > > Latency
> > > > > > > > > > > >> > > > > > > > > > Metrics to Connectors.
> > > > > > > > > > > >> > > > > > > > > > This KIP aims to improve the
> metrics
> > > > > > available
> > > > > > > > on
> > > > > > > > > > > Source
> > > > > > > > > > > >> > and
> > > > > > > > > > > >> > > > Sink
> > > > > > > > > > > >> > > > > > > > > > Connectors to measure end-to-end
> > > > latency,
> > > > > > > > including
> > > > > > > > > > > >> source
> > > > > > > > > > > >> > > and
> > > > > > > > > > > >> > > > > sink
> > > > > > > > > > > >> > > > > > > > > record
> > > > > > > > > > > >> > > > > > > > > > conversion time, and sink record
> e2e
> > > > > latency
> > > > > > > > > > (similar
> > > > > > > > > > > to
> > > > > > > > > > > >> > > > KIP-613
> > > > > > > > > > > >> > > > > > for
> > > > > > > > > > > >> > > > > > > > > > Streams).
> > > > > > > > > > > >> > > > > > > > > >
> > > > > > > > > > > >> > > > > > > > > > The KIP is here:
> > > > > > > > > > > >> > > > > > > > > >
> > > > > > > > > > > >> > > > > > > > > >
> > > > > > > > > > > >> > > > > > > > >
> > > > > > > > > > > >> > > > > > > >
> > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > >> > > > > >
> > > > > > > > > > > >> > > > >
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> >
> > > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-864%3A+Add+End-To-End+Latency+Metrics+to+Connectors
> > > > > > > > > > > >> > > > > > > > > >
> > > > > > > > > > > >> > > > > > > > > > Please take a look and let me know
> > > what
> > > > > you
> > > > > > > > think.
> > > > > > > > > > > >> > > > > > > > > >
> > > > > > > > > > > >> > > > > > > > > > Cheers,
> > > > > > > > > > > >> > > > > > > > > > Jorge.
> > > > > > > > > > > >> > > > > > > > > >
> > > > > > > > > > > >> > > > > > > > >
> > > > > > > > > > > >> > > > > > > >
> > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > >> > > > > >
> > > > > > > > > > > >> > > > >
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> >
> > > > > > > > > > > >>
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-864: Add End-To-End Latency Metrics to Connectors

Posted by Jorge Esteban Quilcate Otoya <qu...@gmail.com>.
Sure! I have a added the following to the proposed changes section:

```
The per-record metrics will definitely be added to Kafka Connect as part of
this KIP, but their metric level will be changed pending the performance
testing described in KAFKA-14441, and will otherwise only be exposed at
lower level (DEBUG instead of INFO, and TRACE instead of DEBUG)
```

Let me know if how does it look.

Many thanks!
Jorge.

On Mon, 5 Dec 2022 at 14:11, Chris Egerton <ch...@aiven.io.invalid> wrote:

> Hi Jorge,
>
> Thanks for filing KAFKA-14441! In the ticket description we mention that
> "there will be more confidence whether to design metrics to be exposed at a
> DEBUG or INFO level depending on their impact" but it doesn't seem like
> this is called out in the KIP and, just based on what's in the KIP, the
> proposal is still to have several per-record metrics exposed at INFO level.
>
> Could we explicitly call out that the per-record metrics will definitely be
> added to Kafka Connect as part of this KIP, but they will only be exposed
> at INFO level pending pending the performance testing described in
> KAFKA-14441, and will otherwise only be exposed at DEBUG level? Otherwise,
> it's possible that a vote for the KIP as it's written today would be a vote
> in favor of unconditionally exposing these metrics at INFO level, even if
> the performance testing reveals issues.
>
> Cheers,
>
> Chris
>
> On Sun, Dec 4, 2022 at 7:08 PM Jorge Esteban Quilcate Otoya <
> quilcate.jorge@gmail.com> wrote:
>
> > Thanks for the reminder Chris!
> >
> > I have added a note on the KIP to include this as part of the KIP as most
> > of the metrics proposed are per-record and having all on DEBUG would
> limit
> > the benefits, and created
> > https://issues.apache.org/jira/browse/KAFKA-14441
> > to keep track of this task.
> >
> > Cheers,
> > Jorge.
> >
> > On Tue, 29 Nov 2022 at 19:40, Chris Egerton <ch...@aiven.io.invalid>
> > wrote:
> >
> > > Hi Jorge,
> > >
> > > Thanks! What were your thoughts on the possible benchmarking and/or
> > > downgrading of per-record metrics to DEBUG?
> > >
> > > Cheers,
> > >
> > > Chris
> > >
> > > On Thu, Nov 24, 2022 at 8:20 AM Jorge Esteban Quilcate Otoya <
> > > quilcate.jorge@gmail.com> wrote:
> > >
> > > > Thanks Chris! I have updated the KIP with "transform" instead of
> > "alias".
> > > > Agree it's clearer.
> > > >
> > > > Cheers,
> > > > Jorge.
> > > >
> > > > On Mon, 21 Nov 2022 at 21:36, Chris Egerton <chrise@aiven.io.invalid
> >
> > > > wrote:
> > > >
> > > > > Hi Jorge,
> > > > >
> > > > > Thanks for the updates, and apologies for the delay. The new
> diagram
> > > > > directly under the "Proposed Changes" section is absolutely
> gorgeous!
> > > > >
> > > > >
> > > > > Follow-ups:
> > > > >
> > > > > RE 2: Good point. We can use the same level for these metrics, it's
> > > not a
> > > > > big deal.
> > > > >
> > > > > RE 3: As long as all the per-record metrics are kept at DEBUG
> level,
> > it
> > > > > should be fine to leave JMH benchmarking for a follow-up. If we
> want
> > to
> > > > add
> > > > > new per-record, INFO-level metrics, I would be more comfortable
> with
> > > > > including benchmarking as part of the testing plan for the KIP. One
> > > > > possible compromise could be to propose that these features be
> merged
> > > at
> > > > > DEBUG level, and then possibly upgraded to INFO level in the future
> > > > pending
> > > > > benchmarks to guard against performance degradation.
> > > > >
> > > > > RE 4: I think for a true "end-to-end" metric, it'd be useful to
> > include
> > > > the
> > > > > time taken by the task to actually deliver the record. However,
> with
> > > the
> > > > > new metric names and descriptions provided in the KIP, I have no
> > > > objections
> > > > > with what's currently proposed, and a new "end-to-end" metric can
> be
> > > > taken
> > > > > on later in a follow-up KIP.
> > > > >
> > > > > RE 6: You're right, existing producer metrics should be enough for
> > now.
> > > > We
> > > > > can revisit this later if/when we add delivery-centric metrics for
> > sink
> > > > > tasks as well.
> > > > >
> > > > > RE 7: The new metric names in the KIP LGTM; I don't see any need to
> > > > expand
> > > > > beyond those but if you'd still like to pursue others, LMK.
> > > > >
> > > > >
> > > > > New thoughts:
> > > > >
> > > > > One small thought: instead of "alias" in "alias="{transform_alias}"
> > for
> > > > the
> > > > > per-transform metrics, could we use "transform"? IMO it's clearer
> > since
> > > > we
> > > > > don't use "alias" in the names of transform-related properties, and
> > > > "alias"
> > > > > may be confused with the classloading term where you can use, e.g.,
> > > > > "FileStreamSource" as the name of a connector class in a connector
> > > config
> > > > > instead of
> "org.apache.kafka.connect.file.FileStreamSourceConnector".
> > > > >
> > > > >
> > > > > Cheers,
> > > > >
> > > > > Chris
> > > > >
> > > > > On Fri, Nov 18, 2022 at 12:06 PM Jorge Esteban Quilcate Otoya <
> > > > > quilcate.jorge@gmail.com> wrote:
> > > > >
> > > > > > Thanks Mickael!
> > > > > >
> > > > > >
> > > > > > On Wed, 9 Nov 2022 at 15:54, Mickael Maison <
> > > mickael.maison@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Jorge,
> > > > > > >
> > > > > > > Thanks for the KIP, it is a nice improvement.
> > > > > > >
> > > > > > > 1) The per transformation metrics still have a question mark
> next
> > > to
> > > > > > > them in the KIP. Do you want to include them? If so we'll want
> to
> > > tag
> > > > > > > them, we should be able to include the aliases in
> > > TransformationChain
> > > > > > > and use them.
> > > > > > >
> > > > > >
> > > > > > Yes, I have added the changes on TransformChain that will be
> needed
> > > to
> > > > > add
> > > > > > these metrics.
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > 2) I see no references to predicates. If we don't want to
> measure
> > > > > > > their latency, can we say it explicitly?
> > > > > > >
> > > > > >
> > > > > > Good question, I haven't considered these. Though as these are
> > > > > materialized
> > > > > > as PredicatedTransformation, they should be covered by these
> > changes.
> > > > > > Adding a note about this.
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > 3) Should we have sink-record-batch-latency-avg-ms? All other
> > > metrics
> > > > > > > have both the maximum and average values.
> > > > > > >
> > > > > > >
> > > > > > Good question. I will remove it and change the record latency
> from
> > > > > > DEBUG->INFO as it already cover the maximum metric.
> > > > > >
> > > > > > Hope it's clearer now, let me know if there any additional
> > feedback.
> > > > > > Thanks!
> > > > > >
> > > > > >
> > > > > >
> > > > > > > Thanks,
> > > > > > > Mickael
> > > > > > >
> > > > > > > On Thu, Oct 20, 2022 at 9:58 PM Jorge Esteban Quilcate Otoya
> > > > > > > <qu...@gmail.com> wrote:
> > > > > > > >
> > > > > > > > Thanks, Chris! Great feedback! Please, find my comments
> below:
> > > > > > > >
> > > > > > > > On Thu, 13 Oct 2022 at 18:52, Chris Egerton
> > > > <chrise@aiven.io.invalid
> > > > > >
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Jorge,
> > > > > > > > >
> > > > > > > > > Thanks for the KIP. I agree with the overall direction and
> > > think
> > > > > this
> > > > > > > would
> > > > > > > > > be a nice improvement to Kafka Connect. Here are my initial
> > > > > thoughts
> > > > > > > on the
> > > > > > > > > details:
> > > > > > > > >
> > > > > > > > > 1. The motivation section outlines the gaps in Kafka
> > Connect's
> > > > task
> > > > > > > metrics
> > > > > > > > > nicely. I think it'd be useful to include more concrete
> > details
> > > > on
> > > > > > why
> > > > > > > > > these gaps need to be filled in, and in which cases
> > additional
> > > > > > metrics
> > > > > > > > > would be helpful. One goal could be to provide enhanced
> > > > monitoring
> > > > > of
> > > > > > > > > production deployments that allows for cluster
> administrators
> > > to
> > > > > set
> > > > > > up
> > > > > > > > > automatic alerts for latency spikes and, if triggered,
> > quickly
> > > > > > > identify the
> > > > > > > > > root cause of those alerts, reducing the time to
> remediation.
> > > > > Another
> > > > > > > goal
> > > > > > > > > could be to provide more insight to developers or cluster
> > > > > > > administrators
> > > > > > > > > who want to do performance testing on connectors in
> > > > non-production
> > > > > > > > > environments. It may help guide our decision making process
> > to
> > > > > have a
> > > > > > > > > clearer picture of the goals we're trying to achieve.
> > > > > > > > >
> > > > > > > >
> > > > > > > > Agree. The Motivation section has been updated.
> > > > > > > > Thanks for the examples, I see both of them being covered by
> > the
> > > > KIP.
> > > > > > > > I see how these could give us a good distinction on whether
> to
> > > > > position
> > > > > > > > some metrics at INFO or DEBUG level.
> > > > > > > >
> > > > > > > >
> > > > > > > > > 2. If we're trying to address the alert-and-diagnose use
> > case,
> > > > it'd
> > > > > > be
> > > > > > > > > useful to have as much information as possible at INFO
> level,
> > > > > rather
> > > > > > > than
> > > > > > > > > forcing cluster administrators to possibly reconfigure a
> > > > connector
> > > > > to
> > > > > > > emit
> > > > > > > > > DEBUG or TRACE level metrics in order to diagnose a
> potential
> > > > > > > > > production-impacting performance bottleneck. I can see the
> > > > > rationale
> > > > > > > for
> > > > > > > > > emitting per-record metrics that track an average value at
> > > DEBUG
> > > > > > > level, but
> > > > > > > > > for per-record metrics that track a maximum value, is there
> > any
> > > > > > reason
> > > > > > > not
> > > > > > > > > to provide this information at INFO level?
> > > > > > > > >
> > > > > > > >
> > > > > > > > Agree. Though with Max and Avg metrics being part of the same
> > > > sensor
> > > > > —
> > > > > > > > where Metric Level is defined — then both metrics get the
> same
> > > > level.
> > > > > > > >
> > > > > > > >
> > > > > > > > > 3. I'm also curious about the performance testing suggested
> > by
> > > > Yash
> > > > > > to
> > > > > > > > > gauge the potential impact of this change. Have you been
> able
> > > to
> > > > do
> > > > > > any
> > > > > > > > > testing with your draft implementation yet?
> > > > > > > > >
> > > > > > > >
> > > > > > > > No, not so far.
> > > > > > > > I think it would be valuable to discuss the scope of this
> > testing
> > > > and
> > > > > > > maybe
> > > > > > > > tackle it
> > > > > > > > in a separate issue as Sensors and Metrics are used all over
> > the
> > > > > place.
> > > > > > > > My initial understanding is that these tests should by placed
> > in
> > > > the
> > > > > > > > jmh-benchmarks[1].
> > > > > > > > Then, we could target testing Sensors and Metrics, and
> validate
> > > how
> > > > > > much
> > > > > > > > overhead
> > > > > > > > is added by having only Max vs Max,Avg(,Min), etc.
> > > > > > > > In the other hand, we could extend this to Transformers or
> > other
> > > > > > Connect
> > > > > > > > layers.
> > > > > > > >
> > > > > > > > Here are some pointers to the Sensors and Metrics
> > implementations
> > > > > that
> > > > > > > > could be considered:
> > > > > > > > Path to metric recording:
> > > > > > > > -
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/5cab11cf525f6c06fcf9eb43f7f95ef33fe1cdbb/clients/src/main/java/org/apache/kafka/common/metrics/Sensor.java#L195-L199
> > > > > > > > -
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/5cab11cf525f6c06fcf9eb43f7f95ef33fe1cdbb/clients/src/main/java/org/apache/kafka/common/metrics/Sensor.java#L230-L244
> > > > > > > >
> > > > > > > > ```
> > > > > > > > // increment all the stats
> > > > > > > > for (StatAndConfig statAndConfig : this.stats) {
> > > > > > > >    statAndConfig.stat.record(statAndConfig.config(), value,
> > > > timeMs);
> > > > > > > > }
> > > > > > > > ```
> > > > > > > >
> > > > > > > > SampledStats:
> > > > > > > > - Avg:
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/068ab9cefae301f3187ea885d645c425955e77d2/clients/src/main/java/org/apache/kafka/common/metrics/stats/Avg.java
> > > > > > > > - Max:
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/068ab9cefae301f3187ea885d645c425955e77d2/clients/src/main/java/org/apache/kafka/common/metrics/stats/Max.java
> > > > > > > > - Min:
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/068ab9cefae301f3187ea885d645c425955e77d2/clients/src/main/java/org/apache/kafka/common/metrics/stats/Min.java
> > > > > > > >
> > > > > > > > `stat#record()` are implemented by `update` method in
> > > SampledStat:
> > > > > > > >
> > > > > > > > ```Max.java
> > > > > > > >     @Override
> > > > > > > >     protected void update(Sample sample, MetricConfig config,
> > > > double
> > > > > > > value,
> > > > > > > > long now) {
> > > > > > > >         sample.value += value;
> > > > > > > >     }
> > > > > > > > ```
> > > > > > > >
> > > > > > > > ```Avg.java
> > > > > > > >     @Override
> > > > > > > >     protected void update(Sample sample, MetricConfig config,
> > > > double
> > > > > > > value,
> > > > > > > > long now) {
> > > > > > > >         sample.value = Math.max(sample.value, value);
> > > > > > > >     }
> > > > > > > > ```
> > > > > > > >
> > > > > > > > As far as I understand, most of the work of the stats happens
> > on
> > > > the
> > > > > > > > `combine` method that is not part of the connector execution
> > but
> > > > > called
> > > > > > > > when metrics are queried.
> > > > > > > >
> > > > > > > > I wonder whether we should consider Avg and Max for all
> metrics
> > > > > > proposed
> > > > > > > as
> > > > > > > > the impact on the execution path seems minimal, and even see
> if
> > > Min
> > > > > is
> > > > > > > also
> > > > > > > > valuable, and use DEBUG only for more granular metrics.
> > > > > > > >
> > > > > > > > [1]
> https://github.com/apache/kafka/tree/trunk/jmh-benchmarks
> > > > > > > >
> > > > > > > >
> > > > > > > > > 4. Just to make sure I understand correctly--does "time
> when
> > it
> > > > has
> > > > > > > been
> > > > > > > > > received by the Sink task" refer to the wallclock time
> > directly
> > > > > > after a
> > > > > > > > > call to SinkTask::put has been completed (as opposed to
> > > directly
> > > > > > before
> > > > > > > > > that call is made, or something else entirely)?
> > > > > > > > >
> > > > > > > >
> > > > > > > > It currently means when it has been received by the Sink task
> > > > > > > > right after consumer poll and before conversions.
> > > > > > > > Would it be valuable to have it after put-sink-records?
> > > > > > > >
> > > > > > > >
> > > > > > > > > 5. If the goal is to identify performance bottlenecks
> (either
> > > in
> > > > > > > production
> > > > > > > > > or pre-production environments), would it make sense to
> > > introduce
> > > > > > > metrics
> > > > > > > > > for each individual converter (i.e., key/value/header) and
> > > > > > > transformation?
> > > > > > > > > It's definitely an improvement to be able to identify the
> > total
> > > > > time
> > > > > > > for
> > > > > > > > > conversion and transformation, but then the immediate
> > follow-up
> > > > > > > question if
> > > > > > > > > a bottleneck is found in that phase is "which
> > > > > > converter/transformation
> > > > > > > is
> > > > > > > > > responsible?" It'd be nice if we could provide a way to
> > quickly
> > > > > > answer
> > > > > > > that
> > > > > > > > > question.
> > > > > > > > >
> > > > > > > >
> > > > > > > > This is a great idea. I'd like to consider this as well,
> though
> > > > maybe
> > > > > > > these
> > > > > > > > more granular
> > > > > > > > metrics would be good to have them as DEBUG.
> > > > > > > >
> > > > > > > >
> > > > > > > > > 6. Any thoughts about offering latency metrics for source
> > tasks
> > > > > > between
> > > > > > > > > receipt of the record from the task and delivery of the
> > record
> > > to
> > > > > > Kafka
> > > > > > > > > (which would be tracked by producer callback)? We could
> also
> > > use
> > > > > the
> > > > > > > record
> > > > > > > > > timestamp either instead of or in addition to receipt time
> if
> > > the
> > > > > > task
> > > > > > > > > provides a timestamp with its records.
> > > > > > > > >
> > > > > > > >
> > > > > > > > With source transform and convert metrics we get part of that
> > > > > latency.
> > > > > > > > Looking at the Producer metrics, `request-latency` (though a
> > very
> > > > > > generic
> > > > > > > > metric)
> > > > > > > > sort of answer the time between send request and ack — if my
> > > > > > > understanding
> > > > > > > > is correct.
> > > > > > > > Would these be enough or you're thinking about another
> > approach?
> > > > > > > > maybe a custom metric to cover the producer side?
> > > > > > > >
> > > > > > > >
> > > > > > > > > 7. We may end up introducing a way for sink tasks to record
> > > > > > per-record
> > > > > > > > > delivery to the sink system (see KIP-767 [1]). I'd like it
> if
> > > we
> > > > > > could
> > > > > > > keep
> > > > > > > > > the names of our metrics very precise in order to avoid
> > > confusing
> > > > > > users
> > > > > > > > > (who may think that we're providing metrics on actual
> > delivery
> > > to
> > > > > the
> > > > > > > sink
> > > > > > > > > system, which may not be the case if the connector performs
> > > > > > > asynchronous
> > > > > > > > > writes), and in order to leave room for a metrics on true
> > > > delivery
> > > > > > > time by
> > > > > > > > > sink tasks. It'd also be nice if we could remain consistent
> > > with
> > > > > > > existing
> > > > > > > > > metrics such as "put-batch-avg-time-ms". With that in mind,
> > > what
> > > > do
> > > > > > you
> > > > > > > > > think about renaming these metrics:
> > > > > > > > > - "sink-record-batch-latency-max-ms" to
> > > > "put-batch-avg-latency-ms"
> > > > > > > > > - "sink-record-latency-max-ms" to
> > > > "put-sink-record-latency-max-ms"
> > > > > > > > > - "sink-record-latency-avg-ms" to
> > > > "put-sink-record-latency-avg-ms"
> > > > > > > > > - "sink-record-convert-transform-time-max-ms" to
> > > > > > > > > "convert-transform-sink-record-time-max-ms"
> > > > > > > > > - "sink-record-convert-transform-time-avg-ms" to
> > > > > > > > > "convert-transform-sink-record-time-avg-ms"
> > > > > > > > > - "source-record-transform-convert-time-max-ms" to
> > > > > > > > > "transform-convert-source-record-time-max-ms"
> > > > > > > > > - "source-record-transform-convert-time-avg-ms" to
> > > > > > > > > "transform-convert-source-record-time-avg-ms"
> > > > > > > > >
> > > > > > > >
> > > > > > > > Make sense, thanks! I have updated the list of metrics and
> > group
> > > > them
> > > > > > by
> > > > > > > > sensor and applying these suggestions.
> > > > > > > > The only ones that I want to review are: sink-record-* to
> > > > put-batch-*
> > > > > > > > (first 3). Not sure if put-batch/put-sink-record describes
> the
> > > > > purpose
> > > > > > of
> > > > > > > > the metric — neither `sink-record-latency` to be honest.
> > > > > > > > My initial thought was to have something like Kafka Streams
> > > > > > e2e-latency.
> > > > > > > > Based on 4. and 6. questions, an idea could be to add:
> > > > > > > > - source-batch-e2e-latency-before-send: measure wallclock -
> > > source
> > > > > > record
> > > > > > > > timestamp after source connector poll.
> > > > > > > > - source-batch-e2e-latency-after-send: measure wallclock -
> > record
> > > > > > > timestamp
> > > > > > > > on producer send callback
> > > > > > > > - sink-batch-e2e-latency-before-put: measure time wallclock -
> > > > record
> > > > > > > > timestamp after consumer poll
> > > > > > > > - sink-batch-e2e-latency-after-put: measure time wallclock -
> > > record
> > > > > > > > timestamp after sink connector put.
> > > > > > > >
> > > > > > > >
> > > > > > > > > Thanks again for the KIP! Looking forward to your thoughts.
> > > > > > > > >
> > > > > > > > > Cheers,
> > > > > > > > >
> > > > > > > > > Chris
> > > > > > > > >
> > > > > > > > > [1] -
> > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-767%3A+Connect+Latency+Metrics
> > > > > > > > >
> > > > > > > > > On Thu, Sep 15, 2022 at 1:32 PM Jorge Esteban Quilcate
> Otoya
> > <
> > > > > > > > > quilcate.jorge@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > > Hi everyone,
> > > > > > > > > >
> > > > > > > > > > I've made a slight addition to the KIP based on Yash
> > > feedback:
> > > > > > > > > >
> > > > > > > > > > - A new metric is added at INFO level to record the max
> > > latency
> > > > > > from
> > > > > > > the
> > > > > > > > > > batch timestamp, by keeping the oldest record timestamp
> per
> > > > > batch.
> > > > > > > > > > - A draft implementation is linked.
> > > > > > > > > >
> > > > > > > > > > Looking forward to your feedback.
> > > > > > > > > > Also, a kindly reminder that the vote thread is open.
> > > > > > > > > >
> > > > > > > > > > Thanks!
> > > > > > > > > > Jorge.
> > > > > > > > > >
> > > > > > > > > > On Thu, 8 Sept 2022 at 14:25, Jorge Esteban Quilcate
> Otoya
> > <
> > > > > > > > > > quilcate.jorge@gmail.com> wrote:
> > > > > > > > > >
> > > > > > > > > > > Great. I have updated the KIP to reflect this.
> > > > > > > > > > >
> > > > > > > > > > > Cheers,
> > > > > > > > > > > Jorge.
> > > > > > > > > > >
> > > > > > > > > > > On Thu, 8 Sept 2022 at 12:26, Yash Mayya <
> > > > yash.mayya@gmail.com
> > > > > >
> > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > >> Thanks, I think it makes sense to define these metrics
> > at
> > > a
> > > > > > DEBUG
> > > > > > > > > > >> recording
> > > > > > > > > > >> level.
> > > > > > > > > > >>
> > > > > > > > > > >> On Thu, Sep 8, 2022 at 2:51 PM Jorge Esteban Quilcate
> > > Otoya
> > > > <
> > > > > > > > > > >> quilcate.jorge@gmail.com> wrote:
> > > > > > > > > > >>
> > > > > > > > > > >> > On Thu, 8 Sept 2022 at 05:55, Yash Mayya <
> > > > > > yash.mayya@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > > >> >
> > > > > > > > > > >> > > Hi Jorge,
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > Thanks for the changes. With regard to having per
> > > batch
> > > > vs
> > > > > > per
> > > > > > > > > > record
> > > > > > > > > > >> > > metrics, the additional overhead I was referring
> to
> > > > wasn't
> > > > > > > about
> > > > > > > > > > >> whether
> > > > > > > > > > >> > or
> > > > > > > > > > >> > > not we would need to iterate over all the records
> > in a
> > > > > > batch.
> > > > > > > I
> > > > > > > > > was
> > > > > > > > > > >> > > referring to the potential additional overhead
> > caused
> > > by
> > > > > the
> > > > > > > > > higher
> > > > > > > > > > >> > volume
> > > > > > > > > > >> > > of calls to Sensor::record on the sensors for the
> > new
> > > > > > metrics
> > > > > > > (as
> > > > > > > > > > >> > compared
> > > > > > > > > > >> > > to the existing batch only metrics), especially
> for
> > > high
> > > > > > > > > throughput
> > > > > > > > > > >> > > connectors where batch sizes could be large. I
> guess
> > > we
> > > > > may
> > > > > > > want
> > > > > > > > > to
> > > > > > > > > > do
> > > > > > > > > > >> > some
> > > > > > > > > > >> > > sort of performance testing and get concrete
> numbers
> > > to
> > > > > > verify
> > > > > > > > > > whether
> > > > > > > > > > >> > this
> > > > > > > > > > >> > > is a valid concern or not?
> > > > > > > > > > >> > >
> > > > > > > > > > >> >
> > > > > > > > > > >> > 6.1. Got it, thanks for clarifying. I guess there
> > could
> > > > be a
> > > > > > > > > benchmark
> > > > > > > > > > >> test
> > > > > > > > > > >> > of the `Sensor::record` to get an idea of the
> > > performance
> > > > > > > impact.
> > > > > > > > > > >> > Regardless, the fact that these are single-record
> > > metrics
> > > > > > > compared
> > > > > > > > > to
> > > > > > > > > > >> > existing batch-only could be explicitly defined by
> > > setting
> > > > > > these
> > > > > > > > > > >> metrics at
> > > > > > > > > > >> > a DEBUG or TRACE metric recording level, leaving the
> > > > > existing
> > > > > > at
> > > > > > > > > INFO
> > > > > > > > > > >> > level.
> > > > > > > > > > >> > wdyt?
> > > > > > > > > > >> >
> > > > > > > > > > >> >
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > Thanks,
> > > > > > > > > > >> > > Yash
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > On Tue, Sep 6, 2022 at 4:42 PM Jorge Esteban
> > Quilcate
> > > > > Otoya
> > > > > > <
> > > > > > > > > > >> > > quilcate.jorge@gmail.com> wrote:
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > > Hi Sagar and Yash,
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > > the way it's defined in
> > > > > > > > > > >> > > >
> > > > > > https://kafka.apache.org/documentation/#connect_monitoring
> > > > > > > for
> > > > > > > > > > the
> > > > > > > > > > >> > > metrics
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > 4.1. Got it. Add it to the KIP.
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > > The only thing I would argue is do we need
> > > > > > > > > > >> sink-record-latency-min?
> > > > > > > > > > >> > > Maybe
> > > > > > > > > > >> > > > we
> > > > > > > > > > >> > > > > could remove this min metric as well and make
> > all
> > > of
> > > > > the
> > > > > > > 3 e2e
> > > > > > > > > > >> > metrics
> > > > > > > > > > >> > > > > consistent
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > 4.2 I see. Will remove it from the KIP.
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > > Probably users can track the metrics at their
> > end
> > > to
> > > > > > > > > > >> > > > > figure that out. Do you think that makes
> sense?
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > 4.3. Yes, agree. With these new metrics it
> should
> > be
> > > > > > easier
> > > > > > > for
> > > > > > > > > > >> users
> > > > > > > > > > >> > to
> > > > > > > > > > >> > > > track this.
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > > I think it makes sense to not have a min
> metric
> > > for
> > > > > > > either to
> > > > > > > > > > >> remain
> > > > > > > > > > >> > > > > consistent with the existing put-batch and
> > > > poll-batch
> > > > > > > metrics
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > 5.1. Got it. Same as 4.2
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > > Another naming related suggestion I had was
> with
> > > the
> > > > > > > > > > >> > > > > "convert-time" metrics - we should probably
> > > include
> > > > > > > > > > >> transformations
> > > > > > > > > > >> > in
> > > > > > > > > > >> > > > the
> > > > > > > > > > >> > > > > name since SMTs could definitely be
> attributable
> > > to
> > > > a
> > > > > > > sizable
> > > > > > > > > > >> chunk
> > > > > > > > > > >> > of
> > > > > > > > > > >> > > > the
> > > > > > > > > > >> > > > > latency depending on the specific
> transformation
> > > > > chain.
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > 5.2. Make sense. I'm proposing to add
> > > > > > > > > > >> > `sink-record-convert-transform...`
> > > > > > > > > > >> > > > and `source-record-transform-convert...` to
> > > represent
> > > > > > > correctly
> > > > > > > > > > the
> > > > > > > > > > >> > order
> > > > > > > > > > >> > > > of operations.
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > > it seems like both source and sink tasks only
> > > record
> > > > > > > metrics
> > > > > > > > > at
> > > > > > > > > > a
> > > > > > > > > > >> > > "batch"
> > > > > > > > > > >> > > > > level, not on an individual record level. I
> > think
> > > it
> > > > > > > might be
> > > > > > > > > > >> > > additional
> > > > > > > > > > >> > > > > overhead if we want to record these new
> metrics
> > > all
> > > > at
> > > > > > the
> > > > > > > > > > record
> > > > > > > > > > >> > > level?
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > 5.3. I considered at the beginning to implement
> > all
> > > > > > metrics
> > > > > > > at
> > > > > > > > > the
> > > > > > > > > > >> > batch
> > > > > > > > > > >> > > > level, but given how the framework process
> > records,
> > > I
> > > > > > > fallback
> > > > > > > > > to
> > > > > > > > > > >> the
> > > > > > > > > > >> > > > proposed approach:
> > > > > > > > > > >> > > > - Sink Task:
> > > > > > > > > > >> > > >   - `WorkerSinkTask#convertMessages(msgs)`
> already
> > > > > > iterates
> > > > > > > over
> > > > > > > > > > >> > records,
> > > > > > > > > > >> > > > so there is no additional overhead to capture
> > record
> > > > > > > latency per
> > > > > > > > > > >> > record.
> > > > > > > > > > >> > > >     -
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > >
> > > > > > > > > > >> >
> > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/9841647c4fe422532f448423c92d26e4fdcb8932/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L490-L514
> > > > > > > > > > >> > > >   -
> > > `WorkerSinkTask#convertAndTransformRecord(record)`
> > > > > > > actually
> > > > > > > > > > >> happens
> > > > > > > > > > >> > > > individually. Measuring this operation per batch
> > > would
> > > > > > > include
> > > > > > > > > > >> > processing
> > > > > > > > > > >> > > > that is not strictly part of "convert and
> > transform"
> > > > > > > > > > >> > > >     -
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > >
> > > > > > > > > > >> >
> > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/9841647c4fe422532f448423c92d26e4fdcb8932/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L518
> > > > > > > > > > >> > > > - Source Task:
> > > > > > > > > > >> > > >   - `AbstractWorkerSourceTask#sendRecords`
> > iterates
> > > > > over a
> > > > > > > batch
> > > > > > > > > > and
> > > > > > > > > > >> > > > applies transforms and convert record
> individually
> > > as
> > > > > > well:
> > > > > > > > > > >> > > >     -
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > >
> > > > > > > > > > >> >
> > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/9841647c4fe422532f448423c92d26e4fdcb8932/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/AbstractWorkerSourceTask.java#L389-L390
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > > This might require some additional changes -
> > > > > > > > > > >> > > > > for instance, with the "sink-record-latency"
> > > metric,
> > > > > we
> > > > > > > might
> > > > > > > > > > only
> > > > > > > > > > >> > want
> > > > > > > > > > >> > > > to
> > > > > > > > > > >> > > > > have a "max" metric since "avg" would require
> > > > > recording
> > > > > > a
> > > > > > > > > value
> > > > > > > > > > on
> > > > > > > > > > >> > the
> > > > > > > > > > >> > > > > sensor for each record (whereas we can get a
> > "max"
> > > > by
> > > > > > only
> > > > > > > > > > >> recording
> > > > > > > > > > >> > a
> > > > > > > > > > >> > > > > metric value for the oldest record in each
> > batch).
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > 5.4. Recording record-latency per batch may not
> be
> > > as
> > > > > > > useful as
> > > > > > > > > > >> there
> > > > > > > > > > >> > is
> > > > > > > > > > >> > > no
> > > > > > > > > > >> > > > guarantee that the oldest record will be
> > > > representative
> > > > > of
> > > > > > > the
> > > > > > > > > > >> batch.
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > On Sat, 3 Sept 2022 at 16:02, Yash Mayya <
> > > > > > > yash.mayya@gmail.com>
> > > > > > > > > > >> wrote:
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > > Hi Jorge and Sagar,
> > > > > > > > > > >> > > > >
> > > > > > > > > > >> > > > > I think it makes sense to not have a min
> metric
> > > for
> > > > > > > either to
> > > > > > > > > > >> remain
> > > > > > > > > > >> > > > > consistent with the existing put-batch and
> > > > poll-batch
> > > > > > > metrics
> > > > > > > > > > (it
> > > > > > > > > > >> > > doesn't
> > > > > > > > > > >> > > > > seem particularly useful either anyway). Also,
> > the
> > > > new
> > > > > > > > > > >> > > > > "sink-record-latency" metric name looks fine
> to
> > > me,
> > > > > > > thanks for
> > > > > > > > > > >> making
> > > > > > > > > > >> > > the
> > > > > > > > > > >> > > > > changes! Another naming related suggestion I
> had
> > > was
> > > > > > with
> > > > > > > the
> > > > > > > > > > >> > > > > "convert-time" metrics - we should probably
> > > include
> > > > > > > > > > >> transformations
> > > > > > > > > > >> > in
> > > > > > > > > > >> > > > the
> > > > > > > > > > >> > > > > name since SMTs could definitely be
> attributable
> > > to
> > > > a
> > > > > > > sizable
> > > > > > > > > > >> chunk
> > > > > > > > > > >> > of
> > > > > > > > > > >> > > > the
> > > > > > > > > > >> > > > > latency depending on the specific
> transformation
> > > > > chain.
> > > > > > > > > > >> > > > >
> > > > > > > > > > >> > > > > I have one high level question with respect to
> > > > > > > implementation
> > > > > > > > > -
> > > > > > > > > > >> > > > currently,
> > > > > > > > > > >> > > > > it seems like both source and sink tasks only
> > > record
> > > > > > > metrics
> > > > > > > > > at
> > > > > > > > > > a
> > > > > > > > > > >> > > "batch"
> > > > > > > > > > >> > > > > level, not on an individual record level. I
> > think
> > > it
> > > > > > > might be
> > > > > > > > > > >> > > additional
> > > > > > > > > > >> > > > > overhead if we want to record these new
> metrics
> > > all
> > > > at
> > > > > > the
> > > > > > > > > > record
> > > > > > > > > > >> > > level?
> > > > > > > > > > >> > > > > Could we instead make all of these new metrics
> > for
> > > > > > > batches of
> > > > > > > > > > >> records
> > > > > > > > > > >> > > > > rather than individual records in order to
> > remain
> > > > > > > consistent
> > > > > > > > > > with
> > > > > > > > > > >> the
> > > > > > > > > > >> > > > > existing task level metrics? This might
> require
> > > some
> > > > > > > > > additional
> > > > > > > > > > >> > > changes -
> > > > > > > > > > >> > > > > for instance, with the "sink-record-latency"
> > > metric,
> > > > > we
> > > > > > > might
> > > > > > > > > > only
> > > > > > > > > > >> > want
> > > > > > > > > > >> > > > to
> > > > > > > > > > >> > > > > have a "max" metric since "avg" would require
> > > > > recording
> > > > > > a
> > > > > > > > > value
> > > > > > > > > > on
> > > > > > > > > > >> > the
> > > > > > > > > > >> > > > > sensor for each record (whereas we can get a
> > "max"
> > > > by
> > > > > > only
> > > > > > > > > > >> recording
> > > > > > > > > > >> > a
> > > > > > > > > > >> > > > > metric value for the oldest record in each
> > batch).
> > > > > > > > > > >> > > > >
> > > > > > > > > > >> > > > > Thanks,
> > > > > > > > > > >> > > > > Yash
> > > > > > > > > > >> > > > >
> > > > > > > > > > >> > > > > On Fri, Sep 2, 2022 at 3:16 PM Sagar <
> > > > > > > > > sagarmeansocean@gmail.com
> > > > > > > > > > >
> > > > > > > > > > >> > > wrote:
> > > > > > > > > > >> > > > >
> > > > > > > > > > >> > > > > > Hi Jorge,
> > > > > > > > > > >> > > > > >
> > > > > > > > > > >> > > > > > Thanks for the changes.
> > > > > > > > > > >> > > > > >
> > > > > > > > > > >> > > > > > Regarding the metrics, I meant something
> like
> > > > this:
> > > > > > > > > > >> > > > > >
> > > > > > > > > > >> > > > >
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > >
> > > > > > > > > > >> >
> > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> kafka.connect:type=sink-task-metrics,connector="{connector}",task="{task}"
> > > > > > > > > > >> > > > > >
> > > > > > > > > > >> > > > > > the way it's defined in
> > > > > > > > > > >> > > > > >
> > > > > > > https://kafka.apache.org/documentation/#connect_monitoring
> > > > > > > > > > for
> > > > > > > > > > >> the
> > > > > > > > > > >> > > > > > metrics.
> > > > > > > > > > >> > > > > >
> > > > > > > > > > >> > > > > > I see what you mean by the 3 metrics and how
> > it
> > > > can
> > > > > be
> > > > > > > > > > >> interpreted.
> > > > > > > > > > >> > > The
> > > > > > > > > > >> > > > > > only thing I would argue is do we need
> > > > > > > > > > sink-record-latency-min?
> > > > > > > > > > >> > Maybe
> > > > > > > > > > >> > > > we
> > > > > > > > > > >> > > > > > could remove this min metric as well and
> make
> > > all
> > > > of
> > > > > > > the 3
> > > > > > > > > e2e
> > > > > > > > > > >> > > metrics
> > > > > > > > > > >> > > > > > consistent(since put-batch also doesn't
> > expose a
> > > > min
> > > > > > > which
> > > > > > > > > > makes
> > > > > > > > > > >> > > sense
> > > > > > > > > > >> > > > to
> > > > > > > > > > >> > > > > > me). I think this is in contrast to what
> Yash
> > > > > pointed
> > > > > > > out
> > > > > > > > > > above
> > > > > > > > > > >> so
> > > > > > > > > > >> > I
> > > > > > > > > > >> > > > > would
> > > > > > > > > > >> > > > > > like to hear his thoughts as well.
> > > > > > > > > > >> > > > > >
> > > > > > > > > > >> > > > > > The other point Yash mentioned about the
> > > slightly
> > > > > > flawed
> > > > > > > > > > >> definition
> > > > > > > > > > >> > > of
> > > > > > > > > > >> > > > > e2e
> > > > > > > > > > >> > > > > > is also true in a sense. But I have a
> feeling
> > > > that's
> > > > > > > one the
> > > > > > > > > > >> > records
> > > > > > > > > > >> > > > are
> > > > > > > > > > >> > > > > > polled by the connector tasks, it would be
> > > > difficult
> > > > > > to
> > > > > > > > > track
> > > > > > > > > > >> the
> > > > > > > > > > >> > > final
> > > > > > > > > > >> > > > > leg
> > > > > > > > > > >> > > > > > via the framework. Probably users can track
> > the
> > > > > > metrics
> > > > > > > at
> > > > > > > > > > their
> > > > > > > > > > >> > end
> > > > > > > > > > >> > > to
> > > > > > > > > > >> > > > > > figure that out. Do you think that makes
> > sense?
> > > > > > > > > > >> > > > > >
> > > > > > > > > > >> > > > > > Thanks!
> > > > > > > > > > >> > > > > > Sagar.
> > > > > > > > > > >> > > > > >
> > > > > > > > > > >> > > > > >
> > > > > > > > > > >> > > > > >
> > > > > > > > > > >> > > > > >
> > > > > > > > > > >> > > > > > On Thu, Sep 1, 2022 at 11:40 PM Jorge
> Esteban
> > > > > Quilcate
> > > > > > > > > Otoya <
> > > > > > > > > > >> > > > > > quilcate.jorge@gmail.com> wrote:
> > > > > > > > > > >> > > > > >
> > > > > > > > > > >> > > > > > > Hi Sagar and Yash,
> > > > > > > > > > >> > > > > > >
> > > > > > > > > > >> > > > > > > Thanks for your feedback!
> > > > > > > > > > >> > > > > > >
> > > > > > > > > > >> > > > > > > > 1) I am assuming the new metrics would
> be
> > > task
> > > > > > level
> > > > > > > > > > metric.
> > > > > > > > > > >> > > > > > >
> > > > > > > > > > >> > > > > > > 1.1 Yes, it will be a task level metric,
> > > > > implemented
> > > > > > > on
> > > > > > > > > the
> > > > > > > > > > >> > > > > > > Worker[Source/Sink]Task.
> > > > > > > > > > >> > > > > > >
> > > > > > > > > > >> > > > > > > > Could you specify the way it's done for
> > > other
> > > > > > > > > sink/source
> > > > > > > > > > >> > > > connector?
> > > > > > > > > > >> > > > > > >
> > > > > > > > > > >> > > > > > > 1.2. Not sure what do you mean by this.
> > Could
> > > > you
> > > > > > > > > elaborate
> > > > > > > > > > a
> > > > > > > > > > >> bit
> > > > > > > > > > >> > > > more?
> > > > > > > > > > >> > > > > > >
> > > > > > > > > > >> > > > > > > > 2. I am slightly confused about the e2e
> > > > latency
> > > > > > > > > metric...
> > > > > > > > > > >> > > > > > >
> > > > > > > > > > >> > > > > > > 2.1. Yes, I see. I was trying to bring a
> > > similar
> > > > > > > concept
> > > > > > > > > as
> > > > > > > > > > in
> > > > > > > > > > >> > > > Streams
> > > > > > > > > > >> > > > > > with
> > > > > > > > > > >> > > > > > > KIP-613, though the e2e concept may not be
> > > > > > > translatable.
> > > > > > > > > > >> > > > > > > We could keep it as `sink-record-latency`
> to
> > > > avoid
> > > > > > > > > > conflating
> > > > > > > > > > >> > > > > concepts. A
> > > > > > > > > > >> > > > > > > similar metric naming was proposed in
> > KIP-489
> > > > but
> > > > > at
> > > > > > > the
> > > > > > > > > > >> consumer
> > > > > > > > > > >> > > > > level —
> > > > > > > > > > >> > > > > > > though it seems dormant for a couple of
> > years.
> > > > > > > > > > >> > > > > > >
> > > > > > > > > > >> > > > > > > > However, the put-batch time measures the
> > > > > > > > > > >> > > > > > > > time to put a batch of records to
> external
> > > > sink.
> > > > > > > So, I
> > > > > > > > > > would
> > > > > > > > > > >> > > assume
> > > > > > > > > > >> > > > > > the 2
> > > > > > > > > > >> > > > > > > > can't be added as is to compute the e2e
> > > > latency.
> > > > > > > Maybe I
> > > > > > > > > > am
> > > > > > > > > > >> > > missing
> > > > > > > > > > >> > > > > > > > something here. Could you plz clarify
> > this.
> > > > > > > > > > >> > > > > > >
> > > > > > > > > > >> > > > > > > 2.2. Yes, agree. Not necessarily added,
> but
> > > with
> > > > > > the 3
> > > > > > > > > > >> latencies
> > > > > > > > > > >> > > > (poll,
> > > > > > > > > > >> > > > > > > convert, putBatch) will be clearer where
> the
> > > > > > > bottleneck
> > > > > > > > > may
> > > > > > > > > > >> be,
> > > > > > > > > > >> > and
> > > > > > > > > > >> > > > > > > represent the internal processing.
> > > > > > > > > > >> > > > > > >
> > > > > > > > > > >> > > > > > > > however, as per the KIP it looks like it
> > > will
> > > > be
> > > > > > > > > > >> > > > > > > > the latency between when the record was
> > > > written
> > > > > to
> > > > > > > Kafka
> > > > > > > > > > and
> > > > > > > > > > >> > when
> > > > > > > > > > >> > > > the
> > > > > > > > > > >> > > > > > > > record is returned by a sink task's
> > > consumer's
> > > > > > poll?
> > > > > > > > > > >> > > > > > >
> > > > > > > > > > >> > > > > > > 3.1. Agree. 2.1. could help to clarify
> this.
> > > > > > > > > > >> > > > > > >
> > > > > > > > > > >> > > > > > > > One more thing - I was wondering
> > > > > > > > > > >> > > > > > > > if there's a particular reason for
> having
> > a
> > > > min
> > > > > > > metric
> > > > > > > > > for
> > > > > > > > > > >> e2e
> > > > > > > > > > >> > > > > latency
> > > > > > > > > > >> > > > > > > but
> > > > > > > > > > >> > > > > > > > not for convert time?
> > > > > > > > > > >> > > > > > >
> > > > > > > > > > >> > > > > > > 3.2. Was following KIP-613 for e2e which
> > seems
> > > > > > useful
> > > > > > > to
> > > > > > > > > > >> compare
> > > > > > > > > > >> > > with
> > > > > > > > > > >> > > > > > Max a
> > > > > > > > > > >> > > > > > > get an idea of the window of results,
> though
> > > > > current
> > > > > > > > > > >> latencies in
> > > > > > > > > > >> > > > > > Connector
> > > > > > > > > > >> > > > > > > do not include Min, and that's why I
> haven't
> > > > added
> > > > > > it
> > > > > > > for
> > > > > > > > > > >> convert
> > > > > > > > > > >> > > > > > latency.
> > > > > > > > > > >> > > > > > > Do you think it make sense to extend
> latency
> > > > > metrics
> > > > > > > with
> > > > > > > > > > Min?
> > > > > > > > > > >> > > > > > >
> > > > > > > > > > >> > > > > > > KIP is updated to clarify some of these
> > > changes.
> > > > > > > > > > >> > > > > > >
> > > > > > > > > > >> > > > > > > Many thanks,
> > > > > > > > > > >> > > > > > > Jorge.
> > > > > > > > > > >> > > > > > >
> > > > > > > > > > >> > > > > > > On Thu, 1 Sept 2022 at 18:11, Yash Mayya <
> > > > > > > > > > >> yash.mayya@gmail.com>
> > > > > > > > > > >> > > > wrote:
> > > > > > > > > > >> > > > > > >
> > > > > > > > > > >> > > > > > > > Hi Jorge,
> > > > > > > > > > >> > > > > > > >
> > > > > > > > > > >> > > > > > > > Thanks for the KIP! I have the same
> > > confusion
> > > > > with
> > > > > > > the
> > > > > > > > > > >> > > e2e-latency
> > > > > > > > > > >> > > > > > > metrics
> > > > > > > > > > >> > > > > > > > as Sagar above. "e2e" would seem to
> > indicate
> > > > the
> > > > > > > latency
> > > > > > > > > > >> > between
> > > > > > > > > > >> > > > when
> > > > > > > > > > >> > > > > > the
> > > > > > > > > > >> > > > > > > > record was written to Kafka and when the
> > > > record
> > > > > > was
> > > > > > > > > > written
> > > > > > > > > > >> to
> > > > > > > > > > >> > > the
> > > > > > > > > > >> > > > > sink
> > > > > > > > > > >> > > > > > > > system by the connector - however, as
> per
> > > the
> > > > > KIP
> > > > > > it
> > > > > > > > > looks
> > > > > > > > > > >> like
> > > > > > > > > > >> > > it
> > > > > > > > > > >> > > > > will
> > > > > > > > > > >> > > > > > > be
> > > > > > > > > > >> > > > > > > > the latency between when the record was
> > > > written
> > > > > to
> > > > > > > Kafka
> > > > > > > > > > and
> > > > > > > > > > >> > when
> > > > > > > > > > >> > > > the
> > > > > > > > > > >> > > > > > > > record is returned by a sink task's
> > > consumer's
> > > > > > > poll? I
> > > > > > > > > > think
> > > > > > > > > > >> > that
> > > > > > > > > > >> > > > > > metric
> > > > > > > > > > >> > > > > > > > will be a little confusing to interpret.
> > One
> > > > > more
> > > > > > > thing
> > > > > > > > > -
> > > > > > > > > > I
> > > > > > > > > > >> was
> > > > > > > > > > >> > > > > > wondering
> > > > > > > > > > >> > > > > > > > if there's a particular reason for
> having
> > a
> > > > min
> > > > > > > metric
> > > > > > > > > for
> > > > > > > > > > >> e2e
> > > > > > > > > > >> > > > > latency
> > > > > > > > > > >> > > > > > > but
> > > > > > > > > > >> > > > > > > > not for convert time?
> > > > > > > > > > >> > > > > > > >
> > > > > > > > > > >> > > > > > > > Thanks,
> > > > > > > > > > >> > > > > > > > Yash
> > > > > > > > > > >> > > > > > > >
> > > > > > > > > > >> > > > > > > > On Thu, Sep 1, 2022 at 8:59 PM Sagar <
> > > > > > > > > > >> > sagarmeansocean@gmail.com>
> > > > > > > > > > >> > > > > > wrote:
> > > > > > > > > > >> > > > > > > >
> > > > > > > > > > >> > > > > > > > > Hi Jorge,
> > > > > > > > > > >> > > > > > > > >
> > > > > > > > > > >> > > > > > > > > Thanks for the KIP. It looks like a
> very
> > > > good
> > > > > > > > > addition.
> > > > > > > > > > I
> > > > > > > > > > >> > > skimmed
> > > > > > > > > > >> > > > > > > through
> > > > > > > > > > >> > > > > > > > > once and had a couple of questions =>
> > > > > > > > > > >> > > > > > > > >
> > > > > > > > > > >> > > > > > > > > 1) I am assuming the new metrics would
> > be
> > > > task
> > > > > > > level
> > > > > > > > > > >> metric.
> > > > > > > > > > >> > > > Could
> > > > > > > > > > >> > > > > > you
> > > > > > > > > > >> > > > > > > > > specify the way it's done for other
> > > > > sink/source
> > > > > > > > > > connector?
> > > > > > > > > > >> > > > > > > > > 2) I am slightly confused about the
> e2e
> > > > > latency
> > > > > > > > > metric.
> > > > > > > > > > >> Let's
> > > > > > > > > > >> > > > > > consider
> > > > > > > > > > >> > > > > > > > the
> > > > > > > > > > >> > > > > > > > > sink connector metric. If I look at
> the
> > > way
> > > > > it's
> > > > > > > > > > supposed
> > > > > > > > > > >> to
> > > > > > > > > > >> > be
> > > > > > > > > > >> > > > > > > > calculated,
> > > > > > > > > > >> > > > > > > > > i.e the difference between the record
> > > > > timestamp
> > > > > > > and
> > > > > > > > > the
> > > > > > > > > > >> wall
> > > > > > > > > > >> > > > clock
> > > > > > > > > > >> > > > > > > time,
> > > > > > > > > > >> > > > > > > > it
> > > > > > > > > > >> > > > > > > > > looks like a per record metric.
> However,
> > > the
> > > > > > > put-batch
> > > > > > > > > > >> time
> > > > > > > > > > >> > > > > measures
> > > > > > > > > > >> > > > > > > the
> > > > > > > > > > >> > > > > > > > > time to put a batch of records to
> > external
> > > > > sink.
> > > > > > > So, I
> > > > > > > > > > >> would
> > > > > > > > > > >> > > > assume
> > > > > > > > > > >> > > > > > > the 2
> > > > > > > > > > >> > > > > > > > > can't be added as is to compute the
> e2e
> > > > > latency.
> > > > > > > > > Maybe I
> > > > > > > > > > >> am
> > > > > > > > > > >> > > > missing
> > > > > > > > > > >> > > > > > > > > something here. Could you plz clarify
> > > this.
> > > > > > > > > > >> > > > > > > > >
> > > > > > > > > > >> > > > > > > > > Thanks!
> > > > > > > > > > >> > > > > > > > > Sagar.
> > > > > > > > > > >> > > > > > > > >
> > > > > > > > > > >> > > > > > > > > On Tue, Aug 30, 2022 at 8:43 PM Jorge
> > > > Esteban
> > > > > > > Quilcate
> > > > > > > > > > >> Otoya
> > > > > > > > > > >> > <
> > > > > > > > > > >> > > > > > > > > quilcate.jorge@gmail.com> wrote:
> > > > > > > > > > >> > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > Hi all,
> > > > > > > > > > >> > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > I'd like to start a discussion
> thread
> > on
> > > > > > > KIP-864:
> > > > > > > > > Add
> > > > > > > > > > >> > > > End-To-End
> > > > > > > > > > >> > > > > > > > Latency
> > > > > > > > > > >> > > > > > > > > > Metrics to Connectors.
> > > > > > > > > > >> > > > > > > > > > This KIP aims to improve the metrics
> > > > > available
> > > > > > > on
> > > > > > > > > > Source
> > > > > > > > > > >> > and
> > > > > > > > > > >> > > > Sink
> > > > > > > > > > >> > > > > > > > > > Connectors to measure end-to-end
> > > latency,
> > > > > > > including
> > > > > > > > > > >> source
> > > > > > > > > > >> > > and
> > > > > > > > > > >> > > > > sink
> > > > > > > > > > >> > > > > > > > > record
> > > > > > > > > > >> > > > > > > > > > conversion time, and sink record e2e
> > > > latency
> > > > > > > > > (similar
> > > > > > > > > > to
> > > > > > > > > > >> > > > KIP-613
> > > > > > > > > > >> > > > > > for
> > > > > > > > > > >> > > > > > > > > > Streams).
> > > > > > > > > > >> > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > The KIP is here:
> > > > > > > > > > >> > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > >
> > > > > > > > > > >> > > > > > > > >
> > > > > > > > > > >> > > > > > > >
> > > > > > > > > > >> > > > > > >
> > > > > > > > > > >> > > > > >
> > > > > > > > > > >> > > > >
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > >
> > > > > > > > > > >> >
> > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-864%3A+Add+End-To-End+Latency+Metrics+to+Connectors
> > > > > > > > > > >> > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > Please take a look and let me know
> > what
> > > > you
> > > > > > > think.
> > > > > > > > > > >> > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > Cheers,
> > > > > > > > > > >> > > > > > > > > > Jorge.
> > > > > > > > > > >> > > > > > > > > >
> > > > > > > > > > >> > > > > > > > >
> > > > > > > > > > >> > > > > > > >
> > > > > > > > > > >> > > > > > >
> > > > > > > > > > >> > > > > >
> > > > > > > > > > >> > > > >
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > >
> > > > > > > > > > >> >
> > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-864: Add End-To-End Latency Metrics to Connectors

Posted by Chris Egerton <ch...@aiven.io.INVALID>.
Hi Jorge,

Thanks for filing KAFKA-14441! In the ticket description we mention that
"there will be more confidence whether to design metrics to be exposed at a
DEBUG or INFO level depending on their impact" but it doesn't seem like
this is called out in the KIP and, just based on what's in the KIP, the
proposal is still to have several per-record metrics exposed at INFO level.

Could we explicitly call out that the per-record metrics will definitely be
added to Kafka Connect as part of this KIP, but they will only be exposed
at INFO level pending pending the performance testing described in
KAFKA-14441, and will otherwise only be exposed at DEBUG level? Otherwise,
it's possible that a vote for the KIP as it's written today would be a vote
in favor of unconditionally exposing these metrics at INFO level, even if
the performance testing reveals issues.

Cheers,

Chris

On Sun, Dec 4, 2022 at 7:08 PM Jorge Esteban Quilcate Otoya <
quilcate.jorge@gmail.com> wrote:

> Thanks for the reminder Chris!
>
> I have added a note on the KIP to include this as part of the KIP as most
> of the metrics proposed are per-record and having all on DEBUG would limit
> the benefits, and created
> https://issues.apache.org/jira/browse/KAFKA-14441
> to keep track of this task.
>
> Cheers,
> Jorge.
>
> On Tue, 29 Nov 2022 at 19:40, Chris Egerton <ch...@aiven.io.invalid>
> wrote:
>
> > Hi Jorge,
> >
> > Thanks! What were your thoughts on the possible benchmarking and/or
> > downgrading of per-record metrics to DEBUG?
> >
> > Cheers,
> >
> > Chris
> >
> > On Thu, Nov 24, 2022 at 8:20 AM Jorge Esteban Quilcate Otoya <
> > quilcate.jorge@gmail.com> wrote:
> >
> > > Thanks Chris! I have updated the KIP with "transform" instead of
> "alias".
> > > Agree it's clearer.
> > >
> > > Cheers,
> > > Jorge.
> > >
> > > On Mon, 21 Nov 2022 at 21:36, Chris Egerton <ch...@aiven.io.invalid>
> > > wrote:
> > >
> > > > Hi Jorge,
> > > >
> > > > Thanks for the updates, and apologies for the delay. The new diagram
> > > > directly under the "Proposed Changes" section is absolutely gorgeous!
> > > >
> > > >
> > > > Follow-ups:
> > > >
> > > > RE 2: Good point. We can use the same level for these metrics, it's
> > not a
> > > > big deal.
> > > >
> > > > RE 3: As long as all the per-record metrics are kept at DEBUG level,
> it
> > > > should be fine to leave JMH benchmarking for a follow-up. If we want
> to
> > > add
> > > > new per-record, INFO-level metrics, I would be more comfortable with
> > > > including benchmarking as part of the testing plan for the KIP. One
> > > > possible compromise could be to propose that these features be merged
> > at
> > > > DEBUG level, and then possibly upgraded to INFO level in the future
> > > pending
> > > > benchmarks to guard against performance degradation.
> > > >
> > > > RE 4: I think for a true "end-to-end" metric, it'd be useful to
> include
> > > the
> > > > time taken by the task to actually deliver the record. However, with
> > the
> > > > new metric names and descriptions provided in the KIP, I have no
> > > objections
> > > > with what's currently proposed, and a new "end-to-end" metric can be
> > > taken
> > > > on later in a follow-up KIP.
> > > >
> > > > RE 6: You're right, existing producer metrics should be enough for
> now.
> > > We
> > > > can revisit this later if/when we add delivery-centric metrics for
> sink
> > > > tasks as well.
> > > >
> > > > RE 7: The new metric names in the KIP LGTM; I don't see any need to
> > > expand
> > > > beyond those but if you'd still like to pursue others, LMK.
> > > >
> > > >
> > > > New thoughts:
> > > >
> > > > One small thought: instead of "alias" in "alias="{transform_alias}"
> for
> > > the
> > > > per-transform metrics, could we use "transform"? IMO it's clearer
> since
> > > we
> > > > don't use "alias" in the names of transform-related properties, and
> > > "alias"
> > > > may be confused with the classloading term where you can use, e.g.,
> > > > "FileStreamSource" as the name of a connector class in a connector
> > config
> > > > instead of "org.apache.kafka.connect.file.FileStreamSourceConnector".
> > > >
> > > >
> > > > Cheers,
> > > >
> > > > Chris
> > > >
> > > > On Fri, Nov 18, 2022 at 12:06 PM Jorge Esteban Quilcate Otoya <
> > > > quilcate.jorge@gmail.com> wrote:
> > > >
> > > > > Thanks Mickael!
> > > > >
> > > > >
> > > > > On Wed, 9 Nov 2022 at 15:54, Mickael Maison <
> > mickael.maison@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi Jorge,
> > > > > >
> > > > > > Thanks for the KIP, it is a nice improvement.
> > > > > >
> > > > > > 1) The per transformation metrics still have a question mark next
> > to
> > > > > > them in the KIP. Do you want to include them? If so we'll want to
> > tag
> > > > > > them, we should be able to include the aliases in
> > TransformationChain
> > > > > > and use them.
> > > > > >
> > > > >
> > > > > Yes, I have added the changes on TransformChain that will be needed
> > to
> > > > add
> > > > > these metrics.
> > > > >
> > > > >
> > > > > >
> > > > > > 2) I see no references to predicates. If we don't want to measure
> > > > > > their latency, can we say it explicitly?
> > > > > >
> > > > >
> > > > > Good question, I haven't considered these. Though as these are
> > > > materialized
> > > > > as PredicatedTransformation, they should be covered by these
> changes.
> > > > > Adding a note about this.
> > > > >
> > > > >
> > > > > >
> > > > > > 3) Should we have sink-record-batch-latency-avg-ms? All other
> > metrics
> > > > > > have both the maximum and average values.
> > > > > >
> > > > > >
> > > > > Good question. I will remove it and change the record latency from
> > > > > DEBUG->INFO as it already cover the maximum metric.
> > > > >
> > > > > Hope it's clearer now, let me know if there any additional
> feedback.
> > > > > Thanks!
> > > > >
> > > > >
> > > > >
> > > > > > Thanks,
> > > > > > Mickael
> > > > > >
> > > > > > On Thu, Oct 20, 2022 at 9:58 PM Jorge Esteban Quilcate Otoya
> > > > > > <qu...@gmail.com> wrote:
> > > > > > >
> > > > > > > Thanks, Chris! Great feedback! Please, find my comments below:
> > > > > > >
> > > > > > > On Thu, 13 Oct 2022 at 18:52, Chris Egerton
> > > <chrise@aiven.io.invalid
> > > > >
> > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Jorge,
> > > > > > > >
> > > > > > > > Thanks for the KIP. I agree with the overall direction and
> > think
> > > > this
> > > > > > would
> > > > > > > > be a nice improvement to Kafka Connect. Here are my initial
> > > > thoughts
> > > > > > on the
> > > > > > > > details:
> > > > > > > >
> > > > > > > > 1. The motivation section outlines the gaps in Kafka
> Connect's
> > > task
> > > > > > metrics
> > > > > > > > nicely. I think it'd be useful to include more concrete
> details
> > > on
> > > > > why
> > > > > > > > these gaps need to be filled in, and in which cases
> additional
> > > > > metrics
> > > > > > > > would be helpful. One goal could be to provide enhanced
> > > monitoring
> > > > of
> > > > > > > > production deployments that allows for cluster administrators
> > to
> > > > set
> > > > > up
> > > > > > > > automatic alerts for latency spikes and, if triggered,
> quickly
> > > > > > identify the
> > > > > > > > root cause of those alerts, reducing the time to remediation.
> > > > Another
> > > > > > goal
> > > > > > > > could be to provide more insight to developers or cluster
> > > > > > administrators
> > > > > > > > who want to do performance testing on connectors in
> > > non-production
> > > > > > > > environments. It may help guide our decision making process
> to
> > > > have a
> > > > > > > > clearer picture of the goals we're trying to achieve.
> > > > > > > >
> > > > > > >
> > > > > > > Agree. The Motivation section has been updated.
> > > > > > > Thanks for the examples, I see both of them being covered by
> the
> > > KIP.
> > > > > > > I see how these could give us a good distinction on whether to
> > > > position
> > > > > > > some metrics at INFO or DEBUG level.
> > > > > > >
> > > > > > >
> > > > > > > > 2. If we're trying to address the alert-and-diagnose use
> case,
> > > it'd
> > > > > be
> > > > > > > > useful to have as much information as possible at INFO level,
> > > > rather
> > > > > > than
> > > > > > > > forcing cluster administrators to possibly reconfigure a
> > > connector
> > > > to
> > > > > > emit
> > > > > > > > DEBUG or TRACE level metrics in order to diagnose a potential
> > > > > > > > production-impacting performance bottleneck. I can see the
> > > > rationale
> > > > > > for
> > > > > > > > emitting per-record metrics that track an average value at
> > DEBUG
> > > > > > level, but
> > > > > > > > for per-record metrics that track a maximum value, is there
> any
> > > > > reason
> > > > > > not
> > > > > > > > to provide this information at INFO level?
> > > > > > > >
> > > > > > >
> > > > > > > Agree. Though with Max and Avg metrics being part of the same
> > > sensor
> > > > —
> > > > > > > where Metric Level is defined — then both metrics get the same
> > > level.
> > > > > > >
> > > > > > >
> > > > > > > > 3. I'm also curious about the performance testing suggested
> by
> > > Yash
> > > > > to
> > > > > > > > gauge the potential impact of this change. Have you been able
> > to
> > > do
> > > > > any
> > > > > > > > testing with your draft implementation yet?
> > > > > > > >
> > > > > > >
> > > > > > > No, not so far.
> > > > > > > I think it would be valuable to discuss the scope of this
> testing
> > > and
> > > > > > maybe
> > > > > > > tackle it
> > > > > > > in a separate issue as Sensors and Metrics are used all over
> the
> > > > place.
> > > > > > > My initial understanding is that these tests should by placed
> in
> > > the
> > > > > > > jmh-benchmarks[1].
> > > > > > > Then, we could target testing Sensors and Metrics, and validate
> > how
> > > > > much
> > > > > > > overhead
> > > > > > > is added by having only Max vs Max,Avg(,Min), etc.
> > > > > > > In the other hand, we could extend this to Transformers or
> other
> > > > > Connect
> > > > > > > layers.
> > > > > > >
> > > > > > > Here are some pointers to the Sensors and Metrics
> implementations
> > > > that
> > > > > > > could be considered:
> > > > > > > Path to metric recording:
> > > > > > > -
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/5cab11cf525f6c06fcf9eb43f7f95ef33fe1cdbb/clients/src/main/java/org/apache/kafka/common/metrics/Sensor.java#L195-L199
> > > > > > > -
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/5cab11cf525f6c06fcf9eb43f7f95ef33fe1cdbb/clients/src/main/java/org/apache/kafka/common/metrics/Sensor.java#L230-L244
> > > > > > >
> > > > > > > ```
> > > > > > > // increment all the stats
> > > > > > > for (StatAndConfig statAndConfig : this.stats) {
> > > > > > >    statAndConfig.stat.record(statAndConfig.config(), value,
> > > timeMs);
> > > > > > > }
> > > > > > > ```
> > > > > > >
> > > > > > > SampledStats:
> > > > > > > - Avg:
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/068ab9cefae301f3187ea885d645c425955e77d2/clients/src/main/java/org/apache/kafka/common/metrics/stats/Avg.java
> > > > > > > - Max:
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/068ab9cefae301f3187ea885d645c425955e77d2/clients/src/main/java/org/apache/kafka/common/metrics/stats/Max.java
> > > > > > > - Min:
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/068ab9cefae301f3187ea885d645c425955e77d2/clients/src/main/java/org/apache/kafka/common/metrics/stats/Min.java
> > > > > > >
> > > > > > > `stat#record()` are implemented by `update` method in
> > SampledStat:
> > > > > > >
> > > > > > > ```Max.java
> > > > > > >     @Override
> > > > > > >     protected void update(Sample sample, MetricConfig config,
> > > double
> > > > > > value,
> > > > > > > long now) {
> > > > > > >         sample.value += value;
> > > > > > >     }
> > > > > > > ```
> > > > > > >
> > > > > > > ```Avg.java
> > > > > > >     @Override
> > > > > > >     protected void update(Sample sample, MetricConfig config,
> > > double
> > > > > > value,
> > > > > > > long now) {
> > > > > > >         sample.value = Math.max(sample.value, value);
> > > > > > >     }
> > > > > > > ```
> > > > > > >
> > > > > > > As far as I understand, most of the work of the stats happens
> on
> > > the
> > > > > > > `combine` method that is not part of the connector execution
> but
> > > > called
> > > > > > > when metrics are queried.
> > > > > > >
> > > > > > > I wonder whether we should consider Avg and Max for all metrics
> > > > > proposed
> > > > > > as
> > > > > > > the impact on the execution path seems minimal, and even see if
> > Min
> > > > is
> > > > > > also
> > > > > > > valuable, and use DEBUG only for more granular metrics.
> > > > > > >
> > > > > > > [1] https://github.com/apache/kafka/tree/trunk/jmh-benchmarks
> > > > > > >
> > > > > > >
> > > > > > > > 4. Just to make sure I understand correctly--does "time when
> it
> > > has
> > > > > > been
> > > > > > > > received by the Sink task" refer to the wallclock time
> directly
> > > > > after a
> > > > > > > > call to SinkTask::put has been completed (as opposed to
> > directly
> > > > > before
> > > > > > > > that call is made, or something else entirely)?
> > > > > > > >
> > > > > > >
> > > > > > > It currently means when it has been received by the Sink task
> > > > > > > right after consumer poll and before conversions.
> > > > > > > Would it be valuable to have it after put-sink-records?
> > > > > > >
> > > > > > >
> > > > > > > > 5. If the goal is to identify performance bottlenecks (either
> > in
> > > > > > production
> > > > > > > > or pre-production environments), would it make sense to
> > introduce
> > > > > > metrics
> > > > > > > > for each individual converter (i.e., key/value/header) and
> > > > > > transformation?
> > > > > > > > It's definitely an improvement to be able to identify the
> total
> > > > time
> > > > > > for
> > > > > > > > conversion and transformation, but then the immediate
> follow-up
> > > > > > question if
> > > > > > > > a bottleneck is found in that phase is "which
> > > > > converter/transformation
> > > > > > is
> > > > > > > > responsible?" It'd be nice if we could provide a way to
> quickly
> > > > > answer
> > > > > > that
> > > > > > > > question.
> > > > > > > >
> > > > > > >
> > > > > > > This is a great idea. I'd like to consider this as well, though
> > > maybe
> > > > > > these
> > > > > > > more granular
> > > > > > > metrics would be good to have them as DEBUG.
> > > > > > >
> > > > > > >
> > > > > > > > 6. Any thoughts about offering latency metrics for source
> tasks
> > > > > between
> > > > > > > > receipt of the record from the task and delivery of the
> record
> > to
> > > > > Kafka
> > > > > > > > (which would be tracked by producer callback)? We could also
> > use
> > > > the
> > > > > > record
> > > > > > > > timestamp either instead of or in addition to receipt time if
> > the
> > > > > task
> > > > > > > > provides a timestamp with its records.
> > > > > > > >
> > > > > > >
> > > > > > > With source transform and convert metrics we get part of that
> > > > latency.
> > > > > > > Looking at the Producer metrics, `request-latency` (though a
> very
> > > > > generic
> > > > > > > metric)
> > > > > > > sort of answer the time between send request and ack — if my
> > > > > > understanding
> > > > > > > is correct.
> > > > > > > Would these be enough or you're thinking about another
> approach?
> > > > > > > maybe a custom metric to cover the producer side?
> > > > > > >
> > > > > > >
> > > > > > > > 7. We may end up introducing a way for sink tasks to record
> > > > > per-record
> > > > > > > > delivery to the sink system (see KIP-767 [1]). I'd like it if
> > we
> > > > > could
> > > > > > keep
> > > > > > > > the names of our metrics very precise in order to avoid
> > confusing
> > > > > users
> > > > > > > > (who may think that we're providing metrics on actual
> delivery
> > to
> > > > the
> > > > > > sink
> > > > > > > > system, which may not be the case if the connector performs
> > > > > > asynchronous
> > > > > > > > writes), and in order to leave room for a metrics on true
> > > delivery
> > > > > > time by
> > > > > > > > sink tasks. It'd also be nice if we could remain consistent
> > with
> > > > > > existing
> > > > > > > > metrics such as "put-batch-avg-time-ms". With that in mind,
> > what
> > > do
> > > > > you
> > > > > > > > think about renaming these metrics:
> > > > > > > > - "sink-record-batch-latency-max-ms" to
> > > "put-batch-avg-latency-ms"
> > > > > > > > - "sink-record-latency-max-ms" to
> > > "put-sink-record-latency-max-ms"
> > > > > > > > - "sink-record-latency-avg-ms" to
> > > "put-sink-record-latency-avg-ms"
> > > > > > > > - "sink-record-convert-transform-time-max-ms" to
> > > > > > > > "convert-transform-sink-record-time-max-ms"
> > > > > > > > - "sink-record-convert-transform-time-avg-ms" to
> > > > > > > > "convert-transform-sink-record-time-avg-ms"
> > > > > > > > - "source-record-transform-convert-time-max-ms" to
> > > > > > > > "transform-convert-source-record-time-max-ms"
> > > > > > > > - "source-record-transform-convert-time-avg-ms" to
> > > > > > > > "transform-convert-source-record-time-avg-ms"
> > > > > > > >
> > > > > > >
> > > > > > > Make sense, thanks! I have updated the list of metrics and
> group
> > > them
> > > > > by
> > > > > > > sensor and applying these suggestions.
> > > > > > > The only ones that I want to review are: sink-record-* to
> > > put-batch-*
> > > > > > > (first 3). Not sure if put-batch/put-sink-record describes the
> > > > purpose
> > > > > of
> > > > > > > the metric — neither `sink-record-latency` to be honest.
> > > > > > > My initial thought was to have something like Kafka Streams
> > > > > e2e-latency.
> > > > > > > Based on 4. and 6. questions, an idea could be to add:
> > > > > > > - source-batch-e2e-latency-before-send: measure wallclock -
> > source
> > > > > record
> > > > > > > timestamp after source connector poll.
> > > > > > > - source-batch-e2e-latency-after-send: measure wallclock -
> record
> > > > > > timestamp
> > > > > > > on producer send callback
> > > > > > > - sink-batch-e2e-latency-before-put: measure time wallclock -
> > > record
> > > > > > > timestamp after consumer poll
> > > > > > > - sink-batch-e2e-latency-after-put: measure time wallclock -
> > record
> > > > > > > timestamp after sink connector put.
> > > > > > >
> > > > > > >
> > > > > > > > Thanks again for the KIP! Looking forward to your thoughts.
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > >
> > > > > > > > Chris
> > > > > > > >
> > > > > > > > [1] -
> > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-767%3A+Connect+Latency+Metrics
> > > > > > > >
> > > > > > > > On Thu, Sep 15, 2022 at 1:32 PM Jorge Esteban Quilcate Otoya
> <
> > > > > > > > quilcate.jorge@gmail.com> wrote:
> > > > > > > >
> > > > > > > > > Hi everyone,
> > > > > > > > >
> > > > > > > > > I've made a slight addition to the KIP based on Yash
> > feedback:
> > > > > > > > >
> > > > > > > > > - A new metric is added at INFO level to record the max
> > latency
> > > > > from
> > > > > > the
> > > > > > > > > batch timestamp, by keeping the oldest record timestamp per
> > > > batch.
> > > > > > > > > - A draft implementation is linked.
> > > > > > > > >
> > > > > > > > > Looking forward to your feedback.
> > > > > > > > > Also, a kindly reminder that the vote thread is open.
> > > > > > > > >
> > > > > > > > > Thanks!
> > > > > > > > > Jorge.
> > > > > > > > >
> > > > > > > > > On Thu, 8 Sept 2022 at 14:25, Jorge Esteban Quilcate Otoya
> <
> > > > > > > > > quilcate.jorge@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > > Great. I have updated the KIP to reflect this.
> > > > > > > > > >
> > > > > > > > > > Cheers,
> > > > > > > > > > Jorge.
> > > > > > > > > >
> > > > > > > > > > On Thu, 8 Sept 2022 at 12:26, Yash Mayya <
> > > yash.mayya@gmail.com
> > > > >
> > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > >> Thanks, I think it makes sense to define these metrics
> at
> > a
> > > > > DEBUG
> > > > > > > > > >> recording
> > > > > > > > > >> level.
> > > > > > > > > >>
> > > > > > > > > >> On Thu, Sep 8, 2022 at 2:51 PM Jorge Esteban Quilcate
> > Otoya
> > > <
> > > > > > > > > >> quilcate.jorge@gmail.com> wrote:
> > > > > > > > > >>
> > > > > > > > > >> > On Thu, 8 Sept 2022 at 05:55, Yash Mayya <
> > > > > yash.mayya@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > > >> >
> > > > > > > > > >> > > Hi Jorge,
> > > > > > > > > >> > >
> > > > > > > > > >> > > Thanks for the changes. With regard to having per
> > batch
> > > vs
> > > > > per
> > > > > > > > > record
> > > > > > > > > >> > > metrics, the additional overhead I was referring to
> > > wasn't
> > > > > > about
> > > > > > > > > >> whether
> > > > > > > > > >> > or
> > > > > > > > > >> > > not we would need to iterate over all the records
> in a
> > > > > batch.
> > > > > > I
> > > > > > > > was
> > > > > > > > > >> > > referring to the potential additional overhead
> caused
> > by
> > > > the
> > > > > > > > higher
> > > > > > > > > >> > volume
> > > > > > > > > >> > > of calls to Sensor::record on the sensors for the
> new
> > > > > metrics
> > > > > > (as
> > > > > > > > > >> > compared
> > > > > > > > > >> > > to the existing batch only metrics), especially for
> > high
> > > > > > > > throughput
> > > > > > > > > >> > > connectors where batch sizes could be large. I guess
> > we
> > > > may
> > > > > > want
> > > > > > > > to
> > > > > > > > > do
> > > > > > > > > >> > some
> > > > > > > > > >> > > sort of performance testing and get concrete numbers
> > to
> > > > > verify
> > > > > > > > > whether
> > > > > > > > > >> > this
> > > > > > > > > >> > > is a valid concern or not?
> > > > > > > > > >> > >
> > > > > > > > > >> >
> > > > > > > > > >> > 6.1. Got it, thanks for clarifying. I guess there
> could
> > > be a
> > > > > > > > benchmark
> > > > > > > > > >> test
> > > > > > > > > >> > of the `Sensor::record` to get an idea of the
> > performance
> > > > > > impact.
> > > > > > > > > >> > Regardless, the fact that these are single-record
> > metrics
> > > > > > compared
> > > > > > > > to
> > > > > > > > > >> > existing batch-only could be explicitly defined by
> > setting
> > > > > these
> > > > > > > > > >> metrics at
> > > > > > > > > >> > a DEBUG or TRACE metric recording level, leaving the
> > > > existing
> > > > > at
> > > > > > > > INFO
> > > > > > > > > >> > level.
> > > > > > > > > >> > wdyt?
> > > > > > > > > >> >
> > > > > > > > > >> >
> > > > > > > > > >> > >
> > > > > > > > > >> > > Thanks,
> > > > > > > > > >> > > Yash
> > > > > > > > > >> > >
> > > > > > > > > >> > > On Tue, Sep 6, 2022 at 4:42 PM Jorge Esteban
> Quilcate
> > > > Otoya
> > > > > <
> > > > > > > > > >> > > quilcate.jorge@gmail.com> wrote:
> > > > > > > > > >> > >
> > > > > > > > > >> > > > Hi Sagar and Yash,
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > > the way it's defined in
> > > > > > > > > >> > > >
> > > > > https://kafka.apache.org/documentation/#connect_monitoring
> > > > > > for
> > > > > > > > > the
> > > > > > > > > >> > > metrics
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > 4.1. Got it. Add it to the KIP.
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > > The only thing I would argue is do we need
> > > > > > > > > >> sink-record-latency-min?
> > > > > > > > > >> > > Maybe
> > > > > > > > > >> > > > we
> > > > > > > > > >> > > > > could remove this min metric as well and make
> all
> > of
> > > > the
> > > > > > 3 e2e
> > > > > > > > > >> > metrics
> > > > > > > > > >> > > > > consistent
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > 4.2 I see. Will remove it from the KIP.
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > > Probably users can track the metrics at their
> end
> > to
> > > > > > > > > >> > > > > figure that out. Do you think that makes sense?
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > 4.3. Yes, agree. With these new metrics it should
> be
> > > > > easier
> > > > > > for
> > > > > > > > > >> users
> > > > > > > > > >> > to
> > > > > > > > > >> > > > track this.
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > > I think it makes sense to not have a min metric
> > for
> > > > > > either to
> > > > > > > > > >> remain
> > > > > > > > > >> > > > > consistent with the existing put-batch and
> > > poll-batch
> > > > > > metrics
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > 5.1. Got it. Same as 4.2
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > > Another naming related suggestion I had was with
> > the
> > > > > > > > > >> > > > > "convert-time" metrics - we should probably
> > include
> > > > > > > > > >> transformations
> > > > > > > > > >> > in
> > > > > > > > > >> > > > the
> > > > > > > > > >> > > > > name since SMTs could definitely be attributable
> > to
> > > a
> > > > > > sizable
> > > > > > > > > >> chunk
> > > > > > > > > >> > of
> > > > > > > > > >> > > > the
> > > > > > > > > >> > > > > latency depending on the specific transformation
> > > > chain.
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > 5.2. Make sense. I'm proposing to add
> > > > > > > > > >> > `sink-record-convert-transform...`
> > > > > > > > > >> > > > and `source-record-transform-convert...` to
> > represent
> > > > > > correctly
> > > > > > > > > the
> > > > > > > > > >> > order
> > > > > > > > > >> > > > of operations.
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > > it seems like both source and sink tasks only
> > record
> > > > > > metrics
> > > > > > > > at
> > > > > > > > > a
> > > > > > > > > >> > > "batch"
> > > > > > > > > >> > > > > level, not on an individual record level. I
> think
> > it
> > > > > > might be
> > > > > > > > > >> > > additional
> > > > > > > > > >> > > > > overhead if we want to record these new metrics
> > all
> > > at
> > > > > the
> > > > > > > > > record
> > > > > > > > > >> > > level?
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > 5.3. I considered at the beginning to implement
> all
> > > > > metrics
> > > > > > at
> > > > > > > > the
> > > > > > > > > >> > batch
> > > > > > > > > >> > > > level, but given how the framework process
> records,
> > I
> > > > > > fallback
> > > > > > > > to
> > > > > > > > > >> the
> > > > > > > > > >> > > > proposed approach:
> > > > > > > > > >> > > > - Sink Task:
> > > > > > > > > >> > > >   - `WorkerSinkTask#convertMessages(msgs)` already
> > > > > iterates
> > > > > > over
> > > > > > > > > >> > records,
> > > > > > > > > >> > > > so there is no additional overhead to capture
> record
> > > > > > latency per
> > > > > > > > > >> > record.
> > > > > > > > > >> > > >     -
> > > > > > > > > >> > > >
> > > > > > > > > >> > > >
> > > > > > > > > >> > >
> > > > > > > > > >> >
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/9841647c4fe422532f448423c92d26e4fdcb8932/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L490-L514
> > > > > > > > > >> > > >   -
> > `WorkerSinkTask#convertAndTransformRecord(record)`
> > > > > > actually
> > > > > > > > > >> happens
> > > > > > > > > >> > > > individually. Measuring this operation per batch
> > would
> > > > > > include
> > > > > > > > > >> > processing
> > > > > > > > > >> > > > that is not strictly part of "convert and
> transform"
> > > > > > > > > >> > > >     -
> > > > > > > > > >> > > >
> > > > > > > > > >> > > >
> > > > > > > > > >> > >
> > > > > > > > > >> >
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/9841647c4fe422532f448423c92d26e4fdcb8932/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L518
> > > > > > > > > >> > > > - Source Task:
> > > > > > > > > >> > > >   - `AbstractWorkerSourceTask#sendRecords`
> iterates
> > > > over a
> > > > > > batch
> > > > > > > > > and
> > > > > > > > > >> > > > applies transforms and convert record individually
> > as
> > > > > well:
> > > > > > > > > >> > > >     -
> > > > > > > > > >> > > >
> > > > > > > > > >> > > >
> > > > > > > > > >> > >
> > > > > > > > > >> >
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/9841647c4fe422532f448423c92d26e4fdcb8932/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/AbstractWorkerSourceTask.java#L389-L390
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > > This might require some additional changes -
> > > > > > > > > >> > > > > for instance, with the "sink-record-latency"
> > metric,
> > > > we
> > > > > > might
> > > > > > > > > only
> > > > > > > > > >> > want
> > > > > > > > > >> > > > to
> > > > > > > > > >> > > > > have a "max" metric since "avg" would require
> > > > recording
> > > > > a
> > > > > > > > value
> > > > > > > > > on
> > > > > > > > > >> > the
> > > > > > > > > >> > > > > sensor for each record (whereas we can get a
> "max"
> > > by
> > > > > only
> > > > > > > > > >> recording
> > > > > > > > > >> > a
> > > > > > > > > >> > > > > metric value for the oldest record in each
> batch).
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > 5.4. Recording record-latency per batch may not be
> > as
> > > > > > useful as
> > > > > > > > > >> there
> > > > > > > > > >> > is
> > > > > > > > > >> > > no
> > > > > > > > > >> > > > guarantee that the oldest record will be
> > > representative
> > > > of
> > > > > > the
> > > > > > > > > >> batch.
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > On Sat, 3 Sept 2022 at 16:02, Yash Mayya <
> > > > > > yash.mayya@gmail.com>
> > > > > > > > > >> wrote:
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > > Hi Jorge and Sagar,
> > > > > > > > > >> > > > >
> > > > > > > > > >> > > > > I think it makes sense to not have a min metric
> > for
> > > > > > either to
> > > > > > > > > >> remain
> > > > > > > > > >> > > > > consistent with the existing put-batch and
> > > poll-batch
> > > > > > metrics
> > > > > > > > > (it
> > > > > > > > > >> > > doesn't
> > > > > > > > > >> > > > > seem particularly useful either anyway). Also,
> the
> > > new
> > > > > > > > > >> > > > > "sink-record-latency" metric name looks fine to
> > me,
> > > > > > thanks for
> > > > > > > > > >> making
> > > > > > > > > >> > > the
> > > > > > > > > >> > > > > changes! Another naming related suggestion I had
> > was
> > > > > with
> > > > > > the
> > > > > > > > > >> > > > > "convert-time" metrics - we should probably
> > include
> > > > > > > > > >> transformations
> > > > > > > > > >> > in
> > > > > > > > > >> > > > the
> > > > > > > > > >> > > > > name since SMTs could definitely be attributable
> > to
> > > a
> > > > > > sizable
> > > > > > > > > >> chunk
> > > > > > > > > >> > of
> > > > > > > > > >> > > > the
> > > > > > > > > >> > > > > latency depending on the specific transformation
> > > > chain.
> > > > > > > > > >> > > > >
> > > > > > > > > >> > > > > I have one high level question with respect to
> > > > > > implementation
> > > > > > > > -
> > > > > > > > > >> > > > currently,
> > > > > > > > > >> > > > > it seems like both source and sink tasks only
> > record
> > > > > > metrics
> > > > > > > > at
> > > > > > > > > a
> > > > > > > > > >> > > "batch"
> > > > > > > > > >> > > > > level, not on an individual record level. I
> think
> > it
> > > > > > might be
> > > > > > > > > >> > > additional
> > > > > > > > > >> > > > > overhead if we want to record these new metrics
> > all
> > > at
> > > > > the
> > > > > > > > > record
> > > > > > > > > >> > > level?
> > > > > > > > > >> > > > > Could we instead make all of these new metrics
> for
> > > > > > batches of
> > > > > > > > > >> records
> > > > > > > > > >> > > > > rather than individual records in order to
> remain
> > > > > > consistent
> > > > > > > > > with
> > > > > > > > > >> the
> > > > > > > > > >> > > > > existing task level metrics? This might require
> > some
> > > > > > > > additional
> > > > > > > > > >> > > changes -
> > > > > > > > > >> > > > > for instance, with the "sink-record-latency"
> > metric,
> > > > we
> > > > > > might
> > > > > > > > > only
> > > > > > > > > >> > want
> > > > > > > > > >> > > > to
> > > > > > > > > >> > > > > have a "max" metric since "avg" would require
> > > > recording
> > > > > a
> > > > > > > > value
> > > > > > > > > on
> > > > > > > > > >> > the
> > > > > > > > > >> > > > > sensor for each record (whereas we can get a
> "max"
> > > by
> > > > > only
> > > > > > > > > >> recording
> > > > > > > > > >> > a
> > > > > > > > > >> > > > > metric value for the oldest record in each
> batch).
> > > > > > > > > >> > > > >
> > > > > > > > > >> > > > > Thanks,
> > > > > > > > > >> > > > > Yash
> > > > > > > > > >> > > > >
> > > > > > > > > >> > > > > On Fri, Sep 2, 2022 at 3:16 PM Sagar <
> > > > > > > > sagarmeansocean@gmail.com
> > > > > > > > > >
> > > > > > > > > >> > > wrote:
> > > > > > > > > >> > > > >
> > > > > > > > > >> > > > > > Hi Jorge,
> > > > > > > > > >> > > > > >
> > > > > > > > > >> > > > > > Thanks for the changes.
> > > > > > > > > >> > > > > >
> > > > > > > > > >> > > > > > Regarding the metrics, I meant something like
> > > this:
> > > > > > > > > >> > > > > >
> > > > > > > > > >> > > > >
> > > > > > > > > >> > > >
> > > > > > > > > >> > >
> > > > > > > > > >> >
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> kafka.connect:type=sink-task-metrics,connector="{connector}",task="{task}"
> > > > > > > > > >> > > > > >
> > > > > > > > > >> > > > > > the way it's defined in
> > > > > > > > > >> > > > > >
> > > > > > https://kafka.apache.org/documentation/#connect_monitoring
> > > > > > > > > for
> > > > > > > > > >> the
> > > > > > > > > >> > > > > > metrics.
> > > > > > > > > >> > > > > >
> > > > > > > > > >> > > > > > I see what you mean by the 3 metrics and how
> it
> > > can
> > > > be
> > > > > > > > > >> interpreted.
> > > > > > > > > >> > > The
> > > > > > > > > >> > > > > > only thing I would argue is do we need
> > > > > > > > > sink-record-latency-min?
> > > > > > > > > >> > Maybe
> > > > > > > > > >> > > > we
> > > > > > > > > >> > > > > > could remove this min metric as well and make
> > all
> > > of
> > > > > > the 3
> > > > > > > > e2e
> > > > > > > > > >> > > metrics
> > > > > > > > > >> > > > > > consistent(since put-batch also doesn't
> expose a
> > > min
> > > > > > which
> > > > > > > > > makes
> > > > > > > > > >> > > sense
> > > > > > > > > >> > > > to
> > > > > > > > > >> > > > > > me). I think this is in contrast to what Yash
> > > > pointed
> > > > > > out
> > > > > > > > > above
> > > > > > > > > >> so
> > > > > > > > > >> > I
> > > > > > > > > >> > > > > would
> > > > > > > > > >> > > > > > like to hear his thoughts as well.
> > > > > > > > > >> > > > > >
> > > > > > > > > >> > > > > > The other point Yash mentioned about the
> > slightly
> > > > > flawed
> > > > > > > > > >> definition
> > > > > > > > > >> > > of
> > > > > > > > > >> > > > > e2e
> > > > > > > > > >> > > > > > is also true in a sense. But I have a feeling
> > > that's
> > > > > > one the
> > > > > > > > > >> > records
> > > > > > > > > >> > > > are
> > > > > > > > > >> > > > > > polled by the connector tasks, it would be
> > > difficult
> > > > > to
> > > > > > > > track
> > > > > > > > > >> the
> > > > > > > > > >> > > final
> > > > > > > > > >> > > > > leg
> > > > > > > > > >> > > > > > via the framework. Probably users can track
> the
> > > > > metrics
> > > > > > at
> > > > > > > > > their
> > > > > > > > > >> > end
> > > > > > > > > >> > > to
> > > > > > > > > >> > > > > > figure that out. Do you think that makes
> sense?
> > > > > > > > > >> > > > > >
> > > > > > > > > >> > > > > > Thanks!
> > > > > > > > > >> > > > > > Sagar.
> > > > > > > > > >> > > > > >
> > > > > > > > > >> > > > > >
> > > > > > > > > >> > > > > >
> > > > > > > > > >> > > > > >
> > > > > > > > > >> > > > > > On Thu, Sep 1, 2022 at 11:40 PM Jorge Esteban
> > > > Quilcate
> > > > > > > > Otoya <
> > > > > > > > > >> > > > > > quilcate.jorge@gmail.com> wrote:
> > > > > > > > > >> > > > > >
> > > > > > > > > >> > > > > > > Hi Sagar and Yash,
> > > > > > > > > >> > > > > > >
> > > > > > > > > >> > > > > > > Thanks for your feedback!
> > > > > > > > > >> > > > > > >
> > > > > > > > > >> > > > > > > > 1) I am assuming the new metrics would be
> > task
> > > > > level
> > > > > > > > > metric.
> > > > > > > > > >> > > > > > >
> > > > > > > > > >> > > > > > > 1.1 Yes, it will be a task level metric,
> > > > implemented
> > > > > > on
> > > > > > > > the
> > > > > > > > > >> > > > > > > Worker[Source/Sink]Task.
> > > > > > > > > >> > > > > > >
> > > > > > > > > >> > > > > > > > Could you specify the way it's done for
> > other
> > > > > > > > sink/source
> > > > > > > > > >> > > > connector?
> > > > > > > > > >> > > > > > >
> > > > > > > > > >> > > > > > > 1.2. Not sure what do you mean by this.
> Could
> > > you
> > > > > > > > elaborate
> > > > > > > > > a
> > > > > > > > > >> bit
> > > > > > > > > >> > > > more?
> > > > > > > > > >> > > > > > >
> > > > > > > > > >> > > > > > > > 2. I am slightly confused about the e2e
> > > latency
> > > > > > > > metric...
> > > > > > > > > >> > > > > > >
> > > > > > > > > >> > > > > > > 2.1. Yes, I see. I was trying to bring a
> > similar
> > > > > > concept
> > > > > > > > as
> > > > > > > > > in
> > > > > > > > > >> > > > Streams
> > > > > > > > > >> > > > > > with
> > > > > > > > > >> > > > > > > KIP-613, though the e2e concept may not be
> > > > > > translatable.
> > > > > > > > > >> > > > > > > We could keep it as `sink-record-latency` to
> > > avoid
> > > > > > > > > conflating
> > > > > > > > > >> > > > > concepts. A
> > > > > > > > > >> > > > > > > similar metric naming was proposed in
> KIP-489
> > > but
> > > > at
> > > > > > the
> > > > > > > > > >> consumer
> > > > > > > > > >> > > > > level —
> > > > > > > > > >> > > > > > > though it seems dormant for a couple of
> years.
> > > > > > > > > >> > > > > > >
> > > > > > > > > >> > > > > > > > However, the put-batch time measures the
> > > > > > > > > >> > > > > > > > time to put a batch of records to external
> > > sink.
> > > > > > So, I
> > > > > > > > > would
> > > > > > > > > >> > > assume
> > > > > > > > > >> > > > > > the 2
> > > > > > > > > >> > > > > > > > can't be added as is to compute the e2e
> > > latency.
> > > > > > Maybe I
> > > > > > > > > am
> > > > > > > > > >> > > missing
> > > > > > > > > >> > > > > > > > something here. Could you plz clarify
> this.
> > > > > > > > > >> > > > > > >
> > > > > > > > > >> > > > > > > 2.2. Yes, agree. Not necessarily added, but
> > with
> > > > > the 3
> > > > > > > > > >> latencies
> > > > > > > > > >> > > > (poll,
> > > > > > > > > >> > > > > > > convert, putBatch) will be clearer where the
> > > > > > bottleneck
> > > > > > > > may
> > > > > > > > > >> be,
> > > > > > > > > >> > and
> > > > > > > > > >> > > > > > > represent the internal processing.
> > > > > > > > > >> > > > > > >
> > > > > > > > > >> > > > > > > > however, as per the KIP it looks like it
> > will
> > > be
> > > > > > > > > >> > > > > > > > the latency between when the record was
> > > written
> > > > to
> > > > > > Kafka
> > > > > > > > > and
> > > > > > > > > >> > when
> > > > > > > > > >> > > > the
> > > > > > > > > >> > > > > > > > record is returned by a sink task's
> > consumer's
> > > > > poll?
> > > > > > > > > >> > > > > > >
> > > > > > > > > >> > > > > > > 3.1. Agree. 2.1. could help to clarify this.
> > > > > > > > > >> > > > > > >
> > > > > > > > > >> > > > > > > > One more thing - I was wondering
> > > > > > > > > >> > > > > > > > if there's a particular reason for having
> a
> > > min
> > > > > > metric
> > > > > > > > for
> > > > > > > > > >> e2e
> > > > > > > > > >> > > > > latency
> > > > > > > > > >> > > > > > > but
> > > > > > > > > >> > > > > > > > not for convert time?
> > > > > > > > > >> > > > > > >
> > > > > > > > > >> > > > > > > 3.2. Was following KIP-613 for e2e which
> seems
> > > > > useful
> > > > > > to
> > > > > > > > > >> compare
> > > > > > > > > >> > > with
> > > > > > > > > >> > > > > > Max a
> > > > > > > > > >> > > > > > > get an idea of the window of results, though
> > > > current
> > > > > > > > > >> latencies in
> > > > > > > > > >> > > > > > Connector
> > > > > > > > > >> > > > > > > do not include Min, and that's why I haven't
> > > added
> > > > > it
> > > > > > for
> > > > > > > > > >> convert
> > > > > > > > > >> > > > > > latency.
> > > > > > > > > >> > > > > > > Do you think it make sense to extend latency
> > > > metrics
> > > > > > with
> > > > > > > > > Min?
> > > > > > > > > >> > > > > > >
> > > > > > > > > >> > > > > > > KIP is updated to clarify some of these
> > changes.
> > > > > > > > > >> > > > > > >
> > > > > > > > > >> > > > > > > Many thanks,
> > > > > > > > > >> > > > > > > Jorge.
> > > > > > > > > >> > > > > > >
> > > > > > > > > >> > > > > > > On Thu, 1 Sept 2022 at 18:11, Yash Mayya <
> > > > > > > > > >> yash.mayya@gmail.com>
> > > > > > > > > >> > > > wrote:
> > > > > > > > > >> > > > > > >
> > > > > > > > > >> > > > > > > > Hi Jorge,
> > > > > > > > > >> > > > > > > >
> > > > > > > > > >> > > > > > > > Thanks for the KIP! I have the same
> > confusion
> > > > with
> > > > > > the
> > > > > > > > > >> > > e2e-latency
> > > > > > > > > >> > > > > > > metrics
> > > > > > > > > >> > > > > > > > as Sagar above. "e2e" would seem to
> indicate
> > > the
> > > > > > latency
> > > > > > > > > >> > between
> > > > > > > > > >> > > > when
> > > > > > > > > >> > > > > > the
> > > > > > > > > >> > > > > > > > record was written to Kafka and when the
> > > record
> > > > > was
> > > > > > > > > written
> > > > > > > > > >> to
> > > > > > > > > >> > > the
> > > > > > > > > >> > > > > sink
> > > > > > > > > >> > > > > > > > system by the connector - however, as per
> > the
> > > > KIP
> > > > > it
> > > > > > > > looks
> > > > > > > > > >> like
> > > > > > > > > >> > > it
> > > > > > > > > >> > > > > will
> > > > > > > > > >> > > > > > > be
> > > > > > > > > >> > > > > > > > the latency between when the record was
> > > written
> > > > to
> > > > > > Kafka
> > > > > > > > > and
> > > > > > > > > >> > when
> > > > > > > > > >> > > > the
> > > > > > > > > >> > > > > > > > record is returned by a sink task's
> > consumer's
> > > > > > poll? I
> > > > > > > > > think
> > > > > > > > > >> > that
> > > > > > > > > >> > > > > > metric
> > > > > > > > > >> > > > > > > > will be a little confusing to interpret.
> One
> > > > more
> > > > > > thing
> > > > > > > > -
> > > > > > > > > I
> > > > > > > > > >> was
> > > > > > > > > >> > > > > > wondering
> > > > > > > > > >> > > > > > > > if there's a particular reason for having
> a
> > > min
> > > > > > metric
> > > > > > > > for
> > > > > > > > > >> e2e
> > > > > > > > > >> > > > > latency
> > > > > > > > > >> > > > > > > but
> > > > > > > > > >> > > > > > > > not for convert time?
> > > > > > > > > >> > > > > > > >
> > > > > > > > > >> > > > > > > > Thanks,
> > > > > > > > > >> > > > > > > > Yash
> > > > > > > > > >> > > > > > > >
> > > > > > > > > >> > > > > > > > On Thu, Sep 1, 2022 at 8:59 PM Sagar <
> > > > > > > > > >> > sagarmeansocean@gmail.com>
> > > > > > > > > >> > > > > > wrote:
> > > > > > > > > >> > > > > > > >
> > > > > > > > > >> > > > > > > > > Hi Jorge,
> > > > > > > > > >> > > > > > > > >
> > > > > > > > > >> > > > > > > > > Thanks for the KIP. It looks like a very
> > > good
> > > > > > > > addition.
> > > > > > > > > I
> > > > > > > > > >> > > skimmed
> > > > > > > > > >> > > > > > > through
> > > > > > > > > >> > > > > > > > > once and had a couple of questions =>
> > > > > > > > > >> > > > > > > > >
> > > > > > > > > >> > > > > > > > > 1) I am assuming the new metrics would
> be
> > > task
> > > > > > level
> > > > > > > > > >> metric.
> > > > > > > > > >> > > > Could
> > > > > > > > > >> > > > > > you
> > > > > > > > > >> > > > > > > > > specify the way it's done for other
> > > > sink/source
> > > > > > > > > connector?
> > > > > > > > > >> > > > > > > > > 2) I am slightly confused about the e2e
> > > > latency
> > > > > > > > metric.
> > > > > > > > > >> Let's
> > > > > > > > > >> > > > > > consider
> > > > > > > > > >> > > > > > > > the
> > > > > > > > > >> > > > > > > > > sink connector metric. If I look at the
> > way
> > > > it's
> > > > > > > > > supposed
> > > > > > > > > >> to
> > > > > > > > > >> > be
> > > > > > > > > >> > > > > > > > calculated,
> > > > > > > > > >> > > > > > > > > i.e the difference between the record
> > > > timestamp
> > > > > > and
> > > > > > > > the
> > > > > > > > > >> wall
> > > > > > > > > >> > > > clock
> > > > > > > > > >> > > > > > > time,
> > > > > > > > > >> > > > > > > > it
> > > > > > > > > >> > > > > > > > > looks like a per record metric. However,
> > the
> > > > > > put-batch
> > > > > > > > > >> time
> > > > > > > > > >> > > > > measures
> > > > > > > > > >> > > > > > > the
> > > > > > > > > >> > > > > > > > > time to put a batch of records to
> external
> > > > sink.
> > > > > > So, I
> > > > > > > > > >> would
> > > > > > > > > >> > > > assume
> > > > > > > > > >> > > > > > > the 2
> > > > > > > > > >> > > > > > > > > can't be added as is to compute the e2e
> > > > latency.
> > > > > > > > Maybe I
> > > > > > > > > >> am
> > > > > > > > > >> > > > missing
> > > > > > > > > >> > > > > > > > > something here. Could you plz clarify
> > this.
> > > > > > > > > >> > > > > > > > >
> > > > > > > > > >> > > > > > > > > Thanks!
> > > > > > > > > >> > > > > > > > > Sagar.
> > > > > > > > > >> > > > > > > > >
> > > > > > > > > >> > > > > > > > > On Tue, Aug 30, 2022 at 8:43 PM Jorge
> > > Esteban
> > > > > > Quilcate
> > > > > > > > > >> Otoya
> > > > > > > > > >> > <
> > > > > > > > > >> > > > > > > > > quilcate.jorge@gmail.com> wrote:
> > > > > > > > > >> > > > > > > > >
> > > > > > > > > >> > > > > > > > > > Hi all,
> > > > > > > > > >> > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > I'd like to start a discussion thread
> on
> > > > > > KIP-864:
> > > > > > > > Add
> > > > > > > > > >> > > > End-To-End
> > > > > > > > > >> > > > > > > > Latency
> > > > > > > > > >> > > > > > > > > > Metrics to Connectors.
> > > > > > > > > >> > > > > > > > > > This KIP aims to improve the metrics
> > > > available
> > > > > > on
> > > > > > > > > Source
> > > > > > > > > >> > and
> > > > > > > > > >> > > > Sink
> > > > > > > > > >> > > > > > > > > > Connectors to measure end-to-end
> > latency,
> > > > > > including
> > > > > > > > > >> source
> > > > > > > > > >> > > and
> > > > > > > > > >> > > > > sink
> > > > > > > > > >> > > > > > > > > record
> > > > > > > > > >> > > > > > > > > > conversion time, and sink record e2e
> > > latency
> > > > > > > > (similar
> > > > > > > > > to
> > > > > > > > > >> > > > KIP-613
> > > > > > > > > >> > > > > > for
> > > > > > > > > >> > > > > > > > > > Streams).
> > > > > > > > > >> > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > The KIP is here:
> > > > > > > > > >> > > > > > > > > >
> > > > > > > > > >> > > > > > > > > >
> > > > > > > > > >> > > > > > > > >
> > > > > > > > > >> > > > > > > >
> > > > > > > > > >> > > > > > >
> > > > > > > > > >> > > > > >
> > > > > > > > > >> > > > >
> > > > > > > > > >> > > >
> > > > > > > > > >> > >
> > > > > > > > > >> >
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-864%3A+Add+End-To-End+Latency+Metrics+to+Connectors
> > > > > > > > > >> > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > Please take a look and let me know
> what
> > > you
> > > > > > think.
> > > > > > > > > >> > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > Cheers,
> > > > > > > > > >> > > > > > > > > > Jorge.
> > > > > > > > > >> > > > > > > > > >
> > > > > > > > > >> > > > > > > > >
> > > > > > > > > >> > > > > > > >
> > > > > > > > > >> > > > > > >
> > > > > > > > > >> > > > > >
> > > > > > > > > >> > > > >
> > > > > > > > > >> > > >
> > > > > > > > > >> > >
> > > > > > > > > >> >
> > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>