You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kafka.apache.org by Sagar <sa...@gmail.com> on 2022/09/01 15:28:16 UTC

Re: [DISCUSS] KIP-864: Add End-To-End Latency Metrics to Connectors

Hi Jorge,

Thanks for the KIP. It looks like a very good addition. I skimmed through
once and had a couple of questions =>

1) I am assuming the new metrics would be task level metric. Could you
specify the way it's done for other sink/source connector?
2) I am slightly confused about the e2e latency metric. Let's consider the
sink connector metric. If I look at the way it's supposed to be calculated,
i.e the difference between the record timestamp and the wall clock time, it
looks like a per record metric. However, the put-batch time measures the
time to put a batch of records to external sink. So, I would assume the 2
can't be added as is to compute the e2e latency. Maybe I am missing
something here. Could you plz clarify this.

Thanks!
Sagar.

On Tue, Aug 30, 2022 at 8:43 PM Jorge Esteban Quilcate Otoya <
quilcate.jorge@gmail.com> wrote:

> Hi all,
>
> I'd like to start a discussion thread on KIP-864: Add End-To-End Latency
> Metrics to Connectors.
> This KIP aims to improve the metrics available on Source and Sink
> Connectors to measure end-to-end latency, including source and sink record
> conversion time, and sink record e2e latency (similar to KIP-613 for
> Streams).
>
> The KIP is here:
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-864%3A+Add+End-To-End+Latency+Metrics+to+Connectors
>
> Please take a look and let me know what you think.
>
> Cheers,
> Jorge.
>

Re: [DISCUSS] KIP-864: Add End-To-End Latency Metrics to Connectors

Posted by Geertjan Wielenga <ge...@googlemail.com.INVALID>.

On Sat, 3 Sep 2022 at 17:02, Yash Mayya <ya...@gmail.com> wrote:

> Hi Jorge and Sagar,
>
> I think it makes sense to not have a min metric for either to remain
> consistent with the existing put-batch and poll-batch metrics (it doesn't
> seem particularly useful either anyway). Also, the new
> "sink-record-latency" metric name looks fine to me, thanks for making the
> changes! Another naming related suggestion I had was with the
> "convert-time" metrics - we should probably include transformations in the
> name since SMTs could definitely be attributable to a sizable chunk of the
> latency depending on the specific transformation chain.
>
> I have one high level question with respect to implementation - currently,
> it seems like both source and sink tasks only record metrics at a "batch"
> level, not on an individual record level. I think it might be additional
> overhead if we want to record these new metrics all at the record level?
> Could we instead make all of these new metrics for batches of records
> rather than individual records in order to remain consistent with the
> existing task level metrics? This might require some additional changes -
> for instance, with the "sink-record-latency" metric, we might only want to
> have a "max" metric since "avg" would require recording a value on the
> sensor for each record (whereas we can get a "max" by only recording a
> metric value for the oldest record in each batch).
>
> Thanks,
> Yash
>
> On Fri, Sep 2, 2022 at 3:16 PM Sagar <sa...@gmail.com> wrote:
>
> > Hi Jorge,
> >
> > Thanks for the changes.
> >
> > Regarding the metrics, I meant something like this:
> >
> kafka.connect:type=sink-task-metrics,connector="{connector}",task="{task}"
> >
> > the way it's defined in
> > https://kafka.apache.org/documentation/#connect_monitoring for the
> > metrics.
> >
> > I see what you mean by the 3 metrics and how it can be interpreted. The
> > only thing I would argue is do we need sink-record-latency-min? Maybe we
> > could remove this min metric as well and make all of the 3 e2e metrics
> > consistent(since put-batch also doesn't expose a min which makes sense to
> > me). I think this is in contrast to what Yash pointed out above so I
> would
> > like to hear his thoughts as well.
> >
> > The other point Yash mentioned about the slightly flawed definition of
> e2e
> > is also true in a sense. But I have a feeling that's one the records are
> > polled by the connector tasks, it would be difficult to track the final
> leg
> > via the framework. Probably users can track the metrics at their end to
> > figure that out. Do you think that makes sense?
> >
> > Thanks!
> > Sagar.
> >
> >
> >
> >
> > On Thu, Sep 1, 2022 at 11:40 PM Jorge Esteban Quilcate Otoya <
> > quilcate.jorge@gmail.com> wrote:
> >
> > > Hi Sagar and Yash,
> > >
> > > Thanks for your feedback!
> > >
> > > > 1) I am assuming the new metrics would be task level metric.
> > >
> > > 1.1 Yes, it will be a task level metric, implemented on the
> > > Worker[Source/Sink]Task.
> > >
> > > > Could you specify the way it's done for other sink/source connector?
> > >
> > > 1.2. Not sure what do you mean by this. Could you elaborate a bit more?
> > >
> > > > 2. I am slightly confused about the e2e latency metric...
> > >
> > > 2.1. Yes, I see. I was trying to bring a similar concept as in Streams
> > with
> > > KIP-613, though the e2e concept may not be translatable.
> > > We could keep it as `sink-record-latency` to avoid conflating
> concepts. A
> > > similar metric naming was proposed in KIP-489 but at the consumer
> level —
> > > though it seems dormant for a couple of years.
> > >
> > > > However, the put-batch time measures the
> > > > time to put a batch of records to external sink. So, I would assume
> > the 2
> > > > can't be added as is to compute the e2e latency. Maybe I am missing
> > > > something here. Could you plz clarify this.
> > >
> > > 2.2. Yes, agree. Not necessarily added, but with the 3 latencies (poll,
> > > convert, putBatch) will be clearer where the bottleneck may be, and
> > > represent the internal processing.
> > >
> > > > however, as per the KIP it looks like it will be
> > > > the latency between when the record was written to Kafka and when the
> > > > record is returned by a sink task's consumer's poll?
> > >
> > > 3.1. Agree. 2.1. could help to clarify this.
> > >
> > > > One more thing - I was wondering
> > > > if there's a particular reason for having a min metric for e2e
> latency
> > > but
> > > > not for convert time?
> > >
> > > 3.2. Was following KIP-613 for e2e which seems useful to compare with
> > Max a
> > > get an idea of the window of results, though current latencies in
> > Connector
> > > do not include Min, and that's why I haven't added it for convert
> > latency.
> > > Do you think it make sense to extend latency metrics with Min?
> > >
> > > KIP is updated to clarify some of these changes.
> > >
> > > Many thanks,
> > > Jorge.
> > >
> > > On Thu, 1 Sept 2022 at 18:11, Yash Mayya <ya...@gmail.com> wrote:
> > >
> > > > Hi Jorge,
> > > >
> > > > Thanks for the KIP! I have the same confusion with the e2e-latency
> > > metrics
> > > > as Sagar above. "e2e" would seem to indicate the latency between when
> > the
> > > > record was written to Kafka and when the record was written to the
> sink
> > > > system by the connector - however, as per the KIP it looks like it
> will
> > > be
> > > > the latency between when the record was written to Kafka and when the
> > > > record is returned by a sink task's consumer's poll? I think that
> > metric
> > > > will be a little confusing to interpret. One more thing - I was
> > wondering
> > > > if there's a particular reason for having a min metric for e2e
> latency
> > > but
> > > > not for convert time?
> > > >
> > > > Thanks,
> > > > Yash
> > > >
> > > > On Thu, Sep 1, 2022 at 8:59 PM Sagar <sa...@gmail.com>
> > wrote:
> > > >
> > > > > Hi Jorge,
> > > > >
> > > > > Thanks for the KIP. It looks like a very good addition. I skimmed
> > > through
> > > > > once and had a couple of questions =>
> > > > >
> > > > > 1) I am assuming the new metrics would be task level metric. Could
> > you
> > > > > specify the way it's done for other sink/source connector?
> > > > > 2) I am slightly confused about the e2e latency metric. Let's
> > consider
> > > > the
> > > > > sink connector metric. If I look at the way it's supposed to be
> > > > calculated,
> > > > > i.e the difference between the record timestamp and the wall clock
> > > time,
> > > > it
> > > > > looks like a per record metric. However, the put-batch time
> measures
> > > the
> > > > > time to put a batch of records to external sink. So, I would assume
> > > the 2
> > > > > can't be added as is to compute the e2e latency. Maybe I am missing
> > > > > something here. Could you plz clarify this.
> > > > >
> > > > > Thanks!
> > > > > Sagar.
> > > > >
> > > > > On Tue, Aug 30, 2022 at 8:43 PM Jorge Esteban Quilcate Otoya <
> > > > > quilcate.jorge@gmail.com> wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > I'd like to start a discussion thread on KIP-864: Add End-To-End
> > > > Latency
> > > > > > Metrics to Connectors.
> > > > > > This KIP aims to improve the metrics available on Source and Sink
> > > > > > Connectors to measure end-to-end latency, including source and
> sink
> > > > > record
> > > > > > conversion time, and sink record e2e latency (similar to KIP-613
> > for
> > > > > > Streams).
> > > > > >
> > > > > > The KIP is here:
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-864%3A+Add+End-To-End+Latency+Metrics+to+Connectors
> > > > > >
> > > > > > Please take a look and let me know what you think.
> > > > > >
> > > > > > Cheers,
> > > > > > Jorge.
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-864: Add End-To-End Latency Metrics to Connectors

Posted by Jorge Esteban Quilcate Otoya <qu...@gmail.com>.

No worries, thanks Chris!

I think most feedback has been covered and the KIP is ready for vote. Will
be starting the vote thread soon.

Cheers,
Jorge.

On Mon, 5 Dec 2022 at 15:10, Chris Egerton <ch...@aiven.io.invalid> wrote:

> Hi Jorge,
>
> Thanks for indulging my paranoia. LGTM!
>
> Cheers,
>
> Chris
>
> On Mon, Dec 5, 2022 at 10:06 AM Jorge Esteban Quilcate Otoya <
> quilcate.jorge@gmail.com> wrote:
>
> > Sure! I have a added the following to the proposed changes section:
> >
> > ```
> > The per-record metrics will definitely be added to Kafka Connect as part
> of
> > this KIP, but their metric level will be changed pending the performance
> > testing described in KAFKA-14441, and will otherwise only be exposed at
> > lower level (DEBUG instead of INFO, and TRACE instead of DEBUG)
> > ```
> >
> > Let me know if how does it look.
> >
> > Many thanks!
> > Jorge.
> >
> > On Mon, 5 Dec 2022 at 14:11, Chris Egerton <ch...@aiven.io.invalid>
> > wrote:
> >
> > > Hi Jorge,
> > >
> > > Thanks for filing KAFKA-14441! In the ticket description we mention
> that
> > > "there will be more confidence whether to design metrics to be exposed
> > at a
> > > DEBUG or INFO level depending on their impact" but it doesn't seem like
> > > this is called out in the KIP and, just based on what's in the KIP, the
> > > proposal is still to have several per-record metrics exposed at INFO
> > level.
> > >
> > > Could we explicitly call out that the per-record metrics will
> definitely
> > be
> > > added to Kafka Connect as part of this KIP, but they will only be
> exposed
> > > at INFO level pending pending the performance testing described in
> > > KAFKA-14441, and will otherwise only be exposed at DEBUG level?
> > Otherwise,
> > > it's possible that a vote for the KIP as it's written today would be a
> > vote
> > > in favor of unconditionally exposing these metrics at INFO level, even
> if
> > > the performance testing reveals issues.
> > >
> > > Cheers,
> > >
> > > Chris
> > >
> > > On Sun, Dec 4, 2022 at 7:08 PM Jorge Esteban Quilcate Otoya <
> > > quilcate.jorge@gmail.com> wrote:
> > >
> > > > Thanks for the reminder Chris!
> > > >
> > > > I have added a note on the KIP to include this as part of the KIP as
> > most
> > > > of the metrics proposed are per-record and having all on DEBUG would
> > > limit
> > > > the benefits, and created
> > > > https://issues.apache.org/jira/browse/KAFKA-14441
> > > > to keep track of this task.
> > > >
> > > > Cheers,
> > > > Jorge.
> > > >
> > > > On Tue, 29 Nov 2022 at 19:40, Chris Egerton <chrise@aiven.io.invalid
> >
> > > > wrote:
> > > >
> > > > > Hi Jorge,
> > > > >
> > > > > Thanks! What were your thoughts on the possible benchmarking and/or
> > > > > downgrading of per-record metrics to DEBUG?
> > > > >
> > > > > Cheers,
> > > > >
> > > > > Chris
> > > > >
> > > > > On Thu, Nov 24, 2022 at 8:20 AM Jorge Esteban Quilcate Otoya <
> > > > > quilcate.jorge@gmail.com> wrote:
> > > > >
> > > > > > Thanks Chris! I have updated the KIP with "transform" instead of
> > > > "alias".
> > > > > > Agree it's clearer.
> > > > > >
> > > > > > Cheers,
> > > > > > Jorge.
> > > > > >
> > > > > > On Mon, 21 Nov 2022 at 21:36, Chris Egerton
> > <chrise@aiven.io.invalid
> > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Jorge,
> > > > > > >
> > > > > > > Thanks for the updates, and apologies for the delay. The new
> > > diagram
> > > > > > > directly under the "Proposed Changes" section is absolutely
> > > gorgeous!
> > > > > > >
> > > > > > >
> > > > > > > Follow-ups:
> > > > > > >
> > > > > > > RE 2: Good point. We can use the same level for these metrics,
> > it's
> > > > > not a
> > > > > > > big deal.
> > > > > > >
> > > > > > > RE 3: As long as all the per-record metrics are kept at DEBUG
> > > level,
> > > > it
> > > > > > > should be fine to leave JMH benchmarking for a follow-up. If we
> > > want
> > > > to
> > > > > > add
> > > > > > > new per-record, INFO-level metrics, I would be more comfortable
> > > with
> > > > > > > including benchmarking as part of the testing plan for the KIP.
> > One
> > > > > > > possible compromise could be to propose that these features be
> > > merged
> > > > > at
> > > > > > > DEBUG level, and then possibly upgraded to INFO level in the
> > future
> > > > > > pending
> > > > > > > benchmarks to guard against performance degradation.
> > > > > > >
> > > > > > > RE 4: I think for a true "end-to-end" metric, it'd be useful to
> > > > include
> > > > > > the
> > > > > > > time taken by the task to actually deliver the record. However,
> > > with
> > > > > the
> > > > > > > new metric names and descriptions provided in the KIP, I have
> no
> > > > > > objections
> > > > > > > with what's currently proposed, and a new "end-to-end" metric
> can
> > > be
> > > > > > taken
> > > > > > > on later in a follow-up KIP.
> > > > > > >
> > > > > > > RE 6: You're right, existing producer metrics should be enough
> > for
> > > > now.
> > > > > > We
> > > > > > > can revisit this later if/when we add delivery-centric metrics
> > for
> > > > sink
> > > > > > > tasks as well.
> > > > > > >
> > > > > > > RE 7: The new metric names in the KIP LGTM; I don't see any
> need
> > to
> > > > > > expand
> > > > > > > beyond those but if you'd still like to pursue others, LMK.
> > > > > > >
> > > > > > >
> > > > > > > New thoughts:
> > > > > > >
> > > > > > > One small thought: instead of "alias" in
> > "alias="{transform_alias}"
> > > > for
> > > > > > the
> > > > > > > per-transform metrics, could we use "transform"? IMO it's
> clearer
> > > > since
> > > > > > we
> > > > > > > don't use "alias" in the names of transform-related properties,
> > and
> > > > > > "alias"
> > > > > > > may be confused with the classloading term where you can use,
> > e.g.,
> > > > > > > "FileStreamSource" as the name of a connector class in a
> > connector
> > > > > config
> > > > > > > instead of
> > > "org.apache.kafka.connect.file.FileStreamSourceConnector".
> > > > > > >
> > > > > > >
> > > > > > > Cheers,
> > > > > > >
> > > > > > > Chris
> > > > > > >
> > > > > > > On Fri, Nov 18, 2022 at 12:06 PM Jorge Esteban Quilcate Otoya <
> > > > > > > quilcate.jorge@gmail.com> wrote:
> > > > > > >
> > > > > > > > Thanks Mickael!
> > > > > > > >
> > > > > > > >
> > > > > > > > On Wed, 9 Nov 2022 at 15:54, Mickael Maison <
> > > > > mickael.maison@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Jorge,
> > > > > > > > >
> > > > > > > > > Thanks for the KIP, it is a nice improvement.
> > > > > > > > >
> > > > > > > > > 1) The per transformation metrics still have a question
> mark
> > > next
> > > > > to
> > > > > > > > > them in the KIP. Do you want to include them? If so we'll
> > want
> > > to
> > > > > tag
> > > > > > > > > them, we should be able to include the aliases in
> > > > > TransformationChain
> > > > > > > > > and use them.
> > > > > > > > >
> > > > > > > >
> > > > > > > > Yes, I have added the changes on TransformChain that will be
> > > needed
> > > > > to
> > > > > > > add
> > > > > > > > these metrics.
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > 2) I see no references to predicates. If we don't want to
> > > measure
> > > > > > > > > their latency, can we say it explicitly?
> > > > > > > > >
> > > > > > > >
> > > > > > > > Good question, I haven't considered these. Though as these
> are
> > > > > > > materialized
> > > > > > > > as PredicatedTransformation, they should be covered by these
> > > > changes.
> > > > > > > > Adding a note about this.
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > 3) Should we have sink-record-batch-latency-avg-ms? All
> other
> > > > > metrics
> > > > > > > > > have both the maximum and average values.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > Good question. I will remove it and change the record latency
> > > from
> > > > > > > > DEBUG->INFO as it already cover the maximum metric.
> > > > > > > >
> > > > > > > > Hope it's clearer now, let me know if there any additional
> > > > feedback.
> > > > > > > > Thanks!
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Mickael
> > > > > > > > >
> > > > > > > > > On Thu, Oct 20, 2022 at 9:58 PM Jorge Esteban Quilcate
> Otoya
> > > > > > > > > <qu...@gmail.com> wrote:
> > > > > > > > > >
> > > > > > > > > > Thanks, Chris! Great feedback! Please, find my comments
> > > below:
> > > > > > > > > >
> > > > > > > > > > On Thu, 13 Oct 2022 at 18:52, Chris Egerton
> > > > > > <chrise@aiven.io.invalid
> > > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi Jorge,
> > > > > > > > > > >
> > > > > > > > > > > Thanks for the KIP. I agree with the overall direction
> > and
> > > > > think
> > > > > > > this
> > > > > > > > > would
> > > > > > > > > > > be a nice improvement to Kafka Connect. Here are my
> > initial
> > > > > > > thoughts
> > > > > > > > > on the
> > > > > > > > > > > details:
> > > > > > > > > > >
> > > > > > > > > > > 1. The motivation section outlines the gaps in Kafka
> > > > Connect's
> > > > > > task
> > > > > > > > > metrics
> > > > > > > > > > > nicely. I think it'd be useful to include more concrete
> > > > details
> > > > > > on
> > > > > > > > why
> > > > > > > > > > > these gaps need to be filled in, and in which cases
> > > > additional
> > > > > > > > metrics
> > > > > > > > > > > would be helpful. One goal could be to provide enhanced
> > > > > > monitoring
> > > > > > > of
> > > > > > > > > > > production deployments that allows for cluster
> > > administrators
> > > > > to
> > > > > > > set
> > > > > > > > up
> > > > > > > > > > > automatic alerts for latency spikes and, if triggered,
> > > > quickly
> > > > > > > > > identify the
> > > > > > > > > > > root cause of those alerts, reducing the time to
> > > remediation.
> > > > > > > Another
> > > > > > > > > goal
> > > > > > > > > > > could be to provide more insight to developers or
> cluster
> > > > > > > > > administrators
> > > > > > > > > > > who want to do performance testing on connectors in
> > > > > > non-production
> > > > > > > > > > > environments. It may help guide our decision making
> > process
> > > > to
> > > > > > > have a
> > > > > > > > > > > clearer picture of the goals we're trying to achieve.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Agree. The Motivation section has been updated.
> > > > > > > > > > Thanks for the examples, I see both of them being covered
> > by
> > > > the
> > > > > > KIP.
> > > > > > > > > > I see how these could give us a good distinction on
> whether
> > > to
> > > > > > > position
> > > > > > > > > > some metrics at INFO or DEBUG level.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > 2. If we're trying to address the alert-and-diagnose
> use
> > > > case,
> > > > > > it'd
> > > > > > > > be
> > > > > > > > > > > useful to have as much information as possible at INFO
> > > level,
> > > > > > > rather
> > > > > > > > > than
> > > > > > > > > > > forcing cluster administrators to possibly reconfigure
> a
> > > > > > connector
> > > > > > > to
> > > > > > > > > emit
> > > > > > > > > > > DEBUG or TRACE level metrics in order to diagnose a
> > > potential
> > > > > > > > > > > production-impacting performance bottleneck. I can see
> > the
> > > > > > > rationale
> > > > > > > > > for
> > > > > > > > > > > emitting per-record metrics that track an average value
> > at
> > > > > DEBUG
> > > > > > > > > level, but
> > > > > > > > > > > for per-record metrics that track a maximum value, is
> > there
> > > > any
> > > > > > > > reason
> > > > > > > > > not
> > > > > > > > > > > to provide this information at INFO level?
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Agree. Though with Max and Avg metrics being part of the
> > same
> > > > > > sensor
> > > > > > > —
> > > > > > > > > > where Metric Level is defined — then both metrics get the
> > > same
> > > > > > level.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > 3. I'm also curious about the performance testing
> > suggested
> > > > by
> > > > > > Yash
> > > > > > > > to
> > > > > > > > > > > gauge the potential impact of this change. Have you
> been
> > > able
> > > > > to
> > > > > > do
> > > > > > > > any
> > > > > > > > > > > testing with your draft implementation yet?
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > No, not so far.
> > > > > > > > > > I think it would be valuable to discuss the scope of this
> > > > testing
> > > > > > and
> > > > > > > > > maybe
> > > > > > > > > > tackle it
> > > > > > > > > > in a separate issue as Sensors and Metrics are used all
> > over
> > > > the
> > > > > > > place.
> > > > > > > > > > My initial understanding is that these tests should by
> > placed
> > > > in
> > > > > > the
> > > > > > > > > > jmh-benchmarks[1].
> > > > > > > > > > Then, we could target testing Sensors and Metrics, and
> > > validate
> > > > > how
> > > > > > > > much
> > > > > > > > > > overhead
> > > > > > > > > > is added by having only Max vs Max,Avg(,Min), etc.
> > > > > > > > > > In the other hand, we could extend this to Transformers
> or
> > > > other
> > > > > > > > Connect
> > > > > > > > > > layers.
> > > > > > > > > >
> > > > > > > > > > Here are some pointers to the Sensors and Metrics
> > > > implementations
> > > > > > > that
> > > > > > > > > > could be considered:
> > > > > > > > > > Path to metric recording:
> > > > > > > > > > -
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/5cab11cf525f6c06fcf9eb43f7f95ef33fe1cdbb/clients/src/main/java/org/apache/kafka/common/metrics/Sensor.java#L195-L199
> > > > > > > > > > -
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/5cab11cf525f6c06fcf9eb43f7f95ef33fe1cdbb/clients/src/main/java/org/apache/kafka/common/metrics/Sensor.java#L230-L244
> > > > > > > > > >
> > > > > > > > > > ```
> > > > > > > > > > // increment all the stats
> > > > > > > > > > for (StatAndConfig statAndConfig : this.stats) {
> > > > > > > > > >    statAndConfig.stat.record(statAndConfig.config(),
> value,
> > > > > > timeMs);
> > > > > > > > > > }
> > > > > > > > > > ```
> > > > > > > > > >
> > > > > > > > > > SampledStats:
> > > > > > > > > > - Avg:
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/068ab9cefae301f3187ea885d645c425955e77d2/clients/src/main/java/org/apache/kafka/common/metrics/stats/Avg.java
> > > > > > > > > > - Max:
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/068ab9cefae301f3187ea885d645c425955e77d2/clients/src/main/java/org/apache/kafka/common/metrics/stats/Max.java
> > > > > > > > > > - Min:
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/068ab9cefae301f3187ea885d645c425955e77d2/clients/src/main/java/org/apache/kafka/common/metrics/stats/Min.java
> > > > > > > > > >
> > > > > > > > > > `stat#record()` are implemented by `update` method in
> > > > > SampledStat:
> > > > > > > > > >
> > > > > > > > > > ```Max.java
> > > > > > > > > >     @Override
> > > > > > > > > >     protected void update(Sample sample, MetricConfig
> > config,
> > > > > > double
> > > > > > > > > value,
> > > > > > > > > > long now) {
> > > > > > > > > >         sample.value += value;
> > > > > > > > > >     }
> > > > > > > > > > ```
> > > > > > > > > >
> > > > > > > > > > ```Avg.java
> > > > > > > > > >     @Override
> > > > > > > > > >     protected void update(Sample sample, MetricConfig
> > config,
> > > > > > double
> > > > > > > > > value,
> > > > > > > > > > long now) {
> > > > > > > > > >         sample.value = Math.max(sample.value, value);
> > > > > > > > > >     }
> > > > > > > > > > ```
> > > > > > > > > >
> > > > > > > > > > As far as I understand, most of the work of the stats
> > happens
> > > > on
> > > > > > the
> > > > > > > > > > `combine` method that is not part of the connector
> > execution
> > > > but
> > > > > > > called
> > > > > > > > > > when metrics are queried.
> > > > > > > > > >
> > > > > > > > > > I wonder whether we should consider Avg and Max for all
> > > metrics
> > > > > > > > proposed
> > > > > > > > > as
> > > > > > > > > > the impact on the execution path seems minimal, and even
> > see
> > > if
> > > > > Min
> > > > > > > is
> > > > > > > > > also
> > > > > > > > > > valuable, and use DEBUG only for more granular metrics.
> > > > > > > > > >
> > > > > > > > > > [1]
> > > https://github.com/apache/kafka/tree/trunk/jmh-benchmarks
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > 4. Just to make sure I understand correctly--does "time
> > > when
> > > > it
> > > > > > has
> > > > > > > > > been
> > > > > > > > > > > received by the Sink task" refer to the wallclock time
> > > > directly
> > > > > > > > after a
> > > > > > > > > > > call to SinkTask::put has been completed (as opposed to
> > > > > directly
> > > > > > > > before
> > > > > > > > > > > that call is made, or something else entirely)?
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > It currently means when it has been received by the Sink
> > task
> > > > > > > > > > right after consumer poll and before conversions.
> > > > > > > > > > Would it be valuable to have it after put-sink-records?
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > 5. If the goal is to identify performance bottlenecks
> > > (either
> > > > > in
> > > > > > > > > production
> > > > > > > > > > > or pre-production environments), would it make sense to
> > > > > introduce
> > > > > > > > > metrics
> > > > > > > > > > > for each individual converter (i.e., key/value/header)
> > and
> > > > > > > > > transformation?
> > > > > > > > > > > It's definitely an improvement to be able to identify
> the
> > > > total
> > > > > > > time
> > > > > > > > > for
> > > > > > > > > > > conversion and transformation, but then the immediate
> > > > follow-up
> > > > > > > > > question if
> > > > > > > > > > > a bottleneck is found in that phase is "which
> > > > > > > > converter/transformation
> > > > > > > > > is
> > > > > > > > > > > responsible?" It'd be nice if we could provide a way to
> > > > quickly
> > > > > > > > answer
> > > > > > > > > that
> > > > > > > > > > > question.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > This is a great idea. I'd like to consider this as well,
> > > though
> > > > > > maybe
> > > > > > > > > these
> > > > > > > > > > more granular
> > > > > > > > > > metrics would be good to have them as DEBUG.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > 6. Any thoughts about offering latency metrics for
> source
> > > > tasks
> > > > > > > > between
> > > > > > > > > > > receipt of the record from the task and delivery of the
> > > > record
> > > > > to
> > > > > > > > Kafka
> > > > > > > > > > > (which would be tracked by producer callback)? We could
> > > also
> > > > > use
> > > > > > > the
> > > > > > > > > record
> > > > > > > > > > > timestamp either instead of or in addition to receipt
> > time
> > > if
> > > > > the
> > > > > > > > task
> > > > > > > > > > > provides a timestamp with its records.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > With source transform and convert metrics we get part of
> > that
> > > > > > > latency.
> > > > > > > > > > Looking at the Producer metrics, `request-latency`
> (though
> > a
> > > > very
> > > > > > > > generic
> > > > > > > > > > metric)
> > > > > > > > > > sort of answer the time between send request and ack — if
> > my
> > > > > > > > > understanding
> > > > > > > > > > is correct.
> > > > > > > > > > Would these be enough or you're thinking about another
> > > > approach?
> > > > > > > > > > maybe a custom metric to cover the producer side?
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > 7. We may end up introducing a way for sink tasks to
> > record
> > > > > > > > per-record
> > > > > > > > > > > delivery to the sink system (see KIP-767 [1]). I'd like
> > it
> > > if
> > > > > we
> > > > > > > > could
> > > > > > > > > keep
> > > > > > > > > > > the names of our metrics very precise in order to avoid
> > > > > confusing
> > > > > > > > users
> > > > > > > > > > > (who may think that we're providing metrics on actual
> > > > delivery
> > > > > to
> > > > > > > the
> > > > > > > > > sink
> > > > > > > > > > > system, which may not be the case if the connector
> > performs
> > > > > > > > > asynchronous
> > > > > > > > > > > writes), and in order to leave room for a metrics on
> true
> > > > > > delivery
> > > > > > > > > time by
> > > > > > > > > > > sink tasks. It'd also be nice if we could remain
> > consistent
> > > > > with
> > > > > > > > > existing
> > > > > > > > > > > metrics such as "put-batch-avg-time-ms". With that in
> > mind,
> > > > > what
> > > > > > do
> > > > > > > > you
> > > > > > > > > > > think about renaming these metrics:
> > > > > > > > > > > - "sink-record-batch-latency-max-ms" to
> > > > > > "put-batch-avg-latency-ms"
> > > > > > > > > > > - "sink-record-latency-max-ms" to
> > > > > > "put-sink-record-latency-max-ms"
> > > > > > > > > > > - "sink-record-latency-avg-ms" to
> > > > > > "put-sink-record-latency-avg-ms"
> > > > > > > > > > > - "sink-record-convert-transform-time-max-ms" to
> > > > > > > > > > > "convert-transform-sink-record-time-max-ms"
> > > > > > > > > > > - "sink-record-convert-transform-time-avg-ms" to
> > > > > > > > > > > "convert-transform-sink-record-time-avg-ms"
> > > > > > > > > > > - "source-record-transform-convert-time-max-ms" to
> > > > > > > > > > > "transform-convert-source-record-time-max-ms"
> > > > > > > > > > > - "source-record-transform-convert-time-avg-ms" to
> > > > > > > > > > > "transform-convert-source-record-time-avg-ms"
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Make sense, thanks! I have updated the list of metrics
> and
> > > > group
> > > > > > them
> > > > > > > > by
> > > > > > > > > > sensor and applying these suggestions.
> > > > > > > > > > The only ones that I want to review are: sink-record-* to
> > > > > > put-batch-*
> > > > > > > > > > (first 3). Not sure if put-batch/put-sink-record
> describes
> > > the
> > > > > > > purpose
> > > > > > > > of
> > > > > > > > > > the metric — neither `sink-record-latency` to be honest.
> > > > > > > > > > My initial thought was to have something like Kafka
> Streams
> > > > > > > > e2e-latency.
> > > > > > > > > > Based on 4. and 6. questions, an idea could be to add:
> > > > > > > > > > - source-batch-e2e-latency-before-send: measure
> wallclock -
> > > > > source
> > > > > > > > record
> > > > > > > > > > timestamp after source connector poll.
> > > > > > > > > > - source-batch-e2e-latency-after-send: measure wallclock
> -
> > > > record
> > > > > > > > > timestamp
> > > > > > > > > > on producer send callback
> > > > > > > > > > - sink-batch-e2e-latency-before-put: measure time
> > wallclock -
> > > > > > record
> > > > > > > > > > timestamp after consumer poll
> > > > > > > > > > - sink-batch-e2e-latency-after-put: measure time
> wallclock
> > -
> > > > > record
> > > > > > > > > > timestamp after sink connector put.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > Thanks again for the KIP! Looking forward to your
> > thoughts.
> > > > > > > > > > >
> > > > > > > > > > > Cheers,
> > > > > > > > > > >
> > > > > > > > > > > Chris
> > > > > > > > > > >
> > > > > > > > > > > [1] -
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-767%3A+Connect+Latency+Metrics
> > > > > > > > > > >
> > > > > > > > > > > On Thu, Sep 15, 2022 at 1:32 PM Jorge Esteban Quilcate
> > > Otoya
> > > > <
> > > > > > > > > > > quilcate.jorge@gmail.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi everyone,
> > > > > > > > > > > >
> > > > > > > > > > > > I've made a slight addition to the KIP based on Yash
> > > > > feedback:
> > > > > > > > > > > >
> > > > > > > > > > > > - A new metric is added at INFO level to record the
> max
> > > > > latency
> > > > > > > > from
> > > > > > > > > the
> > > > > > > > > > > > batch timestamp, by keeping the oldest record
> timestamp
> > > per
> > > > > > > batch.
> > > > > > > > > > > > - A draft implementation is linked.
> > > > > > > > > > > >
> > > > > > > > > > > > Looking forward to your feedback.
> > > > > > > > > > > > Also, a kindly reminder that the vote thread is open.
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks!
> > > > > > > > > > > > Jorge.
> > > > > > > > > > > >
> > > > > > > > > > > > On Thu, 8 Sept 2022 at 14:25, Jorge Esteban Quilcate
> > > Otoya
> > > > <
> > > > > > > > > > > > quilcate.jorge@gmail.com> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Great. I have updated the KIP to reflect this.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Cheers,
> > > > > > > > > > > > > Jorge.
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Thu, 8 Sept 2022 at 12:26, Yash Mayya <
> > > > > > yash.mayya@gmail.com
> > > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > >> Thanks, I think it makes sense to define these
> > metrics
> > > > at
> > > > > a
> > > > > > > > DEBUG
> > > > > > > > > > > > >> recording
> > > > > > > > > > > > >> level.
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> On Thu, Sep 8, 2022 at 2:51 PM Jorge Esteban
> > Quilcate
> > > > > Otoya
> > > > > > <
> > > > > > > > > > > > >> quilcate.jorge@gmail.com> wrote:
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> > On Thu, 8 Sept 2022 at 05:55, Yash Mayya <
> > > > > > > > yash.mayya@gmail.com>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > > Hi Jorge,
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> > > Thanks for the changes. With regard to having
> > per
> > > > > batch
> > > > > > vs
> > > > > > > > per
> > > > > > > > > > > > record
> > > > > > > > > > > > >> > > metrics, the additional overhead I was
> referring
> > > to
> > > > > > wasn't
> > > > > > > > > about
> > > > > > > > > > > > >> whether
> > > > > > > > > > > > >> > or
> > > > > > > > > > > > >> > > not we would need to iterate over all the
> > records
> > > > in a
> > > > > > > > batch.
> > > > > > > > > I
> > > > > > > > > > > was
> > > > > > > > > > > > >> > > referring to the potential additional overhead
> > > > caused
> > > > > by
> > > > > > > the
> > > > > > > > > > > higher
> > > > > > > > > > > > >> > volume
> > > > > > > > > > > > >> > > of calls to Sensor::record on the sensors for
> > the
> > > > new
> > > > > > > > metrics
> > > > > > > > > (as
> > > > > > > > > > > > >> > compared
> > > > > > > > > > > > >> > > to the existing batch only metrics),
> especially
> > > for
> > > > > high
> > > > > > > > > > > throughput
> > > > > > > > > > > > >> > > connectors where batch sizes could be large. I
> > > guess
> > > > > we
> > > > > > > may
> > > > > > > > > want
> > > > > > > > > > > to
> > > > > > > > > > > > do
> > > > > > > > > > > > >> > some
> > > > > > > > > > > > >> > > sort of performance testing and get concrete
> > > numbers
> > > > > to
> > > > > > > > verify
> > > > > > > > > > > > whether
> > > > > > > > > > > > >> > this
> > > > > > > > > > > > >> > > is a valid concern or not?
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > 6.1. Got it, thanks for clarifying. I guess
> there
> > > > could
> > > > > > be a
> > > > > > > > > > > benchmark
> > > > > > > > > > > > >> test
> > > > > > > > > > > > >> > of the `Sensor::record` to get an idea of the
> > > > > performance
> > > > > > > > > impact.
> > > > > > > > > > > > >> > Regardless, the fact that these are
> single-record
> > > > > metrics
> > > > > > > > > compared
> > > > > > > > > > > to
> > > > > > > > > > > > >> > existing batch-only could be explicitly defined
> by
> > > > > setting
> > > > > > > > these
> > > > > > > > > > > > >> metrics at
> > > > > > > > > > > > >> > a DEBUG or TRACE metric recording level, leaving
> > the
> > > > > > > existing
> > > > > > > > at
> > > > > > > > > > > INFO
> > > > > > > > > > > > >> > level.
> > > > > > > > > > > > >> > wdyt?
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> > > Thanks,
> > > > > > > > > > > > >> > > Yash
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> > > On Tue, Sep 6, 2022 at 4:42 PM Jorge Esteban
> > > > Quilcate
> > > > > > > Otoya
> > > > > > > > <
> > > > > > > > > > > > >> > > quilcate.jorge@gmail.com> wrote:
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> > > > Hi Sagar and Yash,
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > > > > the way it's defined in
> > > > > > > > > > > > >> > > >
> > > > > > > > https://kafka.apache.org/documentation/#connect_monitoring
> > > > > > > > > for
> > > > > > > > > > > > the
> > > > > > > > > > > > >> > > metrics
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > > > 4.1. Got it. Add it to the KIP.
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > > > > The only thing I would argue is do we need
> > > > > > > > > > > > >> sink-record-latency-min?
> > > > > > > > > > > > >> > > Maybe
> > > > > > > > > > > > >> > > > we
> > > > > > > > > > > > >> > > > > could remove this min metric as well and
> > make
> > > > all
> > > > > of
> > > > > > > the
> > > > > > > > > 3 e2e
> > > > > > > > > > > > >> > metrics
> > > > > > > > > > > > >> > > > > consistent
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > > > 4.2 I see. Will remove it from the KIP.
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > > > > Probably users can track the metrics at
> > their
> > > > end
> > > > > to
> > > > > > > > > > > > >> > > > > figure that out. Do you think that makes
> > > sense?
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > > > 4.3. Yes, agree. With these new metrics it
> > > should
> > > > be
> > > > > > > > easier
> > > > > > > > > for
> > > > > > > > > > > > >> users
> > > > > > > > > > > > >> > to
> > > > > > > > > > > > >> > > > track this.
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > > > > I think it makes sense to not have a min
> > > metric
> > > > > for
> > > > > > > > > either to
> > > > > > > > > > > > >> remain
> > > > > > > > > > > > >> > > > > consistent with the existing put-batch and
> > > > > > poll-batch
> > > > > > > > > metrics
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > > > 5.1. Got it. Same as 4.2
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > > > > Another naming related suggestion I had
> was
> > > with
> > > > > the
> > > > > > > > > > > > >> > > > > "convert-time" metrics - we should
> probably
> > > > > include
> > > > > > > > > > > > >> transformations
> > > > > > > > > > > > >> > in
> > > > > > > > > > > > >> > > > the
> > > > > > > > > > > > >> > > > > name since SMTs could definitely be
> > > attributable
> > > > > to
> > > > > > a
> > > > > > > > > sizable
> > > > > > > > > > > > >> chunk
> > > > > > > > > > > > >> > of
> > > > > > > > > > > > >> > > > the
> > > > > > > > > > > > >> > > > > latency depending on the specific
> > > transformation
> > > > > > > chain.
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > > > 5.2. Make sense. I'm proposing to add
> > > > > > > > > > > > >> > `sink-record-convert-transform...`
> > > > > > > > > > > > >> > > > and `source-record-transform-convert...` to
> > > > > represent
> > > > > > > > > correctly
> > > > > > > > > > > > the
> > > > > > > > > > > > >> > order
> > > > > > > > > > > > >> > > > of operations.
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > > > > it seems like both source and sink tasks
> > only
> > > > > record
> > > > > > > > > metrics
> > > > > > > > > > > at
> > > > > > > > > > > > a
> > > > > > > > > > > > >> > > "batch"
> > > > > > > > > > > > >> > > > > level, not on an individual record level.
> I
> > > > think
> > > > > it
> > > > > > > > > might be
> > > > > > > > > > > > >> > > additional
> > > > > > > > > > > > >> > > > > overhead if we want to record these new
> > > metrics
> > > > > all
> > > > > > at
> > > > > > > > the
> > > > > > > > > > > > record
> > > > > > > > > > > > >> > > level?
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > > > 5.3. I considered at the beginning to
> > implement
> > > > all
> > > > > > > > metrics
> > > > > > > > > at
> > > > > > > > > > > the
> > > > > > > > > > > > >> > batch
> > > > > > > > > > > > >> > > > level, but given how the framework process
> > > > records,
> > > > > I
> > > > > > > > > fallback
> > > > > > > > > > > to
> > > > > > > > > > > > >> the
> > > > > > > > > > > > >> > > > proposed approach:
> > > > > > > > > > > > >> > > > - Sink Task:
> > > > > > > > > > > > >> > > >   - `WorkerSinkTask#convertMessages(msgs)`
> > > already
> > > > > > > > iterates
> > > > > > > > > over
> > > > > > > > > > > > >> > records,
> > > > > > > > > > > > >> > > > so there is no additional overhead to
> capture
> > > > record
> > > > > > > > > latency per
> > > > > > > > > > > > >> > record.
> > > > > > > > > > > > >> > > >     -
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >>
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/9841647c4fe422532f448423c92d26e4fdcb8932/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L490-L514
> > > > > > > > > > > > >> > > >   -
> > > > > `WorkerSinkTask#convertAndTransformRecord(record)`
> > > > > > > > > actually
> > > > > > > > > > > > >> happens
> > > > > > > > > > > > >> > > > individually. Measuring this operation per
> > batch
> > > > > would
> > > > > > > > > include
> > > > > > > > > > > > >> > processing
> > > > > > > > > > > > >> > > > that is not strictly part of "convert and
> > > > transform"
> > > > > > > > > > > > >> > > >     -
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >>
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/9841647c4fe422532f448423c92d26e4fdcb8932/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L518
> > > > > > > > > > > > >> > > > - Source Task:
> > > > > > > > > > > > >> > > >   - `AbstractWorkerSourceTask#sendRecords`
> > > > iterates
> > > > > > > over a
> > > > > > > > > batch
> > > > > > > > > > > > and
> > > > > > > > > > > > >> > > > applies transforms and convert record
> > > individually
> > > > > as
> > > > > > > > well:
> > > > > > > > > > > > >> > > >     -
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >>
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/9841647c4fe422532f448423c92d26e4fdcb8932/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/AbstractWorkerSourceTask.java#L389-L390
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > > > > This might require some additional
> changes -
> > > > > > > > > > > > >> > > > > for instance, with the
> "sink-record-latency"
> > > > > metric,
> > > > > > > we
> > > > > > > > > might
> > > > > > > > > > > > only
> > > > > > > > > > > > >> > want
> > > > > > > > > > > > >> > > > to
> > > > > > > > > > > > >> > > > > have a "max" metric since "avg" would
> > require
> > > > > > > recording
> > > > > > > > a
> > > > > > > > > > > value
> > > > > > > > > > > > on
> > > > > > > > > > > > >> > the
> > > > > > > > > > > > >> > > > > sensor for each record (whereas we can
> get a
> > > > "max"
> > > > > > by
> > > > > > > > only
> > > > > > > > > > > > >> recording
> > > > > > > > > > > > >> > a
> > > > > > > > > > > > >> > > > > metric value for the oldest record in each
> > > > batch).
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > > > 5.4. Recording record-latency per batch may
> > not
> > > be
> > > > > as
> > > > > > > > > useful as
> > > > > > > > > > > > >> there
> > > > > > > > > > > > >> > is
> > > > > > > > > > > > >> > > no
> > > > > > > > > > > > >> > > > guarantee that the oldest record will be
> > > > > > representative
> > > > > > > of
> > > > > > > > > the
> > > > > > > > > > > > >> batch.
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > > > On Sat, 3 Sept 2022 at 16:02, Yash Mayya <
> > > > > > > > > yash.mayya@gmail.com>
> > > > > > > > > > > > >> wrote:
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > > > > Hi Jorge and Sagar,
> > > > > > > > > > > > >> > > > >
> > > > > > > > > > > > >> > > > > I think it makes sense to not have a min
> > > metric
> > > > > for
> > > > > > > > > either to
> > > > > > > > > > > > >> remain
> > > > > > > > > > > > >> > > > > consistent with the existing put-batch and
> > > > > > poll-batch
> > > > > > > > > metrics
> > > > > > > > > > > > (it
> > > > > > > > > > > > >> > > doesn't
> > > > > > > > > > > > >> > > > > seem particularly useful either anyway).
> > Also,
> > > > the
> > > > > > new
> > > > > > > > > > > > >> > > > > "sink-record-latency" metric name looks
> fine
> > > to
> > > > > me,
> > > > > > > > > thanks for
> > > > > > > > > > > > >> making
> > > > > > > > > > > > >> > > the
> > > > > > > > > > > > >> > > > > changes! Another naming related
> suggestion I
> > > had
> > > > > was
> > > > > > > > with
> > > > > > > > > the
> > > > > > > > > > > > >> > > > > "convert-time" metrics - we should
> probably
> > > > > include
> > > > > > > > > > > > >> transformations
> > > > > > > > > > > > >> > in
> > > > > > > > > > > > >> > > > the
> > > > > > > > > > > > >> > > > > name since SMTs could definitely be
> > > attributable
> > > > > to
> > > > > > a
> > > > > > > > > sizable
> > > > > > > > > > > > >> chunk
> > > > > > > > > > > > >> > of
> > > > > > > > > > > > >> > > > the
> > > > > > > > > > > > >> > > > > latency depending on the specific
> > > transformation
> > > > > > > chain.
> > > > > > > > > > > > >> > > > >
> > > > > > > > > > > > >> > > > > I have one high level question with
> respect
> > to
> > > > > > > > > implementation
> > > > > > > > > > > -
> > > > > > > > > > > > >> > > > currently,
> > > > > > > > > > > > >> > > > > it seems like both source and sink tasks
> > only
> > > > > record
> > > > > > > > > metrics
> > > > > > > > > > > at
> > > > > > > > > > > > a
> > > > > > > > > > > > >> > > "batch"
> > > > > > > > > > > > >> > > > > level, not on an individual record level.
> I
> > > > think
> > > > > it
> > > > > > > > > might be
> > > > > > > > > > > > >> > > additional
> > > > > > > > > > > > >> > > > > overhead if we want to record these new
> > > metrics
> > > > > all
> > > > > > at
> > > > > > > > the
> > > > > > > > > > > > record
> > > > > > > > > > > > >> > > level?
> > > > > > > > > > > > >> > > > > Could we instead make all of these new
> > metrics
> > > > for
> > > > > > > > > batches of
> > > > > > > > > > > > >> records
> > > > > > > > > > > > >> > > > > rather than individual records in order to
> > > > remain
> > > > > > > > > consistent
> > > > > > > > > > > > with
> > > > > > > > > > > > >> the
> > > > > > > > > > > > >> > > > > existing task level metrics? This might
> > > require
> > > > > some
> > > > > > > > > > > additional
> > > > > > > > > > > > >> > > changes -
> > > > > > > > > > > > >> > > > > for instance, with the
> "sink-record-latency"
> > > > > metric,
> > > > > > > we
> > > > > > > > > might
> > > > > > > > > > > > only
> > > > > > > > > > > > >> > want
> > > > > > > > > > > > >> > > > to
> > > > > > > > > > > > >> > > > > have a "max" metric since "avg" would
> > require
> > > > > > > recording
> > > > > > > > a
> > > > > > > > > > > value
> > > > > > > > > > > > on
> > > > > > > > > > > > >> > the
> > > > > > > > > > > > >> > > > > sensor for each record (whereas we can
> get a
> > > > "max"
> > > > > > by
> > > > > > > > only
> > > > > > > > > > > > >> recording
> > > > > > > > > > > > >> > a
> > > > > > > > > > > > >> > > > > metric value for the oldest record in each
> > > > batch).
> > > > > > > > > > > > >> > > > >
> > > > > > > > > > > > >> > > > > Thanks,
> > > > > > > > > > > > >> > > > > Yash
> > > > > > > > > > > > >> > > > >
> > > > > > > > > > > > >> > > > > On Fri, Sep 2, 2022 at 3:16 PM Sagar <
> > > > > > > > > > > sagarmeansocean@gmail.com
> > > > > > > > > > > > >
> > > > > > > > > > > > >> > > wrote:
> > > > > > > > > > > > >> > > > >
> > > > > > > > > > > > >> > > > > > Hi Jorge,
> > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > >> > > > > > Thanks for the changes.
> > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > >> > > > > > Regarding the metrics, I meant something
> > > like
> > > > > > this:
> > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > >> > > > >
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >>
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> kafka.connect:type=sink-task-metrics,connector="{connector}",task="{task}"
> > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > >> > > > > > the way it's defined in
> > > > > > > > > > > > >> > > > > >
> > > > > > > > > https://kafka.apache.org/documentation/#connect_monitoring
> > > > > > > > > > > > for
> > > > > > > > > > > > >> the
> > > > > > > > > > > > >> > > > > > metrics.
> > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > >> > > > > > I see what you mean by the 3 metrics and
> > how
> > > > it
> > > > > > can
> > > > > > > be
> > > > > > > > > > > > >> interpreted.
> > > > > > > > > > > > >> > > The
> > > > > > > > > > > > >> > > > > > only thing I would argue is do we need
> > > > > > > > > > > > sink-record-latency-min?
> > > > > > > > > > > > >> > Maybe
> > > > > > > > > > > > >> > > > we
> > > > > > > > > > > > >> > > > > > could remove this min metric as well and
> > > make
> > > > > all
> > > > > > of
> > > > > > > > > the 3
> > > > > > > > > > > e2e
> > > > > > > > > > > > >> > > metrics
> > > > > > > > > > > > >> > > > > > consistent(since put-batch also doesn't
> > > > expose a
> > > > > > min
> > > > > > > > > which
> > > > > > > > > > > > makes
> > > > > > > > > > > > >> > > sense
> > > > > > > > > > > > >> > > > to
> > > > > > > > > > > > >> > > > > > me). I think this is in contrast to what
> > > Yash
> > > > > > > pointed
> > > > > > > > > out
> > > > > > > > > > > > above
> > > > > > > > > > > > >> so
> > > > > > > > > > > > >> > I
> > > > > > > > > > > > >> > > > > would
> > > > > > > > > > > > >> > > > > > like to hear his thoughts as well.
> > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > >> > > > > > The other point Yash mentioned about the
> > > > > slightly
> > > > > > > > flawed
> > > > > > > > > > > > >> definition
> > > > > > > > > > > > >> > > of
> > > > > > > > > > > > >> > > > > e2e
> > > > > > > > > > > > >> > > > > > is also true in a sense. But I have a
> > > feeling
> > > > > > that's
> > > > > > > > > one the
> > > > > > > > > > > > >> > records
> > > > > > > > > > > > >> > > > are
> > > > > > > > > > > > >> > > > > > polled by the connector tasks, it would
> be
> > > > > > difficult
> > > > > > > > to
> > > > > > > > > > > track
> > > > > > > > > > > > >> the
> > > > > > > > > > > > >> > > final
> > > > > > > > > > > > >> > > > > leg
> > > > > > > > > > > > >> > > > > > via the framework. Probably users can
> > track
> > > > the
> > > > > > > > metrics
> > > > > > > > > at
> > > > > > > > > > > > their
> > > > > > > > > > > > >> > end
> > > > > > > > > > > > >> > > to
> > > > > > > > > > > > >> > > > > > figure that out. Do you think that makes
> > > > sense?
> > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > >> > > > > > Thanks!
> > > > > > > > > > > > >> > > > > > Sagar.
> > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > >> > > > > > On Thu, Sep 1, 2022 at 11:40 PM Jorge
> > > Esteban
> > > > > > > Quilcate
> > > > > > > > > > > Otoya <
> > > > > > > > > > > > >> > > > > > quilcate.jorge@gmail.com> wrote:
> > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > >> > > > > > > Hi Sagar and Yash,
> > > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > > >> > > > > > > Thanks for your feedback!
> > > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > > >> > > > > > > > 1) I am assuming the new metrics
> would
> > > be
> > > > > task
> > > > > > > > level
> > > > > > > > > > > > metric.
> > > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > > >> > > > > > > 1.1 Yes, it will be a task level
> metric,
> > > > > > > implemented
> > > > > > > > > on
> > > > > > > > > > > the
> > > > > > > > > > > > >> > > > > > > Worker[Source/Sink]Task.
> > > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > > >> > > > > > > > Could you specify the way it's done
> > for
> > > > > other
> > > > > > > > > > > sink/source
> > > > > > > > > > > > >> > > > connector?
> > > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > > >> > > > > > > 1.2. Not sure what do you mean by
> this.
> > > > Could
> > > > > > you
> > > > > > > > > > > elaborate
> > > > > > > > > > > > a
> > > > > > > > > > > > >> bit
> > > > > > > > > > > > >> > > > more?
> > > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > > >> > > > > > > > 2. I am slightly confused about the
> > e2e
> > > > > > latency
> > > > > > > > > > > metric...
> > > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > > >> > > > > > > 2.1. Yes, I see. I was trying to
> bring a
> > > > > similar
> > > > > > > > > concept
> > > > > > > > > > > as
> > > > > > > > > > > > in
> > > > > > > > > > > > >> > > > Streams
> > > > > > > > > > > > >> > > > > > with
> > > > > > > > > > > > >> > > > > > > KIP-613, though the e2e concept may
> not
> > be
> > > > > > > > > translatable.
> > > > > > > > > > > > >> > > > > > > We could keep it as
> > `sink-record-latency`
> > > to
> > > > > > avoid
> > > > > > > > > > > > conflating
> > > > > > > > > > > > >> > > > > concepts. A
> > > > > > > > > > > > >> > > > > > > similar metric naming was proposed in
> > > > KIP-489
> > > > > > but
> > > > > > > at
> > > > > > > > > the
> > > > > > > > > > > > >> consumer
> > > > > > > > > > > > >> > > > > level —
> > > > > > > > > > > > >> > > > > > > though it seems dormant for a couple
> of
> > > > years.
> > > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > > >> > > > > > > > However, the put-batch time measures
> > the
> > > > > > > > > > > > >> > > > > > > > time to put a batch of records to
> > > external
> > > > > > sink.
> > > > > > > > > So, I
> > > > > > > > > > > > would
> > > > > > > > > > > > >> > > assume
> > > > > > > > > > > > >> > > > > > the 2
> > > > > > > > > > > > >> > > > > > > > can't be added as is to compute the
> > e2e
> > > > > > latency.
> > > > > > > > > Maybe I
> > > > > > > > > > > > am
> > > > > > > > > > > > >> > > missing
> > > > > > > > > > > > >> > > > > > > > something here. Could you plz
> clarify
> > > > this.
> > > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > > >> > > > > > > 2.2. Yes, agree. Not necessarily
> added,
> > > but
> > > > > with
> > > > > > > > the 3
> > > > > > > > > > > > >> latencies
> > > > > > > > > > > > >> > > > (poll,
> > > > > > > > > > > > >> > > > > > > convert, putBatch) will be clearer
> where
> > > the
> > > > > > > > > bottleneck
> > > > > > > > > > > may
> > > > > > > > > > > > >> be,
> > > > > > > > > > > > >> > and
> > > > > > > > > > > > >> > > > > > > represent the internal processing.
> > > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > > >> > > > > > > > however, as per the KIP it looks
> like
> > it
> > > > > will
> > > > > > be
> > > > > > > > > > > > >> > > > > > > > the latency between when the record
> > was
> > > > > > written
> > > > > > > to
> > > > > > > > > Kafka
> > > > > > > > > > > > and
> > > > > > > > > > > > >> > when
> > > > > > > > > > > > >> > > > the
> > > > > > > > > > > > >> > > > > > > > record is returned by a sink task's
> > > > > consumer's
> > > > > > > > poll?
> > > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > > >> > > > > > > 3.1. Agree. 2.1. could help to clarify
> > > this.
> > > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > > >> > > > > > > > One more thing - I was wondering
> > > > > > > > > > > > >> > > > > > > > if there's a particular reason for
> > > having
> > > > a
> > > > > > min
> > > > > > > > > metric
> > > > > > > > > > > for
> > > > > > > > > > > > >> e2e
> > > > > > > > > > > > >> > > > > latency
> > > > > > > > > > > > >> > > > > > > but
> > > > > > > > > > > > >> > > > > > > > not for convert time?
> > > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > > >> > > > > > > 3.2. Was following KIP-613 for e2e
> which
> > > > seems
> > > > > > > > useful
> > > > > > > > > to
> > > > > > > > > > > > >> compare
> > > > > > > > > > > > >> > > with
> > > > > > > > > > > > >> > > > > > Max a
> > > > > > > > > > > > >> > > > > > > get an idea of the window of results,
> > > though
> > > > > > > current
> > > > > > > > > > > > >> latencies in
> > > > > > > > > > > > >> > > > > > Connector
> > > > > > > > > > > > >> > > > > > > do not include Min, and that's why I
> > > haven't
> > > > > > added
> > > > > > > > it
> > > > > > > > > for
> > > > > > > > > > > > >> convert
> > > > > > > > > > > > >> > > > > > latency.
> > > > > > > > > > > > >> > > > > > > Do you think it make sense to extend
> > > latency
> > > > > > > metrics
> > > > > > > > > with
> > > > > > > > > > > > Min?
> > > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > > >> > > > > > > KIP is updated to clarify some of
> these
> > > > > changes.
> > > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > > >> > > > > > > Many thanks,
> > > > > > > > > > > > >> > > > > > > Jorge.
> > > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > > >> > > > > > > On Thu, 1 Sept 2022 at 18:11, Yash
> > Mayya <
> > > > > > > > > > > > >> yash.mayya@gmail.com>
> > > > > > > > > > > > >> > > > wrote:
> > > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > > >> > > > > > > > Hi Jorge,
> > > > > > > > > > > > >> > > > > > > >
> > > > > > > > > > > > >> > > > > > > > Thanks for the KIP! I have the same
> > > > > confusion
> > > > > > > with
> > > > > > > > > the
> > > > > > > > > > > > >> > > e2e-latency
> > > > > > > > > > > > >> > > > > > > metrics
> > > > > > > > > > > > >> > > > > > > > as Sagar above. "e2e" would seem to
> > > > indicate
> > > > > > the
> > > > > > > > > latency
> > > > > > > > > > > > >> > between
> > > > > > > > > > > > >> > > > when
> > > > > > > > > > > > >> > > > > > the
> > > > > > > > > > > > >> > > > > > > > record was written to Kafka and when
> > the
> > > > > > record
> > > > > > > > was
> > > > > > > > > > > > written
> > > > > > > > > > > > >> to
> > > > > > > > > > > > >> > > the
> > > > > > > > > > > > >> > > > > sink
> > > > > > > > > > > > >> > > > > > > > system by the connector - however,
> as
> > > per
> > > > > the
> > > > > > > KIP
> > > > > > > > it
> > > > > > > > > > > looks
> > > > > > > > > > > > >> like
> > > > > > > > > > > > >> > > it
> > > > > > > > > > > > >> > > > > will
> > > > > > > > > > > > >> > > > > > > be
> > > > > > > > > > > > >> > > > > > > > the latency between when the record
> > was
> > > > > > written
> > > > > > > to
> > > > > > > > > Kafka
> > > > > > > > > > > > and
> > > > > > > > > > > > >> > when
> > > > > > > > > > > > >> > > > the
> > > > > > > > > > > > >> > > > > > > > record is returned by a sink task's
> > > > > consumer's
> > > > > > > > > poll? I
> > > > > > > > > > > > think
> > > > > > > > > > > > >> > that
> > > > > > > > > > > > >> > > > > > metric
> > > > > > > > > > > > >> > > > > > > > will be a little confusing to
> > interpret.
> > > > One
> > > > > > > more
> > > > > > > > > thing
> > > > > > > > > > > -
> > > > > > > > > > > > I
> > > > > > > > > > > > >> was
> > > > > > > > > > > > >> > > > > > wondering
> > > > > > > > > > > > >> > > > > > > > if there's a particular reason for
> > > having
> > > > a
> > > > > > min
> > > > > > > > > metric
> > > > > > > > > > > for
> > > > > > > > > > > > >> e2e
> > > > > > > > > > > > >> > > > > latency
> > > > > > > > > > > > >> > > > > > > but
> > > > > > > > > > > > >> > > > > > > > not for convert time?
> > > > > > > > > > > > >> > > > > > > >
> > > > > > > > > > > > >> > > > > > > > Thanks,
> > > > > > > > > > > > >> > > > > > > > Yash
> > > > > > > > > > > > >> > > > > > > >
> > > > > > > > > > > > >> > > > > > > > On Thu, Sep 1, 2022 at 8:59 PM
> Sagar <
> > > > > > > > > > > > >> > sagarmeansocean@gmail.com>
> > > > > > > > > > > > >> > > > > > wrote:
> > > > > > > > > > > > >> > > > > > > >
> > > > > > > > > > > > >> > > > > > > > > Hi Jorge,
> > > > > > > > > > > > >> > > > > > > > >
> > > > > > > > > > > > >> > > > > > > > > Thanks for the KIP. It looks like
> a
> > > very
> > > > > > good
> > > > > > > > > > > addition.
> > > > > > > > > > > > I
> > > > > > > > > > > > >> > > skimmed
> > > > > > > > > > > > >> > > > > > > through
> > > > > > > > > > > > >> > > > > > > > > once and had a couple of questions
> > =>
> > > > > > > > > > > > >> > > > > > > > >
> > > > > > > > > > > > >> > > > > > > > > 1) I am assuming the new metrics
> > would
> > > > be
> > > > > > task
> > > > > > > > > level
> > > > > > > > > > > > >> metric.
> > > > > > > > > > > > >> > > > Could
> > > > > > > > > > > > >> > > > > > you
> > > > > > > > > > > > >> > > > > > > > > specify the way it's done for
> other
> > > > > > > sink/source
> > > > > > > > > > > > connector?
> > > > > > > > > > > > >> > > > > > > > > 2) I am slightly confused about
> the
> > > e2e
> > > > > > > latency
> > > > > > > > > > > metric.
> > > > > > > > > > > > >> Let's
> > > > > > > > > > > > >> > > > > > consider
> > > > > > > > > > > > >> > > > > > > > the
> > > > > > > > > > > > >> > > > > > > > > sink connector metric. If I look
> at
> > > the
> > > > > way
> > > > > > > it's
> > > > > > > > > > > > supposed
> > > > > > > > > > > > >> to
> > > > > > > > > > > > >> > be
> > > > > > > > > > > > >> > > > > > > > calculated,
> > > > > > > > > > > > >> > > > > > > > > i.e the difference between the
> > record
> > > > > > > timestamp
> > > > > > > > > and
> > > > > > > > > > > the
> > > > > > > > > > > > >> wall
> > > > > > > > > > > > >> > > > clock
> > > > > > > > > > > > >> > > > > > > time,
> > > > > > > > > > > > >> > > > > > > > it
> > > > > > > > > > > > >> > > > > > > > > looks like a per record metric.
> > > However,
> > > > > the
> > > > > > > > > put-batch
> > > > > > > > > > > > >> time
> > > > > > > > > > > > >> > > > > measures
> > > > > > > > > > > > >> > > > > > > the
> > > > > > > > > > > > >> > > > > > > > > time to put a batch of records to
> > > > external
> > > > > > > sink.
> > > > > > > > > So, I
> > > > > > > > > > > > >> would
> > > > > > > > > > > > >> > > > assume
> > > > > > > > > > > > >> > > > > > > the 2
> > > > > > > > > > > > >> > > > > > > > > can't be added as is to compute
> the
> > > e2e
> > > > > > > latency.
> > > > > > > > > > > Maybe I
> > > > > > > > > > > > >> am
> > > > > > > > > > > > >> > > > missing
> > > > > > > > > > > > >> > > > > > > > > something here. Could you plz
> > clarify
> > > > > this.
> > > > > > > > > > > > >> > > > > > > > >
> > > > > > > > > > > > >> > > > > > > > > Thanks!
> > > > > > > > > > > > >> > > > > > > > > Sagar.
> > > > > > > > > > > > >> > > > > > > > >
> > > > > > > > > > > > >> > > > > > > > > On Tue, Aug 30, 2022 at 8:43 PM
> > Jorge
> > > > > > Esteban
> > > > > > > > > Quilcate
> > > > > > > > > > > > >> Otoya
> > > > > > > > > > > > >> > <
> > > > > > > > > > > > >> > > > > > > > > quilcate.jorge@gmail.com> wrote:
> > > > > > > > > > > > >> > > > > > > > >
> > > > > > > > > > > > >> > > > > > > > > > Hi all,
> > > > > > > > > > > > >> > > > > > > > > >
> > > > > > > > > > > > >> > > > > > > > > > I'd like to start a discussion
> > > thread
> > > > on
> > > > > > > > > KIP-864:
> > > > > > > > > > > Add
> > > > > > > > > > > > >> > > > End-To-End
> > > > > > > > > > > > >> > > > > > > > Latency
> > > > > > > > > > > > >> > > > > > > > > > Metrics to Connectors.
> > > > > > > > > > > > >> > > > > > > > > > This KIP aims to improve the
> > metrics
> > > > > > > available
> > > > > > > > > on
> > > > > > > > > > > > Source
> > > > > > > > > > > > >> > and
> > > > > > > > > > > > >> > > > Sink
> > > > > > > > > > > > >> > > > > > > > > > Connectors to measure end-to-end
> > > > > latency,
> > > > > > > > > including
> > > > > > > > > > > > >> source
> > > > > > > > > > > > >> > > and
> > > > > > > > > > > > >> > > > > sink
> > > > > > > > > > > > >> > > > > > > > > record
> > > > > > > > > > > > >> > > > > > > > > > conversion time, and sink record
> > e2e
> > > > > > latency
> > > > > > > > > > > (similar
> > > > > > > > > > > > to
> > > > > > > > > > > > >> > > > KIP-613
> > > > > > > > > > > > >> > > > > > for
> > > > > > > > > > > > >> > > > > > > > > > Streams).
> > > > > > > > > > > > >> > > > > > > > > >
> > > > > > > > > > > > >> > > > > > > > > > The KIP is here:
> > > > > > > > > > > > >> > > > > > > > > >
> > > > > > > > > > > > >> > > > > > > > > >
> > > > > > > > > > > > >> > > > > > > > >
> > > > > > > > > > > > >> > > > > > > >
> > > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > >> > > > >
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >>
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-864%3A+Add+End-To-End+Latency+Metrics+to+Connectors
> > > > > > > > > > > > >> > > > > > > > > >
> > > > > > > > > > > > >> > > > > > > > > > Please take a look and let me
> know
> > > > what
> > > > > > you
> > > > > > > > > think.
> > > > > > > > > > > > >> > > > > > > > > >
> > > > > > > > > > > > >> > > > > > > > > > Cheers,
> > > > > > > > > > > > >> > > > > > > > > > Jorge.
> > > > > > > > > > > > >> > > > > > > > > >
> > > > > > > > > > > > >> > > > > > > > >
> > > > > > > > > > > > >> > > > > > > >
> > > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > >> > > > >
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >>
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-864: Add End-To-End Latency Metrics to Connectors

Posted by Chris Egerton <ch...@aiven.io.INVALID>.

Hi Jorge,

Thanks for indulging my paranoia. LGTM!

Cheers,

Chris

On Mon, Dec 5, 2022 at 10:06 AM Jorge Esteban Quilcate Otoya <
quilcate.jorge@gmail.com> wrote:

> Sure! I have a added the following to the proposed changes section:
>
> ```
> The per-record metrics will definitely be added to Kafka Connect as part of
> this KIP, but their metric level will be changed pending the performance
> testing described in KAFKA-14441, and will otherwise only be exposed at
> lower level (DEBUG instead of INFO, and TRACE instead of DEBUG)
> ```
>
> Let me know if how does it look.
>
> Many thanks!
> Jorge.
>
> On Mon, 5 Dec 2022 at 14:11, Chris Egerton <ch...@aiven.io.invalid>
> wrote:
>
> > Hi Jorge,
> >
> > Thanks for filing KAFKA-14441! In the ticket description we mention that
> > "there will be more confidence whether to design metrics to be exposed
> at a
> > DEBUG or INFO level depending on their impact" but it doesn't seem like
> > this is called out in the KIP and, just based on what's in the KIP, the
> > proposal is still to have several per-record metrics exposed at INFO
> level.
> >
> > Could we explicitly call out that the per-record metrics will definitely
> be
> > added to Kafka Connect as part of this KIP, but they will only be exposed
> > at INFO level pending pending the performance testing described in
> > KAFKA-14441, and will otherwise only be exposed at DEBUG level?
> Otherwise,
> > it's possible that a vote for the KIP as it's written today would be a
> vote
> > in favor of unconditionally exposing these metrics at INFO level, even if
> > the performance testing reveals issues.
> >
> > Cheers,
> >
> > Chris
> >
> > On Sun, Dec 4, 2022 at 7:08 PM Jorge Esteban Quilcate Otoya <
> > quilcate.jorge@gmail.com> wrote:
> >
> > > Thanks for the reminder Chris!
> > >
> > > I have added a note on the KIP to include this as part of the KIP as
> most
> > > of the metrics proposed are per-record and having all on DEBUG would
> > limit
> > > the benefits, and created
> > > https://issues.apache.org/jira/browse/KAFKA-14441
> > > to keep track of this task.
> > >
> > > Cheers,
> > > Jorge.
> > >
> > > On Tue, 29 Nov 2022 at 19:40, Chris Egerton <ch...@aiven.io.invalid>
> > > wrote:
> > >
> > > > Hi Jorge,
> > > >
> > > > Thanks! What were your thoughts on the possible benchmarking and/or
> > > > downgrading of per-record metrics to DEBUG?
> > > >
> > > > Cheers,
> > > >
> > > > Chris
> > > >
> > > > On Thu, Nov 24, 2022 at 8:20 AM Jorge Esteban Quilcate Otoya <
> > > > quilcate.jorge@gmail.com> wrote:
> > > >
> > > > > Thanks Chris! I have updated the KIP with "transform" instead of
> > > "alias".
> > > > > Agree it's clearer.
> > > > >
> > > > > Cheers,
> > > > > Jorge.
> > > > >
> > > > > On Mon, 21 Nov 2022 at 21:36, Chris Egerton
> <chrise@aiven.io.invalid
> > >
> > > > > wrote:
> > > > >
> > > > > > Hi Jorge,
> > > > > >
> > > > > > Thanks for the updates, and apologies for the delay. The new
> > diagram
> > > > > > directly under the "Proposed Changes" section is absolutely
> > gorgeous!
> > > > > >
> > > > > >
> > > > > > Follow-ups:
> > > > > >
> > > > > > RE 2: Good point. We can use the same level for these metrics,
> it's
> > > > not a
> > > > > > big deal.
> > > > > >
> > > > > > RE 3: As long as all the per-record metrics are kept at DEBUG
> > level,
> > > it
> > > > > > should be fine to leave JMH benchmarking for a follow-up. If we
> > want
> > > to
> > > > > add
> > > > > > new per-record, INFO-level metrics, I would be more comfortable
> > with
> > > > > > including benchmarking as part of the testing plan for the KIP.
> One
> > > > > > possible compromise could be to propose that these features be
> > merged
> > > > at
> > > > > > DEBUG level, and then possibly upgraded to INFO level in the
> future
> > > > > pending
> > > > > > benchmarks to guard against performance degradation.
> > > > > >
> > > > > > RE 4: I think for a true "end-to-end" metric, it'd be useful to
> > > include
> > > > > the
> > > > > > time taken by the task to actually deliver the record. However,
> > with
> > > > the
> > > > > > new metric names and descriptions provided in the KIP, I have no
> > > > > objections
> > > > > > with what's currently proposed, and a new "end-to-end" metric can
> > be
> > > > > taken
> > > > > > on later in a follow-up KIP.
> > > > > >
> > > > > > RE 6: You're right, existing producer metrics should be enough
> for
> > > now.
> > > > > We
> > > > > > can revisit this later if/when we add delivery-centric metrics
> for
> > > sink
> > > > > > tasks as well.
> > > > > >
> > > > > > RE 7: The new metric names in the KIP LGTM; I don't see any need
> to
> > > > > expand
> > > > > > beyond those but if you'd still like to pursue others, LMK.
> > > > > >
> > > > > >
> > > > > > New thoughts:
> > > > > >
> > > > > > One small thought: instead of "alias" in
> "alias="{transform_alias}"
> > > for
> > > > > the
> > > > > > per-transform metrics, could we use "transform"? IMO it's clearer
> > > since
> > > > > we
> > > > > > don't use "alias" in the names of transform-related properties,
> and
> > > > > "alias"
> > > > > > may be confused with the classloading term where you can use,
> e.g.,
> > > > > > "FileStreamSource" as the name of a connector class in a
> connector
> > > > config
> > > > > > instead of
> > "org.apache.kafka.connect.file.FileStreamSourceConnector".
> > > > > >
> > > > > >
> > > > > > Cheers,
> > > > > >
> > > > > > Chris
> > > > > >
> > > > > > On Fri, Nov 18, 2022 at 12:06 PM Jorge Esteban Quilcate Otoya <
> > > > > > quilcate.jorge@gmail.com> wrote:
> > > > > >
> > > > > > > Thanks Mickael!
> > > > > > >
> > > > > > >
> > > > > > > On Wed, 9 Nov 2022 at 15:54, Mickael Maison <
> > > > mickael.maison@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Jorge,
> > > > > > > >
> > > > > > > > Thanks for the KIP, it is a nice improvement.
> > > > > > > >
> > > > > > > > 1) The per transformation metrics still have a question mark
> > next
> > > > to
> > > > > > > > them in the KIP. Do you want to include them? If so we'll
> want
> > to
> > > > tag
> > > > > > > > them, we should be able to include the aliases in
> > > > TransformationChain
> > > > > > > > and use them.
> > > > > > > >
> > > > > > >
> > > > > > > Yes, I have added the changes on TransformChain that will be
> > needed
> > > > to
> > > > > > add
> > > > > > > these metrics.
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > 2) I see no references to predicates. If we don't want to
> > measure
> > > > > > > > their latency, can we say it explicitly?
> > > > > > > >
> > > > > > >
> > > > > > > Good question, I haven't considered these. Though as these are
> > > > > > materialized
> > > > > > > as PredicatedTransformation, they should be covered by these
> > > changes.
> > > > > > > Adding a note about this.
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > 3) Should we have sink-record-batch-latency-avg-ms? All other
> > > > metrics
> > > > > > > > have both the maximum and average values.
> > > > > > > >
> > > > > > > >
> > > > > > > Good question. I will remove it and change the record latency
> > from
> > > > > > > DEBUG->INFO as it already cover the maximum metric.
> > > > > > >
> > > > > > > Hope it's clearer now, let me know if there any additional
> > > feedback.
> > > > > > > Thanks!
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Mickael
> > > > > > > >
> > > > > > > > On Thu, Oct 20, 2022 at 9:58 PM Jorge Esteban Quilcate Otoya
> > > > > > > > <qu...@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > Thanks, Chris! Great feedback! Please, find my comments
> > below:
> > > > > > > > >
> > > > > > > > > On Thu, 13 Oct 2022 at 18:52, Chris Egerton
> > > > > <chrise@aiven.io.invalid
> > > > > > >
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi Jorge,
> > > > > > > > > >
> > > > > > > > > > Thanks for the KIP. I agree with the overall direction
> and
> > > > think
> > > > > > this
> > > > > > > > would
> > > > > > > > > > be a nice improvement to Kafka Connect. Here are my
> initial
> > > > > > thoughts
> > > > > > > > on the
> > > > > > > > > > details:
> > > > > > > > > >
> > > > > > > > > > 1. The motivation section outlines the gaps in Kafka
> > > Connect's
> > > > > task
> > > > > > > > metrics
> > > > > > > > > > nicely. I think it'd be useful to include more concrete
> > > details
> > > > > on
> > > > > > > why
> > > > > > > > > > these gaps need to be filled in, and in which cases
> > > additional
> > > > > > > metrics
> > > > > > > > > > would be helpful. One goal could be to provide enhanced
> > > > > monitoring
> > > > > > of
> > > > > > > > > > production deployments that allows for cluster
> > administrators
> > > > to
> > > > > > set
> > > > > > > up
> > > > > > > > > > automatic alerts for latency spikes and, if triggered,
> > > quickly
> > > > > > > > identify the
> > > > > > > > > > root cause of those alerts, reducing the time to
> > remediation.
> > > > > > Another
> > > > > > > > goal
> > > > > > > > > > could be to provide more insight to developers or cluster
> > > > > > > > administrators
> > > > > > > > > > who want to do performance testing on connectors in
> > > > > non-production
> > > > > > > > > > environments. It may help guide our decision making
> process
> > > to
> > > > > > have a
> > > > > > > > > > clearer picture of the goals we're trying to achieve.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Agree. The Motivation section has been updated.
> > > > > > > > > Thanks for the examples, I see both of them being covered
> by
> > > the
> > > > > KIP.
> > > > > > > > > I see how these could give us a good distinction on whether
> > to
> > > > > > position
> > > > > > > > > some metrics at INFO or DEBUG level.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > 2. If we're trying to address the alert-and-diagnose use
> > > case,
> > > > > it'd
> > > > > > > be
> > > > > > > > > > useful to have as much information as possible at INFO
> > level,
> > > > > > rather
> > > > > > > > than
> > > > > > > > > > forcing cluster administrators to possibly reconfigure a
> > > > > connector
> > > > > > to
> > > > > > > > emit
> > > > > > > > > > DEBUG or TRACE level metrics in order to diagnose a
> > potential
> > > > > > > > > > production-impacting performance bottleneck. I can see
> the
> > > > > > rationale
> > > > > > > > for
> > > > > > > > > > emitting per-record metrics that track an average value
> at
> > > > DEBUG
> > > > > > > > level, but
> > > > > > > > > > for per-record metrics that track a maximum value, is
> there
> > > any
> > > > > > > reason
> > > > > > > > not
> > > > > > > > > > to provide this information at INFO level?
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Agree. Though with Max and Avg metrics being part of the
> same
> > > > > sensor
> > > > > > —
> > > > > > > > > where Metric Level is defined — then both metrics get the
> > same
> > > > > level.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > 3. I'm also curious about the performance testing
> suggested
> > > by
> > > > > Yash
> > > > > > > to
> > > > > > > > > > gauge the potential impact of this change. Have you been
> > able
> > > > to
> > > > > do
> > > > > > > any
> > > > > > > > > > testing with your draft implementation yet?
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > No, not so far.
> > > > > > > > > I think it would be valuable to discuss the scope of this
> > > testing
> > > > > and
> > > > > > > > maybe
> > > > > > > > > tackle it
> > > > > > > > > in a separate issue as Sensors and Metrics are used all
> over
> > > the
> > > > > > place.
> > > > > > > > > My initial understanding is that these tests should by
> placed
> > > in
> > > > > the
> > > > > > > > > jmh-benchmarks[1].
> > > > > > > > > Then, we could target testing Sensors and Metrics, and
> > validate
> > > > how
> > > > > > > much
> > > > > > > > > overhead
> > > > > > > > > is added by having only Max vs Max,Avg(,Min), etc.
> > > > > > > > > In the other hand, we could extend this to Transformers or
> > > other
> > > > > > > Connect
> > > > > > > > > layers.
> > > > > > > > >
> > > > > > > > > Here are some pointers to the Sensors and Metrics
> > > implementations
> > > > > > that
> > > > > > > > > could be considered:
> > > > > > > > > Path to metric recording:
> > > > > > > > > -
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/5cab11cf525f6c06fcf9eb43f7f95ef33fe1cdbb/clients/src/main/java/org/apache/kafka/common/metrics/Sensor.java#L195-L199
> > > > > > > > > -
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/5cab11cf525f6c06fcf9eb43f7f95ef33fe1cdbb/clients/src/main/java/org/apache/kafka/common/metrics/Sensor.java#L230-L244
> > > > > > > > >
> > > > > > > > > ```
> > > > > > > > > // increment all the stats
> > > > > > > > > for (StatAndConfig statAndConfig : this.stats) {
> > > > > > > > >    statAndConfig.stat.record(statAndConfig.config(), value,
> > > > > timeMs);
> > > > > > > > > }
> > > > > > > > > ```
> > > > > > > > >
> > > > > > > > > SampledStats:
> > > > > > > > > - Avg:
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/068ab9cefae301f3187ea885d645c425955e77d2/clients/src/main/java/org/apache/kafka/common/metrics/stats/Avg.java
> > > > > > > > > - Max:
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/068ab9cefae301f3187ea885d645c425955e77d2/clients/src/main/java/org/apache/kafka/common/metrics/stats/Max.java
> > > > > > > > > - Min:
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/068ab9cefae301f3187ea885d645c425955e77d2/clients/src/main/java/org/apache/kafka/common/metrics/stats/Min.java
> > > > > > > > >
> > > > > > > > > `stat#record()` are implemented by `update` method in
> > > > SampledStat:
> > > > > > > > >
> > > > > > > > > ```Max.java
> > > > > > > > >     @Override
> > > > > > > > >     protected void update(Sample sample, MetricConfig
> config,
> > > > > double
> > > > > > > > value,
> > > > > > > > > long now) {
> > > > > > > > >         sample.value += value;
> > > > > > > > >     }
> > > > > > > > > ```
> > > > > > > > >
> > > > > > > > > ```Avg.java
> > > > > > > > >     @Override
> > > > > > > > >     protected void update(Sample sample, MetricConfig
> config,
> > > > > double
> > > > > > > > value,
> > > > > > > > > long now) {
> > > > > > > > >         sample.value = Math.max(sample.value, value);
> > > > > > > > >     }
> > > > > > > > > ```
> > > > > > > > >
> > > > > > > > > As far as I understand, most of the work of the stats
> happens
> > > on
> > > > > the
> > > > > > > > > `combine` method that is not part of the connector
> execution
> > > but
> > > > > > called
> > > > > > > > > when metrics are queried.
> > > > > > > > >
> > > > > > > > > I wonder whether we should consider Avg and Max for all
> > metrics
> > > > > > > proposed
> > > > > > > > as
> > > > > > > > > the impact on the execution path seems minimal, and even
> see
> > if
> > > > Min
> > > > > > is
> > > > > > > > also
> > > > > > > > > valuable, and use DEBUG only for more granular metrics.
> > > > > > > > >
> > > > > > > > > [1]
> > https://github.com/apache/kafka/tree/trunk/jmh-benchmarks
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > 4. Just to make sure I understand correctly--does "time
> > when
> > > it
> > > > > has
> > > > > > > > been
> > > > > > > > > > received by the Sink task" refer to the wallclock time
> > > directly
> > > > > > > after a
> > > > > > > > > > call to SinkTask::put has been completed (as opposed to
> > > > directly
> > > > > > > before
> > > > > > > > > > that call is made, or something else entirely)?
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > It currently means when it has been received by the Sink
> task
> > > > > > > > > right after consumer poll and before conversions.
> > > > > > > > > Would it be valuable to have it after put-sink-records?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > 5. If the goal is to identify performance bottlenecks
> > (either
> > > > in
> > > > > > > > production
> > > > > > > > > > or pre-production environments), would it make sense to
> > > > introduce
> > > > > > > > metrics
> > > > > > > > > > for each individual converter (i.e., key/value/header)
> and
> > > > > > > > transformation?
> > > > > > > > > > It's definitely an improvement to be able to identify the
> > > total
> > > > > > time
> > > > > > > > for
> > > > > > > > > > conversion and transformation, but then the immediate
> > > follow-up
> > > > > > > > question if
> > > > > > > > > > a bottleneck is found in that phase is "which
> > > > > > > converter/transformation
> > > > > > > > is
> > > > > > > > > > responsible?" It'd be nice if we could provide a way to
> > > quickly
> > > > > > > answer
> > > > > > > > that
> > > > > > > > > > question.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > This is a great idea. I'd like to consider this as well,
> > though
> > > > > maybe
> > > > > > > > these
> > > > > > > > > more granular
> > > > > > > > > metrics would be good to have them as DEBUG.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > 6. Any thoughts about offering latency metrics for source
> > > tasks
> > > > > > > between
> > > > > > > > > > receipt of the record from the task and delivery of the
> > > record
> > > > to
> > > > > > > Kafka
> > > > > > > > > > (which would be tracked by producer callback)? We could
> > also
> > > > use
> > > > > > the
> > > > > > > > record
> > > > > > > > > > timestamp either instead of or in addition to receipt
> time
> > if
> > > > the
> > > > > > > task
> > > > > > > > > > provides a timestamp with its records.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > With source transform and convert metrics we get part of
> that
> > > > > > latency.
> > > > > > > > > Looking at the Producer metrics, `request-latency` (though
> a
> > > very
> > > > > > > generic
> > > > > > > > > metric)
> > > > > > > > > sort of answer the time between send request and ack — if
> my
> > > > > > > > understanding
> > > > > > > > > is correct.
> > > > > > > > > Would these be enough or you're thinking about another
> > > approach?
> > > > > > > > > maybe a custom metric to cover the producer side?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > 7. We may end up introducing a way for sink tasks to
> record
> > > > > > > per-record
> > > > > > > > > > delivery to the sink system (see KIP-767 [1]). I'd like
> it
> > if
> > > > we
> > > > > > > could
> > > > > > > > keep
> > > > > > > > > > the names of our metrics very precise in order to avoid
> > > > confusing
> > > > > > > users
> > > > > > > > > > (who may think that we're providing metrics on actual
> > > delivery
> > > > to
> > > > > > the
> > > > > > > > sink
> > > > > > > > > > system, which may not be the case if the connector
> performs
> > > > > > > > asynchronous
> > > > > > > > > > writes), and in order to leave room for a metrics on true
> > > > > delivery
> > > > > > > > time by
> > > > > > > > > > sink tasks. It'd also be nice if we could remain
> consistent
> > > > with
> > > > > > > > existing
> > > > > > > > > > metrics such as "put-batch-avg-time-ms". With that in
> mind,
> > > > what
> > > > > do
> > > > > > > you
> > > > > > > > > > think about renaming these metrics:
> > > > > > > > > > - "sink-record-batch-latency-max-ms" to
> > > > > "put-batch-avg-latency-ms"
> > > > > > > > > > - "sink-record-latency-max-ms" to
> > > > > "put-sink-record-latency-max-ms"
> > > > > > > > > > - "sink-record-latency-avg-ms" to
> > > > > "put-sink-record-latency-avg-ms"
> > > > > > > > > > - "sink-record-convert-transform-time-max-ms" to
> > > > > > > > > > "convert-transform-sink-record-time-max-ms"
> > > > > > > > > > - "sink-record-convert-transform-time-avg-ms" to
> > > > > > > > > > "convert-transform-sink-record-time-avg-ms"
> > > > > > > > > > - "source-record-transform-convert-time-max-ms" to
> > > > > > > > > > "transform-convert-source-record-time-max-ms"
> > > > > > > > > > - "source-record-transform-convert-time-avg-ms" to
> > > > > > > > > > "transform-convert-source-record-time-avg-ms"
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Make sense, thanks! I have updated the list of metrics and
> > > group
> > > > > them
> > > > > > > by
> > > > > > > > > sensor and applying these suggestions.
> > > > > > > > > The only ones that I want to review are: sink-record-* to
> > > > > put-batch-*
> > > > > > > > > (first 3). Not sure if put-batch/put-sink-record describes
> > the
> > > > > > purpose
> > > > > > > of
> > > > > > > > > the metric — neither `sink-record-latency` to be honest.
> > > > > > > > > My initial thought was to have something like Kafka Streams
> > > > > > > e2e-latency.
> > > > > > > > > Based on 4. and 6. questions, an idea could be to add:
> > > > > > > > > - source-batch-e2e-latency-before-send: measure wallclock -
> > > > source
> > > > > > > record
> > > > > > > > > timestamp after source connector poll.
> > > > > > > > > - source-batch-e2e-latency-after-send: measure wallclock -
> > > record
> > > > > > > > timestamp
> > > > > > > > > on producer send callback
> > > > > > > > > - sink-batch-e2e-latency-before-put: measure time
> wallclock -
> > > > > record
> > > > > > > > > timestamp after consumer poll
> > > > > > > > > - sink-batch-e2e-latency-after-put: measure time wallclock
> -
> > > > record
> > > > > > > > > timestamp after sink connector put.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > Thanks again for the KIP! Looking forward to your
> thoughts.
> > > > > > > > > >
> > > > > > > > > > Cheers,
> > > > > > > > > >
> > > > > > > > > > Chris
> > > > > > > > > >
> > > > > > > > > > [1] -
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-767%3A+Connect+Latency+Metrics
> > > > > > > > > >
> > > > > > > > > > On Thu, Sep 15, 2022 at 1:32 PM Jorge Esteban Quilcate
> > Otoya
> > > <
> > > > > > > > > > quilcate.jorge@gmail.com> wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi everyone,
> > > > > > > > > > >
> > > > > > > > > > > I've made a slight addition to the KIP based on Yash
> > > > feedback:
> > > > > > > > > > >
> > > > > > > > > > > - A new metric is added at INFO level to record the max
> > > > latency
> > > > > > > from
> > > > > > > > the
> > > > > > > > > > > batch timestamp, by keeping the oldest record timestamp
> > per
> > > > > > batch.
> > > > > > > > > > > - A draft implementation is linked.
> > > > > > > > > > >
> > > > > > > > > > > Looking forward to your feedback.
> > > > > > > > > > > Also, a kindly reminder that the vote thread is open.
> > > > > > > > > > >
> > > > > > > > > > > Thanks!
> > > > > > > > > > > Jorge.
> > > > > > > > > > >
> > > > > > > > > > > On Thu, 8 Sept 2022 at 14:25, Jorge Esteban Quilcate
> > Otoya
> > > <
> > > > > > > > > > > quilcate.jorge@gmail.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Great. I have updated the KIP to reflect this.
> > > > > > > > > > > >
> > > > > > > > > > > > Cheers,
> > > > > > > > > > > > Jorge.
> > > > > > > > > > > >
> > > > > > > > > > > > On Thu, 8 Sept 2022 at 12:26, Yash Mayya <
> > > > > yash.mayya@gmail.com
> > > > > > >
> > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > >> Thanks, I think it makes sense to define these
> metrics
> > > at
> > > > a
> > > > > > > DEBUG
> > > > > > > > > > > >> recording
> > > > > > > > > > > >> level.
> > > > > > > > > > > >>
> > > > > > > > > > > >> On Thu, Sep 8, 2022 at 2:51 PM Jorge Esteban
> Quilcate
> > > > Otoya
> > > > > <
> > > > > > > > > > > >> quilcate.jorge@gmail.com> wrote:
> > > > > > > > > > > >>
> > > > > > > > > > > >> > On Thu, 8 Sept 2022 at 05:55, Yash Mayya <
> > > > > > > yash.mayya@gmail.com>
> > > > > > > > > > > wrote:
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > > Hi Jorge,
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> > > Thanks for the changes. With regard to having
> per
> > > > batch
> > > > > vs
> > > > > > > per
> > > > > > > > > > > record
> > > > > > > > > > > >> > > metrics, the additional overhead I was referring
> > to
> > > > > wasn't
> > > > > > > > about
> > > > > > > > > > > >> whether
> > > > > > > > > > > >> > or
> > > > > > > > > > > >> > > not we would need to iterate over all the
> records
> > > in a
> > > > > > > batch.
> > > > > > > > I
> > > > > > > > > > was
> > > > > > > > > > > >> > > referring to the potential additional overhead
> > > caused
> > > > by
> > > > > > the
> > > > > > > > > > higher
> > > > > > > > > > > >> > volume
> > > > > > > > > > > >> > > of calls to Sensor::record on the sensors for
> the
> > > new
> > > > > > > metrics
> > > > > > > > (as
> > > > > > > > > > > >> > compared
> > > > > > > > > > > >> > > to the existing batch only metrics), especially
> > for
> > > > high
> > > > > > > > > > throughput
> > > > > > > > > > > >> > > connectors where batch sizes could be large. I
> > guess
> > > > we
> > > > > > may
> > > > > > > > want
> > > > > > > > > > to
> > > > > > > > > > > do
> > > > > > > > > > > >> > some
> > > > > > > > > > > >> > > sort of performance testing and get concrete
> > numbers
> > > > to
> > > > > > > verify
> > > > > > > > > > > whether
> > > > > > > > > > > >> > this
> > > > > > > > > > > >> > > is a valid concern or not?
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > 6.1. Got it, thanks for clarifying. I guess there
> > > could
> > > > > be a
> > > > > > > > > > benchmark
> > > > > > > > > > > >> test
> > > > > > > > > > > >> > of the `Sensor::record` to get an idea of the
> > > > performance
> > > > > > > > impact.
> > > > > > > > > > > >> > Regardless, the fact that these are single-record
> > > > metrics
> > > > > > > > compared
> > > > > > > > > > to
> > > > > > > > > > > >> > existing batch-only could be explicitly defined by
> > > > setting
> > > > > > > these
> > > > > > > > > > > >> metrics at
> > > > > > > > > > > >> > a DEBUG or TRACE metric recording level, leaving
> the
> > > > > > existing
> > > > > > > at
> > > > > > > > > > INFO
> > > > > > > > > > > >> > level.
> > > > > > > > > > > >> > wdyt?
> > > > > > > > > > > >> >
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> > > Thanks,
> > > > > > > > > > > >> > > Yash
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> > > On Tue, Sep 6, 2022 at 4:42 PM Jorge Esteban
> > > Quilcate
> > > > > > Otoya
> > > > > > > <
> > > > > > > > > > > >> > > quilcate.jorge@gmail.com> wrote:
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> > > > Hi Sagar and Yash,
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > > > > the way it's defined in
> > > > > > > > > > > >> > > >
> > > > > > > https://kafka.apache.org/documentation/#connect_monitoring
> > > > > > > > for
> > > > > > > > > > > the
> > > > > > > > > > > >> > > metrics
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > > > 4.1. Got it. Add it to the KIP.
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > > > > The only thing I would argue is do we need
> > > > > > > > > > > >> sink-record-latency-min?
> > > > > > > > > > > >> > > Maybe
> > > > > > > > > > > >> > > > we
> > > > > > > > > > > >> > > > > could remove this min metric as well and
> make
> > > all
> > > > of
> > > > > > the
> > > > > > > > 3 e2e
> > > > > > > > > > > >> > metrics
> > > > > > > > > > > >> > > > > consistent
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > > > 4.2 I see. Will remove it from the KIP.
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > > > > Probably users can track the metrics at
> their
> > > end
> > > > to
> > > > > > > > > > > >> > > > > figure that out. Do you think that makes
> > sense?
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > > > 4.3. Yes, agree. With these new metrics it
> > should
> > > be
> > > > > > > easier
> > > > > > > > for
> > > > > > > > > > > >> users
> > > > > > > > > > > >> > to
> > > > > > > > > > > >> > > > track this.
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > > > > I think it makes sense to not have a min
> > metric
> > > > for
> > > > > > > > either to
> > > > > > > > > > > >> remain
> > > > > > > > > > > >> > > > > consistent with the existing put-batch and
> > > > > poll-batch
> > > > > > > > metrics
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > > > 5.1. Got it. Same as 4.2
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > > > > Another naming related suggestion I had was
> > with
> > > > the
> > > > > > > > > > > >> > > > > "convert-time" metrics - we should probably
> > > > include
> > > > > > > > > > > >> transformations
> > > > > > > > > > > >> > in
> > > > > > > > > > > >> > > > the
> > > > > > > > > > > >> > > > > name since SMTs could definitely be
> > attributable
> > > > to
> > > > > a
> > > > > > > > sizable
> > > > > > > > > > > >> chunk
> > > > > > > > > > > >> > of
> > > > > > > > > > > >> > > > the
> > > > > > > > > > > >> > > > > latency depending on the specific
> > transformation
> > > > > > chain.
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > > > 5.2. Make sense. I'm proposing to add
> > > > > > > > > > > >> > `sink-record-convert-transform...`
> > > > > > > > > > > >> > > > and `source-record-transform-convert...` to
> > > > represent
> > > > > > > > correctly
> > > > > > > > > > > the
> > > > > > > > > > > >> > order
> > > > > > > > > > > >> > > > of operations.
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > > > > it seems like both source and sink tasks
> only
> > > > record
> > > > > > > > metrics
> > > > > > > > > > at
> > > > > > > > > > > a
> > > > > > > > > > > >> > > "batch"
> > > > > > > > > > > >> > > > > level, not on an individual record level. I
> > > think
> > > > it
> > > > > > > > might be
> > > > > > > > > > > >> > > additional
> > > > > > > > > > > >> > > > > overhead if we want to record these new
> > metrics
> > > > all
> > > > > at
> > > > > > > the
> > > > > > > > > > > record
> > > > > > > > > > > >> > > level?
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > > > 5.3. I considered at the beginning to
> implement
> > > all
> > > > > > > metrics
> > > > > > > > at
> > > > > > > > > > the
> > > > > > > > > > > >> > batch
> > > > > > > > > > > >> > > > level, but given how the framework process
> > > records,
> > > > I
> > > > > > > > fallback
> > > > > > > > > > to
> > > > > > > > > > > >> the
> > > > > > > > > > > >> > > > proposed approach:
> > > > > > > > > > > >> > > > - Sink Task:
> > > > > > > > > > > >> > > >   - `WorkerSinkTask#convertMessages(msgs)`
> > already
> > > > > > > iterates
> > > > > > > > over
> > > > > > > > > > > >> > records,
> > > > > > > > > > > >> > > > so there is no additional overhead to capture
> > > record
> > > > > > > > latency per
> > > > > > > > > > > >> > record.
> > > > > > > > > > > >> > > >     -
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> >
> > > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/9841647c4fe422532f448423c92d26e4fdcb8932/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L490-L514
> > > > > > > > > > > >> > > >   -
> > > > `WorkerSinkTask#convertAndTransformRecord(record)`
> > > > > > > > actually
> > > > > > > > > > > >> happens
> > > > > > > > > > > >> > > > individually. Measuring this operation per
> batch
> > > > would
> > > > > > > > include
> > > > > > > > > > > >> > processing
> > > > > > > > > > > >> > > > that is not strictly part of "convert and
> > > transform"
> > > > > > > > > > > >> > > >     -
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> >
> > > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/9841647c4fe422532f448423c92d26e4fdcb8932/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L518
> > > > > > > > > > > >> > > > - Source Task:
> > > > > > > > > > > >> > > >   - `AbstractWorkerSourceTask#sendRecords`
> > > iterates
> > > > > > over a
> > > > > > > > batch
> > > > > > > > > > > and
> > > > > > > > > > > >> > > > applies transforms and convert record
> > individually
> > > > as
> > > > > > > well:
> > > > > > > > > > > >> > > >     -
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> >
> > > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/9841647c4fe422532f448423c92d26e4fdcb8932/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/AbstractWorkerSourceTask.java#L389-L390
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > > > > This might require some additional changes -
> > > > > > > > > > > >> > > > > for instance, with the "sink-record-latency"
> > > > metric,
> > > > > > we
> > > > > > > > might
> > > > > > > > > > > only
> > > > > > > > > > > >> > want
> > > > > > > > > > > >> > > > to
> > > > > > > > > > > >> > > > > have a "max" metric since "avg" would
> require
> > > > > > recording
> > > > > > > a
> > > > > > > > > > value
> > > > > > > > > > > on
> > > > > > > > > > > >> > the
> > > > > > > > > > > >> > > > > sensor for each record (whereas we can get a
> > > "max"
> > > > > by
> > > > > > > only
> > > > > > > > > > > >> recording
> > > > > > > > > > > >> > a
> > > > > > > > > > > >> > > > > metric value for the oldest record in each
> > > batch).
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > > > 5.4. Recording record-latency per batch may
> not
> > be
> > > > as
> > > > > > > > useful as
> > > > > > > > > > > >> there
> > > > > > > > > > > >> > is
> > > > > > > > > > > >> > > no
> > > > > > > > > > > >> > > > guarantee that the oldest record will be
> > > > > representative
> > > > > > of
> > > > > > > > the
> > > > > > > > > > > >> batch.
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > > > On Sat, 3 Sept 2022 at 16:02, Yash Mayya <
> > > > > > > > yash.mayya@gmail.com>
> > > > > > > > > > > >> wrote:
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > > > > Hi Jorge and Sagar,
> > > > > > > > > > > >> > > > >
> > > > > > > > > > > >> > > > > I think it makes sense to not have a min
> > metric
> > > > for
> > > > > > > > either to
> > > > > > > > > > > >> remain
> > > > > > > > > > > >> > > > > consistent with the existing put-batch and
> > > > > poll-batch
> > > > > > > > metrics
> > > > > > > > > > > (it
> > > > > > > > > > > >> > > doesn't
> > > > > > > > > > > >> > > > > seem particularly useful either anyway).
> Also,
> > > the
> > > > > new
> > > > > > > > > > > >> > > > > "sink-record-latency" metric name looks fine
> > to
> > > > me,
> > > > > > > > thanks for
> > > > > > > > > > > >> making
> > > > > > > > > > > >> > > the
> > > > > > > > > > > >> > > > > changes! Another naming related suggestion I
> > had
> > > > was
> > > > > > > with
> > > > > > > > the
> > > > > > > > > > > >> > > > > "convert-time" metrics - we should probably
> > > > include
> > > > > > > > > > > >> transformations
> > > > > > > > > > > >> > in
> > > > > > > > > > > >> > > > the
> > > > > > > > > > > >> > > > > name since SMTs could definitely be
> > attributable
> > > > to
> > > > > a
> > > > > > > > sizable
> > > > > > > > > > > >> chunk
> > > > > > > > > > > >> > of
> > > > > > > > > > > >> > > > the
> > > > > > > > > > > >> > > > > latency depending on the specific
> > transformation
> > > > > > chain.
> > > > > > > > > > > >> > > > >
> > > > > > > > > > > >> > > > > I have one high level question with respect
> to
> > > > > > > > implementation
> > > > > > > > > > -
> > > > > > > > > > > >> > > > currently,
> > > > > > > > > > > >> > > > > it seems like both source and sink tasks
> only
> > > > record
> > > > > > > > metrics
> > > > > > > > > > at
> > > > > > > > > > > a
> > > > > > > > > > > >> > > "batch"
> > > > > > > > > > > >> > > > > level, not on an individual record level. I
> > > think
> > > > it
> > > > > > > > might be
> > > > > > > > > > > >> > > additional
> > > > > > > > > > > >> > > > > overhead if we want to record these new
> > metrics
> > > > all
> > > > > at
> > > > > > > the
> > > > > > > > > > > record
> > > > > > > > > > > >> > > level?
> > > > > > > > > > > >> > > > > Could we instead make all of these new
> metrics
> > > for
> > > > > > > > batches of
> > > > > > > > > > > >> records
> > > > > > > > > > > >> > > > > rather than individual records in order to
> > > remain
> > > > > > > > consistent
> > > > > > > > > > > with
> > > > > > > > > > > >> the
> > > > > > > > > > > >> > > > > existing task level metrics? This might
> > require
> > > > some
> > > > > > > > > > additional
> > > > > > > > > > > >> > > changes -
> > > > > > > > > > > >> > > > > for instance, with the "sink-record-latency"
> > > > metric,
> > > > > > we
> > > > > > > > might
> > > > > > > > > > > only
> > > > > > > > > > > >> > want
> > > > > > > > > > > >> > > > to
> > > > > > > > > > > >> > > > > have a "max" metric since "avg" would
> require
> > > > > > recording
> > > > > > > a
> > > > > > > > > > value
> > > > > > > > > > > on
> > > > > > > > > > > >> > the
> > > > > > > > > > > >> > > > > sensor for each record (whereas we can get a
> > > "max"
> > > > > by
> > > > > > > only
> > > > > > > > > > > >> recording
> > > > > > > > > > > >> > a
> > > > > > > > > > > >> > > > > metric value for the oldest record in each
> > > batch).
> > > > > > > > > > > >> > > > >
> > > > > > > > > > > >> > > > > Thanks,
> > > > > > > > > > > >> > > > > Yash
> > > > > > > > > > > >> > > > >
> > > > > > > > > > > >> > > > > On Fri, Sep 2, 2022 at 3:16 PM Sagar <
> > > > > > > > > > sagarmeansocean@gmail.com
> > > > > > > > > > > >
> > > > > > > > > > > >> > > wrote:
> > > > > > > > > > > >> > > > >
> > > > > > > > > > > >> > > > > > Hi Jorge,
> > > > > > > > > > > >> > > > > >
> > > > > > > > > > > >> > > > > > Thanks for the changes.
> > > > > > > > > > > >> > > > > >
> > > > > > > > > > > >> > > > > > Regarding the metrics, I meant something
> > like
> > > > > this:
> > > > > > > > > > > >> > > > > >
> > > > > > > > > > > >> > > > >
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> >
> > > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> kafka.connect:type=sink-task-metrics,connector="{connector}",task="{task}"
> > > > > > > > > > > >> > > > > >
> > > > > > > > > > > >> > > > > > the way it's defined in
> > > > > > > > > > > >> > > > > >
> > > > > > > > https://kafka.apache.org/documentation/#connect_monitoring
> > > > > > > > > > > for
> > > > > > > > > > > >> the
> > > > > > > > > > > >> > > > > > metrics.
> > > > > > > > > > > >> > > > > >
> > > > > > > > > > > >> > > > > > I see what you mean by the 3 metrics and
> how
> > > it
> > > > > can
> > > > > > be
> > > > > > > > > > > >> interpreted.
> > > > > > > > > > > >> > > The
> > > > > > > > > > > >> > > > > > only thing I would argue is do we need
> > > > > > > > > > > sink-record-latency-min?
> > > > > > > > > > > >> > Maybe
> > > > > > > > > > > >> > > > we
> > > > > > > > > > > >> > > > > > could remove this min metric as well and
> > make
> > > > all
> > > > > of
> > > > > > > > the 3
> > > > > > > > > > e2e
> > > > > > > > > > > >> > > metrics
> > > > > > > > > > > >> > > > > > consistent(since put-batch also doesn't
> > > expose a
> > > > > min
> > > > > > > > which
> > > > > > > > > > > makes
> > > > > > > > > > > >> > > sense
> > > > > > > > > > > >> > > > to
> > > > > > > > > > > >> > > > > > me). I think this is in contrast to what
> > Yash
> > > > > > pointed
> > > > > > > > out
> > > > > > > > > > > above
> > > > > > > > > > > >> so
> > > > > > > > > > > >> > I
> > > > > > > > > > > >> > > > > would
> > > > > > > > > > > >> > > > > > like to hear his thoughts as well.
> > > > > > > > > > > >> > > > > >
> > > > > > > > > > > >> > > > > > The other point Yash mentioned about the
> > > > slightly
> > > > > > > flawed
> > > > > > > > > > > >> definition
> > > > > > > > > > > >> > > of
> > > > > > > > > > > >> > > > > e2e
> > > > > > > > > > > >> > > > > > is also true in a sense. But I have a
> > feeling
> > > > > that's
> > > > > > > > one the
> > > > > > > > > > > >> > records
> > > > > > > > > > > >> > > > are
> > > > > > > > > > > >> > > > > > polled by the connector tasks, it would be
> > > > > difficult
> > > > > > > to
> > > > > > > > > > track
> > > > > > > > > > > >> the
> > > > > > > > > > > >> > > final
> > > > > > > > > > > >> > > > > leg
> > > > > > > > > > > >> > > > > > via the framework. Probably users can
> track
> > > the
> > > > > > > metrics
> > > > > > > > at
> > > > > > > > > > > their
> > > > > > > > > > > >> > end
> > > > > > > > > > > >> > > to
> > > > > > > > > > > >> > > > > > figure that out. Do you think that makes
> > > sense?
> > > > > > > > > > > >> > > > > >
> > > > > > > > > > > >> > > > > > Thanks!
> > > > > > > > > > > >> > > > > > Sagar.
> > > > > > > > > > > >> > > > > >
> > > > > > > > > > > >> > > > > >
> > > > > > > > > > > >> > > > > >
> > > > > > > > > > > >> > > > > >
> > > > > > > > > > > >> > > > > > On Thu, Sep 1, 2022 at 11:40 PM Jorge
> > Esteban
> > > > > > Quilcate
> > > > > > > > > > Otoya <
> > > > > > > > > > > >> > > > > > quilcate.jorge@gmail.com> wrote:
> > > > > > > > > > > >> > > > > >
> > > > > > > > > > > >> > > > > > > Hi Sagar and Yash,
> > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > >> > > > > > > Thanks for your feedback!
> > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > >> > > > > > > > 1) I am assuming the new metrics would
> > be
> > > > task
> > > > > > > level
> > > > > > > > > > > metric.
> > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > >> > > > > > > 1.1 Yes, it will be a task level metric,
> > > > > > implemented
> > > > > > > > on
> > > > > > > > > > the
> > > > > > > > > > > >> > > > > > > Worker[Source/Sink]Task.
> > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > >> > > > > > > > Could you specify the way it's done
> for
> > > > other
> > > > > > > > > > sink/source
> > > > > > > > > > > >> > > > connector?
> > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > >> > > > > > > 1.2. Not sure what do you mean by this.
> > > Could
> > > > > you
> > > > > > > > > > elaborate
> > > > > > > > > > > a
> > > > > > > > > > > >> bit
> > > > > > > > > > > >> > > > more?
> > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > >> > > > > > > > 2. I am slightly confused about the
> e2e
> > > > > latency
> > > > > > > > > > metric...
> > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > >> > > > > > > 2.1. Yes, I see. I was trying to bring a
> > > > similar
> > > > > > > > concept
> > > > > > > > > > as
> > > > > > > > > > > in
> > > > > > > > > > > >> > > > Streams
> > > > > > > > > > > >> > > > > > with
> > > > > > > > > > > >> > > > > > > KIP-613, though the e2e concept may not
> be
> > > > > > > > translatable.
> > > > > > > > > > > >> > > > > > > We could keep it as
> `sink-record-latency`
> > to
> > > > > avoid
> > > > > > > > > > > conflating
> > > > > > > > > > > >> > > > > concepts. A
> > > > > > > > > > > >> > > > > > > similar metric naming was proposed in
> > > KIP-489
> > > > > but
> > > > > > at
> > > > > > > > the
> > > > > > > > > > > >> consumer
> > > > > > > > > > > >> > > > > level —
> > > > > > > > > > > >> > > > > > > though it seems dormant for a couple of
> > > years.
> > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > >> > > > > > > > However, the put-batch time measures
> the
> > > > > > > > > > > >> > > > > > > > time to put a batch of records to
> > external
> > > > > sink.
> > > > > > > > So, I
> > > > > > > > > > > would
> > > > > > > > > > > >> > > assume
> > > > > > > > > > > >> > > > > > the 2
> > > > > > > > > > > >> > > > > > > > can't be added as is to compute the
> e2e
> > > > > latency.
> > > > > > > > Maybe I
> > > > > > > > > > > am
> > > > > > > > > > > >> > > missing
> > > > > > > > > > > >> > > > > > > > something here. Could you plz clarify
> > > this.
> > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > >> > > > > > > 2.2. Yes, agree. Not necessarily added,
> > but
> > > > with
> > > > > > > the 3
> > > > > > > > > > > >> latencies
> > > > > > > > > > > >> > > > (poll,
> > > > > > > > > > > >> > > > > > > convert, putBatch) will be clearer where
> > the
> > > > > > > > bottleneck
> > > > > > > > > > may
> > > > > > > > > > > >> be,
> > > > > > > > > > > >> > and
> > > > > > > > > > > >> > > > > > > represent the internal processing.
> > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > >> > > > > > > > however, as per the KIP it looks like
> it
> > > > will
> > > > > be
> > > > > > > > > > > >> > > > > > > > the latency between when the record
> was
> > > > > written
> > > > > > to
> > > > > > > > Kafka
> > > > > > > > > > > and
> > > > > > > > > > > >> > when
> > > > > > > > > > > >> > > > the
> > > > > > > > > > > >> > > > > > > > record is returned by a sink task's
> > > > consumer's
> > > > > > > poll?
> > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > >> > > > > > > 3.1. Agree. 2.1. could help to clarify
> > this.
> > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > >> > > > > > > > One more thing - I was wondering
> > > > > > > > > > > >> > > > > > > > if there's a particular reason for
> > having
> > > a
> > > > > min
> > > > > > > > metric
> > > > > > > > > > for
> > > > > > > > > > > >> e2e
> > > > > > > > > > > >> > > > > latency
> > > > > > > > > > > >> > > > > > > but
> > > > > > > > > > > >> > > > > > > > not for convert time?
> > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > >> > > > > > > 3.2. Was following KIP-613 for e2e which
> > > seems
> > > > > > > useful
> > > > > > > > to
> > > > > > > > > > > >> compare
> > > > > > > > > > > >> > > with
> > > > > > > > > > > >> > > > > > Max a
> > > > > > > > > > > >> > > > > > > get an idea of the window of results,
> > though
> > > > > > current
> > > > > > > > > > > >> latencies in
> > > > > > > > > > > >> > > > > > Connector
> > > > > > > > > > > >> > > > > > > do not include Min, and that's why I
> > haven't
> > > > > added
> > > > > > > it
> > > > > > > > for
> > > > > > > > > > > >> convert
> > > > > > > > > > > >> > > > > > latency.
> > > > > > > > > > > >> > > > > > > Do you think it make sense to extend
> > latency
> > > > > > metrics
> > > > > > > > with
> > > > > > > > > > > Min?
> > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > >> > > > > > > KIP is updated to clarify some of these
> > > > changes.
> > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > >> > > > > > > Many thanks,
> > > > > > > > > > > >> > > > > > > Jorge.
> > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > >> > > > > > > On Thu, 1 Sept 2022 at 18:11, Yash
> Mayya <
> > > > > > > > > > > >> yash.mayya@gmail.com>
> > > > > > > > > > > >> > > > wrote:
> > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > >> > > > > > > > Hi Jorge,
> > > > > > > > > > > >> > > > > > > >
> > > > > > > > > > > >> > > > > > > > Thanks for the KIP! I have the same
> > > > confusion
> > > > > > with
> > > > > > > > the
> > > > > > > > > > > >> > > e2e-latency
> > > > > > > > > > > >> > > > > > > metrics
> > > > > > > > > > > >> > > > > > > > as Sagar above. "e2e" would seem to
> > > indicate
> > > > > the
> > > > > > > > latency
> > > > > > > > > > > >> > between
> > > > > > > > > > > >> > > > when
> > > > > > > > > > > >> > > > > > the
> > > > > > > > > > > >> > > > > > > > record was written to Kafka and when
> the
> > > > > record
> > > > > > > was
> > > > > > > > > > > written
> > > > > > > > > > > >> to
> > > > > > > > > > > >> > > the
> > > > > > > > > > > >> > > > > sink
> > > > > > > > > > > >> > > > > > > > system by the connector - however, as
> > per
> > > > the
> > > > > > KIP
> > > > > > > it
> > > > > > > > > > looks
> > > > > > > > > > > >> like
> > > > > > > > > > > >> > > it
> > > > > > > > > > > >> > > > > will
> > > > > > > > > > > >> > > > > > > be
> > > > > > > > > > > >> > > > > > > > the latency between when the record
> was
> > > > > written
> > > > > > to
> > > > > > > > Kafka
> > > > > > > > > > > and
> > > > > > > > > > > >> > when
> > > > > > > > > > > >> > > > the
> > > > > > > > > > > >> > > > > > > > record is returned by a sink task's
> > > > consumer's
> > > > > > > > poll? I
> > > > > > > > > > > think
> > > > > > > > > > > >> > that
> > > > > > > > > > > >> > > > > > metric
> > > > > > > > > > > >> > > > > > > > will be a little confusing to
> interpret.
> > > One
> > > > > > more
> > > > > > > > thing
> > > > > > > > > > -
> > > > > > > > > > > I
> > > > > > > > > > > >> was
> > > > > > > > > > > >> > > > > > wondering
> > > > > > > > > > > >> > > > > > > > if there's a particular reason for
> > having
> > > a
> > > > > min
> > > > > > > > metric
> > > > > > > > > > for
> > > > > > > > > > > >> e2e
> > > > > > > > > > > >> > > > > latency
> > > > > > > > > > > >> > > > > > > but
> > > > > > > > > > > >> > > > > > > > not for convert time?
> > > > > > > > > > > >> > > > > > > >
> > > > > > > > > > > >> > > > > > > > Thanks,
> > > > > > > > > > > >> > > > > > > > Yash
> > > > > > > > > > > >> > > > > > > >
> > > > > > > > > > > >> > > > > > > > On Thu, Sep 1, 2022 at 8:59 PM Sagar <
> > > > > > > > > > > >> > sagarmeansocean@gmail.com>
> > > > > > > > > > > >> > > > > > wrote:
> > > > > > > > > > > >> > > > > > > >
> > > > > > > > > > > >> > > > > > > > > Hi Jorge,
> > > > > > > > > > > >> > > > > > > > >
> > > > > > > > > > > >> > > > > > > > > Thanks for the KIP. It looks like a
> > very
> > > > > good
> > > > > > > > > > addition.
> > > > > > > > > > > I
> > > > > > > > > > > >> > > skimmed
> > > > > > > > > > > >> > > > > > > through
> > > > > > > > > > > >> > > > > > > > > once and had a couple of questions
> =>
> > > > > > > > > > > >> > > > > > > > >
> > > > > > > > > > > >> > > > > > > > > 1) I am assuming the new metrics
> would
> > > be
> > > > > task
> > > > > > > > level
> > > > > > > > > > > >> metric.
> > > > > > > > > > > >> > > > Could
> > > > > > > > > > > >> > > > > > you
> > > > > > > > > > > >> > > > > > > > > specify the way it's done for other
> > > > > > sink/source
> > > > > > > > > > > connector?
> > > > > > > > > > > >> > > > > > > > > 2) I am slightly confused about the
> > e2e
> > > > > > latency
> > > > > > > > > > metric.
> > > > > > > > > > > >> Let's
> > > > > > > > > > > >> > > > > > consider
> > > > > > > > > > > >> > > > > > > > the
> > > > > > > > > > > >> > > > > > > > > sink connector metric. If I look at
> > the
> > > > way
> > > > > > it's
> > > > > > > > > > > supposed
> > > > > > > > > > > >> to
> > > > > > > > > > > >> > be
> > > > > > > > > > > >> > > > > > > > calculated,
> > > > > > > > > > > >> > > > > > > > > i.e the difference between the
> record
> > > > > > timestamp
> > > > > > > > and
> > > > > > > > > > the
> > > > > > > > > > > >> wall
> > > > > > > > > > > >> > > > clock
> > > > > > > > > > > >> > > > > > > time,
> > > > > > > > > > > >> > > > > > > > it
> > > > > > > > > > > >> > > > > > > > > looks like a per record metric.
> > However,
> > > > the
> > > > > > > > put-batch
> > > > > > > > > > > >> time
> > > > > > > > > > > >> > > > > measures
> > > > > > > > > > > >> > > > > > > the
> > > > > > > > > > > >> > > > > > > > > time to put a batch of records to
> > > external
> > > > > > sink.
> > > > > > > > So, I
> > > > > > > > > > > >> would
> > > > > > > > > > > >> > > > assume
> > > > > > > > > > > >> > > > > > > the 2
> > > > > > > > > > > >> > > > > > > > > can't be added as is to compute the
> > e2e
> > > > > > latency.
> > > > > > > > > > Maybe I
> > > > > > > > > > > >> am
> > > > > > > > > > > >> > > > missing
> > > > > > > > > > > >> > > > > > > > > something here. Could you plz
> clarify
> > > > this.
> > > > > > > > > > > >> > > > > > > > >
> > > > > > > > > > > >> > > > > > > > > Thanks!
> > > > > > > > > > > >> > > > > > > > > Sagar.
> > > > > > > > > > > >> > > > > > > > >
> > > > > > > > > > > >> > > > > > > > > On Tue, Aug 30, 2022 at 8:43 PM
> Jorge
> > > > > Esteban
> > > > > > > > Quilcate
> > > > > > > > > > > >> Otoya
> > > > > > > > > > > >> > <
> > > > > > > > > > > >> > > > > > > > > quilcate.jorge@gmail.com> wrote:
> > > > > > > > > > > >> > > > > > > > >
> > > > > > > > > > > >> > > > > > > > > > Hi all,
> > > > > > > > > > > >> > > > > > > > > >
> > > > > > > > > > > >> > > > > > > > > > I'd like to start a discussion
> > thread
> > > on
> > > > > > > > KIP-864:
> > > > > > > > > > Add
> > > > > > > > > > > >> > > > End-To-End
> > > > > > > > > > > >> > > > > > > > Latency
> > > > > > > > > > > >> > > > > > > > > > Metrics to Connectors.
> > > > > > > > > > > >> > > > > > > > > > This KIP aims to improve the
> metrics
> > > > > > available
> > > > > > > > on
> > > > > > > > > > > Source
> > > > > > > > > > > >> > and
> > > > > > > > > > > >> > > > Sink
> > > > > > > > > > > >> > > > > > > > > > Connectors to measure end-to-end
> > > > latency,
> > > > > > > > including
> > > > > > > > > > > >> source
> > > > > > > > > > > >> > > and
> > > > > > > > > > > >> > > > > sink
> > > > > > > > > > > >> > > > > > > > > record
> > > > > > > > > > > >> > > > > > > > > > conversion time, and sink record
> e2e
> > > > > latency
> > > > > > > > > > (similar
> > > > > > > > > > > to
> > > > > > > > > > > >> > > > KIP-613
> > > > > > > > > > > >> > > > > > for
> > > > > > > > > > > >> > > > > > > > > > Streams).
> > > > > > > > > > > >> > > > > > > > > >
> > > > > > > > > > > >> > > > > > > > > > The KIP is here:
> > > > > > > > > > > >> > > > > > > > > >
> > > > > > > > > > > >> > > > > > > > > >
> > > > > > > > > > > >> > > > > > > > >
> > > > > > > > > > > >> > > > > > > >
> > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > >> > > > > >
> > > > > > > > > > > >> > > > >
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> >
> > > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-864%3A+Add+End-To-End+Latency+Metrics+to+Connectors
> > > > > > > > > > > >> > > > > > > > > >
> > > > > > > > > > > >> > > > > > > > > > Please take a look and let me know
> > > what
> > > > > you
> > > > > > > > think.
> > > > > > > > > > > >> > > > > > > > > >
> > > > > > > > > > > >> > > > > > > > > > Cheers,
> > > > > > > > > > > >> > > > > > > > > > Jorge.
> > > > > > > > > > > >> > > > > > > > > >
> > > > > > > > > > > >> > > > > > > > >
> > > > > > > > > > > >> > > > > > > >
> > > > > > > > > > > >> > > > > > >
> > > > > > > > > > > >> > > > > >
> > > > > > > > > > > >> > > > >
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> >
> > > > > > > > > > > >>
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-864: Add End-To-End Latency Metrics to Connectors

Posted by Jorge Esteban Quilcate Otoya <qu...@gmail.com>.

Sure! I have a added the following to the proposed changes section:

```
The per-record metrics will definitely be added to Kafka Connect as part of
this KIP, but their metric level will be changed pending the performance
testing described in KAFKA-14441, and will otherwise only be exposed at
lower level (DEBUG instead of INFO, and TRACE instead of DEBUG)
```

Let me know if how does it look.

Many thanks!
Jorge.

On Mon, 5 Dec 2022 at 14:11, Chris Egerton <ch...@aiven.io.invalid> wrote:

> Hi Jorge,
>
> Thanks for filing KAFKA-14441! In the ticket description we mention that
> "there will be more confidence whether to design metrics to be exposed at a
> DEBUG or INFO level depending on their impact" but it doesn't seem like
> this is called out in the KIP and, just based on what's in the KIP, the
> proposal is still to have several per-record metrics exposed at INFO level.
>
> Could we explicitly call out that the per-record metrics will definitely be
> added to Kafka Connect as part of this KIP, but they will only be exposed
> at INFO level pending pending the performance testing described in
> KAFKA-14441, and will otherwise only be exposed at DEBUG level? Otherwise,
> it's possible that a vote for the KIP as it's written today would be a vote
> in favor of unconditionally exposing these metrics at INFO level, even if
> the performance testing reveals issues.
>
> Cheers,
>
> Chris
>
> On Sun, Dec 4, 2022 at 7:08 PM Jorge Esteban Quilcate Otoya <
> quilcate.jorge@gmail.com> wrote:
>
> > Thanks for the reminder Chris!
> >
> > I have added a note on the KIP to include this as part of the KIP as most
> > of the metrics proposed are per-record and having all on DEBUG would
> limit
> > the benefits, and created
> > https://issues.apache.org/jira/browse/KAFKA-14441
> > to keep track of this task.
> >
> > Cheers,
> > Jorge.
> >
> > On Tue, 29 Nov 2022 at 19:40, Chris Egerton <ch...@aiven.io.invalid>
> > wrote:
> >
> > > Hi Jorge,
> > >
> > > Thanks! What were your thoughts on the possible benchmarking and/or
> > > downgrading of per-record metrics to DEBUG?
> > >
> > > Cheers,
> > >
> > > Chris
> > >
> > > On Thu, Nov 24, 2022 at 8:20 AM Jorge Esteban Quilcate Otoya <
> > > quilcate.jorge@gmail.com> wrote:
> > >
> > > > Thanks Chris! I have updated the KIP with "transform" instead of
> > "alias".
> > > > Agree it's clearer.
> > > >
> > > > Cheers,
> > > > Jorge.
> > > >
> > > > On Mon, 21 Nov 2022 at 21:36, Chris Egerton <chrise@aiven.io.invalid
> >
> > > > wrote:
> > > >
> > > > > Hi Jorge,
> > > > >
> > > > > Thanks for the updates, and apologies for the delay. The new
> diagram
> > > > > directly under the "Proposed Changes" section is absolutely
> gorgeous!
> > > > >
> > > > >
> > > > > Follow-ups:
> > > > >
> > > > > RE 2: Good point. We can use the same level for these metrics, it's
> > > not a
> > > > > big deal.
> > > > >
> > > > > RE 3: As long as all the per-record metrics are kept at DEBUG
> level,
> > it
> > > > > should be fine to leave JMH benchmarking for a follow-up. If we
> want
> > to
> > > > add
> > > > > new per-record, INFO-level metrics, I would be more comfortable
> with
> > > > > including benchmarking as part of the testing plan for the KIP. One
> > > > > possible compromise could be to propose that these features be
> merged
> > > at
> > > > > DEBUG level, and then possibly upgraded to INFO level in the future
> > > > pending
> > > > > benchmarks to guard against performance degradation.
> > > > >
> > > > > RE 4: I think for a true "end-to-end" metric, it'd be useful to
> > include
> > > > the
> > > > > time taken by the task to actually deliver the record. However,
> with
> > > the
> > > > > new metric names and descriptions provided in the KIP, I have no
> > > > objections
> > > > > with what's currently proposed, and a new "end-to-end" metric can
> be
> > > > taken
> > > > > on later in a follow-up KIP.
> > > > >
> > > > > RE 6: You're right, existing producer metrics should be enough for
> > now.
> > > > We
> > > > > can revisit this later if/when we add delivery-centric metrics for
> > sink
> > > > > tasks as well.
> > > > >
> > > > > RE 7: The new metric names in the KIP LGTM; I don't see any need to
> > > > expand
> > > > > beyond those but if you'd still like to pursue others, LMK.
> > > > >
> > > > >
> > > > > New thoughts:
> > > > >
> > > > > One small thought: instead of "alias" in "alias="{transform_alias}"
> > for
> > > > the
> > > > > per-transform metrics, could we use "transform"? IMO it's clearer
> > since
> > > > we
> > > > > don't use "alias" in the names of transform-related properties, and
> > > > "alias"
> > > > > may be confused with the classloading term where you can use, e.g.,
> > > > > "FileStreamSource" as the name of a connector class in a connector
> > > config
> > > > > instead of
> "org.apache.kafka.connect.file.FileStreamSourceConnector".
> > > > >
> > > > >
> > > > > Cheers,
> > > > >
> > > > > Chris
> > > > >
> > > > > On Fri, Nov 18, 2022 at 12:06 PM Jorge Esteban Quilcate Otoya <
> > > > > quilcate.jorge@gmail.com> wrote:
> > > > >
> > > > > > Thanks Mickael!
> > > > > >
> > > > > >
> > > > > > On Wed, 9 Nov 2022 at 15:54, Mickael Maison <
> > > mickael.maison@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Jorge,
> > > > > > >
> > > > > > > Thanks for the KIP, it is a nice improvement.
> > > > > > >
> > > > > > > 1) The per transformation metrics still have a question mark
> next
> > > to
> > > > > > > them in the KIP. Do you want to include them? If so we'll want
> to
> > > tag
> > > > > > > them, we should be able to include the aliases in
> > > TransformationChain
> > > > > > > and use them.
> > > > > > >
> > > > > >
> > > > > > Yes, I have added the changes on TransformChain that will be
> needed
> > > to
> > > > > add
> > > > > > these metrics.
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > 2) I see no references to predicates. If we don't want to
> measure
> > > > > > > their latency, can we say it explicitly?
> > > > > > >
> > > > > >
> > > > > > Good question, I haven't considered these. Though as these are
> > > > > materialized
> > > > > > as PredicatedTransformation, they should be covered by these
> > changes.
> > > > > > Adding a note about this.
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > 3) Should we have sink-record-batch-latency-avg-ms? All other
> > > metrics
> > > > > > > have both the maximum and average values.
> > > > > > >
> > > > > > >
> > > > > > Good question. I will remove it and change the record latency
> from
> > > > > > DEBUG->INFO as it already cover the maximum metric.
> > > > > >
> > > > > > Hope it's clearer now, let me know if there any additional
> > feedback.
> > > > > > Thanks!
> > > > > >
> > > > > >
> > > > > >
> > > > > > > Thanks,
> > > > > > > Mickael
> > > > > > >
> > > > > > > On Thu, Oct 20, 2022 at 9:58 PM Jorge Esteban Quilcate Otoya
> > > > > > > <qu...@gmail.com> wrote:
> > > > > > > >
> > > > > > > > Thanks, Chris! Great feedback! Please, find my comments
> below:
> > > > > > > >
> > > > > > > > On Thu, 13 Oct 2022 at 18:52, Chris Egerton
> > > > <chrise@aiven.io.invalid
> > > > > >
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Jorge,
> > > > > > > > >
> > > > > > > > > Thanks for the KIP. I agree with the overall direction and
> > > think
> > > > > this
> > > > > > > would
> > > > > > > > > be a nice improvement to Kafka Connect. Here are my initial
> > > > > thoughts
> > > > > > > on the
> > > > > > > > > details:
> > > > > > > > >
> > > > > > > > > 1. The motivation section outlines the gaps in Kafka
> > Connect's
> > > > task
> > > > > > > metrics
> > > > > > > > > nicely. I think it'd be useful to include more concrete
> > details
> > > > on
> > > > > > why
> > > > > > > > > these gaps need to be filled in, and in which cases
> > additional
> > > > > > metrics
> > > > > > > > > would be helpful. One goal could be to provide enhanced
> > > > monitoring
> > > > > of
> > > > > > > > > production deployments that allows for cluster
> administrators
> > > to
> > > > > set
> > > > > > up
> > > > > > > > > automatic alerts for latency spikes and, if triggered,
> > quickly
> > > > > > > identify the
> > > > > > > > > root cause of those alerts, reducing the time to
> remediation.
> > > > > Another
> > > > > > > goal
> > > > > > > > > could be to provide more insight to developers or cluster
> > > > > > > administrators
> > > > > > > > > who want to do performance testing on connectors in
> > > > non-production
> > > > > > > > > environments. It may help guide our decision making process
> > to
> > > > > have a
> > > > > > > > > clearer picture of the goals we're trying to achieve.
> > > > > > > > >
> > > > > > > >
> > > > > > > > Agree. The Motivation section has been updated.
> > > > > > > > Thanks for the examples, I see both of them being covered by
> > the
> > > > KIP.
> > > > > > > > I see how these could give us a good distinction on whether
> to
> > > > > position
> > > > > > > > some metrics at INFO or DEBUG level.
> > > > > > > >
> > > > > > > >
> > > > > > > > > 2. If we're trying to address the alert-and-diagnose use
> > case,
> > > > it'd
> > > > > > be
> > > > > > > > > useful to have as much information as possible at INFO
> level,
> > > > > rather
> > > > > > > than
> > > > > > > > > forcing cluster administrators to possibly reconfigure a
> > > > connector
> > > > > to
> > > > > > > emit
> > > > > > > > > DEBUG or TRACE level metrics in order to diagnose a
> potential
> > > > > > > > > production-impacting performance bottleneck. I can see the
> > > > > rationale
> > > > > > > for
> > > > > > > > > emitting per-record metrics that track an average value at
> > > DEBUG
> > > > > > > level, but
> > > > > > > > > for per-record metrics that track a maximum value, is there
> > any
> > > > > > reason
> > > > > > > not
> > > > > > > > > to provide this information at INFO level?
> > > > > > > > >
> > > > > > > >
> > > > > > > > Agree. Though with Max and Avg metrics being part of the same
> > > > sensor
> > > > > —
> > > > > > > > where Metric Level is defined — then both metrics get the
> same
> > > > level.
> > > > > > > >
> > > > > > > >
> > > > > > > > > 3. I'm also curious about the performance testing suggested
> > by
> > > > Yash
> > > > > > to
> > > > > > > > > gauge the potential impact of this change. Have you been
> able
> > > to
> > > > do
> > > > > > any
> > > > > > > > > testing with your draft implementation yet?
> > > > > > > > >
> > > > > > > >
> > > > > > > > No, not so far.
> > > > > > > > I think it would be valuable to discuss the scope of this
> > testing
> > > > and
> > > > > > > maybe
> > > > > > > > tackle it
> > > > > > > > in a separate issue as Sensors and Metrics are used all over
> > the
> > > > > place.
> > > > > > > > My initial understanding is that these tests should by placed
> > in
> > > > the
> > > > > > > > jmh-benchmarks[1].
> > > > > > > > Then, we could target testing Sensors and Metrics, and
> validate
> > > how
> > > > > > much
> > > > > > > > overhead
> > > > > > > > is added by having only Max vs Max,Avg(,Min), etc.
> > > > > > > > In the other hand, we could extend this to Transformers or
> > other
> > > > > > Connect
> > > > > > > > layers.
> > > > > > > >
> > > > > > > > Here are some pointers to the Sensors and Metrics
> > implementations
> > > > > that
> > > > > > > > could be considered:
> > > > > > > > Path to metric recording:
> > > > > > > > -
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/5cab11cf525f6c06fcf9eb43f7f95ef33fe1cdbb/clients/src/main/java/org/apache/kafka/common/metrics/Sensor.java#L195-L199
> > > > > > > > -
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/5cab11cf525f6c06fcf9eb43f7f95ef33fe1cdbb/clients/src/main/java/org/apache/kafka/common/metrics/Sensor.java#L230-L244
> > > > > > > >
> > > > > > > > ```
> > > > > > > > // increment all the stats
> > > > > > > > for (StatAndConfig statAndConfig : this.stats) {
> > > > > > > >    statAndConfig.stat.record(statAndConfig.config(), value,
> > > > timeMs);
> > > > > > > > }
> > > > > > > > ```
> > > > > > > >
> > > > > > > > SampledStats:
> > > > > > > > - Avg:
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/068ab9cefae301f3187ea885d645c425955e77d2/clients/src/main/java/org/apache/kafka/common/metrics/stats/Avg.java
> > > > > > > > - Max:
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/068ab9cefae301f3187ea885d645c425955e77d2/clients/src/main/java/org/apache/kafka/common/metrics/stats/Max.java
> > > > > > > > - Min:
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/068ab9cefae301f3187ea885d645c425955e77d2/clients/src/main/java/org/apache/kafka/common/metrics/stats/Min.java
> > > > > > > >
> > > > > > > > `stat#record()` are implemented by `update` method in
> > > SampledStat:
> > > > > > > >
> > > > > > > > ```Max.java
> > > > > > > >     @Override
> > > > > > > >     protected void update(Sample sample, MetricConfig config,
> > > > double
> > > > > > > value,
> > > > > > > > long now) {
> > > > > > > >         sample.value += value;
> > > > > > > >     }
> > > > > > > > ```
> > > > > > > >
> > > > > > > > ```Avg.java
> > > > > > > >     @Override
> > > > > > > >     protected void update(Sample sample, MetricConfig config,
> > > > double
> > > > > > > value,
> > > > > > > > long now) {
> > > > > > > >         sample.value = Math.max(sample.value, value);
> > > > > > > >     }
> > > > > > > > ```
> > > > > > > >
> > > > > > > > As far as I understand, most of the work of the stats happens
> > on
> > > > the
> > > > > > > > `combine` method that is not part of the connector execution
> > but
> > > > > called
> > > > > > > > when metrics are queried.
> > > > > > > >
> > > > > > > > I wonder whether we should consider Avg and Max for all
> metrics
> > > > > > proposed
> > > > > > > as
> > > > > > > > the impact on the execution path seems minimal, and even see
> if
> > > Min
> > > > > is
> > > > > > > also
> > > > > > > > valuable, and use DEBUG only for more granular metrics.
> > > > > > > >
> > > > > > > > [1]
> https://github.com/apache/kafka/tree/trunk/jmh-benchmarks
> > > > > > > >
> > > > > > > >
> > > > > > > > > 4. Just to make sure I understand correctly--does "time
> when
> > it
> > > > has
> > > > > > > been
> > > > > > > > > received by the Sink task" refer to the wallclock time
> > directly
> > > > > > after a
> > > > > > > > > call to SinkTask::put has been completed (as opposed to
> > > directly
> > > > > > before
> > > > > > > > > that call is made, or something else entirely)?
> > > > > > > > >
> > > > > > > >
> > > > > > > > It currently means when it has been received by the Sink task
> > > > > > > > right after consumer poll and before conversions.
> > > > > > > > Would it be valuable to have it after put-sink-records?
> > > > > > > >
> > > > > > > >
> > > > > > > > > 5. If the goal is to identify performance bottlenecks
> (either
> > > in
> > > > > > > production
> > > > > > > > > or pre-production environments), would it make sense to
> > > introduce
> > > > > > > metrics
> > > > > > > > > for each individual converter (i.e., key/value/header) and
> > > > > > > transformation?
> > > > > > > > > It's definitely an improvement to be able to identify the
> > total
> > > > > time
> > > > > > > for
> > > > > > > > > conversion and transformation, but then the immediate
> > follow-up
> > > > > > > question if
> > > > > > > > > a bottleneck is found in that phase is "which
> > > > > > converter/transformation
> > > > > > > is
> > > > > > > > > responsible?" It'd be nice if we could provide a way to
> > quickly
> > > > > > answer
> > > > > > > that
> > > > > > > > > question.
> > > > > > > > >
> > > > > > > >
> > > > > > > > This is a great idea. I'd like to consider this as well,
> though
> > > > maybe
> > > > > > > these
> > > > > > > > more granular
> > > > > > > > metrics would be good to have them as DEBUG.
> > > > > > > >
> > > > > > > >
> > > > > > > > > 6. Any thoughts about offering latency metrics for source
> > tasks
> > > > > > between
> > > > > > > > > receipt of the record from the task and delivery of the
> > record
> > > to
> > > > > > Kafka
> > > > > > > > > (which would be tracked by producer callback)? We could
> also
> > > use
> > > > > the
> > > > > > > record
> > > > > > > > > timestamp either instead of or in addition to receipt time
> if
> > > the
> > > > > > task
> > > > > > > > > provides a timestamp with its records.
> > > > > > > > >
> > > > > > > >
> > > > > > > > With source transform and convert metrics we get part of that
> > > > > latency.
> > > > > > > > Looking at the Producer metrics, `request-latency` (though a
> > very
> > > > > > generic
> > > > > > > > metric)
> > > > > > > > sort of answer the time between send request and ack — if my
> > > > > > > understanding
> > > > > > > > is correct.
> > > > > > > > Would these be enough or you're thinking about another
> > approach?
> > > > > > > > maybe a custom metric to cover the producer side?
> > > > > > > >
> > > > > > > >
> > > > > > > > > 7. We may end up introducing a way for sink tasks to record
> > > > > > per-record
> > > > > > > > > delivery to the sink system (see KIP-767 [1]). I'd like it
> if
> > > we
> > > > > > could
> > > > > > > keep
> > > > > > > > > the names of our metrics very precise in order to avoid
> > > confusing
> > > > > > users
> > > > > > > > > (who may think that we're providing metrics on actual
> > delivery
> > > to
> > > > > the
> > > > > > > sink
> > > > > > > > > system, which may not be the case if the connector performs
> > > > > > > asynchronous
> > > > > > > > > writes), and in order to leave room for a metrics on true
> > > > delivery
> > > > > > > time by
> > > > > > > > > sink tasks. It'd also be nice if we could remain consistent
> > > with
> > > > > > > existing
> > > > > > > > > metrics such as "put-batch-avg-time-ms". With that in mind,
> > > what
> > > > do
> > > > > > you
> > > > > > > > > think about renaming these metrics:
> > > > > > > > > - "sink-record-batch-latency-max-ms" to
> > > > "put-batch-avg-latency-ms"
> > > > > > > > > - "sink-record-latency-max-ms" to
> > > > "put-sink-record-latency-max-ms"
> > > > > > > > > - "sink-record-latency-avg-ms" to
> > > > "put-sink-record-latency-avg-ms"
> > > > > > > > > - "sink-record-convert-transform-time-max-ms" to
> > > > > > > > > "convert-transform-sink-record-time-max-ms"
> > > > > > > > > - "sink-record-convert-transform-time-avg-ms" to
> > > > > > > > > "convert-transform-sink-record-time-avg-ms"
> > > > > > > > > - "source-record-transform-convert-time-max-ms" to
> > > > > > > > > "transform-convert-source-record-time-max-ms"
> > > > > > > > > - "source-record-transform-convert-time-avg-ms" to
> > > > > > > > > "transform-convert-source-record-time-avg-ms"
> > > > > > > > >
> > > > > > > >
> > > > > > > > Make sense, thanks! I have updated the list of metrics and
> > group
> > > > them
> > > > > > by
> > > > > > > > sensor and applying these suggestions.
> > > > > > > > The only ones that I want to review are: sink-record-* to
> > > > put-batch-*
> > > > > > > > (first 3). Not sure if put-batch/put-sink-record describes
> the
> > > > > purpose
> > > > > > of
> > > > > > > > the metric — neither `sink-record-latency` to be honest.
> > > > > > > > My initial thought was to have something like Kafka Streams
> > > > > > e2e-latency.
> > > > > > > > Based on 4. and 6. questions, an idea could be to add:
> > > > > > > > - source-batch-e2e-latency-before-send: measure wallclock -
> > > source
> > > > > > record
> > > > > > > > timestamp after source connector poll.
> > > > > > > > - source-batch-e2e-latency-after-send: measure wallclock -
> > record
> > > > > > > timestamp
> > > > > > > > on producer send callback
> > > > > > > > - sink-batch-e2e-latency-before-put: measure time wallclock -
> > > > record
> > > > > > > > timestamp after consumer poll
> > > > > > > > - sink-batch-e2e-latency-after-put: measure time wallclock -
> > > record
> > > > > > > > timestamp after sink connector put.
> > > > > > > >
> > > > > > > >
> > > > > > > > > Thanks again for the KIP! Looking forward to your thoughts.
> > > > > > > > >
> > > > > > > > > Cheers,
> > > > > > > > >
> > > > > > > > > Chris
> > > > > > > > >
> > > > > > > > > [1] -
> > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-767%3A+Connect+Latency+Metrics
> > > > > > > > >
> > > > > > > > > On Thu, Sep 15, 2022 at 1:32 PM Jorge Esteban Quilcate
> Otoya
> > <
> > > > > > > > > quilcate.jorge@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > > Hi everyone,
> > > > > > > > > >
> > > > > > > > > > I've made a slight addition to the KIP based on Yash
> > > feedback:
> > > > > > > > > >
> > > > > > > > > > - A new metric is added at INFO level to record the max
> > > latency
> > > > > > from
> > > > > > > the
> > > > > > > > > > batch timestamp, by keeping the oldest record timestamp
> per
> > > > > batch.
> > > > > > > > > > - A draft implementation is linked.
> > > > > > > > > >
> > > > > > > > > > Looking forward to your feedback.
> > > > > > > > > > Also, a kindly reminder that the vote thread is open.
> > > > > > > > > >
> > > > > > > > > > Thanks!
> > > > > > > > > > Jorge.
> > > > > > > > > >
> > > > > > > > > > On Thu, 8 Sept 2022 at 14:25, Jorge Esteban Quilcate
> Otoya
> > <
> > > > > > > > > > quilcate.jorge@gmail.com> wrote:
> > > > > > > > > >
> > > > > > > > > > > Great. I have updated the KIP to reflect this.
> > > > > > > > > > >
> > > > > > > > > > > Cheers,
> > > > > > > > > > > Jorge.
> > > > > > > > > > >
> > > > > > > > > > > On Thu, 8 Sept 2022 at 12:26, Yash Mayya <
> > > > yash.mayya@gmail.com
> > > > > >
> > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > >> Thanks, I think it makes sense to define these metrics
> > at
> > > a
> > > > > > DEBUG
> > > > > > > > > > >> recording
> > > > > > > > > > >> level.
> > > > > > > > > > >>
> > > > > > > > > > >> On Thu, Sep 8, 2022 at 2:51 PM Jorge Esteban Quilcate
> > > Otoya
> > > > <
> > > > > > > > > > >> quilcate.jorge@gmail.com> wrote:
> > > > > > > > > > >>
> > > > > > > > > > >> > On Thu, 8 Sept 2022 at 05:55, Yash Mayya <
> > > > > > yash.mayya@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > > >> >
> > > > > > > > > > >> > > Hi Jorge,
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > Thanks for the changes. With regard to having per
> > > batch
> > > > vs
> > > > > > per
> > > > > > > > > > record
> > > > > > > > > > >> > > metrics, the additional overhead I was referring
> to
> > > > wasn't
> > > > > > > about
> > > > > > > > > > >> whether
> > > > > > > > > > >> > or
> > > > > > > > > > >> > > not we would need to iterate over all the records
> > in a
> > > > > > batch.
> > > > > > > I
> > > > > > > > > was
> > > > > > > > > > >> > > referring to the potential additional overhead
> > caused
> > > by
> > > > > the
> > > > > > > > > higher
> > > > > > > > > > >> > volume
> > > > > > > > > > >> > > of calls to Sensor::record on the sensors for the
> > new
> > > > > > metrics
> > > > > > > (as
> > > > > > > > > > >> > compared
> > > > > > > > > > >> > > to the existing batch only metrics), especially
> for
> > > high
> > > > > > > > > throughput
> > > > > > > > > > >> > > connectors where batch sizes could be large. I
> guess
> > > we
> > > > > may
> > > > > > > want
> > > > > > > > > to
> > > > > > > > > > do
> > > > > > > > > > >> > some
> > > > > > > > > > >> > > sort of performance testing and get concrete
> numbers
> > > to
> > > > > > verify
> > > > > > > > > > whether
> > > > > > > > > > >> > this
> > > > > > > > > > >> > > is a valid concern or not?
> > > > > > > > > > >> > >
> > > > > > > > > > >> >
> > > > > > > > > > >> > 6.1. Got it, thanks for clarifying. I guess there
> > could
> > > > be a
> > > > > > > > > benchmark
> > > > > > > > > > >> test
> > > > > > > > > > >> > of the `Sensor::record` to get an idea of the
> > > performance
> > > > > > > impact.
> > > > > > > > > > >> > Regardless, the fact that these are single-record
> > > metrics
> > > > > > > compared
> > > > > > > > > to
> > > > > > > > > > >> > existing batch-only could be explicitly defined by
> > > setting
> > > > > > these
> > > > > > > > > > >> metrics at
> > > > > > > > > > >> > a DEBUG or TRACE metric recording level, leaving the
> > > > > existing
> > > > > > at
> > > > > > > > > INFO
> > > > > > > > > > >> > level.
> > > > > > > > > > >> > wdyt?
> > > > > > > > > > >> >
> > > > > > > > > > >> >
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > Thanks,
> > > > > > > > > > >> > > Yash
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > On Tue, Sep 6, 2022 at 4:42 PM Jorge Esteban
> > Quilcate
> > > > > Otoya
> > > > > > <
> > > > > > > > > > >> > > quilcate.jorge@gmail.com> wrote:
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > > Hi Sagar and Yash,
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > > the way it's defined in
> > > > > > > > > > >> > > >
> > > > > > https://kafka.apache.org/documentation/#connect_monitoring
> > > > > > > for
> > > > > > > > > > the
> > > > > > > > > > >> > > metrics
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > 4.1. Got it. Add it to the KIP.
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > > The only thing I would argue is do we need
> > > > > > > > > > >> sink-record-latency-min?
> > > > > > > > > > >> > > Maybe
> > > > > > > > > > >> > > > we
> > > > > > > > > > >> > > > > could remove this min metric as well and make
> > all
> > > of
> > > > > the
> > > > > > > 3 e2e
> > > > > > > > > > >> > metrics
> > > > > > > > > > >> > > > > consistent
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > 4.2 I see. Will remove it from the KIP.
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > > Probably users can track the metrics at their
> > end
> > > to
> > > > > > > > > > >> > > > > figure that out. Do you think that makes
> sense?
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > 4.3. Yes, agree. With these new metrics it
> should
> > be
> > > > > > easier
> > > > > > > for
> > > > > > > > > > >> users
> > > > > > > > > > >> > to
> > > > > > > > > > >> > > > track this.
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > > I think it makes sense to not have a min
> metric
> > > for
> > > > > > > either to
> > > > > > > > > > >> remain
> > > > > > > > > > >> > > > > consistent with the existing put-batch and
> > > > poll-batch
> > > > > > > metrics
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > 5.1. Got it. Same as 4.2
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > > Another naming related suggestion I had was
> with
> > > the
> > > > > > > > > > >> > > > > "convert-time" metrics - we should probably
> > > include
> > > > > > > > > > >> transformations
> > > > > > > > > > >> > in
> > > > > > > > > > >> > > > the
> > > > > > > > > > >> > > > > name since SMTs could definitely be
> attributable
> > > to
> > > > a
> > > > > > > sizable
> > > > > > > > > > >> chunk
> > > > > > > > > > >> > of
> > > > > > > > > > >> > > > the
> > > > > > > > > > >> > > > > latency depending on the specific
> transformation
> > > > > chain.
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > 5.2. Make sense. I'm proposing to add
> > > > > > > > > > >> > `sink-record-convert-transform...`
> > > > > > > > > > >> > > > and `source-record-transform-convert...` to
> > > represent
> > > > > > > correctly
> > > > > > > > > > the
> > > > > > > > > > >> > order
> > > > > > > > > > >> > > > of operations.
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > > it seems like both source and sink tasks only
> > > record
> > > > > > > metrics
> > > > > > > > > at
> > > > > > > > > > a
> > > > > > > > > > >> > > "batch"
> > > > > > > > > > >> > > > > level, not on an individual record level. I
> > think
> > > it
> > > > > > > might be
> > > > > > > > > > >> > > additional
> > > > > > > > > > >> > > > > overhead if we want to record these new
> metrics
> > > all
> > > > at
> > > > > > the
> > > > > > > > > > record
> > > > > > > > > > >> > > level?
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > 5.3. I considered at the beginning to implement
> > all
> > > > > > metrics
> > > > > > > at
> > > > > > > > > the
> > > > > > > > > > >> > batch
> > > > > > > > > > >> > > > level, but given how the framework process
> > records,
> > > I
> > > > > > > fallback
> > > > > > > > > to
> > > > > > > > > > >> the
> > > > > > > > > > >> > > > proposed approach:
> > > > > > > > > > >> > > > - Sink Task:
> > > > > > > > > > >> > > >   - `WorkerSinkTask#convertMessages(msgs)`
> already
> > > > > > iterates
> > > > > > > over
> > > > > > > > > > >> > records,
> > > > > > > > > > >> > > > so there is no additional overhead to capture
> > record
> > > > > > > latency per
> > > > > > > > > > >> > record.
> > > > > > > > > > >> > > >     -
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > >
> > > > > > > > > > >> >
> > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/9841647c4fe422532f448423c92d26e4fdcb8932/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L490-L514
> > > > > > > > > > >> > > >   -
> > > `WorkerSinkTask#convertAndTransformRecord(record)`
> > > > > > > actually
> > > > > > > > > > >> happens
> > > > > > > > > > >> > > > individually. Measuring this operation per batch
> > > would
> > > > > > > include
> > > > > > > > > > >> > processing
> > > > > > > > > > >> > > > that is not strictly part of "convert and
> > transform"
> > > > > > > > > > >> > > >     -
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > >
> > > > > > > > > > >> >
> > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/9841647c4fe422532f448423c92d26e4fdcb8932/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L518
> > > > > > > > > > >> > > > - Source Task:
> > > > > > > > > > >> > > >   - `AbstractWorkerSourceTask#sendRecords`
> > iterates
> > > > > over a
> > > > > > > batch
> > > > > > > > > > and
> > > > > > > > > > >> > > > applies transforms and convert record
> individually
> > > as
> > > > > > well:
> > > > > > > > > > >> > > >     -
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > >
> > > > > > > > > > >> >
> > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/9841647c4fe422532f448423c92d26e4fdcb8932/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/AbstractWorkerSourceTask.java#L389-L390
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > > This might require some additional changes -
> > > > > > > > > > >> > > > > for instance, with the "sink-record-latency"
> > > metric,
> > > > > we
> > > > > > > might
> > > > > > > > > > only
> > > > > > > > > > >> > want
> > > > > > > > > > >> > > > to
> > > > > > > > > > >> > > > > have a "max" metric since "avg" would require
> > > > > recording
> > > > > > a
> > > > > > > > > value
> > > > > > > > > > on
> > > > > > > > > > >> > the
> > > > > > > > > > >> > > > > sensor for each record (whereas we can get a
> > "max"
> > > > by
> > > > > > only
> > > > > > > > > > >> recording
> > > > > > > > > > >> > a
> > > > > > > > > > >> > > > > metric value for the oldest record in each
> > batch).
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > 5.4. Recording record-latency per batch may not
> be
> > > as
> > > > > > > useful as
> > > > > > > > > > >> there
> > > > > > > > > > >> > is
> > > > > > > > > > >> > > no
> > > > > > > > > > >> > > > guarantee that the oldest record will be
> > > > representative
> > > > > of
> > > > > > > the
> > > > > > > > > > >> batch.
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > On Sat, 3 Sept 2022 at 16:02, Yash Mayya <
> > > > > > > yash.mayya@gmail.com>
> > > > > > > > > > >> wrote:
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > > Hi Jorge and Sagar,
> > > > > > > > > > >> > > > >
> > > > > > > > > > >> > > > > I think it makes sense to not have a min
> metric
> > > for
> > > > > > > either to
> > > > > > > > > > >> remain
> > > > > > > > > > >> > > > > consistent with the existing put-batch and
> > > > poll-batch
> > > > > > > metrics
> > > > > > > > > > (it
> > > > > > > > > > >> > > doesn't
> > > > > > > > > > >> > > > > seem particularly useful either anyway). Also,
> > the
> > > > new
> > > > > > > > > > >> > > > > "sink-record-latency" metric name looks fine
> to
> > > me,
> > > > > > > thanks for
> > > > > > > > > > >> making
> > > > > > > > > > >> > > the
> > > > > > > > > > >> > > > > changes! Another naming related suggestion I
> had
> > > was
> > > > > > with
> > > > > > > the
> > > > > > > > > > >> > > > > "convert-time" metrics - we should probably
> > > include
> > > > > > > > > > >> transformations
> > > > > > > > > > >> > in
> > > > > > > > > > >> > > > the
> > > > > > > > > > >> > > > > name since SMTs could definitely be
> attributable
> > > to
> > > > a
> > > > > > > sizable
> > > > > > > > > > >> chunk
> > > > > > > > > > >> > of
> > > > > > > > > > >> > > > the
> > > > > > > > > > >> > > > > latency depending on the specific
> transformation
> > > > > chain.
> > > > > > > > > > >> > > > >
> > > > > > > > > > >> > > > > I have one high level question with respect to
> > > > > > > implementation
> > > > > > > > > -
> > > > > > > > > > >> > > > currently,
> > > > > > > > > > >> > > > > it seems like both source and sink tasks only
> > > record
> > > > > > > metrics
> > > > > > > > > at
> > > > > > > > > > a
> > > > > > > > > > >> > > "batch"
> > > > > > > > > > >> > > > > level, not on an individual record level. I
> > think
> > > it
> > > > > > > might be
> > > > > > > > > > >> > > additional
> > > > > > > > > > >> > > > > overhead if we want to record these new
> metrics
> > > all
> > > > at
> > > > > > the
> > > > > > > > > > record
> > > > > > > > > > >> > > level?
> > > > > > > > > > >> > > > > Could we instead make all of these new metrics
> > for
> > > > > > > batches of
> > > > > > > > > > >> records
> > > > > > > > > > >> > > > > rather than individual records in order to
> > remain
> > > > > > > consistent
> > > > > > > > > > with
> > > > > > > > > > >> the
> > > > > > > > > > >> > > > > existing task level metrics? This might
> require
> > > some
> > > > > > > > > additional
> > > > > > > > > > >> > > changes -
> > > > > > > > > > >> > > > > for instance, with the "sink-record-latency"
> > > metric,
> > > > > we
> > > > > > > might
> > > > > > > > > > only
> > > > > > > > > > >> > want
> > > > > > > > > > >> > > > to
> > > > > > > > > > >> > > > > have a "max" metric since "avg" would require
> > > > > recording
> > > > > > a
> > > > > > > > > value
> > > > > > > > > > on
> > > > > > > > > > >> > the
> > > > > > > > > > >> > > > > sensor for each record (whereas we can get a
> > "max"
> > > > by
> > > > > > only
> > > > > > > > > > >> recording
> > > > > > > > > > >> > a
> > > > > > > > > > >> > > > > metric value for the oldest record in each
> > batch).
> > > > > > > > > > >> > > > >
> > > > > > > > > > >> > > > > Thanks,
> > > > > > > > > > >> > > > > Yash
> > > > > > > > > > >> > > > >
> > > > > > > > > > >> > > > > On Fri, Sep 2, 2022 at 3:16 PM Sagar <
> > > > > > > > > sagarmeansocean@gmail.com
> > > > > > > > > > >
> > > > > > > > > > >> > > wrote:
> > > > > > > > > > >> > > > >
> > > > > > > > > > >> > > > > > Hi Jorge,
> > > > > > > > > > >> > > > > >
> > > > > > > > > > >> > > > > > Thanks for the changes.
> > > > > > > > > > >> > > > > >
> > > > > > > > > > >> > > > > > Regarding the metrics, I meant something
> like
> > > > this:
> > > > > > > > > > >> > > > > >
> > > > > > > > > > >> > > > >
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > >
> > > > > > > > > > >> >
> > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> kafka.connect:type=sink-task-metrics,connector="{connector}",task="{task}"
> > > > > > > > > > >> > > > > >
> > > > > > > > > > >> > > > > > the way it's defined in
> > > > > > > > > > >> > > > > >
> > > > > > > https://kafka.apache.org/documentation/#connect_monitoring
> > > > > > > > > > for
> > > > > > > > > > >> the
> > > > > > > > > > >> > > > > > metrics.
> > > > > > > > > > >> > > > > >
> > > > > > > > > > >> > > > > > I see what you mean by the 3 metrics and how
> > it
> > > > can
> > > > > be
> > > > > > > > > > >> interpreted.
> > > > > > > > > > >> > > The
> > > > > > > > > > >> > > > > > only thing I would argue is do we need
> > > > > > > > > > sink-record-latency-min?
> > > > > > > > > > >> > Maybe
> > > > > > > > > > >> > > > we
> > > > > > > > > > >> > > > > > could remove this min metric as well and
> make
> > > all
> > > > of
> > > > > > > the 3
> > > > > > > > > e2e
> > > > > > > > > > >> > > metrics
> > > > > > > > > > >> > > > > > consistent(since put-batch also doesn't
> > expose a
> > > > min
> > > > > > > which
> > > > > > > > > > makes
> > > > > > > > > > >> > > sense
> > > > > > > > > > >> > > > to
> > > > > > > > > > >> > > > > > me). I think this is in contrast to what
> Yash
> > > > > pointed
> > > > > > > out
> > > > > > > > > > above
> > > > > > > > > > >> so
> > > > > > > > > > >> > I
> > > > > > > > > > >> > > > > would
> > > > > > > > > > >> > > > > > like to hear his thoughts as well.
> > > > > > > > > > >> > > > > >
> > > > > > > > > > >> > > > > > The other point Yash mentioned about the
> > > slightly
> > > > > > flawed
> > > > > > > > > > >> definition
> > > > > > > > > > >> > > of
> > > > > > > > > > >> > > > > e2e
> > > > > > > > > > >> > > > > > is also true in a sense. But I have a
> feeling
> > > > that's
> > > > > > > one the
> > > > > > > > > > >> > records
> > > > > > > > > > >> > > > are
> > > > > > > > > > >> > > > > > polled by the connector tasks, it would be
> > > > difficult
> > > > > > to
> > > > > > > > > track
> > > > > > > > > > >> the
> > > > > > > > > > >> > > final
> > > > > > > > > > >> > > > > leg
> > > > > > > > > > >> > > > > > via the framework. Probably users can track
> > the
> > > > > > metrics
> > > > > > > at
> > > > > > > > > > their
> > > > > > > > > > >> > end
> > > > > > > > > > >> > > to
> > > > > > > > > > >> > > > > > figure that out. Do you think that makes
> > sense?
> > > > > > > > > > >> > > > > >
> > > > > > > > > > >> > > > > > Thanks!
> > > > > > > > > > >> > > > > > Sagar.
> > > > > > > > > > >> > > > > >
> > > > > > > > > > >> > > > > >
> > > > > > > > > > >> > > > > >
> > > > > > > > > > >> > > > > >
> > > > > > > > > > >> > > > > > On Thu, Sep 1, 2022 at 11:40 PM Jorge
> Esteban
> > > > > Quilcate
> > > > > > > > > Otoya <
> > > > > > > > > > >> > > > > > quilcate.jorge@gmail.com> wrote:
> > > > > > > > > > >> > > > > >
> > > > > > > > > > >> > > > > > > Hi Sagar and Yash,
> > > > > > > > > > >> > > > > > >
> > > > > > > > > > >> > > > > > > Thanks for your feedback!
> > > > > > > > > > >> > > > > > >
> > > > > > > > > > >> > > > > > > > 1) I am assuming the new metrics would
> be
> > > task
> > > > > > level
> > > > > > > > > > metric.
> > > > > > > > > > >> > > > > > >
> > > > > > > > > > >> > > > > > > 1.1 Yes, it will be a task level metric,
> > > > > implemented
> > > > > > > on
> > > > > > > > > the
> > > > > > > > > > >> > > > > > > Worker[Source/Sink]Task.
> > > > > > > > > > >> > > > > > >
> > > > > > > > > > >> > > > > > > > Could you specify the way it's done for
> > > other
> > > > > > > > > sink/source
> > > > > > > > > > >> > > > connector?
> > > > > > > > > > >> > > > > > >
> > > > > > > > > > >> > > > > > > 1.2. Not sure what do you mean by this.
> > Could
> > > > you
> > > > > > > > > elaborate
> > > > > > > > > > a
> > > > > > > > > > >> bit
> > > > > > > > > > >> > > > more?
> > > > > > > > > > >> > > > > > >
> > > > > > > > > > >> > > > > > > > 2. I am slightly confused about the e2e
> > > > latency
> > > > > > > > > metric...
> > > > > > > > > > >> > > > > > >
> > > > > > > > > > >> > > > > > > 2.1. Yes, I see. I was trying to bring a
> > > similar
> > > > > > > concept
> > > > > > > > > as
> > > > > > > > > > in
> > > > > > > > > > >> > > > Streams
> > > > > > > > > > >> > > > > > with
> > > > > > > > > > >> > > > > > > KIP-613, though the e2e concept may not be
> > > > > > > translatable.
> > > > > > > > > > >> > > > > > > We could keep it as `sink-record-latency`
> to
> > > > avoid
> > > > > > > > > > conflating
> > > > > > > > > > >> > > > > concepts. A
> > > > > > > > > > >> > > > > > > similar metric naming was proposed in
> > KIP-489
> > > > but
> > > > > at
> > > > > > > the
> > > > > > > > > > >> consumer
> > > > > > > > > > >> > > > > level —
> > > > > > > > > > >> > > > > > > though it seems dormant for a couple of
> > years.
> > > > > > > > > > >> > > > > > >
> > > > > > > > > > >> > > > > > > > However, the put-batch time measures the
> > > > > > > > > > >> > > > > > > > time to put a batch of records to
> external
> > > > sink.
> > > > > > > So, I
> > > > > > > > > > would
> > > > > > > > > > >> > > assume
> > > > > > > > > > >> > > > > > the 2
> > > > > > > > > > >> > > > > > > > can't be added as is to compute the e2e
> > > > latency.
> > > > > > > Maybe I
> > > > > > > > > > am
> > > > > > > > > > >> > > missing
> > > > > > > > > > >> > > > > > > > something here. Could you plz clarify
> > this.
> > > > > > > > > > >> > > > > > >
> > > > > > > > > > >> > > > > > > 2.2. Yes, agree. Not necessarily added,
> but
> > > with
> > > > > > the 3
> > > > > > > > > > >> latencies
> > > > > > > > > > >> > > > (poll,
> > > > > > > > > > >> > > > > > > convert, putBatch) will be clearer where
> the
> > > > > > > bottleneck
> > > > > > > > > may
> > > > > > > > > > >> be,
> > > > > > > > > > >> > and
> > > > > > > > > > >> > > > > > > represent the internal processing.
> > > > > > > > > > >> > > > > > >
> > > > > > > > > > >> > > > > > > > however, as per the KIP it looks like it
> > > will
> > > > be
> > > > > > > > > > >> > > > > > > > the latency between when the record was
> > > > written
> > > > > to
> > > > > > > Kafka
> > > > > > > > > > and
> > > > > > > > > > >> > when
> > > > > > > > > > >> > > > the
> > > > > > > > > > >> > > > > > > > record is returned by a sink task's
> > > consumer's
> > > > > > poll?
> > > > > > > > > > >> > > > > > >
> > > > > > > > > > >> > > > > > > 3.1. Agree. 2.1. could help to clarify
> this.
> > > > > > > > > > >> > > > > > >
> > > > > > > > > > >> > > > > > > > One more thing - I was wondering
> > > > > > > > > > >> > > > > > > > if there's a particular reason for
> having
> > a
> > > > min
> > > > > > > metric
> > > > > > > > > for
> > > > > > > > > > >> e2e
> > > > > > > > > > >> > > > > latency
> > > > > > > > > > >> > > > > > > but
> > > > > > > > > > >> > > > > > > > not for convert time?
> > > > > > > > > > >> > > > > > >
> > > > > > > > > > >> > > > > > > 3.2. Was following KIP-613 for e2e which
> > seems
> > > > > > useful
> > > > > > > to
> > > > > > > > > > >> compare
> > > > > > > > > > >> > > with
> > > > > > > > > > >> > > > > > Max a
> > > > > > > > > > >> > > > > > > get an idea of the window of results,
> though
> > > > > current
> > > > > > > > > > >> latencies in
> > > > > > > > > > >> > > > > > Connector
> > > > > > > > > > >> > > > > > > do not include Min, and that's why I
> haven't
> > > > added
> > > > > > it
> > > > > > > for
> > > > > > > > > > >> convert
> > > > > > > > > > >> > > > > > latency.
> > > > > > > > > > >> > > > > > > Do you think it make sense to extend
> latency
> > > > > metrics
> > > > > > > with
> > > > > > > > > > Min?
> > > > > > > > > > >> > > > > > >
> > > > > > > > > > >> > > > > > > KIP is updated to clarify some of these
> > > changes.
> > > > > > > > > > >> > > > > > >
> > > > > > > > > > >> > > > > > > Many thanks,
> > > > > > > > > > >> > > > > > > Jorge.
> > > > > > > > > > >> > > > > > >
> > > > > > > > > > >> > > > > > > On Thu, 1 Sept 2022 at 18:11, Yash Mayya <
> > > > > > > > > > >> yash.mayya@gmail.com>
> > > > > > > > > > >> > > > wrote:
> > > > > > > > > > >> > > > > > >
> > > > > > > > > > >> > > > > > > > Hi Jorge,
> > > > > > > > > > >> > > > > > > >
> > > > > > > > > > >> > > > > > > > Thanks for the KIP! I have the same
> > > confusion
> > > > > with
> > > > > > > the
> > > > > > > > > > >> > > e2e-latency
> > > > > > > > > > >> > > > > > > metrics
> > > > > > > > > > >> > > > > > > > as Sagar above. "e2e" would seem to
> > indicate
> > > > the
> > > > > > > latency
> > > > > > > > > > >> > between
> > > > > > > > > > >> > > > when
> > > > > > > > > > >> > > > > > the
> > > > > > > > > > >> > > > > > > > record was written to Kafka and when the
> > > > record
> > > > > > was
> > > > > > > > > > written
> > > > > > > > > > >> to
> > > > > > > > > > >> > > the
> > > > > > > > > > >> > > > > sink
> > > > > > > > > > >> > > > > > > > system by the connector - however, as
> per
> > > the
> > > > > KIP
> > > > > > it
> > > > > > > > > looks
> > > > > > > > > > >> like
> > > > > > > > > > >> > > it
> > > > > > > > > > >> > > > > will
> > > > > > > > > > >> > > > > > > be
> > > > > > > > > > >> > > > > > > > the latency between when the record was
> > > > written
> > > > > to
> > > > > > > Kafka
> > > > > > > > > > and
> > > > > > > > > > >> > when
> > > > > > > > > > >> > > > the
> > > > > > > > > > >> > > > > > > > record is returned by a sink task's
> > > consumer's
> > > > > > > poll? I
> > > > > > > > > > think
> > > > > > > > > > >> > that
> > > > > > > > > > >> > > > > > metric
> > > > > > > > > > >> > > > > > > > will be a little confusing to interpret.
> > One
> > > > > more
> > > > > > > thing
> > > > > > > > > -
> > > > > > > > > > I
> > > > > > > > > > >> was
> > > > > > > > > > >> > > > > > wondering
> > > > > > > > > > >> > > > > > > > if there's a particular reason for
> having
> > a
> > > > min
> > > > > > > metric
> > > > > > > > > for
> > > > > > > > > > >> e2e
> > > > > > > > > > >> > > > > latency
> > > > > > > > > > >> > > > > > > but
> > > > > > > > > > >> > > > > > > > not for convert time?
> > > > > > > > > > >> > > > > > > >
> > > > > > > > > > >> > > > > > > > Thanks,
> > > > > > > > > > >> > > > > > > > Yash
> > > > > > > > > > >> > > > > > > >
> > > > > > > > > > >> > > > > > > > On Thu, Sep 1, 2022 at 8:59 PM Sagar <
> > > > > > > > > > >> > sagarmeansocean@gmail.com>
> > > > > > > > > > >> > > > > > wrote:
> > > > > > > > > > >> > > > > > > >
> > > > > > > > > > >> > > > > > > > > Hi Jorge,
> > > > > > > > > > >> > > > > > > > >
> > > > > > > > > > >> > > > > > > > > Thanks for the KIP. It looks like a
> very
> > > > good
> > > > > > > > > addition.
> > > > > > > > > > I
> > > > > > > > > > >> > > skimmed
> > > > > > > > > > >> > > > > > > through
> > > > > > > > > > >> > > > > > > > > once and had a couple of questions =>
> > > > > > > > > > >> > > > > > > > >
> > > > > > > > > > >> > > > > > > > > 1) I am assuming the new metrics would
> > be
> > > > task
> > > > > > > level
> > > > > > > > > > >> metric.
> > > > > > > > > > >> > > > Could
> > > > > > > > > > >> > > > > > you
> > > > > > > > > > >> > > > > > > > > specify the way it's done for other
> > > > > sink/source
> > > > > > > > > > connector?
> > > > > > > > > > >> > > > > > > > > 2) I am slightly confused about the
> e2e
> > > > > latency
> > > > > > > > > metric.
> > > > > > > > > > >> Let's
> > > > > > > > > > >> > > > > > consider
> > > > > > > > > > >> > > > > > > > the
> > > > > > > > > > >> > > > > > > > > sink connector metric. If I look at
> the
> > > way
> > > > > it's
> > > > > > > > > > supposed
> > > > > > > > > > >> to
> > > > > > > > > > >> > be
> > > > > > > > > > >> > > > > > > > calculated,
> > > > > > > > > > >> > > > > > > > > i.e the difference between the record
> > > > > timestamp
> > > > > > > and
> > > > > > > > > the
> > > > > > > > > > >> wall
> > > > > > > > > > >> > > > clock
> > > > > > > > > > >> > > > > > > time,
> > > > > > > > > > >> > > > > > > > it
> > > > > > > > > > >> > > > > > > > > looks like a per record metric.
> However,
> > > the
> > > > > > > put-batch
> > > > > > > > > > >> time
> > > > > > > > > > >> > > > > measures
> > > > > > > > > > >> > > > > > > the
> > > > > > > > > > >> > > > > > > > > time to put a batch of records to
> > external
> > > > > sink.
> > > > > > > So, I
> > > > > > > > > > >> would
> > > > > > > > > > >> > > > assume
> > > > > > > > > > >> > > > > > > the 2
> > > > > > > > > > >> > > > > > > > > can't be added as is to compute the
> e2e
> > > > > latency.
> > > > > > > > > Maybe I
> > > > > > > > > > >> am
> > > > > > > > > > >> > > > missing
> > > > > > > > > > >> > > > > > > > > something here. Could you plz clarify
> > > this.
> > > > > > > > > > >> > > > > > > > >
> > > > > > > > > > >> > > > > > > > > Thanks!
> > > > > > > > > > >> > > > > > > > > Sagar.
> > > > > > > > > > >> > > > > > > > >
> > > > > > > > > > >> > > > > > > > > On Tue, Aug 30, 2022 at 8:43 PM Jorge
> > > > Esteban
> > > > > > > Quilcate
> > > > > > > > > > >> Otoya
> > > > > > > > > > >> > <
> > > > > > > > > > >> > > > > > > > > quilcate.jorge@gmail.com> wrote:
> > > > > > > > > > >> > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > Hi all,
> > > > > > > > > > >> > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > I'd like to start a discussion
> thread
> > on
> > > > > > > KIP-864:
> > > > > > > > > Add
> > > > > > > > > > >> > > > End-To-End
> > > > > > > > > > >> > > > > > > > Latency
> > > > > > > > > > >> > > > > > > > > > Metrics to Connectors.
> > > > > > > > > > >> > > > > > > > > > This KIP aims to improve the metrics
> > > > > available
> > > > > > > on
> > > > > > > > > > Source
> > > > > > > > > > >> > and
> > > > > > > > > > >> > > > Sink
> > > > > > > > > > >> > > > > > > > > > Connectors to measure end-to-end
> > > latency,
> > > > > > > including
> > > > > > > > > > >> source
> > > > > > > > > > >> > > and
> > > > > > > > > > >> > > > > sink
> > > > > > > > > > >> > > > > > > > > record
> > > > > > > > > > >> > > > > > > > > > conversion time, and sink record e2e
> > > > latency
> > > > > > > > > (similar
> > > > > > > > > > to
> > > > > > > > > > >> > > > KIP-613
> > > > > > > > > > >> > > > > > for
> > > > > > > > > > >> > > > > > > > > > Streams).
> > > > > > > > > > >> > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > The KIP is here:
> > > > > > > > > > >> > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > >
> > > > > > > > > > >> > > > > > > > >
> > > > > > > > > > >> > > > > > > >
> > > > > > > > > > >> > > > > > >
> > > > > > > > > > >> > > > > >
> > > > > > > > > > >> > > > >
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > >
> > > > > > > > > > >> >
> > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-864%3A+Add+End-To-End+Latency+Metrics+to+Connectors
> > > > > > > > > > >> > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > Please take a look and let me know
> > what
> > > > you
> > > > > > > think.
> > > > > > > > > > >> > > > > > > > > >
> > > > > > > > > > >> > > > > > > > > > Cheers,
> > > > > > > > > > >> > > > > > > > > > Jorge.
> > > > > > > > > > >> > > > > > > > > >
> > > > > > > > > > >> > > > > > > > >
> > > > > > > > > > >> > > > > > > >
> > > > > > > > > > >> > > > > > >
> > > > > > > > > > >> > > > > >
> > > > > > > > > > >> > > > >
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > >
> > > > > > > > > > >> >
> > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-864: Add End-To-End Latency Metrics to Connectors

Posted by Chris Egerton <ch...@aiven.io.INVALID>.

Hi Jorge,

Thanks for filing KAFKA-14441! In the ticket description we mention that
"there will be more confidence whether to design metrics to be exposed at a
DEBUG or INFO level depending on their impact" but it doesn't seem like
this is called out in the KIP and, just based on what's in the KIP, the
proposal is still to have several per-record metrics exposed at INFO level.

Could we explicitly call out that the per-record metrics will definitely be
added to Kafka Connect as part of this KIP, but they will only be exposed
at INFO level pending pending the performance testing described in
KAFKA-14441, and will otherwise only be exposed at DEBUG level? Otherwise,
it's possible that a vote for the KIP as it's written today would be a vote
in favor of unconditionally exposing these metrics at INFO level, even if
the performance testing reveals issues.

Cheers,

Chris

On Sun, Dec 4, 2022 at 7:08 PM Jorge Esteban Quilcate Otoya <
quilcate.jorge@gmail.com> wrote:

> Thanks for the reminder Chris!
>
> I have added a note on the KIP to include this as part of the KIP as most
> of the metrics proposed are per-record and having all on DEBUG would limit
> the benefits, and created
> https://issues.apache.org/jira/browse/KAFKA-14441
> to keep track of this task.
>
> Cheers,
> Jorge.
>
> On Tue, 29 Nov 2022 at 19:40, Chris Egerton <ch...@aiven.io.invalid>
> wrote:
>
> > Hi Jorge,
> >
> > Thanks! What were your thoughts on the possible benchmarking and/or
> > downgrading of per-record metrics to DEBUG?
> >
> > Cheers,
> >
> > Chris
> >
> > On Thu, Nov 24, 2022 at 8:20 AM Jorge Esteban Quilcate Otoya <
> > quilcate.jorge@gmail.com> wrote:
> >
> > > Thanks Chris! I have updated the KIP with "transform" instead of
> "alias".
> > > Agree it's clearer.
> > >
> > > Cheers,
> > > Jorge.
> > >
> > > On Mon, 21 Nov 2022 at 21:36, Chris Egerton <ch...@aiven.io.invalid>
> > > wrote:
> > >
> > > > Hi Jorge,
> > > >
> > > > Thanks for the updates, and apologies for the delay. The new diagram
> > > > directly under the "Proposed Changes" section is absolutely gorgeous!
> > > >
> > > >
> > > > Follow-ups:
> > > >
> > > > RE 2: Good point. We can use the same level for these metrics, it's
> > not a
> > > > big deal.
> > > >
> > > > RE 3: As long as all the per-record metrics are kept at DEBUG level,
> it
> > > > should be fine to leave JMH benchmarking for a follow-up. If we want
> to
> > > add
> > > > new per-record, INFO-level metrics, I would be more comfortable with
> > > > including benchmarking as part of the testing plan for the KIP. One
> > > > possible compromise could be to propose that these features be merged
> > at
> > > > DEBUG level, and then possibly upgraded to INFO level in the future
> > > pending
> > > > benchmarks to guard against performance degradation.
> > > >
> > > > RE 4: I think for a true "end-to-end" metric, it'd be useful to
> include
> > > the
> > > > time taken by the task to actually deliver the record. However, with
> > the
> > > > new metric names and descriptions provided in the KIP, I have no
> > > objections
> > > > with what's currently proposed, and a new "end-to-end" metric can be
> > > taken
> > > > on later in a follow-up KIP.
> > > >
> > > > RE 6: You're right, existing producer metrics should be enough for
> now.
> > > We
> > > > can revisit this later if/when we add delivery-centric metrics for
> sink
> > > > tasks as well.
> > > >
> > > > RE 7: The new metric names in the KIP LGTM; I don't see any need to
> > > expand
> > > > beyond those but if you'd still like to pursue others, LMK.
> > > >
> > > >
> > > > New thoughts:
> > > >
> > > > One small thought: instead of "alias" in "alias="{transform_alias}"
> for
> > > the
> > > > per-transform metrics, could we use "transform"? IMO it's clearer
> since
> > > we
> > > > don't use "alias" in the names of transform-related properties, and
> > > "alias"
> > > > may be confused with the classloading term where you can use, e.g.,
> > > > "FileStreamSource" as the name of a connector class in a connector
> > config
> > > > instead of "org.apache.kafka.connect.file.FileStreamSourceConnector".
> > > >
> > > >
> > > > Cheers,
> > > >
> > > > Chris
> > > >
> > > > On Fri, Nov 18, 2022 at 12:06 PM Jorge Esteban Quilcate Otoya <
> > > > quilcate.jorge@gmail.com> wrote:
> > > >
> > > > > Thanks Mickael!
> > > > >
> > > > >
> > > > > On Wed, 9 Nov 2022 at 15:54, Mickael Maison <
> > mickael.maison@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi Jorge,
> > > > > >
> > > > > > Thanks for the KIP, it is a nice improvement.
> > > > > >
> > > > > > 1) The per transformation metrics still have a question mark next
> > to
> > > > > > them in the KIP. Do you want to include them? If so we'll want to
> > tag
> > > > > > them, we should be able to include the aliases in
> > TransformationChain
> > > > > > and use them.
> > > > > >
> > > > >
> > > > > Yes, I have added the changes on TransformChain that will be needed
> > to
> > > > add
> > > > > these metrics.
> > > > >
> > > > >
> > > > > >
> > > > > > 2) I see no references to predicates. If we don't want to measure
> > > > > > their latency, can we say it explicitly?
> > > > > >
> > > > >
> > > > > Good question, I haven't considered these. Though as these are
> > > > materialized
> > > > > as PredicatedTransformation, they should be covered by these
> changes.
> > > > > Adding a note about this.
> > > > >
> > > > >
> > > > > >
> > > > > > 3) Should we have sink-record-batch-latency-avg-ms? All other
> > metrics
> > > > > > have both the maximum and average values.
> > > > > >
> > > > > >
> > > > > Good question. I will remove it and change the record latency from
> > > > > DEBUG->INFO as it already cover the maximum metric.
> > > > >
> > > > > Hope it's clearer now, let me know if there any additional
> feedback.
> > > > > Thanks!
> > > > >
> > > > >
> > > > >
> > > > > > Thanks,
> > > > > > Mickael
> > > > > >
> > > > > > On Thu, Oct 20, 2022 at 9:58 PM Jorge Esteban Quilcate Otoya
> > > > > > <qu...@gmail.com> wrote:
> > > > > > >
> > > > > > > Thanks, Chris! Great feedback! Please, find my comments below:
> > > > > > >
> > > > > > > On Thu, 13 Oct 2022 at 18:52, Chris Egerton
> > > <chrise@aiven.io.invalid
> > > > >
> > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Jorge,
> > > > > > > >
> > > > > > > > Thanks for the KIP. I agree with the overall direction and
> > think
> > > > this
> > > > > > would
> > > > > > > > be a nice improvement to Kafka Connect. Here are my initial
> > > > thoughts
> > > > > > on the
> > > > > > > > details:
> > > > > > > >
> > > > > > > > 1. The motivation section outlines the gaps in Kafka
> Connect's
> > > task
> > > > > > metrics
> > > > > > > > nicely. I think it'd be useful to include more concrete
> details
> > > on
> > > > > why
> > > > > > > > these gaps need to be filled in, and in which cases
> additional
> > > > > metrics
> > > > > > > > would be helpful. One goal could be to provide enhanced
> > > monitoring
> > > > of
> > > > > > > > production deployments that allows for cluster administrators
> > to
> > > > set
> > > > > up
> > > > > > > > automatic alerts for latency spikes and, if triggered,
> quickly
> > > > > > identify the
> > > > > > > > root cause of those alerts, reducing the time to remediation.
> > > > Another
> > > > > > goal
> > > > > > > > could be to provide more insight to developers or cluster
> > > > > > administrators
> > > > > > > > who want to do performance testing on connectors in
> > > non-production
> > > > > > > > environments. It may help guide our decision making process
> to
> > > > have a
> > > > > > > > clearer picture of the goals we're trying to achieve.
> > > > > > > >
> > > > > > >
> > > > > > > Agree. The Motivation section has been updated.
> > > > > > > Thanks for the examples, I see both of them being covered by
> the
> > > KIP.
> > > > > > > I see how these could give us a good distinction on whether to
> > > > position
> > > > > > > some metrics at INFO or DEBUG level.
> > > > > > >
> > > > > > >
> > > > > > > > 2. If we're trying to address the alert-and-diagnose use
> case,
> > > it'd
> > > > > be
> > > > > > > > useful to have as much information as possible at INFO level,
> > > > rather
> > > > > > than
> > > > > > > > forcing cluster administrators to possibly reconfigure a
> > > connector
> > > > to
> > > > > > emit
> > > > > > > > DEBUG or TRACE level metrics in order to diagnose a potential
> > > > > > > > production-impacting performance bottleneck. I can see the
> > > > rationale
> > > > > > for
> > > > > > > > emitting per-record metrics that track an average value at
> > DEBUG
> > > > > > level, but
> > > > > > > > for per-record metrics that track a maximum value, is there
> any
> > > > > reason
> > > > > > not
> > > > > > > > to provide this information at INFO level?
> > > > > > > >
> > > > > > >
> > > > > > > Agree. Though with Max and Avg metrics being part of the same
> > > sensor
> > > > —
> > > > > > > where Metric Level is defined — then both metrics get the same
> > > level.
> > > > > > >
> > > > > > >
> > > > > > > > 3. I'm also curious about the performance testing suggested
> by
> > > Yash
> > > > > to
> > > > > > > > gauge the potential impact of this change. Have you been able
> > to
> > > do
> > > > > any
> > > > > > > > testing with your draft implementation yet?
> > > > > > > >
> > > > > > >
> > > > > > > No, not so far.
> > > > > > > I think it would be valuable to discuss the scope of this
> testing
> > > and
> > > > > > maybe
> > > > > > > tackle it
> > > > > > > in a separate issue as Sensors and Metrics are used all over
> the
> > > > place.
> > > > > > > My initial understanding is that these tests should by placed
> in
> > > the
> > > > > > > jmh-benchmarks[1].
> > > > > > > Then, we could target testing Sensors and Metrics, and validate
> > how
> > > > > much
> > > > > > > overhead
> > > > > > > is added by having only Max vs Max,Avg(,Min), etc.
> > > > > > > In the other hand, we could extend this to Transformers or
> other
> > > > > Connect
> > > > > > > layers.
> > > > > > >
> > > > > > > Here are some pointers to the Sensors and Metrics
> implementations
> > > > that
> > > > > > > could be considered:
> > > > > > > Path to metric recording:
> > > > > > > -
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/5cab11cf525f6c06fcf9eb43f7f95ef33fe1cdbb/clients/src/main/java/org/apache/kafka/common/metrics/Sensor.java#L195-L199
> > > > > > > -
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/5cab11cf525f6c06fcf9eb43f7f95ef33fe1cdbb/clients/src/main/java/org/apache/kafka/common/metrics/Sensor.java#L230-L244
> > > > > > >
> > > > > > > ```
> > > > > > > // increment all the stats
> > > > > > > for (StatAndConfig statAndConfig : this.stats) {
> > > > > > >    statAndConfig.stat.record(statAndConfig.config(), value,
> > > timeMs);
> > > > > > > }
> > > > > > > ```
> > > > > > >
> > > > > > > SampledStats:
> > > > > > > - Avg:
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/068ab9cefae301f3187ea885d645c425955e77d2/clients/src/main/java/org/apache/kafka/common/metrics/stats/Avg.java
> > > > > > > - Max:
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/068ab9cefae301f3187ea885d645c425955e77d2/clients/src/main/java/org/apache/kafka/common/metrics/stats/Max.java
> > > > > > > - Min:
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/068ab9cefae301f3187ea885d645c425955e77d2/clients/src/main/java/org/apache/kafka/common/metrics/stats/Min.java
> > > > > > >
> > > > > > > `stat#record()` are implemented by `update` method in
> > SampledStat:
> > > > > > >
> > > > > > > ```Max.java
> > > > > > >     @Override
> > > > > > >     protected void update(Sample sample, MetricConfig config,
> > > double
> > > > > > value,
> > > > > > > long now) {
> > > > > > >         sample.value += value;
> > > > > > >     }
> > > > > > > ```
> > > > > > >
> > > > > > > ```Avg.java
> > > > > > >     @Override
> > > > > > >     protected void update(Sample sample, MetricConfig config,
> > > double
> > > > > > value,
> > > > > > > long now) {
> > > > > > >         sample.value = Math.max(sample.value, value);
> > > > > > >     }
> > > > > > > ```
> > > > > > >
> > > > > > > As far as I understand, most of the work of the stats happens
> on
> > > the
> > > > > > > `combine` method that is not part of the connector execution
> but
> > > > called
> > > > > > > when metrics are queried.
> > > > > > >
> > > > > > > I wonder whether we should consider Avg and Max for all metrics
> > > > > proposed
> > > > > > as
> > > > > > > the impact on the execution path seems minimal, and even see if
> > Min
> > > > is
> > > > > > also
> > > > > > > valuable, and use DEBUG only for more granular metrics.
> > > > > > >
> > > > > > > [1] https://github.com/apache/kafka/tree/trunk/jmh-benchmarks
> > > > > > >
> > > > > > >
> > > > > > > > 4. Just to make sure I understand correctly--does "time when
> it
> > > has
> > > > > > been
> > > > > > > > received by the Sink task" refer to the wallclock time
> directly
> > > > > after a
> > > > > > > > call to SinkTask::put has been completed (as opposed to
> > directly
> > > > > before
> > > > > > > > that call is made, or something else entirely)?
> > > > > > > >
> > > > > > >
> > > > > > > It currently means when it has been received by the Sink task
> > > > > > > right after consumer poll and before conversions.
> > > > > > > Would it be valuable to have it after put-sink-records?
> > > > > > >
> > > > > > >
> > > > > > > > 5. If the goal is to identify performance bottlenecks (either
> > in
> > > > > > production
> > > > > > > > or pre-production environments), would it make sense to
> > introduce
> > > > > > metrics
> > > > > > > > for each individual converter (i.e., key/value/header) and
> > > > > > transformation?
> > > > > > > > It's definitely an improvement to be able to identify the
> total
> > > > time
> > > > > > for
> > > > > > > > conversion and transformation, but then the immediate
> follow-up
> > > > > > question if
> > > > > > > > a bottleneck is found in that phase is "which
> > > > > converter/transformation
> > > > > > is
> > > > > > > > responsible?" It'd be nice if we could provide a way to
> quickly
> > > > > answer
> > > > > > that
> > > > > > > > question.
> > > > > > > >
> > > > > > >
> > > > > > > This is a great idea. I'd like to consider this as well, though
> > > maybe
> > > > > > these
> > > > > > > more granular
> > > > > > > metrics would be good to have them as DEBUG.
> > > > > > >
> > > > > > >
> > > > > > > > 6. Any thoughts about offering latency metrics for source
> tasks
> > > > > between
> > > > > > > > receipt of the record from the task and delivery of the
> record
> > to
> > > > > Kafka
> > > > > > > > (which would be tracked by producer callback)? We could also
> > use
> > > > the
> > > > > > record
> > > > > > > > timestamp either instead of or in addition to receipt time if
> > the
> > > > > task
> > > > > > > > provides a timestamp with its records.
> > > > > > > >
> > > > > > >
> > > > > > > With source transform and convert metrics we get part of that
> > > > latency.
> > > > > > > Looking at the Producer metrics, `request-latency` (though a
> very
> > > > > generic
> > > > > > > metric)
> > > > > > > sort of answer the time between send request and ack — if my
> > > > > > understanding
> > > > > > > is correct.
> > > > > > > Would these be enough or you're thinking about another
> approach?
> > > > > > > maybe a custom metric to cover the producer side?
> > > > > > >
> > > > > > >
> > > > > > > > 7. We may end up introducing a way for sink tasks to record
> > > > > per-record
> > > > > > > > delivery to the sink system (see KIP-767 [1]). I'd like it if
> > we
> > > > > could
> > > > > > keep
> > > > > > > > the names of our metrics very precise in order to avoid
> > confusing
> > > > > users
> > > > > > > > (who may think that we're providing metrics on actual
> delivery
> > to
> > > > the
> > > > > > sink
> > > > > > > > system, which may not be the case if the connector performs
> > > > > > asynchronous
> > > > > > > > writes), and in order to leave room for a metrics on true
> > > delivery
> > > > > > time by
> > > > > > > > sink tasks. It'd also be nice if we could remain consistent
> > with
> > > > > > existing
> > > > > > > > metrics such as "put-batch-avg-time-ms". With that in mind,
> > what
> > > do
> > > > > you
> > > > > > > > think about renaming these metrics:
> > > > > > > > - "sink-record-batch-latency-max-ms" to
> > > "put-batch-avg-latency-ms"
> > > > > > > > - "sink-record-latency-max-ms" to
> > > "put-sink-record-latency-max-ms"
> > > > > > > > - "sink-record-latency-avg-ms" to
> > > "put-sink-record-latency-avg-ms"
> > > > > > > > - "sink-record-convert-transform-time-max-ms" to
> > > > > > > > "convert-transform-sink-record-time-max-ms"
> > > > > > > > - "sink-record-convert-transform-time-avg-ms" to
> > > > > > > > "convert-transform-sink-record-time-avg-ms"
> > > > > > > > - "source-record-transform-convert-time-max-ms" to
> > > > > > > > "transform-convert-source-record-time-max-ms"
> > > > > > > > - "source-record-transform-convert-time-avg-ms" to
> > > > > > > > "transform-convert-source-record-time-avg-ms"
> > > > > > > >
> > > > > > >
> > > > > > > Make sense, thanks! I have updated the list of metrics and
> group
> > > them
> > > > > by
> > > > > > > sensor and applying these suggestions.
> > > > > > > The only ones that I want to review are: sink-record-* to
> > > put-batch-*
> > > > > > > (first 3). Not sure if put-batch/put-sink-record describes the
> > > > purpose
> > > > > of
> > > > > > > the metric — neither `sink-record-latency` to be honest.
> > > > > > > My initial thought was to have something like Kafka Streams
> > > > > e2e-latency.
> > > > > > > Based on 4. and 6. questions, an idea could be to add:
> > > > > > > - source-batch-e2e-latency-before-send: measure wallclock -
> > source
> > > > > record
> > > > > > > timestamp after source connector poll.
> > > > > > > - source-batch-e2e-latency-after-send: measure wallclock -
> record
> > > > > > timestamp
> > > > > > > on producer send callback
> > > > > > > - sink-batch-e2e-latency-before-put: measure time wallclock -
> > > record
> > > > > > > timestamp after consumer poll
> > > > > > > - sink-batch-e2e-latency-after-put: measure time wallclock -
> > record
> > > > > > > timestamp after sink connector put.
> > > > > > >
> > > > > > >
> > > > > > > > Thanks again for the KIP! Looking forward to your thoughts.
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > >
> > > > > > > > Chris
> > > > > > > >
> > > > > > > > [1] -
> > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-767%3A+Connect+Latency+Metrics
> > > > > > > >
> > > > > > > > On Thu, Sep 15, 2022 at 1:32 PM Jorge Esteban Quilcate Otoya
> <
> > > > > > > > quilcate.jorge@gmail.com> wrote:
> > > > > > > >
> > > > > > > > > Hi everyone,
> > > > > > > > >
> > > > > > > > > I've made a slight addition to the KIP based on Yash
> > feedback:
> > > > > > > > >
> > > > > > > > > - A new metric is added at INFO level to record the max
> > latency
> > > > > from
> > > > > > the
> > > > > > > > > batch timestamp, by keeping the oldest record timestamp per
> > > > batch.
> > > > > > > > > - A draft implementation is linked.
> > > > > > > > >
> > > > > > > > > Looking forward to your feedback.
> > > > > > > > > Also, a kindly reminder that the vote thread is open.
> > > > > > > > >
> > > > > > > > > Thanks!
> > > > > > > > > Jorge.
> > > > > > > > >
> > > > > > > > > On Thu, 8 Sept 2022 at 14:25, Jorge Esteban Quilcate Otoya
> <
> > > > > > > > > quilcate.jorge@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > > Great. I have updated the KIP to reflect this.
> > > > > > > > > >
> > > > > > > > > > Cheers,
> > > > > > > > > > Jorge.
> > > > > > > > > >
> > > > > > > > > > On Thu, 8 Sept 2022 at 12:26, Yash Mayya <
> > > yash.mayya@gmail.com
> > > > >
> > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > >> Thanks, I think it makes sense to define these metrics
> at
> > a
> > > > > DEBUG
> > > > > > > > > >> recording
> > > > > > > > > >> level.
> > > > > > > > > >>
> > > > > > > > > >> On Thu, Sep 8, 2022 at 2:51 PM Jorge Esteban Quilcate
> > Otoya
> > > <
> > > > > > > > > >> quilcate.jorge@gmail.com> wrote:
> > > > > > > > > >>
> > > > > > > > > >> > On Thu, 8 Sept 2022 at 05:55, Yash Mayya <
> > > > > yash.mayya@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > > >> >
> > > > > > > > > >> > > Hi Jorge,
> > > > > > > > > >> > >
> > > > > > > > > >> > > Thanks for the changes. With regard to having per
> > batch
> > > vs
> > > > > per
> > > > > > > > > record
> > > > > > > > > >> > > metrics, the additional overhead I was referring to
> > > wasn't
> > > > > > about
> > > > > > > > > >> whether
> > > > > > > > > >> > or
> > > > > > > > > >> > > not we would need to iterate over all the records
> in a
> > > > > batch.
> > > > > > I
> > > > > > > > was
> > > > > > > > > >> > > referring to the potential additional overhead
> caused
> > by
> > > > the
> > > > > > > > higher
> > > > > > > > > >> > volume
> > > > > > > > > >> > > of calls to Sensor::record on the sensors for the
> new
> > > > > metrics
> > > > > > (as
> > > > > > > > > >> > compared
> > > > > > > > > >> > > to the existing batch only metrics), especially for
> > high
> > > > > > > > throughput
> > > > > > > > > >> > > connectors where batch sizes could be large. I guess
> > we
> > > > may
> > > > > > want
> > > > > > > > to
> > > > > > > > > do
> > > > > > > > > >> > some
> > > > > > > > > >> > > sort of performance testing and get concrete numbers
> > to
> > > > > verify
> > > > > > > > > whether
> > > > > > > > > >> > this
> > > > > > > > > >> > > is a valid concern or not?
> > > > > > > > > >> > >
> > > > > > > > > >> >
> > > > > > > > > >> > 6.1. Got it, thanks for clarifying. I guess there
> could
> > > be a
> > > > > > > > benchmark
> > > > > > > > > >> test
> > > > > > > > > >> > of the `Sensor::record` to get an idea of the
> > performance
> > > > > > impact.
> > > > > > > > > >> > Regardless, the fact that these are single-record
> > metrics
> > > > > > compared
> > > > > > > > to
> > > > > > > > > >> > existing batch-only could be explicitly defined by
> > setting
> > > > > these
> > > > > > > > > >> metrics at
> > > > > > > > > >> > a DEBUG or TRACE metric recording level, leaving the
> > > > existing
> > > > > at
> > > > > > > > INFO
> > > > > > > > > >> > level.
> > > > > > > > > >> > wdyt?
> > > > > > > > > >> >
> > > > > > > > > >> >
> > > > > > > > > >> > >
> > > > > > > > > >> > > Thanks,
> > > > > > > > > >> > > Yash
> > > > > > > > > >> > >
> > > > > > > > > >> > > On Tue, Sep 6, 2022 at 4:42 PM Jorge Esteban
> Quilcate
> > > > Otoya
> > > > > <
> > > > > > > > > >> > > quilcate.jorge@gmail.com> wrote:
> > > > > > > > > >> > >
> > > > > > > > > >> > > > Hi Sagar and Yash,
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > > the way it's defined in
> > > > > > > > > >> > > >
> > > > > https://kafka.apache.org/documentation/#connect_monitoring
> > > > > > for
> > > > > > > > > the
> > > > > > > > > >> > > metrics
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > 4.1. Got it. Add it to the KIP.
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > > The only thing I would argue is do we need
> > > > > > > > > >> sink-record-latency-min?
> > > > > > > > > >> > > Maybe
> > > > > > > > > >> > > > we
> > > > > > > > > >> > > > > could remove this min metric as well and make
> all
> > of
> > > > the
> > > > > > 3 e2e
> > > > > > > > > >> > metrics
> > > > > > > > > >> > > > > consistent
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > 4.2 I see. Will remove it from the KIP.
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > > Probably users can track the metrics at their
> end
> > to
> > > > > > > > > >> > > > > figure that out. Do you think that makes sense?
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > 4.3. Yes, agree. With these new metrics it should
> be
> > > > > easier
> > > > > > for
> > > > > > > > > >> users
> > > > > > > > > >> > to
> > > > > > > > > >> > > > track this.
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > > I think it makes sense to not have a min metric
> > for
> > > > > > either to
> > > > > > > > > >> remain
> > > > > > > > > >> > > > > consistent with the existing put-batch and
> > > poll-batch
> > > > > > metrics
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > 5.1. Got it. Same as 4.2
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > > Another naming related suggestion I had was with
> > the
> > > > > > > > > >> > > > > "convert-time" metrics - we should probably
> > include
> > > > > > > > > >> transformations
> > > > > > > > > >> > in
> > > > > > > > > >> > > > the
> > > > > > > > > >> > > > > name since SMTs could definitely be attributable
> > to
> > > a
> > > > > > sizable
> > > > > > > > > >> chunk
> > > > > > > > > >> > of
> > > > > > > > > >> > > > the
> > > > > > > > > >> > > > > latency depending on the specific transformation
> > > > chain.
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > 5.2. Make sense. I'm proposing to add
> > > > > > > > > >> > `sink-record-convert-transform...`
> > > > > > > > > >> > > > and `source-record-transform-convert...` to
> > represent
> > > > > > correctly
> > > > > > > > > the
> > > > > > > > > >> > order
> > > > > > > > > >> > > > of operations.
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > > it seems like both source and sink tasks only
> > record
> > > > > > metrics
> > > > > > > > at
> > > > > > > > > a
> > > > > > > > > >> > > "batch"
> > > > > > > > > >> > > > > level, not on an individual record level. I
> think
> > it
> > > > > > might be
> > > > > > > > > >> > > additional
> > > > > > > > > >> > > > > overhead if we want to record these new metrics
> > all
> > > at
> > > > > the
> > > > > > > > > record
> > > > > > > > > >> > > level?
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > 5.3. I considered at the beginning to implement
> all
> > > > > metrics
> > > > > > at
> > > > > > > > the
> > > > > > > > > >> > batch
> > > > > > > > > >> > > > level, but given how the framework process
> records,
> > I
> > > > > > fallback
> > > > > > > > to
> > > > > > > > > >> the
> > > > > > > > > >> > > > proposed approach:
> > > > > > > > > >> > > > - Sink Task:
> > > > > > > > > >> > > >   - `WorkerSinkTask#convertMessages(msgs)` already
> > > > > iterates
> > > > > > over
> > > > > > > > > >> > records,
> > > > > > > > > >> > > > so there is no additional overhead to capture
> record
> > > > > > latency per
> > > > > > > > > >> > record.
> > > > > > > > > >> > > >     -
> > > > > > > > > >> > > >
> > > > > > > > > >> > > >
> > > > > > > > > >> > >
> > > > > > > > > >> >
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/9841647c4fe422532f448423c92d26e4fdcb8932/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L490-L514
> > > > > > > > > >> > > >   -
> > `WorkerSinkTask#convertAndTransformRecord(record)`
> > > > > > actually
> > > > > > > > > >> happens
> > > > > > > > > >> > > > individually. Measuring this operation per batch
> > would
> > > > > > include
> > > > > > > > > >> > processing
> > > > > > > > > >> > > > that is not strictly part of "convert and
> transform"
> > > > > > > > > >> > > >     -
> > > > > > > > > >> > > >
> > > > > > > > > >> > > >
> > > > > > > > > >> > >
> > > > > > > > > >> >
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/9841647c4fe422532f448423c92d26e4fdcb8932/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L518
> > > > > > > > > >> > > > - Source Task:
> > > > > > > > > >> > > >   - `AbstractWorkerSourceTask#sendRecords`
> iterates
> > > > over a
> > > > > > batch
> > > > > > > > > and
> > > > > > > > > >> > > > applies transforms and convert record individually
> > as
> > > > > well:
> > > > > > > > > >> > > >     -
> > > > > > > > > >> > > >
> > > > > > > > > >> > > >
> > > > > > > > > >> > >
> > > > > > > > > >> >
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/9841647c4fe422532f448423c92d26e4fdcb8932/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/AbstractWorkerSourceTask.java#L389-L390
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > > This might require some additional changes -
> > > > > > > > > >> > > > > for instance, with the "sink-record-latency"
> > metric,
> > > > we
> > > > > > might
> > > > > > > > > only
> > > > > > > > > >> > want
> > > > > > > > > >> > > > to
> > > > > > > > > >> > > > > have a "max" metric since "avg" would require
> > > > recording
> > > > > a
> > > > > > > > value
> > > > > > > > > on
> > > > > > > > > >> > the
> > > > > > > > > >> > > > > sensor for each record (whereas we can get a
> "max"
> > > by
> > > > > only
> > > > > > > > > >> recording
> > > > > > > > > >> > a
> > > > > > > > > >> > > > > metric value for the oldest record in each
> batch).
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > 5.4. Recording record-latency per batch may not be
> > as
> > > > > > useful as
> > > > > > > > > >> there
> > > > > > > > > >> > is
> > > > > > > > > >> > > no
> > > > > > > > > >> > > > guarantee that the oldest record will be
> > > representative
> > > > of
> > > > > > the
> > > > > > > > > >> batch.
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > On Sat, 3 Sept 2022 at 16:02, Yash Mayya <
> > > > > > yash.mayya@gmail.com>
> > > > > > > > > >> wrote:
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > > Hi Jorge and Sagar,
> > > > > > > > > >> > > > >
> > > > > > > > > >> > > > > I think it makes sense to not have a min metric
> > for
> > > > > > either to
> > > > > > > > > >> remain
> > > > > > > > > >> > > > > consistent with the existing put-batch and
> > > poll-batch
> > > > > > metrics
> > > > > > > > > (it
> > > > > > > > > >> > > doesn't
> > > > > > > > > >> > > > > seem particularly useful either anyway). Also,
> the
> > > new
> > > > > > > > > >> > > > > "sink-record-latency" metric name looks fine to
> > me,
> > > > > > thanks for
> > > > > > > > > >> making
> > > > > > > > > >> > > the
> > > > > > > > > >> > > > > changes! Another naming related suggestion I had
> > was
> > > > > with
> > > > > > the
> > > > > > > > > >> > > > > "convert-time" metrics - we should probably
> > include
> > > > > > > > > >> transformations
> > > > > > > > > >> > in
> > > > > > > > > >> > > > the
> > > > > > > > > >> > > > > name since SMTs could definitely be attributable
> > to
> > > a
> > > > > > sizable
> > > > > > > > > >> chunk
> > > > > > > > > >> > of
> > > > > > > > > >> > > > the
> > > > > > > > > >> > > > > latency depending on the specific transformation
> > > > chain.
> > > > > > > > > >> > > > >
> > > > > > > > > >> > > > > I have one high level question with respect to
> > > > > > implementation
> > > > > > > > -
> > > > > > > > > >> > > > currently,
> > > > > > > > > >> > > > > it seems like both source and sink tasks only
> > record
> > > > > > metrics
> > > > > > > > at
> > > > > > > > > a
> > > > > > > > > >> > > "batch"
> > > > > > > > > >> > > > > level, not on an individual record level. I
> think
> > it
> > > > > > might be
> > > > > > > > > >> > > additional
> > > > > > > > > >> > > > > overhead if we want to record these new metrics
> > all
> > > at
> > > > > the
> > > > > > > > > record
> > > > > > > > > >> > > level?
> > > > > > > > > >> > > > > Could we instead make all of these new metrics
> for
> > > > > > batches of
> > > > > > > > > >> records
> > > > > > > > > >> > > > > rather than individual records in order to
> remain
> > > > > > consistent
> > > > > > > > > with
> > > > > > > > > >> the
> > > > > > > > > >> > > > > existing task level metrics? This might require
> > some
> > > > > > > > additional
> > > > > > > > > >> > > changes -
> > > > > > > > > >> > > > > for instance, with the "sink-record-latency"
> > metric,
> > > > we
> > > > > > might
> > > > > > > > > only
> > > > > > > > > >> > want
> > > > > > > > > >> > > > to
> > > > > > > > > >> > > > > have a "max" metric since "avg" would require
> > > > recording
> > > > > a
> > > > > > > > value
> > > > > > > > > on
> > > > > > > > > >> > the
> > > > > > > > > >> > > > > sensor for each record (whereas we can get a
> "max"
> > > by
> > > > > only
> > > > > > > > > >> recording
> > > > > > > > > >> > a
> > > > > > > > > >> > > > > metric value for the oldest record in each
> batch).
> > > > > > > > > >> > > > >
> > > > > > > > > >> > > > > Thanks,
> > > > > > > > > >> > > > > Yash
> > > > > > > > > >> > > > >
> > > > > > > > > >> > > > > On Fri, Sep 2, 2022 at 3:16 PM Sagar <
> > > > > > > > sagarmeansocean@gmail.com
> > > > > > > > > >
> > > > > > > > > >> > > wrote:
> > > > > > > > > >> > > > >
> > > > > > > > > >> > > > > > Hi Jorge,
> > > > > > > > > >> > > > > >
> > > > > > > > > >> > > > > > Thanks for the changes.
> > > > > > > > > >> > > > > >
> > > > > > > > > >> > > > > > Regarding the metrics, I meant something like
> > > this:
> > > > > > > > > >> > > > > >
> > > > > > > > > >> > > > >
> > > > > > > > > >> > > >
> > > > > > > > > >> > >
> > > > > > > > > >> >
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> kafka.connect:type=sink-task-metrics,connector="{connector}",task="{task}"
> > > > > > > > > >> > > > > >
> > > > > > > > > >> > > > > > the way it's defined in
> > > > > > > > > >> > > > > >
> > > > > > https://kafka.apache.org/documentation/#connect_monitoring
> > > > > > > > > for
> > > > > > > > > >> the
> > > > > > > > > >> > > > > > metrics.
> > > > > > > > > >> > > > > >
> > > > > > > > > >> > > > > > I see what you mean by the 3 metrics and how
> it
> > > can
> > > > be
> > > > > > > > > >> interpreted.
> > > > > > > > > >> > > The
> > > > > > > > > >> > > > > > only thing I would argue is do we need
> > > > > > > > > sink-record-latency-min?
> > > > > > > > > >> > Maybe
> > > > > > > > > >> > > > we
> > > > > > > > > >> > > > > > could remove this min metric as well and make
> > all
> > > of
> > > > > > the 3
> > > > > > > > e2e
> > > > > > > > > >> > > metrics
> > > > > > > > > >> > > > > > consistent(since put-batch also doesn't
> expose a
> > > min
> > > > > > which
> > > > > > > > > makes
> > > > > > > > > >> > > sense
> > > > > > > > > >> > > > to
> > > > > > > > > >> > > > > > me). I think this is in contrast to what Yash
> > > > pointed
> > > > > > out
> > > > > > > > > above
> > > > > > > > > >> so
> > > > > > > > > >> > I
> > > > > > > > > >> > > > > would
> > > > > > > > > >> > > > > > like to hear his thoughts as well.
> > > > > > > > > >> > > > > >
> > > > > > > > > >> > > > > > The other point Yash mentioned about the
> > slightly
> > > > > flawed
> > > > > > > > > >> definition
> > > > > > > > > >> > > of
> > > > > > > > > >> > > > > e2e
> > > > > > > > > >> > > > > > is also true in a sense. But I have a feeling
> > > that's
> > > > > > one the
> > > > > > > > > >> > records
> > > > > > > > > >> > > > are
> > > > > > > > > >> > > > > > polled by the connector tasks, it would be
> > > difficult
> > > > > to
> > > > > > > > track
> > > > > > > > > >> the
> > > > > > > > > >> > > final
> > > > > > > > > >> > > > > leg
> > > > > > > > > >> > > > > > via the framework. Probably users can track
> the
> > > > > metrics
> > > > > > at
> > > > > > > > > their
> > > > > > > > > >> > end
> > > > > > > > > >> > > to
> > > > > > > > > >> > > > > > figure that out. Do you think that makes
> sense?
> > > > > > > > > >> > > > > >
> > > > > > > > > >> > > > > > Thanks!
> > > > > > > > > >> > > > > > Sagar.
> > > > > > > > > >> > > > > >
> > > > > > > > > >> > > > > >
> > > > > > > > > >> > > > > >
> > > > > > > > > >> > > > > >
> > > > > > > > > >> > > > > > On Thu, Sep 1, 2022 at 11:40 PM Jorge Esteban
> > > > Quilcate
> > > > > > > > Otoya <
> > > > > > > > > >> > > > > > quilcate.jorge@gmail.com> wrote:
> > > > > > > > > >> > > > > >
> > > > > > > > > >> > > > > > > Hi Sagar and Yash,
> > > > > > > > > >> > > > > > >
> > > > > > > > > >> > > > > > > Thanks for your feedback!
> > > > > > > > > >> > > > > > >
> > > > > > > > > >> > > > > > > > 1) I am assuming the new metrics would be
> > task
> > > > > level
> > > > > > > > > metric.
> > > > > > > > > >> > > > > > >
> > > > > > > > > >> > > > > > > 1.1 Yes, it will be a task level metric,
> > > > implemented
> > > > > > on
> > > > > > > > the
> > > > > > > > > >> > > > > > > Worker[Source/Sink]Task.
> > > > > > > > > >> > > > > > >
> > > > > > > > > >> > > > > > > > Could you specify the way it's done for
> > other
> > > > > > > > sink/source
> > > > > > > > > >> > > > connector?
> > > > > > > > > >> > > > > > >
> > > > > > > > > >> > > > > > > 1.2. Not sure what do you mean by this.
> Could
> > > you
> > > > > > > > elaborate
> > > > > > > > > a
> > > > > > > > > >> bit
> > > > > > > > > >> > > > more?
> > > > > > > > > >> > > > > > >
> > > > > > > > > >> > > > > > > > 2. I am slightly confused about the e2e
> > > latency
> > > > > > > > metric...
> > > > > > > > > >> > > > > > >
> > > > > > > > > >> > > > > > > 2.1. Yes, I see. I was trying to bring a
> > similar
> > > > > > concept
> > > > > > > > as
> > > > > > > > > in
> > > > > > > > > >> > > > Streams
> > > > > > > > > >> > > > > > with
> > > > > > > > > >> > > > > > > KIP-613, though the e2e concept may not be
> > > > > > translatable.
> > > > > > > > > >> > > > > > > We could keep it as `sink-record-latency` to
> > > avoid
> > > > > > > > > conflating
> > > > > > > > > >> > > > > concepts. A
> > > > > > > > > >> > > > > > > similar metric naming was proposed in
> KIP-489
> > > but
> > > > at
> > > > > > the
> > > > > > > > > >> consumer
> > > > > > > > > >> > > > > level —
> > > > > > > > > >> > > > > > > though it seems dormant for a couple of
> years.
> > > > > > > > > >> > > > > > >
> > > > > > > > > >> > > > > > > > However, the put-batch time measures the
> > > > > > > > > >> > > > > > > > time to put a batch of records to external
> > > sink.
> > > > > > So, I
> > > > > > > > > would
> > > > > > > > > >> > > assume
> > > > > > > > > >> > > > > > the 2
> > > > > > > > > >> > > > > > > > can't be added as is to compute the e2e
> > > latency.
> > > > > > Maybe I
> > > > > > > > > am
> > > > > > > > > >> > > missing
> > > > > > > > > >> > > > > > > > something here. Could you plz clarify
> this.
> > > > > > > > > >> > > > > > >
> > > > > > > > > >> > > > > > > 2.2. Yes, agree. Not necessarily added, but
> > with
> > > > > the 3
> > > > > > > > > >> latencies
> > > > > > > > > >> > > > (poll,
> > > > > > > > > >> > > > > > > convert, putBatch) will be clearer where the
> > > > > > bottleneck
> > > > > > > > may
> > > > > > > > > >> be,
> > > > > > > > > >> > and
> > > > > > > > > >> > > > > > > represent the internal processing.
> > > > > > > > > >> > > > > > >
> > > > > > > > > >> > > > > > > > however, as per the KIP it looks like it
> > will
> > > be
> > > > > > > > > >> > > > > > > > the latency between when the record was
> > > written
> > > > to
> > > > > > Kafka
> > > > > > > > > and
> > > > > > > > > >> > when
> > > > > > > > > >> > > > the
> > > > > > > > > >> > > > > > > > record is returned by a sink task's
> > consumer's
> > > > > poll?
> > > > > > > > > >> > > > > > >
> > > > > > > > > >> > > > > > > 3.1. Agree. 2.1. could help to clarify this.
> > > > > > > > > >> > > > > > >
> > > > > > > > > >> > > > > > > > One more thing - I was wondering
> > > > > > > > > >> > > > > > > > if there's a particular reason for having
> a
> > > min
> > > > > > metric
> > > > > > > > for
> > > > > > > > > >> e2e
> > > > > > > > > >> > > > > latency
> > > > > > > > > >> > > > > > > but
> > > > > > > > > >> > > > > > > > not for convert time?
> > > > > > > > > >> > > > > > >
> > > > > > > > > >> > > > > > > 3.2. Was following KIP-613 for e2e which
> seems
> > > > > useful
> > > > > > to
> > > > > > > > > >> compare
> > > > > > > > > >> > > with
> > > > > > > > > >> > > > > > Max a
> > > > > > > > > >> > > > > > > get an idea of the window of results, though
> > > > current
> > > > > > > > > >> latencies in
> > > > > > > > > >> > > > > > Connector
> > > > > > > > > >> > > > > > > do not include Min, and that's why I haven't
> > > added
> > > > > it
> > > > > > for
> > > > > > > > > >> convert
> > > > > > > > > >> > > > > > latency.
> > > > > > > > > >> > > > > > > Do you think it make sense to extend latency
> > > > metrics
> > > > > > with
> > > > > > > > > Min?
> > > > > > > > > >> > > > > > >
> > > > > > > > > >> > > > > > > KIP is updated to clarify some of these
> > changes.
> > > > > > > > > >> > > > > > >
> > > > > > > > > >> > > > > > > Many thanks,
> > > > > > > > > >> > > > > > > Jorge.
> > > > > > > > > >> > > > > > >
> > > > > > > > > >> > > > > > > On Thu, 1 Sept 2022 at 18:11, Yash Mayya <
> > > > > > > > > >> yash.mayya@gmail.com>
> > > > > > > > > >> > > > wrote:
> > > > > > > > > >> > > > > > >
> > > > > > > > > >> > > > > > > > Hi Jorge,
> > > > > > > > > >> > > > > > > >
> > > > > > > > > >> > > > > > > > Thanks for the KIP! I have the same
> > confusion
> > > > with
> > > > > > the
> > > > > > > > > >> > > e2e-latency
> > > > > > > > > >> > > > > > > metrics
> > > > > > > > > >> > > > > > > > as Sagar above. "e2e" would seem to
> indicate
> > > the
> > > > > > latency
> > > > > > > > > >> > between
> > > > > > > > > >> > > > when
> > > > > > > > > >> > > > > > the
> > > > > > > > > >> > > > > > > > record was written to Kafka and when the
> > > record
> > > > > was
> > > > > > > > > written
> > > > > > > > > >> to
> > > > > > > > > >> > > the
> > > > > > > > > >> > > > > sink
> > > > > > > > > >> > > > > > > > system by the connector - however, as per
> > the
> > > > KIP
> > > > > it
> > > > > > > > looks
> > > > > > > > > >> like
> > > > > > > > > >> > > it
> > > > > > > > > >> > > > > will
> > > > > > > > > >> > > > > > > be
> > > > > > > > > >> > > > > > > > the latency between when the record was
> > > written
> > > > to
> > > > > > Kafka
> > > > > > > > > and
> > > > > > > > > >> > when
> > > > > > > > > >> > > > the
> > > > > > > > > >> > > > > > > > record is returned by a sink task's
> > consumer's
> > > > > > poll? I
> > > > > > > > > think
> > > > > > > > > >> > that
> > > > > > > > > >> > > > > > metric
> > > > > > > > > >> > > > > > > > will be a little confusing to interpret.
> One
> > > > more
> > > > > > thing
> > > > > > > > -
> > > > > > > > > I
> > > > > > > > > >> was
> > > > > > > > > >> > > > > > wondering
> > > > > > > > > >> > > > > > > > if there's a particular reason for having
> a
> > > min
> > > > > > metric
> > > > > > > > for
> > > > > > > > > >> e2e
> > > > > > > > > >> > > > > latency
> > > > > > > > > >> > > > > > > but
> > > > > > > > > >> > > > > > > > not for convert time?
> > > > > > > > > >> > > > > > > >
> > > > > > > > > >> > > > > > > > Thanks,
> > > > > > > > > >> > > > > > > > Yash
> > > > > > > > > >> > > > > > > >
> > > > > > > > > >> > > > > > > > On Thu, Sep 1, 2022 at 8:59 PM Sagar <
> > > > > > > > > >> > sagarmeansocean@gmail.com>
> > > > > > > > > >> > > > > > wrote:
> > > > > > > > > >> > > > > > > >
> > > > > > > > > >> > > > > > > > > Hi Jorge,
> > > > > > > > > >> > > > > > > > >
> > > > > > > > > >> > > > > > > > > Thanks for the KIP. It looks like a very
> > > good
> > > > > > > > addition.
> > > > > > > > > I
> > > > > > > > > >> > > skimmed
> > > > > > > > > >> > > > > > > through
> > > > > > > > > >> > > > > > > > > once and had a couple of questions =>
> > > > > > > > > >> > > > > > > > >
> > > > > > > > > >> > > > > > > > > 1) I am assuming the new metrics would
> be
> > > task
> > > > > > level
> > > > > > > > > >> metric.
> > > > > > > > > >> > > > Could
> > > > > > > > > >> > > > > > you
> > > > > > > > > >> > > > > > > > > specify the way it's done for other
> > > > sink/source
> > > > > > > > > connector?
> > > > > > > > > >> > > > > > > > > 2) I am slightly confused about the e2e
> > > > latency
> > > > > > > > metric.
> > > > > > > > > >> Let's
> > > > > > > > > >> > > > > > consider
> > > > > > > > > >> > > > > > > > the
> > > > > > > > > >> > > > > > > > > sink connector metric. If I look at the
> > way
> > > > it's
> > > > > > > > > supposed
> > > > > > > > > >> to
> > > > > > > > > >> > be
> > > > > > > > > >> > > > > > > > calculated,
> > > > > > > > > >> > > > > > > > > i.e the difference between the record
> > > > timestamp
> > > > > > and
> > > > > > > > the
> > > > > > > > > >> wall
> > > > > > > > > >> > > > clock
> > > > > > > > > >> > > > > > > time,
> > > > > > > > > >> > > > > > > > it
> > > > > > > > > >> > > > > > > > > looks like a per record metric. However,
> > the
> > > > > > put-batch
> > > > > > > > > >> time
> > > > > > > > > >> > > > > measures
> > > > > > > > > >> > > > > > > the
> > > > > > > > > >> > > > > > > > > time to put a batch of records to
> external
> > > > sink.
> > > > > > So, I
> > > > > > > > > >> would
> > > > > > > > > >> > > > assume
> > > > > > > > > >> > > > > > > the 2
> > > > > > > > > >> > > > > > > > > can't be added as is to compute the e2e
> > > > latency.
> > > > > > > > Maybe I
> > > > > > > > > >> am
> > > > > > > > > >> > > > missing
> > > > > > > > > >> > > > > > > > > something here. Could you plz clarify
> > this.
> > > > > > > > > >> > > > > > > > >
> > > > > > > > > >> > > > > > > > > Thanks!
> > > > > > > > > >> > > > > > > > > Sagar.
> > > > > > > > > >> > > > > > > > >
> > > > > > > > > >> > > > > > > > > On Tue, Aug 30, 2022 at 8:43 PM Jorge
> > > Esteban
> > > > > > Quilcate
> > > > > > > > > >> Otoya
> > > > > > > > > >> > <
> > > > > > > > > >> > > > > > > > > quilcate.jorge@gmail.com> wrote:
> > > > > > > > > >> > > > > > > > >
> > > > > > > > > >> > > > > > > > > > Hi all,
> > > > > > > > > >> > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > I'd like to start a discussion thread
> on
> > > > > > KIP-864:
> > > > > > > > Add
> > > > > > > > > >> > > > End-To-End
> > > > > > > > > >> > > > > > > > Latency
> > > > > > > > > >> > > > > > > > > > Metrics to Connectors.
> > > > > > > > > >> > > > > > > > > > This KIP aims to improve the metrics
> > > > available
> > > > > > on
> > > > > > > > > Source
> > > > > > > > > >> > and
> > > > > > > > > >> > > > Sink
> > > > > > > > > >> > > > > > > > > > Connectors to measure end-to-end
> > latency,
> > > > > > including
> > > > > > > > > >> source
> > > > > > > > > >> > > and
> > > > > > > > > >> > > > > sink
> > > > > > > > > >> > > > > > > > > record
> > > > > > > > > >> > > > > > > > > > conversion time, and sink record e2e
> > > latency
> > > > > > > > (similar
> > > > > > > > > to
> > > > > > > > > >> > > > KIP-613
> > > > > > > > > >> > > > > > for
> > > > > > > > > >> > > > > > > > > > Streams).
> > > > > > > > > >> > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > The KIP is here:
> > > > > > > > > >> > > > > > > > > >
> > > > > > > > > >> > > > > > > > > >
> > > > > > > > > >> > > > > > > > >
> > > > > > > > > >> > > > > > > >
> > > > > > > > > >> > > > > > >
> > > > > > > > > >> > > > > >
> > > > > > > > > >> > > > >
> > > > > > > > > >> > > >
> > > > > > > > > >> > >
> > > > > > > > > >> >
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-864%3A+Add+End-To-End+Latency+Metrics+to+Connectors
> > > > > > > > > >> > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > Please take a look and let me know
> what
> > > you
> > > > > > think.
> > > > > > > > > >> > > > > > > > > >
> > > > > > > > > >> > > > > > > > > > Cheers,
> > > > > > > > > >> > > > > > > > > > Jorge.
> > > > > > > > > >> > > > > > > > > >
> > > > > > > > > >> > > > > > > > >
> > > > > > > > > >> > > > > > > >
> > > > > > > > > >> > > > > > >
> > > > > > > > > >> > > > > >
> > > > > > > > > >> > > > >
> > > > > > > > > >> > > >
> > > > > > > > > >> > >
> > > > > > > > > >> >
> > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-864: Add End-To-End Latency Metrics to Connectors

Posted by Jorge Esteban Quilcate Otoya <qu...@gmail.com>.

Thanks for the reminder Chris!

I have added a note on the KIP to include this as part of the KIP as most
of the metrics proposed are per-record and having all on DEBUG would limit
the benefits, and created https://issues.apache.org/jira/browse/KAFKA-14441
to keep track of this task.

Cheers,
Jorge.

On Tue, 29 Nov 2022 at 19:40, Chris Egerton <ch...@aiven.io.invalid> wrote:

> Hi Jorge,
>
> Thanks! What were your thoughts on the possible benchmarking and/or
> downgrading of per-record metrics to DEBUG?
>
> Cheers,
>
> Chris
>
> On Thu, Nov 24, 2022 at 8:20 AM Jorge Esteban Quilcate Otoya <
> quilcate.jorge@gmail.com> wrote:
>
> > Thanks Chris! I have updated the KIP with "transform" instead of "alias".
> > Agree it's clearer.
> >
> > Cheers,
> > Jorge.
> >
> > On Mon, 21 Nov 2022 at 21:36, Chris Egerton <ch...@aiven.io.invalid>
> > wrote:
> >
> > > Hi Jorge,
> > >
> > > Thanks for the updates, and apologies for the delay. The new diagram
> > > directly under the "Proposed Changes" section is absolutely gorgeous!
> > >
> > >
> > > Follow-ups:
> > >
> > > RE 2: Good point. We can use the same level for these metrics, it's
> not a
> > > big deal.
> > >
> > > RE 3: As long as all the per-record metrics are kept at DEBUG level, it
> > > should be fine to leave JMH benchmarking for a follow-up. If we want to
> > add
> > > new per-record, INFO-level metrics, I would be more comfortable with
> > > including benchmarking as part of the testing plan for the KIP. One
> > > possible compromise could be to propose that these features be merged
> at
> > > DEBUG level, and then possibly upgraded to INFO level in the future
> > pending
> > > benchmarks to guard against performance degradation.
> > >
> > > RE 4: I think for a true "end-to-end" metric, it'd be useful to include
> > the
> > > time taken by the task to actually deliver the record. However, with
> the
> > > new metric names and descriptions provided in the KIP, I have no
> > objections
> > > with what's currently proposed, and a new "end-to-end" metric can be
> > taken
> > > on later in a follow-up KIP.
> > >
> > > RE 6: You're right, existing producer metrics should be enough for now.
> > We
> > > can revisit this later if/when we add delivery-centric metrics for sink
> > > tasks as well.
> > >
> > > RE 7: The new metric names in the KIP LGTM; I don't see any need to
> > expand
> > > beyond those but if you'd still like to pursue others, LMK.
> > >
> > >
> > > New thoughts:
> > >
> > > One small thought: instead of "alias" in "alias="{transform_alias}" for
> > the
> > > per-transform metrics, could we use "transform"? IMO it's clearer since
> > we
> > > don't use "alias" in the names of transform-related properties, and
> > "alias"
> > > may be confused with the classloading term where you can use, e.g.,
> > > "FileStreamSource" as the name of a connector class in a connector
> config
> > > instead of "org.apache.kafka.connect.file.FileStreamSourceConnector".
> > >
> > >
> > > Cheers,
> > >
> > > Chris
> > >
> > > On Fri, Nov 18, 2022 at 12:06 PM Jorge Esteban Quilcate Otoya <
> > > quilcate.jorge@gmail.com> wrote:
> > >
> > > > Thanks Mickael!
> > > >
> > > >
> > > > On Wed, 9 Nov 2022 at 15:54, Mickael Maison <
> mickael.maison@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi Jorge,
> > > > >
> > > > > Thanks for the KIP, it is a nice improvement.
> > > > >
> > > > > 1) The per transformation metrics still have a question mark next
> to
> > > > > them in the KIP. Do you want to include them? If so we'll want to
> tag
> > > > > them, we should be able to include the aliases in
> TransformationChain
> > > > > and use them.
> > > > >
> > > >
> > > > Yes, I have added the changes on TransformChain that will be needed
> to
> > > add
> > > > these metrics.
> > > >
> > > >
> > > > >
> > > > > 2) I see no references to predicates. If we don't want to measure
> > > > > their latency, can we say it explicitly?
> > > > >
> > > >
> > > > Good question, I haven't considered these. Though as these are
> > > materialized
> > > > as PredicatedTransformation, they should be covered by these changes.
> > > > Adding a note about this.
> > > >
> > > >
> > > > >
> > > > > 3) Should we have sink-record-batch-latency-avg-ms? All other
> metrics
> > > > > have both the maximum and average values.
> > > > >
> > > > >
> > > > Good question. I will remove it and change the record latency from
> > > > DEBUG->INFO as it already cover the maximum metric.
> > > >
> > > > Hope it's clearer now, let me know if there any additional feedback.
> > > > Thanks!
> > > >
> > > >
> > > >
> > > > > Thanks,
> > > > > Mickael
> > > > >
> > > > > On Thu, Oct 20, 2022 at 9:58 PM Jorge Esteban Quilcate Otoya
> > > > > <qu...@gmail.com> wrote:
> > > > > >
> > > > > > Thanks, Chris! Great feedback! Please, find my comments below:
> > > > > >
> > > > > > On Thu, 13 Oct 2022 at 18:52, Chris Egerton
> > <chrise@aiven.io.invalid
> > > >
> > > > > wrote:
> > > > > >
> > > > > > > Hi Jorge,
> > > > > > >
> > > > > > > Thanks for the KIP. I agree with the overall direction and
> think
> > > this
> > > > > would
> > > > > > > be a nice improvement to Kafka Connect. Here are my initial
> > > thoughts
> > > > > on the
> > > > > > > details:
> > > > > > >
> > > > > > > 1. The motivation section outlines the gaps in Kafka Connect's
> > task
> > > > > metrics
> > > > > > > nicely. I think it'd be useful to include more concrete details
> > on
> > > > why
> > > > > > > these gaps need to be filled in, and in which cases additional
> > > > metrics
> > > > > > > would be helpful. One goal could be to provide enhanced
> > monitoring
> > > of
> > > > > > > production deployments that allows for cluster administrators
> to
> > > set
> > > > up
> > > > > > > automatic alerts for latency spikes and, if triggered, quickly
> > > > > identify the
> > > > > > > root cause of those alerts, reducing the time to remediation.
> > > Another
> > > > > goal
> > > > > > > could be to provide more insight to developers or cluster
> > > > > administrators
> > > > > > > who want to do performance testing on connectors in
> > non-production
> > > > > > > environments. It may help guide our decision making process to
> > > have a
> > > > > > > clearer picture of the goals we're trying to achieve.
> > > > > > >
> > > > > >
> > > > > > Agree. The Motivation section has been updated.
> > > > > > Thanks for the examples, I see both of them being covered by the
> > KIP.
> > > > > > I see how these could give us a good distinction on whether to
> > > position
> > > > > > some metrics at INFO or DEBUG level.
> > > > > >
> > > > > >
> > > > > > > 2. If we're trying to address the alert-and-diagnose use case,
> > it'd
> > > > be
> > > > > > > useful to have as much information as possible at INFO level,
> > > rather
> > > > > than
> > > > > > > forcing cluster administrators to possibly reconfigure a
> > connector
> > > to
> > > > > emit
> > > > > > > DEBUG or TRACE level metrics in order to diagnose a potential
> > > > > > > production-impacting performance bottleneck. I can see the
> > > rationale
> > > > > for
> > > > > > > emitting per-record metrics that track an average value at
> DEBUG
> > > > > level, but
> > > > > > > for per-record metrics that track a maximum value, is there any
> > > > reason
> > > > > not
> > > > > > > to provide this information at INFO level?
> > > > > > >
> > > > > >
> > > > > > Agree. Though with Max and Avg metrics being part of the same
> > sensor
> > > —
> > > > > > where Metric Level is defined — then both metrics get the same
> > level.
> > > > > >
> > > > > >
> > > > > > > 3. I'm also curious about the performance testing suggested by
> > Yash
> > > > to
> > > > > > > gauge the potential impact of this change. Have you been able
> to
> > do
> > > > any
> > > > > > > testing with your draft implementation yet?
> > > > > > >
> > > > > >
> > > > > > No, not so far.
> > > > > > I think it would be valuable to discuss the scope of this testing
> > and
> > > > > maybe
> > > > > > tackle it
> > > > > > in a separate issue as Sensors and Metrics are used all over the
> > > place.
> > > > > > My initial understanding is that these tests should by placed in
> > the
> > > > > > jmh-benchmarks[1].
> > > > > > Then, we could target testing Sensors and Metrics, and validate
> how
> > > > much
> > > > > > overhead
> > > > > > is added by having only Max vs Max,Avg(,Min), etc.
> > > > > > In the other hand, we could extend this to Transformers or other
> > > > Connect
> > > > > > layers.
> > > > > >
> > > > > > Here are some pointers to the Sensors and Metrics implementations
> > > that
> > > > > > could be considered:
> > > > > > Path to metric recording:
> > > > > > -
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/5cab11cf525f6c06fcf9eb43f7f95ef33fe1cdbb/clients/src/main/java/org/apache/kafka/common/metrics/Sensor.java#L195-L199
> > > > > > -
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/5cab11cf525f6c06fcf9eb43f7f95ef33fe1cdbb/clients/src/main/java/org/apache/kafka/common/metrics/Sensor.java#L230-L244
> > > > > >
> > > > > > ```
> > > > > > // increment all the stats
> > > > > > for (StatAndConfig statAndConfig : this.stats) {
> > > > > >    statAndConfig.stat.record(statAndConfig.config(), value,
> > timeMs);
> > > > > > }
> > > > > > ```
> > > > > >
> > > > > > SampledStats:
> > > > > > - Avg:
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/068ab9cefae301f3187ea885d645c425955e77d2/clients/src/main/java/org/apache/kafka/common/metrics/stats/Avg.java
> > > > > > - Max:
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/068ab9cefae301f3187ea885d645c425955e77d2/clients/src/main/java/org/apache/kafka/common/metrics/stats/Max.java
> > > > > > - Min:
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/068ab9cefae301f3187ea885d645c425955e77d2/clients/src/main/java/org/apache/kafka/common/metrics/stats/Min.java
> > > > > >
> > > > > > `stat#record()` are implemented by `update` method in
> SampledStat:
> > > > > >
> > > > > > ```Max.java
> > > > > >     @Override
> > > > > >     protected void update(Sample sample, MetricConfig config,
> > double
> > > > > value,
> > > > > > long now) {
> > > > > >         sample.value += value;
> > > > > >     }
> > > > > > ```
> > > > > >
> > > > > > ```Avg.java
> > > > > >     @Override
> > > > > >     protected void update(Sample sample, MetricConfig config,
> > double
> > > > > value,
> > > > > > long now) {
> > > > > >         sample.value = Math.max(sample.value, value);
> > > > > >     }
> > > > > > ```
> > > > > >
> > > > > > As far as I understand, most of the work of the stats happens on
> > the
> > > > > > `combine` method that is not part of the connector execution but
> > > called
> > > > > > when metrics are queried.
> > > > > >
> > > > > > I wonder whether we should consider Avg and Max for all metrics
> > > > proposed
> > > > > as
> > > > > > the impact on the execution path seems minimal, and even see if
> Min
> > > is
> > > > > also
> > > > > > valuable, and use DEBUG only for more granular metrics.
> > > > > >
> > > > > > [1] https://github.com/apache/kafka/tree/trunk/jmh-benchmarks
> > > > > >
> > > > > >
> > > > > > > 4. Just to make sure I understand correctly--does "time when it
> > has
> > > > > been
> > > > > > > received by the Sink task" refer to the wallclock time directly
> > > > after a
> > > > > > > call to SinkTask::put has been completed (as opposed to
> directly
> > > > before
> > > > > > > that call is made, or something else entirely)?
> > > > > > >
> > > > > >
> > > > > > It currently means when it has been received by the Sink task
> > > > > > right after consumer poll and before conversions.
> > > > > > Would it be valuable to have it after put-sink-records?
> > > > > >
> > > > > >
> > > > > > > 5. If the goal is to identify performance bottlenecks (either
> in
> > > > > production
> > > > > > > or pre-production environments), would it make sense to
> introduce
> > > > > metrics
> > > > > > > for each individual converter (i.e., key/value/header) and
> > > > > transformation?
> > > > > > > It's definitely an improvement to be able to identify the total
> > > time
> > > > > for
> > > > > > > conversion and transformation, but then the immediate follow-up
> > > > > question if
> > > > > > > a bottleneck is found in that phase is "which
> > > > converter/transformation
> > > > > is
> > > > > > > responsible?" It'd be nice if we could provide a way to quickly
> > > > answer
> > > > > that
> > > > > > > question.
> > > > > > >
> > > > > >
> > > > > > This is a great idea. I'd like to consider this as well, though
> > maybe
> > > > > these
> > > > > > more granular
> > > > > > metrics would be good to have them as DEBUG.
> > > > > >
> > > > > >
> > > > > > > 6. Any thoughts about offering latency metrics for source tasks
> > > > between
> > > > > > > receipt of the record from the task and delivery of the record
> to
> > > > Kafka
> > > > > > > (which would be tracked by producer callback)? We could also
> use
> > > the
> > > > > record
> > > > > > > timestamp either instead of or in addition to receipt time if
> the
> > > > task
> > > > > > > provides a timestamp with its records.
> > > > > > >
> > > > > >
> > > > > > With source transform and convert metrics we get part of that
> > > latency.
> > > > > > Looking at the Producer metrics, `request-latency` (though a very
> > > > generic
> > > > > > metric)
> > > > > > sort of answer the time between send request and ack — if my
> > > > > understanding
> > > > > > is correct.
> > > > > > Would these be enough or you're thinking about another approach?
> > > > > > maybe a custom metric to cover the producer side?
> > > > > >
> > > > > >
> > > > > > > 7. We may end up introducing a way for sink tasks to record
> > > > per-record
> > > > > > > delivery to the sink system (see KIP-767 [1]). I'd like it if
> we
> > > > could
> > > > > keep
> > > > > > > the names of our metrics very precise in order to avoid
> confusing
> > > > users
> > > > > > > (who may think that we're providing metrics on actual delivery
> to
> > > the
> > > > > sink
> > > > > > > system, which may not be the case if the connector performs
> > > > > asynchronous
> > > > > > > writes), and in order to leave room for a metrics on true
> > delivery
> > > > > time by
> > > > > > > sink tasks. It'd also be nice if we could remain consistent
> with
> > > > > existing
> > > > > > > metrics such as "put-batch-avg-time-ms". With that in mind,
> what
> > do
> > > > you
> > > > > > > think about renaming these metrics:
> > > > > > > - "sink-record-batch-latency-max-ms" to
> > "put-batch-avg-latency-ms"
> > > > > > > - "sink-record-latency-max-ms" to
> > "put-sink-record-latency-max-ms"
> > > > > > > - "sink-record-latency-avg-ms" to
> > "put-sink-record-latency-avg-ms"
> > > > > > > - "sink-record-convert-transform-time-max-ms" to
> > > > > > > "convert-transform-sink-record-time-max-ms"
> > > > > > > - "sink-record-convert-transform-time-avg-ms" to
> > > > > > > "convert-transform-sink-record-time-avg-ms"
> > > > > > > - "source-record-transform-convert-time-max-ms" to
> > > > > > > "transform-convert-source-record-time-max-ms"
> > > > > > > - "source-record-transform-convert-time-avg-ms" to
> > > > > > > "transform-convert-source-record-time-avg-ms"
> > > > > > >
> > > > > >
> > > > > > Make sense, thanks! I have updated the list of metrics and group
> > them
> > > > by
> > > > > > sensor and applying these suggestions.
> > > > > > The only ones that I want to review are: sink-record-* to
> > put-batch-*
> > > > > > (first 3). Not sure if put-batch/put-sink-record describes the
> > > purpose
> > > > of
> > > > > > the metric — neither `sink-record-latency` to be honest.
> > > > > > My initial thought was to have something like Kafka Streams
> > > > e2e-latency.
> > > > > > Based on 4. and 6. questions, an idea could be to add:
> > > > > > - source-batch-e2e-latency-before-send: measure wallclock -
> source
> > > > record
> > > > > > timestamp after source connector poll.
> > > > > > - source-batch-e2e-latency-after-send: measure wallclock - record
> > > > > timestamp
> > > > > > on producer send callback
> > > > > > - sink-batch-e2e-latency-before-put: measure time wallclock -
> > record
> > > > > > timestamp after consumer poll
> > > > > > - sink-batch-e2e-latency-after-put: measure time wallclock -
> record
> > > > > > timestamp after sink connector put.
> > > > > >
> > > > > >
> > > > > > > Thanks again for the KIP! Looking forward to your thoughts.
> > > > > > >
> > > > > > > Cheers,
> > > > > > >
> > > > > > > Chris
> > > > > > >
> > > > > > > [1] -
> > > > > > >
> > > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-767%3A+Connect+Latency+Metrics
> > > > > > >
> > > > > > > On Thu, Sep 15, 2022 at 1:32 PM Jorge Esteban Quilcate Otoya <
> > > > > > > quilcate.jorge@gmail.com> wrote:
> > > > > > >
> > > > > > > > Hi everyone,
> > > > > > > >
> > > > > > > > I've made a slight addition to the KIP based on Yash
> feedback:
> > > > > > > >
> > > > > > > > - A new metric is added at INFO level to record the max
> latency
> > > > from
> > > > > the
> > > > > > > > batch timestamp, by keeping the oldest record timestamp per
> > > batch.
> > > > > > > > - A draft implementation is linked.
> > > > > > > >
> > > > > > > > Looking forward to your feedback.
> > > > > > > > Also, a kindly reminder that the vote thread is open.
> > > > > > > >
> > > > > > > > Thanks!
> > > > > > > > Jorge.
> > > > > > > >
> > > > > > > > On Thu, 8 Sept 2022 at 14:25, Jorge Esteban Quilcate Otoya <
> > > > > > > > quilcate.jorge@gmail.com> wrote:
> > > > > > > >
> > > > > > > > > Great. I have updated the KIP to reflect this.
> > > > > > > > >
> > > > > > > > > Cheers,
> > > > > > > > > Jorge.
> > > > > > > > >
> > > > > > > > > On Thu, 8 Sept 2022 at 12:26, Yash Mayya <
> > yash.mayya@gmail.com
> > > >
> > > > > wrote:
> > > > > > > > >
> > > > > > > > >> Thanks, I think it makes sense to define these metrics at
> a
> > > > DEBUG
> > > > > > > > >> recording
> > > > > > > > >> level.
> > > > > > > > >>
> > > > > > > > >> On Thu, Sep 8, 2022 at 2:51 PM Jorge Esteban Quilcate
> Otoya
> > <
> > > > > > > > >> quilcate.jorge@gmail.com> wrote:
> > > > > > > > >>
> > > > > > > > >> > On Thu, 8 Sept 2022 at 05:55, Yash Mayya <
> > > > yash.mayya@gmail.com>
> > > > > > > > wrote:
> > > > > > > > >> >
> > > > > > > > >> > > Hi Jorge,
> > > > > > > > >> > >
> > > > > > > > >> > > Thanks for the changes. With regard to having per
> batch
> > vs
> > > > per
> > > > > > > > record
> > > > > > > > >> > > metrics, the additional overhead I was referring to
> > wasn't
> > > > > about
> > > > > > > > >> whether
> > > > > > > > >> > or
> > > > > > > > >> > > not we would need to iterate over all the records in a
> > > > batch.
> > > > > I
> > > > > > > was
> > > > > > > > >> > > referring to the potential additional overhead caused
> by
> > > the
> > > > > > > higher
> > > > > > > > >> > volume
> > > > > > > > >> > > of calls to Sensor::record on the sensors for the new
> > > > metrics
> > > > > (as
> > > > > > > > >> > compared
> > > > > > > > >> > > to the existing batch only metrics), especially for
> high
> > > > > > > throughput
> > > > > > > > >> > > connectors where batch sizes could be large. I guess
> we
> > > may
> > > > > want
> > > > > > > to
> > > > > > > > do
> > > > > > > > >> > some
> > > > > > > > >> > > sort of performance testing and get concrete numbers
> to
> > > > verify
> > > > > > > > whether
> > > > > > > > >> > this
> > > > > > > > >> > > is a valid concern or not?
> > > > > > > > >> > >
> > > > > > > > >> >
> > > > > > > > >> > 6.1. Got it, thanks for clarifying. I guess there could
> > be a
> > > > > > > benchmark
> > > > > > > > >> test
> > > > > > > > >> > of the `Sensor::record` to get an idea of the
> performance
> > > > > impact.
> > > > > > > > >> > Regardless, the fact that these are single-record
> metrics
> > > > > compared
> > > > > > > to
> > > > > > > > >> > existing batch-only could be explicitly defined by
> setting
> > > > these
> > > > > > > > >> metrics at
> > > > > > > > >> > a DEBUG or TRACE metric recording level, leaving the
> > > existing
> > > > at
> > > > > > > INFO
> > > > > > > > >> > level.
> > > > > > > > >> > wdyt?
> > > > > > > > >> >
> > > > > > > > >> >
> > > > > > > > >> > >
> > > > > > > > >> > > Thanks,
> > > > > > > > >> > > Yash
> > > > > > > > >> > >
> > > > > > > > >> > > On Tue, Sep 6, 2022 at 4:42 PM Jorge Esteban Quilcate
> > > Otoya
> > > > <
> > > > > > > > >> > > quilcate.jorge@gmail.com> wrote:
> > > > > > > > >> > >
> > > > > > > > >> > > > Hi Sagar and Yash,
> > > > > > > > >> > > >
> > > > > > > > >> > > > > the way it's defined in
> > > > > > > > >> > > >
> > > > https://kafka.apache.org/documentation/#connect_monitoring
> > > > > for
> > > > > > > > the
> > > > > > > > >> > > metrics
> > > > > > > > >> > > >
> > > > > > > > >> > > > 4.1. Got it. Add it to the KIP.
> > > > > > > > >> > > >
> > > > > > > > >> > > > > The only thing I would argue is do we need
> > > > > > > > >> sink-record-latency-min?
> > > > > > > > >> > > Maybe
> > > > > > > > >> > > > we
> > > > > > > > >> > > > > could remove this min metric as well and make all
> of
> > > the
> > > > > 3 e2e
> > > > > > > > >> > metrics
> > > > > > > > >> > > > > consistent
> > > > > > > > >> > > >
> > > > > > > > >> > > > 4.2 I see. Will remove it from the KIP.
> > > > > > > > >> > > >
> > > > > > > > >> > > > > Probably users can track the metrics at their end
> to
> > > > > > > > >> > > > > figure that out. Do you think that makes sense?
> > > > > > > > >> > > >
> > > > > > > > >> > > > 4.3. Yes, agree. With these new metrics it should be
> > > > easier
> > > > > for
> > > > > > > > >> users
> > > > > > > > >> > to
> > > > > > > > >> > > > track this.
> > > > > > > > >> > > >
> > > > > > > > >> > > > > I think it makes sense to not have a min metric
> for
> > > > > either to
> > > > > > > > >> remain
> > > > > > > > >> > > > > consistent with the existing put-batch and
> > poll-batch
> > > > > metrics
> > > > > > > > >> > > >
> > > > > > > > >> > > > 5.1. Got it. Same as 4.2
> > > > > > > > >> > > >
> > > > > > > > >> > > > > Another naming related suggestion I had was with
> the
> > > > > > > > >> > > > > "convert-time" metrics - we should probably
> include
> > > > > > > > >> transformations
> > > > > > > > >> > in
> > > > > > > > >> > > > the
> > > > > > > > >> > > > > name since SMTs could definitely be attributable
> to
> > a
> > > > > sizable
> > > > > > > > >> chunk
> > > > > > > > >> > of
> > > > > > > > >> > > > the
> > > > > > > > >> > > > > latency depending on the specific transformation
> > > chain.
> > > > > > > > >> > > >
> > > > > > > > >> > > > 5.2. Make sense. I'm proposing to add
> > > > > > > > >> > `sink-record-convert-transform...`
> > > > > > > > >> > > > and `source-record-transform-convert...` to
> represent
> > > > > correctly
> > > > > > > > the
> > > > > > > > >> > order
> > > > > > > > >> > > > of operations.
> > > > > > > > >> > > >
> > > > > > > > >> > > > > it seems like both source and sink tasks only
> record
> > > > > metrics
> > > > > > > at
> > > > > > > > a
> > > > > > > > >> > > "batch"
> > > > > > > > >> > > > > level, not on an individual record level. I think
> it
> > > > > might be
> > > > > > > > >> > > additional
> > > > > > > > >> > > > > overhead if we want to record these new metrics
> all
> > at
> > > > the
> > > > > > > > record
> > > > > > > > >> > > level?
> > > > > > > > >> > > >
> > > > > > > > >> > > > 5.3. I considered at the beginning to implement all
> > > > metrics
> > > > > at
> > > > > > > the
> > > > > > > > >> > batch
> > > > > > > > >> > > > level, but given how the framework process records,
> I
> > > > > fallback
> > > > > > > to
> > > > > > > > >> the
> > > > > > > > >> > > > proposed approach:
> > > > > > > > >> > > > - Sink Task:
> > > > > > > > >> > > >   - `WorkerSinkTask#convertMessages(msgs)` already
> > > > iterates
> > > > > over
> > > > > > > > >> > records,
> > > > > > > > >> > > > so there is no additional overhead to capture record
> > > > > latency per
> > > > > > > > >> > record.
> > > > > > > > >> > > >     -
> > > > > > > > >> > > >
> > > > > > > > >> > > >
> > > > > > > > >> > >
> > > > > > > > >> >
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/9841647c4fe422532f448423c92d26e4fdcb8932/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L490-L514
> > > > > > > > >> > > >   -
> `WorkerSinkTask#convertAndTransformRecord(record)`
> > > > > actually
> > > > > > > > >> happens
> > > > > > > > >> > > > individually. Measuring this operation per batch
> would
> > > > > include
> > > > > > > > >> > processing
> > > > > > > > >> > > > that is not strictly part of "convert and transform"
> > > > > > > > >> > > >     -
> > > > > > > > >> > > >
> > > > > > > > >> > > >
> > > > > > > > >> > >
> > > > > > > > >> >
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/9841647c4fe422532f448423c92d26e4fdcb8932/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L518
> > > > > > > > >> > > > - Source Task:
> > > > > > > > >> > > >   - `AbstractWorkerSourceTask#sendRecords` iterates
> > > over a
> > > > > batch
> > > > > > > > and
> > > > > > > > >> > > > applies transforms and convert record individually
> as
> > > > well:
> > > > > > > > >> > > >     -
> > > > > > > > >> > > >
> > > > > > > > >> > > >
> > > > > > > > >> > >
> > > > > > > > >> >
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/9841647c4fe422532f448423c92d26e4fdcb8932/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/AbstractWorkerSourceTask.java#L389-L390
> > > > > > > > >> > > >
> > > > > > > > >> > > > > This might require some additional changes -
> > > > > > > > >> > > > > for instance, with the "sink-record-latency"
> metric,
> > > we
> > > > > might
> > > > > > > > only
> > > > > > > > >> > want
> > > > > > > > >> > > > to
> > > > > > > > >> > > > > have a "max" metric since "avg" would require
> > > recording
> > > > a
> > > > > > > value
> > > > > > > > on
> > > > > > > > >> > the
> > > > > > > > >> > > > > sensor for each record (whereas we can get a "max"
> > by
> > > > only
> > > > > > > > >> recording
> > > > > > > > >> > a
> > > > > > > > >> > > > > metric value for the oldest record in each batch).
> > > > > > > > >> > > >
> > > > > > > > >> > > > 5.4. Recording record-latency per batch may not be
> as
> > > > > useful as
> > > > > > > > >> there
> > > > > > > > >> > is
> > > > > > > > >> > > no
> > > > > > > > >> > > > guarantee that the oldest record will be
> > representative
> > > of
> > > > > the
> > > > > > > > >> batch.
> > > > > > > > >> > > >
> > > > > > > > >> > > > On Sat, 3 Sept 2022 at 16:02, Yash Mayya <
> > > > > yash.mayya@gmail.com>
> > > > > > > > >> wrote:
> > > > > > > > >> > > >
> > > > > > > > >> > > > > Hi Jorge and Sagar,
> > > > > > > > >> > > > >
> > > > > > > > >> > > > > I think it makes sense to not have a min metric
> for
> > > > > either to
> > > > > > > > >> remain
> > > > > > > > >> > > > > consistent with the existing put-batch and
> > poll-batch
> > > > > metrics
> > > > > > > > (it
> > > > > > > > >> > > doesn't
> > > > > > > > >> > > > > seem particularly useful either anyway). Also, the
> > new
> > > > > > > > >> > > > > "sink-record-latency" metric name looks fine to
> me,
> > > > > thanks for
> > > > > > > > >> making
> > > > > > > > >> > > the
> > > > > > > > >> > > > > changes! Another naming related suggestion I had
> was
> > > > with
> > > > > the
> > > > > > > > >> > > > > "convert-time" metrics - we should probably
> include
> > > > > > > > >> transformations
> > > > > > > > >> > in
> > > > > > > > >> > > > the
> > > > > > > > >> > > > > name since SMTs could definitely be attributable
> to
> > a
> > > > > sizable
> > > > > > > > >> chunk
> > > > > > > > >> > of
> > > > > > > > >> > > > the
> > > > > > > > >> > > > > latency depending on the specific transformation
> > > chain.
> > > > > > > > >> > > > >
> > > > > > > > >> > > > > I have one high level question with respect to
> > > > > implementation
> > > > > > > -
> > > > > > > > >> > > > currently,
> > > > > > > > >> > > > > it seems like both source and sink tasks only
> record
> > > > > metrics
> > > > > > > at
> > > > > > > > a
> > > > > > > > >> > > "batch"
> > > > > > > > >> > > > > level, not on an individual record level. I think
> it
> > > > > might be
> > > > > > > > >> > > additional
> > > > > > > > >> > > > > overhead if we want to record these new metrics
> all
> > at
> > > > the
> > > > > > > > record
> > > > > > > > >> > > level?
> > > > > > > > >> > > > > Could we instead make all of these new metrics for
> > > > > batches of
> > > > > > > > >> records
> > > > > > > > >> > > > > rather than individual records in order to remain
> > > > > consistent
> > > > > > > > with
> > > > > > > > >> the
> > > > > > > > >> > > > > existing task level metrics? This might require
> some
> > > > > > > additional
> > > > > > > > >> > > changes -
> > > > > > > > >> > > > > for instance, with the "sink-record-latency"
> metric,
> > > we
> > > > > might
> > > > > > > > only
> > > > > > > > >> > want
> > > > > > > > >> > > > to
> > > > > > > > >> > > > > have a "max" metric since "avg" would require
> > > recording
> > > > a
> > > > > > > value
> > > > > > > > on
> > > > > > > > >> > the
> > > > > > > > >> > > > > sensor for each record (whereas we can get a "max"
> > by
> > > > only
> > > > > > > > >> recording
> > > > > > > > >> > a
> > > > > > > > >> > > > > metric value for the oldest record in each batch).
> > > > > > > > >> > > > >
> > > > > > > > >> > > > > Thanks,
> > > > > > > > >> > > > > Yash
> > > > > > > > >> > > > >
> > > > > > > > >> > > > > On Fri, Sep 2, 2022 at 3:16 PM Sagar <
> > > > > > > sagarmeansocean@gmail.com
> > > > > > > > >
> > > > > > > > >> > > wrote:
> > > > > > > > >> > > > >
> > > > > > > > >> > > > > > Hi Jorge,
> > > > > > > > >> > > > > >
> > > > > > > > >> > > > > > Thanks for the changes.
> > > > > > > > >> > > > > >
> > > > > > > > >> > > > > > Regarding the metrics, I meant something like
> > this:
> > > > > > > > >> > > > > >
> > > > > > > > >> > > > >
> > > > > > > > >> > > >
> > > > > > > > >> > >
> > > > > > > > >> >
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > >
> > > >
> > >
> >
> kafka.connect:type=sink-task-metrics,connector="{connector}",task="{task}"
> > > > > > > > >> > > > > >
> > > > > > > > >> > > > > > the way it's defined in
> > > > > > > > >> > > > > >
> > > > > https://kafka.apache.org/documentation/#connect_monitoring
> > > > > > > > for
> > > > > > > > >> the
> > > > > > > > >> > > > > > metrics.
> > > > > > > > >> > > > > >
> > > > > > > > >> > > > > > I see what you mean by the 3 metrics and how it
> > can
> > > be
> > > > > > > > >> interpreted.
> > > > > > > > >> > > The
> > > > > > > > >> > > > > > only thing I would argue is do we need
> > > > > > > > sink-record-latency-min?
> > > > > > > > >> > Maybe
> > > > > > > > >> > > > we
> > > > > > > > >> > > > > > could remove this min metric as well and make
> all
> > of
> > > > > the 3
> > > > > > > e2e
> > > > > > > > >> > > metrics
> > > > > > > > >> > > > > > consistent(since put-batch also doesn't expose a
> > min
> > > > > which
> > > > > > > > makes
> > > > > > > > >> > > sense
> > > > > > > > >> > > > to
> > > > > > > > >> > > > > > me). I think this is in contrast to what Yash
> > > pointed
> > > > > out
> > > > > > > > above
> > > > > > > > >> so
> > > > > > > > >> > I
> > > > > > > > >> > > > > would
> > > > > > > > >> > > > > > like to hear his thoughts as well.
> > > > > > > > >> > > > > >
> > > > > > > > >> > > > > > The other point Yash mentioned about the
> slightly
> > > > flawed
> > > > > > > > >> definition
> > > > > > > > >> > > of
> > > > > > > > >> > > > > e2e
> > > > > > > > >> > > > > > is also true in a sense. But I have a feeling
> > that's
> > > > > one the
> > > > > > > > >> > records
> > > > > > > > >> > > > are
> > > > > > > > >> > > > > > polled by the connector tasks, it would be
> > difficult
> > > > to
> > > > > > > track
> > > > > > > > >> the
> > > > > > > > >> > > final
> > > > > > > > >> > > > > leg
> > > > > > > > >> > > > > > via the framework. Probably users can track the
> > > > metrics
> > > > > at
> > > > > > > > their
> > > > > > > > >> > end
> > > > > > > > >> > > to
> > > > > > > > >> > > > > > figure that out. Do you think that makes sense?
> > > > > > > > >> > > > > >
> > > > > > > > >> > > > > > Thanks!
> > > > > > > > >> > > > > > Sagar.
> > > > > > > > >> > > > > >
> > > > > > > > >> > > > > >
> > > > > > > > >> > > > > >
> > > > > > > > >> > > > > >
> > > > > > > > >> > > > > > On Thu, Sep 1, 2022 at 11:40 PM Jorge Esteban
> > > Quilcate
> > > > > > > Otoya <
> > > > > > > > >> > > > > > quilcate.jorge@gmail.com> wrote:
> > > > > > > > >> > > > > >
> > > > > > > > >> > > > > > > Hi Sagar and Yash,
> > > > > > > > >> > > > > > >
> > > > > > > > >> > > > > > > Thanks for your feedback!
> > > > > > > > >> > > > > > >
> > > > > > > > >> > > > > > > > 1) I am assuming the new metrics would be
> task
> > > > level
> > > > > > > > metric.
> > > > > > > > >> > > > > > >
> > > > > > > > >> > > > > > > 1.1 Yes, it will be a task level metric,
> > > implemented
> > > > > on
> > > > > > > the
> > > > > > > > >> > > > > > > Worker[Source/Sink]Task.
> > > > > > > > >> > > > > > >
> > > > > > > > >> > > > > > > > Could you specify the way it's done for
> other
> > > > > > > sink/source
> > > > > > > > >> > > > connector?
> > > > > > > > >> > > > > > >
> > > > > > > > >> > > > > > > 1.2. Not sure what do you mean by this. Could
> > you
> > > > > > > elaborate
> > > > > > > > a
> > > > > > > > >> bit
> > > > > > > > >> > > > more?
> > > > > > > > >> > > > > > >
> > > > > > > > >> > > > > > > > 2. I am slightly confused about the e2e
> > latency
> > > > > > > metric...
> > > > > > > > >> > > > > > >
> > > > > > > > >> > > > > > > 2.1. Yes, I see. I was trying to bring a
> similar
> > > > > concept
> > > > > > > as
> > > > > > > > in
> > > > > > > > >> > > > Streams
> > > > > > > > >> > > > > > with
> > > > > > > > >> > > > > > > KIP-613, though the e2e concept may not be
> > > > > translatable.
> > > > > > > > >> > > > > > > We could keep it as `sink-record-latency` to
> > avoid
> > > > > > > > conflating
> > > > > > > > >> > > > > concepts. A
> > > > > > > > >> > > > > > > similar metric naming was proposed in KIP-489
> > but
> > > at
> > > > > the
> > > > > > > > >> consumer
> > > > > > > > >> > > > > level —
> > > > > > > > >> > > > > > > though it seems dormant for a couple of years.
> > > > > > > > >> > > > > > >
> > > > > > > > >> > > > > > > > However, the put-batch time measures the
> > > > > > > > >> > > > > > > > time to put a batch of records to external
> > sink.
> > > > > So, I
> > > > > > > > would
> > > > > > > > >> > > assume
> > > > > > > > >> > > > > > the 2
> > > > > > > > >> > > > > > > > can't be added as is to compute the e2e
> > latency.
> > > > > Maybe I
> > > > > > > > am
> > > > > > > > >> > > missing
> > > > > > > > >> > > > > > > > something here. Could you plz clarify this.
> > > > > > > > >> > > > > > >
> > > > > > > > >> > > > > > > 2.2. Yes, agree. Not necessarily added, but
> with
> > > > the 3
> > > > > > > > >> latencies
> > > > > > > > >> > > > (poll,
> > > > > > > > >> > > > > > > convert, putBatch) will be clearer where the
> > > > > bottleneck
> > > > > > > may
> > > > > > > > >> be,
> > > > > > > > >> > and
> > > > > > > > >> > > > > > > represent the internal processing.
> > > > > > > > >> > > > > > >
> > > > > > > > >> > > > > > > > however, as per the KIP it looks like it
> will
> > be
> > > > > > > > >> > > > > > > > the latency between when the record was
> > written
> > > to
> > > > > Kafka
> > > > > > > > and
> > > > > > > > >> > when
> > > > > > > > >> > > > the
> > > > > > > > >> > > > > > > > record is returned by a sink task's
> consumer's
> > > > poll?
> > > > > > > > >> > > > > > >
> > > > > > > > >> > > > > > > 3.1. Agree. 2.1. could help to clarify this.
> > > > > > > > >> > > > > > >
> > > > > > > > >> > > > > > > > One more thing - I was wondering
> > > > > > > > >> > > > > > > > if there's a particular reason for having a
> > min
> > > > > metric
> > > > > > > for
> > > > > > > > >> e2e
> > > > > > > > >> > > > > latency
> > > > > > > > >> > > > > > > but
> > > > > > > > >> > > > > > > > not for convert time?
> > > > > > > > >> > > > > > >
> > > > > > > > >> > > > > > > 3.2. Was following KIP-613 for e2e which seems
> > > > useful
> > > > > to
> > > > > > > > >> compare
> > > > > > > > >> > > with
> > > > > > > > >> > > > > > Max a
> > > > > > > > >> > > > > > > get an idea of the window of results, though
> > > current
> > > > > > > > >> latencies in
> > > > > > > > >> > > > > > Connector
> > > > > > > > >> > > > > > > do not include Min, and that's why I haven't
> > added
> > > > it
> > > > > for
> > > > > > > > >> convert
> > > > > > > > >> > > > > > latency.
> > > > > > > > >> > > > > > > Do you think it make sense to extend latency
> > > metrics
> > > > > with
> > > > > > > > Min?
> > > > > > > > >> > > > > > >
> > > > > > > > >> > > > > > > KIP is updated to clarify some of these
> changes.
> > > > > > > > >> > > > > > >
> > > > > > > > >> > > > > > > Many thanks,
> > > > > > > > >> > > > > > > Jorge.
> > > > > > > > >> > > > > > >
> > > > > > > > >> > > > > > > On Thu, 1 Sept 2022 at 18:11, Yash Mayya <
> > > > > > > > >> yash.mayya@gmail.com>
> > > > > > > > >> > > > wrote:
> > > > > > > > >> > > > > > >
> > > > > > > > >> > > > > > > > Hi Jorge,
> > > > > > > > >> > > > > > > >
> > > > > > > > >> > > > > > > > Thanks for the KIP! I have the same
> confusion
> > > with
> > > > > the
> > > > > > > > >> > > e2e-latency
> > > > > > > > >> > > > > > > metrics
> > > > > > > > >> > > > > > > > as Sagar above. "e2e" would seem to indicate
> > the
> > > > > latency
> > > > > > > > >> > between
> > > > > > > > >> > > > when
> > > > > > > > >> > > > > > the
> > > > > > > > >> > > > > > > > record was written to Kafka and when the
> > record
> > > > was
> > > > > > > > written
> > > > > > > > >> to
> > > > > > > > >> > > the
> > > > > > > > >> > > > > sink
> > > > > > > > >> > > > > > > > system by the connector - however, as per
> the
> > > KIP
> > > > it
> > > > > > > looks
> > > > > > > > >> like
> > > > > > > > >> > > it
> > > > > > > > >> > > > > will
> > > > > > > > >> > > > > > > be
> > > > > > > > >> > > > > > > > the latency between when the record was
> > written
> > > to
> > > > > Kafka
> > > > > > > > and
> > > > > > > > >> > when
> > > > > > > > >> > > > the
> > > > > > > > >> > > > > > > > record is returned by a sink task's
> consumer's
> > > > > poll? I
> > > > > > > > think
> > > > > > > > >> > that
> > > > > > > > >> > > > > > metric
> > > > > > > > >> > > > > > > > will be a little confusing to interpret. One
> > > more
> > > > > thing
> > > > > > > -
> > > > > > > > I
> > > > > > > > >> was
> > > > > > > > >> > > > > > wondering
> > > > > > > > >> > > > > > > > if there's a particular reason for having a
> > min
> > > > > metric
> > > > > > > for
> > > > > > > > >> e2e
> > > > > > > > >> > > > > latency
> > > > > > > > >> > > > > > > but
> > > > > > > > >> > > > > > > > not for convert time?
> > > > > > > > >> > > > > > > >
> > > > > > > > >> > > > > > > > Thanks,
> > > > > > > > >> > > > > > > > Yash
> > > > > > > > >> > > > > > > >
> > > > > > > > >> > > > > > > > On Thu, Sep 1, 2022 at 8:59 PM Sagar <
> > > > > > > > >> > sagarmeansocean@gmail.com>
> > > > > > > > >> > > > > > wrote:
> > > > > > > > >> > > > > > > >
> > > > > > > > >> > > > > > > > > Hi Jorge,
> > > > > > > > >> > > > > > > > >
> > > > > > > > >> > > > > > > > > Thanks for the KIP. It looks like a very
> > good
> > > > > > > addition.
> > > > > > > > I
> > > > > > > > >> > > skimmed
> > > > > > > > >> > > > > > > through
> > > > > > > > >> > > > > > > > > once and had a couple of questions =>
> > > > > > > > >> > > > > > > > >
> > > > > > > > >> > > > > > > > > 1) I am assuming the new metrics would be
> > task
> > > > > level
> > > > > > > > >> metric.
> > > > > > > > >> > > > Could
> > > > > > > > >> > > > > > you
> > > > > > > > >> > > > > > > > > specify the way it's done for other
> > > sink/source
> > > > > > > > connector?
> > > > > > > > >> > > > > > > > > 2) I am slightly confused about the e2e
> > > latency
> > > > > > > metric.
> > > > > > > > >> Let's
> > > > > > > > >> > > > > > consider
> > > > > > > > >> > > > > > > > the
> > > > > > > > >> > > > > > > > > sink connector metric. If I look at the
> way
> > > it's
> > > > > > > > supposed
> > > > > > > > >> to
> > > > > > > > >> > be
> > > > > > > > >> > > > > > > > calculated,
> > > > > > > > >> > > > > > > > > i.e the difference between the record
> > > timestamp
> > > > > and
> > > > > > > the
> > > > > > > > >> wall
> > > > > > > > >> > > > clock
> > > > > > > > >> > > > > > > time,
> > > > > > > > >> > > > > > > > it
> > > > > > > > >> > > > > > > > > looks like a per record metric. However,
> the
> > > > > put-batch
> > > > > > > > >> time
> > > > > > > > >> > > > > measures
> > > > > > > > >> > > > > > > the
> > > > > > > > >> > > > > > > > > time to put a batch of records to external
> > > sink.
> > > > > So, I
> > > > > > > > >> would
> > > > > > > > >> > > > assume
> > > > > > > > >> > > > > > > the 2
> > > > > > > > >> > > > > > > > > can't be added as is to compute the e2e
> > > latency.
> > > > > > > Maybe I
> > > > > > > > >> am
> > > > > > > > >> > > > missing
> > > > > > > > >> > > > > > > > > something here. Could you plz clarify
> this.
> > > > > > > > >> > > > > > > > >
> > > > > > > > >> > > > > > > > > Thanks!
> > > > > > > > >> > > > > > > > > Sagar.
> > > > > > > > >> > > > > > > > >
> > > > > > > > >> > > > > > > > > On Tue, Aug 30, 2022 at 8:43 PM Jorge
> > Esteban
> > > > > Quilcate
> > > > > > > > >> Otoya
> > > > > > > > >> > <
> > > > > > > > >> > > > > > > > > quilcate.jorge@gmail.com> wrote:
> > > > > > > > >> > > > > > > > >
> > > > > > > > >> > > > > > > > > > Hi all,
> > > > > > > > >> > > > > > > > > >
> > > > > > > > >> > > > > > > > > > I'd like to start a discussion thread on
> > > > > KIP-864:
> > > > > > > Add
> > > > > > > > >> > > > End-To-End
> > > > > > > > >> > > > > > > > Latency
> > > > > > > > >> > > > > > > > > > Metrics to Connectors.
> > > > > > > > >> > > > > > > > > > This KIP aims to improve the metrics
> > > available
> > > > > on
> > > > > > > > Source
> > > > > > > > >> > and
> > > > > > > > >> > > > Sink
> > > > > > > > >> > > > > > > > > > Connectors to measure end-to-end
> latency,
> > > > > including
> > > > > > > > >> source
> > > > > > > > >> > > and
> > > > > > > > >> > > > > sink
> > > > > > > > >> > > > > > > > > record
> > > > > > > > >> > > > > > > > > > conversion time, and sink record e2e
> > latency
> > > > > > > (similar
> > > > > > > > to
> > > > > > > > >> > > > KIP-613
> > > > > > > > >> > > > > > for
> > > > > > > > >> > > > > > > > > > Streams).
> > > > > > > > >> > > > > > > > > >
> > > > > > > > >> > > > > > > > > > The KIP is here:
> > > > > > > > >> > > > > > > > > >
> > > > > > > > >> > > > > > > > > >
> > > > > > > > >> > > > > > > > >
> > > > > > > > >> > > > > > > >
> > > > > > > > >> > > > > > >
> > > > > > > > >> > > > > >
> > > > > > > > >> > > > >
> > > > > > > > >> > > >
> > > > > > > > >> > >
> > > > > > > > >> >
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-864%3A+Add+End-To-End+Latency+Metrics+to+Connectors
> > > > > > > > >> > > > > > > > > >
> > > > > > > > >> > > > > > > > > > Please take a look and let me know what
> > you
> > > > > think.
> > > > > > > > >> > > > > > > > > >
> > > > > > > > >> > > > > > > > > > Cheers,
> > > > > > > > >> > > > > > > > > > Jorge.
> > > > > > > > >> > > > > > > > > >
> > > > > > > > >> > > > > > > > >
> > > > > > > > >> > > > > > > >
> > > > > > > > >> > > > > > >
> > > > > > > > >> > > > > >
> > > > > > > > >> > > > >
> > > > > > > > >> > > >
> > > > > > > > >> > >
> > > > > > > > >> >
> > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-864: Add End-To-End Latency Metrics to Connectors

Posted by Chris Egerton <ch...@aiven.io.INVALID>.

Hi Jorge,

Thanks! What were your thoughts on the possible benchmarking and/or
downgrading of per-record metrics to DEBUG?

Cheers,

Chris

On Thu, Nov 24, 2022 at 8:20 AM Jorge Esteban Quilcate Otoya <
quilcate.jorge@gmail.com> wrote:

> Thanks Chris! I have updated the KIP with "transform" instead of "alias".
> Agree it's clearer.
>
> Cheers,
> Jorge.
>
> On Mon, 21 Nov 2022 at 21:36, Chris Egerton <ch...@aiven.io.invalid>
> wrote:
>
> > Hi Jorge,
> >
> > Thanks for the updates, and apologies for the delay. The new diagram
> > directly under the "Proposed Changes" section is absolutely gorgeous!
> >
> >
> > Follow-ups:
> >
> > RE 2: Good point. We can use the same level for these metrics, it's not a
> > big deal.
> >
> > RE 3: As long as all the per-record metrics are kept at DEBUG level, it
> > should be fine to leave JMH benchmarking for a follow-up. If we want to
> add
> > new per-record, INFO-level metrics, I would be more comfortable with
> > including benchmarking as part of the testing plan for the KIP. One
> > possible compromise could be to propose that these features be merged at
> > DEBUG level, and then possibly upgraded to INFO level in the future
> pending
> > benchmarks to guard against performance degradation.
> >
> > RE 4: I think for a true "end-to-end" metric, it'd be useful to include
> the
> > time taken by the task to actually deliver the record. However, with the
> > new metric names and descriptions provided in the KIP, I have no
> objections
> > with what's currently proposed, and a new "end-to-end" metric can be
> taken
> > on later in a follow-up KIP.
> >
> > RE 6: You're right, existing producer metrics should be enough for now.
> We
> > can revisit this later if/when we add delivery-centric metrics for sink
> > tasks as well.
> >
> > RE 7: The new metric names in the KIP LGTM; I don't see any need to
> expand
> > beyond those but if you'd still like to pursue others, LMK.
> >
> >
> > New thoughts:
> >
> > One small thought: instead of "alias" in "alias="{transform_alias}" for
> the
> > per-transform metrics, could we use "transform"? IMO it's clearer since
> we
> > don't use "alias" in the names of transform-related properties, and
> "alias"
> > may be confused with the classloading term where you can use, e.g.,
> > "FileStreamSource" as the name of a connector class in a connector config
> > instead of "org.apache.kafka.connect.file.FileStreamSourceConnector".
> >
> >
> > Cheers,
> >
> > Chris
> >
> > On Fri, Nov 18, 2022 at 12:06 PM Jorge Esteban Quilcate Otoya <
> > quilcate.jorge@gmail.com> wrote:
> >
> > > Thanks Mickael!
> > >
> > >
> > > On Wed, 9 Nov 2022 at 15:54, Mickael Maison <mi...@gmail.com>
> > > wrote:
> > >
> > > > Hi Jorge,
> > > >
> > > > Thanks for the KIP, it is a nice improvement.
> > > >
> > > > 1) The per transformation metrics still have a question mark next to
> > > > them in the KIP. Do you want to include them? If so we'll want to tag
> > > > them, we should be able to include the aliases in TransformationChain
> > > > and use them.
> > > >
> > >
> > > Yes, I have added the changes on TransformChain that will be needed to
> > add
> > > these metrics.
> > >
> > >
> > > >
> > > > 2) I see no references to predicates. If we don't want to measure
> > > > their latency, can we say it explicitly?
> > > >
> > >
> > > Good question, I haven't considered these. Though as these are
> > materialized
> > > as PredicatedTransformation, they should be covered by these changes.
> > > Adding a note about this.
> > >
> > >
> > > >
> > > > 3) Should we have sink-record-batch-latency-avg-ms? All other metrics
> > > > have both the maximum and average values.
> > > >
> > > >
> > > Good question. I will remove it and change the record latency from
> > > DEBUG->INFO as it already cover the maximum metric.
> > >
> > > Hope it's clearer now, let me know if there any additional feedback.
> > > Thanks!
> > >
> > >
> > >
> > > > Thanks,
> > > > Mickael
> > > >
> > > > On Thu, Oct 20, 2022 at 9:58 PM Jorge Esteban Quilcate Otoya
> > > > <qu...@gmail.com> wrote:
> > > > >
> > > > > Thanks, Chris! Great feedback! Please, find my comments below:
> > > > >
> > > > > On Thu, 13 Oct 2022 at 18:52, Chris Egerton
> <chrise@aiven.io.invalid
> > >
> > > > wrote:
> > > > >
> > > > > > Hi Jorge,
> > > > > >
> > > > > > Thanks for the KIP. I agree with the overall direction and think
> > this
> > > > would
> > > > > > be a nice improvement to Kafka Connect. Here are my initial
> > thoughts
> > > > on the
> > > > > > details:
> > > > > >
> > > > > > 1. The motivation section outlines the gaps in Kafka Connect's
> task
> > > > metrics
> > > > > > nicely. I think it'd be useful to include more concrete details
> on
> > > why
> > > > > > these gaps need to be filled in, and in which cases additional
> > > metrics
> > > > > > would be helpful. One goal could be to provide enhanced
> monitoring
> > of
> > > > > > production deployments that allows for cluster administrators to
> > set
> > > up
> > > > > > automatic alerts for latency spikes and, if triggered, quickly
> > > > identify the
> > > > > > root cause of those alerts, reducing the time to remediation.
> > Another
> > > > goal
> > > > > > could be to provide more insight to developers or cluster
> > > > administrators
> > > > > > who want to do performance testing on connectors in
> non-production
> > > > > > environments. It may help guide our decision making process to
> > have a
> > > > > > clearer picture of the goals we're trying to achieve.
> > > > > >
> > > > >
> > > > > Agree. The Motivation section has been updated.
> > > > > Thanks for the examples, I see both of them being covered by the
> KIP.
> > > > > I see how these could give us a good distinction on whether to
> > position
> > > > > some metrics at INFO or DEBUG level.
> > > > >
> > > > >
> > > > > > 2. If we're trying to address the alert-and-diagnose use case,
> it'd
> > > be
> > > > > > useful to have as much information as possible at INFO level,
> > rather
> > > > than
> > > > > > forcing cluster administrators to possibly reconfigure a
> connector
> > to
> > > > emit
> > > > > > DEBUG or TRACE level metrics in order to diagnose a potential
> > > > > > production-impacting performance bottleneck. I can see the
> > rationale
> > > > for
> > > > > > emitting per-record metrics that track an average value at DEBUG
> > > > level, but
> > > > > > for per-record metrics that track a maximum value, is there any
> > > reason
> > > > not
> > > > > > to provide this information at INFO level?
> > > > > >
> > > > >
> > > > > Agree. Though with Max and Avg metrics being part of the same
> sensor
> > —
> > > > > where Metric Level is defined — then both metrics get the same
> level.
> > > > >
> > > > >
> > > > > > 3. I'm also curious about the performance testing suggested by
> Yash
> > > to
> > > > > > gauge the potential impact of this change. Have you been able to
> do
> > > any
> > > > > > testing with your draft implementation yet?
> > > > > >
> > > > >
> > > > > No, not so far.
> > > > > I think it would be valuable to discuss the scope of this testing
> and
> > > > maybe
> > > > > tackle it
> > > > > in a separate issue as Sensors and Metrics are used all over the
> > place.
> > > > > My initial understanding is that these tests should by placed in
> the
> > > > > jmh-benchmarks[1].
> > > > > Then, we could target testing Sensors and Metrics, and validate how
> > > much
> > > > > overhead
> > > > > is added by having only Max vs Max,Avg(,Min), etc.
> > > > > In the other hand, we could extend this to Transformers or other
> > > Connect
> > > > > layers.
> > > > >
> > > > > Here are some pointers to the Sensors and Metrics implementations
> > that
> > > > > could be considered:
> > > > > Path to metric recording:
> > > > > -
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/5cab11cf525f6c06fcf9eb43f7f95ef33fe1cdbb/clients/src/main/java/org/apache/kafka/common/metrics/Sensor.java#L195-L199
> > > > > -
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/5cab11cf525f6c06fcf9eb43f7f95ef33fe1cdbb/clients/src/main/java/org/apache/kafka/common/metrics/Sensor.java#L230-L244
> > > > >
> > > > > ```
> > > > > // increment all the stats
> > > > > for (StatAndConfig statAndConfig : this.stats) {
> > > > >    statAndConfig.stat.record(statAndConfig.config(), value,
> timeMs);
> > > > > }
> > > > > ```
> > > > >
> > > > > SampledStats:
> > > > > - Avg:
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/068ab9cefae301f3187ea885d645c425955e77d2/clients/src/main/java/org/apache/kafka/common/metrics/stats/Avg.java
> > > > > - Max:
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/068ab9cefae301f3187ea885d645c425955e77d2/clients/src/main/java/org/apache/kafka/common/metrics/stats/Max.java
> > > > > - Min:
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/068ab9cefae301f3187ea885d645c425955e77d2/clients/src/main/java/org/apache/kafka/common/metrics/stats/Min.java
> > > > >
> > > > > `stat#record()` are implemented by `update` method in SampledStat:
> > > > >
> > > > > ```Max.java
> > > > >     @Override
> > > > >     protected void update(Sample sample, MetricConfig config,
> double
> > > > value,
> > > > > long now) {
> > > > >         sample.value += value;
> > > > >     }
> > > > > ```
> > > > >
> > > > > ```Avg.java
> > > > >     @Override
> > > > >     protected void update(Sample sample, MetricConfig config,
> double
> > > > value,
> > > > > long now) {
> > > > >         sample.value = Math.max(sample.value, value);
> > > > >     }
> > > > > ```
> > > > >
> > > > > As far as I understand, most of the work of the stats happens on
> the
> > > > > `combine` method that is not part of the connector execution but
> > called
> > > > > when metrics are queried.
> > > > >
> > > > > I wonder whether we should consider Avg and Max for all metrics
> > > proposed
> > > > as
> > > > > the impact on the execution path seems minimal, and even see if Min
> > is
> > > > also
> > > > > valuable, and use DEBUG only for more granular metrics.
> > > > >
> > > > > [1] https://github.com/apache/kafka/tree/trunk/jmh-benchmarks
> > > > >
> > > > >
> > > > > > 4. Just to make sure I understand correctly--does "time when it
> has
> > > > been
> > > > > > received by the Sink task" refer to the wallclock time directly
> > > after a
> > > > > > call to SinkTask::put has been completed (as opposed to directly
> > > before
> > > > > > that call is made, or something else entirely)?
> > > > > >
> > > > >
> > > > > It currently means when it has been received by the Sink task
> > > > > right after consumer poll and before conversions.
> > > > > Would it be valuable to have it after put-sink-records?
> > > > >
> > > > >
> > > > > > 5. If the goal is to identify performance bottlenecks (either in
> > > > production
> > > > > > or pre-production environments), would it make sense to introduce
> > > > metrics
> > > > > > for each individual converter (i.e., key/value/header) and
> > > > transformation?
> > > > > > It's definitely an improvement to be able to identify the total
> > time
> > > > for
> > > > > > conversion and transformation, but then the immediate follow-up
> > > > question if
> > > > > > a bottleneck is found in that phase is "which
> > > converter/transformation
> > > > is
> > > > > > responsible?" It'd be nice if we could provide a way to quickly
> > > answer
> > > > that
> > > > > > question.
> > > > > >
> > > > >
> > > > > This is a great idea. I'd like to consider this as well, though
> maybe
> > > > these
> > > > > more granular
> > > > > metrics would be good to have them as DEBUG.
> > > > >
> > > > >
> > > > > > 6. Any thoughts about offering latency metrics for source tasks
> > > between
> > > > > > receipt of the record from the task and delivery of the record to
> > > Kafka
> > > > > > (which would be tracked by producer callback)? We could also use
> > the
> > > > record
> > > > > > timestamp either instead of or in addition to receipt time if the
> > > task
> > > > > > provides a timestamp with its records.
> > > > > >
> > > > >
> > > > > With source transform and convert metrics we get part of that
> > latency.
> > > > > Looking at the Producer metrics, `request-latency` (though a very
> > > generic
> > > > > metric)
> > > > > sort of answer the time between send request and ack — if my
> > > > understanding
> > > > > is correct.
> > > > > Would these be enough or you're thinking about another approach?
> > > > > maybe a custom metric to cover the producer side?
> > > > >
> > > > >
> > > > > > 7. We may end up introducing a way for sink tasks to record
> > > per-record
> > > > > > delivery to the sink system (see KIP-767 [1]). I'd like it if we
> > > could
> > > > keep
> > > > > > the names of our metrics very precise in order to avoid confusing
> > > users
> > > > > > (who may think that we're providing metrics on actual delivery to
> > the
> > > > sink
> > > > > > system, which may not be the case if the connector performs
> > > > asynchronous
> > > > > > writes), and in order to leave room for a metrics on true
> delivery
> > > > time by
> > > > > > sink tasks. It'd also be nice if we could remain consistent with
> > > > existing
> > > > > > metrics such as "put-batch-avg-time-ms". With that in mind, what
> do
> > > you
> > > > > > think about renaming these metrics:
> > > > > > - "sink-record-batch-latency-max-ms" to
> "put-batch-avg-latency-ms"
> > > > > > - "sink-record-latency-max-ms" to
> "put-sink-record-latency-max-ms"
> > > > > > - "sink-record-latency-avg-ms" to
> "put-sink-record-latency-avg-ms"
> > > > > > - "sink-record-convert-transform-time-max-ms" to
> > > > > > "convert-transform-sink-record-time-max-ms"
> > > > > > - "sink-record-convert-transform-time-avg-ms" to
> > > > > > "convert-transform-sink-record-time-avg-ms"
> > > > > > - "source-record-transform-convert-time-max-ms" to
> > > > > > "transform-convert-source-record-time-max-ms"
> > > > > > - "source-record-transform-convert-time-avg-ms" to
> > > > > > "transform-convert-source-record-time-avg-ms"
> > > > > >
> > > > >
> > > > > Make sense, thanks! I have updated the list of metrics and group
> them
> > > by
> > > > > sensor and applying these suggestions.
> > > > > The only ones that I want to review are: sink-record-* to
> put-batch-*
> > > > > (first 3). Not sure if put-batch/put-sink-record describes the
> > purpose
> > > of
> > > > > the metric — neither `sink-record-latency` to be honest.
> > > > > My initial thought was to have something like Kafka Streams
> > > e2e-latency.
> > > > > Based on 4. and 6. questions, an idea could be to add:
> > > > > - source-batch-e2e-latency-before-send: measure wallclock - source
> > > record
> > > > > timestamp after source connector poll.
> > > > > - source-batch-e2e-latency-after-send: measure wallclock - record
> > > > timestamp
> > > > > on producer send callback
> > > > > - sink-batch-e2e-latency-before-put: measure time wallclock -
> record
> > > > > timestamp after consumer poll
> > > > > - sink-batch-e2e-latency-after-put: measure time wallclock - record
> > > > > timestamp after sink connector put.
> > > > >
> > > > >
> > > > > > Thanks again for the KIP! Looking forward to your thoughts.
> > > > > >
> > > > > > Cheers,
> > > > > >
> > > > > > Chris
> > > > > >
> > > > > > [1] -
> > > > > >
> > > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-767%3A+Connect+Latency+Metrics
> > > > > >
> > > > > > On Thu, Sep 15, 2022 at 1:32 PM Jorge Esteban Quilcate Otoya <
> > > > > > quilcate.jorge@gmail.com> wrote:
> > > > > >
> > > > > > > Hi everyone,
> > > > > > >
> > > > > > > I've made a slight addition to the KIP based on Yash feedback:
> > > > > > >
> > > > > > > - A new metric is added at INFO level to record the max latency
> > > from
> > > > the
> > > > > > > batch timestamp, by keeping the oldest record timestamp per
> > batch.
> > > > > > > - A draft implementation is linked.
> > > > > > >
> > > > > > > Looking forward to your feedback.
> > > > > > > Also, a kindly reminder that the vote thread is open.
> > > > > > >
> > > > > > > Thanks!
> > > > > > > Jorge.
> > > > > > >
> > > > > > > On Thu, 8 Sept 2022 at 14:25, Jorge Esteban Quilcate Otoya <
> > > > > > > quilcate.jorge@gmail.com> wrote:
> > > > > > >
> > > > > > > > Great. I have updated the KIP to reflect this.
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > > Jorge.
> > > > > > > >
> > > > > > > > On Thu, 8 Sept 2022 at 12:26, Yash Mayya <
> yash.mayya@gmail.com
> > >
> > > > wrote:
> > > > > > > >
> > > > > > > >> Thanks, I think it makes sense to define these metrics at a
> > > DEBUG
> > > > > > > >> recording
> > > > > > > >> level.
> > > > > > > >>
> > > > > > > >> On Thu, Sep 8, 2022 at 2:51 PM Jorge Esteban Quilcate Otoya
> <
> > > > > > > >> quilcate.jorge@gmail.com> wrote:
> > > > > > > >>
> > > > > > > >> > On Thu, 8 Sept 2022 at 05:55, Yash Mayya <
> > > yash.mayya@gmail.com>
> > > > > > > wrote:
> > > > > > > >> >
> > > > > > > >> > > Hi Jorge,
> > > > > > > >> > >
> > > > > > > >> > > Thanks for the changes. With regard to having per batch
> vs
> > > per
> > > > > > > record
> > > > > > > >> > > metrics, the additional overhead I was referring to
> wasn't
> > > > about
> > > > > > > >> whether
> > > > > > > >> > or
> > > > > > > >> > > not we would need to iterate over all the records in a
> > > batch.
> > > > I
> > > > > > was
> > > > > > > >> > > referring to the potential additional overhead caused by
> > the
> > > > > > higher
> > > > > > > >> > volume
> > > > > > > >> > > of calls to Sensor::record on the sensors for the new
> > > metrics
> > > > (as
> > > > > > > >> > compared
> > > > > > > >> > > to the existing batch only metrics), especially for high
> > > > > > throughput
> > > > > > > >> > > connectors where batch sizes could be large. I guess we
> > may
> > > > want
> > > > > > to
> > > > > > > do
> > > > > > > >> > some
> > > > > > > >> > > sort of performance testing and get concrete numbers to
> > > verify
> > > > > > > whether
> > > > > > > >> > this
> > > > > > > >> > > is a valid concern or not?
> > > > > > > >> > >
> > > > > > > >> >
> > > > > > > >> > 6.1. Got it, thanks for clarifying. I guess there could
> be a
> > > > > > benchmark
> > > > > > > >> test
> > > > > > > >> > of the `Sensor::record` to get an idea of the performance
> > > > impact.
> > > > > > > >> > Regardless, the fact that these are single-record metrics
> > > > compared
> > > > > > to
> > > > > > > >> > existing batch-only could be explicitly defined by setting
> > > these
> > > > > > > >> metrics at
> > > > > > > >> > a DEBUG or TRACE metric recording level, leaving the
> > existing
> > > at
> > > > > > INFO
> > > > > > > >> > level.
> > > > > > > >> > wdyt?
> > > > > > > >> >
> > > > > > > >> >
> > > > > > > >> > >
> > > > > > > >> > > Thanks,
> > > > > > > >> > > Yash
> > > > > > > >> > >
> > > > > > > >> > > On Tue, Sep 6, 2022 at 4:42 PM Jorge Esteban Quilcate
> > Otoya
> > > <
> > > > > > > >> > > quilcate.jorge@gmail.com> wrote:
> > > > > > > >> > >
> > > > > > > >> > > > Hi Sagar and Yash,
> > > > > > > >> > > >
> > > > > > > >> > > > > the way it's defined in
> > > > > > > >> > > >
> > > https://kafka.apache.org/documentation/#connect_monitoring
> > > > for
> > > > > > > the
> > > > > > > >> > > metrics
> > > > > > > >> > > >
> > > > > > > >> > > > 4.1. Got it. Add it to the KIP.
> > > > > > > >> > > >
> > > > > > > >> > > > > The only thing I would argue is do we need
> > > > > > > >> sink-record-latency-min?
> > > > > > > >> > > Maybe
> > > > > > > >> > > > we
> > > > > > > >> > > > > could remove this min metric as well and make all of
> > the
> > > > 3 e2e
> > > > > > > >> > metrics
> > > > > > > >> > > > > consistent
> > > > > > > >> > > >
> > > > > > > >> > > > 4.2 I see. Will remove it from the KIP.
> > > > > > > >> > > >
> > > > > > > >> > > > > Probably users can track the metrics at their end to
> > > > > > > >> > > > > figure that out. Do you think that makes sense?
> > > > > > > >> > > >
> > > > > > > >> > > > 4.3. Yes, agree. With these new metrics it should be
> > > easier
> > > > for
> > > > > > > >> users
> > > > > > > >> > to
> > > > > > > >> > > > track this.
> > > > > > > >> > > >
> > > > > > > >> > > > > I think it makes sense to not have a min metric for
> > > > either to
> > > > > > > >> remain
> > > > > > > >> > > > > consistent with the existing put-batch and
> poll-batch
> > > > metrics
> > > > > > > >> > > >
> > > > > > > >> > > > 5.1. Got it. Same as 4.2
> > > > > > > >> > > >
> > > > > > > >> > > > > Another naming related suggestion I had was with the
> > > > > > > >> > > > > "convert-time" metrics - we should probably include
> > > > > > > >> transformations
> > > > > > > >> > in
> > > > > > > >> > > > the
> > > > > > > >> > > > > name since SMTs could definitely be attributable to
> a
> > > > sizable
> > > > > > > >> chunk
> > > > > > > >> > of
> > > > > > > >> > > > the
> > > > > > > >> > > > > latency depending on the specific transformation
> > chain.
> > > > > > > >> > > >
> > > > > > > >> > > > 5.2. Make sense. I'm proposing to add
> > > > > > > >> > `sink-record-convert-transform...`
> > > > > > > >> > > > and `source-record-transform-convert...` to represent
> > > > correctly
> > > > > > > the
> > > > > > > >> > order
> > > > > > > >> > > > of operations.
> > > > > > > >> > > >
> > > > > > > >> > > > > it seems like both source and sink tasks only record
> > > > metrics
> > > > > > at
> > > > > > > a
> > > > > > > >> > > "batch"
> > > > > > > >> > > > > level, not on an individual record level. I think it
> > > > might be
> > > > > > > >> > > additional
> > > > > > > >> > > > > overhead if we want to record these new metrics all
> at
> > > the
> > > > > > > record
> > > > > > > >> > > level?
> > > > > > > >> > > >
> > > > > > > >> > > > 5.3. I considered at the beginning to implement all
> > > metrics
> > > > at
> > > > > > the
> > > > > > > >> > batch
> > > > > > > >> > > > level, but given how the framework process records, I
> > > > fallback
> > > > > > to
> > > > > > > >> the
> > > > > > > >> > > > proposed approach:
> > > > > > > >> > > > - Sink Task:
> > > > > > > >> > > >   - `WorkerSinkTask#convertMessages(msgs)` already
> > > iterates
> > > > over
> > > > > > > >> > records,
> > > > > > > >> > > > so there is no additional overhead to capture record
> > > > latency per
> > > > > > > >> > record.
> > > > > > > >> > > >     -
> > > > > > > >> > > >
> > > > > > > >> > > >
> > > > > > > >> > >
> > > > > > > >> >
> > > > > > > >>
> > > > > > >
> > > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/9841647c4fe422532f448423c92d26e4fdcb8932/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L490-L514
> > > > > > > >> > > >   - `WorkerSinkTask#convertAndTransformRecord(record)`
> > > > actually
> > > > > > > >> happens
> > > > > > > >> > > > individually. Measuring this operation per batch would
> > > > include
> > > > > > > >> > processing
> > > > > > > >> > > > that is not strictly part of "convert and transform"
> > > > > > > >> > > >     -
> > > > > > > >> > > >
> > > > > > > >> > > >
> > > > > > > >> > >
> > > > > > > >> >
> > > > > > > >>
> > > > > > >
> > > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/9841647c4fe422532f448423c92d26e4fdcb8932/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L518
> > > > > > > >> > > > - Source Task:
> > > > > > > >> > > >   - `AbstractWorkerSourceTask#sendRecords` iterates
> > over a
> > > > batch
> > > > > > > and
> > > > > > > >> > > > applies transforms and convert record individually as
> > > well:
> > > > > > > >> > > >     -
> > > > > > > >> > > >
> > > > > > > >> > > >
> > > > > > > >> > >
> > > > > > > >> >
> > > > > > > >>
> > > > > > >
> > > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/9841647c4fe422532f448423c92d26e4fdcb8932/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/AbstractWorkerSourceTask.java#L389-L390
> > > > > > > >> > > >
> > > > > > > >> > > > > This might require some additional changes -
> > > > > > > >> > > > > for instance, with the "sink-record-latency" metric,
> > we
> > > > might
> > > > > > > only
> > > > > > > >> > want
> > > > > > > >> > > > to
> > > > > > > >> > > > > have a "max" metric since "avg" would require
> > recording
> > > a
> > > > > > value
> > > > > > > on
> > > > > > > >> > the
> > > > > > > >> > > > > sensor for each record (whereas we can get a "max"
> by
> > > only
> > > > > > > >> recording
> > > > > > > >> > a
> > > > > > > >> > > > > metric value for the oldest record in each batch).
> > > > > > > >> > > >
> > > > > > > >> > > > 5.4. Recording record-latency per batch may not be as
> > > > useful as
> > > > > > > >> there
> > > > > > > >> > is
> > > > > > > >> > > no
> > > > > > > >> > > > guarantee that the oldest record will be
> representative
> > of
> > > > the
> > > > > > > >> batch.
> > > > > > > >> > > >
> > > > > > > >> > > > On Sat, 3 Sept 2022 at 16:02, Yash Mayya <
> > > > yash.mayya@gmail.com>
> > > > > > > >> wrote:
> > > > > > > >> > > >
> > > > > > > >> > > > > Hi Jorge and Sagar,
> > > > > > > >> > > > >
> > > > > > > >> > > > > I think it makes sense to not have a min metric for
> > > > either to
> > > > > > > >> remain
> > > > > > > >> > > > > consistent with the existing put-batch and
> poll-batch
> > > > metrics
> > > > > > > (it
> > > > > > > >> > > doesn't
> > > > > > > >> > > > > seem particularly useful either anyway). Also, the
> new
> > > > > > > >> > > > > "sink-record-latency" metric name looks fine to me,
> > > > thanks for
> > > > > > > >> making
> > > > > > > >> > > the
> > > > > > > >> > > > > changes! Another naming related suggestion I had was
> > > with
> > > > the
> > > > > > > >> > > > > "convert-time" metrics - we should probably include
> > > > > > > >> transformations
> > > > > > > >> > in
> > > > > > > >> > > > the
> > > > > > > >> > > > > name since SMTs could definitely be attributable to
> a
> > > > sizable
> > > > > > > >> chunk
> > > > > > > >> > of
> > > > > > > >> > > > the
> > > > > > > >> > > > > latency depending on the specific transformation
> > chain.
> > > > > > > >> > > > >
> > > > > > > >> > > > > I have one high level question with respect to
> > > > implementation
> > > > > > -
> > > > > > > >> > > > currently,
> > > > > > > >> > > > > it seems like both source and sink tasks only record
> > > > metrics
> > > > > > at
> > > > > > > a
> > > > > > > >> > > "batch"
> > > > > > > >> > > > > level, not on an individual record level. I think it
> > > > might be
> > > > > > > >> > > additional
> > > > > > > >> > > > > overhead if we want to record these new metrics all
> at
> > > the
> > > > > > > record
> > > > > > > >> > > level?
> > > > > > > >> > > > > Could we instead make all of these new metrics for
> > > > batches of
> > > > > > > >> records
> > > > > > > >> > > > > rather than individual records in order to remain
> > > > consistent
> > > > > > > with
> > > > > > > >> the
> > > > > > > >> > > > > existing task level metrics? This might require some
> > > > > > additional
> > > > > > > >> > > changes -
> > > > > > > >> > > > > for instance, with the "sink-record-latency" metric,
> > we
> > > > might
> > > > > > > only
> > > > > > > >> > want
> > > > > > > >> > > > to
> > > > > > > >> > > > > have a "max" metric since "avg" would require
> > recording
> > > a
> > > > > > value
> > > > > > > on
> > > > > > > >> > the
> > > > > > > >> > > > > sensor for each record (whereas we can get a "max"
> by
> > > only
> > > > > > > >> recording
> > > > > > > >> > a
> > > > > > > >> > > > > metric value for the oldest record in each batch).
> > > > > > > >> > > > >
> > > > > > > >> > > > > Thanks,
> > > > > > > >> > > > > Yash
> > > > > > > >> > > > >
> > > > > > > >> > > > > On Fri, Sep 2, 2022 at 3:16 PM Sagar <
> > > > > > sagarmeansocean@gmail.com
> > > > > > > >
> > > > > > > >> > > wrote:
> > > > > > > >> > > > >
> > > > > > > >> > > > > > Hi Jorge,
> > > > > > > >> > > > > >
> > > > > > > >> > > > > > Thanks for the changes.
> > > > > > > >> > > > > >
> > > > > > > >> > > > > > Regarding the metrics, I meant something like
> this:
> > > > > > > >> > > > > >
> > > > > > > >> > > > >
> > > > > > > >> > > >
> > > > > > > >> > >
> > > > > > > >> >
> > > > > > > >>
> > > > > > >
> > > > > >
> > > >
> > >
> >
> kafka.connect:type=sink-task-metrics,connector="{connector}",task="{task}"
> > > > > > > >> > > > > >
> > > > > > > >> > > > > > the way it's defined in
> > > > > > > >> > > > > >
> > > > https://kafka.apache.org/documentation/#connect_monitoring
> > > > > > > for
> > > > > > > >> the
> > > > > > > >> > > > > > metrics.
> > > > > > > >> > > > > >
> > > > > > > >> > > > > > I see what you mean by the 3 metrics and how it
> can
> > be
> > > > > > > >> interpreted.
> > > > > > > >> > > The
> > > > > > > >> > > > > > only thing I would argue is do we need
> > > > > > > sink-record-latency-min?
> > > > > > > >> > Maybe
> > > > > > > >> > > > we
> > > > > > > >> > > > > > could remove this min metric as well and make all
> of
> > > > the 3
> > > > > > e2e
> > > > > > > >> > > metrics
> > > > > > > >> > > > > > consistent(since put-batch also doesn't expose a
> min
> > > > which
> > > > > > > makes
> > > > > > > >> > > sense
> > > > > > > >> > > > to
> > > > > > > >> > > > > > me). I think this is in contrast to what Yash
> > pointed
> > > > out
> > > > > > > above
> > > > > > > >> so
> > > > > > > >> > I
> > > > > > > >> > > > > would
> > > > > > > >> > > > > > like to hear his thoughts as well.
> > > > > > > >> > > > > >
> > > > > > > >> > > > > > The other point Yash mentioned about the slightly
> > > flawed
> > > > > > > >> definition
> > > > > > > >> > > of
> > > > > > > >> > > > > e2e
> > > > > > > >> > > > > > is also true in a sense. But I have a feeling
> that's
> > > > one the
> > > > > > > >> > records
> > > > > > > >> > > > are
> > > > > > > >> > > > > > polled by the connector tasks, it would be
> difficult
> > > to
> > > > > > track
> > > > > > > >> the
> > > > > > > >> > > final
> > > > > > > >> > > > > leg
> > > > > > > >> > > > > > via the framework. Probably users can track the
> > > metrics
> > > > at
> > > > > > > their
> > > > > > > >> > end
> > > > > > > >> > > to
> > > > > > > >> > > > > > figure that out. Do you think that makes sense?
> > > > > > > >> > > > > >
> > > > > > > >> > > > > > Thanks!
> > > > > > > >> > > > > > Sagar.
> > > > > > > >> > > > > >
> > > > > > > >> > > > > >
> > > > > > > >> > > > > >
> > > > > > > >> > > > > >
> > > > > > > >> > > > > > On Thu, Sep 1, 2022 at 11:40 PM Jorge Esteban
> > Quilcate
> > > > > > Otoya <
> > > > > > > >> > > > > > quilcate.jorge@gmail.com> wrote:
> > > > > > > >> > > > > >
> > > > > > > >> > > > > > > Hi Sagar and Yash,
> > > > > > > >> > > > > > >
> > > > > > > >> > > > > > > Thanks for your feedback!
> > > > > > > >> > > > > > >
> > > > > > > >> > > > > > > > 1) I am assuming the new metrics would be task
> > > level
> > > > > > > metric.
> > > > > > > >> > > > > > >
> > > > > > > >> > > > > > > 1.1 Yes, it will be a task level metric,
> > implemented
> > > > on
> > > > > > the
> > > > > > > >> > > > > > > Worker[Source/Sink]Task.
> > > > > > > >> > > > > > >
> > > > > > > >> > > > > > > > Could you specify the way it's done for other
> > > > > > sink/source
> > > > > > > >> > > > connector?
> > > > > > > >> > > > > > >
> > > > > > > >> > > > > > > 1.2. Not sure what do you mean by this. Could
> you
> > > > > > elaborate
> > > > > > > a
> > > > > > > >> bit
> > > > > > > >> > > > more?
> > > > > > > >> > > > > > >
> > > > > > > >> > > > > > > > 2. I am slightly confused about the e2e
> latency
> > > > > > metric...
> > > > > > > >> > > > > > >
> > > > > > > >> > > > > > > 2.1. Yes, I see. I was trying to bring a similar
> > > > concept
> > > > > > as
> > > > > > > in
> > > > > > > >> > > > Streams
> > > > > > > >> > > > > > with
> > > > > > > >> > > > > > > KIP-613, though the e2e concept may not be
> > > > translatable.
> > > > > > > >> > > > > > > We could keep it as `sink-record-latency` to
> avoid
> > > > > > > conflating
> > > > > > > >> > > > > concepts. A
> > > > > > > >> > > > > > > similar metric naming was proposed in KIP-489
> but
> > at
> > > > the
> > > > > > > >> consumer
> > > > > > > >> > > > > level —
> > > > > > > >> > > > > > > though it seems dormant for a couple of years.
> > > > > > > >> > > > > > >
> > > > > > > >> > > > > > > > However, the put-batch time measures the
> > > > > > > >> > > > > > > > time to put a batch of records to external
> sink.
> > > > So, I
> > > > > > > would
> > > > > > > >> > > assume
> > > > > > > >> > > > > > the 2
> > > > > > > >> > > > > > > > can't be added as is to compute the e2e
> latency.
> > > > Maybe I
> > > > > > > am
> > > > > > > >> > > missing
> > > > > > > >> > > > > > > > something here. Could you plz clarify this.
> > > > > > > >> > > > > > >
> > > > > > > >> > > > > > > 2.2. Yes, agree. Not necessarily added, but with
> > > the 3
> > > > > > > >> latencies
> > > > > > > >> > > > (poll,
> > > > > > > >> > > > > > > convert, putBatch) will be clearer where the
> > > > bottleneck
> > > > > > may
> > > > > > > >> be,
> > > > > > > >> > and
> > > > > > > >> > > > > > > represent the internal processing.
> > > > > > > >> > > > > > >
> > > > > > > >> > > > > > > > however, as per the KIP it looks like it will
> be
> > > > > > > >> > > > > > > > the latency between when the record was
> written
> > to
> > > > Kafka
> > > > > > > and
> > > > > > > >> > when
> > > > > > > >> > > > the
> > > > > > > >> > > > > > > > record is returned by a sink task's consumer's
> > > poll?
> > > > > > > >> > > > > > >
> > > > > > > >> > > > > > > 3.1. Agree. 2.1. could help to clarify this.
> > > > > > > >> > > > > > >
> > > > > > > >> > > > > > > > One more thing - I was wondering
> > > > > > > >> > > > > > > > if there's a particular reason for having a
> min
> > > > metric
> > > > > > for
> > > > > > > >> e2e
> > > > > > > >> > > > > latency
> > > > > > > >> > > > > > > but
> > > > > > > >> > > > > > > > not for convert time?
> > > > > > > >> > > > > > >
> > > > > > > >> > > > > > > 3.2. Was following KIP-613 for e2e which seems
> > > useful
> > > > to
> > > > > > > >> compare
> > > > > > > >> > > with
> > > > > > > >> > > > > > Max a
> > > > > > > >> > > > > > > get an idea of the window of results, though
> > current
> > > > > > > >> latencies in
> > > > > > > >> > > > > > Connector
> > > > > > > >> > > > > > > do not include Min, and that's why I haven't
> added
> > > it
> > > > for
> > > > > > > >> convert
> > > > > > > >> > > > > > latency.
> > > > > > > >> > > > > > > Do you think it make sense to extend latency
> > metrics
> > > > with
> > > > > > > Min?
> > > > > > > >> > > > > > >
> > > > > > > >> > > > > > > KIP is updated to clarify some of these changes.
> > > > > > > >> > > > > > >
> > > > > > > >> > > > > > > Many thanks,
> > > > > > > >> > > > > > > Jorge.
> > > > > > > >> > > > > > >
> > > > > > > >> > > > > > > On Thu, 1 Sept 2022 at 18:11, Yash Mayya <
> > > > > > > >> yash.mayya@gmail.com>
> > > > > > > >> > > > wrote:
> > > > > > > >> > > > > > >
> > > > > > > >> > > > > > > > Hi Jorge,
> > > > > > > >> > > > > > > >
> > > > > > > >> > > > > > > > Thanks for the KIP! I have the same confusion
> > with
> > > > the
> > > > > > > >> > > e2e-latency
> > > > > > > >> > > > > > > metrics
> > > > > > > >> > > > > > > > as Sagar above. "e2e" would seem to indicate
> the
> > > > latency
> > > > > > > >> > between
> > > > > > > >> > > > when
> > > > > > > >> > > > > > the
> > > > > > > >> > > > > > > > record was written to Kafka and when the
> record
> > > was
> > > > > > > written
> > > > > > > >> to
> > > > > > > >> > > the
> > > > > > > >> > > > > sink
> > > > > > > >> > > > > > > > system by the connector - however, as per the
> > KIP
> > > it
> > > > > > looks
> > > > > > > >> like
> > > > > > > >> > > it
> > > > > > > >> > > > > will
> > > > > > > >> > > > > > > be
> > > > > > > >> > > > > > > > the latency between when the record was
> written
> > to
> > > > Kafka
> > > > > > > and
> > > > > > > >> > when
> > > > > > > >> > > > the
> > > > > > > >> > > > > > > > record is returned by a sink task's consumer's
> > > > poll? I
> > > > > > > think
> > > > > > > >> > that
> > > > > > > >> > > > > > metric
> > > > > > > >> > > > > > > > will be a little confusing to interpret. One
> > more
> > > > thing
> > > > > > -
> > > > > > > I
> > > > > > > >> was
> > > > > > > >> > > > > > wondering
> > > > > > > >> > > > > > > > if there's a particular reason for having a
> min
> > > > metric
> > > > > > for
> > > > > > > >> e2e
> > > > > > > >> > > > > latency
> > > > > > > >> > > > > > > but
> > > > > > > >> > > > > > > > not for convert time?
> > > > > > > >> > > > > > > >
> > > > > > > >> > > > > > > > Thanks,
> > > > > > > >> > > > > > > > Yash
> > > > > > > >> > > > > > > >
> > > > > > > >> > > > > > > > On Thu, Sep 1, 2022 at 8:59 PM Sagar <
> > > > > > > >> > sagarmeansocean@gmail.com>
> > > > > > > >> > > > > > wrote:
> > > > > > > >> > > > > > > >
> > > > > > > >> > > > > > > > > Hi Jorge,
> > > > > > > >> > > > > > > > >
> > > > > > > >> > > > > > > > > Thanks for the KIP. It looks like a very
> good
> > > > > > addition.
> > > > > > > I
> > > > > > > >> > > skimmed
> > > > > > > >> > > > > > > through
> > > > > > > >> > > > > > > > > once and had a couple of questions =>
> > > > > > > >> > > > > > > > >
> > > > > > > >> > > > > > > > > 1) I am assuming the new metrics would be
> task
> > > > level
> > > > > > > >> metric.
> > > > > > > >> > > > Could
> > > > > > > >> > > > > > you
> > > > > > > >> > > > > > > > > specify the way it's done for other
> > sink/source
> > > > > > > connector?
> > > > > > > >> > > > > > > > > 2) I am slightly confused about the e2e
> > latency
> > > > > > metric.
> > > > > > > >> Let's
> > > > > > > >> > > > > > consider
> > > > > > > >> > > > > > > > the
> > > > > > > >> > > > > > > > > sink connector metric. If I look at the way
> > it's
> > > > > > > supposed
> > > > > > > >> to
> > > > > > > >> > be
> > > > > > > >> > > > > > > > calculated,
> > > > > > > >> > > > > > > > > i.e the difference between the record
> > timestamp
> > > > and
> > > > > > the
> > > > > > > >> wall
> > > > > > > >> > > > clock
> > > > > > > >> > > > > > > time,
> > > > > > > >> > > > > > > > it
> > > > > > > >> > > > > > > > > looks like a per record metric. However, the
> > > > put-batch
> > > > > > > >> time
> > > > > > > >> > > > > measures
> > > > > > > >> > > > > > > the
> > > > > > > >> > > > > > > > > time to put a batch of records to external
> > sink.
> > > > So, I
> > > > > > > >> would
> > > > > > > >> > > > assume
> > > > > > > >> > > > > > > the 2
> > > > > > > >> > > > > > > > > can't be added as is to compute the e2e
> > latency.
> > > > > > Maybe I
> > > > > > > >> am
> > > > > > > >> > > > missing
> > > > > > > >> > > > > > > > > something here. Could you plz clarify this.
> > > > > > > >> > > > > > > > >
> > > > > > > >> > > > > > > > > Thanks!
> > > > > > > >> > > > > > > > > Sagar.
> > > > > > > >> > > > > > > > >
> > > > > > > >> > > > > > > > > On Tue, Aug 30, 2022 at 8:43 PM Jorge
> Esteban
> > > > Quilcate
> > > > > > > >> Otoya
> > > > > > > >> > <
> > > > > > > >> > > > > > > > > quilcate.jorge@gmail.com> wrote:
> > > > > > > >> > > > > > > > >
> > > > > > > >> > > > > > > > > > Hi all,
> > > > > > > >> > > > > > > > > >
> > > > > > > >> > > > > > > > > > I'd like to start a discussion thread on
> > > > KIP-864:
> > > > > > Add
> > > > > > > >> > > > End-To-End
> > > > > > > >> > > > > > > > Latency
> > > > > > > >> > > > > > > > > > Metrics to Connectors.
> > > > > > > >> > > > > > > > > > This KIP aims to improve the metrics
> > available
> > > > on
> > > > > > > Source
> > > > > > > >> > and
> > > > > > > >> > > > Sink
> > > > > > > >> > > > > > > > > > Connectors to measure end-to-end latency,
> > > > including
> > > > > > > >> source
> > > > > > > >> > > and
> > > > > > > >> > > > > sink
> > > > > > > >> > > > > > > > > record
> > > > > > > >> > > > > > > > > > conversion time, and sink record e2e
> latency
> > > > > > (similar
> > > > > > > to
> > > > > > > >> > > > KIP-613
> > > > > > > >> > > > > > for
> > > > > > > >> > > > > > > > > > Streams).
> > > > > > > >> > > > > > > > > >
> > > > > > > >> > > > > > > > > > The KIP is here:
> > > > > > > >> > > > > > > > > >
> > > > > > > >> > > > > > > > > >
> > > > > > > >> > > > > > > > >
> > > > > > > >> > > > > > > >
> > > > > > > >> > > > > > >
> > > > > > > >> > > > > >
> > > > > > > >> > > > >
> > > > > > > >> > > >
> > > > > > > >> > >
> > > > > > > >> >
> > > > > > > >>
> > > > > > >
> > > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-864%3A+Add+End-To-End+Latency+Metrics+to+Connectors
> > > > > > > >> > > > > > > > > >
> > > > > > > >> > > > > > > > > > Please take a look and let me know what
> you
> > > > think.
> > > > > > > >> > > > > > > > > >
> > > > > > > >> > > > > > > > > > Cheers,
> > > > > > > >> > > > > > > > > > Jorge.
> > > > > > > >> > > > > > > > > >
> > > > > > > >> > > > > > > > >
> > > > > > > >> > > > > > > >
> > > > > > > >> > > > > > >
> > > > > > > >> > > > > >
> > > > > > > >> > > > >
> > > > > > > >> > > >
> > > > > > > >> > >
> > > > > > > >> >
> > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-864: Add End-To-End Latency Metrics to Connectors

Posted by Jorge Esteban Quilcate Otoya <qu...@gmail.com>.

Thanks Chris! I have updated the KIP with "transform" instead of "alias".
Agree it's clearer.

Cheers,
Jorge.

On Mon, 21 Nov 2022 at 21:36, Chris Egerton <ch...@aiven.io.invalid> wrote:

> Hi Jorge,
>
> Thanks for the updates, and apologies for the delay. The new diagram
> directly under the "Proposed Changes" section is absolutely gorgeous!
>
>
> Follow-ups:
>
> RE 2: Good point. We can use the same level for these metrics, it's not a
> big deal.
>
> RE 3: As long as all the per-record metrics are kept at DEBUG level, it
> should be fine to leave JMH benchmarking for a follow-up. If we want to add
> new per-record, INFO-level metrics, I would be more comfortable with
> including benchmarking as part of the testing plan for the KIP. One
> possible compromise could be to propose that these features be merged at
> DEBUG level, and then possibly upgraded to INFO level in the future pending
> benchmarks to guard against performance degradation.
>
> RE 4: I think for a true "end-to-end" metric, it'd be useful to include the
> time taken by the task to actually deliver the record. However, with the
> new metric names and descriptions provided in the KIP, I have no objections
> with what's currently proposed, and a new "end-to-end" metric can be taken
> on later in a follow-up KIP.
>
> RE 6: You're right, existing producer metrics should be enough for now. We
> can revisit this later if/when we add delivery-centric metrics for sink
> tasks as well.
>
> RE 7: The new metric names in the KIP LGTM; I don't see any need to expand
> beyond those but if you'd still like to pursue others, LMK.
>
>
> New thoughts:
>
> One small thought: instead of "alias" in "alias="{transform_alias}" for the
> per-transform metrics, could we use "transform"? IMO it's clearer since we
> don't use "alias" in the names of transform-related properties, and "alias"
> may be confused with the classloading term where you can use, e.g.,
> "FileStreamSource" as the name of a connector class in a connector config
> instead of "org.apache.kafka.connect.file.FileStreamSourceConnector".
>
>
> Cheers,
>
> Chris
>
> On Fri, Nov 18, 2022 at 12:06 PM Jorge Esteban Quilcate Otoya <
> quilcate.jorge@gmail.com> wrote:
>
> > Thanks Mickael!
> >
> >
> > On Wed, 9 Nov 2022 at 15:54, Mickael Maison <mi...@gmail.com>
> > wrote:
> >
> > > Hi Jorge,
> > >
> > > Thanks for the KIP, it is a nice improvement.
> > >
> > > 1) The per transformation metrics still have a question mark next to
> > > them in the KIP. Do you want to include them? If so we'll want to tag
> > > them, we should be able to include the aliases in TransformationChain
> > > and use them.
> > >
> >
> > Yes, I have added the changes on TransformChain that will be needed to
> add
> > these metrics.
> >
> >
> > >
> > > 2) I see no references to predicates. If we don't want to measure
> > > their latency, can we say it explicitly?
> > >
> >
> > Good question, I haven't considered these. Though as these are
> materialized
> > as PredicatedTransformation, they should be covered by these changes.
> > Adding a note about this.
> >
> >
> > >
> > > 3) Should we have sink-record-batch-latency-avg-ms? All other metrics
> > > have both the maximum and average values.
> > >
> > >
> > Good question. I will remove it and change the record latency from
> > DEBUG->INFO as it already cover the maximum metric.
> >
> > Hope it's clearer now, let me know if there any additional feedback.
> > Thanks!
> >
> >
> >
> > > Thanks,
> > > Mickael
> > >
> > > On Thu, Oct 20, 2022 at 9:58 PM Jorge Esteban Quilcate Otoya
> > > <qu...@gmail.com> wrote:
> > > >
> > > > Thanks, Chris! Great feedback! Please, find my comments below:
> > > >
> > > > On Thu, 13 Oct 2022 at 18:52, Chris Egerton <chrise@aiven.io.invalid
> >
> > > wrote:
> > > >
> > > > > Hi Jorge,
> > > > >
> > > > > Thanks for the KIP. I agree with the overall direction and think
> this
> > > would
> > > > > be a nice improvement to Kafka Connect. Here are my initial
> thoughts
> > > on the
> > > > > details:
> > > > >
> > > > > 1. The motivation section outlines the gaps in Kafka Connect's task
> > > metrics
> > > > > nicely. I think it'd be useful to include more concrete details on
> > why
> > > > > these gaps need to be filled in, and in which cases additional
> > metrics
> > > > > would be helpful. One goal could be to provide enhanced monitoring
> of
> > > > > production deployments that allows for cluster administrators to
> set
> > up
> > > > > automatic alerts for latency spikes and, if triggered, quickly
> > > identify the
> > > > > root cause of those alerts, reducing the time to remediation.
> Another
> > > goal
> > > > > could be to provide more insight to developers or cluster
> > > administrators
> > > > > who want to do performance testing on connectors in non-production
> > > > > environments. It may help guide our decision making process to
> have a
> > > > > clearer picture of the goals we're trying to achieve.
> > > > >
> > > >
> > > > Agree. The Motivation section has been updated.
> > > > Thanks for the examples, I see both of them being covered by the KIP.
> > > > I see how these could give us a good distinction on whether to
> position
> > > > some metrics at INFO or DEBUG level.
> > > >
> > > >
> > > > > 2. If we're trying to address the alert-and-diagnose use case, it'd
> > be
> > > > > useful to have as much information as possible at INFO level,
> rather
> > > than
> > > > > forcing cluster administrators to possibly reconfigure a connector
> to
> > > emit
> > > > > DEBUG or TRACE level metrics in order to diagnose a potential
> > > > > production-impacting performance bottleneck. I can see the
> rationale
> > > for
> > > > > emitting per-record metrics that track an average value at DEBUG
> > > level, but
> > > > > for per-record metrics that track a maximum value, is there any
> > reason
> > > not
> > > > > to provide this information at INFO level?
> > > > >
> > > >
> > > > Agree. Though with Max and Avg metrics being part of the same sensor
> —
> > > > where Metric Level is defined — then both metrics get the same level.
> > > >
> > > >
> > > > > 3. I'm also curious about the performance testing suggested by Yash
> > to
> > > > > gauge the potential impact of this change. Have you been able to do
> > any
> > > > > testing with your draft implementation yet?
> > > > >
> > > >
> > > > No, not so far.
> > > > I think it would be valuable to discuss the scope of this testing and
> > > maybe
> > > > tackle it
> > > > in a separate issue as Sensors and Metrics are used all over the
> place.
> > > > My initial understanding is that these tests should by placed in the
> > > > jmh-benchmarks[1].
> > > > Then, we could target testing Sensors and Metrics, and validate how
> > much
> > > > overhead
> > > > is added by having only Max vs Max,Avg(,Min), etc.
> > > > In the other hand, we could extend this to Transformers or other
> > Connect
> > > > layers.
> > > >
> > > > Here are some pointers to the Sensors and Metrics implementations
> that
> > > > could be considered:
> > > > Path to metric recording:
> > > > -
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/5cab11cf525f6c06fcf9eb43f7f95ef33fe1cdbb/clients/src/main/java/org/apache/kafka/common/metrics/Sensor.java#L195-L199
> > > > -
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/5cab11cf525f6c06fcf9eb43f7f95ef33fe1cdbb/clients/src/main/java/org/apache/kafka/common/metrics/Sensor.java#L230-L244
> > > >
> > > > ```
> > > > // increment all the stats
> > > > for (StatAndConfig statAndConfig : this.stats) {
> > > >    statAndConfig.stat.record(statAndConfig.config(), value, timeMs);
> > > > }
> > > > ```
> > > >
> > > > SampledStats:
> > > > - Avg:
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/068ab9cefae301f3187ea885d645c425955e77d2/clients/src/main/java/org/apache/kafka/common/metrics/stats/Avg.java
> > > > - Max:
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/068ab9cefae301f3187ea885d645c425955e77d2/clients/src/main/java/org/apache/kafka/common/metrics/stats/Max.java
> > > > - Min:
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/068ab9cefae301f3187ea885d645c425955e77d2/clients/src/main/java/org/apache/kafka/common/metrics/stats/Min.java
> > > >
> > > > `stat#record()` are implemented by `update` method in SampledStat:
> > > >
> > > > ```Max.java
> > > >     @Override
> > > >     protected void update(Sample sample, MetricConfig config, double
> > > value,
> > > > long now) {
> > > >         sample.value += value;
> > > >     }
> > > > ```
> > > >
> > > > ```Avg.java
> > > >     @Override
> > > >     protected void update(Sample sample, MetricConfig config, double
> > > value,
> > > > long now) {
> > > >         sample.value = Math.max(sample.value, value);
> > > >     }
> > > > ```
> > > >
> > > > As far as I understand, most of the work of the stats happens on the
> > > > `combine` method that is not part of the connector execution but
> called
> > > > when metrics are queried.
> > > >
> > > > I wonder whether we should consider Avg and Max for all metrics
> > proposed
> > > as
> > > > the impact on the execution path seems minimal, and even see if Min
> is
> > > also
> > > > valuable, and use DEBUG only for more granular metrics.
> > > >
> > > > [1] https://github.com/apache/kafka/tree/trunk/jmh-benchmarks
> > > >
> > > >
> > > > > 4. Just to make sure I understand correctly--does "time when it has
> > > been
> > > > > received by the Sink task" refer to the wallclock time directly
> > after a
> > > > > call to SinkTask::put has been completed (as opposed to directly
> > before
> > > > > that call is made, or something else entirely)?
> > > > >
> > > >
> > > > It currently means when it has been received by the Sink task
> > > > right after consumer poll and before conversions.
> > > > Would it be valuable to have it after put-sink-records?
> > > >
> > > >
> > > > > 5. If the goal is to identify performance bottlenecks (either in
> > > production
> > > > > or pre-production environments), would it make sense to introduce
> > > metrics
> > > > > for each individual converter (i.e., key/value/header) and
> > > transformation?
> > > > > It's definitely an improvement to be able to identify the total
> time
> > > for
> > > > > conversion and transformation, but then the immediate follow-up
> > > question if
> > > > > a bottleneck is found in that phase is "which
> > converter/transformation
> > > is
> > > > > responsible?" It'd be nice if we could provide a way to quickly
> > answer
> > > that
> > > > > question.
> > > > >
> > > >
> > > > This is a great idea. I'd like to consider this as well, though maybe
> > > these
> > > > more granular
> > > > metrics would be good to have them as DEBUG.
> > > >
> > > >
> > > > > 6. Any thoughts about offering latency metrics for source tasks
> > between
> > > > > receipt of the record from the task and delivery of the record to
> > Kafka
> > > > > (which would be tracked by producer callback)? We could also use
> the
> > > record
> > > > > timestamp either instead of or in addition to receipt time if the
> > task
> > > > > provides a timestamp with its records.
> > > > >
> > > >
> > > > With source transform and convert metrics we get part of that
> latency.
> > > > Looking at the Producer metrics, `request-latency` (though a very
> > generic
> > > > metric)
> > > > sort of answer the time between send request and ack — if my
> > > understanding
> > > > is correct.
> > > > Would these be enough or you're thinking about another approach?
> > > > maybe a custom metric to cover the producer side?
> > > >
> > > >
> > > > > 7. We may end up introducing a way for sink tasks to record
> > per-record
> > > > > delivery to the sink system (see KIP-767 [1]). I'd like it if we
> > could
> > > keep
> > > > > the names of our metrics very precise in order to avoid confusing
> > users
> > > > > (who may think that we're providing metrics on actual delivery to
> the
> > > sink
> > > > > system, which may not be the case if the connector performs
> > > asynchronous
> > > > > writes), and in order to leave room for a metrics on true delivery
> > > time by
> > > > > sink tasks. It'd also be nice if we could remain consistent with
> > > existing
> > > > > metrics such as "put-batch-avg-time-ms". With that in mind, what do
> > you
> > > > > think about renaming these metrics:
> > > > > - "sink-record-batch-latency-max-ms" to "put-batch-avg-latency-ms"
> > > > > - "sink-record-latency-max-ms" to "put-sink-record-latency-max-ms"
> > > > > - "sink-record-latency-avg-ms" to "put-sink-record-latency-avg-ms"
> > > > > - "sink-record-convert-transform-time-max-ms" to
> > > > > "convert-transform-sink-record-time-max-ms"
> > > > > - "sink-record-convert-transform-time-avg-ms" to
> > > > > "convert-transform-sink-record-time-avg-ms"
> > > > > - "source-record-transform-convert-time-max-ms" to
> > > > > "transform-convert-source-record-time-max-ms"
> > > > > - "source-record-transform-convert-time-avg-ms" to
> > > > > "transform-convert-source-record-time-avg-ms"
> > > > >
> > > >
> > > > Make sense, thanks! I have updated the list of metrics and group them
> > by
> > > > sensor and applying these suggestions.
> > > > The only ones that I want to review are: sink-record-* to put-batch-*
> > > > (first 3). Not sure if put-batch/put-sink-record describes the
> purpose
> > of
> > > > the metric — neither `sink-record-latency` to be honest.
> > > > My initial thought was to have something like Kafka Streams
> > e2e-latency.
> > > > Based on 4. and 6. questions, an idea could be to add:
> > > > - source-batch-e2e-latency-before-send: measure wallclock - source
> > record
> > > > timestamp after source connector poll.
> > > > - source-batch-e2e-latency-after-send: measure wallclock - record
> > > timestamp
> > > > on producer send callback
> > > > - sink-batch-e2e-latency-before-put: measure time wallclock - record
> > > > timestamp after consumer poll
> > > > - sink-batch-e2e-latency-after-put: measure time wallclock - record
> > > > timestamp after sink connector put.
> > > >
> > > >
> > > > > Thanks again for the KIP! Looking forward to your thoughts.
> > > > >
> > > > > Cheers,
> > > > >
> > > > > Chris
> > > > >
> > > > > [1] -
> > > > >
> > > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-767%3A+Connect+Latency+Metrics
> > > > >
> > > > > On Thu, Sep 15, 2022 at 1:32 PM Jorge Esteban Quilcate Otoya <
> > > > > quilcate.jorge@gmail.com> wrote:
> > > > >
> > > > > > Hi everyone,
> > > > > >
> > > > > > I've made a slight addition to the KIP based on Yash feedback:
> > > > > >
> > > > > > - A new metric is added at INFO level to record the max latency
> > from
> > > the
> > > > > > batch timestamp, by keeping the oldest record timestamp per
> batch.
> > > > > > - A draft implementation is linked.
> > > > > >
> > > > > > Looking forward to your feedback.
> > > > > > Also, a kindly reminder that the vote thread is open.
> > > > > >
> > > > > > Thanks!
> > > > > > Jorge.
> > > > > >
> > > > > > On Thu, 8 Sept 2022 at 14:25, Jorge Esteban Quilcate Otoya <
> > > > > > quilcate.jorge@gmail.com> wrote:
> > > > > >
> > > > > > > Great. I have updated the KIP to reflect this.
> > > > > > >
> > > > > > > Cheers,
> > > > > > > Jorge.
> > > > > > >
> > > > > > > On Thu, 8 Sept 2022 at 12:26, Yash Mayya <yash.mayya@gmail.com
> >
> > > wrote:
> > > > > > >
> > > > > > >> Thanks, I think it makes sense to define these metrics at a
> > DEBUG
> > > > > > >> recording
> > > > > > >> level.
> > > > > > >>
> > > > > > >> On Thu, Sep 8, 2022 at 2:51 PM Jorge Esteban Quilcate Otoya <
> > > > > > >> quilcate.jorge@gmail.com> wrote:
> > > > > > >>
> > > > > > >> > On Thu, 8 Sept 2022 at 05:55, Yash Mayya <
> > yash.mayya@gmail.com>
> > > > > > wrote:
> > > > > > >> >
> > > > > > >> > > Hi Jorge,
> > > > > > >> > >
> > > > > > >> > > Thanks for the changes. With regard to having per batch vs
> > per
> > > > > > record
> > > > > > >> > > metrics, the additional overhead I was referring to wasn't
> > > about
> > > > > > >> whether
> > > > > > >> > or
> > > > > > >> > > not we would need to iterate over all the records in a
> > batch.
> > > I
> > > > > was
> > > > > > >> > > referring to the potential additional overhead caused by
> the
> > > > > higher
> > > > > > >> > volume
> > > > > > >> > > of calls to Sensor::record on the sensors for the new
> > metrics
> > > (as
> > > > > > >> > compared
> > > > > > >> > > to the existing batch only metrics), especially for high
> > > > > throughput
> > > > > > >> > > connectors where batch sizes could be large. I guess we
> may
> > > want
> > > > > to
> > > > > > do
> > > > > > >> > some
> > > > > > >> > > sort of performance testing and get concrete numbers to
> > verify
> > > > > > whether
> > > > > > >> > this
> > > > > > >> > > is a valid concern or not?
> > > > > > >> > >
> > > > > > >> >
> > > > > > >> > 6.1. Got it, thanks for clarifying. I guess there could be a
> > > > > benchmark
> > > > > > >> test
> > > > > > >> > of the `Sensor::record` to get an idea of the performance
> > > impact.
> > > > > > >> > Regardless, the fact that these are single-record metrics
> > > compared
> > > > > to
> > > > > > >> > existing batch-only could be explicitly defined by setting
> > these
> > > > > > >> metrics at
> > > > > > >> > a DEBUG or TRACE metric recording level, leaving the
> existing
> > at
> > > > > INFO
> > > > > > >> > level.
> > > > > > >> > wdyt?
> > > > > > >> >
> > > > > > >> >
> > > > > > >> > >
> > > > > > >> > > Thanks,
> > > > > > >> > > Yash
> > > > > > >> > >
> > > > > > >> > > On Tue, Sep 6, 2022 at 4:42 PM Jorge Esteban Quilcate
> Otoya
> > <
> > > > > > >> > > quilcate.jorge@gmail.com> wrote:
> > > > > > >> > >
> > > > > > >> > > > Hi Sagar and Yash,
> > > > > > >> > > >
> > > > > > >> > > > > the way it's defined in
> > > > > > >> > > >
> > https://kafka.apache.org/documentation/#connect_monitoring
> > > for
> > > > > > the
> > > > > > >> > > metrics
> > > > > > >> > > >
> > > > > > >> > > > 4.1. Got it. Add it to the KIP.
> > > > > > >> > > >
> > > > > > >> > > > > The only thing I would argue is do we need
> > > > > > >> sink-record-latency-min?
> > > > > > >> > > Maybe
> > > > > > >> > > > we
> > > > > > >> > > > > could remove this min metric as well and make all of
> the
> > > 3 e2e
> > > > > > >> > metrics
> > > > > > >> > > > > consistent
> > > > > > >> > > >
> > > > > > >> > > > 4.2 I see. Will remove it from the KIP.
> > > > > > >> > > >
> > > > > > >> > > > > Probably users can track the metrics at their end to
> > > > > > >> > > > > figure that out. Do you think that makes sense?
> > > > > > >> > > >
> > > > > > >> > > > 4.3. Yes, agree. With these new metrics it should be
> > easier
> > > for
> > > > > > >> users
> > > > > > >> > to
> > > > > > >> > > > track this.
> > > > > > >> > > >
> > > > > > >> > > > > I think it makes sense to not have a min metric for
> > > either to
> > > > > > >> remain
> > > > > > >> > > > > consistent with the existing put-batch and poll-batch
> > > metrics
> > > > > > >> > > >
> > > > > > >> > > > 5.1. Got it. Same as 4.2
> > > > > > >> > > >
> > > > > > >> > > > > Another naming related suggestion I had was with the
> > > > > > >> > > > > "convert-time" metrics - we should probably include
> > > > > > >> transformations
> > > > > > >> > in
> > > > > > >> > > > the
> > > > > > >> > > > > name since SMTs could definitely be attributable to a
> > > sizable
> > > > > > >> chunk
> > > > > > >> > of
> > > > > > >> > > > the
> > > > > > >> > > > > latency depending on the specific transformation
> chain.
> > > > > > >> > > >
> > > > > > >> > > > 5.2. Make sense. I'm proposing to add
> > > > > > >> > `sink-record-convert-transform...`
> > > > > > >> > > > and `source-record-transform-convert...` to represent
> > > correctly
> > > > > > the
> > > > > > >> > order
> > > > > > >> > > > of operations.
> > > > > > >> > > >
> > > > > > >> > > > > it seems like both source and sink tasks only record
> > > metrics
> > > > > at
> > > > > > a
> > > > > > >> > > "batch"
> > > > > > >> > > > > level, not on an individual record level. I think it
> > > might be
> > > > > > >> > > additional
> > > > > > >> > > > > overhead if we want to record these new metrics all at
> > the
> > > > > > record
> > > > > > >> > > level?
> > > > > > >> > > >
> > > > > > >> > > > 5.3. I considered at the beginning to implement all
> > metrics
> > > at
> > > > > the
> > > > > > >> > batch
> > > > > > >> > > > level, but given how the framework process records, I
> > > fallback
> > > > > to
> > > > > > >> the
> > > > > > >> > > > proposed approach:
> > > > > > >> > > > - Sink Task:
> > > > > > >> > > >   - `WorkerSinkTask#convertMessages(msgs)` already
> > iterates
> > > over
> > > > > > >> > records,
> > > > > > >> > > > so there is no additional overhead to capture record
> > > latency per
> > > > > > >> > record.
> > > > > > >> > > >     -
> > > > > > >> > > >
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> >
> > > > > > >>
> > > > > >
> > > > >
> > >
> >
> https://github.com/apache/kafka/blob/9841647c4fe422532f448423c92d26e4fdcb8932/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L490-L514
> > > > > > >> > > >   - `WorkerSinkTask#convertAndTransformRecord(record)`
> > > actually
> > > > > > >> happens
> > > > > > >> > > > individually. Measuring this operation per batch would
> > > include
> > > > > > >> > processing
> > > > > > >> > > > that is not strictly part of "convert and transform"
> > > > > > >> > > >     -
> > > > > > >> > > >
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> >
> > > > > > >>
> > > > > >
> > > > >
> > >
> >
> https://github.com/apache/kafka/blob/9841647c4fe422532f448423c92d26e4fdcb8932/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L518
> > > > > > >> > > > - Source Task:
> > > > > > >> > > >   - `AbstractWorkerSourceTask#sendRecords` iterates
> over a
> > > batch
> > > > > > and
> > > > > > >> > > > applies transforms and convert record individually as
> > well:
> > > > > > >> > > >     -
> > > > > > >> > > >
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> >
> > > > > > >>
> > > > > >
> > > > >
> > >
> >
> https://github.com/apache/kafka/blob/9841647c4fe422532f448423c92d26e4fdcb8932/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/AbstractWorkerSourceTask.java#L389-L390
> > > > > > >> > > >
> > > > > > >> > > > > This might require some additional changes -
> > > > > > >> > > > > for instance, with the "sink-record-latency" metric,
> we
> > > might
> > > > > > only
> > > > > > >> > want
> > > > > > >> > > > to
> > > > > > >> > > > > have a "max" metric since "avg" would require
> recording
> > a
> > > > > value
> > > > > > on
> > > > > > >> > the
> > > > > > >> > > > > sensor for each record (whereas we can get a "max" by
> > only
> > > > > > >> recording
> > > > > > >> > a
> > > > > > >> > > > > metric value for the oldest record in each batch).
> > > > > > >> > > >
> > > > > > >> > > > 5.4. Recording record-latency per batch may not be as
> > > useful as
> > > > > > >> there
> > > > > > >> > is
> > > > > > >> > > no
> > > > > > >> > > > guarantee that the oldest record will be representative
> of
> > > the
> > > > > > >> batch.
> > > > > > >> > > >
> > > > > > >> > > > On Sat, 3 Sept 2022 at 16:02, Yash Mayya <
> > > yash.mayya@gmail.com>
> > > > > > >> wrote:
> > > > > > >> > > >
> > > > > > >> > > > > Hi Jorge and Sagar,
> > > > > > >> > > > >
> > > > > > >> > > > > I think it makes sense to not have a min metric for
> > > either to
> > > > > > >> remain
> > > > > > >> > > > > consistent with the existing put-batch and poll-batch
> > > metrics
> > > > > > (it
> > > > > > >> > > doesn't
> > > > > > >> > > > > seem particularly useful either anyway). Also, the new
> > > > > > >> > > > > "sink-record-latency" metric name looks fine to me,
> > > thanks for
> > > > > > >> making
> > > > > > >> > > the
> > > > > > >> > > > > changes! Another naming related suggestion I had was
> > with
> > > the
> > > > > > >> > > > > "convert-time" metrics - we should probably include
> > > > > > >> transformations
> > > > > > >> > in
> > > > > > >> > > > the
> > > > > > >> > > > > name since SMTs could definitely be attributable to a
> > > sizable
> > > > > > >> chunk
> > > > > > >> > of
> > > > > > >> > > > the
> > > > > > >> > > > > latency depending on the specific transformation
> chain.
> > > > > > >> > > > >
> > > > > > >> > > > > I have one high level question with respect to
> > > implementation
> > > > > -
> > > > > > >> > > > currently,
> > > > > > >> > > > > it seems like both source and sink tasks only record
> > > metrics
> > > > > at
> > > > > > a
> > > > > > >> > > "batch"
> > > > > > >> > > > > level, not on an individual record level. I think it
> > > might be
> > > > > > >> > > additional
> > > > > > >> > > > > overhead if we want to record these new metrics all at
> > the
> > > > > > record
> > > > > > >> > > level?
> > > > > > >> > > > > Could we instead make all of these new metrics for
> > > batches of
> > > > > > >> records
> > > > > > >> > > > > rather than individual records in order to remain
> > > consistent
> > > > > > with
> > > > > > >> the
> > > > > > >> > > > > existing task level metrics? This might require some
> > > > > additional
> > > > > > >> > > changes -
> > > > > > >> > > > > for instance, with the "sink-record-latency" metric,
> we
> > > might
> > > > > > only
> > > > > > >> > want
> > > > > > >> > > > to
> > > > > > >> > > > > have a "max" metric since "avg" would require
> recording
> > a
> > > > > value
> > > > > > on
> > > > > > >> > the
> > > > > > >> > > > > sensor for each record (whereas we can get a "max" by
> > only
> > > > > > >> recording
> > > > > > >> > a
> > > > > > >> > > > > metric value for the oldest record in each batch).
> > > > > > >> > > > >
> > > > > > >> > > > > Thanks,
> > > > > > >> > > > > Yash
> > > > > > >> > > > >
> > > > > > >> > > > > On Fri, Sep 2, 2022 at 3:16 PM Sagar <
> > > > > sagarmeansocean@gmail.com
> > > > > > >
> > > > > > >> > > wrote:
> > > > > > >> > > > >
> > > > > > >> > > > > > Hi Jorge,
> > > > > > >> > > > > >
> > > > > > >> > > > > > Thanks for the changes.
> > > > > > >> > > > > >
> > > > > > >> > > > > > Regarding the metrics, I meant something like this:
> > > > > > >> > > > > >
> > > > > > >> > > > >
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> >
> > > > > > >>
> > > > > >
> > > > >
> > >
> >
> kafka.connect:type=sink-task-metrics,connector="{connector}",task="{task}"
> > > > > > >> > > > > >
> > > > > > >> > > > > > the way it's defined in
> > > > > > >> > > > > >
> > > https://kafka.apache.org/documentation/#connect_monitoring
> > > > > > for
> > > > > > >> the
> > > > > > >> > > > > > metrics.
> > > > > > >> > > > > >
> > > > > > >> > > > > > I see what you mean by the 3 metrics and how it can
> be
> > > > > > >> interpreted.
> > > > > > >> > > The
> > > > > > >> > > > > > only thing I would argue is do we need
> > > > > > sink-record-latency-min?
> > > > > > >> > Maybe
> > > > > > >> > > > we
> > > > > > >> > > > > > could remove this min metric as well and make all of
> > > the 3
> > > > > e2e
> > > > > > >> > > metrics
> > > > > > >> > > > > > consistent(since put-batch also doesn't expose a min
> > > which
> > > > > > makes
> > > > > > >> > > sense
> > > > > > >> > > > to
> > > > > > >> > > > > > me). I think this is in contrast to what Yash
> pointed
> > > out
> > > > > > above
> > > > > > >> so
> > > > > > >> > I
> > > > > > >> > > > > would
> > > > > > >> > > > > > like to hear his thoughts as well.
> > > > > > >> > > > > >
> > > > > > >> > > > > > The other point Yash mentioned about the slightly
> > flawed
> > > > > > >> definition
> > > > > > >> > > of
> > > > > > >> > > > > e2e
> > > > > > >> > > > > > is also true in a sense. But I have a feeling that's
> > > one the
> > > > > > >> > records
> > > > > > >> > > > are
> > > > > > >> > > > > > polled by the connector tasks, it would be difficult
> > to
> > > > > track
> > > > > > >> the
> > > > > > >> > > final
> > > > > > >> > > > > leg
> > > > > > >> > > > > > via the framework. Probably users can track the
> > metrics
> > > at
> > > > > > their
> > > > > > >> > end
> > > > > > >> > > to
> > > > > > >> > > > > > figure that out. Do you think that makes sense?
> > > > > > >> > > > > >
> > > > > > >> > > > > > Thanks!
> > > > > > >> > > > > > Sagar.
> > > > > > >> > > > > >
> > > > > > >> > > > > >
> > > > > > >> > > > > >
> > > > > > >> > > > > >
> > > > > > >> > > > > > On Thu, Sep 1, 2022 at 11:40 PM Jorge Esteban
> Quilcate
> > > > > Otoya <
> > > > > > >> > > > > > quilcate.jorge@gmail.com> wrote:
> > > > > > >> > > > > >
> > > > > > >> > > > > > > Hi Sagar and Yash,
> > > > > > >> > > > > > >
> > > > > > >> > > > > > > Thanks for your feedback!
> > > > > > >> > > > > > >
> > > > > > >> > > > > > > > 1) I am assuming the new metrics would be task
> > level
> > > > > > metric.
> > > > > > >> > > > > > >
> > > > > > >> > > > > > > 1.1 Yes, it will be a task level metric,
> implemented
> > > on
> > > > > the
> > > > > > >> > > > > > > Worker[Source/Sink]Task.
> > > > > > >> > > > > > >
> > > > > > >> > > > > > > > Could you specify the way it's done for other
> > > > > sink/source
> > > > > > >> > > > connector?
> > > > > > >> > > > > > >
> > > > > > >> > > > > > > 1.2. Not sure what do you mean by this. Could you
> > > > > elaborate
> > > > > > a
> > > > > > >> bit
> > > > > > >> > > > more?
> > > > > > >> > > > > > >
> > > > > > >> > > > > > > > 2. I am slightly confused about the e2e latency
> > > > > metric...
> > > > > > >> > > > > > >
> > > > > > >> > > > > > > 2.1. Yes, I see. I was trying to bring a similar
> > > concept
> > > > > as
> > > > > > in
> > > > > > >> > > > Streams
> > > > > > >> > > > > > with
> > > > > > >> > > > > > > KIP-613, though the e2e concept may not be
> > > translatable.
> > > > > > >> > > > > > > We could keep it as `sink-record-latency` to avoid
> > > > > > conflating
> > > > > > >> > > > > concepts. A
> > > > > > >> > > > > > > similar metric naming was proposed in KIP-489 but
> at
> > > the
> > > > > > >> consumer
> > > > > > >> > > > > level —
> > > > > > >> > > > > > > though it seems dormant for a couple of years.
> > > > > > >> > > > > > >
> > > > > > >> > > > > > > > However, the put-batch time measures the
> > > > > > >> > > > > > > > time to put a batch of records to external sink.
> > > So, I
> > > > > > would
> > > > > > >> > > assume
> > > > > > >> > > > > > the 2
> > > > > > >> > > > > > > > can't be added as is to compute the e2e latency.
> > > Maybe I
> > > > > > am
> > > > > > >> > > missing
> > > > > > >> > > > > > > > something here. Could you plz clarify this.
> > > > > > >> > > > > > >
> > > > > > >> > > > > > > 2.2. Yes, agree. Not necessarily added, but with
> > the 3
> > > > > > >> latencies
> > > > > > >> > > > (poll,
> > > > > > >> > > > > > > convert, putBatch) will be clearer where the
> > > bottleneck
> > > > > may
> > > > > > >> be,
> > > > > > >> > and
> > > > > > >> > > > > > > represent the internal processing.
> > > > > > >> > > > > > >
> > > > > > >> > > > > > > > however, as per the KIP it looks like it will be
> > > > > > >> > > > > > > > the latency between when the record was written
> to
> > > Kafka
> > > > > > and
> > > > > > >> > when
> > > > > > >> > > > the
> > > > > > >> > > > > > > > record is returned by a sink task's consumer's
> > poll?
> > > > > > >> > > > > > >
> > > > > > >> > > > > > > 3.1. Agree. 2.1. could help to clarify this.
> > > > > > >> > > > > > >
> > > > > > >> > > > > > > > One more thing - I was wondering
> > > > > > >> > > > > > > > if there's a particular reason for having a min
> > > metric
> > > > > for
> > > > > > >> e2e
> > > > > > >> > > > > latency
> > > > > > >> > > > > > > but
> > > > > > >> > > > > > > > not for convert time?
> > > > > > >> > > > > > >
> > > > > > >> > > > > > > 3.2. Was following KIP-613 for e2e which seems
> > useful
> > > to
> > > > > > >> compare
> > > > > > >> > > with
> > > > > > >> > > > > > Max a
> > > > > > >> > > > > > > get an idea of the window of results, though
> current
> > > > > > >> latencies in
> > > > > > >> > > > > > Connector
> > > > > > >> > > > > > > do not include Min, and that's why I haven't added
> > it
> > > for
> > > > > > >> convert
> > > > > > >> > > > > > latency.
> > > > > > >> > > > > > > Do you think it make sense to extend latency
> metrics
> > > with
> > > > > > Min?
> > > > > > >> > > > > > >
> > > > > > >> > > > > > > KIP is updated to clarify some of these changes.
> > > > > > >> > > > > > >
> > > > > > >> > > > > > > Many thanks,
> > > > > > >> > > > > > > Jorge.
> > > > > > >> > > > > > >
> > > > > > >> > > > > > > On Thu, 1 Sept 2022 at 18:11, Yash Mayya <
> > > > > > >> yash.mayya@gmail.com>
> > > > > > >> > > > wrote:
> > > > > > >> > > > > > >
> > > > > > >> > > > > > > > Hi Jorge,
> > > > > > >> > > > > > > >
> > > > > > >> > > > > > > > Thanks for the KIP! I have the same confusion
> with
> > > the
> > > > > > >> > > e2e-latency
> > > > > > >> > > > > > > metrics
> > > > > > >> > > > > > > > as Sagar above. "e2e" would seem to indicate the
> > > latency
> > > > > > >> > between
> > > > > > >> > > > when
> > > > > > >> > > > > > the
> > > > > > >> > > > > > > > record was written to Kafka and when the record
> > was
> > > > > > written
> > > > > > >> to
> > > > > > >> > > the
> > > > > > >> > > > > sink
> > > > > > >> > > > > > > > system by the connector - however, as per the
> KIP
> > it
> > > > > looks
> > > > > > >> like
> > > > > > >> > > it
> > > > > > >> > > > > will
> > > > > > >> > > > > > > be
> > > > > > >> > > > > > > > the latency between when the record was written
> to
> > > Kafka
> > > > > > and
> > > > > > >> > when
> > > > > > >> > > > the
> > > > > > >> > > > > > > > record is returned by a sink task's consumer's
> > > poll? I
> > > > > > think
> > > > > > >> > that
> > > > > > >> > > > > > metric
> > > > > > >> > > > > > > > will be a little confusing to interpret. One
> more
> > > thing
> > > > > -
> > > > > > I
> > > > > > >> was
> > > > > > >> > > > > > wondering
> > > > > > >> > > > > > > > if there's a particular reason for having a min
> > > metric
> > > > > for
> > > > > > >> e2e
> > > > > > >> > > > > latency
> > > > > > >> > > > > > > but
> > > > > > >> > > > > > > > not for convert time?
> > > > > > >> > > > > > > >
> > > > > > >> > > > > > > > Thanks,
> > > > > > >> > > > > > > > Yash
> > > > > > >> > > > > > > >
> > > > > > >> > > > > > > > On Thu, Sep 1, 2022 at 8:59 PM Sagar <
> > > > > > >> > sagarmeansocean@gmail.com>
> > > > > > >> > > > > > wrote:
> > > > > > >> > > > > > > >
> > > > > > >> > > > > > > > > Hi Jorge,
> > > > > > >> > > > > > > > >
> > > > > > >> > > > > > > > > Thanks for the KIP. It looks like a very good
> > > > > addition.
> > > > > > I
> > > > > > >> > > skimmed
> > > > > > >> > > > > > > through
> > > > > > >> > > > > > > > > once and had a couple of questions =>
> > > > > > >> > > > > > > > >
> > > > > > >> > > > > > > > > 1) I am assuming the new metrics would be task
> > > level
> > > > > > >> metric.
> > > > > > >> > > > Could
> > > > > > >> > > > > > you
> > > > > > >> > > > > > > > > specify the way it's done for other
> sink/source
> > > > > > connector?
> > > > > > >> > > > > > > > > 2) I am slightly confused about the e2e
> latency
> > > > > metric.
> > > > > > >> Let's
> > > > > > >> > > > > > consider
> > > > > > >> > > > > > > > the
> > > > > > >> > > > > > > > > sink connector metric. If I look at the way
> it's
> > > > > > supposed
> > > > > > >> to
> > > > > > >> > be
> > > > > > >> > > > > > > > calculated,
> > > > > > >> > > > > > > > > i.e the difference between the record
> timestamp
> > > and
> > > > > the
> > > > > > >> wall
> > > > > > >> > > > clock
> > > > > > >> > > > > > > time,
> > > > > > >> > > > > > > > it
> > > > > > >> > > > > > > > > looks like a per record metric. However, the
> > > put-batch
> > > > > > >> time
> > > > > > >> > > > > measures
> > > > > > >> > > > > > > the
> > > > > > >> > > > > > > > > time to put a batch of records to external
> sink.
> > > So, I
> > > > > > >> would
> > > > > > >> > > > assume
> > > > > > >> > > > > > > the 2
> > > > > > >> > > > > > > > > can't be added as is to compute the e2e
> latency.
> > > > > Maybe I
> > > > > > >> am
> > > > > > >> > > > missing
> > > > > > >> > > > > > > > > something here. Could you plz clarify this.
> > > > > > >> > > > > > > > >
> > > > > > >> > > > > > > > > Thanks!
> > > > > > >> > > > > > > > > Sagar.
> > > > > > >> > > > > > > > >
> > > > > > >> > > > > > > > > On Tue, Aug 30, 2022 at 8:43 PM Jorge Esteban
> > > Quilcate
> > > > > > >> Otoya
> > > > > > >> > <
> > > > > > >> > > > > > > > > quilcate.jorge@gmail.com> wrote:
> > > > > > >> > > > > > > > >
> > > > > > >> > > > > > > > > > Hi all,
> > > > > > >> > > > > > > > > >
> > > > > > >> > > > > > > > > > I'd like to start a discussion thread on
> > > KIP-864:
> > > > > Add
> > > > > > >> > > > End-To-End
> > > > > > >> > > > > > > > Latency
> > > > > > >> > > > > > > > > > Metrics to Connectors.
> > > > > > >> > > > > > > > > > This KIP aims to improve the metrics
> available
> > > on
> > > > > > Source
> > > > > > >> > and
> > > > > > >> > > > Sink
> > > > > > >> > > > > > > > > > Connectors to measure end-to-end latency,
> > > including
> > > > > > >> source
> > > > > > >> > > and
> > > > > > >> > > > > sink
> > > > > > >> > > > > > > > > record
> > > > > > >> > > > > > > > > > conversion time, and sink record e2e latency
> > > > > (similar
> > > > > > to
> > > > > > >> > > > KIP-613
> > > > > > >> > > > > > for
> > > > > > >> > > > > > > > > > Streams).
> > > > > > >> > > > > > > > > >
> > > > > > >> > > > > > > > > > The KIP is here:
> > > > > > >> > > > > > > > > >
> > > > > > >> > > > > > > > > >
> > > > > > >> > > > > > > > >
> > > > > > >> > > > > > > >
> > > > > > >> > > > > > >
> > > > > > >> > > > > >
> > > > > > >> > > > >
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> >
> > > > > > >>
> > > > > >
> > > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-864%3A+Add+End-To-End+Latency+Metrics+to+Connectors
> > > > > > >> > > > > > > > > >
> > > > > > >> > > > > > > > > > Please take a look and let me know what you
> > > think.
> > > > > > >> > > > > > > > > >
> > > > > > >> > > > > > > > > > Cheers,
> > > > > > >> > > > > > > > > > Jorge.
> > > > > > >> > > > > > > > > >
> > > > > > >> > > > > > > > >
> > > > > > >> > > > > > > >
> > > > > > >> > > > > > >
> > > > > > >> > > > > >
> > > > > > >> > > > >
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> >
> > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > >
> >
>

Re: [DISCUSS] KIP-864: Add End-To-End Latency Metrics to Connectors

Posted by Chris Egerton <ch...@aiven.io.INVALID>.

Hi Jorge,

Thanks for the updates, and apologies for the delay. The new diagram
directly under the "Proposed Changes" section is absolutely gorgeous!


Follow-ups:

RE 2: Good point. We can use the same level for these metrics, it's not a
big deal.

RE 3: As long as all the per-record metrics are kept at DEBUG level, it
should be fine to leave JMH benchmarking for a follow-up. If we want to add
new per-record, INFO-level metrics, I would be more comfortable with
including benchmarking as part of the testing plan for the KIP. One
possible compromise could be to propose that these features be merged at
DEBUG level, and then possibly upgraded to INFO level in the future pending
benchmarks to guard against performance degradation.

RE 4: I think for a true "end-to-end" metric, it'd be useful to include the
time taken by the task to actually deliver the record. However, with the
new metric names and descriptions provided in the KIP, I have no objections
with what's currently proposed, and a new "end-to-end" metric can be taken
on later in a follow-up KIP.

RE 6: You're right, existing producer metrics should be enough for now. We
can revisit this later if/when we add delivery-centric metrics for sink
tasks as well.

RE 7: The new metric names in the KIP LGTM; I don't see any need to expand
beyond those but if you'd still like to pursue others, LMK.


New thoughts:

One small thought: instead of "alias" in "alias="{transform_alias}" for the
per-transform metrics, could we use "transform"? IMO it's clearer since we
don't use "alias" in the names of transform-related properties, and "alias"
may be confused with the classloading term where you can use, e.g.,
"FileStreamSource" as the name of a connector class in a connector config
instead of "org.apache.kafka.connect.file.FileStreamSourceConnector".


Cheers,

Chris

On Fri, Nov 18, 2022 at 12:06 PM Jorge Esteban Quilcate Otoya <
quilcate.jorge@gmail.com> wrote:

> Thanks Mickael!
>
>
> On Wed, 9 Nov 2022 at 15:54, Mickael Maison <mi...@gmail.com>
> wrote:
>
> > Hi Jorge,
> >
> > Thanks for the KIP, it is a nice improvement.
> >
> > 1) The per transformation metrics still have a question mark next to
> > them in the KIP. Do you want to include them? If so we'll want to tag
> > them, we should be able to include the aliases in TransformationChain
> > and use them.
> >
>
> Yes, I have added the changes on TransformChain that will be needed to add
> these metrics.
>
>
> >
> > 2) I see no references to predicates. If we don't want to measure
> > their latency, can we say it explicitly?
> >
>
> Good question, I haven't considered these. Though as these are materialized
> as PredicatedTransformation, they should be covered by these changes.
> Adding a note about this.
>
>
> >
> > 3) Should we have sink-record-batch-latency-avg-ms? All other metrics
> > have both the maximum and average values.
> >
> >
> Good question. I will remove it and change the record latency from
> DEBUG->INFO as it already cover the maximum metric.
>
> Hope it's clearer now, let me know if there any additional feedback.
> Thanks!
>
>
>
> > Thanks,
> > Mickael
> >
> > On Thu, Oct 20, 2022 at 9:58 PM Jorge Esteban Quilcate Otoya
> > <qu...@gmail.com> wrote:
> > >
> > > Thanks, Chris! Great feedback! Please, find my comments below:
> > >
> > > On Thu, 13 Oct 2022 at 18:52, Chris Egerton <ch...@aiven.io.invalid>
> > wrote:
> > >
> > > > Hi Jorge,
> > > >
> > > > Thanks for the KIP. I agree with the overall direction and think this
> > would
> > > > be a nice improvement to Kafka Connect. Here are my initial thoughts
> > on the
> > > > details:
> > > >
> > > > 1. The motivation section outlines the gaps in Kafka Connect's task
> > metrics
> > > > nicely. I think it'd be useful to include more concrete details on
> why
> > > > these gaps need to be filled in, and in which cases additional
> metrics
> > > > would be helpful. One goal could be to provide enhanced monitoring of
> > > > production deployments that allows for cluster administrators to set
> up
> > > > automatic alerts for latency spikes and, if triggered, quickly
> > identify the
> > > > root cause of those alerts, reducing the time to remediation. Another
> > goal
> > > > could be to provide more insight to developers or cluster
> > administrators
> > > > who want to do performance testing on connectors in non-production
> > > > environments. It may help guide our decision making process to have a
> > > > clearer picture of the goals we're trying to achieve.
> > > >
> > >
> > > Agree. The Motivation section has been updated.
> > > Thanks for the examples, I see both of them being covered by the KIP.
> > > I see how these could give us a good distinction on whether to position
> > > some metrics at INFO or DEBUG level.
> > >
> > >
> > > > 2. If we're trying to address the alert-and-diagnose use case, it'd
> be
> > > > useful to have as much information as possible at INFO level, rather
> > than
> > > > forcing cluster administrators to possibly reconfigure a connector to
> > emit
> > > > DEBUG or TRACE level metrics in order to diagnose a potential
> > > > production-impacting performance bottleneck. I can see the rationale
> > for
> > > > emitting per-record metrics that track an average value at DEBUG
> > level, but
> > > > for per-record metrics that track a maximum value, is there any
> reason
> > not
> > > > to provide this information at INFO level?
> > > >
> > >
> > > Agree. Though with Max and Avg metrics being part of the same sensor —
> > > where Metric Level is defined — then both metrics get the same level.
> > >
> > >
> > > > 3. I'm also curious about the performance testing suggested by Yash
> to
> > > > gauge the potential impact of this change. Have you been able to do
> any
> > > > testing with your draft implementation yet?
> > > >
> > >
> > > No, not so far.
> > > I think it would be valuable to discuss the scope of this testing and
> > maybe
> > > tackle it
> > > in a separate issue as Sensors and Metrics are used all over the place.
> > > My initial understanding is that these tests should by placed in the
> > > jmh-benchmarks[1].
> > > Then, we could target testing Sensors and Metrics, and validate how
> much
> > > overhead
> > > is added by having only Max vs Max,Avg(,Min), etc.
> > > In the other hand, we could extend this to Transformers or other
> Connect
> > > layers.
> > >
> > > Here are some pointers to the Sensors and Metrics implementations that
> > > could be considered:
> > > Path to metric recording:
> > > -
> > >
> >
> https://github.com/apache/kafka/blob/5cab11cf525f6c06fcf9eb43f7f95ef33fe1cdbb/clients/src/main/java/org/apache/kafka/common/metrics/Sensor.java#L195-L199
> > > -
> > >
> >
> https://github.com/apache/kafka/blob/5cab11cf525f6c06fcf9eb43f7f95ef33fe1cdbb/clients/src/main/java/org/apache/kafka/common/metrics/Sensor.java#L230-L244
> > >
> > > ```
> > > // increment all the stats
> > > for (StatAndConfig statAndConfig : this.stats) {
> > >    statAndConfig.stat.record(statAndConfig.config(), value, timeMs);
> > > }
> > > ```
> > >
> > > SampledStats:
> > > - Avg:
> > >
> >
> https://github.com/apache/kafka/blob/068ab9cefae301f3187ea885d645c425955e77d2/clients/src/main/java/org/apache/kafka/common/metrics/stats/Avg.java
> > > - Max:
> > >
> >
> https://github.com/apache/kafka/blob/068ab9cefae301f3187ea885d645c425955e77d2/clients/src/main/java/org/apache/kafka/common/metrics/stats/Max.java
> > > - Min:
> > >
> >
> https://github.com/apache/kafka/blob/068ab9cefae301f3187ea885d645c425955e77d2/clients/src/main/java/org/apache/kafka/common/metrics/stats/Min.java
> > >
> > > `stat#record()` are implemented by `update` method in SampledStat:
> > >
> > > ```Max.java
> > >     @Override
> > >     protected void update(Sample sample, MetricConfig config, double
> > value,
> > > long now) {
> > >         sample.value += value;
> > >     }
> > > ```
> > >
> > > ```Avg.java
> > >     @Override
> > >     protected void update(Sample sample, MetricConfig config, double
> > value,
> > > long now) {
> > >         sample.value = Math.max(sample.value, value);
> > >     }
> > > ```
> > >
> > > As far as I understand, most of the work of the stats happens on the
> > > `combine` method that is not part of the connector execution but called
> > > when metrics are queried.
> > >
> > > I wonder whether we should consider Avg and Max for all metrics
> proposed
> > as
> > > the impact on the execution path seems minimal, and even see if Min is
> > also
> > > valuable, and use DEBUG only for more granular metrics.
> > >
> > > [1] https://github.com/apache/kafka/tree/trunk/jmh-benchmarks
> > >
> > >
> > > > 4. Just to make sure I understand correctly--does "time when it has
> > been
> > > > received by the Sink task" refer to the wallclock time directly
> after a
> > > > call to SinkTask::put has been completed (as opposed to directly
> before
> > > > that call is made, or something else entirely)?
> > > >
> > >
> > > It currently means when it has been received by the Sink task
> > > right after consumer poll and before conversions.
> > > Would it be valuable to have it after put-sink-records?
> > >
> > >
> > > > 5. If the goal is to identify performance bottlenecks (either in
> > production
> > > > or pre-production environments), would it make sense to introduce
> > metrics
> > > > for each individual converter (i.e., key/value/header) and
> > transformation?
> > > > It's definitely an improvement to be able to identify the total time
> > for
> > > > conversion and transformation, but then the immediate follow-up
> > question if
> > > > a bottleneck is found in that phase is "which
> converter/transformation
> > is
> > > > responsible?" It'd be nice if we could provide a way to quickly
> answer
> > that
> > > > question.
> > > >
> > >
> > > This is a great idea. I'd like to consider this as well, though maybe
> > these
> > > more granular
> > > metrics would be good to have them as DEBUG.
> > >
> > >
> > > > 6. Any thoughts about offering latency metrics for source tasks
> between
> > > > receipt of the record from the task and delivery of the record to
> Kafka
> > > > (which would be tracked by producer callback)? We could also use the
> > record
> > > > timestamp either instead of or in addition to receipt time if the
> task
> > > > provides a timestamp with its records.
> > > >
> > >
> > > With source transform and convert metrics we get part of that latency.
> > > Looking at the Producer metrics, `request-latency` (though a very
> generic
> > > metric)
> > > sort of answer the time between send request and ack — if my
> > understanding
> > > is correct.
> > > Would these be enough or you're thinking about another approach?
> > > maybe a custom metric to cover the producer side?
> > >
> > >
> > > > 7. We may end up introducing a way for sink tasks to record
> per-record
> > > > delivery to the sink system (see KIP-767 [1]). I'd like it if we
> could
> > keep
> > > > the names of our metrics very precise in order to avoid confusing
> users
> > > > (who may think that we're providing metrics on actual delivery to the
> > sink
> > > > system, which may not be the case if the connector performs
> > asynchronous
> > > > writes), and in order to leave room for a metrics on true delivery
> > time by
> > > > sink tasks. It'd also be nice if we could remain consistent with
> > existing
> > > > metrics such as "put-batch-avg-time-ms". With that in mind, what do
> you
> > > > think about renaming these metrics:
> > > > - "sink-record-batch-latency-max-ms" to "put-batch-avg-latency-ms"
> > > > - "sink-record-latency-max-ms" to "put-sink-record-latency-max-ms"
> > > > - "sink-record-latency-avg-ms" to "put-sink-record-latency-avg-ms"
> > > > - "sink-record-convert-transform-time-max-ms" to
> > > > "convert-transform-sink-record-time-max-ms"
> > > > - "sink-record-convert-transform-time-avg-ms" to
> > > > "convert-transform-sink-record-time-avg-ms"
> > > > - "source-record-transform-convert-time-max-ms" to
> > > > "transform-convert-source-record-time-max-ms"
> > > > - "source-record-transform-convert-time-avg-ms" to
> > > > "transform-convert-source-record-time-avg-ms"
> > > >
> > >
> > > Make sense, thanks! I have updated the list of metrics and group them
> by
> > > sensor and applying these suggestions.
> > > The only ones that I want to review are: sink-record-* to put-batch-*
> > > (first 3). Not sure if put-batch/put-sink-record describes the purpose
> of
> > > the metric — neither `sink-record-latency` to be honest.
> > > My initial thought was to have something like Kafka Streams
> e2e-latency.
> > > Based on 4. and 6. questions, an idea could be to add:
> > > - source-batch-e2e-latency-before-send: measure wallclock - source
> record
> > > timestamp after source connector poll.
> > > - source-batch-e2e-latency-after-send: measure wallclock - record
> > timestamp
> > > on producer send callback
> > > - sink-batch-e2e-latency-before-put: measure time wallclock - record
> > > timestamp after consumer poll
> > > - sink-batch-e2e-latency-after-put: measure time wallclock - record
> > > timestamp after sink connector put.
> > >
> > >
> > > > Thanks again for the KIP! Looking forward to your thoughts.
> > > >
> > > > Cheers,
> > > >
> > > > Chris
> > > >
> > > > [1] -
> > > >
> > > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-767%3A+Connect+Latency+Metrics
> > > >
> > > > On Thu, Sep 15, 2022 at 1:32 PM Jorge Esteban Quilcate Otoya <
> > > > quilcate.jorge@gmail.com> wrote:
> > > >
> > > > > Hi everyone,
> > > > >
> > > > > I've made a slight addition to the KIP based on Yash feedback:
> > > > >
> > > > > - A new metric is added at INFO level to record the max latency
> from
> > the
> > > > > batch timestamp, by keeping the oldest record timestamp per batch.
> > > > > - A draft implementation is linked.
> > > > >
> > > > > Looking forward to your feedback.
> > > > > Also, a kindly reminder that the vote thread is open.
> > > > >
> > > > > Thanks!
> > > > > Jorge.
> > > > >
> > > > > On Thu, 8 Sept 2022 at 14:25, Jorge Esteban Quilcate Otoya <
> > > > > quilcate.jorge@gmail.com> wrote:
> > > > >
> > > > > > Great. I have updated the KIP to reflect this.
> > > > > >
> > > > > > Cheers,
> > > > > > Jorge.
> > > > > >
> > > > > > On Thu, 8 Sept 2022 at 12:26, Yash Mayya <ya...@gmail.com>
> > wrote:
> > > > > >
> > > > > >> Thanks, I think it makes sense to define these metrics at a
> DEBUG
> > > > > >> recording
> > > > > >> level.
> > > > > >>
> > > > > >> On Thu, Sep 8, 2022 at 2:51 PM Jorge Esteban Quilcate Otoya <
> > > > > >> quilcate.jorge@gmail.com> wrote:
> > > > > >>
> > > > > >> > On Thu, 8 Sept 2022 at 05:55, Yash Mayya <
> yash.mayya@gmail.com>
> > > > > wrote:
> > > > > >> >
> > > > > >> > > Hi Jorge,
> > > > > >> > >
> > > > > >> > > Thanks for the changes. With regard to having per batch vs
> per
> > > > > record
> > > > > >> > > metrics, the additional overhead I was referring to wasn't
> > about
> > > > > >> whether
> > > > > >> > or
> > > > > >> > > not we would need to iterate over all the records in a
> batch.
> > I
> > > > was
> > > > > >> > > referring to the potential additional overhead caused by the
> > > > higher
> > > > > >> > volume
> > > > > >> > > of calls to Sensor::record on the sensors for the new
> metrics
> > (as
> > > > > >> > compared
> > > > > >> > > to the existing batch only metrics), especially for high
> > > > throughput
> > > > > >> > > connectors where batch sizes could be large. I guess we may
> > want
> > > > to
> > > > > do
> > > > > >> > some
> > > > > >> > > sort of performance testing and get concrete numbers to
> verify
> > > > > whether
> > > > > >> > this
> > > > > >> > > is a valid concern or not?
> > > > > >> > >
> > > > > >> >
> > > > > >> > 6.1. Got it, thanks for clarifying. I guess there could be a
> > > > benchmark
> > > > > >> test
> > > > > >> > of the `Sensor::record` to get an idea of the performance
> > impact.
> > > > > >> > Regardless, the fact that these are single-record metrics
> > compared
> > > > to
> > > > > >> > existing batch-only could be explicitly defined by setting
> these
> > > > > >> metrics at
> > > > > >> > a DEBUG or TRACE metric recording level, leaving the existing
> at
> > > > INFO
> > > > > >> > level.
> > > > > >> > wdyt?
> > > > > >> >
> > > > > >> >
> > > > > >> > >
> > > > > >> > > Thanks,
> > > > > >> > > Yash
> > > > > >> > >
> > > > > >> > > On Tue, Sep 6, 2022 at 4:42 PM Jorge Esteban Quilcate Otoya
> <
> > > > > >> > > quilcate.jorge@gmail.com> wrote:
> > > > > >> > >
> > > > > >> > > > Hi Sagar and Yash,
> > > > > >> > > >
> > > > > >> > > > > the way it's defined in
> > > > > >> > > >
> https://kafka.apache.org/documentation/#connect_monitoring
> > for
> > > > > the
> > > > > >> > > metrics
> > > > > >> > > >
> > > > > >> > > > 4.1. Got it. Add it to the KIP.
> > > > > >> > > >
> > > > > >> > > > > The only thing I would argue is do we need
> > > > > >> sink-record-latency-min?
> > > > > >> > > Maybe
> > > > > >> > > > we
> > > > > >> > > > > could remove this min metric as well and make all of the
> > 3 e2e
> > > > > >> > metrics
> > > > > >> > > > > consistent
> > > > > >> > > >
> > > > > >> > > > 4.2 I see. Will remove it from the KIP.
> > > > > >> > > >
> > > > > >> > > > > Probably users can track the metrics at their end to
> > > > > >> > > > > figure that out. Do you think that makes sense?
> > > > > >> > > >
> > > > > >> > > > 4.3. Yes, agree. With these new metrics it should be
> easier
> > for
> > > > > >> users
> > > > > >> > to
> > > > > >> > > > track this.
> > > > > >> > > >
> > > > > >> > > > > I think it makes sense to not have a min metric for
> > either to
> > > > > >> remain
> > > > > >> > > > > consistent with the existing put-batch and poll-batch
> > metrics
> > > > > >> > > >
> > > > > >> > > > 5.1. Got it. Same as 4.2
> > > > > >> > > >
> > > > > >> > > > > Another naming related suggestion I had was with the
> > > > > >> > > > > "convert-time" metrics - we should probably include
> > > > > >> transformations
> > > > > >> > in
> > > > > >> > > > the
> > > > > >> > > > > name since SMTs could definitely be attributable to a
> > sizable
> > > > > >> chunk
> > > > > >> > of
> > > > > >> > > > the
> > > > > >> > > > > latency depending on the specific transformation chain.
> > > > > >> > > >
> > > > > >> > > > 5.2. Make sense. I'm proposing to add
> > > > > >> > `sink-record-convert-transform...`
> > > > > >> > > > and `source-record-transform-convert...` to represent
> > correctly
> > > > > the
> > > > > >> > order
> > > > > >> > > > of operations.
> > > > > >> > > >
> > > > > >> > > > > it seems like both source and sink tasks only record
> > metrics
> > > > at
> > > > > a
> > > > > >> > > "batch"
> > > > > >> > > > > level, not on an individual record level. I think it
> > might be
> > > > > >> > > additional
> > > > > >> > > > > overhead if we want to record these new metrics all at
> the
> > > > > record
> > > > > >> > > level?
> > > > > >> > > >
> > > > > >> > > > 5.3. I considered at the beginning to implement all
> metrics
> > at
> > > > the
> > > > > >> > batch
> > > > > >> > > > level, but given how the framework process records, I
> > fallback
> > > > to
> > > > > >> the
> > > > > >> > > > proposed approach:
> > > > > >> > > > - Sink Task:
> > > > > >> > > >   - `WorkerSinkTask#convertMessages(msgs)` already
> iterates
> > over
> > > > > >> > records,
> > > > > >> > > > so there is no additional overhead to capture record
> > latency per
> > > > > >> > record.
> > > > > >> > > >     -
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> >
> https://github.com/apache/kafka/blob/9841647c4fe422532f448423c92d26e4fdcb8932/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L490-L514
> > > > > >> > > >   - `WorkerSinkTask#convertAndTransformRecord(record)`
> > actually
> > > > > >> happens
> > > > > >> > > > individually. Measuring this operation per batch would
> > include
> > > > > >> > processing
> > > > > >> > > > that is not strictly part of "convert and transform"
> > > > > >> > > >     -
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> >
> https://github.com/apache/kafka/blob/9841647c4fe422532f448423c92d26e4fdcb8932/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L518
> > > > > >> > > > - Source Task:
> > > > > >> > > >   - `AbstractWorkerSourceTask#sendRecords` iterates over a
> > batch
> > > > > and
> > > > > >> > > > applies transforms and convert record individually as
> well:
> > > > > >> > > >     -
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> >
> https://github.com/apache/kafka/blob/9841647c4fe422532f448423c92d26e4fdcb8932/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/AbstractWorkerSourceTask.java#L389-L390
> > > > > >> > > >
> > > > > >> > > > > This might require some additional changes -
> > > > > >> > > > > for instance, with the "sink-record-latency" metric, we
> > might
> > > > > only
> > > > > >> > want
> > > > > >> > > > to
> > > > > >> > > > > have a "max" metric since "avg" would require recording
> a
> > > > value
> > > > > on
> > > > > >> > the
> > > > > >> > > > > sensor for each record (whereas we can get a "max" by
> only
> > > > > >> recording
> > > > > >> > a
> > > > > >> > > > > metric value for the oldest record in each batch).
> > > > > >> > > >
> > > > > >> > > > 5.4. Recording record-latency per batch may not be as
> > useful as
> > > > > >> there
> > > > > >> > is
> > > > > >> > > no
> > > > > >> > > > guarantee that the oldest record will be representative of
> > the
> > > > > >> batch.
> > > > > >> > > >
> > > > > >> > > > On Sat, 3 Sept 2022 at 16:02, Yash Mayya <
> > yash.mayya@gmail.com>
> > > > > >> wrote:
> > > > > >> > > >
> > > > > >> > > > > Hi Jorge and Sagar,
> > > > > >> > > > >
> > > > > >> > > > > I think it makes sense to not have a min metric for
> > either to
> > > > > >> remain
> > > > > >> > > > > consistent with the existing put-batch and poll-batch
> > metrics
> > > > > (it
> > > > > >> > > doesn't
> > > > > >> > > > > seem particularly useful either anyway). Also, the new
> > > > > >> > > > > "sink-record-latency" metric name looks fine to me,
> > thanks for
> > > > > >> making
> > > > > >> > > the
> > > > > >> > > > > changes! Another naming related suggestion I had was
> with
> > the
> > > > > >> > > > > "convert-time" metrics - we should probably include
> > > > > >> transformations
> > > > > >> > in
> > > > > >> > > > the
> > > > > >> > > > > name since SMTs could definitely be attributable to a
> > sizable
> > > > > >> chunk
> > > > > >> > of
> > > > > >> > > > the
> > > > > >> > > > > latency depending on the specific transformation chain.
> > > > > >> > > > >
> > > > > >> > > > > I have one high level question with respect to
> > implementation
> > > > -
> > > > > >> > > > currently,
> > > > > >> > > > > it seems like both source and sink tasks only record
> > metrics
> > > > at
> > > > > a
> > > > > >> > > "batch"
> > > > > >> > > > > level, not on an individual record level. I think it
> > might be
> > > > > >> > > additional
> > > > > >> > > > > overhead if we want to record these new metrics all at
> the
> > > > > record
> > > > > >> > > level?
> > > > > >> > > > > Could we instead make all of these new metrics for
> > batches of
> > > > > >> records
> > > > > >> > > > > rather than individual records in order to remain
> > consistent
> > > > > with
> > > > > >> the
> > > > > >> > > > > existing task level metrics? This might require some
> > > > additional
> > > > > >> > > changes -
> > > > > >> > > > > for instance, with the "sink-record-latency" metric, we
> > might
> > > > > only
> > > > > >> > want
> > > > > >> > > > to
> > > > > >> > > > > have a "max" metric since "avg" would require recording
> a
> > > > value
> > > > > on
> > > > > >> > the
> > > > > >> > > > > sensor for each record (whereas we can get a "max" by
> only
> > > > > >> recording
> > > > > >> > a
> > > > > >> > > > > metric value for the oldest record in each batch).
> > > > > >> > > > >
> > > > > >> > > > > Thanks,
> > > > > >> > > > > Yash
> > > > > >> > > > >
> > > > > >> > > > > On Fri, Sep 2, 2022 at 3:16 PM Sagar <
> > > > sagarmeansocean@gmail.com
> > > > > >
> > > > > >> > > wrote:
> > > > > >> > > > >
> > > > > >> > > > > > Hi Jorge,
> > > > > >> > > > > >
> > > > > >> > > > > > Thanks for the changes.
> > > > > >> > > > > >
> > > > > >> > > > > > Regarding the metrics, I meant something like this:
> > > > > >> > > > > >
> > > > > >> > > > >
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> >
> kafka.connect:type=sink-task-metrics,connector="{connector}",task="{task}"
> > > > > >> > > > > >
> > > > > >> > > > > > the way it's defined in
> > > > > >> > > > > >
> > https://kafka.apache.org/documentation/#connect_monitoring
> > > > > for
> > > > > >> the
> > > > > >> > > > > > metrics.
> > > > > >> > > > > >
> > > > > >> > > > > > I see what you mean by the 3 metrics and how it can be
> > > > > >> interpreted.
> > > > > >> > > The
> > > > > >> > > > > > only thing I would argue is do we need
> > > > > sink-record-latency-min?
> > > > > >> > Maybe
> > > > > >> > > > we
> > > > > >> > > > > > could remove this min metric as well and make all of
> > the 3
> > > > e2e
> > > > > >> > > metrics
> > > > > >> > > > > > consistent(since put-batch also doesn't expose a min
> > which
> > > > > makes
> > > > > >> > > sense
> > > > > >> > > > to
> > > > > >> > > > > > me). I think this is in contrast to what Yash pointed
> > out
> > > > > above
> > > > > >> so
> > > > > >> > I
> > > > > >> > > > > would
> > > > > >> > > > > > like to hear his thoughts as well.
> > > > > >> > > > > >
> > > > > >> > > > > > The other point Yash mentioned about the slightly
> flawed
> > > > > >> definition
> > > > > >> > > of
> > > > > >> > > > > e2e
> > > > > >> > > > > > is also true in a sense. But I have a feeling that's
> > one the
> > > > > >> > records
> > > > > >> > > > are
> > > > > >> > > > > > polled by the connector tasks, it would be difficult
> to
> > > > track
> > > > > >> the
> > > > > >> > > final
> > > > > >> > > > > leg
> > > > > >> > > > > > via the framework. Probably users can track the
> metrics
> > at
> > > > > their
> > > > > >> > end
> > > > > >> > > to
> > > > > >> > > > > > figure that out. Do you think that makes sense?
> > > > > >> > > > > >
> > > > > >> > > > > > Thanks!
> > > > > >> > > > > > Sagar.
> > > > > >> > > > > >
> > > > > >> > > > > >
> > > > > >> > > > > >
> > > > > >> > > > > >
> > > > > >> > > > > > On Thu, Sep 1, 2022 at 11:40 PM Jorge Esteban Quilcate
> > > > Otoya <
> > > > > >> > > > > > quilcate.jorge@gmail.com> wrote:
> > > > > >> > > > > >
> > > > > >> > > > > > > Hi Sagar and Yash,
> > > > > >> > > > > > >
> > > > > >> > > > > > > Thanks for your feedback!
> > > > > >> > > > > > >
> > > > > >> > > > > > > > 1) I am assuming the new metrics would be task
> level
> > > > > metric.
> > > > > >> > > > > > >
> > > > > >> > > > > > > 1.1 Yes, it will be a task level metric, implemented
> > on
> > > > the
> > > > > >> > > > > > > Worker[Source/Sink]Task.
> > > > > >> > > > > > >
> > > > > >> > > > > > > > Could you specify the way it's done for other
> > > > sink/source
> > > > > >> > > > connector?
> > > > > >> > > > > > >
> > > > > >> > > > > > > 1.2. Not sure what do you mean by this. Could you
> > > > elaborate
> > > > > a
> > > > > >> bit
> > > > > >> > > > more?
> > > > > >> > > > > > >
> > > > > >> > > > > > > > 2. I am slightly confused about the e2e latency
> > > > metric...
> > > > > >> > > > > > >
> > > > > >> > > > > > > 2.1. Yes, I see. I was trying to bring a similar
> > concept
> > > > as
> > > > > in
> > > > > >> > > > Streams
> > > > > >> > > > > > with
> > > > > >> > > > > > > KIP-613, though the e2e concept may not be
> > translatable.
> > > > > >> > > > > > > We could keep it as `sink-record-latency` to avoid
> > > > > conflating
> > > > > >> > > > > concepts. A
> > > > > >> > > > > > > similar metric naming was proposed in KIP-489 but at
> > the
> > > > > >> consumer
> > > > > >> > > > > level —
> > > > > >> > > > > > > though it seems dormant for a couple of years.
> > > > > >> > > > > > >
> > > > > >> > > > > > > > However, the put-batch time measures the
> > > > > >> > > > > > > > time to put a batch of records to external sink.
> > So, I
> > > > > would
> > > > > >> > > assume
> > > > > >> > > > > > the 2
> > > > > >> > > > > > > > can't be added as is to compute the e2e latency.
> > Maybe I
> > > > > am
> > > > > >> > > missing
> > > > > >> > > > > > > > something here. Could you plz clarify this.
> > > > > >> > > > > > >
> > > > > >> > > > > > > 2.2. Yes, agree. Not necessarily added, but with
> the 3
> > > > > >> latencies
> > > > > >> > > > (poll,
> > > > > >> > > > > > > convert, putBatch) will be clearer where the
> > bottleneck
> > > > may
> > > > > >> be,
> > > > > >> > and
> > > > > >> > > > > > > represent the internal processing.
> > > > > >> > > > > > >
> > > > > >> > > > > > > > however, as per the KIP it looks like it will be
> > > > > >> > > > > > > > the latency between when the record was written to
> > Kafka
> > > > > and
> > > > > >> > when
> > > > > >> > > > the
> > > > > >> > > > > > > > record is returned by a sink task's consumer's
> poll?
> > > > > >> > > > > > >
> > > > > >> > > > > > > 3.1. Agree. 2.1. could help to clarify this.
> > > > > >> > > > > > >
> > > > > >> > > > > > > > One more thing - I was wondering
> > > > > >> > > > > > > > if there's a particular reason for having a min
> > metric
> > > > for
> > > > > >> e2e
> > > > > >> > > > > latency
> > > > > >> > > > > > > but
> > > > > >> > > > > > > > not for convert time?
> > > > > >> > > > > > >
> > > > > >> > > > > > > 3.2. Was following KIP-613 for e2e which seems
> useful
> > to
> > > > > >> compare
> > > > > >> > > with
> > > > > >> > > > > > Max a
> > > > > >> > > > > > > get an idea of the window of results, though current
> > > > > >> latencies in
> > > > > >> > > > > > Connector
> > > > > >> > > > > > > do not include Min, and that's why I haven't added
> it
> > for
> > > > > >> convert
> > > > > >> > > > > > latency.
> > > > > >> > > > > > > Do you think it make sense to extend latency metrics
> > with
> > > > > Min?
> > > > > >> > > > > > >
> > > > > >> > > > > > > KIP is updated to clarify some of these changes.
> > > > > >> > > > > > >
> > > > > >> > > > > > > Many thanks,
> > > > > >> > > > > > > Jorge.
> > > > > >> > > > > > >
> > > > > >> > > > > > > On Thu, 1 Sept 2022 at 18:11, Yash Mayya <
> > > > > >> yash.mayya@gmail.com>
> > > > > >> > > > wrote:
> > > > > >> > > > > > >
> > > > > >> > > > > > > > Hi Jorge,
> > > > > >> > > > > > > >
> > > > > >> > > > > > > > Thanks for the KIP! I have the same confusion with
> > the
> > > > > >> > > e2e-latency
> > > > > >> > > > > > > metrics
> > > > > >> > > > > > > > as Sagar above. "e2e" would seem to indicate the
> > latency
> > > > > >> > between
> > > > > >> > > > when
> > > > > >> > > > > > the
> > > > > >> > > > > > > > record was written to Kafka and when the record
> was
> > > > > written
> > > > > >> to
> > > > > >> > > the
> > > > > >> > > > > sink
> > > > > >> > > > > > > > system by the connector - however, as per the KIP
> it
> > > > looks
> > > > > >> like
> > > > > >> > > it
> > > > > >> > > > > will
> > > > > >> > > > > > > be
> > > > > >> > > > > > > > the latency between when the record was written to
> > Kafka
> > > > > and
> > > > > >> > when
> > > > > >> > > > the
> > > > > >> > > > > > > > record is returned by a sink task's consumer's
> > poll? I
> > > > > think
> > > > > >> > that
> > > > > >> > > > > > metric
> > > > > >> > > > > > > > will be a little confusing to interpret. One more
> > thing
> > > > -
> > > > > I
> > > > > >> was
> > > > > >> > > > > > wondering
> > > > > >> > > > > > > > if there's a particular reason for having a min
> > metric
> > > > for
> > > > > >> e2e
> > > > > >> > > > > latency
> > > > > >> > > > > > > but
> > > > > >> > > > > > > > not for convert time?
> > > > > >> > > > > > > >
> > > > > >> > > > > > > > Thanks,
> > > > > >> > > > > > > > Yash
> > > > > >> > > > > > > >
> > > > > >> > > > > > > > On Thu, Sep 1, 2022 at 8:59 PM Sagar <
> > > > > >> > sagarmeansocean@gmail.com>
> > > > > >> > > > > > wrote:
> > > > > >> > > > > > > >
> > > > > >> > > > > > > > > Hi Jorge,
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > Thanks for the KIP. It looks like a very good
> > > > addition.
> > > > > I
> > > > > >> > > skimmed
> > > > > >> > > > > > > through
> > > > > >> > > > > > > > > once and had a couple of questions =>
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > 1) I am assuming the new metrics would be task
> > level
> > > > > >> metric.
> > > > > >> > > > Could
> > > > > >> > > > > > you
> > > > > >> > > > > > > > > specify the way it's done for other sink/source
> > > > > connector?
> > > > > >> > > > > > > > > 2) I am slightly confused about the e2e latency
> > > > metric.
> > > > > >> Let's
> > > > > >> > > > > > consider
> > > > > >> > > > > > > > the
> > > > > >> > > > > > > > > sink connector metric. If I look at the way it's
> > > > > supposed
> > > > > >> to
> > > > > >> > be
> > > > > >> > > > > > > > calculated,
> > > > > >> > > > > > > > > i.e the difference between the record timestamp
> > and
> > > > the
> > > > > >> wall
> > > > > >> > > > clock
> > > > > >> > > > > > > time,
> > > > > >> > > > > > > > it
> > > > > >> > > > > > > > > looks like a per record metric. However, the
> > put-batch
> > > > > >> time
> > > > > >> > > > > measures
> > > > > >> > > > > > > the
> > > > > >> > > > > > > > > time to put a batch of records to external sink.
> > So, I
> > > > > >> would
> > > > > >> > > > assume
> > > > > >> > > > > > > the 2
> > > > > >> > > > > > > > > can't be added as is to compute the e2e latency.
> > > > Maybe I
> > > > > >> am
> > > > > >> > > > missing
> > > > > >> > > > > > > > > something here. Could you plz clarify this.
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > Thanks!
> > > > > >> > > > > > > > > Sagar.
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > On Tue, Aug 30, 2022 at 8:43 PM Jorge Esteban
> > Quilcate
> > > > > >> Otoya
> > > > > >> > <
> > > > > >> > > > > > > > > quilcate.jorge@gmail.com> wrote:
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > > Hi all,
> > > > > >> > > > > > > > > >
> > > > > >> > > > > > > > > > I'd like to start a discussion thread on
> > KIP-864:
> > > > Add
> > > > > >> > > > End-To-End
> > > > > >> > > > > > > > Latency
> > > > > >> > > > > > > > > > Metrics to Connectors.
> > > > > >> > > > > > > > > > This KIP aims to improve the metrics available
> > on
> > > > > Source
> > > > > >> > and
> > > > > >> > > > Sink
> > > > > >> > > > > > > > > > Connectors to measure end-to-end latency,
> > including
> > > > > >> source
> > > > > >> > > and
> > > > > >> > > > > sink
> > > > > >> > > > > > > > > record
> > > > > >> > > > > > > > > > conversion time, and sink record e2e latency
> > > > (similar
> > > > > to
> > > > > >> > > > KIP-613
> > > > > >> > > > > > for
> > > > > >> > > > > > > > > > Streams).
> > > > > >> > > > > > > > > >
> > > > > >> > > > > > > > > > The KIP is here:
> > > > > >> > > > > > > > > >
> > > > > >> > > > > > > > > >
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > >
> > > > > >> > > > > > >
> > > > > >> > > > > >
> > > > > >> > > > >
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-864%3A+Add+End-To-End+Latency+Metrics+to+Connectors
> > > > > >> > > > > > > > > >
> > > > > >> > > > > > > > > > Please take a look and let me know what you
> > think.
> > > > > >> > > > > > > > > >
> > > > > >> > > > > > > > > > Cheers,
> > > > > >> > > > > > > > > > Jorge.
> > > > > >> > > > > > > > > >
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > >
> > > > > >> > > > > > >
> > > > > >> > > > > >
> > > > > >> > > > >
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > > >
> > > > >
> > > >
> >
>

Re: [DISCUSS] KIP-864: Add End-To-End Latency Metrics to Connectors

Posted by Jorge Esteban Quilcate Otoya <qu...@gmail.com>.

Thanks Mickael!


On Wed, 9 Nov 2022 at 15:54, Mickael Maison <mi...@gmail.com>
wrote:

> Hi Jorge,
>
> Thanks for the KIP, it is a nice improvement.
>
> 1) The per transformation metrics still have a question mark next to
> them in the KIP. Do you want to include them? If so we'll want to tag
> them, we should be able to include the aliases in TransformationChain
> and use them.
>

Yes, I have added the changes on TransformChain that will be needed to add
these metrics.


>
> 2) I see no references to predicates. If we don't want to measure
> their latency, can we say it explicitly?
>

Good question, I haven't considered these. Though as these are materialized
as PredicatedTransformation, they should be covered by these changes.
Adding a note about this.


>
> 3) Should we have sink-record-batch-latency-avg-ms? All other metrics
> have both the maximum and average values.
>
>
Good question. I will remove it and change the record latency from
DEBUG->INFO as it already cover the maximum metric.

Hope it's clearer now, let me know if there any additional feedback.
Thanks!



> Thanks,
> Mickael
>
> On Thu, Oct 20, 2022 at 9:58 PM Jorge Esteban Quilcate Otoya
> <qu...@gmail.com> wrote:
> >
> > Thanks, Chris! Great feedback! Please, find my comments below:
> >
> > On Thu, 13 Oct 2022 at 18:52, Chris Egerton <ch...@aiven.io.invalid>
> wrote:
> >
> > > Hi Jorge,
> > >
> > > Thanks for the KIP. I agree with the overall direction and think this
> would
> > > be a nice improvement to Kafka Connect. Here are my initial thoughts
> on the
> > > details:
> > >
> > > 1. The motivation section outlines the gaps in Kafka Connect's task
> metrics
> > > nicely. I think it'd be useful to include more concrete details on why
> > > these gaps need to be filled in, and in which cases additional metrics
> > > would be helpful. One goal could be to provide enhanced monitoring of
> > > production deployments that allows for cluster administrators to set up
> > > automatic alerts for latency spikes and, if triggered, quickly
> identify the
> > > root cause of those alerts, reducing the time to remediation. Another
> goal
> > > could be to provide more insight to developers or cluster
> administrators
> > > who want to do performance testing on connectors in non-production
> > > environments. It may help guide our decision making process to have a
> > > clearer picture of the goals we're trying to achieve.
> > >
> >
> > Agree. The Motivation section has been updated.
> > Thanks for the examples, I see both of them being covered by the KIP.
> > I see how these could give us a good distinction on whether to position
> > some metrics at INFO or DEBUG level.
> >
> >
> > > 2. If we're trying to address the alert-and-diagnose use case, it'd be
> > > useful to have as much information as possible at INFO level, rather
> than
> > > forcing cluster administrators to possibly reconfigure a connector to
> emit
> > > DEBUG or TRACE level metrics in order to diagnose a potential
> > > production-impacting performance bottleneck. I can see the rationale
> for
> > > emitting per-record metrics that track an average value at DEBUG
> level, but
> > > for per-record metrics that track a maximum value, is there any reason
> not
> > > to provide this information at INFO level?
> > >
> >
> > Agree. Though with Max and Avg metrics being part of the same sensor —
> > where Metric Level is defined — then both metrics get the same level.
> >
> >
> > > 3. I'm also curious about the performance testing suggested by Yash to
> > > gauge the potential impact of this change. Have you been able to do any
> > > testing with your draft implementation yet?
> > >
> >
> > No, not so far.
> > I think it would be valuable to discuss the scope of this testing and
> maybe
> > tackle it
> > in a separate issue as Sensors and Metrics are used all over the place.
> > My initial understanding is that these tests should by placed in the
> > jmh-benchmarks[1].
> > Then, we could target testing Sensors and Metrics, and validate how much
> > overhead
> > is added by having only Max vs Max,Avg(,Min), etc.
> > In the other hand, we could extend this to Transformers or other Connect
> > layers.
> >
> > Here are some pointers to the Sensors and Metrics implementations that
> > could be considered:
> > Path to metric recording:
> > -
> >
> https://github.com/apache/kafka/blob/5cab11cf525f6c06fcf9eb43f7f95ef33fe1cdbb/clients/src/main/java/org/apache/kafka/common/metrics/Sensor.java#L195-L199
> > -
> >
> https://github.com/apache/kafka/blob/5cab11cf525f6c06fcf9eb43f7f95ef33fe1cdbb/clients/src/main/java/org/apache/kafka/common/metrics/Sensor.java#L230-L244
> >
> > ```
> > // increment all the stats
> > for (StatAndConfig statAndConfig : this.stats) {
> >    statAndConfig.stat.record(statAndConfig.config(), value, timeMs);
> > }
> > ```
> >
> > SampledStats:
> > - Avg:
> >
> https://github.com/apache/kafka/blob/068ab9cefae301f3187ea885d645c425955e77d2/clients/src/main/java/org/apache/kafka/common/metrics/stats/Avg.java
> > - Max:
> >
> https://github.com/apache/kafka/blob/068ab9cefae301f3187ea885d645c425955e77d2/clients/src/main/java/org/apache/kafka/common/metrics/stats/Max.java
> > - Min:
> >
> https://github.com/apache/kafka/blob/068ab9cefae301f3187ea885d645c425955e77d2/clients/src/main/java/org/apache/kafka/common/metrics/stats/Min.java
> >
> > `stat#record()` are implemented by `update` method in SampledStat:
> >
> > ```Max.java
> >     @Override
> >     protected void update(Sample sample, MetricConfig config, double
> value,
> > long now) {
> >         sample.value += value;
> >     }
> > ```
> >
> > ```Avg.java
> >     @Override
> >     protected void update(Sample sample, MetricConfig config, double
> value,
> > long now) {
> >         sample.value = Math.max(sample.value, value);
> >     }
> > ```
> >
> > As far as I understand, most of the work of the stats happens on the
> > `combine` method that is not part of the connector execution but called
> > when metrics are queried.
> >
> > I wonder whether we should consider Avg and Max for all metrics proposed
> as
> > the impact on the execution path seems minimal, and even see if Min is
> also
> > valuable, and use DEBUG only for more granular metrics.
> >
> > [1] https://github.com/apache/kafka/tree/trunk/jmh-benchmarks
> >
> >
> > > 4. Just to make sure I understand correctly--does "time when it has
> been
> > > received by the Sink task" refer to the wallclock time directly after a
> > > call to SinkTask::put has been completed (as opposed to directly before
> > > that call is made, or something else entirely)?
> > >
> >
> > It currently means when it has been received by the Sink task
> > right after consumer poll and before conversions.
> > Would it be valuable to have it after put-sink-records?
> >
> >
> > > 5. If the goal is to identify performance bottlenecks (either in
> production
> > > or pre-production environments), would it make sense to introduce
> metrics
> > > for each individual converter (i.e., key/value/header) and
> transformation?
> > > It's definitely an improvement to be able to identify the total time
> for
> > > conversion and transformation, but then the immediate follow-up
> question if
> > > a bottleneck is found in that phase is "which converter/transformation
> is
> > > responsible?" It'd be nice if we could provide a way to quickly answer
> that
> > > question.
> > >
> >
> > This is a great idea. I'd like to consider this as well, though maybe
> these
> > more granular
> > metrics would be good to have them as DEBUG.
> >
> >
> > > 6. Any thoughts about offering latency metrics for source tasks between
> > > receipt of the record from the task and delivery of the record to Kafka
> > > (which would be tracked by producer callback)? We could also use the
> record
> > > timestamp either instead of or in addition to receipt time if the task
> > > provides a timestamp with its records.
> > >
> >
> > With source transform and convert metrics we get part of that latency.
> > Looking at the Producer metrics, `request-latency` (though a very generic
> > metric)
> > sort of answer the time between send request and ack — if my
> understanding
> > is correct.
> > Would these be enough or you're thinking about another approach?
> > maybe a custom metric to cover the producer side?
> >
> >
> > > 7. We may end up introducing a way for sink tasks to record per-record
> > > delivery to the sink system (see KIP-767 [1]). I'd like it if we could
> keep
> > > the names of our metrics very precise in order to avoid confusing users
> > > (who may think that we're providing metrics on actual delivery to the
> sink
> > > system, which may not be the case if the connector performs
> asynchronous
> > > writes), and in order to leave room for a metrics on true delivery
> time by
> > > sink tasks. It'd also be nice if we could remain consistent with
> existing
> > > metrics such as "put-batch-avg-time-ms". With that in mind, what do you
> > > think about renaming these metrics:
> > > - "sink-record-batch-latency-max-ms" to "put-batch-avg-latency-ms"
> > > - "sink-record-latency-max-ms" to "put-sink-record-latency-max-ms"
> > > - "sink-record-latency-avg-ms" to "put-sink-record-latency-avg-ms"
> > > - "sink-record-convert-transform-time-max-ms" to
> > > "convert-transform-sink-record-time-max-ms"
> > > - "sink-record-convert-transform-time-avg-ms" to
> > > "convert-transform-sink-record-time-avg-ms"
> > > - "source-record-transform-convert-time-max-ms" to
> > > "transform-convert-source-record-time-max-ms"
> > > - "source-record-transform-convert-time-avg-ms" to
> > > "transform-convert-source-record-time-avg-ms"
> > >
> >
> > Make sense, thanks! I have updated the list of metrics and group them by
> > sensor and applying these suggestions.
> > The only ones that I want to review are: sink-record-* to put-batch-*
> > (first 3). Not sure if put-batch/put-sink-record describes the purpose of
> > the metric — neither `sink-record-latency` to be honest.
> > My initial thought was to have something like Kafka Streams e2e-latency.
> > Based on 4. and 6. questions, an idea could be to add:
> > - source-batch-e2e-latency-before-send: measure wallclock - source record
> > timestamp after source connector poll.
> > - source-batch-e2e-latency-after-send: measure wallclock - record
> timestamp
> > on producer send callback
> > - sink-batch-e2e-latency-before-put: measure time wallclock - record
> > timestamp after consumer poll
> > - sink-batch-e2e-latency-after-put: measure time wallclock - record
> > timestamp after sink connector put.
> >
> >
> > > Thanks again for the KIP! Looking forward to your thoughts.
> > >
> > > Cheers,
> > >
> > > Chris
> > >
> > > [1] -
> > >
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-767%3A+Connect+Latency+Metrics
> > >
> > > On Thu, Sep 15, 2022 at 1:32 PM Jorge Esteban Quilcate Otoya <
> > > quilcate.jorge@gmail.com> wrote:
> > >
> > > > Hi everyone,
> > > >
> > > > I've made a slight addition to the KIP based on Yash feedback:
> > > >
> > > > - A new metric is added at INFO level to record the max latency from
> the
> > > > batch timestamp, by keeping the oldest record timestamp per batch.
> > > > - A draft implementation is linked.
> > > >
> > > > Looking forward to your feedback.
> > > > Also, a kindly reminder that the vote thread is open.
> > > >
> > > > Thanks!
> > > > Jorge.
> > > >
> > > > On Thu, 8 Sept 2022 at 14:25, Jorge Esteban Quilcate Otoya <
> > > > quilcate.jorge@gmail.com> wrote:
> > > >
> > > > > Great. I have updated the KIP to reflect this.
> > > > >
> > > > > Cheers,
> > > > > Jorge.
> > > > >
> > > > > On Thu, 8 Sept 2022 at 12:26, Yash Mayya <ya...@gmail.com>
> wrote:
> > > > >
> > > > >> Thanks, I think it makes sense to define these metrics at a DEBUG
> > > > >> recording
> > > > >> level.
> > > > >>
> > > > >> On Thu, Sep 8, 2022 at 2:51 PM Jorge Esteban Quilcate Otoya <
> > > > >> quilcate.jorge@gmail.com> wrote:
> > > > >>
> > > > >> > On Thu, 8 Sept 2022 at 05:55, Yash Mayya <ya...@gmail.com>
> > > > wrote:
> > > > >> >
> > > > >> > > Hi Jorge,
> > > > >> > >
> > > > >> > > Thanks for the changes. With regard to having per batch vs per
> > > > record
> > > > >> > > metrics, the additional overhead I was referring to wasn't
> about
> > > > >> whether
> > > > >> > or
> > > > >> > > not we would need to iterate over all the records in a batch.
> I
> > > was
> > > > >> > > referring to the potential additional overhead caused by the
> > > higher
> > > > >> > volume
> > > > >> > > of calls to Sensor::record on the sensors for the new metrics
> (as
> > > > >> > compared
> > > > >> > > to the existing batch only metrics), especially for high
> > > throughput
> > > > >> > > connectors where batch sizes could be large. I guess we may
> want
> > > to
> > > > do
> > > > >> > some
> > > > >> > > sort of performance testing and get concrete numbers to verify
> > > > whether
> > > > >> > this
> > > > >> > > is a valid concern or not?
> > > > >> > >
> > > > >> >
> > > > >> > 6.1. Got it, thanks for clarifying. I guess there could be a
> > > benchmark
> > > > >> test
> > > > >> > of the `Sensor::record` to get an idea of the performance
> impact.
> > > > >> > Regardless, the fact that these are single-record metrics
> compared
> > > to
> > > > >> > existing batch-only could be explicitly defined by setting these
> > > > >> metrics at
> > > > >> > a DEBUG or TRACE metric recording level, leaving the existing at
> > > INFO
> > > > >> > level.
> > > > >> > wdyt?
> > > > >> >
> > > > >> >
> > > > >> > >
> > > > >> > > Thanks,
> > > > >> > > Yash
> > > > >> > >
> > > > >> > > On Tue, Sep 6, 2022 at 4:42 PM Jorge Esteban Quilcate Otoya <
> > > > >> > > quilcate.jorge@gmail.com> wrote:
> > > > >> > >
> > > > >> > > > Hi Sagar and Yash,
> > > > >> > > >
> > > > >> > > > > the way it's defined in
> > > > >> > > > https://kafka.apache.org/documentation/#connect_monitoring
> for
> > > > the
> > > > >> > > metrics
> > > > >> > > >
> > > > >> > > > 4.1. Got it. Add it to the KIP.
> > > > >> > > >
> > > > >> > > > > The only thing I would argue is do we need
> > > > >> sink-record-latency-min?
> > > > >> > > Maybe
> > > > >> > > > we
> > > > >> > > > > could remove this min metric as well and make all of the
> 3 e2e
> > > > >> > metrics
> > > > >> > > > > consistent
> > > > >> > > >
> > > > >> > > > 4.2 I see. Will remove it from the KIP.
> > > > >> > > >
> > > > >> > > > > Probably users can track the metrics at their end to
> > > > >> > > > > figure that out. Do you think that makes sense?
> > > > >> > > >
> > > > >> > > > 4.3. Yes, agree. With these new metrics it should be easier
> for
> > > > >> users
> > > > >> > to
> > > > >> > > > track this.
> > > > >> > > >
> > > > >> > > > > I think it makes sense to not have a min metric for
> either to
> > > > >> remain
> > > > >> > > > > consistent with the existing put-batch and poll-batch
> metrics
> > > > >> > > >
> > > > >> > > > 5.1. Got it. Same as 4.2
> > > > >> > > >
> > > > >> > > > > Another naming related suggestion I had was with the
> > > > >> > > > > "convert-time" metrics - we should probably include
> > > > >> transformations
> > > > >> > in
> > > > >> > > > the
> > > > >> > > > > name since SMTs could definitely be attributable to a
> sizable
> > > > >> chunk
> > > > >> > of
> > > > >> > > > the
> > > > >> > > > > latency depending on the specific transformation chain.
> > > > >> > > >
> > > > >> > > > 5.2. Make sense. I'm proposing to add
> > > > >> > `sink-record-convert-transform...`
> > > > >> > > > and `source-record-transform-convert...` to represent
> correctly
> > > > the
> > > > >> > order
> > > > >> > > > of operations.
> > > > >> > > >
> > > > >> > > > > it seems like both source and sink tasks only record
> metrics
> > > at
> > > > a
> > > > >> > > "batch"
> > > > >> > > > > level, not on an individual record level. I think it
> might be
> > > > >> > > additional
> > > > >> > > > > overhead if we want to record these new metrics all at the
> > > > record
> > > > >> > > level?
> > > > >> > > >
> > > > >> > > > 5.3. I considered at the beginning to implement all metrics
> at
> > > the
> > > > >> > batch
> > > > >> > > > level, but given how the framework process records, I
> fallback
> > > to
> > > > >> the
> > > > >> > > > proposed approach:
> > > > >> > > > - Sink Task:
> > > > >> > > >   - `WorkerSinkTask#convertMessages(msgs)` already iterates
> over
> > > > >> > records,
> > > > >> > > > so there is no additional overhead to capture record
> latency per
> > > > >> > record.
> > > > >> > > >     -
> > > > >> > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > >
> > >
> https://github.com/apache/kafka/blob/9841647c4fe422532f448423c92d26e4fdcb8932/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L490-L514
> > > > >> > > >   - `WorkerSinkTask#convertAndTransformRecord(record)`
> actually
> > > > >> happens
> > > > >> > > > individually. Measuring this operation per batch would
> include
> > > > >> > processing
> > > > >> > > > that is not strictly part of "convert and transform"
> > > > >> > > >     -
> > > > >> > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > >
> > >
> https://github.com/apache/kafka/blob/9841647c4fe422532f448423c92d26e4fdcb8932/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L518
> > > > >> > > > - Source Task:
> > > > >> > > >   - `AbstractWorkerSourceTask#sendRecords` iterates over a
> batch
> > > > and
> > > > >> > > > applies transforms and convert record individually as well:
> > > > >> > > >     -
> > > > >> > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > >
> > >
> https://github.com/apache/kafka/blob/9841647c4fe422532f448423c92d26e4fdcb8932/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/AbstractWorkerSourceTask.java#L389-L390
> > > > >> > > >
> > > > >> > > > > This might require some additional changes -
> > > > >> > > > > for instance, with the "sink-record-latency" metric, we
> might
> > > > only
> > > > >> > want
> > > > >> > > > to
> > > > >> > > > > have a "max" metric since "avg" would require recording a
> > > value
> > > > on
> > > > >> > the
> > > > >> > > > > sensor for each record (whereas we can get a "max" by only
> > > > >> recording
> > > > >> > a
> > > > >> > > > > metric value for the oldest record in each batch).
> > > > >> > > >
> > > > >> > > > 5.4. Recording record-latency per batch may not be as
> useful as
> > > > >> there
> > > > >> > is
> > > > >> > > no
> > > > >> > > > guarantee that the oldest record will be representative of
> the
> > > > >> batch.
> > > > >> > > >
> > > > >> > > > On Sat, 3 Sept 2022 at 16:02, Yash Mayya <
> yash.mayya@gmail.com>
> > > > >> wrote:
> > > > >> > > >
> > > > >> > > > > Hi Jorge and Sagar,
> > > > >> > > > >
> > > > >> > > > > I think it makes sense to not have a min metric for
> either to
> > > > >> remain
> > > > >> > > > > consistent with the existing put-batch and poll-batch
> metrics
> > > > (it
> > > > >> > > doesn't
> > > > >> > > > > seem particularly useful either anyway). Also, the new
> > > > >> > > > > "sink-record-latency" metric name looks fine to me,
> thanks for
> > > > >> making
> > > > >> > > the
> > > > >> > > > > changes! Another naming related suggestion I had was with
> the
> > > > >> > > > > "convert-time" metrics - we should probably include
> > > > >> transformations
> > > > >> > in
> > > > >> > > > the
> > > > >> > > > > name since SMTs could definitely be attributable to a
> sizable
> > > > >> chunk
> > > > >> > of
> > > > >> > > > the
> > > > >> > > > > latency depending on the specific transformation chain.
> > > > >> > > > >
> > > > >> > > > > I have one high level question with respect to
> implementation
> > > -
> > > > >> > > > currently,
> > > > >> > > > > it seems like both source and sink tasks only record
> metrics
> > > at
> > > > a
> > > > >> > > "batch"
> > > > >> > > > > level, not on an individual record level. I think it
> might be
> > > > >> > > additional
> > > > >> > > > > overhead if we want to record these new metrics all at the
> > > > record
> > > > >> > > level?
> > > > >> > > > > Could we instead make all of these new metrics for
> batches of
> > > > >> records
> > > > >> > > > > rather than individual records in order to remain
> consistent
> > > > with
> > > > >> the
> > > > >> > > > > existing task level metrics? This might require some
> > > additional
> > > > >> > > changes -
> > > > >> > > > > for instance, with the "sink-record-latency" metric, we
> might
> > > > only
> > > > >> > want
> > > > >> > > > to
> > > > >> > > > > have a "max" metric since "avg" would require recording a
> > > value
> > > > on
> > > > >> > the
> > > > >> > > > > sensor for each record (whereas we can get a "max" by only
> > > > >> recording
> > > > >> > a
> > > > >> > > > > metric value for the oldest record in each batch).
> > > > >> > > > >
> > > > >> > > > > Thanks,
> > > > >> > > > > Yash
> > > > >> > > > >
> > > > >> > > > > On Fri, Sep 2, 2022 at 3:16 PM Sagar <
> > > sagarmeansocean@gmail.com
> > > > >
> > > > >> > > wrote:
> > > > >> > > > >
> > > > >> > > > > > Hi Jorge,
> > > > >> > > > > >
> > > > >> > > > > > Thanks for the changes.
> > > > >> > > > > >
> > > > >> > > > > > Regarding the metrics, I meant something like this:
> > > > >> > > > > >
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > >
> > >
> kafka.connect:type=sink-task-metrics,connector="{connector}",task="{task}"
> > > > >> > > > > >
> > > > >> > > > > > the way it's defined in
> > > > >> > > > > >
> https://kafka.apache.org/documentation/#connect_monitoring
> > > > for
> > > > >> the
> > > > >> > > > > > metrics.
> > > > >> > > > > >
> > > > >> > > > > > I see what you mean by the 3 metrics and how it can be
> > > > >> interpreted.
> > > > >> > > The
> > > > >> > > > > > only thing I would argue is do we need
> > > > sink-record-latency-min?
> > > > >> > Maybe
> > > > >> > > > we
> > > > >> > > > > > could remove this min metric as well and make all of
> the 3
> > > e2e
> > > > >> > > metrics
> > > > >> > > > > > consistent(since put-batch also doesn't expose a min
> which
> > > > makes
> > > > >> > > sense
> > > > >> > > > to
> > > > >> > > > > > me). I think this is in contrast to what Yash pointed
> out
> > > > above
> > > > >> so
> > > > >> > I
> > > > >> > > > > would
> > > > >> > > > > > like to hear his thoughts as well.
> > > > >> > > > > >
> > > > >> > > > > > The other point Yash mentioned about the slightly flawed
> > > > >> definition
> > > > >> > > of
> > > > >> > > > > e2e
> > > > >> > > > > > is also true in a sense. But I have a feeling that's
> one the
> > > > >> > records
> > > > >> > > > are
> > > > >> > > > > > polled by the connector tasks, it would be difficult to
> > > track
> > > > >> the
> > > > >> > > final
> > > > >> > > > > leg
> > > > >> > > > > > via the framework. Probably users can track the metrics
> at
> > > > their
> > > > >> > end
> > > > >> > > to
> > > > >> > > > > > figure that out. Do you think that makes sense?
> > > > >> > > > > >
> > > > >> > > > > > Thanks!
> > > > >> > > > > > Sagar.
> > > > >> > > > > >
> > > > >> > > > > >
> > > > >> > > > > >
> > > > >> > > > > >
> > > > >> > > > > > On Thu, Sep 1, 2022 at 11:40 PM Jorge Esteban Quilcate
> > > Otoya <
> > > > >> > > > > > quilcate.jorge@gmail.com> wrote:
> > > > >> > > > > >
> > > > >> > > > > > > Hi Sagar and Yash,
> > > > >> > > > > > >
> > > > >> > > > > > > Thanks for your feedback!
> > > > >> > > > > > >
> > > > >> > > > > > > > 1) I am assuming the new metrics would be task level
> > > > metric.
> > > > >> > > > > > >
> > > > >> > > > > > > 1.1 Yes, it will be a task level metric, implemented
> on
> > > the
> > > > >> > > > > > > Worker[Source/Sink]Task.
> > > > >> > > > > > >
> > > > >> > > > > > > > Could you specify the way it's done for other
> > > sink/source
> > > > >> > > > connector?
> > > > >> > > > > > >
> > > > >> > > > > > > 1.2. Not sure what do you mean by this. Could you
> > > elaborate
> > > > a
> > > > >> bit
> > > > >> > > > more?
> > > > >> > > > > > >
> > > > >> > > > > > > > 2. I am slightly confused about the e2e latency
> > > metric...
> > > > >> > > > > > >
> > > > >> > > > > > > 2.1. Yes, I see. I was trying to bring a similar
> concept
> > > as
> > > > in
> > > > >> > > > Streams
> > > > >> > > > > > with
> > > > >> > > > > > > KIP-613, though the e2e concept may not be
> translatable.
> > > > >> > > > > > > We could keep it as `sink-record-latency` to avoid
> > > > conflating
> > > > >> > > > > concepts. A
> > > > >> > > > > > > similar metric naming was proposed in KIP-489 but at
> the
> > > > >> consumer
> > > > >> > > > > level —
> > > > >> > > > > > > though it seems dormant for a couple of years.
> > > > >> > > > > > >
> > > > >> > > > > > > > However, the put-batch time measures the
> > > > >> > > > > > > > time to put a batch of records to external sink.
> So, I
> > > > would
> > > > >> > > assume
> > > > >> > > > > > the 2
> > > > >> > > > > > > > can't be added as is to compute the e2e latency.
> Maybe I
> > > > am
> > > > >> > > missing
> > > > >> > > > > > > > something here. Could you plz clarify this.
> > > > >> > > > > > >
> > > > >> > > > > > > 2.2. Yes, agree. Not necessarily added, but with the 3
> > > > >> latencies
> > > > >> > > > (poll,
> > > > >> > > > > > > convert, putBatch) will be clearer where the
> bottleneck
> > > may
> > > > >> be,
> > > > >> > and
> > > > >> > > > > > > represent the internal processing.
> > > > >> > > > > > >
> > > > >> > > > > > > > however, as per the KIP it looks like it will be
> > > > >> > > > > > > > the latency between when the record was written to
> Kafka
> > > > and
> > > > >> > when
> > > > >> > > > the
> > > > >> > > > > > > > record is returned by a sink task's consumer's poll?
> > > > >> > > > > > >
> > > > >> > > > > > > 3.1. Agree. 2.1. could help to clarify this.
> > > > >> > > > > > >
> > > > >> > > > > > > > One more thing - I was wondering
> > > > >> > > > > > > > if there's a particular reason for having a min
> metric
> > > for
> > > > >> e2e
> > > > >> > > > > latency
> > > > >> > > > > > > but
> > > > >> > > > > > > > not for convert time?
> > > > >> > > > > > >
> > > > >> > > > > > > 3.2. Was following KIP-613 for e2e which seems useful
> to
> > > > >> compare
> > > > >> > > with
> > > > >> > > > > > Max a
> > > > >> > > > > > > get an idea of the window of results, though current
> > > > >> latencies in
> > > > >> > > > > > Connector
> > > > >> > > > > > > do not include Min, and that's why I haven't added it
> for
> > > > >> convert
> > > > >> > > > > > latency.
> > > > >> > > > > > > Do you think it make sense to extend latency metrics
> with
> > > > Min?
> > > > >> > > > > > >
> > > > >> > > > > > > KIP is updated to clarify some of these changes.
> > > > >> > > > > > >
> > > > >> > > > > > > Many thanks,
> > > > >> > > > > > > Jorge.
> > > > >> > > > > > >
> > > > >> > > > > > > On Thu, 1 Sept 2022 at 18:11, Yash Mayya <
> > > > >> yash.mayya@gmail.com>
> > > > >> > > > wrote:
> > > > >> > > > > > >
> > > > >> > > > > > > > Hi Jorge,
> > > > >> > > > > > > >
> > > > >> > > > > > > > Thanks for the KIP! I have the same confusion with
> the
> > > > >> > > e2e-latency
> > > > >> > > > > > > metrics
> > > > >> > > > > > > > as Sagar above. "e2e" would seem to indicate the
> latency
> > > > >> > between
> > > > >> > > > when
> > > > >> > > > > > the
> > > > >> > > > > > > > record was written to Kafka and when the record was
> > > > written
> > > > >> to
> > > > >> > > the
> > > > >> > > > > sink
> > > > >> > > > > > > > system by the connector - however, as per the KIP it
> > > looks
> > > > >> like
> > > > >> > > it
> > > > >> > > > > will
> > > > >> > > > > > > be
> > > > >> > > > > > > > the latency between when the record was written to
> Kafka
> > > > and
> > > > >> > when
> > > > >> > > > the
> > > > >> > > > > > > > record is returned by a sink task's consumer's
> poll? I
> > > > think
> > > > >> > that
> > > > >> > > > > > metric
> > > > >> > > > > > > > will be a little confusing to interpret. One more
> thing
> > > -
> > > > I
> > > > >> was
> > > > >> > > > > > wondering
> > > > >> > > > > > > > if there's a particular reason for having a min
> metric
> > > for
> > > > >> e2e
> > > > >> > > > > latency
> > > > >> > > > > > > but
> > > > >> > > > > > > > not for convert time?
> > > > >> > > > > > > >
> > > > >> > > > > > > > Thanks,
> > > > >> > > > > > > > Yash
> > > > >> > > > > > > >
> > > > >> > > > > > > > On Thu, Sep 1, 2022 at 8:59 PM Sagar <
> > > > >> > sagarmeansocean@gmail.com>
> > > > >> > > > > > wrote:
> > > > >> > > > > > > >
> > > > >> > > > > > > > > Hi Jorge,
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > Thanks for the KIP. It looks like a very good
> > > addition.
> > > > I
> > > > >> > > skimmed
> > > > >> > > > > > > through
> > > > >> > > > > > > > > once and had a couple of questions =>
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > 1) I am assuming the new metrics would be task
> level
> > > > >> metric.
> > > > >> > > > Could
> > > > >> > > > > > you
> > > > >> > > > > > > > > specify the way it's done for other sink/source
> > > > connector?
> > > > >> > > > > > > > > 2) I am slightly confused about the e2e latency
> > > metric.
> > > > >> Let's
> > > > >> > > > > > consider
> > > > >> > > > > > > > the
> > > > >> > > > > > > > > sink connector metric. If I look at the way it's
> > > > supposed
> > > > >> to
> > > > >> > be
> > > > >> > > > > > > > calculated,
> > > > >> > > > > > > > > i.e the difference between the record timestamp
> and
> > > the
> > > > >> wall
> > > > >> > > > clock
> > > > >> > > > > > > time,
> > > > >> > > > > > > > it
> > > > >> > > > > > > > > looks like a per record metric. However, the
> put-batch
> > > > >> time
> > > > >> > > > > measures
> > > > >> > > > > > > the
> > > > >> > > > > > > > > time to put a batch of records to external sink.
> So, I
> > > > >> would
> > > > >> > > > assume
> > > > >> > > > > > > the 2
> > > > >> > > > > > > > > can't be added as is to compute the e2e latency.
> > > Maybe I
> > > > >> am
> > > > >> > > > missing
> > > > >> > > > > > > > > something here. Could you plz clarify this.
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > Thanks!
> > > > >> > > > > > > > > Sagar.
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > On Tue, Aug 30, 2022 at 8:43 PM Jorge Esteban
> Quilcate
> > > > >> Otoya
> > > > >> > <
> > > > >> > > > > > > > > quilcate.jorge@gmail.com> wrote:
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > > Hi all,
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > I'd like to start a discussion thread on
> KIP-864:
> > > Add
> > > > >> > > > End-To-End
> > > > >> > > > > > > > Latency
> > > > >> > > > > > > > > > Metrics to Connectors.
> > > > >> > > > > > > > > > This KIP aims to improve the metrics available
> on
> > > > Source
> > > > >> > and
> > > > >> > > > Sink
> > > > >> > > > > > > > > > Connectors to measure end-to-end latency,
> including
> > > > >> source
> > > > >> > > and
> > > > >> > > > > sink
> > > > >> > > > > > > > > record
> > > > >> > > > > > > > > > conversion time, and sink record e2e latency
> > > (similar
> > > > to
> > > > >> > > > KIP-613
> > > > >> > > > > > for
> > > > >> > > > > > > > > > Streams).
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > The KIP is here:
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > >
> > > > >> > > > > > > >
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > >
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-864%3A+Add+End-To-End+Latency+Metrics+to+Connectors
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > Please take a look and let me know what you
> think.
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > Cheers,
> > > > >> > > > > > > > > > Jorge.
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > >
> > > > >> > > > > > > >
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >
> > > >
> > >
>

Re: [DISCUSS] KIP-864: Add End-To-End Latency Metrics to Connectors

Posted by Mickael Maison <mi...@gmail.com>.

Hi Jorge,

Thanks for the KIP, it is a nice improvement.

1) The per transformation metrics still have a question mark next to
them in the KIP. Do you want to include them? If so we'll want to tag
them, we should be able to include the aliases in TransformationChain
and use them.

2) I see no references to predicates. If we don't want to measure
their latency, can we say it explicitly?

3) Should we have sink-record-batch-latency-avg-ms? All other metrics
have both the maximum and average values.

Thanks,
Mickael

On Thu, Oct 20, 2022 at 9:58 PM Jorge Esteban Quilcate Otoya
<qu...@gmail.com> wrote:
>
> Thanks, Chris! Great feedback! Please, find my comments below:
>
> On Thu, 13 Oct 2022 at 18:52, Chris Egerton <ch...@aiven.io.invalid> wrote:
>
> > Hi Jorge,
> >
> > Thanks for the KIP. I agree with the overall direction and think this would
> > be a nice improvement to Kafka Connect. Here are my initial thoughts on the
> > details:
> >
> > 1. The motivation section outlines the gaps in Kafka Connect's task metrics
> > nicely. I think it'd be useful to include more concrete details on why
> > these gaps need to be filled in, and in which cases additional metrics
> > would be helpful. One goal could be to provide enhanced monitoring of
> > production deployments that allows for cluster administrators to set up
> > automatic alerts for latency spikes and, if triggered, quickly identify the
> > root cause of those alerts, reducing the time to remediation. Another goal
> > could be to provide more insight to developers or cluster administrators
> > who want to do performance testing on connectors in non-production
> > environments. It may help guide our decision making process to have a
> > clearer picture of the goals we're trying to achieve.
> >
>
> Agree. The Motivation section has been updated.
> Thanks for the examples, I see both of them being covered by the KIP.
> I see how these could give us a good distinction on whether to position
> some metrics at INFO or DEBUG level.
>
>
> > 2. If we're trying to address the alert-and-diagnose use case, it'd be
> > useful to have as much information as possible at INFO level, rather than
> > forcing cluster administrators to possibly reconfigure a connector to emit
> > DEBUG or TRACE level metrics in order to diagnose a potential
> > production-impacting performance bottleneck. I can see the rationale for
> > emitting per-record metrics that track an average value at DEBUG level, but
> > for per-record metrics that track a maximum value, is there any reason not
> > to provide this information at INFO level?
> >
>
> Agree. Though with Max and Avg metrics being part of the same sensor —
> where Metric Level is defined — then both metrics get the same level.
>
>
> > 3. I'm also curious about the performance testing suggested by Yash to
> > gauge the potential impact of this change. Have you been able to do any
> > testing with your draft implementation yet?
> >
>
> No, not so far.
> I think it would be valuable to discuss the scope of this testing and maybe
> tackle it
> in a separate issue as Sensors and Metrics are used all over the place.
> My initial understanding is that these tests should by placed in the
> jmh-benchmarks[1].
> Then, we could target testing Sensors and Metrics, and validate how much
> overhead
> is added by having only Max vs Max,Avg(,Min), etc.
> In the other hand, we could extend this to Transformers or other Connect
> layers.
>
> Here are some pointers to the Sensors and Metrics implementations that
> could be considered:
> Path to metric recording:
> -
> https://github.com/apache/kafka/blob/5cab11cf525f6c06fcf9eb43f7f95ef33fe1cdbb/clients/src/main/java/org/apache/kafka/common/metrics/Sensor.java#L195-L199
> -
> https://github.com/apache/kafka/blob/5cab11cf525f6c06fcf9eb43f7f95ef33fe1cdbb/clients/src/main/java/org/apache/kafka/common/metrics/Sensor.java#L230-L244
>
> ```
> // increment all the stats
> for (StatAndConfig statAndConfig : this.stats) {
>    statAndConfig.stat.record(statAndConfig.config(), value, timeMs);
> }
> ```
>
> SampledStats:
> - Avg:
> https://github.com/apache/kafka/blob/068ab9cefae301f3187ea885d645c425955e77d2/clients/src/main/java/org/apache/kafka/common/metrics/stats/Avg.java
> - Max:
> https://github.com/apache/kafka/blob/068ab9cefae301f3187ea885d645c425955e77d2/clients/src/main/java/org/apache/kafka/common/metrics/stats/Max.java
> - Min:
> https://github.com/apache/kafka/blob/068ab9cefae301f3187ea885d645c425955e77d2/clients/src/main/java/org/apache/kafka/common/metrics/stats/Min.java
>
> `stat#record()` are implemented by `update` method in SampledStat:
>
> ```Max.java
>     @Override
>     protected void update(Sample sample, MetricConfig config, double value,
> long now) {
>         sample.value += value;
>     }
> ```
>
> ```Avg.java
>     @Override
>     protected void update(Sample sample, MetricConfig config, double value,
> long now) {
>         sample.value = Math.max(sample.value, value);
>     }
> ```
>
> As far as I understand, most of the work of the stats happens on the
> `combine` method that is not part of the connector execution but called
> when metrics are queried.
>
> I wonder whether we should consider Avg and Max for all metrics proposed as
> the impact on the execution path seems minimal, and even see if Min is also
> valuable, and use DEBUG only for more granular metrics.
>
> [1] https://github.com/apache/kafka/tree/trunk/jmh-benchmarks
>
>
> > 4. Just to make sure I understand correctly--does "time when it has been
> > received by the Sink task" refer to the wallclock time directly after a
> > call to SinkTask::put has been completed (as opposed to directly before
> > that call is made, or something else entirely)?
> >
>
> It currently means when it has been received by the Sink task
> right after consumer poll and before conversions.
> Would it be valuable to have it after put-sink-records?
>
>
> > 5. If the goal is to identify performance bottlenecks (either in production
> > or pre-production environments), would it make sense to introduce metrics
> > for each individual converter (i.e., key/value/header) and transformation?
> > It's definitely an improvement to be able to identify the total time for
> > conversion and transformation, but then the immediate follow-up question if
> > a bottleneck is found in that phase is "which converter/transformation is
> > responsible?" It'd be nice if we could provide a way to quickly answer that
> > question.
> >
>
> This is a great idea. I'd like to consider this as well, though maybe these
> more granular
> metrics would be good to have them as DEBUG.
>
>
> > 6. Any thoughts about offering latency metrics for source tasks between
> > receipt of the record from the task and delivery of the record to Kafka
> > (which would be tracked by producer callback)? We could also use the record
> > timestamp either instead of or in addition to receipt time if the task
> > provides a timestamp with its records.
> >
>
> With source transform and convert metrics we get part of that latency.
> Looking at the Producer metrics, `request-latency` (though a very generic
> metric)
> sort of answer the time between send request and ack — if my understanding
> is correct.
> Would these be enough or you're thinking about another approach?
> maybe a custom metric to cover the producer side?
>
>
> > 7. We may end up introducing a way for sink tasks to record per-record
> > delivery to the sink system (see KIP-767 [1]). I'd like it if we could keep
> > the names of our metrics very precise in order to avoid confusing users
> > (who may think that we're providing metrics on actual delivery to the sink
> > system, which may not be the case if the connector performs asynchronous
> > writes), and in order to leave room for a metrics on true delivery time by
> > sink tasks. It'd also be nice if we could remain consistent with existing
> > metrics such as "put-batch-avg-time-ms". With that in mind, what do you
> > think about renaming these metrics:
> > - "sink-record-batch-latency-max-ms" to "put-batch-avg-latency-ms"
> > - "sink-record-latency-max-ms" to "put-sink-record-latency-max-ms"
> > - "sink-record-latency-avg-ms" to "put-sink-record-latency-avg-ms"
> > - "sink-record-convert-transform-time-max-ms" to
> > "convert-transform-sink-record-time-max-ms"
> > - "sink-record-convert-transform-time-avg-ms" to
> > "convert-transform-sink-record-time-avg-ms"
> > - "source-record-transform-convert-time-max-ms" to
> > "transform-convert-source-record-time-max-ms"
> > - "source-record-transform-convert-time-avg-ms" to
> > "transform-convert-source-record-time-avg-ms"
> >
>
> Make sense, thanks! I have updated the list of metrics and group them by
> sensor and applying these suggestions.
> The only ones that I want to review are: sink-record-* to put-batch-*
> (first 3). Not sure if put-batch/put-sink-record describes the purpose of
> the metric — neither `sink-record-latency` to be honest.
> My initial thought was to have something like Kafka Streams e2e-latency.
> Based on 4. and 6. questions, an idea could be to add:
> - source-batch-e2e-latency-before-send: measure wallclock - source record
> timestamp after source connector poll.
> - source-batch-e2e-latency-after-send: measure wallclock - record timestamp
> on producer send callback
> - sink-batch-e2e-latency-before-put: measure time wallclock - record
> timestamp after consumer poll
> - sink-batch-e2e-latency-after-put: measure time wallclock - record
> timestamp after sink connector put.
>
>
> > Thanks again for the KIP! Looking forward to your thoughts.
> >
> > Cheers,
> >
> > Chris
> >
> > [1] -
> >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-767%3A+Connect+Latency+Metrics
> >
> > On Thu, Sep 15, 2022 at 1:32 PM Jorge Esteban Quilcate Otoya <
> > quilcate.jorge@gmail.com> wrote:
> >
> > > Hi everyone,
> > >
> > > I've made a slight addition to the KIP based on Yash feedback:
> > >
> > > - A new metric is added at INFO level to record the max latency from the
> > > batch timestamp, by keeping the oldest record timestamp per batch.
> > > - A draft implementation is linked.
> > >
> > > Looking forward to your feedback.
> > > Also, a kindly reminder that the vote thread is open.
> > >
> > > Thanks!
> > > Jorge.
> > >
> > > On Thu, 8 Sept 2022 at 14:25, Jorge Esteban Quilcate Otoya <
> > > quilcate.jorge@gmail.com> wrote:
> > >
> > > > Great. I have updated the KIP to reflect this.
> > > >
> > > > Cheers,
> > > > Jorge.
> > > >
> > > > On Thu, 8 Sept 2022 at 12:26, Yash Mayya <ya...@gmail.com> wrote:
> > > >
> > > >> Thanks, I think it makes sense to define these metrics at a DEBUG
> > > >> recording
> > > >> level.
> > > >>
> > > >> On Thu, Sep 8, 2022 at 2:51 PM Jorge Esteban Quilcate Otoya <
> > > >> quilcate.jorge@gmail.com> wrote:
> > > >>
> > > >> > On Thu, 8 Sept 2022 at 05:55, Yash Mayya <ya...@gmail.com>
> > > wrote:
> > > >> >
> > > >> > > Hi Jorge,
> > > >> > >
> > > >> > > Thanks for the changes. With regard to having per batch vs per
> > > record
> > > >> > > metrics, the additional overhead I was referring to wasn't about
> > > >> whether
> > > >> > or
> > > >> > > not we would need to iterate over all the records in a batch. I
> > was
> > > >> > > referring to the potential additional overhead caused by the
> > higher
> > > >> > volume
> > > >> > > of calls to Sensor::record on the sensors for the new metrics (as
> > > >> > compared
> > > >> > > to the existing batch only metrics), especially for high
> > throughput
> > > >> > > connectors where batch sizes could be large. I guess we may want
> > to
> > > do
> > > >> > some
> > > >> > > sort of performance testing and get concrete numbers to verify
> > > whether
> > > >> > this
> > > >> > > is a valid concern or not?
> > > >> > >
> > > >> >
> > > >> > 6.1. Got it, thanks for clarifying. I guess there could be a
> > benchmark
> > > >> test
> > > >> > of the `Sensor::record` to get an idea of the performance impact.
> > > >> > Regardless, the fact that these are single-record metrics compared
> > to
> > > >> > existing batch-only could be explicitly defined by setting these
> > > >> metrics at
> > > >> > a DEBUG or TRACE metric recording level, leaving the existing at
> > INFO
> > > >> > level.
> > > >> > wdyt?
> > > >> >
> > > >> >
> > > >> > >
> > > >> > > Thanks,
> > > >> > > Yash
> > > >> > >
> > > >> > > On Tue, Sep 6, 2022 at 4:42 PM Jorge Esteban Quilcate Otoya <
> > > >> > > quilcate.jorge@gmail.com> wrote:
> > > >> > >
> > > >> > > > Hi Sagar and Yash,
> > > >> > > >
> > > >> > > > > the way it's defined in
> > > >> > > > https://kafka.apache.org/documentation/#connect_monitoring for
> > > the
> > > >> > > metrics
> > > >> > > >
> > > >> > > > 4.1. Got it. Add it to the KIP.
> > > >> > > >
> > > >> > > > > The only thing I would argue is do we need
> > > >> sink-record-latency-min?
> > > >> > > Maybe
> > > >> > > > we
> > > >> > > > > could remove this min metric as well and make all of the 3 e2e
> > > >> > metrics
> > > >> > > > > consistent
> > > >> > > >
> > > >> > > > 4.2 I see. Will remove it from the KIP.
> > > >> > > >
> > > >> > > > > Probably users can track the metrics at their end to
> > > >> > > > > figure that out. Do you think that makes sense?
> > > >> > > >
> > > >> > > > 4.3. Yes, agree. With these new metrics it should be easier for
> > > >> users
> > > >> > to
> > > >> > > > track this.
> > > >> > > >
> > > >> > > > > I think it makes sense to not have a min metric for either to
> > > >> remain
> > > >> > > > > consistent with the existing put-batch and poll-batch metrics
> > > >> > > >
> > > >> > > > 5.1. Got it. Same as 4.2
> > > >> > > >
> > > >> > > > > Another naming related suggestion I had was with the
> > > >> > > > > "convert-time" metrics - we should probably include
> > > >> transformations
> > > >> > in
> > > >> > > > the
> > > >> > > > > name since SMTs could definitely be attributable to a sizable
> > > >> chunk
> > > >> > of
> > > >> > > > the
> > > >> > > > > latency depending on the specific transformation chain.
> > > >> > > >
> > > >> > > > 5.2. Make sense. I'm proposing to add
> > > >> > `sink-record-convert-transform...`
> > > >> > > > and `source-record-transform-convert...` to represent correctly
> > > the
> > > >> > order
> > > >> > > > of operations.
> > > >> > > >
> > > >> > > > > it seems like both source and sink tasks only record metrics
> > at
> > > a
> > > >> > > "batch"
> > > >> > > > > level, not on an individual record level. I think it might be
> > > >> > > additional
> > > >> > > > > overhead if we want to record these new metrics all at the
> > > record
> > > >> > > level?
> > > >> > > >
> > > >> > > > 5.3. I considered at the beginning to implement all metrics at
> > the
> > > >> > batch
> > > >> > > > level, but given how the framework process records, I fallback
> > to
> > > >> the
> > > >> > > > proposed approach:
> > > >> > > > - Sink Task:
> > > >> > > >   - `WorkerSinkTask#convertMessages(msgs)` already iterates over
> > > >> > records,
> > > >> > > > so there is no additional overhead to capture record latency per
> > > >> > record.
> > > >> > > >     -
> > > >> > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> > https://github.com/apache/kafka/blob/9841647c4fe422532f448423c92d26e4fdcb8932/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L490-L514
> > > >> > > >   - `WorkerSinkTask#convertAndTransformRecord(record)` actually
> > > >> happens
> > > >> > > > individually. Measuring this operation per batch would include
> > > >> > processing
> > > >> > > > that is not strictly part of "convert and transform"
> > > >> > > >     -
> > > >> > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> > https://github.com/apache/kafka/blob/9841647c4fe422532f448423c92d26e4fdcb8932/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L518
> > > >> > > > - Source Task:
> > > >> > > >   - `AbstractWorkerSourceTask#sendRecords` iterates over a batch
> > > and
> > > >> > > > applies transforms and convert record individually as well:
> > > >> > > >     -
> > > >> > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> > https://github.com/apache/kafka/blob/9841647c4fe422532f448423c92d26e4fdcb8932/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/AbstractWorkerSourceTask.java#L389-L390
> > > >> > > >
> > > >> > > > > This might require some additional changes -
> > > >> > > > > for instance, with the "sink-record-latency" metric, we might
> > > only
> > > >> > want
> > > >> > > > to
> > > >> > > > > have a "max" metric since "avg" would require recording a
> > value
> > > on
> > > >> > the
> > > >> > > > > sensor for each record (whereas we can get a "max" by only
> > > >> recording
> > > >> > a
> > > >> > > > > metric value for the oldest record in each batch).
> > > >> > > >
> > > >> > > > 5.4. Recording record-latency per batch may not be as useful as
> > > >> there
> > > >> > is
> > > >> > > no
> > > >> > > > guarantee that the oldest record will be representative of the
> > > >> batch.
> > > >> > > >
> > > >> > > > On Sat, 3 Sept 2022 at 16:02, Yash Mayya <ya...@gmail.com>
> > > >> wrote:
> > > >> > > >
> > > >> > > > > Hi Jorge and Sagar,
> > > >> > > > >
> > > >> > > > > I think it makes sense to not have a min metric for either to
> > > >> remain
> > > >> > > > > consistent with the existing put-batch and poll-batch metrics
> > > (it
> > > >> > > doesn't
> > > >> > > > > seem particularly useful either anyway). Also, the new
> > > >> > > > > "sink-record-latency" metric name looks fine to me, thanks for
> > > >> making
> > > >> > > the
> > > >> > > > > changes! Another naming related suggestion I had was with the
> > > >> > > > > "convert-time" metrics - we should probably include
> > > >> transformations
> > > >> > in
> > > >> > > > the
> > > >> > > > > name since SMTs could definitely be attributable to a sizable
> > > >> chunk
> > > >> > of
> > > >> > > > the
> > > >> > > > > latency depending on the specific transformation chain.
> > > >> > > > >
> > > >> > > > > I have one high level question with respect to implementation
> > -
> > > >> > > > currently,
> > > >> > > > > it seems like both source and sink tasks only record metrics
> > at
> > > a
> > > >> > > "batch"
> > > >> > > > > level, not on an individual record level. I think it might be
> > > >> > > additional
> > > >> > > > > overhead if we want to record these new metrics all at the
> > > record
> > > >> > > level?
> > > >> > > > > Could we instead make all of these new metrics for batches of
> > > >> records
> > > >> > > > > rather than individual records in order to remain consistent
> > > with
> > > >> the
> > > >> > > > > existing task level metrics? This might require some
> > additional
> > > >> > > changes -
> > > >> > > > > for instance, with the "sink-record-latency" metric, we might
> > > only
> > > >> > want
> > > >> > > > to
> > > >> > > > > have a "max" metric since "avg" would require recording a
> > value
> > > on
> > > >> > the
> > > >> > > > > sensor for each record (whereas we can get a "max" by only
> > > >> recording
> > > >> > a
> > > >> > > > > metric value for the oldest record in each batch).
> > > >> > > > >
> > > >> > > > > Thanks,
> > > >> > > > > Yash
> > > >> > > > >
> > > >> > > > > On Fri, Sep 2, 2022 at 3:16 PM Sagar <
> > sagarmeansocean@gmail.com
> > > >
> > > >> > > wrote:
> > > >> > > > >
> > > >> > > > > > Hi Jorge,
> > > >> > > > > >
> > > >> > > > > > Thanks for the changes.
> > > >> > > > > >
> > > >> > > > > > Regarding the metrics, I meant something like this:
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> > kafka.connect:type=sink-task-metrics,connector="{connector}",task="{task}"
> > > >> > > > > >
> > > >> > > > > > the way it's defined in
> > > >> > > > > > https://kafka.apache.org/documentation/#connect_monitoring
> > > for
> > > >> the
> > > >> > > > > > metrics.
> > > >> > > > > >
> > > >> > > > > > I see what you mean by the 3 metrics and how it can be
> > > >> interpreted.
> > > >> > > The
> > > >> > > > > > only thing I would argue is do we need
> > > sink-record-latency-min?
> > > >> > Maybe
> > > >> > > > we
> > > >> > > > > > could remove this min metric as well and make all of the 3
> > e2e
> > > >> > > metrics
> > > >> > > > > > consistent(since put-batch also doesn't expose a min which
> > > makes
> > > >> > > sense
> > > >> > > > to
> > > >> > > > > > me). I think this is in contrast to what Yash pointed out
> > > above
> > > >> so
> > > >> > I
> > > >> > > > > would
> > > >> > > > > > like to hear his thoughts as well.
> > > >> > > > > >
> > > >> > > > > > The other point Yash mentioned about the slightly flawed
> > > >> definition
> > > >> > > of
> > > >> > > > > e2e
> > > >> > > > > > is also true in a sense. But I have a feeling that's one the
> > > >> > records
> > > >> > > > are
> > > >> > > > > > polled by the connector tasks, it would be difficult to
> > track
> > > >> the
> > > >> > > final
> > > >> > > > > leg
> > > >> > > > > > via the framework. Probably users can track the metrics at
> > > their
> > > >> > end
> > > >> > > to
> > > >> > > > > > figure that out. Do you think that makes sense?
> > > >> > > > > >
> > > >> > > > > > Thanks!
> > > >> > > > > > Sagar.
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > > On Thu, Sep 1, 2022 at 11:40 PM Jorge Esteban Quilcate
> > Otoya <
> > > >> > > > > > quilcate.jorge@gmail.com> wrote:
> > > >> > > > > >
> > > >> > > > > > > Hi Sagar and Yash,
> > > >> > > > > > >
> > > >> > > > > > > Thanks for your feedback!
> > > >> > > > > > >
> > > >> > > > > > > > 1) I am assuming the new metrics would be task level
> > > metric.
> > > >> > > > > > >
> > > >> > > > > > > 1.1 Yes, it will be a task level metric, implemented on
> > the
> > > >> > > > > > > Worker[Source/Sink]Task.
> > > >> > > > > > >
> > > >> > > > > > > > Could you specify the way it's done for other
> > sink/source
> > > >> > > > connector?
> > > >> > > > > > >
> > > >> > > > > > > 1.2. Not sure what do you mean by this. Could you
> > elaborate
> > > a
> > > >> bit
> > > >> > > > more?
> > > >> > > > > > >
> > > >> > > > > > > > 2. I am slightly confused about the e2e latency
> > metric...
> > > >> > > > > > >
> > > >> > > > > > > 2.1. Yes, I see. I was trying to bring a similar concept
> > as
> > > in
> > > >> > > > Streams
> > > >> > > > > > with
> > > >> > > > > > > KIP-613, though the e2e concept may not be translatable.
> > > >> > > > > > > We could keep it as `sink-record-latency` to avoid
> > > conflating
> > > >> > > > > concepts. A
> > > >> > > > > > > similar metric naming was proposed in KIP-489 but at the
> > > >> consumer
> > > >> > > > > level —
> > > >> > > > > > > though it seems dormant for a couple of years.
> > > >> > > > > > >
> > > >> > > > > > > > However, the put-batch time measures the
> > > >> > > > > > > > time to put a batch of records to external sink. So, I
> > > would
> > > >> > > assume
> > > >> > > > > > the 2
> > > >> > > > > > > > can't be added as is to compute the e2e latency. Maybe I
> > > am
> > > >> > > missing
> > > >> > > > > > > > something here. Could you plz clarify this.
> > > >> > > > > > >
> > > >> > > > > > > 2.2. Yes, agree. Not necessarily added, but with the 3
> > > >> latencies
> > > >> > > > (poll,
> > > >> > > > > > > convert, putBatch) will be clearer where the bottleneck
> > may
> > > >> be,
> > > >> > and
> > > >> > > > > > > represent the internal processing.
> > > >> > > > > > >
> > > >> > > > > > > > however, as per the KIP it looks like it will be
> > > >> > > > > > > > the latency between when the record was written to Kafka
> > > and
> > > >> > when
> > > >> > > > the
> > > >> > > > > > > > record is returned by a sink task's consumer's poll?
> > > >> > > > > > >
> > > >> > > > > > > 3.1. Agree. 2.1. could help to clarify this.
> > > >> > > > > > >
> > > >> > > > > > > > One more thing - I was wondering
> > > >> > > > > > > > if there's a particular reason for having a min metric
> > for
> > > >> e2e
> > > >> > > > > latency
> > > >> > > > > > > but
> > > >> > > > > > > > not for convert time?
> > > >> > > > > > >
> > > >> > > > > > > 3.2. Was following KIP-613 for e2e which seems useful to
> > > >> compare
> > > >> > > with
> > > >> > > > > > Max a
> > > >> > > > > > > get an idea of the window of results, though current
> > > >> latencies in
> > > >> > > > > > Connector
> > > >> > > > > > > do not include Min, and that's why I haven't added it for
> > > >> convert
> > > >> > > > > > latency.
> > > >> > > > > > > Do you think it make sense to extend latency metrics with
> > > Min?
> > > >> > > > > > >
> > > >> > > > > > > KIP is updated to clarify some of these changes.
> > > >> > > > > > >
> > > >> > > > > > > Many thanks,
> > > >> > > > > > > Jorge.
> > > >> > > > > > >
> > > >> > > > > > > On Thu, 1 Sept 2022 at 18:11, Yash Mayya <
> > > >> yash.mayya@gmail.com>
> > > >> > > > wrote:
> > > >> > > > > > >
> > > >> > > > > > > > Hi Jorge,
> > > >> > > > > > > >
> > > >> > > > > > > > Thanks for the KIP! I have the same confusion with the
> > > >> > > e2e-latency
> > > >> > > > > > > metrics
> > > >> > > > > > > > as Sagar above. "e2e" would seem to indicate the latency
> > > >> > between
> > > >> > > > when
> > > >> > > > > > the
> > > >> > > > > > > > record was written to Kafka and when the record was
> > > written
> > > >> to
> > > >> > > the
> > > >> > > > > sink
> > > >> > > > > > > > system by the connector - however, as per the KIP it
> > looks
> > > >> like
> > > >> > > it
> > > >> > > > > will
> > > >> > > > > > > be
> > > >> > > > > > > > the latency between when the record was written to Kafka
> > > and
> > > >> > when
> > > >> > > > the
> > > >> > > > > > > > record is returned by a sink task's consumer's poll? I
> > > think
> > > >> > that
> > > >> > > > > > metric
> > > >> > > > > > > > will be a little confusing to interpret. One more thing
> > -
> > > I
> > > >> was
> > > >> > > > > > wondering
> > > >> > > > > > > > if there's a particular reason for having a min metric
> > for
> > > >> e2e
> > > >> > > > > latency
> > > >> > > > > > > but
> > > >> > > > > > > > not for convert time?
> > > >> > > > > > > >
> > > >> > > > > > > > Thanks,
> > > >> > > > > > > > Yash
> > > >> > > > > > > >
> > > >> > > > > > > > On Thu, Sep 1, 2022 at 8:59 PM Sagar <
> > > >> > sagarmeansocean@gmail.com>
> > > >> > > > > > wrote:
> > > >> > > > > > > >
> > > >> > > > > > > > > Hi Jorge,
> > > >> > > > > > > > >
> > > >> > > > > > > > > Thanks for the KIP. It looks like a very good
> > addition.
> > > I
> > > >> > > skimmed
> > > >> > > > > > > through
> > > >> > > > > > > > > once and had a couple of questions =>
> > > >> > > > > > > > >
> > > >> > > > > > > > > 1) I am assuming the new metrics would be task level
> > > >> metric.
> > > >> > > > Could
> > > >> > > > > > you
> > > >> > > > > > > > > specify the way it's done for other sink/source
> > > connector?
> > > >> > > > > > > > > 2) I am slightly confused about the e2e latency
> > metric.
> > > >> Let's
> > > >> > > > > > consider
> > > >> > > > > > > > the
> > > >> > > > > > > > > sink connector metric. If I look at the way it's
> > > supposed
> > > >> to
> > > >> > be
> > > >> > > > > > > > calculated,
> > > >> > > > > > > > > i.e the difference between the record timestamp and
> > the
> > > >> wall
> > > >> > > > clock
> > > >> > > > > > > time,
> > > >> > > > > > > > it
> > > >> > > > > > > > > looks like a per record metric. However, the put-batch
> > > >> time
> > > >> > > > > measures
> > > >> > > > > > > the
> > > >> > > > > > > > > time to put a batch of records to external sink. So, I
> > > >> would
> > > >> > > > assume
> > > >> > > > > > > the 2
> > > >> > > > > > > > > can't be added as is to compute the e2e latency.
> > Maybe I
> > > >> am
> > > >> > > > missing
> > > >> > > > > > > > > something here. Could you plz clarify this.
> > > >> > > > > > > > >
> > > >> > > > > > > > > Thanks!
> > > >> > > > > > > > > Sagar.
> > > >> > > > > > > > >
> > > >> > > > > > > > > On Tue, Aug 30, 2022 at 8:43 PM Jorge Esteban Quilcate
> > > >> Otoya
> > > >> > <
> > > >> > > > > > > > > quilcate.jorge@gmail.com> wrote:
> > > >> > > > > > > > >
> > > >> > > > > > > > > > Hi all,
> > > >> > > > > > > > > >
> > > >> > > > > > > > > > I'd like to start a discussion thread on KIP-864:
> > Add
> > > >> > > > End-To-End
> > > >> > > > > > > > Latency
> > > >> > > > > > > > > > Metrics to Connectors.
> > > >> > > > > > > > > > This KIP aims to improve the metrics available on
> > > Source
> > > >> > and
> > > >> > > > Sink
> > > >> > > > > > > > > > Connectors to measure end-to-end latency, including
> > > >> source
> > > >> > > and
> > > >> > > > > sink
> > > >> > > > > > > > > record
> > > >> > > > > > > > > > conversion time, and sink record e2e latency
> > (similar
> > > to
> > > >> > > > KIP-613
> > > >> > > > > > for
> > > >> > > > > > > > > > Streams).
> > > >> > > > > > > > > >
> > > >> > > > > > > > > > The KIP is here:
> > > >> > > > > > > > > >
> > > >> > > > > > > > > >
> > > >> > > > > > > > >
> > > >> > > > > > > >
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-864%3A+Add+End-To-End+Latency+Metrics+to+Connectors
> > > >> > > > > > > > > >
> > > >> > > > > > > > > > Please take a look and let me know what you think.
> > > >> > > > > > > > > >
> > > >> > > > > > > > > > Cheers,
> > > >> > > > > > > > > > Jorge.
> > > >> > > > > > > > > >
> > > >> > > > > > > > >
> > > >> > > > > > > >
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > >
> >

Re: [DISCUSS] KIP-864: Add End-To-End Latency Metrics to Connectors

Posted by Jorge Esteban Quilcate Otoya <qu...@gmail.com>.

Thanks, Chris! Great feedback! Please, find my comments below:

On Thu, 13 Oct 2022 at 18:52, Chris Egerton <ch...@aiven.io.invalid> wrote:

> Hi Jorge,
>
> Thanks for the KIP. I agree with the overall direction and think this would
> be a nice improvement to Kafka Connect. Here are my initial thoughts on the
> details:
>
> 1. The motivation section outlines the gaps in Kafka Connect's task metrics
> nicely. I think it'd be useful to include more concrete details on why
> these gaps need to be filled in, and in which cases additional metrics
> would be helpful. One goal could be to provide enhanced monitoring of
> production deployments that allows for cluster administrators to set up
> automatic alerts for latency spikes and, if triggered, quickly identify the
> root cause of those alerts, reducing the time to remediation. Another goal
> could be to provide more insight to developers or cluster administrators
> who want to do performance testing on connectors in non-production
> environments. It may help guide our decision making process to have a
> clearer picture of the goals we're trying to achieve.
>

Agree. The Motivation section has been updated.
Thanks for the examples, I see both of them being covered by the KIP.
I see how these could give us a good distinction on whether to position
some metrics at INFO or DEBUG level.


> 2. If we're trying to address the alert-and-diagnose use case, it'd be
> useful to have as much information as possible at INFO level, rather than
> forcing cluster administrators to possibly reconfigure a connector to emit
> DEBUG or TRACE level metrics in order to diagnose a potential
> production-impacting performance bottleneck. I can see the rationale for
> emitting per-record metrics that track an average value at DEBUG level, but
> for per-record metrics that track a maximum value, is there any reason not
> to provide this information at INFO level?
>

Agree. Though with Max and Avg metrics being part of the same sensor —
where Metric Level is defined — then both metrics get the same level.


> 3. I'm also curious about the performance testing suggested by Yash to
> gauge the potential impact of this change. Have you been able to do any
> testing with your draft implementation yet?
>

No, not so far.
I think it would be valuable to discuss the scope of this testing and maybe
tackle it
in a separate issue as Sensors and Metrics are used all over the place.
My initial understanding is that these tests should by placed in the
jmh-benchmarks[1].
Then, we could target testing Sensors and Metrics, and validate how much
overhead
is added by having only Max vs Max,Avg(,Min), etc.
In the other hand, we could extend this to Transformers or other Connect
layers.

Here are some pointers to the Sensors and Metrics implementations that
could be considered:
Path to metric recording:
-
https://github.com/apache/kafka/blob/5cab11cf525f6c06fcf9eb43f7f95ef33fe1cdbb/clients/src/main/java/org/apache/kafka/common/metrics/Sensor.java#L195-L199
-
https://github.com/apache/kafka/blob/5cab11cf525f6c06fcf9eb43f7f95ef33fe1cdbb/clients/src/main/java/org/apache/kafka/common/metrics/Sensor.java#L230-L244

```
// increment all the stats
for (StatAndConfig statAndConfig : this.stats) {
   statAndConfig.stat.record(statAndConfig.config(), value, timeMs);
}
```

SampledStats:
- Avg:
https://github.com/apache/kafka/blob/068ab9cefae301f3187ea885d645c425955e77d2/clients/src/main/java/org/apache/kafka/common/metrics/stats/Avg.java
- Max:
https://github.com/apache/kafka/blob/068ab9cefae301f3187ea885d645c425955e77d2/clients/src/main/java/org/apache/kafka/common/metrics/stats/Max.java
- Min:
https://github.com/apache/kafka/blob/068ab9cefae301f3187ea885d645c425955e77d2/clients/src/main/java/org/apache/kafka/common/metrics/stats/Min.java

`stat#record()` are implemented by `update` method in SampledStat:

```Max.java
    @Override
    protected void update(Sample sample, MetricConfig config, double value,
long now) {
        sample.value += value;
    }
```

```Avg.java
    @Override
    protected void update(Sample sample, MetricConfig config, double value,
long now) {
        sample.value = Math.max(sample.value, value);
    }
```

As far as I understand, most of the work of the stats happens on the
`combine` method that is not part of the connector execution but called
when metrics are queried.

I wonder whether we should consider Avg and Max for all metrics proposed as
the impact on the execution path seems minimal, and even see if Min is also
valuable, and use DEBUG only for more granular metrics.

[1] https://github.com/apache/kafka/tree/trunk/jmh-benchmarks


> 4. Just to make sure I understand correctly--does "time when it has been
> received by the Sink task" refer to the wallclock time directly after a
> call to SinkTask::put has been completed (as opposed to directly before
> that call is made, or something else entirely)?
>

It currently means when it has been received by the Sink task
right after consumer poll and before conversions.
Would it be valuable to have it after put-sink-records?


> 5. If the goal is to identify performance bottlenecks (either in production
> or pre-production environments), would it make sense to introduce metrics
> for each individual converter (i.e., key/value/header) and transformation?
> It's definitely an improvement to be able to identify the total time for
> conversion and transformation, but then the immediate follow-up question if
> a bottleneck is found in that phase is "which converter/transformation is
> responsible?" It'd be nice if we could provide a way to quickly answer that
> question.
>

This is a great idea. I'd like to consider this as well, though maybe these
more granular
metrics would be good to have them as DEBUG.


> 6. Any thoughts about offering latency metrics for source tasks between
> receipt of the record from the task and delivery of the record to Kafka
> (which would be tracked by producer callback)? We could also use the record
> timestamp either instead of or in addition to receipt time if the task
> provides a timestamp with its records.
>

With source transform and convert metrics we get part of that latency.
Looking at the Producer metrics, `request-latency` (though a very generic
metric)
sort of answer the time between send request and ack — if my understanding
is correct.
Would these be enough or you're thinking about another approach?
maybe a custom metric to cover the producer side?


> 7. We may end up introducing a way for sink tasks to record per-record
> delivery to the sink system (see KIP-767 [1]). I'd like it if we could keep
> the names of our metrics very precise in order to avoid confusing users
> (who may think that we're providing metrics on actual delivery to the sink
> system, which may not be the case if the connector performs asynchronous
> writes), and in order to leave room for a metrics on true delivery time by
> sink tasks. It'd also be nice if we could remain consistent with existing
> metrics such as "put-batch-avg-time-ms". With that in mind, what do you
> think about renaming these metrics:
> - "sink-record-batch-latency-max-ms" to "put-batch-avg-latency-ms"
> - "sink-record-latency-max-ms" to "put-sink-record-latency-max-ms"
> - "sink-record-latency-avg-ms" to "put-sink-record-latency-avg-ms"
> - "sink-record-convert-transform-time-max-ms" to
> "convert-transform-sink-record-time-max-ms"
> - "sink-record-convert-transform-time-avg-ms" to
> "convert-transform-sink-record-time-avg-ms"
> - "source-record-transform-convert-time-max-ms" to
> "transform-convert-source-record-time-max-ms"
> - "source-record-transform-convert-time-avg-ms" to
> "transform-convert-source-record-time-avg-ms"
>

Make sense, thanks! I have updated the list of metrics and group them by
sensor and applying these suggestions.
The only ones that I want to review are: sink-record-* to put-batch-*
(first 3). Not sure if put-batch/put-sink-record describes the purpose of
the metric — neither `sink-record-latency` to be honest.
My initial thought was to have something like Kafka Streams e2e-latency.
Based on 4. and 6. questions, an idea could be to add:
- source-batch-e2e-latency-before-send: measure wallclock - source record
timestamp after source connector poll.
- source-batch-e2e-latency-after-send: measure wallclock - record timestamp
on producer send callback
- sink-batch-e2e-latency-before-put: measure time wallclock - record
timestamp after consumer poll
- sink-batch-e2e-latency-after-put: measure time wallclock - record
timestamp after sink connector put.


> Thanks again for the KIP! Looking forward to your thoughts.
>
> Cheers,
>
> Chris
>
> [1] -
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-767%3A+Connect+Latency+Metrics
>
> On Thu, Sep 15, 2022 at 1:32 PM Jorge Esteban Quilcate Otoya <
> quilcate.jorge@gmail.com> wrote:
>
> > Hi everyone,
> >
> > I've made a slight addition to the KIP based on Yash feedback:
> >
> > - A new metric is added at INFO level to record the max latency from the
> > batch timestamp, by keeping the oldest record timestamp per batch.
> > - A draft implementation is linked.
> >
> > Looking forward to your feedback.
> > Also, a kindly reminder that the vote thread is open.
> >
> > Thanks!
> > Jorge.
> >
> > On Thu, 8 Sept 2022 at 14:25, Jorge Esteban Quilcate Otoya <
> > quilcate.jorge@gmail.com> wrote:
> >
> > > Great. I have updated the KIP to reflect this.
> > >
> > > Cheers,
> > > Jorge.
> > >
> > > On Thu, 8 Sept 2022 at 12:26, Yash Mayya <ya...@gmail.com> wrote:
> > >
> > >> Thanks, I think it makes sense to define these metrics at a DEBUG
> > >> recording
> > >> level.
> > >>
> > >> On Thu, Sep 8, 2022 at 2:51 PM Jorge Esteban Quilcate Otoya <
> > >> quilcate.jorge@gmail.com> wrote:
> > >>
> > >> > On Thu, 8 Sept 2022 at 05:55, Yash Mayya <ya...@gmail.com>
> > wrote:
> > >> >
> > >> > > Hi Jorge,
> > >> > >
> > >> > > Thanks for the changes. With regard to having per batch vs per
> > record
> > >> > > metrics, the additional overhead I was referring to wasn't about
> > >> whether
> > >> > or
> > >> > > not we would need to iterate over all the records in a batch. I
> was
> > >> > > referring to the potential additional overhead caused by the
> higher
> > >> > volume
> > >> > > of calls to Sensor::record on the sensors for the new metrics (as
> > >> > compared
> > >> > > to the existing batch only metrics), especially for high
> throughput
> > >> > > connectors where batch sizes could be large. I guess we may want
> to
> > do
> > >> > some
> > >> > > sort of performance testing and get concrete numbers to verify
> > whether
> > >> > this
> > >> > > is a valid concern or not?
> > >> > >
> > >> >
> > >> > 6.1. Got it, thanks for clarifying. I guess there could be a
> benchmark
> > >> test
> > >> > of the `Sensor::record` to get an idea of the performance impact.
> > >> > Regardless, the fact that these are single-record metrics compared
> to
> > >> > existing batch-only could be explicitly defined by setting these
> > >> metrics at
> > >> > a DEBUG or TRACE metric recording level, leaving the existing at
> INFO
> > >> > level.
> > >> > wdyt?
> > >> >
> > >> >
> > >> > >
> > >> > > Thanks,
> > >> > > Yash
> > >> > >
> > >> > > On Tue, Sep 6, 2022 at 4:42 PM Jorge Esteban Quilcate Otoya <
> > >> > > quilcate.jorge@gmail.com> wrote:
> > >> > >
> > >> > > > Hi Sagar and Yash,
> > >> > > >
> > >> > > > > the way it's defined in
> > >> > > > https://kafka.apache.org/documentation/#connect_monitoring for
> > the
> > >> > > metrics
> > >> > > >
> > >> > > > 4.1. Got it. Add it to the KIP.
> > >> > > >
> > >> > > > > The only thing I would argue is do we need
> > >> sink-record-latency-min?
> > >> > > Maybe
> > >> > > > we
> > >> > > > > could remove this min metric as well and make all of the 3 e2e
> > >> > metrics
> > >> > > > > consistent
> > >> > > >
> > >> > > > 4.2 I see. Will remove it from the KIP.
> > >> > > >
> > >> > > > > Probably users can track the metrics at their end to
> > >> > > > > figure that out. Do you think that makes sense?
> > >> > > >
> > >> > > > 4.3. Yes, agree. With these new metrics it should be easier for
> > >> users
> > >> > to
> > >> > > > track this.
> > >> > > >
> > >> > > > > I think it makes sense to not have a min metric for either to
> > >> remain
> > >> > > > > consistent with the existing put-batch and poll-batch metrics
> > >> > > >
> > >> > > > 5.1. Got it. Same as 4.2
> > >> > > >
> > >> > > > > Another naming related suggestion I had was with the
> > >> > > > > "convert-time" metrics - we should probably include
> > >> transformations
> > >> > in
> > >> > > > the
> > >> > > > > name since SMTs could definitely be attributable to a sizable
> > >> chunk
> > >> > of
> > >> > > > the
> > >> > > > > latency depending on the specific transformation chain.
> > >> > > >
> > >> > > > 5.2. Make sense. I'm proposing to add
> > >> > `sink-record-convert-transform...`
> > >> > > > and `source-record-transform-convert...` to represent correctly
> > the
> > >> > order
> > >> > > > of operations.
> > >> > > >
> > >> > > > > it seems like both source and sink tasks only record metrics
> at
> > a
> > >> > > "batch"
> > >> > > > > level, not on an individual record level. I think it might be
> > >> > > additional
> > >> > > > > overhead if we want to record these new metrics all at the
> > record
> > >> > > level?
> > >> > > >
> > >> > > > 5.3. I considered at the beginning to implement all metrics at
> the
> > >> > batch
> > >> > > > level, but given how the framework process records, I fallback
> to
> > >> the
> > >> > > > proposed approach:
> > >> > > > - Sink Task:
> > >> > > >   - `WorkerSinkTask#convertMessages(msgs)` already iterates over
> > >> > records,
> > >> > > > so there is no additional overhead to capture record latency per
> > >> > record.
> > >> > > >     -
> > >> > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://github.com/apache/kafka/blob/9841647c4fe422532f448423c92d26e4fdcb8932/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L490-L514
> > >> > > >   - `WorkerSinkTask#convertAndTransformRecord(record)` actually
> > >> happens
> > >> > > > individually. Measuring this operation per batch would include
> > >> > processing
> > >> > > > that is not strictly part of "convert and transform"
> > >> > > >     -
> > >> > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://github.com/apache/kafka/blob/9841647c4fe422532f448423c92d26e4fdcb8932/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L518
> > >> > > > - Source Task:
> > >> > > >   - `AbstractWorkerSourceTask#sendRecords` iterates over a batch
> > and
> > >> > > > applies transforms and convert record individually as well:
> > >> > > >     -
> > >> > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://github.com/apache/kafka/blob/9841647c4fe422532f448423c92d26e4fdcb8932/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/AbstractWorkerSourceTask.java#L389-L390
> > >> > > >
> > >> > > > > This might require some additional changes -
> > >> > > > > for instance, with the "sink-record-latency" metric, we might
> > only
> > >> > want
> > >> > > > to
> > >> > > > > have a "max" metric since "avg" would require recording a
> value
> > on
> > >> > the
> > >> > > > > sensor for each record (whereas we can get a "max" by only
> > >> recording
> > >> > a
> > >> > > > > metric value for the oldest record in each batch).
> > >> > > >
> > >> > > > 5.4. Recording record-latency per batch may not be as useful as
> > >> there
> > >> > is
> > >> > > no
> > >> > > > guarantee that the oldest record will be representative of the
> > >> batch.
> > >> > > >
> > >> > > > On Sat, 3 Sept 2022 at 16:02, Yash Mayya <ya...@gmail.com>
> > >> wrote:
> > >> > > >
> > >> > > > > Hi Jorge and Sagar,
> > >> > > > >
> > >> > > > > I think it makes sense to not have a min metric for either to
> > >> remain
> > >> > > > > consistent with the existing put-batch and poll-batch metrics
> > (it
> > >> > > doesn't
> > >> > > > > seem particularly useful either anyway). Also, the new
> > >> > > > > "sink-record-latency" metric name looks fine to me, thanks for
> > >> making
> > >> > > the
> > >> > > > > changes! Another naming related suggestion I had was with the
> > >> > > > > "convert-time" metrics - we should probably include
> > >> transformations
> > >> > in
> > >> > > > the
> > >> > > > > name since SMTs could definitely be attributable to a sizable
> > >> chunk
> > >> > of
> > >> > > > the
> > >> > > > > latency depending on the specific transformation chain.
> > >> > > > >
> > >> > > > > I have one high level question with respect to implementation
> -
> > >> > > > currently,
> > >> > > > > it seems like both source and sink tasks only record metrics
> at
> > a
> > >> > > "batch"
> > >> > > > > level, not on an individual record level. I think it might be
> > >> > > additional
> > >> > > > > overhead if we want to record these new metrics all at the
> > record
> > >> > > level?
> > >> > > > > Could we instead make all of these new metrics for batches of
> > >> records
> > >> > > > > rather than individual records in order to remain consistent
> > with
> > >> the
> > >> > > > > existing task level metrics? This might require some
> additional
> > >> > > changes -
> > >> > > > > for instance, with the "sink-record-latency" metric, we might
> > only
> > >> > want
> > >> > > > to
> > >> > > > > have a "max" metric since "avg" would require recording a
> value
> > on
> > >> > the
> > >> > > > > sensor for each record (whereas we can get a "max" by only
> > >> recording
> > >> > a
> > >> > > > > metric value for the oldest record in each batch).
> > >> > > > >
> > >> > > > > Thanks,
> > >> > > > > Yash
> > >> > > > >
> > >> > > > > On Fri, Sep 2, 2022 at 3:16 PM Sagar <
> sagarmeansocean@gmail.com
> > >
> > >> > > wrote:
> > >> > > > >
> > >> > > > > > Hi Jorge,
> > >> > > > > >
> > >> > > > > > Thanks for the changes.
> > >> > > > > >
> > >> > > > > > Regarding the metrics, I meant something like this:
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> kafka.connect:type=sink-task-metrics,connector="{connector}",task="{task}"
> > >> > > > > >
> > >> > > > > > the way it's defined in
> > >> > > > > > https://kafka.apache.org/documentation/#connect_monitoring
> > for
> > >> the
> > >> > > > > > metrics.
> > >> > > > > >
> > >> > > > > > I see what you mean by the 3 metrics and how it can be
> > >> interpreted.
> > >> > > The
> > >> > > > > > only thing I would argue is do we need
> > sink-record-latency-min?
> > >> > Maybe
> > >> > > > we
> > >> > > > > > could remove this min metric as well and make all of the 3
> e2e
> > >> > > metrics
> > >> > > > > > consistent(since put-batch also doesn't expose a min which
> > makes
> > >> > > sense
> > >> > > > to
> > >> > > > > > me). I think this is in contrast to what Yash pointed out
> > above
> > >> so
> > >> > I
> > >> > > > > would
> > >> > > > > > like to hear his thoughts as well.
> > >> > > > > >
> > >> > > > > > The other point Yash mentioned about the slightly flawed
> > >> definition
> > >> > > of
> > >> > > > > e2e
> > >> > > > > > is also true in a sense. But I have a feeling that's one the
> > >> > records
> > >> > > > are
> > >> > > > > > polled by the connector tasks, it would be difficult to
> track
> > >> the
> > >> > > final
> > >> > > > > leg
> > >> > > > > > via the framework. Probably users can track the metrics at
> > their
> > >> > end
> > >> > > to
> > >> > > > > > figure that out. Do you think that makes sense?
> > >> > > > > >
> > >> > > > > > Thanks!
> > >> > > > > > Sagar.
> > >> > > > > >
> > >> > > > > >
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > On Thu, Sep 1, 2022 at 11:40 PM Jorge Esteban Quilcate
> Otoya <
> > >> > > > > > quilcate.jorge@gmail.com> wrote:
> > >> > > > > >
> > >> > > > > > > Hi Sagar and Yash,
> > >> > > > > > >
> > >> > > > > > > Thanks for your feedback!
> > >> > > > > > >
> > >> > > > > > > > 1) I am assuming the new metrics would be task level
> > metric.
> > >> > > > > > >
> > >> > > > > > > 1.1 Yes, it will be a task level metric, implemented on
> the
> > >> > > > > > > Worker[Source/Sink]Task.
> > >> > > > > > >
> > >> > > > > > > > Could you specify the way it's done for other
> sink/source
> > >> > > > connector?
> > >> > > > > > >
> > >> > > > > > > 1.2. Not sure what do you mean by this. Could you
> elaborate
> > a
> > >> bit
> > >> > > > more?
> > >> > > > > > >
> > >> > > > > > > > 2. I am slightly confused about the e2e latency
> metric...
> > >> > > > > > >
> > >> > > > > > > 2.1. Yes, I see. I was trying to bring a similar concept
> as
> > in
> > >> > > > Streams
> > >> > > > > > with
> > >> > > > > > > KIP-613, though the e2e concept may not be translatable.
> > >> > > > > > > We could keep it as `sink-record-latency` to avoid
> > conflating
> > >> > > > > concepts. A
> > >> > > > > > > similar metric naming was proposed in KIP-489 but at the
> > >> consumer
> > >> > > > > level —
> > >> > > > > > > though it seems dormant for a couple of years.
> > >> > > > > > >
> > >> > > > > > > > However, the put-batch time measures the
> > >> > > > > > > > time to put a batch of records to external sink. So, I
> > would
> > >> > > assume
> > >> > > > > > the 2
> > >> > > > > > > > can't be added as is to compute the e2e latency. Maybe I
> > am
> > >> > > missing
> > >> > > > > > > > something here. Could you plz clarify this.
> > >> > > > > > >
> > >> > > > > > > 2.2. Yes, agree. Not necessarily added, but with the 3
> > >> latencies
> > >> > > > (poll,
> > >> > > > > > > convert, putBatch) will be clearer where the bottleneck
> may
> > >> be,
> > >> > and
> > >> > > > > > > represent the internal processing.
> > >> > > > > > >
> > >> > > > > > > > however, as per the KIP it looks like it will be
> > >> > > > > > > > the latency between when the record was written to Kafka
> > and
> > >> > when
> > >> > > > the
> > >> > > > > > > > record is returned by a sink task's consumer's poll?
> > >> > > > > > >
> > >> > > > > > > 3.1. Agree. 2.1. could help to clarify this.
> > >> > > > > > >
> > >> > > > > > > > One more thing - I was wondering
> > >> > > > > > > > if there's a particular reason for having a min metric
> for
> > >> e2e
> > >> > > > > latency
> > >> > > > > > > but
> > >> > > > > > > > not for convert time?
> > >> > > > > > >
> > >> > > > > > > 3.2. Was following KIP-613 for e2e which seems useful to
> > >> compare
> > >> > > with
> > >> > > > > > Max a
> > >> > > > > > > get an idea of the window of results, though current
> > >> latencies in
> > >> > > > > > Connector
> > >> > > > > > > do not include Min, and that's why I haven't added it for
> > >> convert
> > >> > > > > > latency.
> > >> > > > > > > Do you think it make sense to extend latency metrics with
> > Min?
> > >> > > > > > >
> > >> > > > > > > KIP is updated to clarify some of these changes.
> > >> > > > > > >
> > >> > > > > > > Many thanks,
> > >> > > > > > > Jorge.
> > >> > > > > > >
> > >> > > > > > > On Thu, 1 Sept 2022 at 18:11, Yash Mayya <
> > >> yash.mayya@gmail.com>
> > >> > > > wrote:
> > >> > > > > > >
> > >> > > > > > > > Hi Jorge,
> > >> > > > > > > >
> > >> > > > > > > > Thanks for the KIP! I have the same confusion with the
> > >> > > e2e-latency
> > >> > > > > > > metrics
> > >> > > > > > > > as Sagar above. "e2e" would seem to indicate the latency
> > >> > between
> > >> > > > when
> > >> > > > > > the
> > >> > > > > > > > record was written to Kafka and when the record was
> > written
> > >> to
> > >> > > the
> > >> > > > > sink
> > >> > > > > > > > system by the connector - however, as per the KIP it
> looks
> > >> like
> > >> > > it
> > >> > > > > will
> > >> > > > > > > be
> > >> > > > > > > > the latency between when the record was written to Kafka
> > and
> > >> > when
> > >> > > > the
> > >> > > > > > > > record is returned by a sink task's consumer's poll? I
> > think
> > >> > that
> > >> > > > > > metric
> > >> > > > > > > > will be a little confusing to interpret. One more thing
> -
> > I
> > >> was
> > >> > > > > > wondering
> > >> > > > > > > > if there's a particular reason for having a min metric
> for
> > >> e2e
> > >> > > > > latency
> > >> > > > > > > but
> > >> > > > > > > > not for convert time?
> > >> > > > > > > >
> > >> > > > > > > > Thanks,
> > >> > > > > > > > Yash
> > >> > > > > > > >
> > >> > > > > > > > On Thu, Sep 1, 2022 at 8:59 PM Sagar <
> > >> > sagarmeansocean@gmail.com>
> > >> > > > > > wrote:
> > >> > > > > > > >
> > >> > > > > > > > > Hi Jorge,
> > >> > > > > > > > >
> > >> > > > > > > > > Thanks for the KIP. It looks like a very good
> addition.
> > I
> > >> > > skimmed
> > >> > > > > > > through
> > >> > > > > > > > > once and had a couple of questions =>
> > >> > > > > > > > >
> > >> > > > > > > > > 1) I am assuming the new metrics would be task level
> > >> metric.
> > >> > > > Could
> > >> > > > > > you
> > >> > > > > > > > > specify the way it's done for other sink/source
> > connector?
> > >> > > > > > > > > 2) I am slightly confused about the e2e latency
> metric.
> > >> Let's
> > >> > > > > > consider
> > >> > > > > > > > the
> > >> > > > > > > > > sink connector metric. If I look at the way it's
> > supposed
> > >> to
> > >> > be
> > >> > > > > > > > calculated,
> > >> > > > > > > > > i.e the difference between the record timestamp and
> the
> > >> wall
> > >> > > > clock
> > >> > > > > > > time,
> > >> > > > > > > > it
> > >> > > > > > > > > looks like a per record metric. However, the put-batch
> > >> time
> > >> > > > > measures
> > >> > > > > > > the
> > >> > > > > > > > > time to put a batch of records to external sink. So, I
> > >> would
> > >> > > > assume
> > >> > > > > > > the 2
> > >> > > > > > > > > can't be added as is to compute the e2e latency.
> Maybe I
> > >> am
> > >> > > > missing
> > >> > > > > > > > > something here. Could you plz clarify this.
> > >> > > > > > > > >
> > >> > > > > > > > > Thanks!
> > >> > > > > > > > > Sagar.
> > >> > > > > > > > >
> > >> > > > > > > > > On Tue, Aug 30, 2022 at 8:43 PM Jorge Esteban Quilcate
> > >> Otoya
> > >> > <
> > >> > > > > > > > > quilcate.jorge@gmail.com> wrote:
> > >> > > > > > > > >
> > >> > > > > > > > > > Hi all,
> > >> > > > > > > > > >
> > >> > > > > > > > > > I'd like to start a discussion thread on KIP-864:
> Add
> > >> > > > End-To-End
> > >> > > > > > > > Latency
> > >> > > > > > > > > > Metrics to Connectors.
> > >> > > > > > > > > > This KIP aims to improve the metrics available on
> > Source
> > >> > and
> > >> > > > Sink
> > >> > > > > > > > > > Connectors to measure end-to-end latency, including
> > >> source
> > >> > > and
> > >> > > > > sink
> > >> > > > > > > > > record
> > >> > > > > > > > > > conversion time, and sink record e2e latency
> (similar
> > to
> > >> > > > KIP-613
> > >> > > > > > for
> > >> > > > > > > > > > Streams).
> > >> > > > > > > > > >
> > >> > > > > > > > > > The KIP is here:
> > >> > > > > > > > > >
> > >> > > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > >
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-864%3A+Add+End-To-End+Latency+Metrics+to+Connectors
> > >> > > > > > > > > >
> > >> > > > > > > > > > Please take a look and let me know what you think.
> > >> > > > > > > > > >
> > >> > > > > > > > > > Cheers,
> > >> > > > > > > > > > Jorge.
> > >> > > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > >
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> >
>

Re: [DISCUSS] KIP-864: Add End-To-End Latency Metrics to Connectors

Posted by Chris Egerton <ch...@aiven.io.INVALID>.

Hi Jorge,

Thanks for the KIP. I agree with the overall direction and think this would
be a nice improvement to Kafka Connect. Here are my initial thoughts on the
details:

1. The motivation section outlines the gaps in Kafka Connect's task metrics
nicely. I think it'd be useful to include more concrete details on why
these gaps need to be filled in, and in which cases additional metrics
would be helpful. One goal could be to provide enhanced monitoring of
production deployments that allows for cluster administrators to set up
automatic alerts for latency spikes and, if triggered, quickly identify the
root cause of those alerts, reducing the time to remediation. Another goal
could be to provide more insight to developers or cluster administrators
who want to do performance testing on connectors in non-production
environments. It may help guide our decision making process to have a
clearer picture of the goals we're trying to achieve.
2. If we're trying to address the alert-and-diagnose use case, it'd be
useful to have as much information as possible at INFO level, rather than
forcing cluster administrators to possibly reconfigure a connector to emit
DEBUG or TRACE level metrics in order to diagnose a potential
production-impacting performance bottleneck. I can see the rationale for
emitting per-record metrics that track an average value at DEBUG level, but
for per-record metrics that track a maximum value, is there any reason not
to provide this information at INFO level?
3. I'm also curious about the performance testing suggested by Yash to
gauge the potential impact of this change. Have you been able to do any
testing with your draft implementation yet?
4. Just to make sure I understand correctly--does "time when it has been
received by the Sink task" refer to the wallclock time directly after a
call to SinkTask::put has been completed (as opposed to directly before
that call is made, or something else entirely)?
5. If the goal is to identify performance bottlenecks (either in production
or pre-production environments), would it make sense to introduce metrics
for each individual converter (i.e., key/value/header) and transformation?
It's definitely an improvement to be able to identify the total time for
conversion and transformation, but then the immediate follow-up question if
a bottleneck is found in that phase is "which converter/transformation is
responsible?" It'd be nice if we could provide a way to quickly answer that
question.
6. Any thoughts about offering latency metrics for source tasks between
receipt of the record from the task and delivery of the record to Kafka
(which would be tracked by producer callback)? We could also use the record
timestamp either instead of or in addition to receipt time if the task
provides a timestamp with its records.
7. We may end up introducing a way for sink tasks to record per-record
delivery to the sink system (see KIP-767 [1]). I'd like it if we could keep
the names of our metrics very precise in order to avoid confusing users
(who may think that we're providing metrics on actual delivery to the sink
system, which may not be the case if the connector performs asynchronous
writes), and in order to leave room for a metrics on true delivery time by
sink tasks. It'd also be nice if we could remain consistent with existing
metrics such as "put-batch-avg-time-ms". With that in mind, what do you
think about renaming these metrics:
- "sink-record-batch-latency-max-ms" to "put-batch-avg-latency-ms"
- "sink-record-latency-max-ms" to "put-sink-record-latency-max-ms"
- "sink-record-latency-avg-ms" to "put-sink-record-latency-avg-ms"
- "sink-record-convert-transform-time-max-ms" to
"convert-transform-sink-record-time-max-ms"
- "sink-record-convert-transform-time-avg-ms" to
"convert-transform-sink-record-time-avg-ms"
- "source-record-transform-convert-time-max-ms" to
"transform-convert-source-record-time-max-ms"
- "source-record-transform-convert-time-avg-ms" to
"transform-convert-source-record-time-avg-ms"

Thanks again for the KIP! Looking forward to your thoughts.

Cheers,

Chris

[1] -
https://cwiki.apache.org/confluence/display/KAFKA/KIP-767%3A+Connect+Latency+Metrics

On Thu, Sep 15, 2022 at 1:32 PM Jorge Esteban Quilcate Otoya <
quilcate.jorge@gmail.com> wrote:

> Hi everyone,
>
> I've made a slight addition to the KIP based on Yash feedback:
>
> - A new metric is added at INFO level to record the max latency from the
> batch timestamp, by keeping the oldest record timestamp per batch.
> - A draft implementation is linked.
>
> Looking forward to your feedback.
> Also, a kindly reminder that the vote thread is open.
>
> Thanks!
> Jorge.
>
> On Thu, 8 Sept 2022 at 14:25, Jorge Esteban Quilcate Otoya <
> quilcate.jorge@gmail.com> wrote:
>
> > Great. I have updated the KIP to reflect this.
> >
> > Cheers,
> > Jorge.
> >
> > On Thu, 8 Sept 2022 at 12:26, Yash Mayya <ya...@gmail.com> wrote:
> >
> >> Thanks, I think it makes sense to define these metrics at a DEBUG
> >> recording
> >> level.
> >>
> >> On Thu, Sep 8, 2022 at 2:51 PM Jorge Esteban Quilcate Otoya <
> >> quilcate.jorge@gmail.com> wrote:
> >>
> >> > On Thu, 8 Sept 2022 at 05:55, Yash Mayya <ya...@gmail.com>
> wrote:
> >> >
> >> > > Hi Jorge,
> >> > >
> >> > > Thanks for the changes. With regard to having per batch vs per
> record
> >> > > metrics, the additional overhead I was referring to wasn't about
> >> whether
> >> > or
> >> > > not we would need to iterate over all the records in a batch. I was
> >> > > referring to the potential additional overhead caused by the higher
> >> > volume
> >> > > of calls to Sensor::record on the sensors for the new metrics (as
> >> > compared
> >> > > to the existing batch only metrics), especially for high throughput
> >> > > connectors where batch sizes could be large. I guess we may want to
> do
> >> > some
> >> > > sort of performance testing and get concrete numbers to verify
> whether
> >> > this
> >> > > is a valid concern or not?
> >> > >
> >> >
> >> > 6.1. Got it, thanks for clarifying. I guess there could be a benchmark
> >> test
> >> > of the `Sensor::record` to get an idea of the performance impact.
> >> > Regardless, the fact that these are single-record metrics compared to
> >> > existing batch-only could be explicitly defined by setting these
> >> metrics at
> >> > a DEBUG or TRACE metric recording level, leaving the existing at INFO
> >> > level.
> >> > wdyt?
> >> >
> >> >
> >> > >
> >> > > Thanks,
> >> > > Yash
> >> > >
> >> > > On Tue, Sep 6, 2022 at 4:42 PM Jorge Esteban Quilcate Otoya <
> >> > > quilcate.jorge@gmail.com> wrote:
> >> > >
> >> > > > Hi Sagar and Yash,
> >> > > >
> >> > > > > the way it's defined in
> >> > > > https://kafka.apache.org/documentation/#connect_monitoring for
> the
> >> > > metrics
> >> > > >
> >> > > > 4.1. Got it. Add it to the KIP.
> >> > > >
> >> > > > > The only thing I would argue is do we need
> >> sink-record-latency-min?
> >> > > Maybe
> >> > > > we
> >> > > > > could remove this min metric as well and make all of the 3 e2e
> >> > metrics
> >> > > > > consistent
> >> > > >
> >> > > > 4.2 I see. Will remove it from the KIP.
> >> > > >
> >> > > > > Probably users can track the metrics at their end to
> >> > > > > figure that out. Do you think that makes sense?
> >> > > >
> >> > > > 4.3. Yes, agree. With these new metrics it should be easier for
> >> users
> >> > to
> >> > > > track this.
> >> > > >
> >> > > > > I think it makes sense to not have a min metric for either to
> >> remain
> >> > > > > consistent with the existing put-batch and poll-batch metrics
> >> > > >
> >> > > > 5.1. Got it. Same as 4.2
> >> > > >
> >> > > > > Another naming related suggestion I had was with the
> >> > > > > "convert-time" metrics - we should probably include
> >> transformations
> >> > in
> >> > > > the
> >> > > > > name since SMTs could definitely be attributable to a sizable
> >> chunk
> >> > of
> >> > > > the
> >> > > > > latency depending on the specific transformation chain.
> >> > > >
> >> > > > 5.2. Make sense. I'm proposing to add
> >> > `sink-record-convert-transform...`
> >> > > > and `source-record-transform-convert...` to represent correctly
> the
> >> > order
> >> > > > of operations.
> >> > > >
> >> > > > > it seems like both source and sink tasks only record metrics at
> a
> >> > > "batch"
> >> > > > > level, not on an individual record level. I think it might be
> >> > > additional
> >> > > > > overhead if we want to record these new metrics all at the
> record
> >> > > level?
> >> > > >
> >> > > > 5.3. I considered at the beginning to implement all metrics at the
> >> > batch
> >> > > > level, but given how the framework process records, I fallback to
> >> the
> >> > > > proposed approach:
> >> > > > - Sink Task:
> >> > > >   - `WorkerSinkTask#convertMessages(msgs)` already iterates over
> >> > records,
> >> > > > so there is no additional overhead to capture record latency per
> >> > record.
> >> > > >     -
> >> > > >
> >> > > >
> >> > >
> >> >
> >>
> https://github.com/apache/kafka/blob/9841647c4fe422532f448423c92d26e4fdcb8932/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L490-L514
> >> > > >   - `WorkerSinkTask#convertAndTransformRecord(record)` actually
> >> happens
> >> > > > individually. Measuring this operation per batch would include
> >> > processing
> >> > > > that is not strictly part of "convert and transform"
> >> > > >     -
> >> > > >
> >> > > >
> >> > >
> >> >
> >>
> https://github.com/apache/kafka/blob/9841647c4fe422532f448423c92d26e4fdcb8932/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L518
> >> > > > - Source Task:
> >> > > >   - `AbstractWorkerSourceTask#sendRecords` iterates over a batch
> and
> >> > > > applies transforms and convert record individually as well:
> >> > > >     -
> >> > > >
> >> > > >
> >> > >
> >> >
> >>
> https://github.com/apache/kafka/blob/9841647c4fe422532f448423c92d26e4fdcb8932/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/AbstractWorkerSourceTask.java#L389-L390
> >> > > >
> >> > > > > This might require some additional changes -
> >> > > > > for instance, with the "sink-record-latency" metric, we might
> only
> >> > want
> >> > > > to
> >> > > > > have a "max" metric since "avg" would require recording a value
> on
> >> > the
> >> > > > > sensor for each record (whereas we can get a "max" by only
> >> recording
> >> > a
> >> > > > > metric value for the oldest record in each batch).
> >> > > >
> >> > > > 5.4. Recording record-latency per batch may not be as useful as
> >> there
> >> > is
> >> > > no
> >> > > > guarantee that the oldest record will be representative of the
> >> batch.
> >> > > >
> >> > > > On Sat, 3 Sept 2022 at 16:02, Yash Mayya <ya...@gmail.com>
> >> wrote:
> >> > > >
> >> > > > > Hi Jorge and Sagar,
> >> > > > >
> >> > > > > I think it makes sense to not have a min metric for either to
> >> remain
> >> > > > > consistent with the existing put-batch and poll-batch metrics
> (it
> >> > > doesn't
> >> > > > > seem particularly useful either anyway). Also, the new
> >> > > > > "sink-record-latency" metric name looks fine to me, thanks for
> >> making
> >> > > the
> >> > > > > changes! Another naming related suggestion I had was with the
> >> > > > > "convert-time" metrics - we should probably include
> >> transformations
> >> > in
> >> > > > the
> >> > > > > name since SMTs could definitely be attributable to a sizable
> >> chunk
> >> > of
> >> > > > the
> >> > > > > latency depending on the specific transformation chain.
> >> > > > >
> >> > > > > I have one high level question with respect to implementation -
> >> > > > currently,
> >> > > > > it seems like both source and sink tasks only record metrics at
> a
> >> > > "batch"
> >> > > > > level, not on an individual record level. I think it might be
> >> > > additional
> >> > > > > overhead if we want to record these new metrics all at the
> record
> >> > > level?
> >> > > > > Could we instead make all of these new metrics for batches of
> >> records
> >> > > > > rather than individual records in order to remain consistent
> with
> >> the
> >> > > > > existing task level metrics? This might require some additional
> >> > > changes -
> >> > > > > for instance, with the "sink-record-latency" metric, we might
> only
> >> > want
> >> > > > to
> >> > > > > have a "max" metric since "avg" would require recording a value
> on
> >> > the
> >> > > > > sensor for each record (whereas we can get a "max" by only
> >> recording
> >> > a
> >> > > > > metric value for the oldest record in each batch).
> >> > > > >
> >> > > > > Thanks,
> >> > > > > Yash
> >> > > > >
> >> > > > > On Fri, Sep 2, 2022 at 3:16 PM Sagar <sagarmeansocean@gmail.com
> >
> >> > > wrote:
> >> > > > >
> >> > > > > > Hi Jorge,
> >> > > > > >
> >> > > > > > Thanks for the changes.
> >> > > > > >
> >> > > > > > Regarding the metrics, I meant something like this:
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> kafka.connect:type=sink-task-metrics,connector="{connector}",task="{task}"
> >> > > > > >
> >> > > > > > the way it's defined in
> >> > > > > > https://kafka.apache.org/documentation/#connect_monitoring
> for
> >> the
> >> > > > > > metrics.
> >> > > > > >
> >> > > > > > I see what you mean by the 3 metrics and how it can be
> >> interpreted.
> >> > > The
> >> > > > > > only thing I would argue is do we need
> sink-record-latency-min?
> >> > Maybe
> >> > > > we
> >> > > > > > could remove this min metric as well and make all of the 3 e2e
> >> > > metrics
> >> > > > > > consistent(since put-batch also doesn't expose a min which
> makes
> >> > > sense
> >> > > > to
> >> > > > > > me). I think this is in contrast to what Yash pointed out
> above
> >> so
> >> > I
> >> > > > > would
> >> > > > > > like to hear his thoughts as well.
> >> > > > > >
> >> > > > > > The other point Yash mentioned about the slightly flawed
> >> definition
> >> > > of
> >> > > > > e2e
> >> > > > > > is also true in a sense. But I have a feeling that's one the
> >> > records
> >> > > > are
> >> > > > > > polled by the connector tasks, it would be difficult to track
> >> the
> >> > > final
> >> > > > > leg
> >> > > > > > via the framework. Probably users can track the metrics at
> their
> >> > end
> >> > > to
> >> > > > > > figure that out. Do you think that makes sense?
> >> > > > > >
> >> > > > > > Thanks!
> >> > > > > > Sagar.
> >> > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > > > On Thu, Sep 1, 2022 at 11:40 PM Jorge Esteban Quilcate Otoya <
> >> > > > > > quilcate.jorge@gmail.com> wrote:
> >> > > > > >
> >> > > > > > > Hi Sagar and Yash,
> >> > > > > > >
> >> > > > > > > Thanks for your feedback!
> >> > > > > > >
> >> > > > > > > > 1) I am assuming the new metrics would be task level
> metric.
> >> > > > > > >
> >> > > > > > > 1.1 Yes, it will be a task level metric, implemented on the
> >> > > > > > > Worker[Source/Sink]Task.
> >> > > > > > >
> >> > > > > > > > Could you specify the way it's done for other sink/source
> >> > > > connector?
> >> > > > > > >
> >> > > > > > > 1.2. Not sure what do you mean by this. Could you elaborate
> a
> >> bit
> >> > > > more?
> >> > > > > > >
> >> > > > > > > > 2. I am slightly confused about the e2e latency metric...
> >> > > > > > >
> >> > > > > > > 2.1. Yes, I see. I was trying to bring a similar concept as
> in
> >> > > > Streams
> >> > > > > > with
> >> > > > > > > KIP-613, though the e2e concept may not be translatable.
> >> > > > > > > We could keep it as `sink-record-latency` to avoid
> conflating
> >> > > > > concepts. A
> >> > > > > > > similar metric naming was proposed in KIP-489 but at the
> >> consumer
> >> > > > > level —
> >> > > > > > > though it seems dormant for a couple of years.
> >> > > > > > >
> >> > > > > > > > However, the put-batch time measures the
> >> > > > > > > > time to put a batch of records to external sink. So, I
> would
> >> > > assume
> >> > > > > > the 2
> >> > > > > > > > can't be added as is to compute the e2e latency. Maybe I
> am
> >> > > missing
> >> > > > > > > > something here. Could you plz clarify this.
> >> > > > > > >
> >> > > > > > > 2.2. Yes, agree. Not necessarily added, but with the 3
> >> latencies
> >> > > > (poll,
> >> > > > > > > convert, putBatch) will be clearer where the bottleneck may
> >> be,
> >> > and
> >> > > > > > > represent the internal processing.
> >> > > > > > >
> >> > > > > > > > however, as per the KIP it looks like it will be
> >> > > > > > > > the latency between when the record was written to Kafka
> and
> >> > when
> >> > > > the
> >> > > > > > > > record is returned by a sink task's consumer's poll?
> >> > > > > > >
> >> > > > > > > 3.1. Agree. 2.1. could help to clarify this.
> >> > > > > > >
> >> > > > > > > > One more thing - I was wondering
> >> > > > > > > > if there's a particular reason for having a min metric for
> >> e2e
> >> > > > > latency
> >> > > > > > > but
> >> > > > > > > > not for convert time?
> >> > > > > > >
> >> > > > > > > 3.2. Was following KIP-613 for e2e which seems useful to
> >> compare
> >> > > with
> >> > > > > > Max a
> >> > > > > > > get an idea of the window of results, though current
> >> latencies in
> >> > > > > > Connector
> >> > > > > > > do not include Min, and that's why I haven't added it for
> >> convert
> >> > > > > > latency.
> >> > > > > > > Do you think it make sense to extend latency metrics with
> Min?
> >> > > > > > >
> >> > > > > > > KIP is updated to clarify some of these changes.
> >> > > > > > >
> >> > > > > > > Many thanks,
> >> > > > > > > Jorge.
> >> > > > > > >
> >> > > > > > > On Thu, 1 Sept 2022 at 18:11, Yash Mayya <
> >> yash.mayya@gmail.com>
> >> > > > wrote:
> >> > > > > > >
> >> > > > > > > > Hi Jorge,
> >> > > > > > > >
> >> > > > > > > > Thanks for the KIP! I have the same confusion with the
> >> > > e2e-latency
> >> > > > > > > metrics
> >> > > > > > > > as Sagar above. "e2e" would seem to indicate the latency
> >> > between
> >> > > > when
> >> > > > > > the
> >> > > > > > > > record was written to Kafka and when the record was
> written
> >> to
> >> > > the
> >> > > > > sink
> >> > > > > > > > system by the connector - however, as per the KIP it looks
> >> like
> >> > > it
> >> > > > > will
> >> > > > > > > be
> >> > > > > > > > the latency between when the record was written to Kafka
> and
> >> > when
> >> > > > the
> >> > > > > > > > record is returned by a sink task's consumer's poll? I
> think
> >> > that
> >> > > > > > metric
> >> > > > > > > > will be a little confusing to interpret. One more thing -
> I
> >> was
> >> > > > > > wondering
> >> > > > > > > > if there's a particular reason for having a min metric for
> >> e2e
> >> > > > > latency
> >> > > > > > > but
> >> > > > > > > > not for convert time?
> >> > > > > > > >
> >> > > > > > > > Thanks,
> >> > > > > > > > Yash
> >> > > > > > > >
> >> > > > > > > > On Thu, Sep 1, 2022 at 8:59 PM Sagar <
> >> > sagarmeansocean@gmail.com>
> >> > > > > > wrote:
> >> > > > > > > >
> >> > > > > > > > > Hi Jorge,
> >> > > > > > > > >
> >> > > > > > > > > Thanks for the KIP. It looks like a very good addition.
> I
> >> > > skimmed
> >> > > > > > > through
> >> > > > > > > > > once and had a couple of questions =>
> >> > > > > > > > >
> >> > > > > > > > > 1) I am assuming the new metrics would be task level
> >> metric.
> >> > > > Could
> >> > > > > > you
> >> > > > > > > > > specify the way it's done for other sink/source
> connector?
> >> > > > > > > > > 2) I am slightly confused about the e2e latency metric.
> >> Let's
> >> > > > > > consider
> >> > > > > > > > the
> >> > > > > > > > > sink connector metric. If I look at the way it's
> supposed
> >> to
> >> > be
> >> > > > > > > > calculated,
> >> > > > > > > > > i.e the difference between the record timestamp and the
> >> wall
> >> > > > clock
> >> > > > > > > time,
> >> > > > > > > > it
> >> > > > > > > > > looks like a per record metric. However, the put-batch
> >> time
> >> > > > > measures
> >> > > > > > > the
> >> > > > > > > > > time to put a batch of records to external sink. So, I
> >> would
> >> > > > assume
> >> > > > > > > the 2
> >> > > > > > > > > can't be added as is to compute the e2e latency. Maybe I
> >> am
> >> > > > missing
> >> > > > > > > > > something here. Could you plz clarify this.
> >> > > > > > > > >
> >> > > > > > > > > Thanks!
> >> > > > > > > > > Sagar.
> >> > > > > > > > >
> >> > > > > > > > > On Tue, Aug 30, 2022 at 8:43 PM Jorge Esteban Quilcate
> >> Otoya
> >> > <
> >> > > > > > > > > quilcate.jorge@gmail.com> wrote:
> >> > > > > > > > >
> >> > > > > > > > > > Hi all,
> >> > > > > > > > > >
> >> > > > > > > > > > I'd like to start a discussion thread on KIP-864: Add
> >> > > > End-To-End
> >> > > > > > > > Latency
> >> > > > > > > > > > Metrics to Connectors.
> >> > > > > > > > > > This KIP aims to improve the metrics available on
> Source
> >> > and
> >> > > > Sink
> >> > > > > > > > > > Connectors to measure end-to-end latency, including
> >> source
> >> > > and
> >> > > > > sink
> >> > > > > > > > > record
> >> > > > > > > > > > conversion time, and sink record e2e latency (similar
> to
> >> > > > KIP-613
> >> > > > > > for
> >> > > > > > > > > > Streams).
> >> > > > > > > > > >
> >> > > > > > > > > > The KIP is here:
> >> > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-864%3A+Add+End-To-End+Latency+Metrics+to+Connectors
> >> > > > > > > > > >
> >> > > > > > > > > > Please take a look and let me know what you think.
> >> > > > > > > > > >
> >> > > > > > > > > > Cheers,
> >> > > > > > > > > > Jorge.
> >> > > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
>

Re: [DISCUSS] KIP-864: Add End-To-End Latency Metrics to Connectors

Posted by Jorge Esteban Quilcate Otoya <qu...@gmail.com>.

Hi everyone,

I've made a slight addition to the KIP based on Yash feedback:

- A new metric is added at INFO level to record the max latency from the
batch timestamp, by keeping the oldest record timestamp per batch.
- A draft implementation is linked.

Looking forward to your feedback.
Also, a kindly reminder that the vote thread is open.

Thanks!
Jorge.

On Thu, 8 Sept 2022 at 14:25, Jorge Esteban Quilcate Otoya <
quilcate.jorge@gmail.com> wrote:

> Great. I have updated the KIP to reflect this.
>
> Cheers,
> Jorge.
>
> On Thu, 8 Sept 2022 at 12:26, Yash Mayya <ya...@gmail.com> wrote:
>
>> Thanks, I think it makes sense to define these metrics at a DEBUG
>> recording
>> level.
>>
>> On Thu, Sep 8, 2022 at 2:51 PM Jorge Esteban Quilcate Otoya <
>> quilcate.jorge@gmail.com> wrote:
>>
>> > On Thu, 8 Sept 2022 at 05:55, Yash Mayya <ya...@gmail.com> wrote:
>> >
>> > > Hi Jorge,
>> > >
>> > > Thanks for the changes. With regard to having per batch vs per record
>> > > metrics, the additional overhead I was referring to wasn't about
>> whether
>> > or
>> > > not we would need to iterate over all the records in a batch. I was
>> > > referring to the potential additional overhead caused by the higher
>> > volume
>> > > of calls to Sensor::record on the sensors for the new metrics (as
>> > compared
>> > > to the existing batch only metrics), especially for high throughput
>> > > connectors where batch sizes could be large. I guess we may want to do
>> > some
>> > > sort of performance testing and get concrete numbers to verify whether
>> > this
>> > > is a valid concern or not?
>> > >
>> >
>> > 6.1. Got it, thanks for clarifying. I guess there could be a benchmark
>> test
>> > of the `Sensor::record` to get an idea of the performance impact.
>> > Regardless, the fact that these are single-record metrics compared to
>> > existing batch-only could be explicitly defined by setting these
>> metrics at
>> > a DEBUG or TRACE metric recording level, leaving the existing at INFO
>> > level.
>> > wdyt?
>> >
>> >
>> > >
>> > > Thanks,
>> > > Yash
>> > >
>> > > On Tue, Sep 6, 2022 at 4:42 PM Jorge Esteban Quilcate Otoya <
>> > > quilcate.jorge@gmail.com> wrote:
>> > >
>> > > > Hi Sagar and Yash,
>> > > >
>> > > > > the way it's defined in
>> > > > https://kafka.apache.org/documentation/#connect_monitoring for the
>> > > metrics
>> > > >
>> > > > 4.1. Got it. Add it to the KIP.
>> > > >
>> > > > > The only thing I would argue is do we need
>> sink-record-latency-min?
>> > > Maybe
>> > > > we
>> > > > > could remove this min metric as well and make all of the 3 e2e
>> > metrics
>> > > > > consistent
>> > > >
>> > > > 4.2 I see. Will remove it from the KIP.
>> > > >
>> > > > > Probably users can track the metrics at their end to
>> > > > > figure that out. Do you think that makes sense?
>> > > >
>> > > > 4.3. Yes, agree. With these new metrics it should be easier for
>> users
>> > to
>> > > > track this.
>> > > >
>> > > > > I think it makes sense to not have a min metric for either to
>> remain
>> > > > > consistent with the existing put-batch and poll-batch metrics
>> > > >
>> > > > 5.1. Got it. Same as 4.2
>> > > >
>> > > > > Another naming related suggestion I had was with the
>> > > > > "convert-time" metrics - we should probably include
>> transformations
>> > in
>> > > > the
>> > > > > name since SMTs could definitely be attributable to a sizable
>> chunk
>> > of
>> > > > the
>> > > > > latency depending on the specific transformation chain.
>> > > >
>> > > > 5.2. Make sense. I'm proposing to add
>> > `sink-record-convert-transform...`
>> > > > and `source-record-transform-convert...` to represent correctly the
>> > order
>> > > > of operations.
>> > > >
>> > > > > it seems like both source and sink tasks only record metrics at a
>> > > "batch"
>> > > > > level, not on an individual record level. I think it might be
>> > > additional
>> > > > > overhead if we want to record these new metrics all at the record
>> > > level?
>> > > >
>> > > > 5.3. I considered at the beginning to implement all metrics at the
>> > batch
>> > > > level, but given how the framework process records, I fallback to
>> the
>> > > > proposed approach:
>> > > > - Sink Task:
>> > > >   - `WorkerSinkTask#convertMessages(msgs)` already iterates over
>> > records,
>> > > > so there is no additional overhead to capture record latency per
>> > record.
>> > > >     -
>> > > >
>> > > >
>> > >
>> >
>> https://github.com/apache/kafka/blob/9841647c4fe422532f448423c92d26e4fdcb8932/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L490-L514
>> > > >   - `WorkerSinkTask#convertAndTransformRecord(record)` actually
>> happens
>> > > > individually. Measuring this operation per batch would include
>> > processing
>> > > > that is not strictly part of "convert and transform"
>> > > >     -
>> > > >
>> > > >
>> > >
>> >
>> https://github.com/apache/kafka/blob/9841647c4fe422532f448423c92d26e4fdcb8932/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L518
>> > > > - Source Task:
>> > > >   - `AbstractWorkerSourceTask#sendRecords` iterates over a batch and
>> > > > applies transforms and convert record individually as well:
>> > > >     -
>> > > >
>> > > >
>> > >
>> >
>> https://github.com/apache/kafka/blob/9841647c4fe422532f448423c92d26e4fdcb8932/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/AbstractWorkerSourceTask.java#L389-L390
>> > > >
>> > > > > This might require some additional changes -
>> > > > > for instance, with the "sink-record-latency" metric, we might only
>> > want
>> > > > to
>> > > > > have a "max" metric since "avg" would require recording a value on
>> > the
>> > > > > sensor for each record (whereas we can get a "max" by only
>> recording
>> > a
>> > > > > metric value for the oldest record in each batch).
>> > > >
>> > > > 5.4. Recording record-latency per batch may not be as useful as
>> there
>> > is
>> > > no
>> > > > guarantee that the oldest record will be representative of the
>> batch.
>> > > >
>> > > > On Sat, 3 Sept 2022 at 16:02, Yash Mayya <ya...@gmail.com>
>> wrote:
>> > > >
>> > > > > Hi Jorge and Sagar,
>> > > > >
>> > > > > I think it makes sense to not have a min metric for either to
>> remain
>> > > > > consistent with the existing put-batch and poll-batch metrics (it
>> > > doesn't
>> > > > > seem particularly useful either anyway). Also, the new
>> > > > > "sink-record-latency" metric name looks fine to me, thanks for
>> making
>> > > the
>> > > > > changes! Another naming related suggestion I had was with the
>> > > > > "convert-time" metrics - we should probably include
>> transformations
>> > in
>> > > > the
>> > > > > name since SMTs could definitely be attributable to a sizable
>> chunk
>> > of
>> > > > the
>> > > > > latency depending on the specific transformation chain.
>> > > > >
>> > > > > I have one high level question with respect to implementation -
>> > > > currently,
>> > > > > it seems like both source and sink tasks only record metrics at a
>> > > "batch"
>> > > > > level, not on an individual record level. I think it might be
>> > > additional
>> > > > > overhead if we want to record these new metrics all at the record
>> > > level?
>> > > > > Could we instead make all of these new metrics for batches of
>> records
>> > > > > rather than individual records in order to remain consistent with
>> the
>> > > > > existing task level metrics? This might require some additional
>> > > changes -
>> > > > > for instance, with the "sink-record-latency" metric, we might only
>> > want
>> > > > to
>> > > > > have a "max" metric since "avg" would require recording a value on
>> > the
>> > > > > sensor for each record (whereas we can get a "max" by only
>> recording
>> > a
>> > > > > metric value for the oldest record in each batch).
>> > > > >
>> > > > > Thanks,
>> > > > > Yash
>> > > > >
>> > > > > On Fri, Sep 2, 2022 at 3:16 PM Sagar <sa...@gmail.com>
>> > > wrote:
>> > > > >
>> > > > > > Hi Jorge,
>> > > > > >
>> > > > > > Thanks for the changes.
>> > > > > >
>> > > > > > Regarding the metrics, I meant something like this:
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> kafka.connect:type=sink-task-metrics,connector="{connector}",task="{task}"
>> > > > > >
>> > > > > > the way it's defined in
>> > > > > > https://kafka.apache.org/documentation/#connect_monitoring for
>> the
>> > > > > > metrics.
>> > > > > >
>> > > > > > I see what you mean by the 3 metrics and how it can be
>> interpreted.
>> > > The
>> > > > > > only thing I would argue is do we need sink-record-latency-min?
>> > Maybe
>> > > > we
>> > > > > > could remove this min metric as well and make all of the 3 e2e
>> > > metrics
>> > > > > > consistent(since put-batch also doesn't expose a min which makes
>> > > sense
>> > > > to
>> > > > > > me). I think this is in contrast to what Yash pointed out above
>> so
>> > I
>> > > > > would
>> > > > > > like to hear his thoughts as well.
>> > > > > >
>> > > > > > The other point Yash mentioned about the slightly flawed
>> definition
>> > > of
>> > > > > e2e
>> > > > > > is also true in a sense. But I have a feeling that's one the
>> > records
>> > > > are
>> > > > > > polled by the connector tasks, it would be difficult to track
>> the
>> > > final
>> > > > > leg
>> > > > > > via the framework. Probably users can track the metrics at their
>> > end
>> > > to
>> > > > > > figure that out. Do you think that makes sense?
>> > > > > >
>> > > > > > Thanks!
>> > > > > > Sagar.
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > On Thu, Sep 1, 2022 at 11:40 PM Jorge Esteban Quilcate Otoya <
>> > > > > > quilcate.jorge@gmail.com> wrote:
>> > > > > >
>> > > > > > > Hi Sagar and Yash,
>> > > > > > >
>> > > > > > > Thanks for your feedback!
>> > > > > > >
>> > > > > > > > 1) I am assuming the new metrics would be task level metric.
>> > > > > > >
>> > > > > > > 1.1 Yes, it will be a task level metric, implemented on the
>> > > > > > > Worker[Source/Sink]Task.
>> > > > > > >
>> > > > > > > > Could you specify the way it's done for other sink/source
>> > > > connector?
>> > > > > > >
>> > > > > > > 1.2. Not sure what do you mean by this. Could you elaborate a
>> bit
>> > > > more?
>> > > > > > >
>> > > > > > > > 2. I am slightly confused about the e2e latency metric...
>> > > > > > >
>> > > > > > > 2.1. Yes, I see. I was trying to bring a similar concept as in
>> > > > Streams
>> > > > > > with
>> > > > > > > KIP-613, though the e2e concept may not be translatable.
>> > > > > > > We could keep it as `sink-record-latency` to avoid conflating
>> > > > > concepts. A
>> > > > > > > similar metric naming was proposed in KIP-489 but at the
>> consumer
>> > > > > level —
>> > > > > > > though it seems dormant for a couple of years.
>> > > > > > >
>> > > > > > > > However, the put-batch time measures the
>> > > > > > > > time to put a batch of records to external sink. So, I would
>> > > assume
>> > > > > > the 2
>> > > > > > > > can't be added as is to compute the e2e latency. Maybe I am
>> > > missing
>> > > > > > > > something here. Could you plz clarify this.
>> > > > > > >
>> > > > > > > 2.2. Yes, agree. Not necessarily added, but with the 3
>> latencies
>> > > > (poll,
>> > > > > > > convert, putBatch) will be clearer where the bottleneck may
>> be,
>> > and
>> > > > > > > represent the internal processing.
>> > > > > > >
>> > > > > > > > however, as per the KIP it looks like it will be
>> > > > > > > > the latency between when the record was written to Kafka and
>> > when
>> > > > the
>> > > > > > > > record is returned by a sink task's consumer's poll?
>> > > > > > >
>> > > > > > > 3.1. Agree. 2.1. could help to clarify this.
>> > > > > > >
>> > > > > > > > One more thing - I was wondering
>> > > > > > > > if there's a particular reason for having a min metric for
>> e2e
>> > > > > latency
>> > > > > > > but
>> > > > > > > > not for convert time?
>> > > > > > >
>> > > > > > > 3.2. Was following KIP-613 for e2e which seems useful to
>> compare
>> > > with
>> > > > > > Max a
>> > > > > > > get an idea of the window of results, though current
>> latencies in
>> > > > > > Connector
>> > > > > > > do not include Min, and that's why I haven't added it for
>> convert
>> > > > > > latency.
>> > > > > > > Do you think it make sense to extend latency metrics with Min?
>> > > > > > >
>> > > > > > > KIP is updated to clarify some of these changes.
>> > > > > > >
>> > > > > > > Many thanks,
>> > > > > > > Jorge.
>> > > > > > >
>> > > > > > > On Thu, 1 Sept 2022 at 18:11, Yash Mayya <
>> yash.mayya@gmail.com>
>> > > > wrote:
>> > > > > > >
>> > > > > > > > Hi Jorge,
>> > > > > > > >
>> > > > > > > > Thanks for the KIP! I have the same confusion with the
>> > > e2e-latency
>> > > > > > > metrics
>> > > > > > > > as Sagar above. "e2e" would seem to indicate the latency
>> > between
>> > > > when
>> > > > > > the
>> > > > > > > > record was written to Kafka and when the record was written
>> to
>> > > the
>> > > > > sink
>> > > > > > > > system by the connector - however, as per the KIP it looks
>> like
>> > > it
>> > > > > will
>> > > > > > > be
>> > > > > > > > the latency between when the record was written to Kafka and
>> > when
>> > > > the
>> > > > > > > > record is returned by a sink task's consumer's poll? I think
>> > that
>> > > > > > metric
>> > > > > > > > will be a little confusing to interpret. One more thing - I
>> was
>> > > > > > wondering
>> > > > > > > > if there's a particular reason for having a min metric for
>> e2e
>> > > > > latency
>> > > > > > > but
>> > > > > > > > not for convert time?
>> > > > > > > >
>> > > > > > > > Thanks,
>> > > > > > > > Yash
>> > > > > > > >
>> > > > > > > > On Thu, Sep 1, 2022 at 8:59 PM Sagar <
>> > sagarmeansocean@gmail.com>
>> > > > > > wrote:
>> > > > > > > >
>> > > > > > > > > Hi Jorge,
>> > > > > > > > >
>> > > > > > > > > Thanks for the KIP. It looks like a very good addition. I
>> > > skimmed
>> > > > > > > through
>> > > > > > > > > once and had a couple of questions =>
>> > > > > > > > >
>> > > > > > > > > 1) I am assuming the new metrics would be task level
>> metric.
>> > > > Could
>> > > > > > you
>> > > > > > > > > specify the way it's done for other sink/source connector?
>> > > > > > > > > 2) I am slightly confused about the e2e latency metric.
>> Let's
>> > > > > > consider
>> > > > > > > > the
>> > > > > > > > > sink connector metric. If I look at the way it's supposed
>> to
>> > be
>> > > > > > > > calculated,
>> > > > > > > > > i.e the difference between the record timestamp and the
>> wall
>> > > > clock
>> > > > > > > time,
>> > > > > > > > it
>> > > > > > > > > looks like a per record metric. However, the put-batch
>> time
>> > > > > measures
>> > > > > > > the
>> > > > > > > > > time to put a batch of records to external sink. So, I
>> would
>> > > > assume
>> > > > > > > the 2
>> > > > > > > > > can't be added as is to compute the e2e latency. Maybe I
>> am
>> > > > missing
>> > > > > > > > > something here. Could you plz clarify this.
>> > > > > > > > >
>> > > > > > > > > Thanks!
>> > > > > > > > > Sagar.
>> > > > > > > > >
>> > > > > > > > > On Tue, Aug 30, 2022 at 8:43 PM Jorge Esteban Quilcate
>> Otoya
>> > <
>> > > > > > > > > quilcate.jorge@gmail.com> wrote:
>> > > > > > > > >
>> > > > > > > > > > Hi all,
>> > > > > > > > > >
>> > > > > > > > > > I'd like to start a discussion thread on KIP-864: Add
>> > > > End-To-End
>> > > > > > > > Latency
>> > > > > > > > > > Metrics to Connectors.
>> > > > > > > > > > This KIP aims to improve the metrics available on Source
>> > and
>> > > > Sink
>> > > > > > > > > > Connectors to measure end-to-end latency, including
>> source
>> > > and
>> > > > > sink
>> > > > > > > > > record
>> > > > > > > > > > conversion time, and sink record e2e latency (similar to
>> > > > KIP-613
>> > > > > > for
>> > > > > > > > > > Streams).
>> > > > > > > > > >
>> > > > > > > > > > The KIP is here:
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-864%3A+Add+End-To-End+Latency+Metrics+to+Connectors
>> > > > > > > > > >
>> > > > > > > > > > Please take a look and let me know what you think.
>> > > > > > > > > >
>> > > > > > > > > > Cheers,
>> > > > > > > > > > Jorge.
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>

Re: [DISCUSS] KIP-864: Add End-To-End Latency Metrics to Connectors

Posted by Jorge Esteban Quilcate Otoya <qu...@gmail.com>.

Great. I have updated the KIP to reflect this.

Cheers,
Jorge.

On Thu, 8 Sept 2022 at 12:26, Yash Mayya <ya...@gmail.com> wrote:

> Thanks, I think it makes sense to define these metrics at a DEBUG recording
> level.
>
> On Thu, Sep 8, 2022 at 2:51 PM Jorge Esteban Quilcate Otoya <
> quilcate.jorge@gmail.com> wrote:
>
> > On Thu, 8 Sept 2022 at 05:55, Yash Mayya <ya...@gmail.com> wrote:
> >
> > > Hi Jorge,
> > >
> > > Thanks for the changes. With regard to having per batch vs per record
> > > metrics, the additional overhead I was referring to wasn't about
> whether
> > or
> > > not we would need to iterate over all the records in a batch. I was
> > > referring to the potential additional overhead caused by the higher
> > volume
> > > of calls to Sensor::record on the sensors for the new metrics (as
> > compared
> > > to the existing batch only metrics), especially for high throughput
> > > connectors where batch sizes could be large. I guess we may want to do
> > some
> > > sort of performance testing and get concrete numbers to verify whether
> > this
> > > is a valid concern or not?
> > >
> >
> > 6.1. Got it, thanks for clarifying. I guess there could be a benchmark
> test
> > of the `Sensor::record` to get an idea of the performance impact.
> > Regardless, the fact that these are single-record metrics compared to
> > existing batch-only could be explicitly defined by setting these metrics
> at
> > a DEBUG or TRACE metric recording level, leaving the existing at INFO
> > level.
> > wdyt?
> >
> >
> > >
> > > Thanks,
> > > Yash
> > >
> > > On Tue, Sep 6, 2022 at 4:42 PM Jorge Esteban Quilcate Otoya <
> > > quilcate.jorge@gmail.com> wrote:
> > >
> > > > Hi Sagar and Yash,
> > > >
> > > > > the way it's defined in
> > > > https://kafka.apache.org/documentation/#connect_monitoring for the
> > > metrics
> > > >
> > > > 4.1. Got it. Add it to the KIP.
> > > >
> > > > > The only thing I would argue is do we need sink-record-latency-min?
> > > Maybe
> > > > we
> > > > > could remove this min metric as well and make all of the 3 e2e
> > metrics
> > > > > consistent
> > > >
> > > > 4.2 I see. Will remove it from the KIP.
> > > >
> > > > > Probably users can track the metrics at their end to
> > > > > figure that out. Do you think that makes sense?
> > > >
> > > > 4.3. Yes, agree. With these new metrics it should be easier for users
> > to
> > > > track this.
> > > >
> > > > > I think it makes sense to not have a min metric for either to
> remain
> > > > > consistent with the existing put-batch and poll-batch metrics
> > > >
> > > > 5.1. Got it. Same as 4.2
> > > >
> > > > > Another naming related suggestion I had was with the
> > > > > "convert-time" metrics - we should probably include transformations
> > in
> > > > the
> > > > > name since SMTs could definitely be attributable to a sizable chunk
> > of
> > > > the
> > > > > latency depending on the specific transformation chain.
> > > >
> > > > 5.2. Make sense. I'm proposing to add
> > `sink-record-convert-transform...`
> > > > and `source-record-transform-convert...` to represent correctly the
> > order
> > > > of operations.
> > > >
> > > > > it seems like both source and sink tasks only record metrics at a
> > > "batch"
> > > > > level, not on an individual record level. I think it might be
> > > additional
> > > > > overhead if we want to record these new metrics all at the record
> > > level?
> > > >
> > > > 5.3. I considered at the beginning to implement all metrics at the
> > batch
> > > > level, but given how the framework process records, I fallback to the
> > > > proposed approach:
> > > > - Sink Task:
> > > >   - `WorkerSinkTask#convertMessages(msgs)` already iterates over
> > records,
> > > > so there is no additional overhead to capture record latency per
> > record.
> > > >     -
> > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/9841647c4fe422532f448423c92d26e4fdcb8932/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L490-L514
> > > >   - `WorkerSinkTask#convertAndTransformRecord(record)` actually
> happens
> > > > individually. Measuring this operation per batch would include
> > processing
> > > > that is not strictly part of "convert and transform"
> > > >     -
> > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/9841647c4fe422532f448423c92d26e4fdcb8932/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L518
> > > > - Source Task:
> > > >   - `AbstractWorkerSourceTask#sendRecords` iterates over a batch and
> > > > applies transforms and convert record individually as well:
> > > >     -
> > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/9841647c4fe422532f448423c92d26e4fdcb8932/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/AbstractWorkerSourceTask.java#L389-L390
> > > >
> > > > > This might require some additional changes -
> > > > > for instance, with the "sink-record-latency" metric, we might only
> > want
> > > > to
> > > > > have a "max" metric since "avg" would require recording a value on
> > the
> > > > > sensor for each record (whereas we can get a "max" by only
> recording
> > a
> > > > > metric value for the oldest record in each batch).
> > > >
> > > > 5.4. Recording record-latency per batch may not be as useful as there
> > is
> > > no
> > > > guarantee that the oldest record will be representative of the batch.
> > > >
> > > > On Sat, 3 Sept 2022 at 16:02, Yash Mayya <ya...@gmail.com>
> wrote:
> > > >
> > > > > Hi Jorge and Sagar,
> > > > >
> > > > > I think it makes sense to not have a min metric for either to
> remain
> > > > > consistent with the existing put-batch and poll-batch metrics (it
> > > doesn't
> > > > > seem particularly useful either anyway). Also, the new
> > > > > "sink-record-latency" metric name looks fine to me, thanks for
> making
> > > the
> > > > > changes! Another naming related suggestion I had was with the
> > > > > "convert-time" metrics - we should probably include transformations
> > in
> > > > the
> > > > > name since SMTs could definitely be attributable to a sizable chunk
> > of
> > > > the
> > > > > latency depending on the specific transformation chain.
> > > > >
> > > > > I have one high level question with respect to implementation -
> > > > currently,
> > > > > it seems like both source and sink tasks only record metrics at a
> > > "batch"
> > > > > level, not on an individual record level. I think it might be
> > > additional
> > > > > overhead if we want to record these new metrics all at the record
> > > level?
> > > > > Could we instead make all of these new metrics for batches of
> records
> > > > > rather than individual records in order to remain consistent with
> the
> > > > > existing task level metrics? This might require some additional
> > > changes -
> > > > > for instance, with the "sink-record-latency" metric, we might only
> > want
> > > > to
> > > > > have a "max" metric since "avg" would require recording a value on
> > the
> > > > > sensor for each record (whereas we can get a "max" by only
> recording
> > a
> > > > > metric value for the oldest record in each batch).
> > > > >
> > > > > Thanks,
> > > > > Yash
> > > > >
> > > > > On Fri, Sep 2, 2022 at 3:16 PM Sagar <sa...@gmail.com>
> > > wrote:
> > > > >
> > > > > > Hi Jorge,
> > > > > >
> > > > > > Thanks for the changes.
> > > > > >
> > > > > > Regarding the metrics, I meant something like this:
> > > > > >
> > > > >
> > > >
> > >
> >
> kafka.connect:type=sink-task-metrics,connector="{connector}",task="{task}"
> > > > > >
> > > > > > the way it's defined in
> > > > > > https://kafka.apache.org/documentation/#connect_monitoring for
> the
> > > > > > metrics.
> > > > > >
> > > > > > I see what you mean by the 3 metrics and how it can be
> interpreted.
> > > The
> > > > > > only thing I would argue is do we need sink-record-latency-min?
> > Maybe
> > > > we
> > > > > > could remove this min metric as well and make all of the 3 e2e
> > > metrics
> > > > > > consistent(since put-batch also doesn't expose a min which makes
> > > sense
> > > > to
> > > > > > me). I think this is in contrast to what Yash pointed out above
> so
> > I
> > > > > would
> > > > > > like to hear his thoughts as well.
> > > > > >
> > > > > > The other point Yash mentioned about the slightly flawed
> definition
> > > of
> > > > > e2e
> > > > > > is also true in a sense. But I have a feeling that's one the
> > records
> > > > are
> > > > > > polled by the connector tasks, it would be difficult to track the
> > > final
> > > > > leg
> > > > > > via the framework. Probably users can track the metrics at their
> > end
> > > to
> > > > > > figure that out. Do you think that makes sense?
> > > > > >
> > > > > > Thanks!
> > > > > > Sagar.
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Thu, Sep 1, 2022 at 11:40 PM Jorge Esteban Quilcate Otoya <
> > > > > > quilcate.jorge@gmail.com> wrote:
> > > > > >
> > > > > > > Hi Sagar and Yash,
> > > > > > >
> > > > > > > Thanks for your feedback!
> > > > > > >
> > > > > > > > 1) I am assuming the new metrics would be task level metric.
> > > > > > >
> > > > > > > 1.1 Yes, it will be a task level metric, implemented on the
> > > > > > > Worker[Source/Sink]Task.
> > > > > > >
> > > > > > > > Could you specify the way it's done for other sink/source
> > > > connector?
> > > > > > >
> > > > > > > 1.2. Not sure what do you mean by this. Could you elaborate a
> bit
> > > > more?
> > > > > > >
> > > > > > > > 2. I am slightly confused about the e2e latency metric...
> > > > > > >
> > > > > > > 2.1. Yes, I see. I was trying to bring a similar concept as in
> > > > Streams
> > > > > > with
> > > > > > > KIP-613, though the e2e concept may not be translatable.
> > > > > > > We could keep it as `sink-record-latency` to avoid conflating
> > > > > concepts. A
> > > > > > > similar metric naming was proposed in KIP-489 but at the
> consumer
> > > > > level —
> > > > > > > though it seems dormant for a couple of years.
> > > > > > >
> > > > > > > > However, the put-batch time measures the
> > > > > > > > time to put a batch of records to external sink. So, I would
> > > assume
> > > > > > the 2
> > > > > > > > can't be added as is to compute the e2e latency. Maybe I am
> > > missing
> > > > > > > > something here. Could you plz clarify this.
> > > > > > >
> > > > > > > 2.2. Yes, agree. Not necessarily added, but with the 3
> latencies
> > > > (poll,
> > > > > > > convert, putBatch) will be clearer where the bottleneck may be,
> > and
> > > > > > > represent the internal processing.
> > > > > > >
> > > > > > > > however, as per the KIP it looks like it will be
> > > > > > > > the latency between when the record was written to Kafka and
> > when
> > > > the
> > > > > > > > record is returned by a sink task's consumer's poll?
> > > > > > >
> > > > > > > 3.1. Agree. 2.1. could help to clarify this.
> > > > > > >
> > > > > > > > One more thing - I was wondering
> > > > > > > > if there's a particular reason for having a min metric for
> e2e
> > > > > latency
> > > > > > > but
> > > > > > > > not for convert time?
> > > > > > >
> > > > > > > 3.2. Was following KIP-613 for e2e which seems useful to
> compare
> > > with
> > > > > > Max a
> > > > > > > get an idea of the window of results, though current latencies
> in
> > > > > > Connector
> > > > > > > do not include Min, and that's why I haven't added it for
> convert
> > > > > > latency.
> > > > > > > Do you think it make sense to extend latency metrics with Min?
> > > > > > >
> > > > > > > KIP is updated to clarify some of these changes.
> > > > > > >
> > > > > > > Many thanks,
> > > > > > > Jorge.
> > > > > > >
> > > > > > > On Thu, 1 Sept 2022 at 18:11, Yash Mayya <yash.mayya@gmail.com
> >
> > > > wrote:
> > > > > > >
> > > > > > > > Hi Jorge,
> > > > > > > >
> > > > > > > > Thanks for the KIP! I have the same confusion with the
> > > e2e-latency
> > > > > > > metrics
> > > > > > > > as Sagar above. "e2e" would seem to indicate the latency
> > between
> > > > when
> > > > > > the
> > > > > > > > record was written to Kafka and when the record was written
> to
> > > the
> > > > > sink
> > > > > > > > system by the connector - however, as per the KIP it looks
> like
> > > it
> > > > > will
> > > > > > > be
> > > > > > > > the latency between when the record was written to Kafka and
> > when
> > > > the
> > > > > > > > record is returned by a sink task's consumer's poll? I think
> > that
> > > > > > metric
> > > > > > > > will be a little confusing to interpret. One more thing - I
> was
> > > > > > wondering
> > > > > > > > if there's a particular reason for having a min metric for
> e2e
> > > > > latency
> > > > > > > but
> > > > > > > > not for convert time?
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Yash
> > > > > > > >
> > > > > > > > On Thu, Sep 1, 2022 at 8:59 PM Sagar <
> > sagarmeansocean@gmail.com>
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Jorge,
> > > > > > > > >
> > > > > > > > > Thanks for the KIP. It looks like a very good addition. I
> > > skimmed
> > > > > > > through
> > > > > > > > > once and had a couple of questions =>
> > > > > > > > >
> > > > > > > > > 1) I am assuming the new metrics would be task level
> metric.
> > > > Could
> > > > > > you
> > > > > > > > > specify the way it's done for other sink/source connector?
> > > > > > > > > 2) I am slightly confused about the e2e latency metric.
> Let's
> > > > > > consider
> > > > > > > > the
> > > > > > > > > sink connector metric. If I look at the way it's supposed
> to
> > be
> > > > > > > > calculated,
> > > > > > > > > i.e the difference between the record timestamp and the
> wall
> > > > clock
> > > > > > > time,
> > > > > > > > it
> > > > > > > > > looks like a per record metric. However, the put-batch time
> > > > > measures
> > > > > > > the
> > > > > > > > > time to put a batch of records to external sink. So, I
> would
> > > > assume
> > > > > > > the 2
> > > > > > > > > can't be added as is to compute the e2e latency. Maybe I am
> > > > missing
> > > > > > > > > something here. Could you plz clarify this.
> > > > > > > > >
> > > > > > > > > Thanks!
> > > > > > > > > Sagar.
> > > > > > > > >
> > > > > > > > > On Tue, Aug 30, 2022 at 8:43 PM Jorge Esteban Quilcate
> Otoya
> > <
> > > > > > > > > quilcate.jorge@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > > Hi all,
> > > > > > > > > >
> > > > > > > > > > I'd like to start a discussion thread on KIP-864: Add
> > > > End-To-End
> > > > > > > > Latency
> > > > > > > > > > Metrics to Connectors.
> > > > > > > > > > This KIP aims to improve the metrics available on Source
> > and
> > > > Sink
> > > > > > > > > > Connectors to measure end-to-end latency, including
> source
> > > and
> > > > > sink
> > > > > > > > > record
> > > > > > > > > > conversion time, and sink record e2e latency (similar to
> > > > KIP-613
> > > > > > for
> > > > > > > > > > Streams).
> > > > > > > > > >
> > > > > > > > > > The KIP is here:
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-864%3A+Add+End-To-End+Latency+Metrics+to+Connectors
> > > > > > > > > >
> > > > > > > > > > Please take a look and let me know what you think.
> > > > > > > > > >
> > > > > > > > > > Cheers,
> > > > > > > > > > Jorge.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-864: Add End-To-End Latency Metrics to Connectors

Posted by Yash Mayya <ya...@gmail.com>.

Thanks, I think it makes sense to define these metrics at a DEBUG recording
level.

On Thu, Sep 8, 2022 at 2:51 PM Jorge Esteban Quilcate Otoya <
quilcate.jorge@gmail.com> wrote:

> On Thu, 8 Sept 2022 at 05:55, Yash Mayya <ya...@gmail.com> wrote:
>
> > Hi Jorge,
> >
> > Thanks for the changes. With regard to having per batch vs per record
> > metrics, the additional overhead I was referring to wasn't about whether
> or
> > not we would need to iterate over all the records in a batch. I was
> > referring to the potential additional overhead caused by the higher
> volume
> > of calls to Sensor::record on the sensors for the new metrics (as
> compared
> > to the existing batch only metrics), especially for high throughput
> > connectors where batch sizes could be large. I guess we may want to do
> some
> > sort of performance testing and get concrete numbers to verify whether
> this
> > is a valid concern or not?
> >
>
> 6.1. Got it, thanks for clarifying. I guess there could be a benchmark test
> of the `Sensor::record` to get an idea of the performance impact.
> Regardless, the fact that these are single-record metrics compared to
> existing batch-only could be explicitly defined by setting these metrics at
> a DEBUG or TRACE metric recording level, leaving the existing at INFO
> level.
> wdyt?
>
>
> >
> > Thanks,
> > Yash
> >
> > On Tue, Sep 6, 2022 at 4:42 PM Jorge Esteban Quilcate Otoya <
> > quilcate.jorge@gmail.com> wrote:
> >
> > > Hi Sagar and Yash,
> > >
> > > > the way it's defined in
> > > https://kafka.apache.org/documentation/#connect_monitoring for the
> > metrics
> > >
> > > 4.1. Got it. Add it to the KIP.
> > >
> > > > The only thing I would argue is do we need sink-record-latency-min?
> > Maybe
> > > we
> > > > could remove this min metric as well and make all of the 3 e2e
> metrics
> > > > consistent
> > >
> > > 4.2 I see. Will remove it from the KIP.
> > >
> > > > Probably users can track the metrics at their end to
> > > > figure that out. Do you think that makes sense?
> > >
> > > 4.3. Yes, agree. With these new metrics it should be easier for users
> to
> > > track this.
> > >
> > > > I think it makes sense to not have a min metric for either to remain
> > > > consistent with the existing put-batch and poll-batch metrics
> > >
> > > 5.1. Got it. Same as 4.2
> > >
> > > > Another naming related suggestion I had was with the
> > > > "convert-time" metrics - we should probably include transformations
> in
> > > the
> > > > name since SMTs could definitely be attributable to a sizable chunk
> of
> > > the
> > > > latency depending on the specific transformation chain.
> > >
> > > 5.2. Make sense. I'm proposing to add
> `sink-record-convert-transform...`
> > > and `source-record-transform-convert...` to represent correctly the
> order
> > > of operations.
> > >
> > > > it seems like both source and sink tasks only record metrics at a
> > "batch"
> > > > level, not on an individual record level. I think it might be
> > additional
> > > > overhead if we want to record these new metrics all at the record
> > level?
> > >
> > > 5.3. I considered at the beginning to implement all metrics at the
> batch
> > > level, but given how the framework process records, I fallback to the
> > > proposed approach:
> > > - Sink Task:
> > >   - `WorkerSinkTask#convertMessages(msgs)` already iterates over
> records,
> > > so there is no additional overhead to capture record latency per
> record.
> > >     -
> > >
> > >
> >
> https://github.com/apache/kafka/blob/9841647c4fe422532f448423c92d26e4fdcb8932/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L490-L514
> > >   - `WorkerSinkTask#convertAndTransformRecord(record)` actually happens
> > > individually. Measuring this operation per batch would include
> processing
> > > that is not strictly part of "convert and transform"
> > >     -
> > >
> > >
> >
> https://github.com/apache/kafka/blob/9841647c4fe422532f448423c92d26e4fdcb8932/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L518
> > > - Source Task:
> > >   - `AbstractWorkerSourceTask#sendRecords` iterates over a batch and
> > > applies transforms and convert record individually as well:
> > >     -
> > >
> > >
> >
> https://github.com/apache/kafka/blob/9841647c4fe422532f448423c92d26e4fdcb8932/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/AbstractWorkerSourceTask.java#L389-L390
> > >
> > > > This might require some additional changes -
> > > > for instance, with the "sink-record-latency" metric, we might only
> want
> > > to
> > > > have a "max" metric since "avg" would require recording a value on
> the
> > > > sensor for each record (whereas we can get a "max" by only recording
> a
> > > > metric value for the oldest record in each batch).
> > >
> > > 5.4. Recording record-latency per batch may not be as useful as there
> is
> > no
> > > guarantee that the oldest record will be representative of the batch.
> > >
> > > On Sat, 3 Sept 2022 at 16:02, Yash Mayya <ya...@gmail.com> wrote:
> > >
> > > > Hi Jorge and Sagar,
> > > >
> > > > I think it makes sense to not have a min metric for either to remain
> > > > consistent with the existing put-batch and poll-batch metrics (it
> > doesn't
> > > > seem particularly useful either anyway). Also, the new
> > > > "sink-record-latency" metric name looks fine to me, thanks for making
> > the
> > > > changes! Another naming related suggestion I had was with the
> > > > "convert-time" metrics - we should probably include transformations
> in
> > > the
> > > > name since SMTs could definitely be attributable to a sizable chunk
> of
> > > the
> > > > latency depending on the specific transformation chain.
> > > >
> > > > I have one high level question with respect to implementation -
> > > currently,
> > > > it seems like both source and sink tasks only record metrics at a
> > "batch"
> > > > level, not on an individual record level. I think it might be
> > additional
> > > > overhead if we want to record these new metrics all at the record
> > level?
> > > > Could we instead make all of these new metrics for batches of records
> > > > rather than individual records in order to remain consistent with the
> > > > existing task level metrics? This might require some additional
> > changes -
> > > > for instance, with the "sink-record-latency" metric, we might only
> want
> > > to
> > > > have a "max" metric since "avg" would require recording a value on
> the
> > > > sensor for each record (whereas we can get a "max" by only recording
> a
> > > > metric value for the oldest record in each batch).
> > > >
> > > > Thanks,
> > > > Yash
> > > >
> > > > On Fri, Sep 2, 2022 at 3:16 PM Sagar <sa...@gmail.com>
> > wrote:
> > > >
> > > > > Hi Jorge,
> > > > >
> > > > > Thanks for the changes.
> > > > >
> > > > > Regarding the metrics, I meant something like this:
> > > > >
> > > >
> > >
> >
> kafka.connect:type=sink-task-metrics,connector="{connector}",task="{task}"
> > > > >
> > > > > the way it's defined in
> > > > > https://kafka.apache.org/documentation/#connect_monitoring for the
> > > > > metrics.
> > > > >
> > > > > I see what you mean by the 3 metrics and how it can be interpreted.
> > The
> > > > > only thing I would argue is do we need sink-record-latency-min?
> Maybe
> > > we
> > > > > could remove this min metric as well and make all of the 3 e2e
> > metrics
> > > > > consistent(since put-batch also doesn't expose a min which makes
> > sense
> > > to
> > > > > me). I think this is in contrast to what Yash pointed out above so
> I
> > > > would
> > > > > like to hear his thoughts as well.
> > > > >
> > > > > The other point Yash mentioned about the slightly flawed definition
> > of
> > > > e2e
> > > > > is also true in a sense. But I have a feeling that's one the
> records
> > > are
> > > > > polled by the connector tasks, it would be difficult to track the
> > final
> > > > leg
> > > > > via the framework. Probably users can track the metrics at their
> end
> > to
> > > > > figure that out. Do you think that makes sense?
> > > > >
> > > > > Thanks!
> > > > > Sagar.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Thu, Sep 1, 2022 at 11:40 PM Jorge Esteban Quilcate Otoya <
> > > > > quilcate.jorge@gmail.com> wrote:
> > > > >
> > > > > > Hi Sagar and Yash,
> > > > > >
> > > > > > Thanks for your feedback!
> > > > > >
> > > > > > > 1) I am assuming the new metrics would be task level metric.
> > > > > >
> > > > > > 1.1 Yes, it will be a task level metric, implemented on the
> > > > > > Worker[Source/Sink]Task.
> > > > > >
> > > > > > > Could you specify the way it's done for other sink/source
> > > connector?
> > > > > >
> > > > > > 1.2. Not sure what do you mean by this. Could you elaborate a bit
> > > more?
> > > > > >
> > > > > > > 2. I am slightly confused about the e2e latency metric...
> > > > > >
> > > > > > 2.1. Yes, I see. I was trying to bring a similar concept as in
> > > Streams
> > > > > with
> > > > > > KIP-613, though the e2e concept may not be translatable.
> > > > > > We could keep it as `sink-record-latency` to avoid conflating
> > > > concepts. A
> > > > > > similar metric naming was proposed in KIP-489 but at the consumer
> > > > level —
> > > > > > though it seems dormant for a couple of years.
> > > > > >
> > > > > > > However, the put-batch time measures the
> > > > > > > time to put a batch of records to external sink. So, I would
> > assume
> > > > > the 2
> > > > > > > can't be added as is to compute the e2e latency. Maybe I am
> > missing
> > > > > > > something here. Could you plz clarify this.
> > > > > >
> > > > > > 2.2. Yes, agree. Not necessarily added, but with the 3 latencies
> > > (poll,
> > > > > > convert, putBatch) will be clearer where the bottleneck may be,
> and
> > > > > > represent the internal processing.
> > > > > >
> > > > > > > however, as per the KIP it looks like it will be
> > > > > > > the latency between when the record was written to Kafka and
> when
> > > the
> > > > > > > record is returned by a sink task's consumer's poll?
> > > > > >
> > > > > > 3.1. Agree. 2.1. could help to clarify this.
> > > > > >
> > > > > > > One more thing - I was wondering
> > > > > > > if there's a particular reason for having a min metric for e2e
> > > > latency
> > > > > > but
> > > > > > > not for convert time?
> > > > > >
> > > > > > 3.2. Was following KIP-613 for e2e which seems useful to compare
> > with
> > > > > Max a
> > > > > > get an idea of the window of results, though current latencies in
> > > > > Connector
> > > > > > do not include Min, and that's why I haven't added it for convert
> > > > > latency.
> > > > > > Do you think it make sense to extend latency metrics with Min?
> > > > > >
> > > > > > KIP is updated to clarify some of these changes.
> > > > > >
> > > > > > Many thanks,
> > > > > > Jorge.
> > > > > >
> > > > > > On Thu, 1 Sept 2022 at 18:11, Yash Mayya <ya...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > > Hi Jorge,
> > > > > > >
> > > > > > > Thanks for the KIP! I have the same confusion with the
> > e2e-latency
> > > > > > metrics
> > > > > > > as Sagar above. "e2e" would seem to indicate the latency
> between
> > > when
> > > > > the
> > > > > > > record was written to Kafka and when the record was written to
> > the
> > > > sink
> > > > > > > system by the connector - however, as per the KIP it looks like
> > it
> > > > will
> > > > > > be
> > > > > > > the latency between when the record was written to Kafka and
> when
> > > the
> > > > > > > record is returned by a sink task's consumer's poll? I think
> that
> > > > > metric
> > > > > > > will be a little confusing to interpret. One more thing - I was
> > > > > wondering
> > > > > > > if there's a particular reason for having a min metric for e2e
> > > > latency
> > > > > > but
> > > > > > > not for convert time?
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Yash
> > > > > > >
> > > > > > > On Thu, Sep 1, 2022 at 8:59 PM Sagar <
> sagarmeansocean@gmail.com>
> > > > > wrote:
> > > > > > >
> > > > > > > > Hi Jorge,
> > > > > > > >
> > > > > > > > Thanks for the KIP. It looks like a very good addition. I
> > skimmed
> > > > > > through
> > > > > > > > once and had a couple of questions =>
> > > > > > > >
> > > > > > > > 1) I am assuming the new metrics would be task level metric.
> > > Could
> > > > > you
> > > > > > > > specify the way it's done for other sink/source connector?
> > > > > > > > 2) I am slightly confused about the e2e latency metric. Let's
> > > > > consider
> > > > > > > the
> > > > > > > > sink connector metric. If I look at the way it's supposed to
> be
> > > > > > > calculated,
> > > > > > > > i.e the difference between the record timestamp and the wall
> > > clock
> > > > > > time,
> > > > > > > it
> > > > > > > > looks like a per record metric. However, the put-batch time
> > > > measures
> > > > > > the
> > > > > > > > time to put a batch of records to external sink. So, I would
> > > assume
> > > > > > the 2
> > > > > > > > can't be added as is to compute the e2e latency. Maybe I am
> > > missing
> > > > > > > > something here. Could you plz clarify this.
> > > > > > > >
> > > > > > > > Thanks!
> > > > > > > > Sagar.
> > > > > > > >
> > > > > > > > On Tue, Aug 30, 2022 at 8:43 PM Jorge Esteban Quilcate Otoya
> <
> > > > > > > > quilcate.jorge@gmail.com> wrote:
> > > > > > > >
> > > > > > > > > Hi all,
> > > > > > > > >
> > > > > > > > > I'd like to start a discussion thread on KIP-864: Add
> > > End-To-End
> > > > > > > Latency
> > > > > > > > > Metrics to Connectors.
> > > > > > > > > This KIP aims to improve the metrics available on Source
> and
> > > Sink
> > > > > > > > > Connectors to measure end-to-end latency, including source
> > and
> > > > sink
> > > > > > > > record
> > > > > > > > > conversion time, and sink record e2e latency (similar to
> > > KIP-613
> > > > > for
> > > > > > > > > Streams).
> > > > > > > > >
> > > > > > > > > The KIP is here:
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-864%3A+Add+End-To-End+Latency+Metrics+to+Connectors
> > > > > > > > >
> > > > > > > > > Please take a look and let me know what you think.
> > > > > > > > >
> > > > > > > > > Cheers,
> > > > > > > > > Jorge.
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-864: Add End-To-End Latency Metrics to Connectors

Posted by Jorge Esteban Quilcate Otoya <qu...@gmail.com>.

On Thu, 8 Sept 2022 at 05:55, Yash Mayya <ya...@gmail.com> wrote:

> Hi Jorge,
>
> Thanks for the changes. With regard to having per batch vs per record
> metrics, the additional overhead I was referring to wasn't about whether or
> not we would need to iterate over all the records in a batch. I was
> referring to the potential additional overhead caused by the higher volume
> of calls to Sensor::record on the sensors for the new metrics (as compared
> to the existing batch only metrics), especially for high throughput
> connectors where batch sizes could be large. I guess we may want to do some
> sort of performance testing and get concrete numbers to verify whether this
> is a valid concern or not?
>

6.1. Got it, thanks for clarifying. I guess there could be a benchmark test
of the `Sensor::record` to get an idea of the performance impact.
Regardless, the fact that these are single-record metrics compared to
existing batch-only could be explicitly defined by setting these metrics at
a DEBUG or TRACE metric recording level, leaving the existing at INFO level.
wdyt?


>
> Thanks,
> Yash
>
> On Tue, Sep 6, 2022 at 4:42 PM Jorge Esteban Quilcate Otoya <
> quilcate.jorge@gmail.com> wrote:
>
> > Hi Sagar and Yash,
> >
> > > the way it's defined in
> > https://kafka.apache.org/documentation/#connect_monitoring for the
> metrics
> >
> > 4.1. Got it. Add it to the KIP.
> >
> > > The only thing I would argue is do we need sink-record-latency-min?
> Maybe
> > we
> > > could remove this min metric as well and make all of the 3 e2e metrics
> > > consistent
> >
> > 4.2 I see. Will remove it from the KIP.
> >
> > > Probably users can track the metrics at their end to
> > > figure that out. Do you think that makes sense?
> >
> > 4.3. Yes, agree. With these new metrics it should be easier for users to
> > track this.
> >
> > > I think it makes sense to not have a min metric for either to remain
> > > consistent with the existing put-batch and poll-batch metrics
> >
> > 5.1. Got it. Same as 4.2
> >
> > > Another naming related suggestion I had was with the
> > > "convert-time" metrics - we should probably include transformations in
> > the
> > > name since SMTs could definitely be attributable to a sizable chunk of
> > the
> > > latency depending on the specific transformation chain.
> >
> > 5.2. Make sense. I'm proposing to add `sink-record-convert-transform...`
> > and `source-record-transform-convert...` to represent correctly the order
> > of operations.
> >
> > > it seems like both source and sink tasks only record metrics at a
> "batch"
> > > level, not on an individual record level. I think it might be
> additional
> > > overhead if we want to record these new metrics all at the record
> level?
> >
> > 5.3. I considered at the beginning to implement all metrics at the batch
> > level, but given how the framework process records, I fallback to the
> > proposed approach:
> > - Sink Task:
> >   - `WorkerSinkTask#convertMessages(msgs)` already iterates over records,
> > so there is no additional overhead to capture record latency per record.
> >     -
> >
> >
> https://github.com/apache/kafka/blob/9841647c4fe422532f448423c92d26e4fdcb8932/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L490-L514
> >   - `WorkerSinkTask#convertAndTransformRecord(record)` actually happens
> > individually. Measuring this operation per batch would include processing
> > that is not strictly part of "convert and transform"
> >     -
> >
> >
> https://github.com/apache/kafka/blob/9841647c4fe422532f448423c92d26e4fdcb8932/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L518
> > - Source Task:
> >   - `AbstractWorkerSourceTask#sendRecords` iterates over a batch and
> > applies transforms and convert record individually as well:
> >     -
> >
> >
> https://github.com/apache/kafka/blob/9841647c4fe422532f448423c92d26e4fdcb8932/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/AbstractWorkerSourceTask.java#L389-L390
> >
> > > This might require some additional changes -
> > > for instance, with the "sink-record-latency" metric, we might only want
> > to
> > > have a "max" metric since "avg" would require recording a value on the
> > > sensor for each record (whereas we can get a "max" by only recording a
> > > metric value for the oldest record in each batch).
> >
> > 5.4. Recording record-latency per batch may not be as useful as there is
> no
> > guarantee that the oldest record will be representative of the batch.
> >
> > On Sat, 3 Sept 2022 at 16:02, Yash Mayya <ya...@gmail.com> wrote:
> >
> > > Hi Jorge and Sagar,
> > >
> > > I think it makes sense to not have a min metric for either to remain
> > > consistent with the existing put-batch and poll-batch metrics (it
> doesn't
> > > seem particularly useful either anyway). Also, the new
> > > "sink-record-latency" metric name looks fine to me, thanks for making
> the
> > > changes! Another naming related suggestion I had was with the
> > > "convert-time" metrics - we should probably include transformations in
> > the
> > > name since SMTs could definitely be attributable to a sizable chunk of
> > the
> > > latency depending on the specific transformation chain.
> > >
> > > I have one high level question with respect to implementation -
> > currently,
> > > it seems like both source and sink tasks only record metrics at a
> "batch"
> > > level, not on an individual record level. I think it might be
> additional
> > > overhead if we want to record these new metrics all at the record
> level?
> > > Could we instead make all of these new metrics for batches of records
> > > rather than individual records in order to remain consistent with the
> > > existing task level metrics? This might require some additional
> changes -
> > > for instance, with the "sink-record-latency" metric, we might only want
> > to
> > > have a "max" metric since "avg" would require recording a value on the
> > > sensor for each record (whereas we can get a "max" by only recording a
> > > metric value for the oldest record in each batch).
> > >
> > > Thanks,
> > > Yash
> > >
> > > On Fri, Sep 2, 2022 at 3:16 PM Sagar <sa...@gmail.com>
> wrote:
> > >
> > > > Hi Jorge,
> > > >
> > > > Thanks for the changes.
> > > >
> > > > Regarding the metrics, I meant something like this:
> > > >
> > >
> >
> kafka.connect:type=sink-task-metrics,connector="{connector}",task="{task}"
> > > >
> > > > the way it's defined in
> > > > https://kafka.apache.org/documentation/#connect_monitoring for the
> > > > metrics.
> > > >
> > > > I see what you mean by the 3 metrics and how it can be interpreted.
> The
> > > > only thing I would argue is do we need sink-record-latency-min? Maybe
> > we
> > > > could remove this min metric as well and make all of the 3 e2e
> metrics
> > > > consistent(since put-batch also doesn't expose a min which makes
> sense
> > to
> > > > me). I think this is in contrast to what Yash pointed out above so I
> > > would
> > > > like to hear his thoughts as well.
> > > >
> > > > The other point Yash mentioned about the slightly flawed definition
> of
> > > e2e
> > > > is also true in a sense. But I have a feeling that's one the records
> > are
> > > > polled by the connector tasks, it would be difficult to track the
> final
> > > leg
> > > > via the framework. Probably users can track the metrics at their end
> to
> > > > figure that out. Do you think that makes sense?
> > > >
> > > > Thanks!
> > > > Sagar.
> > > >
> > > >
> > > >
> > > >
> > > > On Thu, Sep 1, 2022 at 11:40 PM Jorge Esteban Quilcate Otoya <
> > > > quilcate.jorge@gmail.com> wrote:
> > > >
> > > > > Hi Sagar and Yash,
> > > > >
> > > > > Thanks for your feedback!
> > > > >
> > > > > > 1) I am assuming the new metrics would be task level metric.
> > > > >
> > > > > 1.1 Yes, it will be a task level metric, implemented on the
> > > > > Worker[Source/Sink]Task.
> > > > >
> > > > > > Could you specify the way it's done for other sink/source
> > connector?
> > > > >
> > > > > 1.2. Not sure what do you mean by this. Could you elaborate a bit
> > more?
> > > > >
> > > > > > 2. I am slightly confused about the e2e latency metric...
> > > > >
> > > > > 2.1. Yes, I see. I was trying to bring a similar concept as in
> > Streams
> > > > with
> > > > > KIP-613, though the e2e concept may not be translatable.
> > > > > We could keep it as `sink-record-latency` to avoid conflating
> > > concepts. A
> > > > > similar metric naming was proposed in KIP-489 but at the consumer
> > > level —
> > > > > though it seems dormant for a couple of years.
> > > > >
> > > > > > However, the put-batch time measures the
> > > > > > time to put a batch of records to external sink. So, I would
> assume
> > > > the 2
> > > > > > can't be added as is to compute the e2e latency. Maybe I am
> missing
> > > > > > something here. Could you plz clarify this.
> > > > >
> > > > > 2.2. Yes, agree. Not necessarily added, but with the 3 latencies
> > (poll,
> > > > > convert, putBatch) will be clearer where the bottleneck may be, and
> > > > > represent the internal processing.
> > > > >
> > > > > > however, as per the KIP it looks like it will be
> > > > > > the latency between when the record was written to Kafka and when
> > the
> > > > > > record is returned by a sink task's consumer's poll?
> > > > >
> > > > > 3.1. Agree. 2.1. could help to clarify this.
> > > > >
> > > > > > One more thing - I was wondering
> > > > > > if there's a particular reason for having a min metric for e2e
> > > latency
> > > > > but
> > > > > > not for convert time?
> > > > >
> > > > > 3.2. Was following KIP-613 for e2e which seems useful to compare
> with
> > > > Max a
> > > > > get an idea of the window of results, though current latencies in
> > > > Connector
> > > > > do not include Min, and that's why I haven't added it for convert
> > > > latency.
> > > > > Do you think it make sense to extend latency metrics with Min?
> > > > >
> > > > > KIP is updated to clarify some of these changes.
> > > > >
> > > > > Many thanks,
> > > > > Jorge.
> > > > >
> > > > > On Thu, 1 Sept 2022 at 18:11, Yash Mayya <ya...@gmail.com>
> > wrote:
> > > > >
> > > > > > Hi Jorge,
> > > > > >
> > > > > > Thanks for the KIP! I have the same confusion with the
> e2e-latency
> > > > > metrics
> > > > > > as Sagar above. "e2e" would seem to indicate the latency between
> > when
> > > > the
> > > > > > record was written to Kafka and when the record was written to
> the
> > > sink
> > > > > > system by the connector - however, as per the KIP it looks like
> it
> > > will
> > > > > be
> > > > > > the latency between when the record was written to Kafka and when
> > the
> > > > > > record is returned by a sink task's consumer's poll? I think that
> > > > metric
> > > > > > will be a little confusing to interpret. One more thing - I was
> > > > wondering
> > > > > > if there's a particular reason for having a min metric for e2e
> > > latency
> > > > > but
> > > > > > not for convert time?
> > > > > >
> > > > > > Thanks,
> > > > > > Yash
> > > > > >
> > > > > > On Thu, Sep 1, 2022 at 8:59 PM Sagar <sa...@gmail.com>
> > > > wrote:
> > > > > >
> > > > > > > Hi Jorge,
> > > > > > >
> > > > > > > Thanks for the KIP. It looks like a very good addition. I
> skimmed
> > > > > through
> > > > > > > once and had a couple of questions =>
> > > > > > >
> > > > > > > 1) I am assuming the new metrics would be task level metric.
> > Could
> > > > you
> > > > > > > specify the way it's done for other sink/source connector?
> > > > > > > 2) I am slightly confused about the e2e latency metric. Let's
> > > > consider
> > > > > > the
> > > > > > > sink connector metric. If I look at the way it's supposed to be
> > > > > > calculated,
> > > > > > > i.e the difference between the record timestamp and the wall
> > clock
> > > > > time,
> > > > > > it
> > > > > > > looks like a per record metric. However, the put-batch time
> > > measures
> > > > > the
> > > > > > > time to put a batch of records to external sink. So, I would
> > assume
> > > > > the 2
> > > > > > > can't be added as is to compute the e2e latency. Maybe I am
> > missing
> > > > > > > something here. Could you plz clarify this.
> > > > > > >
> > > > > > > Thanks!
> > > > > > > Sagar.
> > > > > > >
> > > > > > > On Tue, Aug 30, 2022 at 8:43 PM Jorge Esteban Quilcate Otoya <
> > > > > > > quilcate.jorge@gmail.com> wrote:
> > > > > > >
> > > > > > > > Hi all,
> > > > > > > >
> > > > > > > > I'd like to start a discussion thread on KIP-864: Add
> > End-To-End
> > > > > > Latency
> > > > > > > > Metrics to Connectors.
> > > > > > > > This KIP aims to improve the metrics available on Source and
> > Sink
> > > > > > > > Connectors to measure end-to-end latency, including source
> and
> > > sink
> > > > > > > record
> > > > > > > > conversion time, and sink record e2e latency (similar to
> > KIP-613
> > > > for
> > > > > > > > Streams).
> > > > > > > >
> > > > > > > > The KIP is here:
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-864%3A+Add+End-To-End+Latency+Metrics+to+Connectors
> > > > > > > >
> > > > > > > > Please take a look and let me know what you think.
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > > Jorge.
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-864: Add End-To-End Latency Metrics to Connectors

Posted by Yash Mayya <ya...@gmail.com>.

Hi Jorge,

Thanks for the changes. With regard to having per batch vs per record
metrics, the additional overhead I was referring to wasn't about whether or
not we would need to iterate over all the records in a batch. I was
referring to the potential additional overhead caused by the higher volume
of calls to Sensor::record on the sensors for the new metrics (as compared
to the existing batch only metrics), especially for high throughput
connectors where batch sizes could be large. I guess we may want to do some
sort of performance testing and get concrete numbers to verify whether this
is a valid concern or not?

Thanks,
Yash

On Tue, Sep 6, 2022 at 4:42 PM Jorge Esteban Quilcate Otoya <
quilcate.jorge@gmail.com> wrote:

> Hi Sagar and Yash,
>
> > the way it's defined in
> https://kafka.apache.org/documentation/#connect_monitoring for the metrics
>
> 4.1. Got it. Add it to the KIP.
>
> > The only thing I would argue is do we need sink-record-latency-min? Maybe
> we
> > could remove this min metric as well and make all of the 3 e2e metrics
> > consistent
>
> 4.2 I see. Will remove it from the KIP.
>
> > Probably users can track the metrics at their end to
> > figure that out. Do you think that makes sense?
>
> 4.3. Yes, agree. With these new metrics it should be easier for users to
> track this.
>
> > I think it makes sense to not have a min metric for either to remain
> > consistent with the existing put-batch and poll-batch metrics
>
> 5.1. Got it. Same as 4.2
>
> > Another naming related suggestion I had was with the
> > "convert-time" metrics - we should probably include transformations in
> the
> > name since SMTs could definitely be attributable to a sizable chunk of
> the
> > latency depending on the specific transformation chain.
>
> 5.2. Make sense. I'm proposing to add `sink-record-convert-transform...`
> and `source-record-transform-convert...` to represent correctly the order
> of operations.
>
> > it seems like both source and sink tasks only record metrics at a "batch"
> > level, not on an individual record level. I think it might be additional
> > overhead if we want to record these new metrics all at the record level?
>
> 5.3. I considered at the beginning to implement all metrics at the batch
> level, but given how the framework process records, I fallback to the
> proposed approach:
> - Sink Task:
>   - `WorkerSinkTask#convertMessages(msgs)` already iterates over records,
> so there is no additional overhead to capture record latency per record.
>     -
>
> https://github.com/apache/kafka/blob/9841647c4fe422532f448423c92d26e4fdcb8932/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L490-L514
>   - `WorkerSinkTask#convertAndTransformRecord(record)` actually happens
> individually. Measuring this operation per batch would include processing
> that is not strictly part of "convert and transform"
>     -
>
> https://github.com/apache/kafka/blob/9841647c4fe422532f448423c92d26e4fdcb8932/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L518
> - Source Task:
>   - `AbstractWorkerSourceTask#sendRecords` iterates over a batch and
> applies transforms and convert record individually as well:
>     -
>
> https://github.com/apache/kafka/blob/9841647c4fe422532f448423c92d26e4fdcb8932/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/AbstractWorkerSourceTask.java#L389-L390
>
> > This might require some additional changes -
> > for instance, with the "sink-record-latency" metric, we might only want
> to
> > have a "max" metric since "avg" would require recording a value on the
> > sensor for each record (whereas we can get a "max" by only recording a
> > metric value for the oldest record in each batch).
>
> 5.4. Recording record-latency per batch may not be as useful as there is no
> guarantee that the oldest record will be representative of the batch.
>
> On Sat, 3 Sept 2022 at 16:02, Yash Mayya <ya...@gmail.com> wrote:
>
> > Hi Jorge and Sagar,
> >
> > I think it makes sense to not have a min metric for either to remain
> > consistent with the existing put-batch and poll-batch metrics (it doesn't
> > seem particularly useful either anyway). Also, the new
> > "sink-record-latency" metric name looks fine to me, thanks for making the
> > changes! Another naming related suggestion I had was with the
> > "convert-time" metrics - we should probably include transformations in
> the
> > name since SMTs could definitely be attributable to a sizable chunk of
> the
> > latency depending on the specific transformation chain.
> >
> > I have one high level question with respect to implementation -
> currently,
> > it seems like both source and sink tasks only record metrics at a "batch"
> > level, not on an individual record level. I think it might be additional
> > overhead if we want to record these new metrics all at the record level?
> > Could we instead make all of these new metrics for batches of records
> > rather than individual records in order to remain consistent with the
> > existing task level metrics? This might require some additional changes -
> > for instance, with the "sink-record-latency" metric, we might only want
> to
> > have a "max" metric since "avg" would require recording a value on the
> > sensor for each record (whereas we can get a "max" by only recording a
> > metric value for the oldest record in each batch).
> >
> > Thanks,
> > Yash
> >
> > On Fri, Sep 2, 2022 at 3:16 PM Sagar <sa...@gmail.com> wrote:
> >
> > > Hi Jorge,
> > >
> > > Thanks for the changes.
> > >
> > > Regarding the metrics, I meant something like this:
> > >
> >
> kafka.connect:type=sink-task-metrics,connector="{connector}",task="{task}"
> > >
> > > the way it's defined in
> > > https://kafka.apache.org/documentation/#connect_monitoring for the
> > > metrics.
> > >
> > > I see what you mean by the 3 metrics and how it can be interpreted. The
> > > only thing I would argue is do we need sink-record-latency-min? Maybe
> we
> > > could remove this min metric as well and make all of the 3 e2e metrics
> > > consistent(since put-batch also doesn't expose a min which makes sense
> to
> > > me). I think this is in contrast to what Yash pointed out above so I
> > would
> > > like to hear his thoughts as well.
> > >
> > > The other point Yash mentioned about the slightly flawed definition of
> > e2e
> > > is also true in a sense. But I have a feeling that's one the records
> are
> > > polled by the connector tasks, it would be difficult to track the final
> > leg
> > > via the framework. Probably users can track the metrics at their end to
> > > figure that out. Do you think that makes sense?
> > >
> > > Thanks!
> > > Sagar.
> > >
> > >
> > >
> > >
> > > On Thu, Sep 1, 2022 at 11:40 PM Jorge Esteban Quilcate Otoya <
> > > quilcate.jorge@gmail.com> wrote:
> > >
> > > > Hi Sagar and Yash,
> > > >
> > > > Thanks for your feedback!
> > > >
> > > > > 1) I am assuming the new metrics would be task level metric.
> > > >
> > > > 1.1 Yes, it will be a task level metric, implemented on the
> > > > Worker[Source/Sink]Task.
> > > >
> > > > > Could you specify the way it's done for other sink/source
> connector?
> > > >
> > > > 1.2. Not sure what do you mean by this. Could you elaborate a bit
> more?
> > > >
> > > > > 2. I am slightly confused about the e2e latency metric...
> > > >
> > > > 2.1. Yes, I see. I was trying to bring a similar concept as in
> Streams
> > > with
> > > > KIP-613, though the e2e concept may not be translatable.
> > > > We could keep it as `sink-record-latency` to avoid conflating
> > concepts. A
> > > > similar metric naming was proposed in KIP-489 but at the consumer
> > level —
> > > > though it seems dormant for a couple of years.
> > > >
> > > > > However, the put-batch time measures the
> > > > > time to put a batch of records to external sink. So, I would assume
> > > the 2
> > > > > can't be added as is to compute the e2e latency. Maybe I am missing
> > > > > something here. Could you plz clarify this.
> > > >
> > > > 2.2. Yes, agree. Not necessarily added, but with the 3 latencies
> (poll,
> > > > convert, putBatch) will be clearer where the bottleneck may be, and
> > > > represent the internal processing.
> > > >
> > > > > however, as per the KIP it looks like it will be
> > > > > the latency between when the record was written to Kafka and when
> the
> > > > > record is returned by a sink task's consumer's poll?
> > > >
> > > > 3.1. Agree. 2.1. could help to clarify this.
> > > >
> > > > > One more thing - I was wondering
> > > > > if there's a particular reason for having a min metric for e2e
> > latency
> > > > but
> > > > > not for convert time?
> > > >
> > > > 3.2. Was following KIP-613 for e2e which seems useful to compare with
> > > Max a
> > > > get an idea of the window of results, though current latencies in
> > > Connector
> > > > do not include Min, and that's why I haven't added it for convert
> > > latency.
> > > > Do you think it make sense to extend latency metrics with Min?
> > > >
> > > > KIP is updated to clarify some of these changes.
> > > >
> > > > Many thanks,
> > > > Jorge.
> > > >
> > > > On Thu, 1 Sept 2022 at 18:11, Yash Mayya <ya...@gmail.com>
> wrote:
> > > >
> > > > > Hi Jorge,
> > > > >
> > > > > Thanks for the KIP! I have the same confusion with the e2e-latency
> > > > metrics
> > > > > as Sagar above. "e2e" would seem to indicate the latency between
> when
> > > the
> > > > > record was written to Kafka and when the record was written to the
> > sink
> > > > > system by the connector - however, as per the KIP it looks like it
> > will
> > > > be
> > > > > the latency between when the record was written to Kafka and when
> the
> > > > > record is returned by a sink task's consumer's poll? I think that
> > > metric
> > > > > will be a little confusing to interpret. One more thing - I was
> > > wondering
> > > > > if there's a particular reason for having a min metric for e2e
> > latency
> > > > but
> > > > > not for convert time?
> > > > >
> > > > > Thanks,
> > > > > Yash
> > > > >
> > > > > On Thu, Sep 1, 2022 at 8:59 PM Sagar <sa...@gmail.com>
> > > wrote:
> > > > >
> > > > > > Hi Jorge,
> > > > > >
> > > > > > Thanks for the KIP. It looks like a very good addition. I skimmed
> > > > through
> > > > > > once and had a couple of questions =>
> > > > > >
> > > > > > 1) I am assuming the new metrics would be task level metric.
> Could
> > > you
> > > > > > specify the way it's done for other sink/source connector?
> > > > > > 2) I am slightly confused about the e2e latency metric. Let's
> > > consider
> > > > > the
> > > > > > sink connector metric. If I look at the way it's supposed to be
> > > > > calculated,
> > > > > > i.e the difference between the record timestamp and the wall
> clock
> > > > time,
> > > > > it
> > > > > > looks like a per record metric. However, the put-batch time
> > measures
> > > > the
> > > > > > time to put a batch of records to external sink. So, I would
> assume
> > > > the 2
> > > > > > can't be added as is to compute the e2e latency. Maybe I am
> missing
> > > > > > something here. Could you plz clarify this.
> > > > > >
> > > > > > Thanks!
> > > > > > Sagar.
> > > > > >
> > > > > > On Tue, Aug 30, 2022 at 8:43 PM Jorge Esteban Quilcate Otoya <
> > > > > > quilcate.jorge@gmail.com> wrote:
> > > > > >
> > > > > > > Hi all,
> > > > > > >
> > > > > > > I'd like to start a discussion thread on KIP-864: Add
> End-To-End
> > > > > Latency
> > > > > > > Metrics to Connectors.
> > > > > > > This KIP aims to improve the metrics available on Source and
> Sink
> > > > > > > Connectors to measure end-to-end latency, including source and
> > sink
> > > > > > record
> > > > > > > conversion time, and sink record e2e latency (similar to
> KIP-613
> > > for
> > > > > > > Streams).
> > > > > > >
> > > > > > > The KIP is here:
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-864%3A+Add+End-To-End+Latency+Metrics+to+Connectors
> > > > > > >
> > > > > > > Please take a look and let me know what you think.
> > > > > > >
> > > > > > > Cheers,
> > > > > > > Jorge.
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-864: Add End-To-End Latency Metrics to Connectors

Posted by Jorge Esteban Quilcate Otoya <qu...@gmail.com>.

Hi Sagar and Yash,

> the way it's defined in
https://kafka.apache.org/documentation/#connect_monitoring for the metrics

4.1. Got it. Add it to the KIP.

> The only thing I would argue is do we need sink-record-latency-min? Maybe
we
> could remove this min metric as well and make all of the 3 e2e metrics
> consistent

4.2 I see. Will remove it from the KIP.

> Probably users can track the metrics at their end to
> figure that out. Do you think that makes sense?

4.3. Yes, agree. With these new metrics it should be easier for users to
track this.

> I think it makes sense to not have a min metric for either to remain
> consistent with the existing put-batch and poll-batch metrics

5.1. Got it. Same as 4.2

> Another naming related suggestion I had was with the
> "convert-time" metrics - we should probably include transformations in the
> name since SMTs could definitely be attributable to a sizable chunk of the
> latency depending on the specific transformation chain.

5.2. Make sense. I'm proposing to add `sink-record-convert-transform...`
and `source-record-transform-convert...` to represent correctly the order
of operations.

> it seems like both source and sink tasks only record metrics at a "batch"
> level, not on an individual record level. I think it might be additional
> overhead if we want to record these new metrics all at the record level?

5.3. I considered at the beginning to implement all metrics at the batch
level, but given how the framework process records, I fallback to the
proposed approach:
- Sink Task:
  - `WorkerSinkTask#convertMessages(msgs)` already iterates over records,
so there is no additional overhead to capture record latency per record.
    -
https://github.com/apache/kafka/blob/9841647c4fe422532f448423c92d26e4fdcb8932/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L490-L514
  - `WorkerSinkTask#convertAndTransformRecord(record)` actually happens
individually. Measuring this operation per batch would include processing
that is not strictly part of "convert and transform"
    -
https://github.com/apache/kafka/blob/9841647c4fe422532f448423c92d26e4fdcb8932/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L518
- Source Task:
  - `AbstractWorkerSourceTask#sendRecords` iterates over a batch and
applies transforms and convert record individually as well:
    -
https://github.com/apache/kafka/blob/9841647c4fe422532f448423c92d26e4fdcb8932/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/AbstractWorkerSourceTask.java#L389-L390

> This might require some additional changes -
> for instance, with the "sink-record-latency" metric, we might only want to
> have a "max" metric since "avg" would require recording a value on the
> sensor for each record (whereas we can get a "max" by only recording a
> metric value for the oldest record in each batch).

5.4. Recording record-latency per batch may not be as useful as there is no
guarantee that the oldest record will be representative of the batch.

On Sat, 3 Sept 2022 at 16:02, Yash Mayya <ya...@gmail.com> wrote:

> Hi Jorge and Sagar,
>
> I think it makes sense to not have a min metric for either to remain
> consistent with the existing put-batch and poll-batch metrics (it doesn't
> seem particularly useful either anyway). Also, the new
> "sink-record-latency" metric name looks fine to me, thanks for making the
> changes! Another naming related suggestion I had was with the
> "convert-time" metrics - we should probably include transformations in the
> name since SMTs could definitely be attributable to a sizable chunk of the
> latency depending on the specific transformation chain.
>
> I have one high level question with respect to implementation - currently,
> it seems like both source and sink tasks only record metrics at a "batch"
> level, not on an individual record level. I think it might be additional
> overhead if we want to record these new metrics all at the record level?
> Could we instead make all of these new metrics for batches of records
> rather than individual records in order to remain consistent with the
> existing task level metrics? This might require some additional changes -
> for instance, with the "sink-record-latency" metric, we might only want to
> have a "max" metric since "avg" would require recording a value on the
> sensor for each record (whereas we can get a "max" by only recording a
> metric value for the oldest record in each batch).
>
> Thanks,
> Yash
>
> On Fri, Sep 2, 2022 at 3:16 PM Sagar <sa...@gmail.com> wrote:
>
> > Hi Jorge,
> >
> > Thanks for the changes.
> >
> > Regarding the metrics, I meant something like this:
> >
> kafka.connect:type=sink-task-metrics,connector="{connector}",task="{task}"
> >
> > the way it's defined in
> > https://kafka.apache.org/documentation/#connect_monitoring for the
> > metrics.
> >
> > I see what you mean by the 3 metrics and how it can be interpreted. The
> > only thing I would argue is do we need sink-record-latency-min? Maybe we
> > could remove this min metric as well and make all of the 3 e2e metrics
> > consistent(since put-batch also doesn't expose a min which makes sense to
> > me). I think this is in contrast to what Yash pointed out above so I
> would
> > like to hear his thoughts as well.
> >
> > The other point Yash mentioned about the slightly flawed definition of
> e2e
> > is also true in a sense. But I have a feeling that's one the records are
> > polled by the connector tasks, it would be difficult to track the final
> leg
> > via the framework. Probably users can track the metrics at their end to
> > figure that out. Do you think that makes sense?
> >
> > Thanks!
> > Sagar.
> >
> >
> >
> >
> > On Thu, Sep 1, 2022 at 11:40 PM Jorge Esteban Quilcate Otoya <
> > quilcate.jorge@gmail.com> wrote:
> >
> > > Hi Sagar and Yash,
> > >
> > > Thanks for your feedback!
> > >
> > > > 1) I am assuming the new metrics would be task level metric.
> > >
> > > 1.1 Yes, it will be a task level metric, implemented on the
> > > Worker[Source/Sink]Task.
> > >
> > > > Could you specify the way it's done for other sink/source connector?
> > >
> > > 1.2. Not sure what do you mean by this. Could you elaborate a bit more?
> > >
> > > > 2. I am slightly confused about the e2e latency metric...
> > >
> > > 2.1. Yes, I see. I was trying to bring a similar concept as in Streams
> > with
> > > KIP-613, though the e2e concept may not be translatable.
> > > We could keep it as `sink-record-latency` to avoid conflating
> concepts. A
> > > similar metric naming was proposed in KIP-489 but at the consumer
> level —
> > > though it seems dormant for a couple of years.
> > >
> > > > However, the put-batch time measures the
> > > > time to put a batch of records to external sink. So, I would assume
> > the 2
> > > > can't be added as is to compute the e2e latency. Maybe I am missing
> > > > something here. Could you plz clarify this.
> > >
> > > 2.2. Yes, agree. Not necessarily added, but with the 3 latencies (poll,
> > > convert, putBatch) will be clearer where the bottleneck may be, and
> > > represent the internal processing.
> > >
> > > > however, as per the KIP it looks like it will be
> > > > the latency between when the record was written to Kafka and when the
> > > > record is returned by a sink task's consumer's poll?
> > >
> > > 3.1. Agree. 2.1. could help to clarify this.
> > >
> > > > One more thing - I was wondering
> > > > if there's a particular reason for having a min metric for e2e
> latency
> > > but
> > > > not for convert time?
> > >
> > > 3.2. Was following KIP-613 for e2e which seems useful to compare with
> > Max a
> > > get an idea of the window of results, though current latencies in
> > Connector
> > > do not include Min, and that's why I haven't added it for convert
> > latency.
> > > Do you think it make sense to extend latency metrics with Min?
> > >
> > > KIP is updated to clarify some of these changes.
> > >
> > > Many thanks,
> > > Jorge.
> > >
> > > On Thu, 1 Sept 2022 at 18:11, Yash Mayya <ya...@gmail.com> wrote:
> > >
> > > > Hi Jorge,
> > > >
> > > > Thanks for the KIP! I have the same confusion with the e2e-latency
> > > metrics
> > > > as Sagar above. "e2e" would seem to indicate the latency between when
> > the
> > > > record was written to Kafka and when the record was written to the
> sink
> > > > system by the connector - however, as per the KIP it looks like it
> will
> > > be
> > > > the latency between when the record was written to Kafka and when the
> > > > record is returned by a sink task's consumer's poll? I think that
> > metric
> > > > will be a little confusing to interpret. One more thing - I was
> > wondering
> > > > if there's a particular reason for having a min metric for e2e
> latency
> > > but
> > > > not for convert time?
> > > >
> > > > Thanks,
> > > > Yash
> > > >
> > > > On Thu, Sep 1, 2022 at 8:59 PM Sagar <sa...@gmail.com>
> > wrote:
> > > >
> > > > > Hi Jorge,
> > > > >
> > > > > Thanks for the KIP. It looks like a very good addition. I skimmed
> > > through
> > > > > once and had a couple of questions =>
> > > > >
> > > > > 1) I am assuming the new metrics would be task level metric. Could
> > you
> > > > > specify the way it's done for other sink/source connector?
> > > > > 2) I am slightly confused about the e2e latency metric. Let's
> > consider
> > > > the
> > > > > sink connector metric. If I look at the way it's supposed to be
> > > > calculated,
> > > > > i.e the difference between the record timestamp and the wall clock
> > > time,
> > > > it
> > > > > looks like a per record metric. However, the put-batch time
> measures
> > > the
> > > > > time to put a batch of records to external sink. So, I would assume
> > > the 2
> > > > > can't be added as is to compute the e2e latency. Maybe I am missing
> > > > > something here. Could you plz clarify this.
> > > > >
> > > > > Thanks!
> > > > > Sagar.
> > > > >
> > > > > On Tue, Aug 30, 2022 at 8:43 PM Jorge Esteban Quilcate Otoya <
> > > > > quilcate.jorge@gmail.com> wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > I'd like to start a discussion thread on KIP-864: Add End-To-End
> > > > Latency
> > > > > > Metrics to Connectors.
> > > > > > This KIP aims to improve the metrics available on Source and Sink
> > > > > > Connectors to measure end-to-end latency, including source and
> sink
> > > > > record
> > > > > > conversion time, and sink record e2e latency (similar to KIP-613
> > for
> > > > > > Streams).
> > > > > >
> > > > > > The KIP is here:
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-864%3A+Add+End-To-End+Latency+Metrics+to+Connectors
> > > > > >
> > > > > > Please take a look and let me know what you think.
> > > > > >
> > > > > > Cheers,
> > > > > > Jorge.
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-864: Add End-To-End Latency Metrics to Connectors

Posted by Yash Mayya <ya...@gmail.com>.

Hi Jorge and Sagar,

I think it makes sense to not have a min metric for either to remain
consistent with the existing put-batch and poll-batch metrics (it doesn't
seem particularly useful either anyway). Also, the new
"sink-record-latency" metric name looks fine to me, thanks for making the
changes! Another naming related suggestion I had was with the
"convert-time" metrics - we should probably include transformations in the
name since SMTs could definitely be attributable to a sizable chunk of the
latency depending on the specific transformation chain.

I have one high level question with respect to implementation - currently,
it seems like both source and sink tasks only record metrics at a "batch"
level, not on an individual record level. I think it might be additional
overhead if we want to record these new metrics all at the record level?
Could we instead make all of these new metrics for batches of records
rather than individual records in order to remain consistent with the
existing task level metrics? This might require some additional changes -
for instance, with the "sink-record-latency" metric, we might only want to
have a "max" metric since "avg" would require recording a value on the
sensor for each record (whereas we can get a "max" by only recording a
metric value for the oldest record in each batch).

Thanks,
Yash

On Fri, Sep 2, 2022 at 3:16 PM Sagar <sa...@gmail.com> wrote:

> Hi Jorge,
>
> Thanks for the changes.
>
> Regarding the metrics, I meant something like this:
> kafka.connect:type=sink-task-metrics,connector="{connector}",task="{task}"
>
> the way it's defined in
> https://kafka.apache.org/documentation/#connect_monitoring for the
> metrics.
>
> I see what you mean by the 3 metrics and how it can be interpreted. The
> only thing I would argue is do we need sink-record-latency-min? Maybe we
> could remove this min metric as well and make all of the 3 e2e metrics
> consistent(since put-batch also doesn't expose a min which makes sense to
> me). I think this is in contrast to what Yash pointed out above so I would
> like to hear his thoughts as well.
>
> The other point Yash mentioned about the slightly flawed definition of e2e
> is also true in a sense. But I have a feeling that's one the records are
> polled by the connector tasks, it would be difficult to track the final leg
> via the framework. Probably users can track the metrics at their end to
> figure that out. Do you think that makes sense?
>
> Thanks!
> Sagar.
>
>
>
>
> On Thu, Sep 1, 2022 at 11:40 PM Jorge Esteban Quilcate Otoya <
> quilcate.jorge@gmail.com> wrote:
>
> > Hi Sagar and Yash,
> >
> > Thanks for your feedback!
> >
> > > 1) I am assuming the new metrics would be task level metric.
> >
> > 1.1 Yes, it will be a task level metric, implemented on the
> > Worker[Source/Sink]Task.
> >
> > > Could you specify the way it's done for other sink/source connector?
> >
> > 1.2. Not sure what do you mean by this. Could you elaborate a bit more?
> >
> > > 2. I am slightly confused about the e2e latency metric...
> >
> > 2.1. Yes, I see. I was trying to bring a similar concept as in Streams
> with
> > KIP-613, though the e2e concept may not be translatable.
> > We could keep it as `sink-record-latency` to avoid conflating concepts. A
> > similar metric naming was proposed in KIP-489 but at the consumer level —
> > though it seems dormant for a couple of years.
> >
> > > However, the put-batch time measures the
> > > time to put a batch of records to external sink. So, I would assume
> the 2
> > > can't be added as is to compute the e2e latency. Maybe I am missing
> > > something here. Could you plz clarify this.
> >
> > 2.2. Yes, agree. Not necessarily added, but with the 3 latencies (poll,
> > convert, putBatch) will be clearer where the bottleneck may be, and
> > represent the internal processing.
> >
> > > however, as per the KIP it looks like it will be
> > > the latency between when the record was written to Kafka and when the
> > > record is returned by a sink task's consumer's poll?
> >
> > 3.1. Agree. 2.1. could help to clarify this.
> >
> > > One more thing - I was wondering
> > > if there's a particular reason for having a min metric for e2e latency
> > but
> > > not for convert time?
> >
> > 3.2. Was following KIP-613 for e2e which seems useful to compare with
> Max a
> > get an idea of the window of results, though current latencies in
> Connector
> > do not include Min, and that's why I haven't added it for convert
> latency.
> > Do you think it make sense to extend latency metrics with Min?
> >
> > KIP is updated to clarify some of these changes.
> >
> > Many thanks,
> > Jorge.
> >
> > On Thu, 1 Sept 2022 at 18:11, Yash Mayya <ya...@gmail.com> wrote:
> >
> > > Hi Jorge,
> > >
> > > Thanks for the KIP! I have the same confusion with the e2e-latency
> > metrics
> > > as Sagar above. "e2e" would seem to indicate the latency between when
> the
> > > record was written to Kafka and when the record was written to the sink
> > > system by the connector - however, as per the KIP it looks like it will
> > be
> > > the latency between when the record was written to Kafka and when the
> > > record is returned by a sink task's consumer's poll? I think that
> metric
> > > will be a little confusing to interpret. One more thing - I was
> wondering
> > > if there's a particular reason for having a min metric for e2e latency
> > but
> > > not for convert time?
> > >
> > > Thanks,
> > > Yash
> > >
> > > On Thu, Sep 1, 2022 at 8:59 PM Sagar <sa...@gmail.com>
> wrote:
> > >
> > > > Hi Jorge,
> > > >
> > > > Thanks for the KIP. It looks like a very good addition. I skimmed
> > through
> > > > once and had a couple of questions =>
> > > >
> > > > 1) I am assuming the new metrics would be task level metric. Could
> you
> > > > specify the way it's done for other sink/source connector?
> > > > 2) I am slightly confused about the e2e latency metric. Let's
> consider
> > > the
> > > > sink connector metric. If I look at the way it's supposed to be
> > > calculated,
> > > > i.e the difference between the record timestamp and the wall clock
> > time,
> > > it
> > > > looks like a per record metric. However, the put-batch time measures
> > the
> > > > time to put a batch of records to external sink. So, I would assume
> > the 2
> > > > can't be added as is to compute the e2e latency. Maybe I am missing
> > > > something here. Could you plz clarify this.
> > > >
> > > > Thanks!
> > > > Sagar.
> > > >
> > > > On Tue, Aug 30, 2022 at 8:43 PM Jorge Esteban Quilcate Otoya <
> > > > quilcate.jorge@gmail.com> wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > I'd like to start a discussion thread on KIP-864: Add End-To-End
> > > Latency
> > > > > Metrics to Connectors.
> > > > > This KIP aims to improve the metrics available on Source and Sink
> > > > > Connectors to measure end-to-end latency, including source and sink
> > > > record
> > > > > conversion time, and sink record e2e latency (similar to KIP-613
> for
> > > > > Streams).
> > > > >
> > > > > The KIP is here:
> > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-864%3A+Add+End-To-End+Latency+Metrics+to+Connectors
> > > > >
> > > > > Please take a look and let me know what you think.
> > > > >
> > > > > Cheers,
> > > > > Jorge.
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-864: Add End-To-End Latency Metrics to Connectors

Posted by Sagar <sa...@gmail.com>.

Hi Jorge,

Thanks for the changes.

Regarding the metrics, I meant something like this:
kafka.connect:type=sink-task-metrics,connector="{connector}",task="{task}"

the way it's defined in
https://kafka.apache.org/documentation/#connect_monitoring for the metrics.

I see what you mean by the 3 metrics and how it can be interpreted. The
only thing I would argue is do we need sink-record-latency-min? Maybe we
could remove this min metric as well and make all of the 3 e2e metrics
consistent(since put-batch also doesn't expose a min which makes sense to
me). I think this is in contrast to what Yash pointed out above so I would
like to hear his thoughts as well.

The other point Yash mentioned about the slightly flawed definition of e2e
is also true in a sense. But I have a feeling that's one the records are
polled by the connector tasks, it would be difficult to track the final leg
via the framework. Probably users can track the metrics at their end to
figure that out. Do you think that makes sense?

Thanks!
Sagar.




On Thu, Sep 1, 2022 at 11:40 PM Jorge Esteban Quilcate Otoya <
quilcate.jorge@gmail.com> wrote:

> Hi Sagar and Yash,
>
> Thanks for your feedback!
>
> > 1) I am assuming the new metrics would be task level metric.
>
> 1.1 Yes, it will be a task level metric, implemented on the
> Worker[Source/Sink]Task.
>
> > Could you specify the way it's done for other sink/source connector?
>
> 1.2. Not sure what do you mean by this. Could you elaborate a bit more?
>
> > 2. I am slightly confused about the e2e latency metric...
>
> 2.1. Yes, I see. I was trying to bring a similar concept as in Streams with
> KIP-613, though the e2e concept may not be translatable.
> We could keep it as `sink-record-latency` to avoid conflating concepts. A
> similar metric naming was proposed in KIP-489 but at the consumer level —
> though it seems dormant for a couple of years.
>
> > However, the put-batch time measures the
> > time to put a batch of records to external sink. So, I would assume the 2
> > can't be added as is to compute the e2e latency. Maybe I am missing
> > something here. Could you plz clarify this.
>
> 2.2. Yes, agree. Not necessarily added, but with the 3 latencies (poll,
> convert, putBatch) will be clearer where the bottleneck may be, and
> represent the internal processing.
>
> > however, as per the KIP it looks like it will be
> > the latency between when the record was written to Kafka and when the
> > record is returned by a sink task's consumer's poll?
>
> 3.1. Agree. 2.1. could help to clarify this.
>
> > One more thing - I was wondering
> > if there's a particular reason for having a min metric for e2e latency
> but
> > not for convert time?
>
> 3.2. Was following KIP-613 for e2e which seems useful to compare with Max a
> get an idea of the window of results, though current latencies in Connector
> do not include Min, and that's why I haven't added it for convert latency.
> Do you think it make sense to extend latency metrics with Min?
>
> KIP is updated to clarify some of these changes.
>
> Many thanks,
> Jorge.
>
> On Thu, 1 Sept 2022 at 18:11, Yash Mayya <ya...@gmail.com> wrote:
>
> > Hi Jorge,
> >
> > Thanks for the KIP! I have the same confusion with the e2e-latency
> metrics
> > as Sagar above. "e2e" would seem to indicate the latency between when the
> > record was written to Kafka and when the record was written to the sink
> > system by the connector - however, as per the KIP it looks like it will
> be
> > the latency between when the record was written to Kafka and when the
> > record is returned by a sink task's consumer's poll? I think that metric
> > will be a little confusing to interpret. One more thing - I was wondering
> > if there's a particular reason for having a min metric for e2e latency
> but
> > not for convert time?
> >
> > Thanks,
> > Yash
> >
> > On Thu, Sep 1, 2022 at 8:59 PM Sagar <sa...@gmail.com> wrote:
> >
> > > Hi Jorge,
> > >
> > > Thanks for the KIP. It looks like a very good addition. I skimmed
> through
> > > once and had a couple of questions =>
> > >
> > > 1) I am assuming the new metrics would be task level metric. Could you
> > > specify the way it's done for other sink/source connector?
> > > 2) I am slightly confused about the e2e latency metric. Let's consider
> > the
> > > sink connector metric. If I look at the way it's supposed to be
> > calculated,
> > > i.e the difference between the record timestamp and the wall clock
> time,
> > it
> > > looks like a per record metric. However, the put-batch time measures
> the
> > > time to put a batch of records to external sink. So, I would assume
> the 2
> > > can't be added as is to compute the e2e latency. Maybe I am missing
> > > something here. Could you plz clarify this.
> > >
> > > Thanks!
> > > Sagar.
> > >
> > > On Tue, Aug 30, 2022 at 8:43 PM Jorge Esteban Quilcate Otoya <
> > > quilcate.jorge@gmail.com> wrote:
> > >
> > > > Hi all,
> > > >
> > > > I'd like to start a discussion thread on KIP-864: Add End-To-End
> > Latency
> > > > Metrics to Connectors.
> > > > This KIP aims to improve the metrics available on Source and Sink
> > > > Connectors to measure end-to-end latency, including source and sink
> > > record
> > > > conversion time, and sink record e2e latency (similar to KIP-613 for
> > > > Streams).
> > > >
> > > > The KIP is here:
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-864%3A+Add+End-To-End+Latency+Metrics+to+Connectors
> > > >
> > > > Please take a look and let me know what you think.
> > > >
> > > > Cheers,
> > > > Jorge.
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-864: Add End-To-End Latency Metrics to Connectors

Posted by Jorge Esteban Quilcate Otoya <qu...@gmail.com>.

Hi Sagar and Yash,

Thanks for your feedback!

> 1) I am assuming the new metrics would be task level metric.

1.1 Yes, it will be a task level metric, implemented on the
Worker[Source/Sink]Task.

> Could you specify the way it's done for other sink/source connector?

1.2. Not sure what do you mean by this. Could you elaborate a bit more?

> 2. I am slightly confused about the e2e latency metric...

2.1. Yes, I see. I was trying to bring a similar concept as in Streams with
KIP-613, though the e2e concept may not be translatable.
We could keep it as `sink-record-latency` to avoid conflating concepts. A
similar metric naming was proposed in KIP-489 but at the consumer level —
though it seems dormant for a couple of years.

> However, the put-batch time measures the
> time to put a batch of records to external sink. So, I would assume the 2
> can't be added as is to compute the e2e latency. Maybe I am missing
> something here. Could you plz clarify this.

2.2. Yes, agree. Not necessarily added, but with the 3 latencies (poll,
convert, putBatch) will be clearer where the bottleneck may be, and
represent the internal processing.

> however, as per the KIP it looks like it will be
> the latency between when the record was written to Kafka and when the
> record is returned by a sink task's consumer's poll?

3.1. Agree. 2.1. could help to clarify this.

> One more thing - I was wondering
> if there's a particular reason for having a min metric for e2e latency but
> not for convert time?

3.2. Was following KIP-613 for e2e which seems useful to compare with Max a
get an idea of the window of results, though current latencies in Connector
do not include Min, and that's why I haven't added it for convert latency.
Do you think it make sense to extend latency metrics with Min?

KIP is updated to clarify some of these changes.

Many thanks,
Jorge.

On Thu, 1 Sept 2022 at 18:11, Yash Mayya <ya...@gmail.com> wrote:

> Hi Jorge,
>
> Thanks for the KIP! I have the same confusion with the e2e-latency metrics
> as Sagar above. "e2e" would seem to indicate the latency between when the
> record was written to Kafka and when the record was written to the sink
> system by the connector - however, as per the KIP it looks like it will be
> the latency between when the record was written to Kafka and when the
> record is returned by a sink task's consumer's poll? I think that metric
> will be a little confusing to interpret. One more thing - I was wondering
> if there's a particular reason for having a min metric for e2e latency but
> not for convert time?
>
> Thanks,
> Yash
>
> On Thu, Sep 1, 2022 at 8:59 PM Sagar <sa...@gmail.com> wrote:
>
> > Hi Jorge,
> >
> > Thanks for the KIP. It looks like a very good addition. I skimmed through
> > once and had a couple of questions =>
> >
> > 1) I am assuming the new metrics would be task level metric. Could you
> > specify the way it's done for other sink/source connector?
> > 2) I am slightly confused about the e2e latency metric. Let's consider
> the
> > sink connector metric. If I look at the way it's supposed to be
> calculated,
> > i.e the difference between the record timestamp and the wall clock time,
> it
> > looks like a per record metric. However, the put-batch time measures the
> > time to put a batch of records to external sink. So, I would assume the 2
> > can't be added as is to compute the e2e latency. Maybe I am missing
> > something here. Could you plz clarify this.
> >
> > Thanks!
> > Sagar.
> >
> > On Tue, Aug 30, 2022 at 8:43 PM Jorge Esteban Quilcate Otoya <
> > quilcate.jorge@gmail.com> wrote:
> >
> > > Hi all,
> > >
> > > I'd like to start a discussion thread on KIP-864: Add End-To-End
> Latency
> > > Metrics to Connectors.
> > > This KIP aims to improve the metrics available on Source and Sink
> > > Connectors to measure end-to-end latency, including source and sink
> > record
> > > conversion time, and sink record e2e latency (similar to KIP-613 for
> > > Streams).
> > >
> > > The KIP is here:
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-864%3A+Add+End-To-End+Latency+Metrics+to+Connectors
> > >
> > > Please take a look and let me know what you think.
> > >
> > > Cheers,
> > > Jorge.
> > >
> >
>

Re: [DISCUSS] KIP-864: Add End-To-End Latency Metrics to Connectors

Posted by Yash Mayya <ya...@gmail.com>.

Hi Jorge,

Thanks for the KIP! I have the same confusion with the e2e-latency metrics
as Sagar above. "e2e" would seem to indicate the latency between when the
record was written to Kafka and when the record was written to the sink
system by the connector - however, as per the KIP it looks like it will be
the latency between when the record was written to Kafka and when the
record is returned by a sink task's consumer's poll? I think that metric
will be a little confusing to interpret. One more thing - I was wondering
if there's a particular reason for having a min metric for e2e latency but
not for convert time?

Thanks,
Yash

On Thu, Sep 1, 2022 at 8:59 PM Sagar <sa...@gmail.com> wrote:

> Hi Jorge,
>
> Thanks for the KIP. It looks like a very good addition. I skimmed through
> once and had a couple of questions =>
>
> 1) I am assuming the new metrics would be task level metric. Could you
> specify the way it's done for other sink/source connector?
> 2) I am slightly confused about the e2e latency metric. Let's consider the
> sink connector metric. If I look at the way it's supposed to be calculated,
> i.e the difference between the record timestamp and the wall clock time, it
> looks like a per record metric. However, the put-batch time measures the
> time to put a batch of records to external sink. So, I would assume the 2
> can't be added as is to compute the e2e latency. Maybe I am missing
> something here. Could you plz clarify this.
>
> Thanks!
> Sagar.
>
> On Tue, Aug 30, 2022 at 8:43 PM Jorge Esteban Quilcate Otoya <
> quilcate.jorge@gmail.com> wrote:
>
> > Hi all,
> >
> > I'd like to start a discussion thread on KIP-864: Add End-To-End Latency
> > Metrics to Connectors.
> > This KIP aims to improve the metrics available on Source and Sink
> > Connectors to measure end-to-end latency, including source and sink
> record
> > conversion time, and sink record e2e latency (similar to KIP-613 for
> > Streams).
> >
> > The KIP is here:
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-864%3A+Add+End-To-End+Latency+Metrics+to+Connectors
> >
> > Please take a look and let me know what you think.
> >
> > Cheers,
> > Jorge.
> >
>