You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Qingsheng Ren <re...@apache.org> on 2022/10/09 02:35:12 UTC

[DISCUSS] Reverting sink metric name changes made in 1.15

Hi devs and users,

I’d like to start a discussion about reverting a breaking change about sink metrics made in 1.15 by FLINK-26126 [1] and FLINK-26492 [2].

TL;DR

All sink metrics with name “numXXXOut” defined in FLIP-33 are replace by “numXXXSend” in FLINK-26126 and FLINK-26492. Considering metric names are public APIs, this is a breaking change to end users and not backward compatible. Also unfortunately this breaking change was not discussed in the mailing list before.

Background

As defined previously in FLIP-33 (the FLIP page has been changed so please refer to the old version [3] ), metric “numRecordsOut” is used for reporting the total number of output records since the sink started (number of records written to the external system), and similarly for “numRecordsOutPerSecond”, “numBytesOut”, “numBytesOutPerSecond” and “numRecordsOutError”. Most sinks are following this naming and definition. However, these metrics are ambiguous in the new Sink API as “numXXXOut” could be used by the output of SinkWriterOperator for reporting number of Committables delivered to SinkCommitterOperator. In order to resolve the conflict, FLINK-26126 and FLINK-26492 changed names of these metrics with “numXXXSend”.

Necessity of reverting this change

- Metric names are actually public API, as end users need to configure metric collecting and alerting system with metric names. Users have to reset all configurations related to affected metrics.
- This could also affect custom and external sinks not maintained by Flink, which might have implemented with numXXXOut metrics.
- The number of records sent to external system is way more important than the number of Committables sent to SinkCommitterOperator, as the latter one is just an internal implementation of sink. We could have a new metric name for the latter one instead.
- We could avoid splitting the project by version (like “plz use numXXXOut before 1.15 and use numXXXSend after”) if we revert it ASAP, cosidering 1.16 is still not released for now.

As a consequence, I’d like to hear from devs and users about your opinion on changing these metrics back to “numXXXOut”.

Looking forward to your reply!

[1] https://issues.apache.org/jira/browse/FLINK-26126
[2] https://issues.apache.org/jira/browse/FLINK-26492
[1] FLIP-33, version 18: https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211883136

Best,
Qingsheng

Re: [DISCUSS] Reverting sink metric name changes made in 1.15

Posted by Xingbo Huang <hx...@apache.org>.
+1 for reverting these changes in Flink 1.16, so I will cancel 1.16.0-rc1.
+1 for `numXXXSend` as the alias of `numXXXOut` in 1.15.3.

Best,
Xingbo

Chesnay Schepler <ch...@apache.org> 于2022年10月10日周一 19:13写道:

> > I’m with Xintong’s idea to treat numXXXSend as an alias of numXXXOut
>
> But that's not possible. If it were that simple there would have never
> been a need to introduce another metric in the first place.
>
> It's a rather fundamental issue with how the new sinks work, in that they
> emit data to the external system (usually considered as "numRecordsOut" of
> sinks) while _also_ sending data to a downstream operator (usually
> considered as "numRecordsOut" of tasks).
> The original issue was that the numRecordsOut of the sink counted both
> (which is completely wrong).
>
> A new metric was always required; otherwise you inevitably end up breaking
> *some* semantic.
> Adding a new metric for what the sink writes to the external system is,
> for better or worse, more consistent with how these metrics usually work in
> Flink.
>
> On 10/10/2022 12:45, Qingsheng Ren wrote:
>
> Thanks everyone for joining the discussion!
>
> > Do you have any idea what has happened in the process here?
>
> The discussion in this PR [1] shows some details and could be helpful to
> understand the original motivation of the renaming. We do have a test case
> for guarding metrics but unfortunaly the case was also modified so the
> defense was broken.
>
> I think the reason why both the developer and the reviewer forgot to
> trigger an discussion and gave a green pass on the change is that metrics
> are quite “trivial” to be noticed as public APIs. As mentioned by Martijn I
> couldn’t find a place noting that metrics are public APIs and should be
> treated carefully while contributing and reviewing.
>
> IMHO three actions could be made to prevent this kind of changes in the
> future:
>
> a. Add test case for metrics (which we already have in SinkMetricsITCase)
> b. We emphasize that any public-interface breaking changes should be
> proposed by a FLIP or discussed in mailing list, and should be listed in
> the release note.
> c. We remind contributors and reviewers about what should be considered as
> public API, and include metric names in it.
>
> For b and c these two pages [2][3] might be proper places.
>
> About the patch to revert this, it looks like we have a consensus on 1.16.
> As of 1.15 I think it’s worthy to trigger a minor version. I didn’t see
> complaints about this for now so it should be OK to save the situation
> asap. I’m with Xintong’s idea to treat numXXXSend as an alias of numXXXOut
> considering there could possibly some users have already adapted their
> system to the new naming, and have another internal metric for reflecting
> number of outgoing committable batches (actually the numRecordsIn of sink
> committer operator should be carrying this info already).
>
> [1] https://github.com/apache/flink/pull/18825
> [2] https://flink.apache.org/contributing/contribute-code.html
> [3] https://flink.apache.org/contributing/reviewing-prs.html
>
> Best,
> Qingsheng
> On Oct 10, 2022, 17:40 +0800, Xintong Song <to...@gmail.com>
> <to...@gmail.com>, wrote:
>
> +1 for reverting these changes in Flink 1.16.
>
> For 1.15.3, can we make these metrics available via both names (numXXXOut
> and numXXXSend)? In this way we don't break it for those who already
> migrated to 1.15 and numXXXSend. That means we still need to change
> SinkWriterOperator to use another metric name in 1.15.3, which IIUC is
> internal to Flink sink.
>
> I'm overall +1 to change numXXXOut back to its original semantics. AFAIK
> (from meetup / flink-forward questionaires), most users do not migrate to a
> new Flink release immediately, until the next 1-2 major releases are out.
>
> Best,
>
> Xintong
>
>
> On Mon, Oct 10, 2022 at 5:26 PM Martijn Visser <ma...@apache.org>
> wrote:
>
>> Hi Qingsheng,
>>
>> Do you have any idea what has happened in the process here? Do we know why
>> they were changed? I was under the impression that these metric names were
>> newly introduced due to the new interfaces and because it still depends on
>> each connector implementing these.
>>
>> Sidenote: metric names are not mentioned in the FLIP process as a public
>> API. Might make sense to have a separate follow-up to add that to the list
>> (I do think we should list them there).
>>
>> +1 for reverting this and make this change in Flink 1.16
>>
>> I'm not in favour of releasing a Flink 1.15.3 with this change: I think
>> the
>> impact is too big for a patch version, especially given how long Flink
>> 1.15
>> is already out there.
>>
>> Best regards,
>>
>> Martijn
>>
>> On Mon, Oct 10, 2022 at 11:13 AM Leonard Xu <xb...@gmail.com> wrote:
>>
>> > Thanks Qingsheng for starting this thread.
>> >
>> > +1 on reverting sink metric name and releasing 1.15.3 to fix this
>> > inconsistent behavior.
>> >
>> >
>> > Best,
>> > Leonard
>> >
>> >
>> >
>> >
>> >
>> > 2022年10月10日 下午3:06,Jark Wu <im...@gmail.com> 写道:
>> >
>> > Thanks for discovering this problem, Qingsheng!
>> >
>> > I'm also +1 for reverting the breaking changes.
>> >
>> > IIUC, currently, the behavior of "numXXXOut" metrics of the new and old
>> > sink is inconsistent.
>> > We have to break one of them to have consistent behavior. Sink V2 is an
>> > evolving API which is just introduced in 1.15.
>> > I think it makes sense to break the unstable API instead of the stable
>> API
>> > which many connectors and users depend on.
>> >
>> > Best,
>> > Jark
>> >
>> >
>> >
>> > On Mon, 10 Oct 2022 at 11:36, Jingsong Li <ji...@gmail.com>
>> wrote:
>> >
>> >> Thanks for driving, Qingsheng.
>> >>
>> >> +1 for reverting sink metric name.
>> >>
>> >> We often forget that metric is also one of the important APIs.
>> >>
>> >> +1 for releasing 1.15.3 to fix this.
>> >>
>> >> Best,
>> >> Jingsong
>> >>
>> >> On Sun, Oct 9, 2022 at 11:35 PM Becket Qin <be...@gmail.com>
>> wrote:
>> >> >
>> >> > Thanks for raising the discussion, Qingsheng,
>> >> >
>> >> > +1 on reverting the breaking changes.
>> >> >
>> >> > In addition, we might want to release a 1.15.3 to fix this and update
>> >> the previous release docs with this known issue, so that users can
>> upgrade
>> >> to 1.15.3 when they hit it. It would also be good to add some backwards
>> >> compatibility tests on metrics to avoid unintended breaking changes
>> like
>> >> this in the future.
>> >> >
>> >> > Thanks,
>> >> >
>> >> > Jiangjie (Becket) Qin
>> >> >
>> >> > On Sun, Oct 9, 2022 at 10:35 AM Qingsheng Ren <re...@apache.org>
>> wrote:
>> >> >>
>> >> >> Hi devs and users,
>> >> >>
>> >> >> I’d like to start a discussion about reverting a breaking change
>> about
>> >> sink metrics made in 1.15 by FLINK-26126 [1] and FLINK-26492 [2].
>> >> >>
>> >> >> TL;DR
>> >> >>
>> >> >> All sink metrics with name “numXXXOut” defined in FLIP-33 are
>> replace
>> >> by “numXXXSend” in FLINK-26126 and FLINK-26492. Considering metric
>> names
>> >> are public APIs, this is a breaking change to end users and not
>> backward
>> >> compatible. Also unfortunately this breaking change was not discussed
>> in
>> >> the mailing list before.
>> >> >>
>> >> >> Background
>> >> >>
>> >> >> As defined previously in FLIP-33 (the FLIP page has been changed so
>> >> please refer to the old version [3] ), metric “numRecordsOut” is used
>> for
>> >> reporting the total number of output records since the sink started
>> (number
>> >> of records written to the external system), and similarly for
>> >> “numRecordsOutPerSecond”, “numBytesOut”, “numBytesOutPerSecond” and
>> >> “numRecordsOutError”. Most sinks are following this naming and
>> definition.
>> >> However, these metrics are ambiguous in the new Sink API as “numXXXOut”
>> >> could be used by the output of SinkWriterOperator for reporting number
>> of
>> >> Committables delivered to SinkCommitterOperator. In order to resolve
>> the
>> >> conflict, FLINK-26126 and FLINK-26492 changed names of these metrics
>> with
>> >> “numXXXSend”.
>> >> >>
>> >> >> Necessity of reverting this change
>> >> >>
>> >> >> - Metric names are actually public API, as end users need to
>> configure
>> >> metric collecting and alerting system with metric names. Users have to
>> >> reset all configurations related to affected metrics.
>> >> >> - This could also affect custom and external sinks not maintained by
>> >> Flink, which might have implemented with numXXXOut metrics.
>> >> >> - The number of records sent to external system is way more
>> important
>> >> than the number of Committables sent to SinkCommitterOperator, as the
>> >> latter one is just an internal implementation of sink. We could have a
>> new
>> >> metric name for the latter one instead.
>> >> >> - We could avoid splitting the project by version (like “plz use
>> >> numXXXOut before 1.15 and use numXXXSend after”) if we revert it ASAP,
>> >> cosidering 1.16 is still not released for now.
>> >> >>
>> >> >> As a consequence, I’d like to hear from devs and users about your
>> >> opinion on changing these metrics back to “numXXXOut”.
>> >> >>
>> >> >> Looking forward to your reply!
>> >> >>
>> >> >> [1] https://issues.apache.org/jira/browse/FLINK-26126
>> >> >> [2] https://issues.apache.org/jira/browse/FLINK-26492
>> >> >> [1] FLIP-33, version 18:
>> >>
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211883136
>> >> >>
>> >> >> Best,
>> >> >> Qingsheng
>> >>
>> >
>> >
>>
>
>

回复: [DISCUSS] Reverting sink metric name changes made in 1.15

Posted by Ruan Hang <ru...@hotmail.com>.
Thanks for raising the discussion, Qingsheng,

+1 on reverting the breaking changes.
+1 for making a unified and clearer metric definition in Flink 2.0

Best,
Hang

________________________________
发件人: Jing Ge <ji...@ververica.com>
发送时间: 2022年10月12日 19:20
收件人: Qingsheng Ren <re...@apache.org>
抄送: Chesnay Schepler <ch...@apache.org>; dev <de...@flink.apache.org>; user <us...@flink.apache.org>; Martijn Visser <ma...@apache.org>; Becket Qin <be...@gmail.com>; Jingsong Li <ji...@gmail.com>; Jark Wu <im...@gmail.com>; Leonard Xu <xb...@gmail.com>; Xintong Song <to...@gmail.com>
主题: Re: [DISCUSS] Reverting sink metric name changes made in 1.15

Hi Qingsheng,

Just want to make sure we are on the same page. Are you suggesting switching the naming between "numXXXSend" and "numXXXOut" or reverting all the changes we did with FLINK-26126 and FLINK-26492?

For the naming switch, please pay attention that the behaviour has been changed since we introduced SinkV2[1]. So, please be aware of different numbers(behaviour change) even with the same metrics name. Sticking with the old name with the new behaviour (very bad idea, IMHO) might seem like saving the effort in the first place, but it might end up with monitoring unexpected metrics, which is even worse for users, i.e. I didn't change anything, but something has been broken since the last update.

For reverting, I am not sure how to fix the issue mentioned in FLINK-26126 after reverting all changes. Like Chesnay has already pointed out, with SinkV2 we have two different output lines - one with the external system and the other with the downstream operator. In this case, "numXXXSend" is rather a new metric than a replacement of "numXXXOut". The "numXXXOut" metric can still be used, depending on what the user wants to monitor.


Best regards,
Jing

[1] https://github.com/apache/flink/blob/51fc20db30d001a95de95b3b9993eeb06f558f6c/flink-metrics/flink-metrics-core/src/main/java/org/apache/flink/metrics/groups/SinkWriterMetricGroup.java#L48

On Wed, Oct 12, 2022 at 12:48 PM Qingsheng Ren <re...@apache.org>> wrote:
As a supplement, considering it could be a big reconstruction
redefining internal and external traffic and touching metric names in
almost all operators, this requires a lot of discussions and we might
do it finally in Flink 2.0. I think compatibility is a bigger blocker
in front of us, as the output of sink is a metric that users care a
lot about.

Thanks,
Qingsheng

On Wed, Oct 12, 2022 at 6:20 PM Qingsheng Ren <re...@apache.org>> wrote:
>
> Thanks Chesnay for the reply. +1 for making a unified and clearer
> metric definition distinguishing internal and external data transfers.
> As you described, having IO in operators is quite common such as
> dimension tables in Table/SQL API. This definitely deserves a FLIP and
> an overall design.
>
> However I think it's necessary to change the metric back to
> numRecordsOut instead of sticking with numRecordsSend in 1.15 and
> 1.16. The most important argument is for compatibility as I mentioned
> in my previous email, otherwise all users have to modify their configs
> of metric systems after upgrading to Flink 1.15+, and all custom
> connectors have to change their implementations to migrate to the new
> metric name. I believe other ones participating and approving this
> proposal share the same concern about compatibility too. Also
> considering this issue is blocking the release of 1.16, maybe we could
> fix this asap, and as for defining a new metric for internal data
> transfers we can have an in-depth discussion later. WDYT?
>
> Best,
> Qingsheng
>
> On Tue, Oct 11, 2022 at 6:06 PM Chesnay Schepler <ch...@apache.org>> wrote:
> >
> > Currently I think that would be a mistake.
> >
> > Ultimately what we have here is the culmination of us never really considering how the numRecordsOut metric should behave for operators that emit data to other operators _and_ external systems. This goes beyond sinks.
> > This even applies to numRecordsIn, for cases where functions query/write data from/to the outside, (e.g., Async IO).
> >
> > Having 2 separate metrics for that, 1 exclusively for internal data transfers, and 1 exclusively for external data transfers, is the only way to get a consistent metric definition in the long-run.
> > We can jump back-and-forth now or just commit to it.
> >
> > I don't think we can really judge this based on FLIP-33. It was IIRC written before the two phase sinks were added, which heavily blurred the lines of what a sink even is. Because it definitely is _not_ the last operator in a chain anymore.
> >
> > What I would suggest is to stick with what we got (although I despise the name numRecordsSend), and alias the numRecordsOut metric for all non-TwoPhaseCommittingSink.
> >
> > On 11/10/2022 05:54, Qingsheng Ren wrote:
> >
> > Thanks for the details Chesnay!
> >
> > By “alias” I mean to respect the original definition made in FLIP-33 for numRecordsOut, which is the number of records written to the external system, and keep numRecordsSend as the same value as numRecordsOut for compatibility.
> >
> > I think keeping numRecordsOut for the output to the external system is more intuitive to end users because in most cases the metric of data flow output is more essential. I agree with you that a new metric is required, but considering compatibility and users’ intuition I prefer to keep the initial definition of numRecordsOut in FLIP-33 and name a new metric for sink writer’s output to downstream operators. This might be against consistency with metrics in other operators in Flink but maybe it’s acceptable to have the sink as a special case.
> >
> > Best,
> > Qingsheng
> > On Oct 10, 2022, 19:13 +0800, Chesnay Schepler <ch...@apache.org>>, wrote:
> >
> > > I’m with Xintong’s idea to treat numXXXSend as an alias of numXXXOut
> >
> > But that's not possible. If it were that simple there would have never been a need to introduce another metric in the first place.
> >
> > It's a rather fundamental issue with how the new sinks work, in that they emit data to the external system (usually considered as "numRecordsOut" of sinks) while _also_ sending data to a downstream operator (usually considered as "numRecordsOut" of tasks).
> > The original issue was that the numRecordsOut of the sink counted both (which is completely wrong).
> >
> > A new metric was always required; otherwise you inevitably end up breaking some semantic.
> > Adding a new metric for what the sink writes to the external system is, for better or worse, more consistent with how these metrics usually work in Flink.
> >
> > On 10/10/2022 12:45, Qingsheng Ren wrote:
> >
> > Thanks everyone for joining the discussion!
> >
> > > Do you have any idea what has happened in the process here?
> >
> > The discussion in this PR [1] shows some details and could be helpful to understand the original motivation of the renaming. We do have a test case for guarding metrics but unfortunaly the case was also modified so the defense was broken.
> >
> > I think the reason why both the developer and the reviewer forgot to trigger an discussion and gave a green pass on the change is that metrics are quite “trivial” to be noticed as public APIs. As mentioned by Martijn I couldn’t find a place noting that metrics are public APIs and should be treated carefully while contributing and reviewing.
> >
> > IMHO three actions could be made to prevent this kind of changes in the future:
> >
> > a. Add test case for metrics (which we already have in SinkMetricsITCase)
> > b. We emphasize that any public-interface breaking changes should be proposed by a FLIP or discussed in mailing list, and should be listed in the release note.
> > c. We remind contributors and reviewers about what should be considered as public API, and include metric names in it.
> >
> > For b and c these two pages [2][3] might be proper places.
> >
> > About the patch to revert this, it looks like we have a consensus on 1.16. As of 1.15 I think it’s worthy to trigger a minor version. I didn’t see complaints about this for now so it should be OK to save the situation asap. I’m with Xintong’s idea to treat numXXXSend as an alias of numXXXOut considering there could possibly some users have already adapted their system to the new naming, and have another internal metric for reflecting number of outgoing committable batches (actually the numRecordsIn of sink committer operator should be carrying this info already).
> >
> > [1] https://github.com/apache/flink/pull/18825
> > [2] https://flink.apache.org/contributing/contribute-code.html
> > [3] https://flink.apache.org/contributing/reviewing-prs.html
> >
> > Best,
> > Qingsheng
> > On Oct 10, 2022, 17:40 +0800, Xintong Song <to...@gmail.com>>, wrote:
> >
> > +1 for reverting these changes in Flink 1.16.
> >
> > For 1.15.3, can we make these metrics available via both names (numXXXOut and numXXXSend)? In this way we don't break it for those who already migrated to 1.15 and numXXXSend. That means we still need to change SinkWriterOperator to use another metric name in 1.15.3, which IIUC is internal to Flink sink.
> >
> > I'm overall +1 to change numXXXOut back to its original semantics. AFAIK (from meetup / flink-forward questionaires), most users do not migrate to a new Flink release immediately, until the next 1-2 major releases are out.
> >
> > Best,
> >
> > Xintong
> >
> >
> >
> > On Mon, Oct 10, 2022 at 5:26 PM Martijn Visser <ma...@apache.org>> wrote:
> >>
> >> Hi Qingsheng,
> >>
> >> Do you have any idea what has happened in the process here? Do we know why
> >> they were changed? I was under the impression that these metric names were
> >> newly introduced due to the new interfaces and because it still depends on
> >> each connector implementing these.
> >>
> >> Sidenote: metric names are not mentioned in the FLIP process as a public
> >> API. Might make sense to have a separate follow-up to add that to the list
> >> (I do think we should list them there).
> >>
> >> +1 for reverting this and make this change in Flink 1.16
> >>
> >> I'm not in favour of releasing a Flink 1.15.3 with this change: I think the
> >> impact is too big for a patch version, especially given how long Flink 1.15
> >> is already out there.
> >>
> >> Best regards,
> >>
> >> Martijn
> >>
> >> On Mon, Oct 10, 2022 at 11:13 AM Leonard Xu <xb...@gmail.com>> wrote:
> >>
> >> > Thanks Qingsheng for starting this thread.
> >> >
> >> > +1 on reverting sink metric name and releasing 1.15.3 to fix this
> >> > inconsistent behavior.
> >> >
> >> >
> >> > Best,
> >> > Leonard
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > 2022年10月10日 下午3:06,Jark Wu <im...@gmail.com>> 写道:
> >> >
> >> > Thanks for discovering this problem, Qingsheng!
> >> >
> >> > I'm also +1 for reverting the breaking changes.
> >> >
> >> > IIUC, currently, the behavior of "numXXXOut" metrics of the new and old
> >> > sink is inconsistent.
> >> > We have to break one of them to have consistent behavior. Sink V2 is an
> >> > evolving API which is just introduced in 1.15.
> >> > I think it makes sense to break the unstable API instead of the stable API
> >> > which many connectors and users depend on.
> >> >
> >> > Best,
> >> > Jark
> >> >
> >> >
> >> >
> >> > On Mon, 10 Oct 2022 at 11:36, Jingsong Li <ji...@gmail.com>> wrote:
> >> >
> >> >> Thanks for driving, Qingsheng.
> >> >>
> >> >> +1 for reverting sink metric name.
> >> >>
> >> >> We often forget that metric is also one of the important APIs.
> >> >>
> >> >> +1 for releasing 1.15.3 to fix this.
> >> >>
> >> >> Best,
> >> >> Jingsong
> >> >>
> >> >> On Sun, Oct 9, 2022 at 11:35 PM Becket Qin <be...@gmail.com>> wrote:
> >> >> >
> >> >> > Thanks for raising the discussion, Qingsheng,
> >> >> >
> >> >> > +1 on reverting the breaking changes.
> >> >> >
> >> >> > In addition, we might want to release a 1.15.3 to fix this and update
> >> >> the previous release docs with this known issue, so that users can upgrade
> >> >> to 1.15.3 when they hit it. It would also be good to add some backwards
> >> >> compatibility tests on metrics to avoid unintended breaking changes like
> >> >> this in the future.
> >> >> >
> >> >> > Thanks,
> >> >> >
> >> >> > Jiangjie (Becket) Qin
> >> >> >
> >> >> > On Sun, Oct 9, 2022 at 10:35 AM Qingsheng Ren <re...@apache.org>> wrote:
> >> >> >>
> >> >> >> Hi devs and users,
> >> >> >>
> >> >> >> I’d like to start a discussion about reverting a breaking change about
> >> >> sink metrics made in 1.15 by FLINK-26126 [1] and FLINK-26492 [2].
> >> >> >>
> >> >> >> TL;DR
> >> >> >>
> >> >> >> All sink metrics with name “numXXXOut” defined in FLIP-33 are replace
> >> >> by “numXXXSend” in FLINK-26126 and FLINK-26492. Considering metric names
> >> >> are public APIs, this is a breaking change to end users and not backward
> >> >> compatible. Also unfortunately this breaking change was not discussed in
> >> >> the mailing list before.
> >> >> >>
> >> >> >> Background
> >> >> >>
> >> >> >> As defined previously in FLIP-33 (the FLIP page has been changed so
> >> >> please refer to the old version [3] ), metric “numRecordsOut” is used for
> >> >> reporting the total number of output records since the sink started (number
> >> >> of records written to the external system), and similarly for
> >> >> “numRecordsOutPerSecond”, “numBytesOut”, “numBytesOutPerSecond” and
> >> >> “numRecordsOutError”. Most sinks are following this naming and definition.
> >> >> However, these metrics are ambiguous in the new Sink API as “numXXXOut”
> >> >> could be used by the output of SinkWriterOperator for reporting number of
> >> >> Committables delivered to SinkCommitterOperator. In order to resolve the
> >> >> conflict, FLINK-26126 and FLINK-26492 changed names of these metrics with
> >> >> “numXXXSend”.
> >> >> >>
> >> >> >> Necessity of reverting this change
> >> >> >>
> >> >> >> - Metric names are actually public API, as end users need to configure
> >> >> metric collecting and alerting system with metric names. Users have to
> >> >> reset all configurations related to affected metrics.
> >> >> >> - This could also affect custom and external sinks not maintained by
> >> >> Flink, which might have implemented with numXXXOut metrics.
> >> >> >> - The number of records sent to external system is way more important
> >> >> than the number of Committables sent to SinkCommitterOperator, as the
> >> >> latter one is just an internal implementation of sink. We could have a new
> >> >> metric name for the latter one instead.
> >> >> >> - We could avoid splitting the project by version (like “plz use
> >> >> numXXXOut before 1.15 and use numXXXSend after”) if we revert it ASAP,
> >> >> cosidering 1.16 is still not released for now.
> >> >> >>
> >> >> >> As a consequence, I’d like to hear from devs and users about your
> >> >> opinion on changing these metrics back to “numXXXOut”.
> >> >> >>
> >> >> >> Looking forward to your reply!
> >> >> >>
> >> >> >> [1] https://issues.apache.org/jira/browse/FLINK-26126
> >> >> >> [2] https://issues.apache.org/jira/browse/FLINK-26492
> >> >> >> [1] FLIP-33, version 18:
> >> >> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211883136
> >> >> >>
> >> >> >> Best,
> >> >> >> Qingsheng
> >> >>
> >> >
> >> >
> >
> >
> >

回复: [DISCUSS] Reverting sink metric name changes made in 1.15

Posted by Ruan Hang <ru...@hotmail.com>.
Thanks for raising the discussion, Qingsheng,

+1 on reverting the breaking changes.
+1 for making a unified and clearer metric definition in Flink 2.0

Best,
Hang

________________________________
发件人: Jing Ge <ji...@ververica.com>
发送时间: 2022年10月12日 19:20
收件人: Qingsheng Ren <re...@apache.org>
抄送: Chesnay Schepler <ch...@apache.org>; dev <de...@flink.apache.org>; user <us...@flink.apache.org>; Martijn Visser <ma...@apache.org>; Becket Qin <be...@gmail.com>; Jingsong Li <ji...@gmail.com>; Jark Wu <im...@gmail.com>; Leonard Xu <xb...@gmail.com>; Xintong Song <to...@gmail.com>
主题: Re: [DISCUSS] Reverting sink metric name changes made in 1.15

Hi Qingsheng,

Just want to make sure we are on the same page. Are you suggesting switching the naming between "numXXXSend" and "numXXXOut" or reverting all the changes we did with FLINK-26126 and FLINK-26492?

For the naming switch, please pay attention that the behaviour has been changed since we introduced SinkV2[1]. So, please be aware of different numbers(behaviour change) even with the same metrics name. Sticking with the old name with the new behaviour (very bad idea, IMHO) might seem like saving the effort in the first place, but it might end up with monitoring unexpected metrics, which is even worse for users, i.e. I didn't change anything, but something has been broken since the last update.

For reverting, I am not sure how to fix the issue mentioned in FLINK-26126 after reverting all changes. Like Chesnay has already pointed out, with SinkV2 we have two different output lines - one with the external system and the other with the downstream operator. In this case, "numXXXSend" is rather a new metric than a replacement of "numXXXOut". The "numXXXOut" metric can still be used, depending on what the user wants to monitor.


Best regards,
Jing

[1] https://github.com/apache/flink/blob/51fc20db30d001a95de95b3b9993eeb06f558f6c/flink-metrics/flink-metrics-core/src/main/java/org/apache/flink/metrics/groups/SinkWriterMetricGroup.java#L48

On Wed, Oct 12, 2022 at 12:48 PM Qingsheng Ren <re...@apache.org>> wrote:
As a supplement, considering it could be a big reconstruction
redefining internal and external traffic and touching metric names in
almost all operators, this requires a lot of discussions and we might
do it finally in Flink 2.0. I think compatibility is a bigger blocker
in front of us, as the output of sink is a metric that users care a
lot about.

Thanks,
Qingsheng

On Wed, Oct 12, 2022 at 6:20 PM Qingsheng Ren <re...@apache.org>> wrote:
>
> Thanks Chesnay for the reply. +1 for making a unified and clearer
> metric definition distinguishing internal and external data transfers.
> As you described, having IO in operators is quite common such as
> dimension tables in Table/SQL API. This definitely deserves a FLIP and
> an overall design.
>
> However I think it's necessary to change the metric back to
> numRecordsOut instead of sticking with numRecordsSend in 1.15 and
> 1.16. The most important argument is for compatibility as I mentioned
> in my previous email, otherwise all users have to modify their configs
> of metric systems after upgrading to Flink 1.15+, and all custom
> connectors have to change their implementations to migrate to the new
> metric name. I believe other ones participating and approving this
> proposal share the same concern about compatibility too. Also
> considering this issue is blocking the release of 1.16, maybe we could
> fix this asap, and as for defining a new metric for internal data
> transfers we can have an in-depth discussion later. WDYT?
>
> Best,
> Qingsheng
>
> On Tue, Oct 11, 2022 at 6:06 PM Chesnay Schepler <ch...@apache.org>> wrote:
> >
> > Currently I think that would be a mistake.
> >
> > Ultimately what we have here is the culmination of us never really considering how the numRecordsOut metric should behave for operators that emit data to other operators _and_ external systems. This goes beyond sinks.
> > This even applies to numRecordsIn, for cases where functions query/write data from/to the outside, (e.g., Async IO).
> >
> > Having 2 separate metrics for that, 1 exclusively for internal data transfers, and 1 exclusively for external data transfers, is the only way to get a consistent metric definition in the long-run.
> > We can jump back-and-forth now or just commit to it.
> >
> > I don't think we can really judge this based on FLIP-33. It was IIRC written before the two phase sinks were added, which heavily blurred the lines of what a sink even is. Because it definitely is _not_ the last operator in a chain anymore.
> >
> > What I would suggest is to stick with what we got (although I despise the name numRecordsSend), and alias the numRecordsOut metric for all non-TwoPhaseCommittingSink.
> >
> > On 11/10/2022 05:54, Qingsheng Ren wrote:
> >
> > Thanks for the details Chesnay!
> >
> > By “alias” I mean to respect the original definition made in FLIP-33 for numRecordsOut, which is the number of records written to the external system, and keep numRecordsSend as the same value as numRecordsOut for compatibility.
> >
> > I think keeping numRecordsOut for the output to the external system is more intuitive to end users because in most cases the metric of data flow output is more essential. I agree with you that a new metric is required, but considering compatibility and users’ intuition I prefer to keep the initial definition of numRecordsOut in FLIP-33 and name a new metric for sink writer’s output to downstream operators. This might be against consistency with metrics in other operators in Flink but maybe it’s acceptable to have the sink as a special case.
> >
> > Best,
> > Qingsheng
> > On Oct 10, 2022, 19:13 +0800, Chesnay Schepler <ch...@apache.org>>, wrote:
> >
> > > I’m with Xintong’s idea to treat numXXXSend as an alias of numXXXOut
> >
> > But that's not possible. If it were that simple there would have never been a need to introduce another metric in the first place.
> >
> > It's a rather fundamental issue with how the new sinks work, in that they emit data to the external system (usually considered as "numRecordsOut" of sinks) while _also_ sending data to a downstream operator (usually considered as "numRecordsOut" of tasks).
> > The original issue was that the numRecordsOut of the sink counted both (which is completely wrong).
> >
> > A new metric was always required; otherwise you inevitably end up breaking some semantic.
> > Adding a new metric for what the sink writes to the external system is, for better or worse, more consistent with how these metrics usually work in Flink.
> >
> > On 10/10/2022 12:45, Qingsheng Ren wrote:
> >
> > Thanks everyone for joining the discussion!
> >
> > > Do you have any idea what has happened in the process here?
> >
> > The discussion in this PR [1] shows some details and could be helpful to understand the original motivation of the renaming. We do have a test case for guarding metrics but unfortunaly the case was also modified so the defense was broken.
> >
> > I think the reason why both the developer and the reviewer forgot to trigger an discussion and gave a green pass on the change is that metrics are quite “trivial” to be noticed as public APIs. As mentioned by Martijn I couldn’t find a place noting that metrics are public APIs and should be treated carefully while contributing and reviewing.
> >
> > IMHO three actions could be made to prevent this kind of changes in the future:
> >
> > a. Add test case for metrics (which we already have in SinkMetricsITCase)
> > b. We emphasize that any public-interface breaking changes should be proposed by a FLIP or discussed in mailing list, and should be listed in the release note.
> > c. We remind contributors and reviewers about what should be considered as public API, and include metric names in it.
> >
> > For b and c these two pages [2][3] might be proper places.
> >
> > About the patch to revert this, it looks like we have a consensus on 1.16. As of 1.15 I think it’s worthy to trigger a minor version. I didn’t see complaints about this for now so it should be OK to save the situation asap. I’m with Xintong’s idea to treat numXXXSend as an alias of numXXXOut considering there could possibly some users have already adapted their system to the new naming, and have another internal metric for reflecting number of outgoing committable batches (actually the numRecordsIn of sink committer operator should be carrying this info already).
> >
> > [1] https://github.com/apache/flink/pull/18825
> > [2] https://flink.apache.org/contributing/contribute-code.html
> > [3] https://flink.apache.org/contributing/reviewing-prs.html
> >
> > Best,
> > Qingsheng
> > On Oct 10, 2022, 17:40 +0800, Xintong Song <to...@gmail.com>>, wrote:
> >
> > +1 for reverting these changes in Flink 1.16.
> >
> > For 1.15.3, can we make these metrics available via both names (numXXXOut and numXXXSend)? In this way we don't break it for those who already migrated to 1.15 and numXXXSend. That means we still need to change SinkWriterOperator to use another metric name in 1.15.3, which IIUC is internal to Flink sink.
> >
> > I'm overall +1 to change numXXXOut back to its original semantics. AFAIK (from meetup / flink-forward questionaires), most users do not migrate to a new Flink release immediately, until the next 1-2 major releases are out.
> >
> > Best,
> >
> > Xintong
> >
> >
> >
> > On Mon, Oct 10, 2022 at 5:26 PM Martijn Visser <ma...@apache.org>> wrote:
> >>
> >> Hi Qingsheng,
> >>
> >> Do you have any idea what has happened in the process here? Do we know why
> >> they were changed? I was under the impression that these metric names were
> >> newly introduced due to the new interfaces and because it still depends on
> >> each connector implementing these.
> >>
> >> Sidenote: metric names are not mentioned in the FLIP process as a public
> >> API. Might make sense to have a separate follow-up to add that to the list
> >> (I do think we should list them there).
> >>
> >> +1 for reverting this and make this change in Flink 1.16
> >>
> >> I'm not in favour of releasing a Flink 1.15.3 with this change: I think the
> >> impact is too big for a patch version, especially given how long Flink 1.15
> >> is already out there.
> >>
> >> Best regards,
> >>
> >> Martijn
> >>
> >> On Mon, Oct 10, 2022 at 11:13 AM Leonard Xu <xb...@gmail.com>> wrote:
> >>
> >> > Thanks Qingsheng for starting this thread.
> >> >
> >> > +1 on reverting sink metric name and releasing 1.15.3 to fix this
> >> > inconsistent behavior.
> >> >
> >> >
> >> > Best,
> >> > Leonard
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > 2022年10月10日 下午3:06,Jark Wu <im...@gmail.com>> 写道:
> >> >
> >> > Thanks for discovering this problem, Qingsheng!
> >> >
> >> > I'm also +1 for reverting the breaking changes.
> >> >
> >> > IIUC, currently, the behavior of "numXXXOut" metrics of the new and old
> >> > sink is inconsistent.
> >> > We have to break one of them to have consistent behavior. Sink V2 is an
> >> > evolving API which is just introduced in 1.15.
> >> > I think it makes sense to break the unstable API instead of the stable API
> >> > which many connectors and users depend on.
> >> >
> >> > Best,
> >> > Jark
> >> >
> >> >
> >> >
> >> > On Mon, 10 Oct 2022 at 11:36, Jingsong Li <ji...@gmail.com>> wrote:
> >> >
> >> >> Thanks for driving, Qingsheng.
> >> >>
> >> >> +1 for reverting sink metric name.
> >> >>
> >> >> We often forget that metric is also one of the important APIs.
> >> >>
> >> >> +1 for releasing 1.15.3 to fix this.
> >> >>
> >> >> Best,
> >> >> Jingsong
> >> >>
> >> >> On Sun, Oct 9, 2022 at 11:35 PM Becket Qin <be...@gmail.com>> wrote:
> >> >> >
> >> >> > Thanks for raising the discussion, Qingsheng,
> >> >> >
> >> >> > +1 on reverting the breaking changes.
> >> >> >
> >> >> > In addition, we might want to release a 1.15.3 to fix this and update
> >> >> the previous release docs with this known issue, so that users can upgrade
> >> >> to 1.15.3 when they hit it. It would also be good to add some backwards
> >> >> compatibility tests on metrics to avoid unintended breaking changes like
> >> >> this in the future.
> >> >> >
> >> >> > Thanks,
> >> >> >
> >> >> > Jiangjie (Becket) Qin
> >> >> >
> >> >> > On Sun, Oct 9, 2022 at 10:35 AM Qingsheng Ren <re...@apache.org>> wrote:
> >> >> >>
> >> >> >> Hi devs and users,
> >> >> >>
> >> >> >> I’d like to start a discussion about reverting a breaking change about
> >> >> sink metrics made in 1.15 by FLINK-26126 [1] and FLINK-26492 [2].
> >> >> >>
> >> >> >> TL;DR
> >> >> >>
> >> >> >> All sink metrics with name “numXXXOut” defined in FLIP-33 are replace
> >> >> by “numXXXSend” in FLINK-26126 and FLINK-26492. Considering metric names
> >> >> are public APIs, this is a breaking change to end users and not backward
> >> >> compatible. Also unfortunately this breaking change was not discussed in
> >> >> the mailing list before.
> >> >> >>
> >> >> >> Background
> >> >> >>
> >> >> >> As defined previously in FLIP-33 (the FLIP page has been changed so
> >> >> please refer to the old version [3] ), metric “numRecordsOut” is used for
> >> >> reporting the total number of output records since the sink started (number
> >> >> of records written to the external system), and similarly for
> >> >> “numRecordsOutPerSecond”, “numBytesOut”, “numBytesOutPerSecond” and
> >> >> “numRecordsOutError”. Most sinks are following this naming and definition.
> >> >> However, these metrics are ambiguous in the new Sink API as “numXXXOut”
> >> >> could be used by the output of SinkWriterOperator for reporting number of
> >> >> Committables delivered to SinkCommitterOperator. In order to resolve the
> >> >> conflict, FLINK-26126 and FLINK-26492 changed names of these metrics with
> >> >> “numXXXSend”.
> >> >> >>
> >> >> >> Necessity of reverting this change
> >> >> >>
> >> >> >> - Metric names are actually public API, as end users need to configure
> >> >> metric collecting and alerting system with metric names. Users have to
> >> >> reset all configurations related to affected metrics.
> >> >> >> - This could also affect custom and external sinks not maintained by
> >> >> Flink, which might have implemented with numXXXOut metrics.
> >> >> >> - The number of records sent to external system is way more important
> >> >> than the number of Committables sent to SinkCommitterOperator, as the
> >> >> latter one is just an internal implementation of sink. We could have a new
> >> >> metric name for the latter one instead.
> >> >> >> - We could avoid splitting the project by version (like “plz use
> >> >> numXXXOut before 1.15 and use numXXXSend after”) if we revert it ASAP,
> >> >> cosidering 1.16 is still not released for now.
> >> >> >>
> >> >> >> As a consequence, I’d like to hear from devs and users about your
> >> >> opinion on changing these metrics back to “numXXXOut”.
> >> >> >>
> >> >> >> Looking forward to your reply!
> >> >> >>
> >> >> >> [1] https://issues.apache.org/jira/browse/FLINK-26126
> >> >> >> [2] https://issues.apache.org/jira/browse/FLINK-26492
> >> >> >> [1] FLIP-33, version 18:
> >> >> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211883136
> >> >> >>
> >> >> >> Best,
> >> >> >> Qingsheng
> >> >>
> >> >
> >> >
> >
> >
> >

Re: [DISCUSS] Reverting sink metric name changes made in 1.15

Posted by Jark Wu <im...@gmail.com>.
Hi Xintong,

In terms of code, I think it's not complicated. It's all about we need a
public discussion for the new metric name.
And we don't want to block the release for the rarely used metric.

Best,
Jark

On Fri, 14 Oct 2022 at 10:07, Xintong Song <to...@gmail.com> wrote:

> @Qingsheng,
>
> I'm overall +1 to your proposal, with only one question: How complicated
> is it to come up with a metric for the internal traffic?
>
> I'm asking because, as the new feature is already out for 1.15 & 1.16, it
> would be nice if the corresponding new metrics can also be available in
> these versions.
>
> I'm not saying this should block the 1.16 release. Just trying to
> understand the efforts needed. If adding such metrics is not something that
> can be done shortly, I'd also be fine with releasing 1.16 without them.
>
> Best,
>
> Xintong
>
>
>
> On Fri, Oct 14, 2022 at 1:31 AM Qingsheng Ren <re...@apache.org> wrote:
>
>> Hi devs and users,
>>
>> It looks like we are getting an initial consensus in the discussion so I
>> started a voting thread [1] just now. Looking forward to your feedback!
>>
>> [1] https://lists.apache.org/thread/ozlf82mkm6ndx2n1vdgq532h156p4lt6
>>
>> Best,
>> Qingsheng
>>
>>
>> On Thu, Oct 13, 2022 at 10:41 PM Jing Ge <ji...@ververica.com> wrote:
>>
>>> Hi Qingsheng,
>>>
>>> Thanks for the clarification. +1, I like the idea. Pointing both
>>> numXXXOut and numXXXSend to the same external data transfer metric does not
>>> really break the new SinkV2 design, since there was no requirement to
>>> monitor the internal traffic. So, I think both developer and user can live
>>> with it. It might not be the perfect solution but is indeed the currently
>>> best trade-off solution after considering the backward compatibility.  I
>>> would suggest firing a follow-up ticket after the PR to take care of the
>>> new metric for the internal traffic in the future.
>>>
>>> Best regards,
>>> Jing
>>>
>>>
>>> On Thu, Oct 13, 2022 at 3:08 PM Qingsheng Ren <re...@apache.org> wrote:
>>>
>>>> Hi Jing,
>>>>
>>>> Thanks for the reply!
>>>>
>>>> Let me rephrase my proposal: we’d like to use numXXXOut registered on
>>>> SinkWriterOperator to reflect the traffic to the external system for
>>>> compatibility with old versions before 1.15, and make numXXXSend have the
>>>> same value as numXXXOut for compatibility within 1.15. That means both
>>>> numXXXOut and numXXXSend are used for external data transfers, which end
>>>> users care more about. As for the internal traffic within the sink, we
>>>> could name a new metric for it because this is a _new_ feature in the _new_
>>>> sink, and end users usually don’t pay attention to internal implementation.
>>>> The name of the new metric could be discussed later after 1.16 release.
>>>>
>>>> > but it might end up with monitoring unexpected metrics, which is
>>>> even worse for users, i.e. I didn't change anything, but something has been
>>>> broken since the last update.
>>>>
>>>> Yeah this is exactly what we are trying to fix with this proposal. I
>>>> believe users are more concerned with the output to the external system
>>>> than the internal data delivery in the sink, so I think we’ll have more
>>>> cases reporting like “I set up a panel on numRecordsOut in sink to monitor
>>>> the output of the job, but after upgrading to 1.15 this value is extremely
>>>> low and I didn’t change anything” if we stick to the current situation. I
>>>> think only a few end users care about the number of committables sending to
>>>> downstream as most of them don’t care how the sink works.
>>>>
>>>> We do need a re-design to fully distinguish the internal and external
>>>> traffic on metrics, not only in sink but in all operators as it’s quite
>>>> common for operators to make IO. This needs time to design, discuss, adjust
>>>> and vote, but considering this is blocking 1.16, maybe it’s better to
>>>> rescue the compatibility for now, and leave the huge reconstruction to
>>>> future versions (maybe 2.0).
>>>>
>>>> Best,
>>>> Qingsheng
>>>>
>>>> On Wed, Oct 12, 2022 at 7:21 PM Jing Ge <ji...@ververica.com> wrote:
>>>>
>>>>> Hi Qingsheng,
>>>>>
>>>>> Just want to make sure we are on the same page. Are you suggesting
>>>>> switching the naming between "numXXXSend" and "numXXXOut" or reverting all
>>>>> the changes we did with FLINK-26126 and FLINK-26492?
>>>>>
>>>>> For the naming switch, please pay attention that the behaviour has
>>>>> been changed since we introduced SinkV2[1]. So, please be aware of
>>>>> different numbers(behaviour change) even with the same metrics name.
>>>>> Sticking with the old name with the new behaviour (very bad idea, IMHO)
>>>>> might seem like saving the effort in the first place, but it might end up
>>>>> with monitoring unexpected metrics, which is even worse for users, i.e. I
>>>>> didn't change anything, but something has been broken since
>>>>> the last update.
>>>>>
>>>>> For reverting, I am not sure how to fix the issue mentioned in
>>>>> FLINK-26126 after reverting all changes. Like Chesnay has already pointed
>>>>> out, with SinkV2 we have two different output lines - one with the external
>>>>> system and the other with the downstream operator. In this case,
>>>>> "numXXXSend" is rather a new metric than a replacement of "numXXXOut". The
>>>>> "numXXXOut" metric can still be used, depending on what the user wants to
>>>>> monitor.
>>>>>
>>>>>
>>>>> Best regards,
>>>>> Jing
>>>>>
>>>>> [1]
>>>>> https://github.com/apache/flink/blob/51fc20db30d001a95de95b3b9993eeb06f558f6c/flink-metrics/flink-metrics-core/src/main/java/org/apache/flink/metrics/groups/SinkWriterMetricGroup.java#L48
>>>>>
>>>>>
>>>>> On Wed, Oct 12, 2022 at 12:48 PM Qingsheng Ren <re...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> As a supplement, considering it could be a big reconstruction
>>>>>> redefining internal and external traffic and touching metric names in
>>>>>> almost all operators, this requires a lot of discussions and we might
>>>>>> do it finally in Flink 2.0. I think compatibility is a bigger blocker
>>>>>> in front of us, as the output of sink is a metric that users care a
>>>>>> lot about.
>>>>>>
>>>>>> Thanks,
>>>>>> Qingsheng
>>>>>>
>>>>>> On Wed, Oct 12, 2022 at 6:20 PM Qingsheng Ren <re...@apache.org>
>>>>>> wrote:
>>>>>> >
>>>>>> > Thanks Chesnay for the reply. +1 for making a unified and clearer
>>>>>> > metric definition distinguishing internal and external data
>>>>>> transfers.
>>>>>> > As you described, having IO in operators is quite common such as
>>>>>> > dimension tables in Table/SQL API. This definitely deserves a FLIP
>>>>>> and
>>>>>> > an overall design.
>>>>>> >
>>>>>> > However I think it's necessary to change the metric back to
>>>>>> > numRecordsOut instead of sticking with numRecordsSend in 1.15 and
>>>>>> > 1.16. The most important argument is for compatibility as I
>>>>>> mentioned
>>>>>> > in my previous email, otherwise all users have to modify their
>>>>>> configs
>>>>>> > of metric systems after upgrading to Flink 1.15+, and all custom
>>>>>> > connectors have to change their implementations to migrate to the
>>>>>> new
>>>>>> > metric name. I believe other ones participating and approving this
>>>>>> > proposal share the same concern about compatibility too. Also
>>>>>> > considering this issue is blocking the release of 1.16, maybe we
>>>>>> could
>>>>>> > fix this asap, and as for defining a new metric for internal data
>>>>>> > transfers we can have an in-depth discussion later. WDYT?
>>>>>> >
>>>>>> > Best,
>>>>>> > Qingsheng
>>>>>> >
>>>>>> > On Tue, Oct 11, 2022 at 6:06 PM Chesnay Schepler <
>>>>>> chesnay@apache.org> wrote:
>>>>>> > >
>>>>>> > > Currently I think that would be a mistake.
>>>>>> > >
>>>>>> > > Ultimately what we have here is the culmination of us never
>>>>>> really considering how the numRecordsOut metric should behave for operators
>>>>>> that emit data to other operators _and_ external systems. This goes beyond
>>>>>> sinks.
>>>>>> > > This even applies to numRecordsIn, for cases where functions
>>>>>> query/write data from/to the outside, (e.g., Async IO).
>>>>>> > >
>>>>>> > > Having 2 separate metrics for that, 1 exclusively for internal
>>>>>> data transfers, and 1 exclusively for external data transfers, is the only
>>>>>> way to get a consistent metric definition in the long-run.
>>>>>> > > We can jump back-and-forth now or just commit to it.
>>>>>> > >
>>>>>> > > I don't think we can really judge this based on FLIP-33. It was
>>>>>> IIRC written before the two phase sinks were added, which heavily blurred
>>>>>> the lines of what a sink even is. Because it definitely is _not_ the last
>>>>>> operator in a chain anymore.
>>>>>> > >
>>>>>> > > What I would suggest is to stick with what we got (although I
>>>>>> despise the name numRecordsSend), and alias the numRecordsOut metric for
>>>>>> all non-TwoPhaseCommittingSink.
>>>>>> > >
>>>>>> > > On 11/10/2022 05:54, Qingsheng Ren wrote:
>>>>>> > >
>>>>>> > > Thanks for the details Chesnay!
>>>>>> > >
>>>>>> > > By “alias” I mean to respect the original definition made in
>>>>>> FLIP-33 for numRecordsOut, which is the number of records written to the
>>>>>> external system, and keep numRecordsSend as the same value as numRecordsOut
>>>>>> for compatibility.
>>>>>> > >
>>>>>> > > I think keeping numRecordsOut for the output to the external
>>>>>> system is more intuitive to end users because in most cases the metric of
>>>>>> data flow output is more essential. I agree with you that a new metric is
>>>>>> required, but considering compatibility and users’ intuition I prefer to
>>>>>> keep the initial definition of numRecordsOut in FLIP-33 and name a new
>>>>>> metric for sink writer’s output to downstream operators. This might be
>>>>>> against consistency with metrics in other operators in Flink but maybe it’s
>>>>>> acceptable to have the sink as a special case.
>>>>>> > >
>>>>>> > > Best,
>>>>>> > > Qingsheng
>>>>>> > > On Oct 10, 2022, 19:13 +0800, Chesnay Schepler <
>>>>>> chesnay@apache.org>, wrote:
>>>>>> > >
>>>>>> > > > I’m with Xintong’s idea to treat numXXXSend as an alias of
>>>>>> numXXXOut
>>>>>> > >
>>>>>> > > But that's not possible. If it were that simple there would have
>>>>>> never been a need to introduce another metric in the first place.
>>>>>> > >
>>>>>> > > It's a rather fundamental issue with how the new sinks work, in
>>>>>> that they emit data to the external system (usually considered as
>>>>>> "numRecordsOut" of sinks) while _also_ sending data to a downstream
>>>>>> operator (usually considered as "numRecordsOut" of tasks).
>>>>>> > > The original issue was that the numRecordsOut of the sink counted
>>>>>> both (which is completely wrong).
>>>>>> > >
>>>>>> > > A new metric was always required; otherwise you inevitably end up
>>>>>> breaking some semantic.
>>>>>> > > Adding a new metric for what the sink writes to the external
>>>>>> system is, for better or worse, more consistent with how these metrics
>>>>>> usually work in Flink.
>>>>>> > >
>>>>>> > > On 10/10/2022 12:45, Qingsheng Ren wrote:
>>>>>> > >
>>>>>> > > Thanks everyone for joining the discussion!
>>>>>> > >
>>>>>> > > > Do you have any idea what has happened in the process here?
>>>>>> > >
>>>>>> > > The discussion in this PR [1] shows some details and could be
>>>>>> helpful to understand the original motivation of the renaming. We do have a
>>>>>> test case for guarding metrics but unfortunaly the case was also modified
>>>>>> so the defense was broken.
>>>>>> > >
>>>>>> > > I think the reason why both the developer and the reviewer forgot
>>>>>> to trigger an discussion and gave a green pass on the change is that
>>>>>> metrics are quite “trivial” to be noticed as public APIs. As mentioned by
>>>>>> Martijn I couldn’t find a place noting that metrics are public APIs and
>>>>>> should be treated carefully while contributing and reviewing.
>>>>>> > >
>>>>>> > > IMHO three actions could be made to prevent this kind of changes
>>>>>> in the future:
>>>>>> > >
>>>>>> > > a. Add test case for metrics (which we already have in
>>>>>> SinkMetricsITCase)
>>>>>> > > b. We emphasize that any public-interface breaking changes should
>>>>>> be proposed by a FLIP or discussed in mailing list, and should be listed in
>>>>>> the release note.
>>>>>> > > c. We remind contributors and reviewers about what should be
>>>>>> considered as public API, and include metric names in it.
>>>>>> > >
>>>>>> > > For b and c these two pages [2][3] might be proper places.
>>>>>> > >
>>>>>> > > About the patch to revert this, it looks like we have a consensus
>>>>>> on 1.16. As of 1.15 I think it’s worthy to trigger a minor version. I
>>>>>> didn’t see complaints about this for now so it should be OK to save the
>>>>>> situation asap. I’m with Xintong’s idea to treat numXXXSend as an alias of
>>>>>> numXXXOut considering there could possibly some users have already adapted
>>>>>> their system to the new naming, and have another internal metric for
>>>>>> reflecting number of outgoing committable batches (actually the
>>>>>> numRecordsIn of sink committer operator should be carrying this info
>>>>>> already).
>>>>>> > >
>>>>>> > > [1] https://github.com/apache/flink/pull/18825
>>>>>> > > [2] https://flink.apache.org/contributing/contribute-code.html
>>>>>> > > [3] https://flink.apache.org/contributing/reviewing-prs.html
>>>>>> > >
>>>>>> > > Best,
>>>>>> > > Qingsheng
>>>>>> > > On Oct 10, 2022, 17:40 +0800, Xintong Song <to...@gmail.com>,
>>>>>> wrote:
>>>>>> > >
>>>>>> > > +1 for reverting these changes in Flink 1.16.
>>>>>> > >
>>>>>> > > For 1.15.3, can we make these metrics available via both names
>>>>>> (numXXXOut and numXXXSend)? In this way we don't break it for those who
>>>>>> already migrated to 1.15 and numXXXSend. That means we still need to change
>>>>>> SinkWriterOperator to use another metric name in 1.15.3, which IIUC is
>>>>>> internal to Flink sink.
>>>>>> > >
>>>>>> > > I'm overall +1 to change numXXXOut back to its original
>>>>>> semantics. AFAIK (from meetup / flink-forward questionaires), most users do
>>>>>> not migrate to a new Flink release immediately, until the next 1-2 major
>>>>>> releases are out.
>>>>>> > >
>>>>>> > > Best,
>>>>>> > >
>>>>>> > > Xintong
>>>>>> > >
>>>>>> > >
>>>>>> > >
>>>>>> > > On Mon, Oct 10, 2022 at 5:26 PM Martijn Visser <
>>>>>> martijnvisser@apache.org> wrote:
>>>>>> > >>
>>>>>> > >> Hi Qingsheng,
>>>>>> > >>
>>>>>> > >> Do you have any idea what has happened in the process here? Do
>>>>>> we know why
>>>>>> > >> they were changed? I was under the impression that these metric
>>>>>> names were
>>>>>> > >> newly introduced due to the new interfaces and because it still
>>>>>> depends on
>>>>>> > >> each connector implementing these.
>>>>>> > >>
>>>>>> > >> Sidenote: metric names are not mentioned in the FLIP process as
>>>>>> a public
>>>>>> > >> API. Might make sense to have a separate follow-up to add that
>>>>>> to the list
>>>>>> > >> (I do think we should list them there).
>>>>>> > >>
>>>>>> > >> +1 for reverting this and make this change in Flink 1.16
>>>>>> > >>
>>>>>> > >> I'm not in favour of releasing a Flink 1.15.3 with this change:
>>>>>> I think the
>>>>>> > >> impact is too big for a patch version, especially given how long
>>>>>> Flink 1.15
>>>>>> > >> is already out there.
>>>>>> > >>
>>>>>> > >> Best regards,
>>>>>> > >>
>>>>>> > >> Martijn
>>>>>> > >>
>>>>>> > >> On Mon, Oct 10, 2022 at 11:13 AM Leonard Xu <xb...@gmail.com>
>>>>>> wrote:
>>>>>> > >>
>>>>>> > >> > Thanks Qingsheng for starting this thread.
>>>>>> > >> >
>>>>>> > >> > +1 on reverting sink metric name and releasing 1.15.3 to fix
>>>>>> this
>>>>>> > >> > inconsistent behavior.
>>>>>> > >> >
>>>>>> > >> >
>>>>>> > >> > Best,
>>>>>> > >> > Leonard
>>>>>> > >> >
>>>>>> > >> >
>>>>>> > >> >
>>>>>> > >> >
>>>>>> > >> >
>>>>>> > >> > 2022年10月10日 下午3:06,Jark Wu <im...@gmail.com> 写道:
>>>>>> > >> >
>>>>>> > >> > Thanks for discovering this problem, Qingsheng!
>>>>>> > >> >
>>>>>> > >> > I'm also +1 for reverting the breaking changes.
>>>>>> > >> >
>>>>>> > >> > IIUC, currently, the behavior of "numXXXOut" metrics of the
>>>>>> new and old
>>>>>> > >> > sink is inconsistent.
>>>>>> > >> > We have to break one of them to have consistent behavior. Sink
>>>>>> V2 is an
>>>>>> > >> > evolving API which is just introduced in 1.15.
>>>>>> > >> > I think it makes sense to break the unstable API instead of
>>>>>> the stable API
>>>>>> > >> > which many connectors and users depend on.
>>>>>> > >> >
>>>>>> > >> > Best,
>>>>>> > >> > Jark
>>>>>> > >> >
>>>>>> > >> >
>>>>>> > >> >
>>>>>> > >> > On Mon, 10 Oct 2022 at 11:36, Jingsong Li <
>>>>>> jingsonglee0@gmail.com> wrote:
>>>>>> > >> >
>>>>>> > >> >> Thanks for driving, Qingsheng.
>>>>>> > >> >>
>>>>>> > >> >> +1 for reverting sink metric name.
>>>>>> > >> >>
>>>>>> > >> >> We often forget that metric is also one of the important APIs.
>>>>>> > >> >>
>>>>>> > >> >> +1 for releasing 1.15.3 to fix this.
>>>>>> > >> >>
>>>>>> > >> >> Best,
>>>>>> > >> >> Jingsong
>>>>>> > >> >>
>>>>>> > >> >> On Sun, Oct 9, 2022 at 11:35 PM Becket Qin <
>>>>>> becket.qin@gmail.com> wrote:
>>>>>> > >> >> >
>>>>>> > >> >> > Thanks for raising the discussion, Qingsheng,
>>>>>> > >> >> >
>>>>>> > >> >> > +1 on reverting the breaking changes.
>>>>>> > >> >> >
>>>>>> > >> >> > In addition, we might want to release a 1.15.3 to fix this
>>>>>> and update
>>>>>> > >> >> the previous release docs with this known issue, so that
>>>>>> users can upgrade
>>>>>> > >> >> to 1.15.3 when they hit it. It would also be good to add some
>>>>>> backwards
>>>>>> > >> >> compatibility tests on metrics to avoid unintended breaking
>>>>>> changes like
>>>>>> > >> >> this in the future.
>>>>>> > >> >> >
>>>>>> > >> >> > Thanks,
>>>>>> > >> >> >
>>>>>> > >> >> > Jiangjie (Becket) Qin
>>>>>> > >> >> >
>>>>>> > >> >> > On Sun, Oct 9, 2022 at 10:35 AM Qingsheng Ren <
>>>>>> renqs@apache.org> wrote:
>>>>>> > >> >> >>
>>>>>> > >> >> >> Hi devs and users,
>>>>>> > >> >> >>
>>>>>> > >> >> >> I’d like to start a discussion about reverting a breaking
>>>>>> change about
>>>>>> > >> >> sink metrics made in 1.15 by FLINK-26126 [1] and FLINK-26492
>>>>>> [2].
>>>>>> > >> >> >>
>>>>>> > >> >> >> TL;DR
>>>>>> > >> >> >>
>>>>>> > >> >> >> All sink metrics with name “numXXXOut” defined in FLIP-33
>>>>>> are replace
>>>>>> > >> >> by “numXXXSend” in FLINK-26126 and FLINK-26492. Considering
>>>>>> metric names
>>>>>> > >> >> are public APIs, this is a breaking change to end users and
>>>>>> not backward
>>>>>> > >> >> compatible. Also unfortunately this breaking change was not
>>>>>> discussed in
>>>>>> > >> >> the mailing list before.
>>>>>> > >> >> >>
>>>>>> > >> >> >> Background
>>>>>> > >> >> >>
>>>>>> > >> >> >> As defined previously in FLIP-33 (the FLIP page has been
>>>>>> changed so
>>>>>> > >> >> please refer to the old version [3] ), metric “numRecordsOut”
>>>>>> is used for
>>>>>> > >> >> reporting the total number of output records since the sink
>>>>>> started (number
>>>>>> > >> >> of records written to the external system), and similarly for
>>>>>> > >> >> “numRecordsOutPerSecond”, “numBytesOut”,
>>>>>> “numBytesOutPerSecond” and
>>>>>> > >> >> “numRecordsOutError”. Most sinks are following this naming
>>>>>> and definition.
>>>>>> > >> >> However, these metrics are ambiguous in the new Sink API as
>>>>>> “numXXXOut”
>>>>>> > >> >> could be used by the output of SinkWriterOperator for
>>>>>> reporting number of
>>>>>> > >> >> Committables delivered to SinkCommitterOperator. In order to
>>>>>> resolve the
>>>>>> > >> >> conflict, FLINK-26126 and FLINK-26492 changed names of these
>>>>>> metrics with
>>>>>> > >> >> “numXXXSend”.
>>>>>> > >> >> >>
>>>>>> > >> >> >> Necessity of reverting this change
>>>>>> > >> >> >>
>>>>>> > >> >> >> - Metric names are actually public API, as end users need
>>>>>> to configure
>>>>>> > >> >> metric collecting and alerting system with metric names.
>>>>>> Users have to
>>>>>> > >> >> reset all configurations related to affected metrics.
>>>>>> > >> >> >> - This could also affect custom and external sinks not
>>>>>> maintained by
>>>>>> > >> >> Flink, which might have implemented with numXXXOut metrics.
>>>>>> > >> >> >> - The number of records sent to external system is way
>>>>>> more important
>>>>>> > >> >> than the number of Committables sent to
>>>>>> SinkCommitterOperator, as the
>>>>>> > >> >> latter one is just an internal implementation of sink. We
>>>>>> could have a new
>>>>>> > >> >> metric name for the latter one instead.
>>>>>> > >> >> >> - We could avoid splitting the project by version (like
>>>>>> “plz use
>>>>>> > >> >> numXXXOut before 1.15 and use numXXXSend after”) if we revert
>>>>>> it ASAP,
>>>>>> > >> >> cosidering 1.16 is still not released for now.
>>>>>> > >> >> >>
>>>>>> > >> >> >> As a consequence, I’d like to hear from devs and users
>>>>>> about your
>>>>>> > >> >> opinion on changing these metrics back to “numXXXOut”.
>>>>>> > >> >> >>
>>>>>> > >> >> >> Looking forward to your reply!
>>>>>> > >> >> >>
>>>>>> > >> >> >> [1] https://issues.apache.org/jira/browse/FLINK-26126
>>>>>> > >> >> >> [2] https://issues.apache.org/jira/browse/FLINK-26492
>>>>>> > >> >> >> [1] FLIP-33, version 18:
>>>>>> > >> >>
>>>>>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211883136
>>>>>> > >> >> >>
>>>>>> > >> >> >> Best,
>>>>>> > >> >> >> Qingsheng
>>>>>> > >> >>
>>>>>> > >> >
>>>>>> > >> >
>>>>>> > >
>>>>>> > >
>>>>>> > >
>>>>>>
>>>>>

Re: [DISCUSS] Reverting sink metric name changes made in 1.15

Posted by Jark Wu <im...@gmail.com>.
Hi Xintong,

In terms of code, I think it's not complicated. It's all about we need a
public discussion for the new metric name.
And we don't want to block the release for the rarely used metric.

Best,
Jark

On Fri, 14 Oct 2022 at 10:07, Xintong Song <to...@gmail.com> wrote:

> @Qingsheng,
>
> I'm overall +1 to your proposal, with only one question: How complicated
> is it to come up with a metric for the internal traffic?
>
> I'm asking because, as the new feature is already out for 1.15 & 1.16, it
> would be nice if the corresponding new metrics can also be available in
> these versions.
>
> I'm not saying this should block the 1.16 release. Just trying to
> understand the efforts needed. If adding such metrics is not something that
> can be done shortly, I'd also be fine with releasing 1.16 without them.
>
> Best,
>
> Xintong
>
>
>
> On Fri, Oct 14, 2022 at 1:31 AM Qingsheng Ren <re...@apache.org> wrote:
>
>> Hi devs and users,
>>
>> It looks like we are getting an initial consensus in the discussion so I
>> started a voting thread [1] just now. Looking forward to your feedback!
>>
>> [1] https://lists.apache.org/thread/ozlf82mkm6ndx2n1vdgq532h156p4lt6
>>
>> Best,
>> Qingsheng
>>
>>
>> On Thu, Oct 13, 2022 at 10:41 PM Jing Ge <ji...@ververica.com> wrote:
>>
>>> Hi Qingsheng,
>>>
>>> Thanks for the clarification. +1, I like the idea. Pointing both
>>> numXXXOut and numXXXSend to the same external data transfer metric does not
>>> really break the new SinkV2 design, since there was no requirement to
>>> monitor the internal traffic. So, I think both developer and user can live
>>> with it. It might not be the perfect solution but is indeed the currently
>>> best trade-off solution after considering the backward compatibility.  I
>>> would suggest firing a follow-up ticket after the PR to take care of the
>>> new metric for the internal traffic in the future.
>>>
>>> Best regards,
>>> Jing
>>>
>>>
>>> On Thu, Oct 13, 2022 at 3:08 PM Qingsheng Ren <re...@apache.org> wrote:
>>>
>>>> Hi Jing,
>>>>
>>>> Thanks for the reply!
>>>>
>>>> Let me rephrase my proposal: we’d like to use numXXXOut registered on
>>>> SinkWriterOperator to reflect the traffic to the external system for
>>>> compatibility with old versions before 1.15, and make numXXXSend have the
>>>> same value as numXXXOut for compatibility within 1.15. That means both
>>>> numXXXOut and numXXXSend are used for external data transfers, which end
>>>> users care more about. As for the internal traffic within the sink, we
>>>> could name a new metric for it because this is a _new_ feature in the _new_
>>>> sink, and end users usually don’t pay attention to internal implementation.
>>>> The name of the new metric could be discussed later after 1.16 release.
>>>>
>>>> > but it might end up with monitoring unexpected metrics, which is
>>>> even worse for users, i.e. I didn't change anything, but something has been
>>>> broken since the last update.
>>>>
>>>> Yeah this is exactly what we are trying to fix with this proposal. I
>>>> believe users are more concerned with the output to the external system
>>>> than the internal data delivery in the sink, so I think we’ll have more
>>>> cases reporting like “I set up a panel on numRecordsOut in sink to monitor
>>>> the output of the job, but after upgrading to 1.15 this value is extremely
>>>> low and I didn’t change anything” if we stick to the current situation. I
>>>> think only a few end users care about the number of committables sending to
>>>> downstream as most of them don’t care how the sink works.
>>>>
>>>> We do need a re-design to fully distinguish the internal and external
>>>> traffic on metrics, not only in sink but in all operators as it’s quite
>>>> common for operators to make IO. This needs time to design, discuss, adjust
>>>> and vote, but considering this is blocking 1.16, maybe it’s better to
>>>> rescue the compatibility for now, and leave the huge reconstruction to
>>>> future versions (maybe 2.0).
>>>>
>>>> Best,
>>>> Qingsheng
>>>>
>>>> On Wed, Oct 12, 2022 at 7:21 PM Jing Ge <ji...@ververica.com> wrote:
>>>>
>>>>> Hi Qingsheng,
>>>>>
>>>>> Just want to make sure we are on the same page. Are you suggesting
>>>>> switching the naming between "numXXXSend" and "numXXXOut" or reverting all
>>>>> the changes we did with FLINK-26126 and FLINK-26492?
>>>>>
>>>>> For the naming switch, please pay attention that the behaviour has
>>>>> been changed since we introduced SinkV2[1]. So, please be aware of
>>>>> different numbers(behaviour change) even with the same metrics name.
>>>>> Sticking with the old name with the new behaviour (very bad idea, IMHO)
>>>>> might seem like saving the effort in the first place, but it might end up
>>>>> with monitoring unexpected metrics, which is even worse for users, i.e. I
>>>>> didn't change anything, but something has been broken since
>>>>> the last update.
>>>>>
>>>>> For reverting, I am not sure how to fix the issue mentioned in
>>>>> FLINK-26126 after reverting all changes. Like Chesnay has already pointed
>>>>> out, with SinkV2 we have two different output lines - one with the external
>>>>> system and the other with the downstream operator. In this case,
>>>>> "numXXXSend" is rather a new metric than a replacement of "numXXXOut". The
>>>>> "numXXXOut" metric can still be used, depending on what the user wants to
>>>>> monitor.
>>>>>
>>>>>
>>>>> Best regards,
>>>>> Jing
>>>>>
>>>>> [1]
>>>>> https://github.com/apache/flink/blob/51fc20db30d001a95de95b3b9993eeb06f558f6c/flink-metrics/flink-metrics-core/src/main/java/org/apache/flink/metrics/groups/SinkWriterMetricGroup.java#L48
>>>>>
>>>>>
>>>>> On Wed, Oct 12, 2022 at 12:48 PM Qingsheng Ren <re...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> As a supplement, considering it could be a big reconstruction
>>>>>> redefining internal and external traffic and touching metric names in
>>>>>> almost all operators, this requires a lot of discussions and we might
>>>>>> do it finally in Flink 2.0. I think compatibility is a bigger blocker
>>>>>> in front of us, as the output of sink is a metric that users care a
>>>>>> lot about.
>>>>>>
>>>>>> Thanks,
>>>>>> Qingsheng
>>>>>>
>>>>>> On Wed, Oct 12, 2022 at 6:20 PM Qingsheng Ren <re...@apache.org>
>>>>>> wrote:
>>>>>> >
>>>>>> > Thanks Chesnay for the reply. +1 for making a unified and clearer
>>>>>> > metric definition distinguishing internal and external data
>>>>>> transfers.
>>>>>> > As you described, having IO in operators is quite common such as
>>>>>> > dimension tables in Table/SQL API. This definitely deserves a FLIP
>>>>>> and
>>>>>> > an overall design.
>>>>>> >
>>>>>> > However I think it's necessary to change the metric back to
>>>>>> > numRecordsOut instead of sticking with numRecordsSend in 1.15 and
>>>>>> > 1.16. The most important argument is for compatibility as I
>>>>>> mentioned
>>>>>> > in my previous email, otherwise all users have to modify their
>>>>>> configs
>>>>>> > of metric systems after upgrading to Flink 1.15+, and all custom
>>>>>> > connectors have to change their implementations to migrate to the
>>>>>> new
>>>>>> > metric name. I believe other ones participating and approving this
>>>>>> > proposal share the same concern about compatibility too. Also
>>>>>> > considering this issue is blocking the release of 1.16, maybe we
>>>>>> could
>>>>>> > fix this asap, and as for defining a new metric for internal data
>>>>>> > transfers we can have an in-depth discussion later. WDYT?
>>>>>> >
>>>>>> > Best,
>>>>>> > Qingsheng
>>>>>> >
>>>>>> > On Tue, Oct 11, 2022 at 6:06 PM Chesnay Schepler <
>>>>>> chesnay@apache.org> wrote:
>>>>>> > >
>>>>>> > > Currently I think that would be a mistake.
>>>>>> > >
>>>>>> > > Ultimately what we have here is the culmination of us never
>>>>>> really considering how the numRecordsOut metric should behave for operators
>>>>>> that emit data to other operators _and_ external systems. This goes beyond
>>>>>> sinks.
>>>>>> > > This even applies to numRecordsIn, for cases where functions
>>>>>> query/write data from/to the outside, (e.g., Async IO).
>>>>>> > >
>>>>>> > > Having 2 separate metrics for that, 1 exclusively for internal
>>>>>> data transfers, and 1 exclusively for external data transfers, is the only
>>>>>> way to get a consistent metric definition in the long-run.
>>>>>> > > We can jump back-and-forth now or just commit to it.
>>>>>> > >
>>>>>> > > I don't think we can really judge this based on FLIP-33. It was
>>>>>> IIRC written before the two phase sinks were added, which heavily blurred
>>>>>> the lines of what a sink even is. Because it definitely is _not_ the last
>>>>>> operator in a chain anymore.
>>>>>> > >
>>>>>> > > What I would suggest is to stick with what we got (although I
>>>>>> despise the name numRecordsSend), and alias the numRecordsOut metric for
>>>>>> all non-TwoPhaseCommittingSink.
>>>>>> > >
>>>>>> > > On 11/10/2022 05:54, Qingsheng Ren wrote:
>>>>>> > >
>>>>>> > > Thanks for the details Chesnay!
>>>>>> > >
>>>>>> > > By “alias” I mean to respect the original definition made in
>>>>>> FLIP-33 for numRecordsOut, which is the number of records written to the
>>>>>> external system, and keep numRecordsSend as the same value as numRecordsOut
>>>>>> for compatibility.
>>>>>> > >
>>>>>> > > I think keeping numRecordsOut for the output to the external
>>>>>> system is more intuitive to end users because in most cases the metric of
>>>>>> data flow output is more essential. I agree with you that a new metric is
>>>>>> required, but considering compatibility and users’ intuition I prefer to
>>>>>> keep the initial definition of numRecordsOut in FLIP-33 and name a new
>>>>>> metric for sink writer’s output to downstream operators. This might be
>>>>>> against consistency with metrics in other operators in Flink but maybe it’s
>>>>>> acceptable to have the sink as a special case.
>>>>>> > >
>>>>>> > > Best,
>>>>>> > > Qingsheng
>>>>>> > > On Oct 10, 2022, 19:13 +0800, Chesnay Schepler <
>>>>>> chesnay@apache.org>, wrote:
>>>>>> > >
>>>>>> > > > I’m with Xintong’s idea to treat numXXXSend as an alias of
>>>>>> numXXXOut
>>>>>> > >
>>>>>> > > But that's not possible. If it were that simple there would have
>>>>>> never been a need to introduce another metric in the first place.
>>>>>> > >
>>>>>> > > It's a rather fundamental issue with how the new sinks work, in
>>>>>> that they emit data to the external system (usually considered as
>>>>>> "numRecordsOut" of sinks) while _also_ sending data to a downstream
>>>>>> operator (usually considered as "numRecordsOut" of tasks).
>>>>>> > > The original issue was that the numRecordsOut of the sink counted
>>>>>> both (which is completely wrong).
>>>>>> > >
>>>>>> > > A new metric was always required; otherwise you inevitably end up
>>>>>> breaking some semantic.
>>>>>> > > Adding a new metric for what the sink writes to the external
>>>>>> system is, for better or worse, more consistent with how these metrics
>>>>>> usually work in Flink.
>>>>>> > >
>>>>>> > > On 10/10/2022 12:45, Qingsheng Ren wrote:
>>>>>> > >
>>>>>> > > Thanks everyone for joining the discussion!
>>>>>> > >
>>>>>> > > > Do you have any idea what has happened in the process here?
>>>>>> > >
>>>>>> > > The discussion in this PR [1] shows some details and could be
>>>>>> helpful to understand the original motivation of the renaming. We do have a
>>>>>> test case for guarding metrics but unfortunaly the case was also modified
>>>>>> so the defense was broken.
>>>>>> > >
>>>>>> > > I think the reason why both the developer and the reviewer forgot
>>>>>> to trigger an discussion and gave a green pass on the change is that
>>>>>> metrics are quite “trivial” to be noticed as public APIs. As mentioned by
>>>>>> Martijn I couldn’t find a place noting that metrics are public APIs and
>>>>>> should be treated carefully while contributing and reviewing.
>>>>>> > >
>>>>>> > > IMHO three actions could be made to prevent this kind of changes
>>>>>> in the future:
>>>>>> > >
>>>>>> > > a. Add test case for metrics (which we already have in
>>>>>> SinkMetricsITCase)
>>>>>> > > b. We emphasize that any public-interface breaking changes should
>>>>>> be proposed by a FLIP or discussed in mailing list, and should be listed in
>>>>>> the release note.
>>>>>> > > c. We remind contributors and reviewers about what should be
>>>>>> considered as public API, and include metric names in it.
>>>>>> > >
>>>>>> > > For b and c these two pages [2][3] might be proper places.
>>>>>> > >
>>>>>> > > About the patch to revert this, it looks like we have a consensus
>>>>>> on 1.16. As of 1.15 I think it’s worthy to trigger a minor version. I
>>>>>> didn’t see complaints about this for now so it should be OK to save the
>>>>>> situation asap. I’m with Xintong’s idea to treat numXXXSend as an alias of
>>>>>> numXXXOut considering there could possibly some users have already adapted
>>>>>> their system to the new naming, and have another internal metric for
>>>>>> reflecting number of outgoing committable batches (actually the
>>>>>> numRecordsIn of sink committer operator should be carrying this info
>>>>>> already).
>>>>>> > >
>>>>>> > > [1] https://github.com/apache/flink/pull/18825
>>>>>> > > [2] https://flink.apache.org/contributing/contribute-code.html
>>>>>> > > [3] https://flink.apache.org/contributing/reviewing-prs.html
>>>>>> > >
>>>>>> > > Best,
>>>>>> > > Qingsheng
>>>>>> > > On Oct 10, 2022, 17:40 +0800, Xintong Song <to...@gmail.com>,
>>>>>> wrote:
>>>>>> > >
>>>>>> > > +1 for reverting these changes in Flink 1.16.
>>>>>> > >
>>>>>> > > For 1.15.3, can we make these metrics available via both names
>>>>>> (numXXXOut and numXXXSend)? In this way we don't break it for those who
>>>>>> already migrated to 1.15 and numXXXSend. That means we still need to change
>>>>>> SinkWriterOperator to use another metric name in 1.15.3, which IIUC is
>>>>>> internal to Flink sink.
>>>>>> > >
>>>>>> > > I'm overall +1 to change numXXXOut back to its original
>>>>>> semantics. AFAIK (from meetup / flink-forward questionaires), most users do
>>>>>> not migrate to a new Flink release immediately, until the next 1-2 major
>>>>>> releases are out.
>>>>>> > >
>>>>>> > > Best,
>>>>>> > >
>>>>>> > > Xintong
>>>>>> > >
>>>>>> > >
>>>>>> > >
>>>>>> > > On Mon, Oct 10, 2022 at 5:26 PM Martijn Visser <
>>>>>> martijnvisser@apache.org> wrote:
>>>>>> > >>
>>>>>> > >> Hi Qingsheng,
>>>>>> > >>
>>>>>> > >> Do you have any idea what has happened in the process here? Do
>>>>>> we know why
>>>>>> > >> they were changed? I was under the impression that these metric
>>>>>> names were
>>>>>> > >> newly introduced due to the new interfaces and because it still
>>>>>> depends on
>>>>>> > >> each connector implementing these.
>>>>>> > >>
>>>>>> > >> Sidenote: metric names are not mentioned in the FLIP process as
>>>>>> a public
>>>>>> > >> API. Might make sense to have a separate follow-up to add that
>>>>>> to the list
>>>>>> > >> (I do think we should list them there).
>>>>>> > >>
>>>>>> > >> +1 for reverting this and make this change in Flink 1.16
>>>>>> > >>
>>>>>> > >> I'm not in favour of releasing a Flink 1.15.3 with this change:
>>>>>> I think the
>>>>>> > >> impact is too big for a patch version, especially given how long
>>>>>> Flink 1.15
>>>>>> > >> is already out there.
>>>>>> > >>
>>>>>> > >> Best regards,
>>>>>> > >>
>>>>>> > >> Martijn
>>>>>> > >>
>>>>>> > >> On Mon, Oct 10, 2022 at 11:13 AM Leonard Xu <xb...@gmail.com>
>>>>>> wrote:
>>>>>> > >>
>>>>>> > >> > Thanks Qingsheng for starting this thread.
>>>>>> > >> >
>>>>>> > >> > +1 on reverting sink metric name and releasing 1.15.3 to fix
>>>>>> this
>>>>>> > >> > inconsistent behavior.
>>>>>> > >> >
>>>>>> > >> >
>>>>>> > >> > Best,
>>>>>> > >> > Leonard
>>>>>> > >> >
>>>>>> > >> >
>>>>>> > >> >
>>>>>> > >> >
>>>>>> > >> >
>>>>>> > >> > 2022年10月10日 下午3:06,Jark Wu <im...@gmail.com> 写道:
>>>>>> > >> >
>>>>>> > >> > Thanks for discovering this problem, Qingsheng!
>>>>>> > >> >
>>>>>> > >> > I'm also +1 for reverting the breaking changes.
>>>>>> > >> >
>>>>>> > >> > IIUC, currently, the behavior of "numXXXOut" metrics of the
>>>>>> new and old
>>>>>> > >> > sink is inconsistent.
>>>>>> > >> > We have to break one of them to have consistent behavior. Sink
>>>>>> V2 is an
>>>>>> > >> > evolving API which is just introduced in 1.15.
>>>>>> > >> > I think it makes sense to break the unstable API instead of
>>>>>> the stable API
>>>>>> > >> > which many connectors and users depend on.
>>>>>> > >> >
>>>>>> > >> > Best,
>>>>>> > >> > Jark
>>>>>> > >> >
>>>>>> > >> >
>>>>>> > >> >
>>>>>> > >> > On Mon, 10 Oct 2022 at 11:36, Jingsong Li <
>>>>>> jingsonglee0@gmail.com> wrote:
>>>>>> > >> >
>>>>>> > >> >> Thanks for driving, Qingsheng.
>>>>>> > >> >>
>>>>>> > >> >> +1 for reverting sink metric name.
>>>>>> > >> >>
>>>>>> > >> >> We often forget that metric is also one of the important APIs.
>>>>>> > >> >>
>>>>>> > >> >> +1 for releasing 1.15.3 to fix this.
>>>>>> > >> >>
>>>>>> > >> >> Best,
>>>>>> > >> >> Jingsong
>>>>>> > >> >>
>>>>>> > >> >> On Sun, Oct 9, 2022 at 11:35 PM Becket Qin <
>>>>>> becket.qin@gmail.com> wrote:
>>>>>> > >> >> >
>>>>>> > >> >> > Thanks for raising the discussion, Qingsheng,
>>>>>> > >> >> >
>>>>>> > >> >> > +1 on reverting the breaking changes.
>>>>>> > >> >> >
>>>>>> > >> >> > In addition, we might want to release a 1.15.3 to fix this
>>>>>> and update
>>>>>> > >> >> the previous release docs with this known issue, so that
>>>>>> users can upgrade
>>>>>> > >> >> to 1.15.3 when they hit it. It would also be good to add some
>>>>>> backwards
>>>>>> > >> >> compatibility tests on metrics to avoid unintended breaking
>>>>>> changes like
>>>>>> > >> >> this in the future.
>>>>>> > >> >> >
>>>>>> > >> >> > Thanks,
>>>>>> > >> >> >
>>>>>> > >> >> > Jiangjie (Becket) Qin
>>>>>> > >> >> >
>>>>>> > >> >> > On Sun, Oct 9, 2022 at 10:35 AM Qingsheng Ren <
>>>>>> renqs@apache.org> wrote:
>>>>>> > >> >> >>
>>>>>> > >> >> >> Hi devs and users,
>>>>>> > >> >> >>
>>>>>> > >> >> >> I’d like to start a discussion about reverting a breaking
>>>>>> change about
>>>>>> > >> >> sink metrics made in 1.15 by FLINK-26126 [1] and FLINK-26492
>>>>>> [2].
>>>>>> > >> >> >>
>>>>>> > >> >> >> TL;DR
>>>>>> > >> >> >>
>>>>>> > >> >> >> All sink metrics with name “numXXXOut” defined in FLIP-33
>>>>>> are replace
>>>>>> > >> >> by “numXXXSend” in FLINK-26126 and FLINK-26492. Considering
>>>>>> metric names
>>>>>> > >> >> are public APIs, this is a breaking change to end users and
>>>>>> not backward
>>>>>> > >> >> compatible. Also unfortunately this breaking change was not
>>>>>> discussed in
>>>>>> > >> >> the mailing list before.
>>>>>> > >> >> >>
>>>>>> > >> >> >> Background
>>>>>> > >> >> >>
>>>>>> > >> >> >> As defined previously in FLIP-33 (the FLIP page has been
>>>>>> changed so
>>>>>> > >> >> please refer to the old version [3] ), metric “numRecordsOut”
>>>>>> is used for
>>>>>> > >> >> reporting the total number of output records since the sink
>>>>>> started (number
>>>>>> > >> >> of records written to the external system), and similarly for
>>>>>> > >> >> “numRecordsOutPerSecond”, “numBytesOut”,
>>>>>> “numBytesOutPerSecond” and
>>>>>> > >> >> “numRecordsOutError”. Most sinks are following this naming
>>>>>> and definition.
>>>>>> > >> >> However, these metrics are ambiguous in the new Sink API as
>>>>>> “numXXXOut”
>>>>>> > >> >> could be used by the output of SinkWriterOperator for
>>>>>> reporting number of
>>>>>> > >> >> Committables delivered to SinkCommitterOperator. In order to
>>>>>> resolve the
>>>>>> > >> >> conflict, FLINK-26126 and FLINK-26492 changed names of these
>>>>>> metrics with
>>>>>> > >> >> “numXXXSend”.
>>>>>> > >> >> >>
>>>>>> > >> >> >> Necessity of reverting this change
>>>>>> > >> >> >>
>>>>>> > >> >> >> - Metric names are actually public API, as end users need
>>>>>> to configure
>>>>>> > >> >> metric collecting and alerting system with metric names.
>>>>>> Users have to
>>>>>> > >> >> reset all configurations related to affected metrics.
>>>>>> > >> >> >> - This could also affect custom and external sinks not
>>>>>> maintained by
>>>>>> > >> >> Flink, which might have implemented with numXXXOut metrics.
>>>>>> > >> >> >> - The number of records sent to external system is way
>>>>>> more important
>>>>>> > >> >> than the number of Committables sent to
>>>>>> SinkCommitterOperator, as the
>>>>>> > >> >> latter one is just an internal implementation of sink. We
>>>>>> could have a new
>>>>>> > >> >> metric name for the latter one instead.
>>>>>> > >> >> >> - We could avoid splitting the project by version (like
>>>>>> “plz use
>>>>>> > >> >> numXXXOut before 1.15 and use numXXXSend after”) if we revert
>>>>>> it ASAP,
>>>>>> > >> >> cosidering 1.16 is still not released for now.
>>>>>> > >> >> >>
>>>>>> > >> >> >> As a consequence, I’d like to hear from devs and users
>>>>>> about your
>>>>>> > >> >> opinion on changing these metrics back to “numXXXOut”.
>>>>>> > >> >> >>
>>>>>> > >> >> >> Looking forward to your reply!
>>>>>> > >> >> >>
>>>>>> > >> >> >> [1] https://issues.apache.org/jira/browse/FLINK-26126
>>>>>> > >> >> >> [2] https://issues.apache.org/jira/browse/FLINK-26492
>>>>>> > >> >> >> [1] FLIP-33, version 18:
>>>>>> > >> >>
>>>>>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211883136
>>>>>> > >> >> >>
>>>>>> > >> >> >> Best,
>>>>>> > >> >> >> Qingsheng
>>>>>> > >> >>
>>>>>> > >> >
>>>>>> > >> >
>>>>>> > >
>>>>>> > >
>>>>>> > >
>>>>>>
>>>>>

Re: [DISCUSS] Reverting sink metric name changes made in 1.15

Posted by Xintong Song <to...@gmail.com>.
@Qingsheng,

I'm overall +1 to your proposal, with only one question: How complicated is
it to come up with a metric for the internal traffic?

I'm asking because, as the new feature is already out for 1.15 & 1.16, it
would be nice if the corresponding new metrics can also be available in
these versions.

I'm not saying this should block the 1.16 release. Just trying to
understand the efforts needed. If adding such metrics is not something that
can be done shortly, I'd also be fine with releasing 1.16 without them.

Best,

Xintong



On Fri, Oct 14, 2022 at 1:31 AM Qingsheng Ren <re...@apache.org> wrote:

> Hi devs and users,
>
> It looks like we are getting an initial consensus in the discussion so I
> started a voting thread [1] just now. Looking forward to your feedback!
>
> [1] https://lists.apache.org/thread/ozlf82mkm6ndx2n1vdgq532h156p4lt6
>
> Best,
> Qingsheng
>
>
> On Thu, Oct 13, 2022 at 10:41 PM Jing Ge <ji...@ververica.com> wrote:
>
>> Hi Qingsheng,
>>
>> Thanks for the clarification. +1, I like the idea. Pointing both
>> numXXXOut and numXXXSend to the same external data transfer metric does not
>> really break the new SinkV2 design, since there was no requirement to
>> monitor the internal traffic. So, I think both developer and user can live
>> with it. It might not be the perfect solution but is indeed the currently
>> best trade-off solution after considering the backward compatibility.  I
>> would suggest firing a follow-up ticket after the PR to take care of the
>> new metric for the internal traffic in the future.
>>
>> Best regards,
>> Jing
>>
>>
>> On Thu, Oct 13, 2022 at 3:08 PM Qingsheng Ren <re...@apache.org> wrote:
>>
>>> Hi Jing,
>>>
>>> Thanks for the reply!
>>>
>>> Let me rephrase my proposal: we’d like to use numXXXOut registered on
>>> SinkWriterOperator to reflect the traffic to the external system for
>>> compatibility with old versions before 1.15, and make numXXXSend have the
>>> same value as numXXXOut for compatibility within 1.15. That means both
>>> numXXXOut and numXXXSend are used for external data transfers, which end
>>> users care more about. As for the internal traffic within the sink, we
>>> could name a new metric for it because this is a _new_ feature in the _new_
>>> sink, and end users usually don’t pay attention to internal implementation.
>>> The name of the new metric could be discussed later after 1.16 release.
>>>
>>> > but it might end up with monitoring unexpected metrics, which is even
>>> worse for users, i.e. I didn't change anything, but something has been
>>> broken since the last update.
>>>
>>> Yeah this is exactly what we are trying to fix with this proposal. I
>>> believe users are more concerned with the output to the external system
>>> than the internal data delivery in the sink, so I think we’ll have more
>>> cases reporting like “I set up a panel on numRecordsOut in sink to monitor
>>> the output of the job, but after upgrading to 1.15 this value is extremely
>>> low and I didn’t change anything” if we stick to the current situation. I
>>> think only a few end users care about the number of committables sending to
>>> downstream as most of them don’t care how the sink works.
>>>
>>> We do need a re-design to fully distinguish the internal and external
>>> traffic on metrics, not only in sink but in all operators as it’s quite
>>> common for operators to make IO. This needs time to design, discuss, adjust
>>> and vote, but considering this is blocking 1.16, maybe it’s better to
>>> rescue the compatibility for now, and leave the huge reconstruction to
>>> future versions (maybe 2.0).
>>>
>>> Best,
>>> Qingsheng
>>>
>>> On Wed, Oct 12, 2022 at 7:21 PM Jing Ge <ji...@ververica.com> wrote:
>>>
>>>> Hi Qingsheng,
>>>>
>>>> Just want to make sure we are on the same page. Are you suggesting
>>>> switching the naming between "numXXXSend" and "numXXXOut" or reverting all
>>>> the changes we did with FLINK-26126 and FLINK-26492?
>>>>
>>>> For the naming switch, please pay attention that the behaviour has been
>>>> changed since we introduced SinkV2[1]. So, please be aware of different
>>>> numbers(behaviour change) even with the same metrics name. Sticking with
>>>> the old name with the new behaviour (very bad idea, IMHO) might seem like
>>>> saving the effort in the first place, but it might end up with monitoring
>>>> unexpected metrics, which is even worse for users, i.e. I didn't change
>>>> anything, but something has been broken since the last update.
>>>>
>>>> For reverting, I am not sure how to fix the issue mentioned in
>>>> FLINK-26126 after reverting all changes. Like Chesnay has already pointed
>>>> out, with SinkV2 we have two different output lines - one with the external
>>>> system and the other with the downstream operator. In this case,
>>>> "numXXXSend" is rather a new metric than a replacement of "numXXXOut". The
>>>> "numXXXOut" metric can still be used, depending on what the user wants to
>>>> monitor.
>>>>
>>>>
>>>> Best regards,
>>>> Jing
>>>>
>>>> [1]
>>>> https://github.com/apache/flink/blob/51fc20db30d001a95de95b3b9993eeb06f558f6c/flink-metrics/flink-metrics-core/src/main/java/org/apache/flink/metrics/groups/SinkWriterMetricGroup.java#L48
>>>>
>>>>
>>>> On Wed, Oct 12, 2022 at 12:48 PM Qingsheng Ren <re...@apache.org>
>>>> wrote:
>>>>
>>>>> As a supplement, considering it could be a big reconstruction
>>>>> redefining internal and external traffic and touching metric names in
>>>>> almost all operators, this requires a lot of discussions and we might
>>>>> do it finally in Flink 2.0. I think compatibility is a bigger blocker
>>>>> in front of us, as the output of sink is a metric that users care a
>>>>> lot about.
>>>>>
>>>>> Thanks,
>>>>> Qingsheng
>>>>>
>>>>> On Wed, Oct 12, 2022 at 6:20 PM Qingsheng Ren <re...@apache.org>
>>>>> wrote:
>>>>> >
>>>>> > Thanks Chesnay for the reply. +1 for making a unified and clearer
>>>>> > metric definition distinguishing internal and external data
>>>>> transfers.
>>>>> > As you described, having IO in operators is quite common such as
>>>>> > dimension tables in Table/SQL API. This definitely deserves a FLIP
>>>>> and
>>>>> > an overall design.
>>>>> >
>>>>> > However I think it's necessary to change the metric back to
>>>>> > numRecordsOut instead of sticking with numRecordsSend in 1.15 and
>>>>> > 1.16. The most important argument is for compatibility as I mentioned
>>>>> > in my previous email, otherwise all users have to modify their
>>>>> configs
>>>>> > of metric systems after upgrading to Flink 1.15+, and all custom
>>>>> > connectors have to change their implementations to migrate to the new
>>>>> > metric name. I believe other ones participating and approving this
>>>>> > proposal share the same concern about compatibility too. Also
>>>>> > considering this issue is blocking the release of 1.16, maybe we
>>>>> could
>>>>> > fix this asap, and as for defining a new metric for internal data
>>>>> > transfers we can have an in-depth discussion later. WDYT?
>>>>> >
>>>>> > Best,
>>>>> > Qingsheng
>>>>> >
>>>>> > On Tue, Oct 11, 2022 at 6:06 PM Chesnay Schepler <ch...@apache.org>
>>>>> wrote:
>>>>> > >
>>>>> > > Currently I think that would be a mistake.
>>>>> > >
>>>>> > > Ultimately what we have here is the culmination of us never really
>>>>> considering how the numRecordsOut metric should behave for operators that
>>>>> emit data to other operators _and_ external systems. This goes beyond sinks.
>>>>> > > This even applies to numRecordsIn, for cases where functions
>>>>> query/write data from/to the outside, (e.g., Async IO).
>>>>> > >
>>>>> > > Having 2 separate metrics for that, 1 exclusively for internal
>>>>> data transfers, and 1 exclusively for external data transfers, is the only
>>>>> way to get a consistent metric definition in the long-run.
>>>>> > > We can jump back-and-forth now or just commit to it.
>>>>> > >
>>>>> > > I don't think we can really judge this based on FLIP-33. It was
>>>>> IIRC written before the two phase sinks were added, which heavily blurred
>>>>> the lines of what a sink even is. Because it definitely is _not_ the last
>>>>> operator in a chain anymore.
>>>>> > >
>>>>> > > What I would suggest is to stick with what we got (although I
>>>>> despise the name numRecordsSend), and alias the numRecordsOut metric for
>>>>> all non-TwoPhaseCommittingSink.
>>>>> > >
>>>>> > > On 11/10/2022 05:54, Qingsheng Ren wrote:
>>>>> > >
>>>>> > > Thanks for the details Chesnay!
>>>>> > >
>>>>> > > By “alias” I mean to respect the original definition made in
>>>>> FLIP-33 for numRecordsOut, which is the number of records written to the
>>>>> external system, and keep numRecordsSend as the same value as numRecordsOut
>>>>> for compatibility.
>>>>> > >
>>>>> > > I think keeping numRecordsOut for the output to the external
>>>>> system is more intuitive to end users because in most cases the metric of
>>>>> data flow output is more essential. I agree with you that a new metric is
>>>>> required, but considering compatibility and users’ intuition I prefer to
>>>>> keep the initial definition of numRecordsOut in FLIP-33 and name a new
>>>>> metric for sink writer’s output to downstream operators. This might be
>>>>> against consistency with metrics in other operators in Flink but maybe it’s
>>>>> acceptable to have the sink as a special case.
>>>>> > >
>>>>> > > Best,
>>>>> > > Qingsheng
>>>>> > > On Oct 10, 2022, 19:13 +0800, Chesnay Schepler <ch...@apache.org>,
>>>>> wrote:
>>>>> > >
>>>>> > > > I’m with Xintong’s idea to treat numXXXSend as an alias of
>>>>> numXXXOut
>>>>> > >
>>>>> > > But that's not possible. If it were that simple there would have
>>>>> never been a need to introduce another metric in the first place.
>>>>> > >
>>>>> > > It's a rather fundamental issue with how the new sinks work, in
>>>>> that they emit data to the external system (usually considered as
>>>>> "numRecordsOut" of sinks) while _also_ sending data to a downstream
>>>>> operator (usually considered as "numRecordsOut" of tasks).
>>>>> > > The original issue was that the numRecordsOut of the sink counted
>>>>> both (which is completely wrong).
>>>>> > >
>>>>> > > A new metric was always required; otherwise you inevitably end up
>>>>> breaking some semantic.
>>>>> > > Adding a new metric for what the sink writes to the external
>>>>> system is, for better or worse, more consistent with how these metrics
>>>>> usually work in Flink.
>>>>> > >
>>>>> > > On 10/10/2022 12:45, Qingsheng Ren wrote:
>>>>> > >
>>>>> > > Thanks everyone for joining the discussion!
>>>>> > >
>>>>> > > > Do you have any idea what has happened in the process here?
>>>>> > >
>>>>> > > The discussion in this PR [1] shows some details and could be
>>>>> helpful to understand the original motivation of the renaming. We do have a
>>>>> test case for guarding metrics but unfortunaly the case was also modified
>>>>> so the defense was broken.
>>>>> > >
>>>>> > > I think the reason why both the developer and the reviewer forgot
>>>>> to trigger an discussion and gave a green pass on the change is that
>>>>> metrics are quite “trivial” to be noticed as public APIs. As mentioned by
>>>>> Martijn I couldn’t find a place noting that metrics are public APIs and
>>>>> should be treated carefully while contributing and reviewing.
>>>>> > >
>>>>> > > IMHO three actions could be made to prevent this kind of changes
>>>>> in the future:
>>>>> > >
>>>>> > > a. Add test case for metrics (which we already have in
>>>>> SinkMetricsITCase)
>>>>> > > b. We emphasize that any public-interface breaking changes should
>>>>> be proposed by a FLIP or discussed in mailing list, and should be listed in
>>>>> the release note.
>>>>> > > c. We remind contributors and reviewers about what should be
>>>>> considered as public API, and include metric names in it.
>>>>> > >
>>>>> > > For b and c these two pages [2][3] might be proper places.
>>>>> > >
>>>>> > > About the patch to revert this, it looks like we have a consensus
>>>>> on 1.16. As of 1.15 I think it’s worthy to trigger a minor version. I
>>>>> didn’t see complaints about this for now so it should be OK to save the
>>>>> situation asap. I’m with Xintong’s idea to treat numXXXSend as an alias of
>>>>> numXXXOut considering there could possibly some users have already adapted
>>>>> their system to the new naming, and have another internal metric for
>>>>> reflecting number of outgoing committable batches (actually the
>>>>> numRecordsIn of sink committer operator should be carrying this info
>>>>> already).
>>>>> > >
>>>>> > > [1] https://github.com/apache/flink/pull/18825
>>>>> > > [2] https://flink.apache.org/contributing/contribute-code.html
>>>>> > > [3] https://flink.apache.org/contributing/reviewing-prs.html
>>>>> > >
>>>>> > > Best,
>>>>> > > Qingsheng
>>>>> > > On Oct 10, 2022, 17:40 +0800, Xintong Song <to...@gmail.com>,
>>>>> wrote:
>>>>> > >
>>>>> > > +1 for reverting these changes in Flink 1.16.
>>>>> > >
>>>>> > > For 1.15.3, can we make these metrics available via both names
>>>>> (numXXXOut and numXXXSend)? In this way we don't break it for those who
>>>>> already migrated to 1.15 and numXXXSend. That means we still need to change
>>>>> SinkWriterOperator to use another metric name in 1.15.3, which IIUC is
>>>>> internal to Flink sink.
>>>>> > >
>>>>> > > I'm overall +1 to change numXXXOut back to its original semantics.
>>>>> AFAIK (from meetup / flink-forward questionaires), most users do not
>>>>> migrate to a new Flink release immediately, until the next 1-2 major
>>>>> releases are out.
>>>>> > >
>>>>> > > Best,
>>>>> > >
>>>>> > > Xintong
>>>>> > >
>>>>> > >
>>>>> > >
>>>>> > > On Mon, Oct 10, 2022 at 5:26 PM Martijn Visser <
>>>>> martijnvisser@apache.org> wrote:
>>>>> > >>
>>>>> > >> Hi Qingsheng,
>>>>> > >>
>>>>> > >> Do you have any idea what has happened in the process here? Do we
>>>>> know why
>>>>> > >> they were changed? I was under the impression that these metric
>>>>> names were
>>>>> > >> newly introduced due to the new interfaces and because it still
>>>>> depends on
>>>>> > >> each connector implementing these.
>>>>> > >>
>>>>> > >> Sidenote: metric names are not mentioned in the FLIP process as a
>>>>> public
>>>>> > >> API. Might make sense to have a separate follow-up to add that to
>>>>> the list
>>>>> > >> (I do think we should list them there).
>>>>> > >>
>>>>> > >> +1 for reverting this and make this change in Flink 1.16
>>>>> > >>
>>>>> > >> I'm not in favour of releasing a Flink 1.15.3 with this change: I
>>>>> think the
>>>>> > >> impact is too big for a patch version, especially given how long
>>>>> Flink 1.15
>>>>> > >> is already out there.
>>>>> > >>
>>>>> > >> Best regards,
>>>>> > >>
>>>>> > >> Martijn
>>>>> > >>
>>>>> > >> On Mon, Oct 10, 2022 at 11:13 AM Leonard Xu <xb...@gmail.com>
>>>>> wrote:
>>>>> > >>
>>>>> > >> > Thanks Qingsheng for starting this thread.
>>>>> > >> >
>>>>> > >> > +1 on reverting sink metric name and releasing 1.15.3 to fix
>>>>> this
>>>>> > >> > inconsistent behavior.
>>>>> > >> >
>>>>> > >> >
>>>>> > >> > Best,
>>>>> > >> > Leonard
>>>>> > >> >
>>>>> > >> >
>>>>> > >> >
>>>>> > >> >
>>>>> > >> >
>>>>> > >> > 2022年10月10日 下午3:06,Jark Wu <im...@gmail.com> 写道:
>>>>> > >> >
>>>>> > >> > Thanks for discovering this problem, Qingsheng!
>>>>> > >> >
>>>>> > >> > I'm also +1 for reverting the breaking changes.
>>>>> > >> >
>>>>> > >> > IIUC, currently, the behavior of "numXXXOut" metrics of the new
>>>>> and old
>>>>> > >> > sink is inconsistent.
>>>>> > >> > We have to break one of them to have consistent behavior. Sink
>>>>> V2 is an
>>>>> > >> > evolving API which is just introduced in 1.15.
>>>>> > >> > I think it makes sense to break the unstable API instead of the
>>>>> stable API
>>>>> > >> > which many connectors and users depend on.
>>>>> > >> >
>>>>> > >> > Best,
>>>>> > >> > Jark
>>>>> > >> >
>>>>> > >> >
>>>>> > >> >
>>>>> > >> > On Mon, 10 Oct 2022 at 11:36, Jingsong Li <
>>>>> jingsonglee0@gmail.com> wrote:
>>>>> > >> >
>>>>> > >> >> Thanks for driving, Qingsheng.
>>>>> > >> >>
>>>>> > >> >> +1 for reverting sink metric name.
>>>>> > >> >>
>>>>> > >> >> We often forget that metric is also one of the important APIs.
>>>>> > >> >>
>>>>> > >> >> +1 for releasing 1.15.3 to fix this.
>>>>> > >> >>
>>>>> > >> >> Best,
>>>>> > >> >> Jingsong
>>>>> > >> >>
>>>>> > >> >> On Sun, Oct 9, 2022 at 11:35 PM Becket Qin <
>>>>> becket.qin@gmail.com> wrote:
>>>>> > >> >> >
>>>>> > >> >> > Thanks for raising the discussion, Qingsheng,
>>>>> > >> >> >
>>>>> > >> >> > +1 on reverting the breaking changes.
>>>>> > >> >> >
>>>>> > >> >> > In addition, we might want to release a 1.15.3 to fix this
>>>>> and update
>>>>> > >> >> the previous release docs with this known issue, so that users
>>>>> can upgrade
>>>>> > >> >> to 1.15.3 when they hit it. It would also be good to add some
>>>>> backwards
>>>>> > >> >> compatibility tests on metrics to avoid unintended breaking
>>>>> changes like
>>>>> > >> >> this in the future.
>>>>> > >> >> >
>>>>> > >> >> > Thanks,
>>>>> > >> >> >
>>>>> > >> >> > Jiangjie (Becket) Qin
>>>>> > >> >> >
>>>>> > >> >> > On Sun, Oct 9, 2022 at 10:35 AM Qingsheng Ren <
>>>>> renqs@apache.org> wrote:
>>>>> > >> >> >>
>>>>> > >> >> >> Hi devs and users,
>>>>> > >> >> >>
>>>>> > >> >> >> I’d like to start a discussion about reverting a breaking
>>>>> change about
>>>>> > >> >> sink metrics made in 1.15 by FLINK-26126 [1] and FLINK-26492
>>>>> [2].
>>>>> > >> >> >>
>>>>> > >> >> >> TL;DR
>>>>> > >> >> >>
>>>>> > >> >> >> All sink metrics with name “numXXXOut” defined in FLIP-33
>>>>> are replace
>>>>> > >> >> by “numXXXSend” in FLINK-26126 and FLINK-26492. Considering
>>>>> metric names
>>>>> > >> >> are public APIs, this is a breaking change to end users and
>>>>> not backward
>>>>> > >> >> compatible. Also unfortunately this breaking change was not
>>>>> discussed in
>>>>> > >> >> the mailing list before.
>>>>> > >> >> >>
>>>>> > >> >> >> Background
>>>>> > >> >> >>
>>>>> > >> >> >> As defined previously in FLIP-33 (the FLIP page has been
>>>>> changed so
>>>>> > >> >> please refer to the old version [3] ), metric “numRecordsOut”
>>>>> is used for
>>>>> > >> >> reporting the total number of output records since the sink
>>>>> started (number
>>>>> > >> >> of records written to the external system), and similarly for
>>>>> > >> >> “numRecordsOutPerSecond”, “numBytesOut”,
>>>>> “numBytesOutPerSecond” and
>>>>> > >> >> “numRecordsOutError”. Most sinks are following this naming and
>>>>> definition.
>>>>> > >> >> However, these metrics are ambiguous in the new Sink API as
>>>>> “numXXXOut”
>>>>> > >> >> could be used by the output of SinkWriterOperator for
>>>>> reporting number of
>>>>> > >> >> Committables delivered to SinkCommitterOperator. In order to
>>>>> resolve the
>>>>> > >> >> conflict, FLINK-26126 and FLINK-26492 changed names of these
>>>>> metrics with
>>>>> > >> >> “numXXXSend”.
>>>>> > >> >> >>
>>>>> > >> >> >> Necessity of reverting this change
>>>>> > >> >> >>
>>>>> > >> >> >> - Metric names are actually public API, as end users need
>>>>> to configure
>>>>> > >> >> metric collecting and alerting system with metric names. Users
>>>>> have to
>>>>> > >> >> reset all configurations related to affected metrics.
>>>>> > >> >> >> - This could also affect custom and external sinks not
>>>>> maintained by
>>>>> > >> >> Flink, which might have implemented with numXXXOut metrics.
>>>>> > >> >> >> - The number of records sent to external system is way more
>>>>> important
>>>>> > >> >> than the number of Committables sent to SinkCommitterOperator,
>>>>> as the
>>>>> > >> >> latter one is just an internal implementation of sink. We
>>>>> could have a new
>>>>> > >> >> metric name for the latter one instead.
>>>>> > >> >> >> - We could avoid splitting the project by version (like
>>>>> “plz use
>>>>> > >> >> numXXXOut before 1.15 and use numXXXSend after”) if we revert
>>>>> it ASAP,
>>>>> > >> >> cosidering 1.16 is still not released for now.
>>>>> > >> >> >>
>>>>> > >> >> >> As a consequence, I’d like to hear from devs and users
>>>>> about your
>>>>> > >> >> opinion on changing these metrics back to “numXXXOut”.
>>>>> > >> >> >>
>>>>> > >> >> >> Looking forward to your reply!
>>>>> > >> >> >>
>>>>> > >> >> >> [1] https://issues.apache.org/jira/browse/FLINK-26126
>>>>> > >> >> >> [2] https://issues.apache.org/jira/browse/FLINK-26492
>>>>> > >> >> >> [1] FLIP-33, version 18:
>>>>> > >> >>
>>>>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211883136
>>>>> > >> >> >>
>>>>> > >> >> >> Best,
>>>>> > >> >> >> Qingsheng
>>>>> > >> >>
>>>>> > >> >
>>>>> > >> >
>>>>> > >
>>>>> > >
>>>>> > >
>>>>>
>>>>

Re: [DISCUSS] Reverting sink metric name changes made in 1.15

Posted by Xintong Song <to...@gmail.com>.
@Qingsheng,

I'm overall +1 to your proposal, with only one question: How complicated is
it to come up with a metric for the internal traffic?

I'm asking because, as the new feature is already out for 1.15 & 1.16, it
would be nice if the corresponding new metrics can also be available in
these versions.

I'm not saying this should block the 1.16 release. Just trying to
understand the efforts needed. If adding such metrics is not something that
can be done shortly, I'd also be fine with releasing 1.16 without them.

Best,

Xintong



On Fri, Oct 14, 2022 at 1:31 AM Qingsheng Ren <re...@apache.org> wrote:

> Hi devs and users,
>
> It looks like we are getting an initial consensus in the discussion so I
> started a voting thread [1] just now. Looking forward to your feedback!
>
> [1] https://lists.apache.org/thread/ozlf82mkm6ndx2n1vdgq532h156p4lt6
>
> Best,
> Qingsheng
>
>
> On Thu, Oct 13, 2022 at 10:41 PM Jing Ge <ji...@ververica.com> wrote:
>
>> Hi Qingsheng,
>>
>> Thanks for the clarification. +1, I like the idea. Pointing both
>> numXXXOut and numXXXSend to the same external data transfer metric does not
>> really break the new SinkV2 design, since there was no requirement to
>> monitor the internal traffic. So, I think both developer and user can live
>> with it. It might not be the perfect solution but is indeed the currently
>> best trade-off solution after considering the backward compatibility.  I
>> would suggest firing a follow-up ticket after the PR to take care of the
>> new metric for the internal traffic in the future.
>>
>> Best regards,
>> Jing
>>
>>
>> On Thu, Oct 13, 2022 at 3:08 PM Qingsheng Ren <re...@apache.org> wrote:
>>
>>> Hi Jing,
>>>
>>> Thanks for the reply!
>>>
>>> Let me rephrase my proposal: we’d like to use numXXXOut registered on
>>> SinkWriterOperator to reflect the traffic to the external system for
>>> compatibility with old versions before 1.15, and make numXXXSend have the
>>> same value as numXXXOut for compatibility within 1.15. That means both
>>> numXXXOut and numXXXSend are used for external data transfers, which end
>>> users care more about. As for the internal traffic within the sink, we
>>> could name a new metric for it because this is a _new_ feature in the _new_
>>> sink, and end users usually don’t pay attention to internal implementation.
>>> The name of the new metric could be discussed later after 1.16 release.
>>>
>>> > but it might end up with monitoring unexpected metrics, which is even
>>> worse for users, i.e. I didn't change anything, but something has been
>>> broken since the last update.
>>>
>>> Yeah this is exactly what we are trying to fix with this proposal. I
>>> believe users are more concerned with the output to the external system
>>> than the internal data delivery in the sink, so I think we’ll have more
>>> cases reporting like “I set up a panel on numRecordsOut in sink to monitor
>>> the output of the job, but after upgrading to 1.15 this value is extremely
>>> low and I didn’t change anything” if we stick to the current situation. I
>>> think only a few end users care about the number of committables sending to
>>> downstream as most of them don’t care how the sink works.
>>>
>>> We do need a re-design to fully distinguish the internal and external
>>> traffic on metrics, not only in sink but in all operators as it’s quite
>>> common for operators to make IO. This needs time to design, discuss, adjust
>>> and vote, but considering this is blocking 1.16, maybe it’s better to
>>> rescue the compatibility for now, and leave the huge reconstruction to
>>> future versions (maybe 2.0).
>>>
>>> Best,
>>> Qingsheng
>>>
>>> On Wed, Oct 12, 2022 at 7:21 PM Jing Ge <ji...@ververica.com> wrote:
>>>
>>>> Hi Qingsheng,
>>>>
>>>> Just want to make sure we are on the same page. Are you suggesting
>>>> switching the naming between "numXXXSend" and "numXXXOut" or reverting all
>>>> the changes we did with FLINK-26126 and FLINK-26492?
>>>>
>>>> For the naming switch, please pay attention that the behaviour has been
>>>> changed since we introduced SinkV2[1]. So, please be aware of different
>>>> numbers(behaviour change) even with the same metrics name. Sticking with
>>>> the old name with the new behaviour (very bad idea, IMHO) might seem like
>>>> saving the effort in the first place, but it might end up with monitoring
>>>> unexpected metrics, which is even worse for users, i.e. I didn't change
>>>> anything, but something has been broken since the last update.
>>>>
>>>> For reverting, I am not sure how to fix the issue mentioned in
>>>> FLINK-26126 after reverting all changes. Like Chesnay has already pointed
>>>> out, with SinkV2 we have two different output lines - one with the external
>>>> system and the other with the downstream operator. In this case,
>>>> "numXXXSend" is rather a new metric than a replacement of "numXXXOut". The
>>>> "numXXXOut" metric can still be used, depending on what the user wants to
>>>> monitor.
>>>>
>>>>
>>>> Best regards,
>>>> Jing
>>>>
>>>> [1]
>>>> https://github.com/apache/flink/blob/51fc20db30d001a95de95b3b9993eeb06f558f6c/flink-metrics/flink-metrics-core/src/main/java/org/apache/flink/metrics/groups/SinkWriterMetricGroup.java#L48
>>>>
>>>>
>>>> On Wed, Oct 12, 2022 at 12:48 PM Qingsheng Ren <re...@apache.org>
>>>> wrote:
>>>>
>>>>> As a supplement, considering it could be a big reconstruction
>>>>> redefining internal and external traffic and touching metric names in
>>>>> almost all operators, this requires a lot of discussions and we might
>>>>> do it finally in Flink 2.0. I think compatibility is a bigger blocker
>>>>> in front of us, as the output of sink is a metric that users care a
>>>>> lot about.
>>>>>
>>>>> Thanks,
>>>>> Qingsheng
>>>>>
>>>>> On Wed, Oct 12, 2022 at 6:20 PM Qingsheng Ren <re...@apache.org>
>>>>> wrote:
>>>>> >
>>>>> > Thanks Chesnay for the reply. +1 for making a unified and clearer
>>>>> > metric definition distinguishing internal and external data
>>>>> transfers.
>>>>> > As you described, having IO in operators is quite common such as
>>>>> > dimension tables in Table/SQL API. This definitely deserves a FLIP
>>>>> and
>>>>> > an overall design.
>>>>> >
>>>>> > However I think it's necessary to change the metric back to
>>>>> > numRecordsOut instead of sticking with numRecordsSend in 1.15 and
>>>>> > 1.16. The most important argument is for compatibility as I mentioned
>>>>> > in my previous email, otherwise all users have to modify their
>>>>> configs
>>>>> > of metric systems after upgrading to Flink 1.15+, and all custom
>>>>> > connectors have to change their implementations to migrate to the new
>>>>> > metric name. I believe other ones participating and approving this
>>>>> > proposal share the same concern about compatibility too. Also
>>>>> > considering this issue is blocking the release of 1.16, maybe we
>>>>> could
>>>>> > fix this asap, and as for defining a new metric for internal data
>>>>> > transfers we can have an in-depth discussion later. WDYT?
>>>>> >
>>>>> > Best,
>>>>> > Qingsheng
>>>>> >
>>>>> > On Tue, Oct 11, 2022 at 6:06 PM Chesnay Schepler <ch...@apache.org>
>>>>> wrote:
>>>>> > >
>>>>> > > Currently I think that would be a mistake.
>>>>> > >
>>>>> > > Ultimately what we have here is the culmination of us never really
>>>>> considering how the numRecordsOut metric should behave for operators that
>>>>> emit data to other operators _and_ external systems. This goes beyond sinks.
>>>>> > > This even applies to numRecordsIn, for cases where functions
>>>>> query/write data from/to the outside, (e.g., Async IO).
>>>>> > >
>>>>> > > Having 2 separate metrics for that, 1 exclusively for internal
>>>>> data transfers, and 1 exclusively for external data transfers, is the only
>>>>> way to get a consistent metric definition in the long-run.
>>>>> > > We can jump back-and-forth now or just commit to it.
>>>>> > >
>>>>> > > I don't think we can really judge this based on FLIP-33. It was
>>>>> IIRC written before the two phase sinks were added, which heavily blurred
>>>>> the lines of what a sink even is. Because it definitely is _not_ the last
>>>>> operator in a chain anymore.
>>>>> > >
>>>>> > > What I would suggest is to stick with what we got (although I
>>>>> despise the name numRecordsSend), and alias the numRecordsOut metric for
>>>>> all non-TwoPhaseCommittingSink.
>>>>> > >
>>>>> > > On 11/10/2022 05:54, Qingsheng Ren wrote:
>>>>> > >
>>>>> > > Thanks for the details Chesnay!
>>>>> > >
>>>>> > > By “alias” I mean to respect the original definition made in
>>>>> FLIP-33 for numRecordsOut, which is the number of records written to the
>>>>> external system, and keep numRecordsSend as the same value as numRecordsOut
>>>>> for compatibility.
>>>>> > >
>>>>> > > I think keeping numRecordsOut for the output to the external
>>>>> system is more intuitive to end users because in most cases the metric of
>>>>> data flow output is more essential. I agree with you that a new metric is
>>>>> required, but considering compatibility and users’ intuition I prefer to
>>>>> keep the initial definition of numRecordsOut in FLIP-33 and name a new
>>>>> metric for sink writer’s output to downstream operators. This might be
>>>>> against consistency with metrics in other operators in Flink but maybe it’s
>>>>> acceptable to have the sink as a special case.
>>>>> > >
>>>>> > > Best,
>>>>> > > Qingsheng
>>>>> > > On Oct 10, 2022, 19:13 +0800, Chesnay Schepler <ch...@apache.org>,
>>>>> wrote:
>>>>> > >
>>>>> > > > I’m with Xintong’s idea to treat numXXXSend as an alias of
>>>>> numXXXOut
>>>>> > >
>>>>> > > But that's not possible. If it were that simple there would have
>>>>> never been a need to introduce another metric in the first place.
>>>>> > >
>>>>> > > It's a rather fundamental issue with how the new sinks work, in
>>>>> that they emit data to the external system (usually considered as
>>>>> "numRecordsOut" of sinks) while _also_ sending data to a downstream
>>>>> operator (usually considered as "numRecordsOut" of tasks).
>>>>> > > The original issue was that the numRecordsOut of the sink counted
>>>>> both (which is completely wrong).
>>>>> > >
>>>>> > > A new metric was always required; otherwise you inevitably end up
>>>>> breaking some semantic.
>>>>> > > Adding a new metric for what the sink writes to the external
>>>>> system is, for better or worse, more consistent with how these metrics
>>>>> usually work in Flink.
>>>>> > >
>>>>> > > On 10/10/2022 12:45, Qingsheng Ren wrote:
>>>>> > >
>>>>> > > Thanks everyone for joining the discussion!
>>>>> > >
>>>>> > > > Do you have any idea what has happened in the process here?
>>>>> > >
>>>>> > > The discussion in this PR [1] shows some details and could be
>>>>> helpful to understand the original motivation of the renaming. We do have a
>>>>> test case for guarding metrics but unfortunaly the case was also modified
>>>>> so the defense was broken.
>>>>> > >
>>>>> > > I think the reason why both the developer and the reviewer forgot
>>>>> to trigger an discussion and gave a green pass on the change is that
>>>>> metrics are quite “trivial” to be noticed as public APIs. As mentioned by
>>>>> Martijn I couldn’t find a place noting that metrics are public APIs and
>>>>> should be treated carefully while contributing and reviewing.
>>>>> > >
>>>>> > > IMHO three actions could be made to prevent this kind of changes
>>>>> in the future:
>>>>> > >
>>>>> > > a. Add test case for metrics (which we already have in
>>>>> SinkMetricsITCase)
>>>>> > > b. We emphasize that any public-interface breaking changes should
>>>>> be proposed by a FLIP or discussed in mailing list, and should be listed in
>>>>> the release note.
>>>>> > > c. We remind contributors and reviewers about what should be
>>>>> considered as public API, and include metric names in it.
>>>>> > >
>>>>> > > For b and c these two pages [2][3] might be proper places.
>>>>> > >
>>>>> > > About the patch to revert this, it looks like we have a consensus
>>>>> on 1.16. As of 1.15 I think it’s worthy to trigger a minor version. I
>>>>> didn’t see complaints about this for now so it should be OK to save the
>>>>> situation asap. I’m with Xintong’s idea to treat numXXXSend as an alias of
>>>>> numXXXOut considering there could possibly some users have already adapted
>>>>> their system to the new naming, and have another internal metric for
>>>>> reflecting number of outgoing committable batches (actually the
>>>>> numRecordsIn of sink committer operator should be carrying this info
>>>>> already).
>>>>> > >
>>>>> > > [1] https://github.com/apache/flink/pull/18825
>>>>> > > [2] https://flink.apache.org/contributing/contribute-code.html
>>>>> > > [3] https://flink.apache.org/contributing/reviewing-prs.html
>>>>> > >
>>>>> > > Best,
>>>>> > > Qingsheng
>>>>> > > On Oct 10, 2022, 17:40 +0800, Xintong Song <to...@gmail.com>,
>>>>> wrote:
>>>>> > >
>>>>> > > +1 for reverting these changes in Flink 1.16.
>>>>> > >
>>>>> > > For 1.15.3, can we make these metrics available via both names
>>>>> (numXXXOut and numXXXSend)? In this way we don't break it for those who
>>>>> already migrated to 1.15 and numXXXSend. That means we still need to change
>>>>> SinkWriterOperator to use another metric name in 1.15.3, which IIUC is
>>>>> internal to Flink sink.
>>>>> > >
>>>>> > > I'm overall +1 to change numXXXOut back to its original semantics.
>>>>> AFAIK (from meetup / flink-forward questionaires), most users do not
>>>>> migrate to a new Flink release immediately, until the next 1-2 major
>>>>> releases are out.
>>>>> > >
>>>>> > > Best,
>>>>> > >
>>>>> > > Xintong
>>>>> > >
>>>>> > >
>>>>> > >
>>>>> > > On Mon, Oct 10, 2022 at 5:26 PM Martijn Visser <
>>>>> martijnvisser@apache.org> wrote:
>>>>> > >>
>>>>> > >> Hi Qingsheng,
>>>>> > >>
>>>>> > >> Do you have any idea what has happened in the process here? Do we
>>>>> know why
>>>>> > >> they were changed? I was under the impression that these metric
>>>>> names were
>>>>> > >> newly introduced due to the new interfaces and because it still
>>>>> depends on
>>>>> > >> each connector implementing these.
>>>>> > >>
>>>>> > >> Sidenote: metric names are not mentioned in the FLIP process as a
>>>>> public
>>>>> > >> API. Might make sense to have a separate follow-up to add that to
>>>>> the list
>>>>> > >> (I do think we should list them there).
>>>>> > >>
>>>>> > >> +1 for reverting this and make this change in Flink 1.16
>>>>> > >>
>>>>> > >> I'm not in favour of releasing a Flink 1.15.3 with this change: I
>>>>> think the
>>>>> > >> impact is too big for a patch version, especially given how long
>>>>> Flink 1.15
>>>>> > >> is already out there.
>>>>> > >>
>>>>> > >> Best regards,
>>>>> > >>
>>>>> > >> Martijn
>>>>> > >>
>>>>> > >> On Mon, Oct 10, 2022 at 11:13 AM Leonard Xu <xb...@gmail.com>
>>>>> wrote:
>>>>> > >>
>>>>> > >> > Thanks Qingsheng for starting this thread.
>>>>> > >> >
>>>>> > >> > +1 on reverting sink metric name and releasing 1.15.3 to fix
>>>>> this
>>>>> > >> > inconsistent behavior.
>>>>> > >> >
>>>>> > >> >
>>>>> > >> > Best,
>>>>> > >> > Leonard
>>>>> > >> >
>>>>> > >> >
>>>>> > >> >
>>>>> > >> >
>>>>> > >> >
>>>>> > >> > 2022年10月10日 下午3:06,Jark Wu <im...@gmail.com> 写道:
>>>>> > >> >
>>>>> > >> > Thanks for discovering this problem, Qingsheng!
>>>>> > >> >
>>>>> > >> > I'm also +1 for reverting the breaking changes.
>>>>> > >> >
>>>>> > >> > IIUC, currently, the behavior of "numXXXOut" metrics of the new
>>>>> and old
>>>>> > >> > sink is inconsistent.
>>>>> > >> > We have to break one of them to have consistent behavior. Sink
>>>>> V2 is an
>>>>> > >> > evolving API which is just introduced in 1.15.
>>>>> > >> > I think it makes sense to break the unstable API instead of the
>>>>> stable API
>>>>> > >> > which many connectors and users depend on.
>>>>> > >> >
>>>>> > >> > Best,
>>>>> > >> > Jark
>>>>> > >> >
>>>>> > >> >
>>>>> > >> >
>>>>> > >> > On Mon, 10 Oct 2022 at 11:36, Jingsong Li <
>>>>> jingsonglee0@gmail.com> wrote:
>>>>> > >> >
>>>>> > >> >> Thanks for driving, Qingsheng.
>>>>> > >> >>
>>>>> > >> >> +1 for reverting sink metric name.
>>>>> > >> >>
>>>>> > >> >> We often forget that metric is also one of the important APIs.
>>>>> > >> >>
>>>>> > >> >> +1 for releasing 1.15.3 to fix this.
>>>>> > >> >>
>>>>> > >> >> Best,
>>>>> > >> >> Jingsong
>>>>> > >> >>
>>>>> > >> >> On Sun, Oct 9, 2022 at 11:35 PM Becket Qin <
>>>>> becket.qin@gmail.com> wrote:
>>>>> > >> >> >
>>>>> > >> >> > Thanks for raising the discussion, Qingsheng,
>>>>> > >> >> >
>>>>> > >> >> > +1 on reverting the breaking changes.
>>>>> > >> >> >
>>>>> > >> >> > In addition, we might want to release a 1.15.3 to fix this
>>>>> and update
>>>>> > >> >> the previous release docs with this known issue, so that users
>>>>> can upgrade
>>>>> > >> >> to 1.15.3 when they hit it. It would also be good to add some
>>>>> backwards
>>>>> > >> >> compatibility tests on metrics to avoid unintended breaking
>>>>> changes like
>>>>> > >> >> this in the future.
>>>>> > >> >> >
>>>>> > >> >> > Thanks,
>>>>> > >> >> >
>>>>> > >> >> > Jiangjie (Becket) Qin
>>>>> > >> >> >
>>>>> > >> >> > On Sun, Oct 9, 2022 at 10:35 AM Qingsheng Ren <
>>>>> renqs@apache.org> wrote:
>>>>> > >> >> >>
>>>>> > >> >> >> Hi devs and users,
>>>>> > >> >> >>
>>>>> > >> >> >> I’d like to start a discussion about reverting a breaking
>>>>> change about
>>>>> > >> >> sink metrics made in 1.15 by FLINK-26126 [1] and FLINK-26492
>>>>> [2].
>>>>> > >> >> >>
>>>>> > >> >> >> TL;DR
>>>>> > >> >> >>
>>>>> > >> >> >> All sink metrics with name “numXXXOut” defined in FLIP-33
>>>>> are replace
>>>>> > >> >> by “numXXXSend” in FLINK-26126 and FLINK-26492. Considering
>>>>> metric names
>>>>> > >> >> are public APIs, this is a breaking change to end users and
>>>>> not backward
>>>>> > >> >> compatible. Also unfortunately this breaking change was not
>>>>> discussed in
>>>>> > >> >> the mailing list before.
>>>>> > >> >> >>
>>>>> > >> >> >> Background
>>>>> > >> >> >>
>>>>> > >> >> >> As defined previously in FLIP-33 (the FLIP page has been
>>>>> changed so
>>>>> > >> >> please refer to the old version [3] ), metric “numRecordsOut”
>>>>> is used for
>>>>> > >> >> reporting the total number of output records since the sink
>>>>> started (number
>>>>> > >> >> of records written to the external system), and similarly for
>>>>> > >> >> “numRecordsOutPerSecond”, “numBytesOut”,
>>>>> “numBytesOutPerSecond” and
>>>>> > >> >> “numRecordsOutError”. Most sinks are following this naming and
>>>>> definition.
>>>>> > >> >> However, these metrics are ambiguous in the new Sink API as
>>>>> “numXXXOut”
>>>>> > >> >> could be used by the output of SinkWriterOperator for
>>>>> reporting number of
>>>>> > >> >> Committables delivered to SinkCommitterOperator. In order to
>>>>> resolve the
>>>>> > >> >> conflict, FLINK-26126 and FLINK-26492 changed names of these
>>>>> metrics with
>>>>> > >> >> “numXXXSend”.
>>>>> > >> >> >>
>>>>> > >> >> >> Necessity of reverting this change
>>>>> > >> >> >>
>>>>> > >> >> >> - Metric names are actually public API, as end users need
>>>>> to configure
>>>>> > >> >> metric collecting and alerting system with metric names. Users
>>>>> have to
>>>>> > >> >> reset all configurations related to affected metrics.
>>>>> > >> >> >> - This could also affect custom and external sinks not
>>>>> maintained by
>>>>> > >> >> Flink, which might have implemented with numXXXOut metrics.
>>>>> > >> >> >> - The number of records sent to external system is way more
>>>>> important
>>>>> > >> >> than the number of Committables sent to SinkCommitterOperator,
>>>>> as the
>>>>> > >> >> latter one is just an internal implementation of sink. We
>>>>> could have a new
>>>>> > >> >> metric name for the latter one instead.
>>>>> > >> >> >> - We could avoid splitting the project by version (like
>>>>> “plz use
>>>>> > >> >> numXXXOut before 1.15 and use numXXXSend after”) if we revert
>>>>> it ASAP,
>>>>> > >> >> cosidering 1.16 is still not released for now.
>>>>> > >> >> >>
>>>>> > >> >> >> As a consequence, I’d like to hear from devs and users
>>>>> about your
>>>>> > >> >> opinion on changing these metrics back to “numXXXOut”.
>>>>> > >> >> >>
>>>>> > >> >> >> Looking forward to your reply!
>>>>> > >> >> >>
>>>>> > >> >> >> [1] https://issues.apache.org/jira/browse/FLINK-26126
>>>>> > >> >> >> [2] https://issues.apache.org/jira/browse/FLINK-26492
>>>>> > >> >> >> [1] FLIP-33, version 18:
>>>>> > >> >>
>>>>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211883136
>>>>> > >> >> >>
>>>>> > >> >> >> Best,
>>>>> > >> >> >> Qingsheng
>>>>> > >> >>
>>>>> > >> >
>>>>> > >> >
>>>>> > >
>>>>> > >
>>>>> > >
>>>>>
>>>>

Re: [DISCUSS] Reverting sink metric name changes made in 1.15

Posted by Qingsheng Ren <re...@apache.org>.
Hi devs and users,

It looks like we are getting an initial consensus in the discussion so I
started a voting thread [1] just now. Looking forward to your feedback!

[1] https://lists.apache.org/thread/ozlf82mkm6ndx2n1vdgq532h156p4lt6

Best,
Qingsheng


On Thu, Oct 13, 2022 at 10:41 PM Jing Ge <ji...@ververica.com> wrote:

> Hi Qingsheng,
>
> Thanks for the clarification. +1, I like the idea. Pointing both numXXXOut
> and numXXXSend to the same external data transfer metric does not really
> break the new SinkV2 design, since there was no requirement to monitor the
> internal traffic. So, I think both developer and user can live with it. It
> might not be the perfect solution but is indeed the currently best
> trade-off solution after considering the backward compatibility.  I would
> suggest firing a follow-up ticket after the PR to take care of the new
> metric for the internal traffic in the future.
>
> Best regards,
> Jing
>
>
> On Thu, Oct 13, 2022 at 3:08 PM Qingsheng Ren <re...@apache.org> wrote:
>
>> Hi Jing,
>>
>> Thanks for the reply!
>>
>> Let me rephrase my proposal: we’d like to use numXXXOut registered on
>> SinkWriterOperator to reflect the traffic to the external system for
>> compatibility with old versions before 1.15, and make numXXXSend have the
>> same value as numXXXOut for compatibility within 1.15. That means both
>> numXXXOut and numXXXSend are used for external data transfers, which end
>> users care more about. As for the internal traffic within the sink, we
>> could name a new metric for it because this is a _new_ feature in the _new_
>> sink, and end users usually don’t pay attention to internal implementation.
>> The name of the new metric could be discussed later after 1.16 release.
>>
>> > but it might end up with monitoring unexpected metrics, which is even
>> worse for users, i.e. I didn't change anything, but something has been
>> broken since the last update.
>>
>> Yeah this is exactly what we are trying to fix with this proposal. I
>> believe users are more concerned with the output to the external system
>> than the internal data delivery in the sink, so I think we’ll have more
>> cases reporting like “I set up a panel on numRecordsOut in sink to monitor
>> the output of the job, but after upgrading to 1.15 this value is extremely
>> low and I didn’t change anything” if we stick to the current situation. I
>> think only a few end users care about the number of committables sending to
>> downstream as most of them don’t care how the sink works.
>>
>> We do need a re-design to fully distinguish the internal and external
>> traffic on metrics, not only in sink but in all operators as it’s quite
>> common for operators to make IO. This needs time to design, discuss, adjust
>> and vote, but considering this is blocking 1.16, maybe it’s better to
>> rescue the compatibility for now, and leave the huge reconstruction to
>> future versions (maybe 2.0).
>>
>> Best,
>> Qingsheng
>>
>> On Wed, Oct 12, 2022 at 7:21 PM Jing Ge <ji...@ververica.com> wrote:
>>
>>> Hi Qingsheng,
>>>
>>> Just want to make sure we are on the same page. Are you suggesting
>>> switching the naming between "numXXXSend" and "numXXXOut" or reverting all
>>> the changes we did with FLINK-26126 and FLINK-26492?
>>>
>>> For the naming switch, please pay attention that the behaviour has been
>>> changed since we introduced SinkV2[1]. So, please be aware of different
>>> numbers(behaviour change) even with the same metrics name. Sticking with
>>> the old name with the new behaviour (very bad idea, IMHO) might seem like
>>> saving the effort in the first place, but it might end up with monitoring
>>> unexpected metrics, which is even worse for users, i.e. I didn't change
>>> anything, but something has been broken since the last update.
>>>
>>> For reverting, I am not sure how to fix the issue mentioned in
>>> FLINK-26126 after reverting all changes. Like Chesnay has already pointed
>>> out, with SinkV2 we have two different output lines - one with the external
>>> system and the other with the downstream operator. In this case,
>>> "numXXXSend" is rather a new metric than a replacement of "numXXXOut". The
>>> "numXXXOut" metric can still be used, depending on what the user wants to
>>> monitor.
>>>
>>>
>>> Best regards,
>>> Jing
>>>
>>> [1]
>>> https://github.com/apache/flink/blob/51fc20db30d001a95de95b3b9993eeb06f558f6c/flink-metrics/flink-metrics-core/src/main/java/org/apache/flink/metrics/groups/SinkWriterMetricGroup.java#L48
>>>
>>>
>>> On Wed, Oct 12, 2022 at 12:48 PM Qingsheng Ren <re...@apache.org> wrote:
>>>
>>>> As a supplement, considering it could be a big reconstruction
>>>> redefining internal and external traffic and touching metric names in
>>>> almost all operators, this requires a lot of discussions and we might
>>>> do it finally in Flink 2.0. I think compatibility is a bigger blocker
>>>> in front of us, as the output of sink is a metric that users care a
>>>> lot about.
>>>>
>>>> Thanks,
>>>> Qingsheng
>>>>
>>>> On Wed, Oct 12, 2022 at 6:20 PM Qingsheng Ren <re...@apache.org> wrote:
>>>> >
>>>> > Thanks Chesnay for the reply. +1 for making a unified and clearer
>>>> > metric definition distinguishing internal and external data transfers.
>>>> > As you described, having IO in operators is quite common such as
>>>> > dimension tables in Table/SQL API. This definitely deserves a FLIP and
>>>> > an overall design.
>>>> >
>>>> > However I think it's necessary to change the metric back to
>>>> > numRecordsOut instead of sticking with numRecordsSend in 1.15 and
>>>> > 1.16. The most important argument is for compatibility as I mentioned
>>>> > in my previous email, otherwise all users have to modify their configs
>>>> > of metric systems after upgrading to Flink 1.15+, and all custom
>>>> > connectors have to change their implementations to migrate to the new
>>>> > metric name. I believe other ones participating and approving this
>>>> > proposal share the same concern about compatibility too. Also
>>>> > considering this issue is blocking the release of 1.16, maybe we could
>>>> > fix this asap, and as for defining a new metric for internal data
>>>> > transfers we can have an in-depth discussion later. WDYT?
>>>> >
>>>> > Best,
>>>> > Qingsheng
>>>> >
>>>> > On Tue, Oct 11, 2022 at 6:06 PM Chesnay Schepler <ch...@apache.org>
>>>> wrote:
>>>> > >
>>>> > > Currently I think that would be a mistake.
>>>> > >
>>>> > > Ultimately what we have here is the culmination of us never really
>>>> considering how the numRecordsOut metric should behave for operators that
>>>> emit data to other operators _and_ external systems. This goes beyond sinks.
>>>> > > This even applies to numRecordsIn, for cases where functions
>>>> query/write data from/to the outside, (e.g., Async IO).
>>>> > >
>>>> > > Having 2 separate metrics for that, 1 exclusively for internal data
>>>> transfers, and 1 exclusively for external data transfers, is the only way
>>>> to get a consistent metric definition in the long-run.
>>>> > > We can jump back-and-forth now or just commit to it.
>>>> > >
>>>> > > I don't think we can really judge this based on FLIP-33. It was
>>>> IIRC written before the two phase sinks were added, which heavily blurred
>>>> the lines of what a sink even is. Because it definitely is _not_ the last
>>>> operator in a chain anymore.
>>>> > >
>>>> > > What I would suggest is to stick with what we got (although I
>>>> despise the name numRecordsSend), and alias the numRecordsOut metric for
>>>> all non-TwoPhaseCommittingSink.
>>>> > >
>>>> > > On 11/10/2022 05:54, Qingsheng Ren wrote:
>>>> > >
>>>> > > Thanks for the details Chesnay!
>>>> > >
>>>> > > By “alias” I mean to respect the original definition made in
>>>> FLIP-33 for numRecordsOut, which is the number of records written to the
>>>> external system, and keep numRecordsSend as the same value as numRecordsOut
>>>> for compatibility.
>>>> > >
>>>> > > I think keeping numRecordsOut for the output to the external system
>>>> is more intuitive to end users because in most cases the metric of data
>>>> flow output is more essential. I agree with you that a new metric is
>>>> required, but considering compatibility and users’ intuition I prefer to
>>>> keep the initial definition of numRecordsOut in FLIP-33 and name a new
>>>> metric for sink writer’s output to downstream operators. This might be
>>>> against consistency with metrics in other operators in Flink but maybe it’s
>>>> acceptable to have the sink as a special case.
>>>> > >
>>>> > > Best,
>>>> > > Qingsheng
>>>> > > On Oct 10, 2022, 19:13 +0800, Chesnay Schepler <ch...@apache.org>,
>>>> wrote:
>>>> > >
>>>> > > > I’m with Xintong’s idea to treat numXXXSend as an alias of
>>>> numXXXOut
>>>> > >
>>>> > > But that's not possible. If it were that simple there would have
>>>> never been a need to introduce another metric in the first place.
>>>> > >
>>>> > > It's a rather fundamental issue with how the new sinks work, in
>>>> that they emit data to the external system (usually considered as
>>>> "numRecordsOut" of sinks) while _also_ sending data to a downstream
>>>> operator (usually considered as "numRecordsOut" of tasks).
>>>> > > The original issue was that the numRecordsOut of the sink counted
>>>> both (which is completely wrong).
>>>> > >
>>>> > > A new metric was always required; otherwise you inevitably end up
>>>> breaking some semantic.
>>>> > > Adding a new metric for what the sink writes to the external system
>>>> is, for better or worse, more consistent with how these metrics usually
>>>> work in Flink.
>>>> > >
>>>> > > On 10/10/2022 12:45, Qingsheng Ren wrote:
>>>> > >
>>>> > > Thanks everyone for joining the discussion!
>>>> > >
>>>> > > > Do you have any idea what has happened in the process here?
>>>> > >
>>>> > > The discussion in this PR [1] shows some details and could be
>>>> helpful to understand the original motivation of the renaming. We do have a
>>>> test case for guarding metrics but unfortunaly the case was also modified
>>>> so the defense was broken.
>>>> > >
>>>> > > I think the reason why both the developer and the reviewer forgot
>>>> to trigger an discussion and gave a green pass on the change is that
>>>> metrics are quite “trivial” to be noticed as public APIs. As mentioned by
>>>> Martijn I couldn’t find a place noting that metrics are public APIs and
>>>> should be treated carefully while contributing and reviewing.
>>>> > >
>>>> > > IMHO three actions could be made to prevent this kind of changes in
>>>> the future:
>>>> > >
>>>> > > a. Add test case for metrics (which we already have in
>>>> SinkMetricsITCase)
>>>> > > b. We emphasize that any public-interface breaking changes should
>>>> be proposed by a FLIP or discussed in mailing list, and should be listed in
>>>> the release note.
>>>> > > c. We remind contributors and reviewers about what should be
>>>> considered as public API, and include metric names in it.
>>>> > >
>>>> > > For b and c these two pages [2][3] might be proper places.
>>>> > >
>>>> > > About the patch to revert this, it looks like we have a consensus
>>>> on 1.16. As of 1.15 I think it’s worthy to trigger a minor version. I
>>>> didn’t see complaints about this for now so it should be OK to save the
>>>> situation asap. I’m with Xintong’s idea to treat numXXXSend as an alias of
>>>> numXXXOut considering there could possibly some users have already adapted
>>>> their system to the new naming, and have another internal metric for
>>>> reflecting number of outgoing committable batches (actually the
>>>> numRecordsIn of sink committer operator should be carrying this info
>>>> already).
>>>> > >
>>>> > > [1] https://github.com/apache/flink/pull/18825
>>>> > > [2] https://flink.apache.org/contributing/contribute-code.html
>>>> > > [3] https://flink.apache.org/contributing/reviewing-prs.html
>>>> > >
>>>> > > Best,
>>>> > > Qingsheng
>>>> > > On Oct 10, 2022, 17:40 +0800, Xintong Song <to...@gmail.com>,
>>>> wrote:
>>>> > >
>>>> > > +1 for reverting these changes in Flink 1.16.
>>>> > >
>>>> > > For 1.15.3, can we make these metrics available via both names
>>>> (numXXXOut and numXXXSend)? In this way we don't break it for those who
>>>> already migrated to 1.15 and numXXXSend. That means we still need to change
>>>> SinkWriterOperator to use another metric name in 1.15.3, which IIUC is
>>>> internal to Flink sink.
>>>> > >
>>>> > > I'm overall +1 to change numXXXOut back to its original semantics.
>>>> AFAIK (from meetup / flink-forward questionaires), most users do not
>>>> migrate to a new Flink release immediately, until the next 1-2 major
>>>> releases are out.
>>>> > >
>>>> > > Best,
>>>> > >
>>>> > > Xintong
>>>> > >
>>>> > >
>>>> > >
>>>> > > On Mon, Oct 10, 2022 at 5:26 PM Martijn Visser <
>>>> martijnvisser@apache.org> wrote:
>>>> > >>
>>>> > >> Hi Qingsheng,
>>>> > >>
>>>> > >> Do you have any idea what has happened in the process here? Do we
>>>> know why
>>>> > >> they were changed? I was under the impression that these metric
>>>> names were
>>>> > >> newly introduced due to the new interfaces and because it still
>>>> depends on
>>>> > >> each connector implementing these.
>>>> > >>
>>>> > >> Sidenote: metric names are not mentioned in the FLIP process as a
>>>> public
>>>> > >> API. Might make sense to have a separate follow-up to add that to
>>>> the list
>>>> > >> (I do think we should list them there).
>>>> > >>
>>>> > >> +1 for reverting this and make this change in Flink 1.16
>>>> > >>
>>>> > >> I'm not in favour of releasing a Flink 1.15.3 with this change: I
>>>> think the
>>>> > >> impact is too big for a patch version, especially given how long
>>>> Flink 1.15
>>>> > >> is already out there.
>>>> > >>
>>>> > >> Best regards,
>>>> > >>
>>>> > >> Martijn
>>>> > >>
>>>> > >> On Mon, Oct 10, 2022 at 11:13 AM Leonard Xu <xb...@gmail.com>
>>>> wrote:
>>>> > >>
>>>> > >> > Thanks Qingsheng for starting this thread.
>>>> > >> >
>>>> > >> > +1 on reverting sink metric name and releasing 1.15.3 to fix this
>>>> > >> > inconsistent behavior.
>>>> > >> >
>>>> > >> >
>>>> > >> > Best,
>>>> > >> > Leonard
>>>> > >> >
>>>> > >> >
>>>> > >> >
>>>> > >> >
>>>> > >> >
>>>> > >> > 2022年10月10日 下午3:06,Jark Wu <im...@gmail.com> 写道:
>>>> > >> >
>>>> > >> > Thanks for discovering this problem, Qingsheng!
>>>> > >> >
>>>> > >> > I'm also +1 for reverting the breaking changes.
>>>> > >> >
>>>> > >> > IIUC, currently, the behavior of "numXXXOut" metrics of the new
>>>> and old
>>>> > >> > sink is inconsistent.
>>>> > >> > We have to break one of them to have consistent behavior. Sink
>>>> V2 is an
>>>> > >> > evolving API which is just introduced in 1.15.
>>>> > >> > I think it makes sense to break the unstable API instead of the
>>>> stable API
>>>> > >> > which many connectors and users depend on.
>>>> > >> >
>>>> > >> > Best,
>>>> > >> > Jark
>>>> > >> >
>>>> > >> >
>>>> > >> >
>>>> > >> > On Mon, 10 Oct 2022 at 11:36, Jingsong Li <
>>>> jingsonglee0@gmail.com> wrote:
>>>> > >> >
>>>> > >> >> Thanks for driving, Qingsheng.
>>>> > >> >>
>>>> > >> >> +1 for reverting sink metric name.
>>>> > >> >>
>>>> > >> >> We often forget that metric is also one of the important APIs.
>>>> > >> >>
>>>> > >> >> +1 for releasing 1.15.3 to fix this.
>>>> > >> >>
>>>> > >> >> Best,
>>>> > >> >> Jingsong
>>>> > >> >>
>>>> > >> >> On Sun, Oct 9, 2022 at 11:35 PM Becket Qin <
>>>> becket.qin@gmail.com> wrote:
>>>> > >> >> >
>>>> > >> >> > Thanks for raising the discussion, Qingsheng,
>>>> > >> >> >
>>>> > >> >> > +1 on reverting the breaking changes.
>>>> > >> >> >
>>>> > >> >> > In addition, we might want to release a 1.15.3 to fix this
>>>> and update
>>>> > >> >> the previous release docs with this known issue, so that users
>>>> can upgrade
>>>> > >> >> to 1.15.3 when they hit it. It would also be good to add some
>>>> backwards
>>>> > >> >> compatibility tests on metrics to avoid unintended breaking
>>>> changes like
>>>> > >> >> this in the future.
>>>> > >> >> >
>>>> > >> >> > Thanks,
>>>> > >> >> >
>>>> > >> >> > Jiangjie (Becket) Qin
>>>> > >> >> >
>>>> > >> >> > On Sun, Oct 9, 2022 at 10:35 AM Qingsheng Ren <
>>>> renqs@apache.org> wrote:
>>>> > >> >> >>
>>>> > >> >> >> Hi devs and users,
>>>> > >> >> >>
>>>> > >> >> >> I’d like to start a discussion about reverting a breaking
>>>> change about
>>>> > >> >> sink metrics made in 1.15 by FLINK-26126 [1] and FLINK-26492
>>>> [2].
>>>> > >> >> >>
>>>> > >> >> >> TL;DR
>>>> > >> >> >>
>>>> > >> >> >> All sink metrics with name “numXXXOut” defined in FLIP-33
>>>> are replace
>>>> > >> >> by “numXXXSend” in FLINK-26126 and FLINK-26492. Considering
>>>> metric names
>>>> > >> >> are public APIs, this is a breaking change to end users and not
>>>> backward
>>>> > >> >> compatible. Also unfortunately this breaking change was not
>>>> discussed in
>>>> > >> >> the mailing list before.
>>>> > >> >> >>
>>>> > >> >> >> Background
>>>> > >> >> >>
>>>> > >> >> >> As defined previously in FLIP-33 (the FLIP page has been
>>>> changed so
>>>> > >> >> please refer to the old version [3] ), metric “numRecordsOut”
>>>> is used for
>>>> > >> >> reporting the total number of output records since the sink
>>>> started (number
>>>> > >> >> of records written to the external system), and similarly for
>>>> > >> >> “numRecordsOutPerSecond”, “numBytesOut”, “numBytesOutPerSecond”
>>>> and
>>>> > >> >> “numRecordsOutError”. Most sinks are following this naming and
>>>> definition.
>>>> > >> >> However, these metrics are ambiguous in the new Sink API as
>>>> “numXXXOut”
>>>> > >> >> could be used by the output of SinkWriterOperator for reporting
>>>> number of
>>>> > >> >> Committables delivered to SinkCommitterOperator. In order to
>>>> resolve the
>>>> > >> >> conflict, FLINK-26126 and FLINK-26492 changed names of these
>>>> metrics with
>>>> > >> >> “numXXXSend”.
>>>> > >> >> >>
>>>> > >> >> >> Necessity of reverting this change
>>>> > >> >> >>
>>>> > >> >> >> - Metric names are actually public API, as end users need to
>>>> configure
>>>> > >> >> metric collecting and alerting system with metric names. Users
>>>> have to
>>>> > >> >> reset all configurations related to affected metrics.
>>>> > >> >> >> - This could also affect custom and external sinks not
>>>> maintained by
>>>> > >> >> Flink, which might have implemented with numXXXOut metrics.
>>>> > >> >> >> - The number of records sent to external system is way more
>>>> important
>>>> > >> >> than the number of Committables sent to SinkCommitterOperator,
>>>> as the
>>>> > >> >> latter one is just an internal implementation of sink. We could
>>>> have a new
>>>> > >> >> metric name for the latter one instead.
>>>> > >> >> >> - We could avoid splitting the project by version (like “plz
>>>> use
>>>> > >> >> numXXXOut before 1.15 and use numXXXSend after”) if we revert
>>>> it ASAP,
>>>> > >> >> cosidering 1.16 is still not released for now.
>>>> > >> >> >>
>>>> > >> >> >> As a consequence, I’d like to hear from devs and users about
>>>> your
>>>> > >> >> opinion on changing these metrics back to “numXXXOut”.
>>>> > >> >> >>
>>>> > >> >> >> Looking forward to your reply!
>>>> > >> >> >>
>>>> > >> >> >> [1] https://issues.apache.org/jira/browse/FLINK-26126
>>>> > >> >> >> [2] https://issues.apache.org/jira/browse/FLINK-26492
>>>> > >> >> >> [1] FLIP-33, version 18:
>>>> > >> >>
>>>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211883136
>>>> > >> >> >>
>>>> > >> >> >> Best,
>>>> > >> >> >> Qingsheng
>>>> > >> >>
>>>> > >> >
>>>> > >> >
>>>> > >
>>>> > >
>>>> > >
>>>>
>>>

Re: [DISCUSS] Reverting sink metric name changes made in 1.15

Posted by Qingsheng Ren <re...@apache.org>.
Hi devs and users,

It looks like we are getting an initial consensus in the discussion so I
started a voting thread [1] just now. Looking forward to your feedback!

[1] https://lists.apache.org/thread/ozlf82mkm6ndx2n1vdgq532h156p4lt6

Best,
Qingsheng


On Thu, Oct 13, 2022 at 10:41 PM Jing Ge <ji...@ververica.com> wrote:

> Hi Qingsheng,
>
> Thanks for the clarification. +1, I like the idea. Pointing both numXXXOut
> and numXXXSend to the same external data transfer metric does not really
> break the new SinkV2 design, since there was no requirement to monitor the
> internal traffic. So, I think both developer and user can live with it. It
> might not be the perfect solution but is indeed the currently best
> trade-off solution after considering the backward compatibility.  I would
> suggest firing a follow-up ticket after the PR to take care of the new
> metric for the internal traffic in the future.
>
> Best regards,
> Jing
>
>
> On Thu, Oct 13, 2022 at 3:08 PM Qingsheng Ren <re...@apache.org> wrote:
>
>> Hi Jing,
>>
>> Thanks for the reply!
>>
>> Let me rephrase my proposal: we’d like to use numXXXOut registered on
>> SinkWriterOperator to reflect the traffic to the external system for
>> compatibility with old versions before 1.15, and make numXXXSend have the
>> same value as numXXXOut for compatibility within 1.15. That means both
>> numXXXOut and numXXXSend are used for external data transfers, which end
>> users care more about. As for the internal traffic within the sink, we
>> could name a new metric for it because this is a _new_ feature in the _new_
>> sink, and end users usually don’t pay attention to internal implementation.
>> The name of the new metric could be discussed later after 1.16 release.
>>
>> > but it might end up with monitoring unexpected metrics, which is even
>> worse for users, i.e. I didn't change anything, but something has been
>> broken since the last update.
>>
>> Yeah this is exactly what we are trying to fix with this proposal. I
>> believe users are more concerned with the output to the external system
>> than the internal data delivery in the sink, so I think we’ll have more
>> cases reporting like “I set up a panel on numRecordsOut in sink to monitor
>> the output of the job, but after upgrading to 1.15 this value is extremely
>> low and I didn’t change anything” if we stick to the current situation. I
>> think only a few end users care about the number of committables sending to
>> downstream as most of them don’t care how the sink works.
>>
>> We do need a re-design to fully distinguish the internal and external
>> traffic on metrics, not only in sink but in all operators as it’s quite
>> common for operators to make IO. This needs time to design, discuss, adjust
>> and vote, but considering this is blocking 1.16, maybe it’s better to
>> rescue the compatibility for now, and leave the huge reconstruction to
>> future versions (maybe 2.0).
>>
>> Best,
>> Qingsheng
>>
>> On Wed, Oct 12, 2022 at 7:21 PM Jing Ge <ji...@ververica.com> wrote:
>>
>>> Hi Qingsheng,
>>>
>>> Just want to make sure we are on the same page. Are you suggesting
>>> switching the naming between "numXXXSend" and "numXXXOut" or reverting all
>>> the changes we did with FLINK-26126 and FLINK-26492?
>>>
>>> For the naming switch, please pay attention that the behaviour has been
>>> changed since we introduced SinkV2[1]. So, please be aware of different
>>> numbers(behaviour change) even with the same metrics name. Sticking with
>>> the old name with the new behaviour (very bad idea, IMHO) might seem like
>>> saving the effort in the first place, but it might end up with monitoring
>>> unexpected metrics, which is even worse for users, i.e. I didn't change
>>> anything, but something has been broken since the last update.
>>>
>>> For reverting, I am not sure how to fix the issue mentioned in
>>> FLINK-26126 after reverting all changes. Like Chesnay has already pointed
>>> out, with SinkV2 we have two different output lines - one with the external
>>> system and the other with the downstream operator. In this case,
>>> "numXXXSend" is rather a new metric than a replacement of "numXXXOut". The
>>> "numXXXOut" metric can still be used, depending on what the user wants to
>>> monitor.
>>>
>>>
>>> Best regards,
>>> Jing
>>>
>>> [1]
>>> https://github.com/apache/flink/blob/51fc20db30d001a95de95b3b9993eeb06f558f6c/flink-metrics/flink-metrics-core/src/main/java/org/apache/flink/metrics/groups/SinkWriterMetricGroup.java#L48
>>>
>>>
>>> On Wed, Oct 12, 2022 at 12:48 PM Qingsheng Ren <re...@apache.org> wrote:
>>>
>>>> As a supplement, considering it could be a big reconstruction
>>>> redefining internal and external traffic and touching metric names in
>>>> almost all operators, this requires a lot of discussions and we might
>>>> do it finally in Flink 2.0. I think compatibility is a bigger blocker
>>>> in front of us, as the output of sink is a metric that users care a
>>>> lot about.
>>>>
>>>> Thanks,
>>>> Qingsheng
>>>>
>>>> On Wed, Oct 12, 2022 at 6:20 PM Qingsheng Ren <re...@apache.org> wrote:
>>>> >
>>>> > Thanks Chesnay for the reply. +1 for making a unified and clearer
>>>> > metric definition distinguishing internal and external data transfers.
>>>> > As you described, having IO in operators is quite common such as
>>>> > dimension tables in Table/SQL API. This definitely deserves a FLIP and
>>>> > an overall design.
>>>> >
>>>> > However I think it's necessary to change the metric back to
>>>> > numRecordsOut instead of sticking with numRecordsSend in 1.15 and
>>>> > 1.16. The most important argument is for compatibility as I mentioned
>>>> > in my previous email, otherwise all users have to modify their configs
>>>> > of metric systems after upgrading to Flink 1.15+, and all custom
>>>> > connectors have to change their implementations to migrate to the new
>>>> > metric name. I believe other ones participating and approving this
>>>> > proposal share the same concern about compatibility too. Also
>>>> > considering this issue is blocking the release of 1.16, maybe we could
>>>> > fix this asap, and as for defining a new metric for internal data
>>>> > transfers we can have an in-depth discussion later. WDYT?
>>>> >
>>>> > Best,
>>>> > Qingsheng
>>>> >
>>>> > On Tue, Oct 11, 2022 at 6:06 PM Chesnay Schepler <ch...@apache.org>
>>>> wrote:
>>>> > >
>>>> > > Currently I think that would be a mistake.
>>>> > >
>>>> > > Ultimately what we have here is the culmination of us never really
>>>> considering how the numRecordsOut metric should behave for operators that
>>>> emit data to other operators _and_ external systems. This goes beyond sinks.
>>>> > > This even applies to numRecordsIn, for cases where functions
>>>> query/write data from/to the outside, (e.g., Async IO).
>>>> > >
>>>> > > Having 2 separate metrics for that, 1 exclusively for internal data
>>>> transfers, and 1 exclusively for external data transfers, is the only way
>>>> to get a consistent metric definition in the long-run.
>>>> > > We can jump back-and-forth now or just commit to it.
>>>> > >
>>>> > > I don't think we can really judge this based on FLIP-33. It was
>>>> IIRC written before the two phase sinks were added, which heavily blurred
>>>> the lines of what a sink even is. Because it definitely is _not_ the last
>>>> operator in a chain anymore.
>>>> > >
>>>> > > What I would suggest is to stick with what we got (although I
>>>> despise the name numRecordsSend), and alias the numRecordsOut metric for
>>>> all non-TwoPhaseCommittingSink.
>>>> > >
>>>> > > On 11/10/2022 05:54, Qingsheng Ren wrote:
>>>> > >
>>>> > > Thanks for the details Chesnay!
>>>> > >
>>>> > > By “alias” I mean to respect the original definition made in
>>>> FLIP-33 for numRecordsOut, which is the number of records written to the
>>>> external system, and keep numRecordsSend as the same value as numRecordsOut
>>>> for compatibility.
>>>> > >
>>>> > > I think keeping numRecordsOut for the output to the external system
>>>> is more intuitive to end users because in most cases the metric of data
>>>> flow output is more essential. I agree with you that a new metric is
>>>> required, but considering compatibility and users’ intuition I prefer to
>>>> keep the initial definition of numRecordsOut in FLIP-33 and name a new
>>>> metric for sink writer’s output to downstream operators. This might be
>>>> against consistency with metrics in other operators in Flink but maybe it’s
>>>> acceptable to have the sink as a special case.
>>>> > >
>>>> > > Best,
>>>> > > Qingsheng
>>>> > > On Oct 10, 2022, 19:13 +0800, Chesnay Schepler <ch...@apache.org>,
>>>> wrote:
>>>> > >
>>>> > > > I’m with Xintong’s idea to treat numXXXSend as an alias of
>>>> numXXXOut
>>>> > >
>>>> > > But that's not possible. If it were that simple there would have
>>>> never been a need to introduce another metric in the first place.
>>>> > >
>>>> > > It's a rather fundamental issue with how the new sinks work, in
>>>> that they emit data to the external system (usually considered as
>>>> "numRecordsOut" of sinks) while _also_ sending data to a downstream
>>>> operator (usually considered as "numRecordsOut" of tasks).
>>>> > > The original issue was that the numRecordsOut of the sink counted
>>>> both (which is completely wrong).
>>>> > >
>>>> > > A new metric was always required; otherwise you inevitably end up
>>>> breaking some semantic.
>>>> > > Adding a new metric for what the sink writes to the external system
>>>> is, for better or worse, more consistent with how these metrics usually
>>>> work in Flink.
>>>> > >
>>>> > > On 10/10/2022 12:45, Qingsheng Ren wrote:
>>>> > >
>>>> > > Thanks everyone for joining the discussion!
>>>> > >
>>>> > > > Do you have any idea what has happened in the process here?
>>>> > >
>>>> > > The discussion in this PR [1] shows some details and could be
>>>> helpful to understand the original motivation of the renaming. We do have a
>>>> test case for guarding metrics but unfortunaly the case was also modified
>>>> so the defense was broken.
>>>> > >
>>>> > > I think the reason why both the developer and the reviewer forgot
>>>> to trigger an discussion and gave a green pass on the change is that
>>>> metrics are quite “trivial” to be noticed as public APIs. As mentioned by
>>>> Martijn I couldn’t find a place noting that metrics are public APIs and
>>>> should be treated carefully while contributing and reviewing.
>>>> > >
>>>> > > IMHO three actions could be made to prevent this kind of changes in
>>>> the future:
>>>> > >
>>>> > > a. Add test case for metrics (which we already have in
>>>> SinkMetricsITCase)
>>>> > > b. We emphasize that any public-interface breaking changes should
>>>> be proposed by a FLIP or discussed in mailing list, and should be listed in
>>>> the release note.
>>>> > > c. We remind contributors and reviewers about what should be
>>>> considered as public API, and include metric names in it.
>>>> > >
>>>> > > For b and c these two pages [2][3] might be proper places.
>>>> > >
>>>> > > About the patch to revert this, it looks like we have a consensus
>>>> on 1.16. As of 1.15 I think it’s worthy to trigger a minor version. I
>>>> didn’t see complaints about this for now so it should be OK to save the
>>>> situation asap. I’m with Xintong’s idea to treat numXXXSend as an alias of
>>>> numXXXOut considering there could possibly some users have already adapted
>>>> their system to the new naming, and have another internal metric for
>>>> reflecting number of outgoing committable batches (actually the
>>>> numRecordsIn of sink committer operator should be carrying this info
>>>> already).
>>>> > >
>>>> > > [1] https://github.com/apache/flink/pull/18825
>>>> > > [2] https://flink.apache.org/contributing/contribute-code.html
>>>> > > [3] https://flink.apache.org/contributing/reviewing-prs.html
>>>> > >
>>>> > > Best,
>>>> > > Qingsheng
>>>> > > On Oct 10, 2022, 17:40 +0800, Xintong Song <to...@gmail.com>,
>>>> wrote:
>>>> > >
>>>> > > +1 for reverting these changes in Flink 1.16.
>>>> > >
>>>> > > For 1.15.3, can we make these metrics available via both names
>>>> (numXXXOut and numXXXSend)? In this way we don't break it for those who
>>>> already migrated to 1.15 and numXXXSend. That means we still need to change
>>>> SinkWriterOperator to use another metric name in 1.15.3, which IIUC is
>>>> internal to Flink sink.
>>>> > >
>>>> > > I'm overall +1 to change numXXXOut back to its original semantics.
>>>> AFAIK (from meetup / flink-forward questionaires), most users do not
>>>> migrate to a new Flink release immediately, until the next 1-2 major
>>>> releases are out.
>>>> > >
>>>> > > Best,
>>>> > >
>>>> > > Xintong
>>>> > >
>>>> > >
>>>> > >
>>>> > > On Mon, Oct 10, 2022 at 5:26 PM Martijn Visser <
>>>> martijnvisser@apache.org> wrote:
>>>> > >>
>>>> > >> Hi Qingsheng,
>>>> > >>
>>>> > >> Do you have any idea what has happened in the process here? Do we
>>>> know why
>>>> > >> they were changed? I was under the impression that these metric
>>>> names were
>>>> > >> newly introduced due to the new interfaces and because it still
>>>> depends on
>>>> > >> each connector implementing these.
>>>> > >>
>>>> > >> Sidenote: metric names are not mentioned in the FLIP process as a
>>>> public
>>>> > >> API. Might make sense to have a separate follow-up to add that to
>>>> the list
>>>> > >> (I do think we should list them there).
>>>> > >>
>>>> > >> +1 for reverting this and make this change in Flink 1.16
>>>> > >>
>>>> > >> I'm not in favour of releasing a Flink 1.15.3 with this change: I
>>>> think the
>>>> > >> impact is too big for a patch version, especially given how long
>>>> Flink 1.15
>>>> > >> is already out there.
>>>> > >>
>>>> > >> Best regards,
>>>> > >>
>>>> > >> Martijn
>>>> > >>
>>>> > >> On Mon, Oct 10, 2022 at 11:13 AM Leonard Xu <xb...@gmail.com>
>>>> wrote:
>>>> > >>
>>>> > >> > Thanks Qingsheng for starting this thread.
>>>> > >> >
>>>> > >> > +1 on reverting sink metric name and releasing 1.15.3 to fix this
>>>> > >> > inconsistent behavior.
>>>> > >> >
>>>> > >> >
>>>> > >> > Best,
>>>> > >> > Leonard
>>>> > >> >
>>>> > >> >
>>>> > >> >
>>>> > >> >
>>>> > >> >
>>>> > >> > 2022年10月10日 下午3:06,Jark Wu <im...@gmail.com> 写道:
>>>> > >> >
>>>> > >> > Thanks for discovering this problem, Qingsheng!
>>>> > >> >
>>>> > >> > I'm also +1 for reverting the breaking changes.
>>>> > >> >
>>>> > >> > IIUC, currently, the behavior of "numXXXOut" metrics of the new
>>>> and old
>>>> > >> > sink is inconsistent.
>>>> > >> > We have to break one of them to have consistent behavior. Sink
>>>> V2 is an
>>>> > >> > evolving API which is just introduced in 1.15.
>>>> > >> > I think it makes sense to break the unstable API instead of the
>>>> stable API
>>>> > >> > which many connectors and users depend on.
>>>> > >> >
>>>> > >> > Best,
>>>> > >> > Jark
>>>> > >> >
>>>> > >> >
>>>> > >> >
>>>> > >> > On Mon, 10 Oct 2022 at 11:36, Jingsong Li <
>>>> jingsonglee0@gmail.com> wrote:
>>>> > >> >
>>>> > >> >> Thanks for driving, Qingsheng.
>>>> > >> >>
>>>> > >> >> +1 for reverting sink metric name.
>>>> > >> >>
>>>> > >> >> We often forget that metric is also one of the important APIs.
>>>> > >> >>
>>>> > >> >> +1 for releasing 1.15.3 to fix this.
>>>> > >> >>
>>>> > >> >> Best,
>>>> > >> >> Jingsong
>>>> > >> >>
>>>> > >> >> On Sun, Oct 9, 2022 at 11:35 PM Becket Qin <
>>>> becket.qin@gmail.com> wrote:
>>>> > >> >> >
>>>> > >> >> > Thanks for raising the discussion, Qingsheng,
>>>> > >> >> >
>>>> > >> >> > +1 on reverting the breaking changes.
>>>> > >> >> >
>>>> > >> >> > In addition, we might want to release a 1.15.3 to fix this
>>>> and update
>>>> > >> >> the previous release docs with this known issue, so that users
>>>> can upgrade
>>>> > >> >> to 1.15.3 when they hit it. It would also be good to add some
>>>> backwards
>>>> > >> >> compatibility tests on metrics to avoid unintended breaking
>>>> changes like
>>>> > >> >> this in the future.
>>>> > >> >> >
>>>> > >> >> > Thanks,
>>>> > >> >> >
>>>> > >> >> > Jiangjie (Becket) Qin
>>>> > >> >> >
>>>> > >> >> > On Sun, Oct 9, 2022 at 10:35 AM Qingsheng Ren <
>>>> renqs@apache.org> wrote:
>>>> > >> >> >>
>>>> > >> >> >> Hi devs and users,
>>>> > >> >> >>
>>>> > >> >> >> I’d like to start a discussion about reverting a breaking
>>>> change about
>>>> > >> >> sink metrics made in 1.15 by FLINK-26126 [1] and FLINK-26492
>>>> [2].
>>>> > >> >> >>
>>>> > >> >> >> TL;DR
>>>> > >> >> >>
>>>> > >> >> >> All sink metrics with name “numXXXOut” defined in FLIP-33
>>>> are replace
>>>> > >> >> by “numXXXSend” in FLINK-26126 and FLINK-26492. Considering
>>>> metric names
>>>> > >> >> are public APIs, this is a breaking change to end users and not
>>>> backward
>>>> > >> >> compatible. Also unfortunately this breaking change was not
>>>> discussed in
>>>> > >> >> the mailing list before.
>>>> > >> >> >>
>>>> > >> >> >> Background
>>>> > >> >> >>
>>>> > >> >> >> As defined previously in FLIP-33 (the FLIP page has been
>>>> changed so
>>>> > >> >> please refer to the old version [3] ), metric “numRecordsOut”
>>>> is used for
>>>> > >> >> reporting the total number of output records since the sink
>>>> started (number
>>>> > >> >> of records written to the external system), and similarly for
>>>> > >> >> “numRecordsOutPerSecond”, “numBytesOut”, “numBytesOutPerSecond”
>>>> and
>>>> > >> >> “numRecordsOutError”. Most sinks are following this naming and
>>>> definition.
>>>> > >> >> However, these metrics are ambiguous in the new Sink API as
>>>> “numXXXOut”
>>>> > >> >> could be used by the output of SinkWriterOperator for reporting
>>>> number of
>>>> > >> >> Committables delivered to SinkCommitterOperator. In order to
>>>> resolve the
>>>> > >> >> conflict, FLINK-26126 and FLINK-26492 changed names of these
>>>> metrics with
>>>> > >> >> “numXXXSend”.
>>>> > >> >> >>
>>>> > >> >> >> Necessity of reverting this change
>>>> > >> >> >>
>>>> > >> >> >> - Metric names are actually public API, as end users need to
>>>> configure
>>>> > >> >> metric collecting and alerting system with metric names. Users
>>>> have to
>>>> > >> >> reset all configurations related to affected metrics.
>>>> > >> >> >> - This could also affect custom and external sinks not
>>>> maintained by
>>>> > >> >> Flink, which might have implemented with numXXXOut metrics.
>>>> > >> >> >> - The number of records sent to external system is way more
>>>> important
>>>> > >> >> than the number of Committables sent to SinkCommitterOperator,
>>>> as the
>>>> > >> >> latter one is just an internal implementation of sink. We could
>>>> have a new
>>>> > >> >> metric name for the latter one instead.
>>>> > >> >> >> - We could avoid splitting the project by version (like “plz
>>>> use
>>>> > >> >> numXXXOut before 1.15 and use numXXXSend after”) if we revert
>>>> it ASAP,
>>>> > >> >> cosidering 1.16 is still not released for now.
>>>> > >> >> >>
>>>> > >> >> >> As a consequence, I’d like to hear from devs and users about
>>>> your
>>>> > >> >> opinion on changing these metrics back to “numXXXOut”.
>>>> > >> >> >>
>>>> > >> >> >> Looking forward to your reply!
>>>> > >> >> >>
>>>> > >> >> >> [1] https://issues.apache.org/jira/browse/FLINK-26126
>>>> > >> >> >> [2] https://issues.apache.org/jira/browse/FLINK-26492
>>>> > >> >> >> [1] FLIP-33, version 18:
>>>> > >> >>
>>>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211883136
>>>> > >> >> >>
>>>> > >> >> >> Best,
>>>> > >> >> >> Qingsheng
>>>> > >> >>
>>>> > >> >
>>>> > >> >
>>>> > >
>>>> > >
>>>> > >
>>>>
>>>

Re: [DISCUSS] Reverting sink metric name changes made in 1.15

Posted by Jing Ge <ji...@ververica.com>.
Hi Qingsheng,

Thanks for the clarification. +1, I like the idea. Pointing both numXXXOut
and numXXXSend to the same external data transfer metric does not really
break the new SinkV2 design, since there was no requirement to monitor the
internal traffic. So, I think both developer and user can live with it. It
might not be the perfect solution but is indeed the currently best
trade-off solution after considering the backward compatibility.  I would
suggest firing a follow-up ticket after the PR to take care of the new
metric for the internal traffic in the future.

Best regards,
Jing


On Thu, Oct 13, 2022 at 3:08 PM Qingsheng Ren <re...@apache.org> wrote:

> Hi Jing,
>
> Thanks for the reply!
>
> Let me rephrase my proposal: we’d like to use numXXXOut registered on
> SinkWriterOperator to reflect the traffic to the external system for
> compatibility with old versions before 1.15, and make numXXXSend have the
> same value as numXXXOut for compatibility within 1.15. That means both
> numXXXOut and numXXXSend are used for external data transfers, which end
> users care more about. As for the internal traffic within the sink, we
> could name a new metric for it because this is a _new_ feature in the _new_
> sink, and end users usually don’t pay attention to internal implementation.
> The name of the new metric could be discussed later after 1.16 release.
>
> > but it might end up with monitoring unexpected metrics, which is even
> worse for users, i.e. I didn't change anything, but something has been
> broken since the last update.
>
> Yeah this is exactly what we are trying to fix with this proposal. I
> believe users are more concerned with the output to the external system
> than the internal data delivery in the sink, so I think we’ll have more
> cases reporting like “I set up a panel on numRecordsOut in sink to monitor
> the output of the job, but after upgrading to 1.15 this value is extremely
> low and I didn’t change anything” if we stick to the current situation. I
> think only a few end users care about the number of committables sending to
> downstream as most of them don’t care how the sink works.
>
> We do need a re-design to fully distinguish the internal and external
> traffic on metrics, not only in sink but in all operators as it’s quite
> common for operators to make IO. This needs time to design, discuss, adjust
> and vote, but considering this is blocking 1.16, maybe it’s better to
> rescue the compatibility for now, and leave the huge reconstruction to
> future versions (maybe 2.0).
>
> Best,
> Qingsheng
>
> On Wed, Oct 12, 2022 at 7:21 PM Jing Ge <ji...@ververica.com> wrote:
>
>> Hi Qingsheng,
>>
>> Just want to make sure we are on the same page. Are you suggesting
>> switching the naming between "numXXXSend" and "numXXXOut" or reverting all
>> the changes we did with FLINK-26126 and FLINK-26492?
>>
>> For the naming switch, please pay attention that the behaviour has been
>> changed since we introduced SinkV2[1]. So, please be aware of different
>> numbers(behaviour change) even with the same metrics name. Sticking with
>> the old name with the new behaviour (very bad idea, IMHO) might seem like
>> saving the effort in the first place, but it might end up with monitoring
>> unexpected metrics, which is even worse for users, i.e. I didn't change
>> anything, but something has been broken since the last update.
>>
>> For reverting, I am not sure how to fix the issue mentioned in
>> FLINK-26126 after reverting all changes. Like Chesnay has already pointed
>> out, with SinkV2 we have two different output lines - one with the external
>> system and the other with the downstream operator. In this case,
>> "numXXXSend" is rather a new metric than a replacement of "numXXXOut". The
>> "numXXXOut" metric can still be used, depending on what the user wants to
>> monitor.
>>
>>
>> Best regards,
>> Jing
>>
>> [1]
>> https://github.com/apache/flink/blob/51fc20db30d001a95de95b3b9993eeb06f558f6c/flink-metrics/flink-metrics-core/src/main/java/org/apache/flink/metrics/groups/SinkWriterMetricGroup.java#L48
>>
>>
>> On Wed, Oct 12, 2022 at 12:48 PM Qingsheng Ren <re...@apache.org> wrote:
>>
>>> As a supplement, considering it could be a big reconstruction
>>> redefining internal and external traffic and touching metric names in
>>> almost all operators, this requires a lot of discussions and we might
>>> do it finally in Flink 2.0. I think compatibility is a bigger blocker
>>> in front of us, as the output of sink is a metric that users care a
>>> lot about.
>>>
>>> Thanks,
>>> Qingsheng
>>>
>>> On Wed, Oct 12, 2022 at 6:20 PM Qingsheng Ren <re...@apache.org> wrote:
>>> >
>>> > Thanks Chesnay for the reply. +1 for making a unified and clearer
>>> > metric definition distinguishing internal and external data transfers.
>>> > As you described, having IO in operators is quite common such as
>>> > dimension tables in Table/SQL API. This definitely deserves a FLIP and
>>> > an overall design.
>>> >
>>> > However I think it's necessary to change the metric back to
>>> > numRecordsOut instead of sticking with numRecordsSend in 1.15 and
>>> > 1.16. The most important argument is for compatibility as I mentioned
>>> > in my previous email, otherwise all users have to modify their configs
>>> > of metric systems after upgrading to Flink 1.15+, and all custom
>>> > connectors have to change their implementations to migrate to the new
>>> > metric name. I believe other ones participating and approving this
>>> > proposal share the same concern about compatibility too. Also
>>> > considering this issue is blocking the release of 1.16, maybe we could
>>> > fix this asap, and as for defining a new metric for internal data
>>> > transfers we can have an in-depth discussion later. WDYT?
>>> >
>>> > Best,
>>> > Qingsheng
>>> >
>>> > On Tue, Oct 11, 2022 at 6:06 PM Chesnay Schepler <ch...@apache.org>
>>> wrote:
>>> > >
>>> > > Currently I think that would be a mistake.
>>> > >
>>> > > Ultimately what we have here is the culmination of us never really
>>> considering how the numRecordsOut metric should behave for operators that
>>> emit data to other operators _and_ external systems. This goes beyond sinks.
>>> > > This even applies to numRecordsIn, for cases where functions
>>> query/write data from/to the outside, (e.g., Async IO).
>>> > >
>>> > > Having 2 separate metrics for that, 1 exclusively for internal data
>>> transfers, and 1 exclusively for external data transfers, is the only way
>>> to get a consistent metric definition in the long-run.
>>> > > We can jump back-and-forth now or just commit to it.
>>> > >
>>> > > I don't think we can really judge this based on FLIP-33. It was IIRC
>>> written before the two phase sinks were added, which heavily blurred the
>>> lines of what a sink even is. Because it definitely is _not_ the last
>>> operator in a chain anymore.
>>> > >
>>> > > What I would suggest is to stick with what we got (although I
>>> despise the name numRecordsSend), and alias the numRecordsOut metric for
>>> all non-TwoPhaseCommittingSink.
>>> > >
>>> > > On 11/10/2022 05:54, Qingsheng Ren wrote:
>>> > >
>>> > > Thanks for the details Chesnay!
>>> > >
>>> > > By “alias” I mean to respect the original definition made in FLIP-33
>>> for numRecordsOut, which is the number of records written to the external
>>> system, and keep numRecordsSend as the same value as numRecordsOut for
>>> compatibility.
>>> > >
>>> > > I think keeping numRecordsOut for the output to the external system
>>> is more intuitive to end users because in most cases the metric of data
>>> flow output is more essential. I agree with you that a new metric is
>>> required, but considering compatibility and users’ intuition I prefer to
>>> keep the initial definition of numRecordsOut in FLIP-33 and name a new
>>> metric for sink writer’s output to downstream operators. This might be
>>> against consistency with metrics in other operators in Flink but maybe it’s
>>> acceptable to have the sink as a special case.
>>> > >
>>> > > Best,
>>> > > Qingsheng
>>> > > On Oct 10, 2022, 19:13 +0800, Chesnay Schepler <ch...@apache.org>,
>>> wrote:
>>> > >
>>> > > > I’m with Xintong’s idea to treat numXXXSend as an alias of
>>> numXXXOut
>>> > >
>>> > > But that's not possible. If it were that simple there would have
>>> never been a need to introduce another metric in the first place.
>>> > >
>>> > > It's a rather fundamental issue with how the new sinks work, in that
>>> they emit data to the external system (usually considered as
>>> "numRecordsOut" of sinks) while _also_ sending data to a downstream
>>> operator (usually considered as "numRecordsOut" of tasks).
>>> > > The original issue was that the numRecordsOut of the sink counted
>>> both (which is completely wrong).
>>> > >
>>> > > A new metric was always required; otherwise you inevitably end up
>>> breaking some semantic.
>>> > > Adding a new metric for what the sink writes to the external system
>>> is, for better or worse, more consistent with how these metrics usually
>>> work in Flink.
>>> > >
>>> > > On 10/10/2022 12:45, Qingsheng Ren wrote:
>>> > >
>>> > > Thanks everyone for joining the discussion!
>>> > >
>>> > > > Do you have any idea what has happened in the process here?
>>> > >
>>> > > The discussion in this PR [1] shows some details and could be
>>> helpful to understand the original motivation of the renaming. We do have a
>>> test case for guarding metrics but unfortunaly the case was also modified
>>> so the defense was broken.
>>> > >
>>> > > I think the reason why both the developer and the reviewer forgot to
>>> trigger an discussion and gave a green pass on the change is that metrics
>>> are quite “trivial” to be noticed as public APIs. As mentioned by Martijn I
>>> couldn’t find a place noting that metrics are public APIs and should be
>>> treated carefully while contributing and reviewing.
>>> > >
>>> > > IMHO three actions could be made to prevent this kind of changes in
>>> the future:
>>> > >
>>> > > a. Add test case for metrics (which we already have in
>>> SinkMetricsITCase)
>>> > > b. We emphasize that any public-interface breaking changes should be
>>> proposed by a FLIP or discussed in mailing list, and should be listed in
>>> the release note.
>>> > > c. We remind contributors and reviewers about what should be
>>> considered as public API, and include metric names in it.
>>> > >
>>> > > For b and c these two pages [2][3] might be proper places.
>>> > >
>>> > > About the patch to revert this, it looks like we have a consensus on
>>> 1.16. As of 1.15 I think it’s worthy to trigger a minor version. I didn’t
>>> see complaints about this for now so it should be OK to save the situation
>>> asap. I’m with Xintong’s idea to treat numXXXSend as an alias of numXXXOut
>>> considering there could possibly some users have already adapted their
>>> system to the new naming, and have another internal metric for reflecting
>>> number of outgoing committable batches (actually the numRecordsIn of sink
>>> committer operator should be carrying this info already).
>>> > >
>>> > > [1] https://github.com/apache/flink/pull/18825
>>> > > [2] https://flink.apache.org/contributing/contribute-code.html
>>> > > [3] https://flink.apache.org/contributing/reviewing-prs.html
>>> > >
>>> > > Best,
>>> > > Qingsheng
>>> > > On Oct 10, 2022, 17:40 +0800, Xintong Song <to...@gmail.com>,
>>> wrote:
>>> > >
>>> > > +1 for reverting these changes in Flink 1.16.
>>> > >
>>> > > For 1.15.3, can we make these metrics available via both names
>>> (numXXXOut and numXXXSend)? In this way we don't break it for those who
>>> already migrated to 1.15 and numXXXSend. That means we still need to change
>>> SinkWriterOperator to use another metric name in 1.15.3, which IIUC is
>>> internal to Flink sink.
>>> > >
>>> > > I'm overall +1 to change numXXXOut back to its original semantics.
>>> AFAIK (from meetup / flink-forward questionaires), most users do not
>>> migrate to a new Flink release immediately, until the next 1-2 major
>>> releases are out.
>>> > >
>>> > > Best,
>>> > >
>>> > > Xintong
>>> > >
>>> > >
>>> > >
>>> > > On Mon, Oct 10, 2022 at 5:26 PM Martijn Visser <
>>> martijnvisser@apache.org> wrote:
>>> > >>
>>> > >> Hi Qingsheng,
>>> > >>
>>> > >> Do you have any idea what has happened in the process here? Do we
>>> know why
>>> > >> they were changed? I was under the impression that these metric
>>> names were
>>> > >> newly introduced due to the new interfaces and because it still
>>> depends on
>>> > >> each connector implementing these.
>>> > >>
>>> > >> Sidenote: metric names are not mentioned in the FLIP process as a
>>> public
>>> > >> API. Might make sense to have a separate follow-up to add that to
>>> the list
>>> > >> (I do think we should list them there).
>>> > >>
>>> > >> +1 for reverting this and make this change in Flink 1.16
>>> > >>
>>> > >> I'm not in favour of releasing a Flink 1.15.3 with this change: I
>>> think the
>>> > >> impact is too big for a patch version, especially given how long
>>> Flink 1.15
>>> > >> is already out there.
>>> > >>
>>> > >> Best regards,
>>> > >>
>>> > >> Martijn
>>> > >>
>>> > >> On Mon, Oct 10, 2022 at 11:13 AM Leonard Xu <xb...@gmail.com>
>>> wrote:
>>> > >>
>>> > >> > Thanks Qingsheng for starting this thread.
>>> > >> >
>>> > >> > +1 on reverting sink metric name and releasing 1.15.3 to fix this
>>> > >> > inconsistent behavior.
>>> > >> >
>>> > >> >
>>> > >> > Best,
>>> > >> > Leonard
>>> > >> >
>>> > >> >
>>> > >> >
>>> > >> >
>>> > >> >
>>> > >> > 2022年10月10日 下午3:06,Jark Wu <im...@gmail.com> 写道:
>>> > >> >
>>> > >> > Thanks for discovering this problem, Qingsheng!
>>> > >> >
>>> > >> > I'm also +1 for reverting the breaking changes.
>>> > >> >
>>> > >> > IIUC, currently, the behavior of "numXXXOut" metrics of the new
>>> and old
>>> > >> > sink is inconsistent.
>>> > >> > We have to break one of them to have consistent behavior. Sink V2
>>> is an
>>> > >> > evolving API which is just introduced in 1.15.
>>> > >> > I think it makes sense to break the unstable API instead of the
>>> stable API
>>> > >> > which many connectors and users depend on.
>>> > >> >
>>> > >> > Best,
>>> > >> > Jark
>>> > >> >
>>> > >> >
>>> > >> >
>>> > >> > On Mon, 10 Oct 2022 at 11:36, Jingsong Li <ji...@gmail.com>
>>> wrote:
>>> > >> >
>>> > >> >> Thanks for driving, Qingsheng.
>>> > >> >>
>>> > >> >> +1 for reverting sink metric name.
>>> > >> >>
>>> > >> >> We often forget that metric is also one of the important APIs.
>>> > >> >>
>>> > >> >> +1 for releasing 1.15.3 to fix this.
>>> > >> >>
>>> > >> >> Best,
>>> > >> >> Jingsong
>>> > >> >>
>>> > >> >> On Sun, Oct 9, 2022 at 11:35 PM Becket Qin <be...@gmail.com>
>>> wrote:
>>> > >> >> >
>>> > >> >> > Thanks for raising the discussion, Qingsheng,
>>> > >> >> >
>>> > >> >> > +1 on reverting the breaking changes.
>>> > >> >> >
>>> > >> >> > In addition, we might want to release a 1.15.3 to fix this and
>>> update
>>> > >> >> the previous release docs with this known issue, so that users
>>> can upgrade
>>> > >> >> to 1.15.3 when they hit it. It would also be good to add some
>>> backwards
>>> > >> >> compatibility tests on metrics to avoid unintended breaking
>>> changes like
>>> > >> >> this in the future.
>>> > >> >> >
>>> > >> >> > Thanks,
>>> > >> >> >
>>> > >> >> > Jiangjie (Becket) Qin
>>> > >> >> >
>>> > >> >> > On Sun, Oct 9, 2022 at 10:35 AM Qingsheng Ren <
>>> renqs@apache.org> wrote:
>>> > >> >> >>
>>> > >> >> >> Hi devs and users,
>>> > >> >> >>
>>> > >> >> >> I’d like to start a discussion about reverting a breaking
>>> change about
>>> > >> >> sink metrics made in 1.15 by FLINK-26126 [1] and FLINK-26492 [2].
>>> > >> >> >>
>>> > >> >> >> TL;DR
>>> > >> >> >>
>>> > >> >> >> All sink metrics with name “numXXXOut” defined in FLIP-33 are
>>> replace
>>> > >> >> by “numXXXSend” in FLINK-26126 and FLINK-26492. Considering
>>> metric names
>>> > >> >> are public APIs, this is a breaking change to end users and not
>>> backward
>>> > >> >> compatible. Also unfortunately this breaking change was not
>>> discussed in
>>> > >> >> the mailing list before.
>>> > >> >> >>
>>> > >> >> >> Background
>>> > >> >> >>
>>> > >> >> >> As defined previously in FLIP-33 (the FLIP page has been
>>> changed so
>>> > >> >> please refer to the old version [3] ), metric “numRecordsOut” is
>>> used for
>>> > >> >> reporting the total number of output records since the sink
>>> started (number
>>> > >> >> of records written to the external system), and similarly for
>>> > >> >> “numRecordsOutPerSecond”, “numBytesOut”, “numBytesOutPerSecond”
>>> and
>>> > >> >> “numRecordsOutError”. Most sinks are following this naming and
>>> definition.
>>> > >> >> However, these metrics are ambiguous in the new Sink API as
>>> “numXXXOut”
>>> > >> >> could be used by the output of SinkWriterOperator for reporting
>>> number of
>>> > >> >> Committables delivered to SinkCommitterOperator. In order to
>>> resolve the
>>> > >> >> conflict, FLINK-26126 and FLINK-26492 changed names of these
>>> metrics with
>>> > >> >> “numXXXSend”.
>>> > >> >> >>
>>> > >> >> >> Necessity of reverting this change
>>> > >> >> >>
>>> > >> >> >> - Metric names are actually public API, as end users need to
>>> configure
>>> > >> >> metric collecting and alerting system with metric names. Users
>>> have to
>>> > >> >> reset all configurations related to affected metrics.
>>> > >> >> >> - This could also affect custom and external sinks not
>>> maintained by
>>> > >> >> Flink, which might have implemented with numXXXOut metrics.
>>> > >> >> >> - The number of records sent to external system is way more
>>> important
>>> > >> >> than the number of Committables sent to SinkCommitterOperator,
>>> as the
>>> > >> >> latter one is just an internal implementation of sink. We could
>>> have a new
>>> > >> >> metric name for the latter one instead.
>>> > >> >> >> - We could avoid splitting the project by version (like “plz
>>> use
>>> > >> >> numXXXOut before 1.15 and use numXXXSend after”) if we revert it
>>> ASAP,
>>> > >> >> cosidering 1.16 is still not released for now.
>>> > >> >> >>
>>> > >> >> >> As a consequence, I’d like to hear from devs and users about
>>> your
>>> > >> >> opinion on changing these metrics back to “numXXXOut”.
>>> > >> >> >>
>>> > >> >> >> Looking forward to your reply!
>>> > >> >> >>
>>> > >> >> >> [1] https://issues.apache.org/jira/browse/FLINK-26126
>>> > >> >> >> [2] https://issues.apache.org/jira/browse/FLINK-26492
>>> > >> >> >> [1] FLIP-33, version 18:
>>> > >> >>
>>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211883136
>>> > >> >> >>
>>> > >> >> >> Best,
>>> > >> >> >> Qingsheng
>>> > >> >>
>>> > >> >
>>> > >> >
>>> > >
>>> > >
>>> > >
>>>
>>

Re: [DISCUSS] Reverting sink metric name changes made in 1.15

Posted by Jing Ge <ji...@ververica.com>.
Hi Qingsheng,

Thanks for the clarification. +1, I like the idea. Pointing both numXXXOut
and numXXXSend to the same external data transfer metric does not really
break the new SinkV2 design, since there was no requirement to monitor the
internal traffic. So, I think both developer and user can live with it. It
might not be the perfect solution but is indeed the currently best
trade-off solution after considering the backward compatibility.  I would
suggest firing a follow-up ticket after the PR to take care of the new
metric for the internal traffic in the future.

Best regards,
Jing


On Thu, Oct 13, 2022 at 3:08 PM Qingsheng Ren <re...@apache.org> wrote:

> Hi Jing,
>
> Thanks for the reply!
>
> Let me rephrase my proposal: we’d like to use numXXXOut registered on
> SinkWriterOperator to reflect the traffic to the external system for
> compatibility with old versions before 1.15, and make numXXXSend have the
> same value as numXXXOut for compatibility within 1.15. That means both
> numXXXOut and numXXXSend are used for external data transfers, which end
> users care more about. As for the internal traffic within the sink, we
> could name a new metric for it because this is a _new_ feature in the _new_
> sink, and end users usually don’t pay attention to internal implementation.
> The name of the new metric could be discussed later after 1.16 release.
>
> > but it might end up with monitoring unexpected metrics, which is even
> worse for users, i.e. I didn't change anything, but something has been
> broken since the last update.
>
> Yeah this is exactly what we are trying to fix with this proposal. I
> believe users are more concerned with the output to the external system
> than the internal data delivery in the sink, so I think we’ll have more
> cases reporting like “I set up a panel on numRecordsOut in sink to monitor
> the output of the job, but after upgrading to 1.15 this value is extremely
> low and I didn’t change anything” if we stick to the current situation. I
> think only a few end users care about the number of committables sending to
> downstream as most of them don’t care how the sink works.
>
> We do need a re-design to fully distinguish the internal and external
> traffic on metrics, not only in sink but in all operators as it’s quite
> common for operators to make IO. This needs time to design, discuss, adjust
> and vote, but considering this is blocking 1.16, maybe it’s better to
> rescue the compatibility for now, and leave the huge reconstruction to
> future versions (maybe 2.0).
>
> Best,
> Qingsheng
>
> On Wed, Oct 12, 2022 at 7:21 PM Jing Ge <ji...@ververica.com> wrote:
>
>> Hi Qingsheng,
>>
>> Just want to make sure we are on the same page. Are you suggesting
>> switching the naming between "numXXXSend" and "numXXXOut" or reverting all
>> the changes we did with FLINK-26126 and FLINK-26492?
>>
>> For the naming switch, please pay attention that the behaviour has been
>> changed since we introduced SinkV2[1]. So, please be aware of different
>> numbers(behaviour change) even with the same metrics name. Sticking with
>> the old name with the new behaviour (very bad idea, IMHO) might seem like
>> saving the effort in the first place, but it might end up with monitoring
>> unexpected metrics, which is even worse for users, i.e. I didn't change
>> anything, but something has been broken since the last update.
>>
>> For reverting, I am not sure how to fix the issue mentioned in
>> FLINK-26126 after reverting all changes. Like Chesnay has already pointed
>> out, with SinkV2 we have two different output lines - one with the external
>> system and the other with the downstream operator. In this case,
>> "numXXXSend" is rather a new metric than a replacement of "numXXXOut". The
>> "numXXXOut" metric can still be used, depending on what the user wants to
>> monitor.
>>
>>
>> Best regards,
>> Jing
>>
>> [1]
>> https://github.com/apache/flink/blob/51fc20db30d001a95de95b3b9993eeb06f558f6c/flink-metrics/flink-metrics-core/src/main/java/org/apache/flink/metrics/groups/SinkWriterMetricGroup.java#L48
>>
>>
>> On Wed, Oct 12, 2022 at 12:48 PM Qingsheng Ren <re...@apache.org> wrote:
>>
>>> As a supplement, considering it could be a big reconstruction
>>> redefining internal and external traffic and touching metric names in
>>> almost all operators, this requires a lot of discussions and we might
>>> do it finally in Flink 2.0. I think compatibility is a bigger blocker
>>> in front of us, as the output of sink is a metric that users care a
>>> lot about.
>>>
>>> Thanks,
>>> Qingsheng
>>>
>>> On Wed, Oct 12, 2022 at 6:20 PM Qingsheng Ren <re...@apache.org> wrote:
>>> >
>>> > Thanks Chesnay for the reply. +1 for making a unified and clearer
>>> > metric definition distinguishing internal and external data transfers.
>>> > As you described, having IO in operators is quite common such as
>>> > dimension tables in Table/SQL API. This definitely deserves a FLIP and
>>> > an overall design.
>>> >
>>> > However I think it's necessary to change the metric back to
>>> > numRecordsOut instead of sticking with numRecordsSend in 1.15 and
>>> > 1.16. The most important argument is for compatibility as I mentioned
>>> > in my previous email, otherwise all users have to modify their configs
>>> > of metric systems after upgrading to Flink 1.15+, and all custom
>>> > connectors have to change their implementations to migrate to the new
>>> > metric name. I believe other ones participating and approving this
>>> > proposal share the same concern about compatibility too. Also
>>> > considering this issue is blocking the release of 1.16, maybe we could
>>> > fix this asap, and as for defining a new metric for internal data
>>> > transfers we can have an in-depth discussion later. WDYT?
>>> >
>>> > Best,
>>> > Qingsheng
>>> >
>>> > On Tue, Oct 11, 2022 at 6:06 PM Chesnay Schepler <ch...@apache.org>
>>> wrote:
>>> > >
>>> > > Currently I think that would be a mistake.
>>> > >
>>> > > Ultimately what we have here is the culmination of us never really
>>> considering how the numRecordsOut metric should behave for operators that
>>> emit data to other operators _and_ external systems. This goes beyond sinks.
>>> > > This even applies to numRecordsIn, for cases where functions
>>> query/write data from/to the outside, (e.g., Async IO).
>>> > >
>>> > > Having 2 separate metrics for that, 1 exclusively for internal data
>>> transfers, and 1 exclusively for external data transfers, is the only way
>>> to get a consistent metric definition in the long-run.
>>> > > We can jump back-and-forth now or just commit to it.
>>> > >
>>> > > I don't think we can really judge this based on FLIP-33. It was IIRC
>>> written before the two phase sinks were added, which heavily blurred the
>>> lines of what a sink even is. Because it definitely is _not_ the last
>>> operator in a chain anymore.
>>> > >
>>> > > What I would suggest is to stick with what we got (although I
>>> despise the name numRecordsSend), and alias the numRecordsOut metric for
>>> all non-TwoPhaseCommittingSink.
>>> > >
>>> > > On 11/10/2022 05:54, Qingsheng Ren wrote:
>>> > >
>>> > > Thanks for the details Chesnay!
>>> > >
>>> > > By “alias” I mean to respect the original definition made in FLIP-33
>>> for numRecordsOut, which is the number of records written to the external
>>> system, and keep numRecordsSend as the same value as numRecordsOut for
>>> compatibility.
>>> > >
>>> > > I think keeping numRecordsOut for the output to the external system
>>> is more intuitive to end users because in most cases the metric of data
>>> flow output is more essential. I agree with you that a new metric is
>>> required, but considering compatibility and users’ intuition I prefer to
>>> keep the initial definition of numRecordsOut in FLIP-33 and name a new
>>> metric for sink writer’s output to downstream operators. This might be
>>> against consistency with metrics in other operators in Flink but maybe it’s
>>> acceptable to have the sink as a special case.
>>> > >
>>> > > Best,
>>> > > Qingsheng
>>> > > On Oct 10, 2022, 19:13 +0800, Chesnay Schepler <ch...@apache.org>,
>>> wrote:
>>> > >
>>> > > > I’m with Xintong’s idea to treat numXXXSend as an alias of
>>> numXXXOut
>>> > >
>>> > > But that's not possible. If it were that simple there would have
>>> never been a need to introduce another metric in the first place.
>>> > >
>>> > > It's a rather fundamental issue with how the new sinks work, in that
>>> they emit data to the external system (usually considered as
>>> "numRecordsOut" of sinks) while _also_ sending data to a downstream
>>> operator (usually considered as "numRecordsOut" of tasks).
>>> > > The original issue was that the numRecordsOut of the sink counted
>>> both (which is completely wrong).
>>> > >
>>> > > A new metric was always required; otherwise you inevitably end up
>>> breaking some semantic.
>>> > > Adding a new metric for what the sink writes to the external system
>>> is, for better or worse, more consistent with how these metrics usually
>>> work in Flink.
>>> > >
>>> > > On 10/10/2022 12:45, Qingsheng Ren wrote:
>>> > >
>>> > > Thanks everyone for joining the discussion!
>>> > >
>>> > > > Do you have any idea what has happened in the process here?
>>> > >
>>> > > The discussion in this PR [1] shows some details and could be
>>> helpful to understand the original motivation of the renaming. We do have a
>>> test case for guarding metrics but unfortunaly the case was also modified
>>> so the defense was broken.
>>> > >
>>> > > I think the reason why both the developer and the reviewer forgot to
>>> trigger an discussion and gave a green pass on the change is that metrics
>>> are quite “trivial” to be noticed as public APIs. As mentioned by Martijn I
>>> couldn’t find a place noting that metrics are public APIs and should be
>>> treated carefully while contributing and reviewing.
>>> > >
>>> > > IMHO three actions could be made to prevent this kind of changes in
>>> the future:
>>> > >
>>> > > a. Add test case for metrics (which we already have in
>>> SinkMetricsITCase)
>>> > > b. We emphasize that any public-interface breaking changes should be
>>> proposed by a FLIP or discussed in mailing list, and should be listed in
>>> the release note.
>>> > > c. We remind contributors and reviewers about what should be
>>> considered as public API, and include metric names in it.
>>> > >
>>> > > For b and c these two pages [2][3] might be proper places.
>>> > >
>>> > > About the patch to revert this, it looks like we have a consensus on
>>> 1.16. As of 1.15 I think it’s worthy to trigger a minor version. I didn’t
>>> see complaints about this for now so it should be OK to save the situation
>>> asap. I’m with Xintong’s idea to treat numXXXSend as an alias of numXXXOut
>>> considering there could possibly some users have already adapted their
>>> system to the new naming, and have another internal metric for reflecting
>>> number of outgoing committable batches (actually the numRecordsIn of sink
>>> committer operator should be carrying this info already).
>>> > >
>>> > > [1] https://github.com/apache/flink/pull/18825
>>> > > [2] https://flink.apache.org/contributing/contribute-code.html
>>> > > [3] https://flink.apache.org/contributing/reviewing-prs.html
>>> > >
>>> > > Best,
>>> > > Qingsheng
>>> > > On Oct 10, 2022, 17:40 +0800, Xintong Song <to...@gmail.com>,
>>> wrote:
>>> > >
>>> > > +1 for reverting these changes in Flink 1.16.
>>> > >
>>> > > For 1.15.3, can we make these metrics available via both names
>>> (numXXXOut and numXXXSend)? In this way we don't break it for those who
>>> already migrated to 1.15 and numXXXSend. That means we still need to change
>>> SinkWriterOperator to use another metric name in 1.15.3, which IIUC is
>>> internal to Flink sink.
>>> > >
>>> > > I'm overall +1 to change numXXXOut back to its original semantics.
>>> AFAIK (from meetup / flink-forward questionaires), most users do not
>>> migrate to a new Flink release immediately, until the next 1-2 major
>>> releases are out.
>>> > >
>>> > > Best,
>>> > >
>>> > > Xintong
>>> > >
>>> > >
>>> > >
>>> > > On Mon, Oct 10, 2022 at 5:26 PM Martijn Visser <
>>> martijnvisser@apache.org> wrote:
>>> > >>
>>> > >> Hi Qingsheng,
>>> > >>
>>> > >> Do you have any idea what has happened in the process here? Do we
>>> know why
>>> > >> they were changed? I was under the impression that these metric
>>> names were
>>> > >> newly introduced due to the new interfaces and because it still
>>> depends on
>>> > >> each connector implementing these.
>>> > >>
>>> > >> Sidenote: metric names are not mentioned in the FLIP process as a
>>> public
>>> > >> API. Might make sense to have a separate follow-up to add that to
>>> the list
>>> > >> (I do think we should list them there).
>>> > >>
>>> > >> +1 for reverting this and make this change in Flink 1.16
>>> > >>
>>> > >> I'm not in favour of releasing a Flink 1.15.3 with this change: I
>>> think the
>>> > >> impact is too big for a patch version, especially given how long
>>> Flink 1.15
>>> > >> is already out there.
>>> > >>
>>> > >> Best regards,
>>> > >>
>>> > >> Martijn
>>> > >>
>>> > >> On Mon, Oct 10, 2022 at 11:13 AM Leonard Xu <xb...@gmail.com>
>>> wrote:
>>> > >>
>>> > >> > Thanks Qingsheng for starting this thread.
>>> > >> >
>>> > >> > +1 on reverting sink metric name and releasing 1.15.3 to fix this
>>> > >> > inconsistent behavior.
>>> > >> >
>>> > >> >
>>> > >> > Best,
>>> > >> > Leonard
>>> > >> >
>>> > >> >
>>> > >> >
>>> > >> >
>>> > >> >
>>> > >> > 2022年10月10日 下午3:06,Jark Wu <im...@gmail.com> 写道:
>>> > >> >
>>> > >> > Thanks for discovering this problem, Qingsheng!
>>> > >> >
>>> > >> > I'm also +1 for reverting the breaking changes.
>>> > >> >
>>> > >> > IIUC, currently, the behavior of "numXXXOut" metrics of the new
>>> and old
>>> > >> > sink is inconsistent.
>>> > >> > We have to break one of them to have consistent behavior. Sink V2
>>> is an
>>> > >> > evolving API which is just introduced in 1.15.
>>> > >> > I think it makes sense to break the unstable API instead of the
>>> stable API
>>> > >> > which many connectors and users depend on.
>>> > >> >
>>> > >> > Best,
>>> > >> > Jark
>>> > >> >
>>> > >> >
>>> > >> >
>>> > >> > On Mon, 10 Oct 2022 at 11:36, Jingsong Li <ji...@gmail.com>
>>> wrote:
>>> > >> >
>>> > >> >> Thanks for driving, Qingsheng.
>>> > >> >>
>>> > >> >> +1 for reverting sink metric name.
>>> > >> >>
>>> > >> >> We often forget that metric is also one of the important APIs.
>>> > >> >>
>>> > >> >> +1 for releasing 1.15.3 to fix this.
>>> > >> >>
>>> > >> >> Best,
>>> > >> >> Jingsong
>>> > >> >>
>>> > >> >> On Sun, Oct 9, 2022 at 11:35 PM Becket Qin <be...@gmail.com>
>>> wrote:
>>> > >> >> >
>>> > >> >> > Thanks for raising the discussion, Qingsheng,
>>> > >> >> >
>>> > >> >> > +1 on reverting the breaking changes.
>>> > >> >> >
>>> > >> >> > In addition, we might want to release a 1.15.3 to fix this and
>>> update
>>> > >> >> the previous release docs with this known issue, so that users
>>> can upgrade
>>> > >> >> to 1.15.3 when they hit it. It would also be good to add some
>>> backwards
>>> > >> >> compatibility tests on metrics to avoid unintended breaking
>>> changes like
>>> > >> >> this in the future.
>>> > >> >> >
>>> > >> >> > Thanks,
>>> > >> >> >
>>> > >> >> > Jiangjie (Becket) Qin
>>> > >> >> >
>>> > >> >> > On Sun, Oct 9, 2022 at 10:35 AM Qingsheng Ren <
>>> renqs@apache.org> wrote:
>>> > >> >> >>
>>> > >> >> >> Hi devs and users,
>>> > >> >> >>
>>> > >> >> >> I’d like to start a discussion about reverting a breaking
>>> change about
>>> > >> >> sink metrics made in 1.15 by FLINK-26126 [1] and FLINK-26492 [2].
>>> > >> >> >>
>>> > >> >> >> TL;DR
>>> > >> >> >>
>>> > >> >> >> All sink metrics with name “numXXXOut” defined in FLIP-33 are
>>> replace
>>> > >> >> by “numXXXSend” in FLINK-26126 and FLINK-26492. Considering
>>> metric names
>>> > >> >> are public APIs, this is a breaking change to end users and not
>>> backward
>>> > >> >> compatible. Also unfortunately this breaking change was not
>>> discussed in
>>> > >> >> the mailing list before.
>>> > >> >> >>
>>> > >> >> >> Background
>>> > >> >> >>
>>> > >> >> >> As defined previously in FLIP-33 (the FLIP page has been
>>> changed so
>>> > >> >> please refer to the old version [3] ), metric “numRecordsOut” is
>>> used for
>>> > >> >> reporting the total number of output records since the sink
>>> started (number
>>> > >> >> of records written to the external system), and similarly for
>>> > >> >> “numRecordsOutPerSecond”, “numBytesOut”, “numBytesOutPerSecond”
>>> and
>>> > >> >> “numRecordsOutError”. Most sinks are following this naming and
>>> definition.
>>> > >> >> However, these metrics are ambiguous in the new Sink API as
>>> “numXXXOut”
>>> > >> >> could be used by the output of SinkWriterOperator for reporting
>>> number of
>>> > >> >> Committables delivered to SinkCommitterOperator. In order to
>>> resolve the
>>> > >> >> conflict, FLINK-26126 and FLINK-26492 changed names of these
>>> metrics with
>>> > >> >> “numXXXSend”.
>>> > >> >> >>
>>> > >> >> >> Necessity of reverting this change
>>> > >> >> >>
>>> > >> >> >> - Metric names are actually public API, as end users need to
>>> configure
>>> > >> >> metric collecting and alerting system with metric names. Users
>>> have to
>>> > >> >> reset all configurations related to affected metrics.
>>> > >> >> >> - This could also affect custom and external sinks not
>>> maintained by
>>> > >> >> Flink, which might have implemented with numXXXOut metrics.
>>> > >> >> >> - The number of records sent to external system is way more
>>> important
>>> > >> >> than the number of Committables sent to SinkCommitterOperator,
>>> as the
>>> > >> >> latter one is just an internal implementation of sink. We could
>>> have a new
>>> > >> >> metric name for the latter one instead.
>>> > >> >> >> - We could avoid splitting the project by version (like “plz
>>> use
>>> > >> >> numXXXOut before 1.15 and use numXXXSend after”) if we revert it
>>> ASAP,
>>> > >> >> cosidering 1.16 is still not released for now.
>>> > >> >> >>
>>> > >> >> >> As a consequence, I’d like to hear from devs and users about
>>> your
>>> > >> >> opinion on changing these metrics back to “numXXXOut”.
>>> > >> >> >>
>>> > >> >> >> Looking forward to your reply!
>>> > >> >> >>
>>> > >> >> >> [1] https://issues.apache.org/jira/browse/FLINK-26126
>>> > >> >> >> [2] https://issues.apache.org/jira/browse/FLINK-26492
>>> > >> >> >> [1] FLIP-33, version 18:
>>> > >> >>
>>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211883136
>>> > >> >> >>
>>> > >> >> >> Best,
>>> > >> >> >> Qingsheng
>>> > >> >>
>>> > >> >
>>> > >> >
>>> > >
>>> > >
>>> > >
>>>
>>

Re: [DISCUSS] Reverting sink metric name changes made in 1.15

Posted by Qingsheng Ren <re...@apache.org>.
Hi Jing,

Thanks for the reply!

Let me rephrase my proposal: we’d like to use numXXXOut registered on
SinkWriterOperator to reflect the traffic to the external system for
compatibility with old versions before 1.15, and make numXXXSend have the
same value as numXXXOut for compatibility within 1.15. That means both
numXXXOut and numXXXSend are used for external data transfers, which end
users care more about. As for the internal traffic within the sink, we
could name a new metric for it because this is a _new_ feature in the _new_
sink, and end users usually don’t pay attention to internal implementation.
The name of the new metric could be discussed later after 1.16 release.

> but it might end up with monitoring unexpected metrics, which is even
worse for users, i.e. I didn't change anything, but something has been
broken since the last update.

Yeah this is exactly what we are trying to fix with this proposal. I
believe users are more concerned with the output to the external system
than the internal data delivery in the sink, so I think we’ll have more
cases reporting like “I set up a panel on numRecordsOut in sink to monitor
the output of the job, but after upgrading to 1.15 this value is extremely
low and I didn’t change anything” if we stick to the current situation. I
think only a few end users care about the number of committables sending to
downstream as most of them don’t care how the sink works.

We do need a re-design to fully distinguish the internal and external
traffic on metrics, not only in sink but in all operators as it’s quite
common for operators to make IO. This needs time to design, discuss, adjust
and vote, but considering this is blocking 1.16, maybe it’s better to
rescue the compatibility for now, and leave the huge reconstruction to
future versions (maybe 2.0).

Best,
Qingsheng

On Wed, Oct 12, 2022 at 7:21 PM Jing Ge <ji...@ververica.com> wrote:

> Hi Qingsheng,
>
> Just want to make sure we are on the same page. Are you suggesting
> switching the naming between "numXXXSend" and "numXXXOut" or reverting all
> the changes we did with FLINK-26126 and FLINK-26492?
>
> For the naming switch, please pay attention that the behaviour has been
> changed since we introduced SinkV2[1]. So, please be aware of different
> numbers(behaviour change) even with the same metrics name. Sticking with
> the old name with the new behaviour (very bad idea, IMHO) might seem like
> saving the effort in the first place, but it might end up with monitoring
> unexpected metrics, which is even worse for users, i.e. I didn't change
> anything, but something has been broken since the last update.
>
> For reverting, I am not sure how to fix the issue mentioned in FLINK-26126
> after reverting all changes. Like Chesnay has already pointed out, with
> SinkV2 we have two different output lines - one with the external system
> and the other with the downstream operator. In this case, "numXXXSend" is
> rather a new metric than a replacement of "numXXXOut". The "numXXXOut"
> metric can still be used, depending on what the user wants to monitor.
>
>
> Best regards,
> Jing
>
> [1]
> https://github.com/apache/flink/blob/51fc20db30d001a95de95b3b9993eeb06f558f6c/flink-metrics/flink-metrics-core/src/main/java/org/apache/flink/metrics/groups/SinkWriterMetricGroup.java#L48
>
>
> On Wed, Oct 12, 2022 at 12:48 PM Qingsheng Ren <re...@apache.org> wrote:
>
>> As a supplement, considering it could be a big reconstruction
>> redefining internal and external traffic and touching metric names in
>> almost all operators, this requires a lot of discussions and we might
>> do it finally in Flink 2.0. I think compatibility is a bigger blocker
>> in front of us, as the output of sink is a metric that users care a
>> lot about.
>>
>> Thanks,
>> Qingsheng
>>
>> On Wed, Oct 12, 2022 at 6:20 PM Qingsheng Ren <re...@apache.org> wrote:
>> >
>> > Thanks Chesnay for the reply. +1 for making a unified and clearer
>> > metric definition distinguishing internal and external data transfers.
>> > As you described, having IO in operators is quite common such as
>> > dimension tables in Table/SQL API. This definitely deserves a FLIP and
>> > an overall design.
>> >
>> > However I think it's necessary to change the metric back to
>> > numRecordsOut instead of sticking with numRecordsSend in 1.15 and
>> > 1.16. The most important argument is for compatibility as I mentioned
>> > in my previous email, otherwise all users have to modify their configs
>> > of metric systems after upgrading to Flink 1.15+, and all custom
>> > connectors have to change their implementations to migrate to the new
>> > metric name. I believe other ones participating and approving this
>> > proposal share the same concern about compatibility too. Also
>> > considering this issue is blocking the release of 1.16, maybe we could
>> > fix this asap, and as for defining a new metric for internal data
>> > transfers we can have an in-depth discussion later. WDYT?
>> >
>> > Best,
>> > Qingsheng
>> >
>> > On Tue, Oct 11, 2022 at 6:06 PM Chesnay Schepler <ch...@apache.org>
>> wrote:
>> > >
>> > > Currently I think that would be a mistake.
>> > >
>> > > Ultimately what we have here is the culmination of us never really
>> considering how the numRecordsOut metric should behave for operators that
>> emit data to other operators _and_ external systems. This goes beyond sinks.
>> > > This even applies to numRecordsIn, for cases where functions
>> query/write data from/to the outside, (e.g., Async IO).
>> > >
>> > > Having 2 separate metrics for that, 1 exclusively for internal data
>> transfers, and 1 exclusively for external data transfers, is the only way
>> to get a consistent metric definition in the long-run.
>> > > We can jump back-and-forth now or just commit to it.
>> > >
>> > > I don't think we can really judge this based on FLIP-33. It was IIRC
>> written before the two phase sinks were added, which heavily blurred the
>> lines of what a sink even is. Because it definitely is _not_ the last
>> operator in a chain anymore.
>> > >
>> > > What I would suggest is to stick with what we got (although I despise
>> the name numRecordsSend), and alias the numRecordsOut metric for all
>> non-TwoPhaseCommittingSink.
>> > >
>> > > On 11/10/2022 05:54, Qingsheng Ren wrote:
>> > >
>> > > Thanks for the details Chesnay!
>> > >
>> > > By “alias” I mean to respect the original definition made in FLIP-33
>> for numRecordsOut, which is the number of records written to the external
>> system, and keep numRecordsSend as the same value as numRecordsOut for
>> compatibility.
>> > >
>> > > I think keeping numRecordsOut for the output to the external system
>> is more intuitive to end users because in most cases the metric of data
>> flow output is more essential. I agree with you that a new metric is
>> required, but considering compatibility and users’ intuition I prefer to
>> keep the initial definition of numRecordsOut in FLIP-33 and name a new
>> metric for sink writer’s output to downstream operators. This might be
>> against consistency with metrics in other operators in Flink but maybe it’s
>> acceptable to have the sink as a special case.
>> > >
>> > > Best,
>> > > Qingsheng
>> > > On Oct 10, 2022, 19:13 +0800, Chesnay Schepler <ch...@apache.org>,
>> wrote:
>> > >
>> > > > I’m with Xintong’s idea to treat numXXXSend as an alias of numXXXOut
>> > >
>> > > But that's not possible. If it were that simple there would have
>> never been a need to introduce another metric in the first place.
>> > >
>> > > It's a rather fundamental issue with how the new sinks work, in that
>> they emit data to the external system (usually considered as
>> "numRecordsOut" of sinks) while _also_ sending data to a downstream
>> operator (usually considered as "numRecordsOut" of tasks).
>> > > The original issue was that the numRecordsOut of the sink counted
>> both (which is completely wrong).
>> > >
>> > > A new metric was always required; otherwise you inevitably end up
>> breaking some semantic.
>> > > Adding a new metric for what the sink writes to the external system
>> is, for better or worse, more consistent with how these metrics usually
>> work in Flink.
>> > >
>> > > On 10/10/2022 12:45, Qingsheng Ren wrote:
>> > >
>> > > Thanks everyone for joining the discussion!
>> > >
>> > > > Do you have any idea what has happened in the process here?
>> > >
>> > > The discussion in this PR [1] shows some details and could be helpful
>> to understand the original motivation of the renaming. We do have a test
>> case for guarding metrics but unfortunaly the case was also modified so the
>> defense was broken.
>> > >
>> > > I think the reason why both the developer and the reviewer forgot to
>> trigger an discussion and gave a green pass on the change is that metrics
>> are quite “trivial” to be noticed as public APIs. As mentioned by Martijn I
>> couldn’t find a place noting that metrics are public APIs and should be
>> treated carefully while contributing and reviewing.
>> > >
>> > > IMHO three actions could be made to prevent this kind of changes in
>> the future:
>> > >
>> > > a. Add test case for metrics (which we already have in
>> SinkMetricsITCase)
>> > > b. We emphasize that any public-interface breaking changes should be
>> proposed by a FLIP or discussed in mailing list, and should be listed in
>> the release note.
>> > > c. We remind contributors and reviewers about what should be
>> considered as public API, and include metric names in it.
>> > >
>> > > For b and c these two pages [2][3] might be proper places.
>> > >
>> > > About the patch to revert this, it looks like we have a consensus on
>> 1.16. As of 1.15 I think it’s worthy to trigger a minor version. I didn’t
>> see complaints about this for now so it should be OK to save the situation
>> asap. I’m with Xintong’s idea to treat numXXXSend as an alias of numXXXOut
>> considering there could possibly some users have already adapted their
>> system to the new naming, and have another internal metric for reflecting
>> number of outgoing committable batches (actually the numRecordsIn of sink
>> committer operator should be carrying this info already).
>> > >
>> > > [1] https://github.com/apache/flink/pull/18825
>> > > [2] https://flink.apache.org/contributing/contribute-code.html
>> > > [3] https://flink.apache.org/contributing/reviewing-prs.html
>> > >
>> > > Best,
>> > > Qingsheng
>> > > On Oct 10, 2022, 17:40 +0800, Xintong Song <to...@gmail.com>,
>> wrote:
>> > >
>> > > +1 for reverting these changes in Flink 1.16.
>> > >
>> > > For 1.15.3, can we make these metrics available via both names
>> (numXXXOut and numXXXSend)? In this way we don't break it for those who
>> already migrated to 1.15 and numXXXSend. That means we still need to change
>> SinkWriterOperator to use another metric name in 1.15.3, which IIUC is
>> internal to Flink sink.
>> > >
>> > > I'm overall +1 to change numXXXOut back to its original semantics.
>> AFAIK (from meetup / flink-forward questionaires), most users do not
>> migrate to a new Flink release immediately, until the next 1-2 major
>> releases are out.
>> > >
>> > > Best,
>> > >
>> > > Xintong
>> > >
>> > >
>> > >
>> > > On Mon, Oct 10, 2022 at 5:26 PM Martijn Visser <
>> martijnvisser@apache.org> wrote:
>> > >>
>> > >> Hi Qingsheng,
>> > >>
>> > >> Do you have any idea what has happened in the process here? Do we
>> know why
>> > >> they were changed? I was under the impression that these metric
>> names were
>> > >> newly introduced due to the new interfaces and because it still
>> depends on
>> > >> each connector implementing these.
>> > >>
>> > >> Sidenote: metric names are not mentioned in the FLIP process as a
>> public
>> > >> API. Might make sense to have a separate follow-up to add that to
>> the list
>> > >> (I do think we should list them there).
>> > >>
>> > >> +1 for reverting this and make this change in Flink 1.16
>> > >>
>> > >> I'm not in favour of releasing a Flink 1.15.3 with this change: I
>> think the
>> > >> impact is too big for a patch version, especially given how long
>> Flink 1.15
>> > >> is already out there.
>> > >>
>> > >> Best regards,
>> > >>
>> > >> Martijn
>> > >>
>> > >> On Mon, Oct 10, 2022 at 11:13 AM Leonard Xu <xb...@gmail.com>
>> wrote:
>> > >>
>> > >> > Thanks Qingsheng for starting this thread.
>> > >> >
>> > >> > +1 on reverting sink metric name and releasing 1.15.3 to fix this
>> > >> > inconsistent behavior.
>> > >> >
>> > >> >
>> > >> > Best,
>> > >> > Leonard
>> > >> >
>> > >> >
>> > >> >
>> > >> >
>> > >> >
>> > >> > 2022年10月10日 下午3:06,Jark Wu <im...@gmail.com> 写道:
>> > >> >
>> > >> > Thanks for discovering this problem, Qingsheng!
>> > >> >
>> > >> > I'm also +1 for reverting the breaking changes.
>> > >> >
>> > >> > IIUC, currently, the behavior of "numXXXOut" metrics of the new
>> and old
>> > >> > sink is inconsistent.
>> > >> > We have to break one of them to have consistent behavior. Sink V2
>> is an
>> > >> > evolving API which is just introduced in 1.15.
>> > >> > I think it makes sense to break the unstable API instead of the
>> stable API
>> > >> > which many connectors and users depend on.
>> > >> >
>> > >> > Best,
>> > >> > Jark
>> > >> >
>> > >> >
>> > >> >
>> > >> > On Mon, 10 Oct 2022 at 11:36, Jingsong Li <ji...@gmail.com>
>> wrote:
>> > >> >
>> > >> >> Thanks for driving, Qingsheng.
>> > >> >>
>> > >> >> +1 for reverting sink metric name.
>> > >> >>
>> > >> >> We often forget that metric is also one of the important APIs.
>> > >> >>
>> > >> >> +1 for releasing 1.15.3 to fix this.
>> > >> >>
>> > >> >> Best,
>> > >> >> Jingsong
>> > >> >>
>> > >> >> On Sun, Oct 9, 2022 at 11:35 PM Becket Qin <be...@gmail.com>
>> wrote:
>> > >> >> >
>> > >> >> > Thanks for raising the discussion, Qingsheng,
>> > >> >> >
>> > >> >> > +1 on reverting the breaking changes.
>> > >> >> >
>> > >> >> > In addition, we might want to release a 1.15.3 to fix this and
>> update
>> > >> >> the previous release docs with this known issue, so that users
>> can upgrade
>> > >> >> to 1.15.3 when they hit it. It would also be good to add some
>> backwards
>> > >> >> compatibility tests on metrics to avoid unintended breaking
>> changes like
>> > >> >> this in the future.
>> > >> >> >
>> > >> >> > Thanks,
>> > >> >> >
>> > >> >> > Jiangjie (Becket) Qin
>> > >> >> >
>> > >> >> > On Sun, Oct 9, 2022 at 10:35 AM Qingsheng Ren <re...@apache.org>
>> wrote:
>> > >> >> >>
>> > >> >> >> Hi devs and users,
>> > >> >> >>
>> > >> >> >> I’d like to start a discussion about reverting a breaking
>> change about
>> > >> >> sink metrics made in 1.15 by FLINK-26126 [1] and FLINK-26492 [2].
>> > >> >> >>
>> > >> >> >> TL;DR
>> > >> >> >>
>> > >> >> >> All sink metrics with name “numXXXOut” defined in FLIP-33 are
>> replace
>> > >> >> by “numXXXSend” in FLINK-26126 and FLINK-26492. Considering
>> metric names
>> > >> >> are public APIs, this is a breaking change to end users and not
>> backward
>> > >> >> compatible. Also unfortunately this breaking change was not
>> discussed in
>> > >> >> the mailing list before.
>> > >> >> >>
>> > >> >> >> Background
>> > >> >> >>
>> > >> >> >> As defined previously in FLIP-33 (the FLIP page has been
>> changed so
>> > >> >> please refer to the old version [3] ), metric “numRecordsOut” is
>> used for
>> > >> >> reporting the total number of output records since the sink
>> started (number
>> > >> >> of records written to the external system), and similarly for
>> > >> >> “numRecordsOutPerSecond”, “numBytesOut”, “numBytesOutPerSecond”
>> and
>> > >> >> “numRecordsOutError”. Most sinks are following this naming and
>> definition.
>> > >> >> However, these metrics are ambiguous in the new Sink API as
>> “numXXXOut”
>> > >> >> could be used by the output of SinkWriterOperator for reporting
>> number of
>> > >> >> Committables delivered to SinkCommitterOperator. In order to
>> resolve the
>> > >> >> conflict, FLINK-26126 and FLINK-26492 changed names of these
>> metrics with
>> > >> >> “numXXXSend”.
>> > >> >> >>
>> > >> >> >> Necessity of reverting this change
>> > >> >> >>
>> > >> >> >> - Metric names are actually public API, as end users need to
>> configure
>> > >> >> metric collecting and alerting system with metric names. Users
>> have to
>> > >> >> reset all configurations related to affected metrics.
>> > >> >> >> - This could also affect custom and external sinks not
>> maintained by
>> > >> >> Flink, which might have implemented with numXXXOut metrics.
>> > >> >> >> - The number of records sent to external system is way more
>> important
>> > >> >> than the number of Committables sent to SinkCommitterOperator, as
>> the
>> > >> >> latter one is just an internal implementation of sink. We could
>> have a new
>> > >> >> metric name for the latter one instead.
>> > >> >> >> - We could avoid splitting the project by version (like “plz
>> use
>> > >> >> numXXXOut before 1.15 and use numXXXSend after”) if we revert it
>> ASAP,
>> > >> >> cosidering 1.16 is still not released for now.
>> > >> >> >>
>> > >> >> >> As a consequence, I’d like to hear from devs and users about
>> your
>> > >> >> opinion on changing these metrics back to “numXXXOut”.
>> > >> >> >>
>> > >> >> >> Looking forward to your reply!
>> > >> >> >>
>> > >> >> >> [1] https://issues.apache.org/jira/browse/FLINK-26126
>> > >> >> >> [2] https://issues.apache.org/jira/browse/FLINK-26492
>> > >> >> >> [1] FLIP-33, version 18:
>> > >> >>
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211883136
>> > >> >> >>
>> > >> >> >> Best,
>> > >> >> >> Qingsheng
>> > >> >>
>> > >> >
>> > >> >
>> > >
>> > >
>> > >
>>
>

Re: [DISCUSS] Reverting sink metric name changes made in 1.15

Posted by Qingsheng Ren <re...@apache.org>.
Hi Jing,

Thanks for the reply!

Let me rephrase my proposal: we’d like to use numXXXOut registered on
SinkWriterOperator to reflect the traffic to the external system for
compatibility with old versions before 1.15, and make numXXXSend have the
same value as numXXXOut for compatibility within 1.15. That means both
numXXXOut and numXXXSend are used for external data transfers, which end
users care more about. As for the internal traffic within the sink, we
could name a new metric for it because this is a _new_ feature in the _new_
sink, and end users usually don’t pay attention to internal implementation.
The name of the new metric could be discussed later after 1.16 release.

> but it might end up with monitoring unexpected metrics, which is even
worse for users, i.e. I didn't change anything, but something has been
broken since the last update.

Yeah this is exactly what we are trying to fix with this proposal. I
believe users are more concerned with the output to the external system
than the internal data delivery in the sink, so I think we’ll have more
cases reporting like “I set up a panel on numRecordsOut in sink to monitor
the output of the job, but after upgrading to 1.15 this value is extremely
low and I didn’t change anything” if we stick to the current situation. I
think only a few end users care about the number of committables sending to
downstream as most of them don’t care how the sink works.

We do need a re-design to fully distinguish the internal and external
traffic on metrics, not only in sink but in all operators as it’s quite
common for operators to make IO. This needs time to design, discuss, adjust
and vote, but considering this is blocking 1.16, maybe it’s better to
rescue the compatibility for now, and leave the huge reconstruction to
future versions (maybe 2.0).

Best,
Qingsheng

On Wed, Oct 12, 2022 at 7:21 PM Jing Ge <ji...@ververica.com> wrote:

> Hi Qingsheng,
>
> Just want to make sure we are on the same page. Are you suggesting
> switching the naming between "numXXXSend" and "numXXXOut" or reverting all
> the changes we did with FLINK-26126 and FLINK-26492?
>
> For the naming switch, please pay attention that the behaviour has been
> changed since we introduced SinkV2[1]. So, please be aware of different
> numbers(behaviour change) even with the same metrics name. Sticking with
> the old name with the new behaviour (very bad idea, IMHO) might seem like
> saving the effort in the first place, but it might end up with monitoring
> unexpected metrics, which is even worse for users, i.e. I didn't change
> anything, but something has been broken since the last update.
>
> For reverting, I am not sure how to fix the issue mentioned in FLINK-26126
> after reverting all changes. Like Chesnay has already pointed out, with
> SinkV2 we have two different output lines - one with the external system
> and the other with the downstream operator. In this case, "numXXXSend" is
> rather a new metric than a replacement of "numXXXOut". The "numXXXOut"
> metric can still be used, depending on what the user wants to monitor.
>
>
> Best regards,
> Jing
>
> [1]
> https://github.com/apache/flink/blob/51fc20db30d001a95de95b3b9993eeb06f558f6c/flink-metrics/flink-metrics-core/src/main/java/org/apache/flink/metrics/groups/SinkWriterMetricGroup.java#L48
>
>
> On Wed, Oct 12, 2022 at 12:48 PM Qingsheng Ren <re...@apache.org> wrote:
>
>> As a supplement, considering it could be a big reconstruction
>> redefining internal and external traffic and touching metric names in
>> almost all operators, this requires a lot of discussions and we might
>> do it finally in Flink 2.0. I think compatibility is a bigger blocker
>> in front of us, as the output of sink is a metric that users care a
>> lot about.
>>
>> Thanks,
>> Qingsheng
>>
>> On Wed, Oct 12, 2022 at 6:20 PM Qingsheng Ren <re...@apache.org> wrote:
>> >
>> > Thanks Chesnay for the reply. +1 for making a unified and clearer
>> > metric definition distinguishing internal and external data transfers.
>> > As you described, having IO in operators is quite common such as
>> > dimension tables in Table/SQL API. This definitely deserves a FLIP and
>> > an overall design.
>> >
>> > However I think it's necessary to change the metric back to
>> > numRecordsOut instead of sticking with numRecordsSend in 1.15 and
>> > 1.16. The most important argument is for compatibility as I mentioned
>> > in my previous email, otherwise all users have to modify their configs
>> > of metric systems after upgrading to Flink 1.15+, and all custom
>> > connectors have to change their implementations to migrate to the new
>> > metric name. I believe other ones participating and approving this
>> > proposal share the same concern about compatibility too. Also
>> > considering this issue is blocking the release of 1.16, maybe we could
>> > fix this asap, and as for defining a new metric for internal data
>> > transfers we can have an in-depth discussion later. WDYT?
>> >
>> > Best,
>> > Qingsheng
>> >
>> > On Tue, Oct 11, 2022 at 6:06 PM Chesnay Schepler <ch...@apache.org>
>> wrote:
>> > >
>> > > Currently I think that would be a mistake.
>> > >
>> > > Ultimately what we have here is the culmination of us never really
>> considering how the numRecordsOut metric should behave for operators that
>> emit data to other operators _and_ external systems. This goes beyond sinks.
>> > > This even applies to numRecordsIn, for cases where functions
>> query/write data from/to the outside, (e.g., Async IO).
>> > >
>> > > Having 2 separate metrics for that, 1 exclusively for internal data
>> transfers, and 1 exclusively for external data transfers, is the only way
>> to get a consistent metric definition in the long-run.
>> > > We can jump back-and-forth now or just commit to it.
>> > >
>> > > I don't think we can really judge this based on FLIP-33. It was IIRC
>> written before the two phase sinks were added, which heavily blurred the
>> lines of what a sink even is. Because it definitely is _not_ the last
>> operator in a chain anymore.
>> > >
>> > > What I would suggest is to stick with what we got (although I despise
>> the name numRecordsSend), and alias the numRecordsOut metric for all
>> non-TwoPhaseCommittingSink.
>> > >
>> > > On 11/10/2022 05:54, Qingsheng Ren wrote:
>> > >
>> > > Thanks for the details Chesnay!
>> > >
>> > > By “alias” I mean to respect the original definition made in FLIP-33
>> for numRecordsOut, which is the number of records written to the external
>> system, and keep numRecordsSend as the same value as numRecordsOut for
>> compatibility.
>> > >
>> > > I think keeping numRecordsOut for the output to the external system
>> is more intuitive to end users because in most cases the metric of data
>> flow output is more essential. I agree with you that a new metric is
>> required, but considering compatibility and users’ intuition I prefer to
>> keep the initial definition of numRecordsOut in FLIP-33 and name a new
>> metric for sink writer’s output to downstream operators. This might be
>> against consistency with metrics in other operators in Flink but maybe it’s
>> acceptable to have the sink as a special case.
>> > >
>> > > Best,
>> > > Qingsheng
>> > > On Oct 10, 2022, 19:13 +0800, Chesnay Schepler <ch...@apache.org>,
>> wrote:
>> > >
>> > > > I’m with Xintong’s idea to treat numXXXSend as an alias of numXXXOut
>> > >
>> > > But that's not possible. If it were that simple there would have
>> never been a need to introduce another metric in the first place.
>> > >
>> > > It's a rather fundamental issue with how the new sinks work, in that
>> they emit data to the external system (usually considered as
>> "numRecordsOut" of sinks) while _also_ sending data to a downstream
>> operator (usually considered as "numRecordsOut" of tasks).
>> > > The original issue was that the numRecordsOut of the sink counted
>> both (which is completely wrong).
>> > >
>> > > A new metric was always required; otherwise you inevitably end up
>> breaking some semantic.
>> > > Adding a new metric for what the sink writes to the external system
>> is, for better or worse, more consistent with how these metrics usually
>> work in Flink.
>> > >
>> > > On 10/10/2022 12:45, Qingsheng Ren wrote:
>> > >
>> > > Thanks everyone for joining the discussion!
>> > >
>> > > > Do you have any idea what has happened in the process here?
>> > >
>> > > The discussion in this PR [1] shows some details and could be helpful
>> to understand the original motivation of the renaming. We do have a test
>> case for guarding metrics but unfortunaly the case was also modified so the
>> defense was broken.
>> > >
>> > > I think the reason why both the developer and the reviewer forgot to
>> trigger an discussion and gave a green pass on the change is that metrics
>> are quite “trivial” to be noticed as public APIs. As mentioned by Martijn I
>> couldn’t find a place noting that metrics are public APIs and should be
>> treated carefully while contributing and reviewing.
>> > >
>> > > IMHO three actions could be made to prevent this kind of changes in
>> the future:
>> > >
>> > > a. Add test case for metrics (which we already have in
>> SinkMetricsITCase)
>> > > b. We emphasize that any public-interface breaking changes should be
>> proposed by a FLIP or discussed in mailing list, and should be listed in
>> the release note.
>> > > c. We remind contributors and reviewers about what should be
>> considered as public API, and include metric names in it.
>> > >
>> > > For b and c these two pages [2][3] might be proper places.
>> > >
>> > > About the patch to revert this, it looks like we have a consensus on
>> 1.16. As of 1.15 I think it’s worthy to trigger a minor version. I didn’t
>> see complaints about this for now so it should be OK to save the situation
>> asap. I’m with Xintong’s idea to treat numXXXSend as an alias of numXXXOut
>> considering there could possibly some users have already adapted their
>> system to the new naming, and have another internal metric for reflecting
>> number of outgoing committable batches (actually the numRecordsIn of sink
>> committer operator should be carrying this info already).
>> > >
>> > > [1] https://github.com/apache/flink/pull/18825
>> > > [2] https://flink.apache.org/contributing/contribute-code.html
>> > > [3] https://flink.apache.org/contributing/reviewing-prs.html
>> > >
>> > > Best,
>> > > Qingsheng
>> > > On Oct 10, 2022, 17:40 +0800, Xintong Song <to...@gmail.com>,
>> wrote:
>> > >
>> > > +1 for reverting these changes in Flink 1.16.
>> > >
>> > > For 1.15.3, can we make these metrics available via both names
>> (numXXXOut and numXXXSend)? In this way we don't break it for those who
>> already migrated to 1.15 and numXXXSend. That means we still need to change
>> SinkWriterOperator to use another metric name in 1.15.3, which IIUC is
>> internal to Flink sink.
>> > >
>> > > I'm overall +1 to change numXXXOut back to its original semantics.
>> AFAIK (from meetup / flink-forward questionaires), most users do not
>> migrate to a new Flink release immediately, until the next 1-2 major
>> releases are out.
>> > >
>> > > Best,
>> > >
>> > > Xintong
>> > >
>> > >
>> > >
>> > > On Mon, Oct 10, 2022 at 5:26 PM Martijn Visser <
>> martijnvisser@apache.org> wrote:
>> > >>
>> > >> Hi Qingsheng,
>> > >>
>> > >> Do you have any idea what has happened in the process here? Do we
>> know why
>> > >> they were changed? I was under the impression that these metric
>> names were
>> > >> newly introduced due to the new interfaces and because it still
>> depends on
>> > >> each connector implementing these.
>> > >>
>> > >> Sidenote: metric names are not mentioned in the FLIP process as a
>> public
>> > >> API. Might make sense to have a separate follow-up to add that to
>> the list
>> > >> (I do think we should list them there).
>> > >>
>> > >> +1 for reverting this and make this change in Flink 1.16
>> > >>
>> > >> I'm not in favour of releasing a Flink 1.15.3 with this change: I
>> think the
>> > >> impact is too big for a patch version, especially given how long
>> Flink 1.15
>> > >> is already out there.
>> > >>
>> > >> Best regards,
>> > >>
>> > >> Martijn
>> > >>
>> > >> On Mon, Oct 10, 2022 at 11:13 AM Leonard Xu <xb...@gmail.com>
>> wrote:
>> > >>
>> > >> > Thanks Qingsheng for starting this thread.
>> > >> >
>> > >> > +1 on reverting sink metric name and releasing 1.15.3 to fix this
>> > >> > inconsistent behavior.
>> > >> >
>> > >> >
>> > >> > Best,
>> > >> > Leonard
>> > >> >
>> > >> >
>> > >> >
>> > >> >
>> > >> >
>> > >> > 2022年10月10日 下午3:06,Jark Wu <im...@gmail.com> 写道:
>> > >> >
>> > >> > Thanks for discovering this problem, Qingsheng!
>> > >> >
>> > >> > I'm also +1 for reverting the breaking changes.
>> > >> >
>> > >> > IIUC, currently, the behavior of "numXXXOut" metrics of the new
>> and old
>> > >> > sink is inconsistent.
>> > >> > We have to break one of them to have consistent behavior. Sink V2
>> is an
>> > >> > evolving API which is just introduced in 1.15.
>> > >> > I think it makes sense to break the unstable API instead of the
>> stable API
>> > >> > which many connectors and users depend on.
>> > >> >
>> > >> > Best,
>> > >> > Jark
>> > >> >
>> > >> >
>> > >> >
>> > >> > On Mon, 10 Oct 2022 at 11:36, Jingsong Li <ji...@gmail.com>
>> wrote:
>> > >> >
>> > >> >> Thanks for driving, Qingsheng.
>> > >> >>
>> > >> >> +1 for reverting sink metric name.
>> > >> >>
>> > >> >> We often forget that metric is also one of the important APIs.
>> > >> >>
>> > >> >> +1 for releasing 1.15.3 to fix this.
>> > >> >>
>> > >> >> Best,
>> > >> >> Jingsong
>> > >> >>
>> > >> >> On Sun, Oct 9, 2022 at 11:35 PM Becket Qin <be...@gmail.com>
>> wrote:
>> > >> >> >
>> > >> >> > Thanks for raising the discussion, Qingsheng,
>> > >> >> >
>> > >> >> > +1 on reverting the breaking changes.
>> > >> >> >
>> > >> >> > In addition, we might want to release a 1.15.3 to fix this and
>> update
>> > >> >> the previous release docs with this known issue, so that users
>> can upgrade
>> > >> >> to 1.15.3 when they hit it. It would also be good to add some
>> backwards
>> > >> >> compatibility tests on metrics to avoid unintended breaking
>> changes like
>> > >> >> this in the future.
>> > >> >> >
>> > >> >> > Thanks,
>> > >> >> >
>> > >> >> > Jiangjie (Becket) Qin
>> > >> >> >
>> > >> >> > On Sun, Oct 9, 2022 at 10:35 AM Qingsheng Ren <re...@apache.org>
>> wrote:
>> > >> >> >>
>> > >> >> >> Hi devs and users,
>> > >> >> >>
>> > >> >> >> I’d like to start a discussion about reverting a breaking
>> change about
>> > >> >> sink metrics made in 1.15 by FLINK-26126 [1] and FLINK-26492 [2].
>> > >> >> >>
>> > >> >> >> TL;DR
>> > >> >> >>
>> > >> >> >> All sink metrics with name “numXXXOut” defined in FLIP-33 are
>> replace
>> > >> >> by “numXXXSend” in FLINK-26126 and FLINK-26492. Considering
>> metric names
>> > >> >> are public APIs, this is a breaking change to end users and not
>> backward
>> > >> >> compatible. Also unfortunately this breaking change was not
>> discussed in
>> > >> >> the mailing list before.
>> > >> >> >>
>> > >> >> >> Background
>> > >> >> >>
>> > >> >> >> As defined previously in FLIP-33 (the FLIP page has been
>> changed so
>> > >> >> please refer to the old version [3] ), metric “numRecordsOut” is
>> used for
>> > >> >> reporting the total number of output records since the sink
>> started (number
>> > >> >> of records written to the external system), and similarly for
>> > >> >> “numRecordsOutPerSecond”, “numBytesOut”, “numBytesOutPerSecond”
>> and
>> > >> >> “numRecordsOutError”. Most sinks are following this naming and
>> definition.
>> > >> >> However, these metrics are ambiguous in the new Sink API as
>> “numXXXOut”
>> > >> >> could be used by the output of SinkWriterOperator for reporting
>> number of
>> > >> >> Committables delivered to SinkCommitterOperator. In order to
>> resolve the
>> > >> >> conflict, FLINK-26126 and FLINK-26492 changed names of these
>> metrics with
>> > >> >> “numXXXSend”.
>> > >> >> >>
>> > >> >> >> Necessity of reverting this change
>> > >> >> >>
>> > >> >> >> - Metric names are actually public API, as end users need to
>> configure
>> > >> >> metric collecting and alerting system with metric names. Users
>> have to
>> > >> >> reset all configurations related to affected metrics.
>> > >> >> >> - This could also affect custom and external sinks not
>> maintained by
>> > >> >> Flink, which might have implemented with numXXXOut metrics.
>> > >> >> >> - The number of records sent to external system is way more
>> important
>> > >> >> than the number of Committables sent to SinkCommitterOperator, as
>> the
>> > >> >> latter one is just an internal implementation of sink. We could
>> have a new
>> > >> >> metric name for the latter one instead.
>> > >> >> >> - We could avoid splitting the project by version (like “plz
>> use
>> > >> >> numXXXOut before 1.15 and use numXXXSend after”) if we revert it
>> ASAP,
>> > >> >> cosidering 1.16 is still not released for now.
>> > >> >> >>
>> > >> >> >> As a consequence, I’d like to hear from devs and users about
>> your
>> > >> >> opinion on changing these metrics back to “numXXXOut”.
>> > >> >> >>
>> > >> >> >> Looking forward to your reply!
>> > >> >> >>
>> > >> >> >> [1] https://issues.apache.org/jira/browse/FLINK-26126
>> > >> >> >> [2] https://issues.apache.org/jira/browse/FLINK-26492
>> > >> >> >> [1] FLIP-33, version 18:
>> > >> >>
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211883136
>> > >> >> >>
>> > >> >> >> Best,
>> > >> >> >> Qingsheng
>> > >> >>
>> > >> >
>> > >> >
>> > >
>> > >
>> > >
>>
>

Re: [DISCUSS] Reverting sink metric name changes made in 1.15

Posted by Jing Ge <ji...@ververica.com>.
Hi Qingsheng,

Just want to make sure we are on the same page. Are you suggesting
switching the naming between "numXXXSend" and "numXXXOut" or reverting all
the changes we did with FLINK-26126 and FLINK-26492?

For the naming switch, please pay attention that the behaviour has been
changed since we introduced SinkV2[1]. So, please be aware of different
numbers(behaviour change) even with the same metrics name. Sticking with
the old name with the new behaviour (very bad idea, IMHO) might seem like
saving the effort in the first place, but it might end up with monitoring
unexpected metrics, which is even worse for users, i.e. I didn't change
anything, but something has been broken since the last update.

For reverting, I am not sure how to fix the issue mentioned in FLINK-26126
after reverting all changes. Like Chesnay has already pointed out, with
SinkV2 we have two different output lines - one with the external system
and the other with the downstream operator. In this case, "numXXXSend" is
rather a new metric than a replacement of "numXXXOut". The "numXXXOut"
metric can still be used, depending on what the user wants to monitor.


Best regards,
Jing

[1]
https://github.com/apache/flink/blob/51fc20db30d001a95de95b3b9993eeb06f558f6c/flink-metrics/flink-metrics-core/src/main/java/org/apache/flink/metrics/groups/SinkWriterMetricGroup.java#L48


On Wed, Oct 12, 2022 at 12:48 PM Qingsheng Ren <re...@apache.org> wrote:

> As a supplement, considering it could be a big reconstruction
> redefining internal and external traffic and touching metric names in
> almost all operators, this requires a lot of discussions and we might
> do it finally in Flink 2.0. I think compatibility is a bigger blocker
> in front of us, as the output of sink is a metric that users care a
> lot about.
>
> Thanks,
> Qingsheng
>
> On Wed, Oct 12, 2022 at 6:20 PM Qingsheng Ren <re...@apache.org> wrote:
> >
> > Thanks Chesnay for the reply. +1 for making a unified and clearer
> > metric definition distinguishing internal and external data transfers.
> > As you described, having IO in operators is quite common such as
> > dimension tables in Table/SQL API. This definitely deserves a FLIP and
> > an overall design.
> >
> > However I think it's necessary to change the metric back to
> > numRecordsOut instead of sticking with numRecordsSend in 1.15 and
> > 1.16. The most important argument is for compatibility as I mentioned
> > in my previous email, otherwise all users have to modify their configs
> > of metric systems after upgrading to Flink 1.15+, and all custom
> > connectors have to change their implementations to migrate to the new
> > metric name. I believe other ones participating and approving this
> > proposal share the same concern about compatibility too. Also
> > considering this issue is blocking the release of 1.16, maybe we could
> > fix this asap, and as for defining a new metric for internal data
> > transfers we can have an in-depth discussion later. WDYT?
> >
> > Best,
> > Qingsheng
> >
> > On Tue, Oct 11, 2022 at 6:06 PM Chesnay Schepler <ch...@apache.org>
> wrote:
> > >
> > > Currently I think that would be a mistake.
> > >
> > > Ultimately what we have here is the culmination of us never really
> considering how the numRecordsOut metric should behave for operators that
> emit data to other operators _and_ external systems. This goes beyond sinks.
> > > This even applies to numRecordsIn, for cases where functions
> query/write data from/to the outside, (e.g., Async IO).
> > >
> > > Having 2 separate metrics for that, 1 exclusively for internal data
> transfers, and 1 exclusively for external data transfers, is the only way
> to get a consistent metric definition in the long-run.
> > > We can jump back-and-forth now or just commit to it.
> > >
> > > I don't think we can really judge this based on FLIP-33. It was IIRC
> written before the two phase sinks were added, which heavily blurred the
> lines of what a sink even is. Because it definitely is _not_ the last
> operator in a chain anymore.
> > >
> > > What I would suggest is to stick with what we got (although I despise
> the name numRecordsSend), and alias the numRecordsOut metric for all
> non-TwoPhaseCommittingSink.
> > >
> > > On 11/10/2022 05:54, Qingsheng Ren wrote:
> > >
> > > Thanks for the details Chesnay!
> > >
> > > By “alias” I mean to respect the original definition made in FLIP-33
> for numRecordsOut, which is the number of records written to the external
> system, and keep numRecordsSend as the same value as numRecordsOut for
> compatibility.
> > >
> > > I think keeping numRecordsOut for the output to the external system is
> more intuitive to end users because in most cases the metric of data flow
> output is more essential. I agree with you that a new metric is required,
> but considering compatibility and users’ intuition I prefer to keep the
> initial definition of numRecordsOut in FLIP-33 and name a new metric for
> sink writer’s output to downstream operators. This might be against
> consistency with metrics in other operators in Flink but maybe it’s
> acceptable to have the sink as a special case.
> > >
> > > Best,
> > > Qingsheng
> > > On Oct 10, 2022, 19:13 +0800, Chesnay Schepler <ch...@apache.org>,
> wrote:
> > >
> > > > I’m with Xintong’s idea to treat numXXXSend as an alias of numXXXOut
> > >
> > > But that's not possible. If it were that simple there would have never
> been a need to introduce another metric in the first place.
> > >
> > > It's a rather fundamental issue with how the new sinks work, in that
> they emit data to the external system (usually considered as
> "numRecordsOut" of sinks) while _also_ sending data to a downstream
> operator (usually considered as "numRecordsOut" of tasks).
> > > The original issue was that the numRecordsOut of the sink counted both
> (which is completely wrong).
> > >
> > > A new metric was always required; otherwise you inevitably end up
> breaking some semantic.
> > > Adding a new metric for what the sink writes to the external system
> is, for better or worse, more consistent with how these metrics usually
> work in Flink.
> > >
> > > On 10/10/2022 12:45, Qingsheng Ren wrote:
> > >
> > > Thanks everyone for joining the discussion!
> > >
> > > > Do you have any idea what has happened in the process here?
> > >
> > > The discussion in this PR [1] shows some details and could be helpful
> to understand the original motivation of the renaming. We do have a test
> case for guarding metrics but unfortunaly the case was also modified so the
> defense was broken.
> > >
> > > I think the reason why both the developer and the reviewer forgot to
> trigger an discussion and gave a green pass on the change is that metrics
> are quite “trivial” to be noticed as public APIs. As mentioned by Martijn I
> couldn’t find a place noting that metrics are public APIs and should be
> treated carefully while contributing and reviewing.
> > >
> > > IMHO three actions could be made to prevent this kind of changes in
> the future:
> > >
> > > a. Add test case for metrics (which we already have in
> SinkMetricsITCase)
> > > b. We emphasize that any public-interface breaking changes should be
> proposed by a FLIP or discussed in mailing list, and should be listed in
> the release note.
> > > c. We remind contributors and reviewers about what should be
> considered as public API, and include metric names in it.
> > >
> > > For b and c these two pages [2][3] might be proper places.
> > >
> > > About the patch to revert this, it looks like we have a consensus on
> 1.16. As of 1.15 I think it’s worthy to trigger a minor version. I didn’t
> see complaints about this for now so it should be OK to save the situation
> asap. I’m with Xintong’s idea to treat numXXXSend as an alias of numXXXOut
> considering there could possibly some users have already adapted their
> system to the new naming, and have another internal metric for reflecting
> number of outgoing committable batches (actually the numRecordsIn of sink
> committer operator should be carrying this info already).
> > >
> > > [1] https://github.com/apache/flink/pull/18825
> > > [2] https://flink.apache.org/contributing/contribute-code.html
> > > [3] https://flink.apache.org/contributing/reviewing-prs.html
> > >
> > > Best,
> > > Qingsheng
> > > On Oct 10, 2022, 17:40 +0800, Xintong Song <to...@gmail.com>,
> wrote:
> > >
> > > +1 for reverting these changes in Flink 1.16.
> > >
> > > For 1.15.3, can we make these metrics available via both names
> (numXXXOut and numXXXSend)? In this way we don't break it for those who
> already migrated to 1.15 and numXXXSend. That means we still need to change
> SinkWriterOperator to use another metric name in 1.15.3, which IIUC is
> internal to Flink sink.
> > >
> > > I'm overall +1 to change numXXXOut back to its original semantics.
> AFAIK (from meetup / flink-forward questionaires), most users do not
> migrate to a new Flink release immediately, until the next 1-2 major
> releases are out.
> > >
> > > Best,
> > >
> > > Xintong
> > >
> > >
> > >
> > > On Mon, Oct 10, 2022 at 5:26 PM Martijn Visser <
> martijnvisser@apache.org> wrote:
> > >>
> > >> Hi Qingsheng,
> > >>
> > >> Do you have any idea what has happened in the process here? Do we
> know why
> > >> they were changed? I was under the impression that these metric names
> were
> > >> newly introduced due to the new interfaces and because it still
> depends on
> > >> each connector implementing these.
> > >>
> > >> Sidenote: metric names are not mentioned in the FLIP process as a
> public
> > >> API. Might make sense to have a separate follow-up to add that to the
> list
> > >> (I do think we should list them there).
> > >>
> > >> +1 for reverting this and make this change in Flink 1.16
> > >>
> > >> I'm not in favour of releasing a Flink 1.15.3 with this change: I
> think the
> > >> impact is too big for a patch version, especially given how long
> Flink 1.15
> > >> is already out there.
> > >>
> > >> Best regards,
> > >>
> > >> Martijn
> > >>
> > >> On Mon, Oct 10, 2022 at 11:13 AM Leonard Xu <xb...@gmail.com>
> wrote:
> > >>
> > >> > Thanks Qingsheng for starting this thread.
> > >> >
> > >> > +1 on reverting sink metric name and releasing 1.15.3 to fix this
> > >> > inconsistent behavior.
> > >> >
> > >> >
> > >> > Best,
> > >> > Leonard
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> > 2022年10月10日 下午3:06,Jark Wu <im...@gmail.com> 写道:
> > >> >
> > >> > Thanks for discovering this problem, Qingsheng!
> > >> >
> > >> > I'm also +1 for reverting the breaking changes.
> > >> >
> > >> > IIUC, currently, the behavior of "numXXXOut" metrics of the new and
> old
> > >> > sink is inconsistent.
> > >> > We have to break one of them to have consistent behavior. Sink V2
> is an
> > >> > evolving API which is just introduced in 1.15.
> > >> > I think it makes sense to break the unstable API instead of the
> stable API
> > >> > which many connectors and users depend on.
> > >> >
> > >> > Best,
> > >> > Jark
> > >> >
> > >> >
> > >> >
> > >> > On Mon, 10 Oct 2022 at 11:36, Jingsong Li <ji...@gmail.com>
> wrote:
> > >> >
> > >> >> Thanks for driving, Qingsheng.
> > >> >>
> > >> >> +1 for reverting sink metric name.
> > >> >>
> > >> >> We often forget that metric is also one of the important APIs.
> > >> >>
> > >> >> +1 for releasing 1.15.3 to fix this.
> > >> >>
> > >> >> Best,
> > >> >> Jingsong
> > >> >>
> > >> >> On Sun, Oct 9, 2022 at 11:35 PM Becket Qin <be...@gmail.com>
> wrote:
> > >> >> >
> > >> >> > Thanks for raising the discussion, Qingsheng,
> > >> >> >
> > >> >> > +1 on reverting the breaking changes.
> > >> >> >
> > >> >> > In addition, we might want to release a 1.15.3 to fix this and
> update
> > >> >> the previous release docs with this known issue, so that users can
> upgrade
> > >> >> to 1.15.3 when they hit it. It would also be good to add some
> backwards
> > >> >> compatibility tests on metrics to avoid unintended breaking
> changes like
> > >> >> this in the future.
> > >> >> >
> > >> >> > Thanks,
> > >> >> >
> > >> >> > Jiangjie (Becket) Qin
> > >> >> >
> > >> >> > On Sun, Oct 9, 2022 at 10:35 AM Qingsheng Ren <re...@apache.org>
> wrote:
> > >> >> >>
> > >> >> >> Hi devs and users,
> > >> >> >>
> > >> >> >> I’d like to start a discussion about reverting a breaking
> change about
> > >> >> sink metrics made in 1.15 by FLINK-26126 [1] and FLINK-26492 [2].
> > >> >> >>
> > >> >> >> TL;DR
> > >> >> >>
> > >> >> >> All sink metrics with name “numXXXOut” defined in FLIP-33 are
> replace
> > >> >> by “numXXXSend” in FLINK-26126 and FLINK-26492. Considering metric
> names
> > >> >> are public APIs, this is a breaking change to end users and not
> backward
> > >> >> compatible. Also unfortunately this breaking change was not
> discussed in
> > >> >> the mailing list before.
> > >> >> >>
> > >> >> >> Background
> > >> >> >>
> > >> >> >> As defined previously in FLIP-33 (the FLIP page has been
> changed so
> > >> >> please refer to the old version [3] ), metric “numRecordsOut” is
> used for
> > >> >> reporting the total number of output records since the sink
> started (number
> > >> >> of records written to the external system), and similarly for
> > >> >> “numRecordsOutPerSecond”, “numBytesOut”, “numBytesOutPerSecond” and
> > >> >> “numRecordsOutError”. Most sinks are following this naming and
> definition.
> > >> >> However, these metrics are ambiguous in the new Sink API as
> “numXXXOut”
> > >> >> could be used by the output of SinkWriterOperator for reporting
> number of
> > >> >> Committables delivered to SinkCommitterOperator. In order to
> resolve the
> > >> >> conflict, FLINK-26126 and FLINK-26492 changed names of these
> metrics with
> > >> >> “numXXXSend”.
> > >> >> >>
> > >> >> >> Necessity of reverting this change
> > >> >> >>
> > >> >> >> - Metric names are actually public API, as end users need to
> configure
> > >> >> metric collecting and alerting system with metric names. Users
> have to
> > >> >> reset all configurations related to affected metrics.
> > >> >> >> - This could also affect custom and external sinks not
> maintained by
> > >> >> Flink, which might have implemented with numXXXOut metrics.
> > >> >> >> - The number of records sent to external system is way more
> important
> > >> >> than the number of Committables sent to SinkCommitterOperator, as
> the
> > >> >> latter one is just an internal implementation of sink. We could
> have a new
> > >> >> metric name for the latter one instead.
> > >> >> >> - We could avoid splitting the project by version (like “plz use
> > >> >> numXXXOut before 1.15 and use numXXXSend after”) if we revert it
> ASAP,
> > >> >> cosidering 1.16 is still not released for now.
> > >> >> >>
> > >> >> >> As a consequence, I’d like to hear from devs and users about
> your
> > >> >> opinion on changing these metrics back to “numXXXOut”.
> > >> >> >>
> > >> >> >> Looking forward to your reply!
> > >> >> >>
> > >> >> >> [1] https://issues.apache.org/jira/browse/FLINK-26126
> > >> >> >> [2] https://issues.apache.org/jira/browse/FLINK-26492
> > >> >> >> [1] FLIP-33, version 18:
> > >> >>
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211883136
> > >> >> >>
> > >> >> >> Best,
> > >> >> >> Qingsheng
> > >> >>
> > >> >
> > >> >
> > >
> > >
> > >
>

Re: [DISCUSS] Reverting sink metric name changes made in 1.15

Posted by Jing Ge <ji...@ververica.com>.
Hi Qingsheng,

Just want to make sure we are on the same page. Are you suggesting
switching the naming between "numXXXSend" and "numXXXOut" or reverting all
the changes we did with FLINK-26126 and FLINK-26492?

For the naming switch, please pay attention that the behaviour has been
changed since we introduced SinkV2[1]. So, please be aware of different
numbers(behaviour change) even with the same metrics name. Sticking with
the old name with the new behaviour (very bad idea, IMHO) might seem like
saving the effort in the first place, but it might end up with monitoring
unexpected metrics, which is even worse for users, i.e. I didn't change
anything, but something has been broken since the last update.

For reverting, I am not sure how to fix the issue mentioned in FLINK-26126
after reverting all changes. Like Chesnay has already pointed out, with
SinkV2 we have two different output lines - one with the external system
and the other with the downstream operator. In this case, "numXXXSend" is
rather a new metric than a replacement of "numXXXOut". The "numXXXOut"
metric can still be used, depending on what the user wants to monitor.


Best regards,
Jing

[1]
https://github.com/apache/flink/blob/51fc20db30d001a95de95b3b9993eeb06f558f6c/flink-metrics/flink-metrics-core/src/main/java/org/apache/flink/metrics/groups/SinkWriterMetricGroup.java#L48


On Wed, Oct 12, 2022 at 12:48 PM Qingsheng Ren <re...@apache.org> wrote:

> As a supplement, considering it could be a big reconstruction
> redefining internal and external traffic and touching metric names in
> almost all operators, this requires a lot of discussions and we might
> do it finally in Flink 2.0. I think compatibility is a bigger blocker
> in front of us, as the output of sink is a metric that users care a
> lot about.
>
> Thanks,
> Qingsheng
>
> On Wed, Oct 12, 2022 at 6:20 PM Qingsheng Ren <re...@apache.org> wrote:
> >
> > Thanks Chesnay for the reply. +1 for making a unified and clearer
> > metric definition distinguishing internal and external data transfers.
> > As you described, having IO in operators is quite common such as
> > dimension tables in Table/SQL API. This definitely deserves a FLIP and
> > an overall design.
> >
> > However I think it's necessary to change the metric back to
> > numRecordsOut instead of sticking with numRecordsSend in 1.15 and
> > 1.16. The most important argument is for compatibility as I mentioned
> > in my previous email, otherwise all users have to modify their configs
> > of metric systems after upgrading to Flink 1.15+, and all custom
> > connectors have to change their implementations to migrate to the new
> > metric name. I believe other ones participating and approving this
> > proposal share the same concern about compatibility too. Also
> > considering this issue is blocking the release of 1.16, maybe we could
> > fix this asap, and as for defining a new metric for internal data
> > transfers we can have an in-depth discussion later. WDYT?
> >
> > Best,
> > Qingsheng
> >
> > On Tue, Oct 11, 2022 at 6:06 PM Chesnay Schepler <ch...@apache.org>
> wrote:
> > >
> > > Currently I think that would be a mistake.
> > >
> > > Ultimately what we have here is the culmination of us never really
> considering how the numRecordsOut metric should behave for operators that
> emit data to other operators _and_ external systems. This goes beyond sinks.
> > > This even applies to numRecordsIn, for cases where functions
> query/write data from/to the outside, (e.g., Async IO).
> > >
> > > Having 2 separate metrics for that, 1 exclusively for internal data
> transfers, and 1 exclusively for external data transfers, is the only way
> to get a consistent metric definition in the long-run.
> > > We can jump back-and-forth now or just commit to it.
> > >
> > > I don't think we can really judge this based on FLIP-33. It was IIRC
> written before the two phase sinks were added, which heavily blurred the
> lines of what a sink even is. Because it definitely is _not_ the last
> operator in a chain anymore.
> > >
> > > What I would suggest is to stick with what we got (although I despise
> the name numRecordsSend), and alias the numRecordsOut metric for all
> non-TwoPhaseCommittingSink.
> > >
> > > On 11/10/2022 05:54, Qingsheng Ren wrote:
> > >
> > > Thanks for the details Chesnay!
> > >
> > > By “alias” I mean to respect the original definition made in FLIP-33
> for numRecordsOut, which is the number of records written to the external
> system, and keep numRecordsSend as the same value as numRecordsOut for
> compatibility.
> > >
> > > I think keeping numRecordsOut for the output to the external system is
> more intuitive to end users because in most cases the metric of data flow
> output is more essential. I agree with you that a new metric is required,
> but considering compatibility and users’ intuition I prefer to keep the
> initial definition of numRecordsOut in FLIP-33 and name a new metric for
> sink writer’s output to downstream operators. This might be against
> consistency with metrics in other operators in Flink but maybe it’s
> acceptable to have the sink as a special case.
> > >
> > > Best,
> > > Qingsheng
> > > On Oct 10, 2022, 19:13 +0800, Chesnay Schepler <ch...@apache.org>,
> wrote:
> > >
> > > > I’m with Xintong’s idea to treat numXXXSend as an alias of numXXXOut
> > >
> > > But that's not possible. If it were that simple there would have never
> been a need to introduce another metric in the first place.
> > >
> > > It's a rather fundamental issue with how the new sinks work, in that
> they emit data to the external system (usually considered as
> "numRecordsOut" of sinks) while _also_ sending data to a downstream
> operator (usually considered as "numRecordsOut" of tasks).
> > > The original issue was that the numRecordsOut of the sink counted both
> (which is completely wrong).
> > >
> > > A new metric was always required; otherwise you inevitably end up
> breaking some semantic.
> > > Adding a new metric for what the sink writes to the external system
> is, for better or worse, more consistent with how these metrics usually
> work in Flink.
> > >
> > > On 10/10/2022 12:45, Qingsheng Ren wrote:
> > >
> > > Thanks everyone for joining the discussion!
> > >
> > > > Do you have any idea what has happened in the process here?
> > >
> > > The discussion in this PR [1] shows some details and could be helpful
> to understand the original motivation of the renaming. We do have a test
> case for guarding metrics but unfortunaly the case was also modified so the
> defense was broken.
> > >
> > > I think the reason why both the developer and the reviewer forgot to
> trigger an discussion and gave a green pass on the change is that metrics
> are quite “trivial” to be noticed as public APIs. As mentioned by Martijn I
> couldn’t find a place noting that metrics are public APIs and should be
> treated carefully while contributing and reviewing.
> > >
> > > IMHO three actions could be made to prevent this kind of changes in
> the future:
> > >
> > > a. Add test case for metrics (which we already have in
> SinkMetricsITCase)
> > > b. We emphasize that any public-interface breaking changes should be
> proposed by a FLIP or discussed in mailing list, and should be listed in
> the release note.
> > > c. We remind contributors and reviewers about what should be
> considered as public API, and include metric names in it.
> > >
> > > For b and c these two pages [2][3] might be proper places.
> > >
> > > About the patch to revert this, it looks like we have a consensus on
> 1.16. As of 1.15 I think it’s worthy to trigger a minor version. I didn’t
> see complaints about this for now so it should be OK to save the situation
> asap. I’m with Xintong’s idea to treat numXXXSend as an alias of numXXXOut
> considering there could possibly some users have already adapted their
> system to the new naming, and have another internal metric for reflecting
> number of outgoing committable batches (actually the numRecordsIn of sink
> committer operator should be carrying this info already).
> > >
> > > [1] https://github.com/apache/flink/pull/18825
> > > [2] https://flink.apache.org/contributing/contribute-code.html
> > > [3] https://flink.apache.org/contributing/reviewing-prs.html
> > >
> > > Best,
> > > Qingsheng
> > > On Oct 10, 2022, 17:40 +0800, Xintong Song <to...@gmail.com>,
> wrote:
> > >
> > > +1 for reverting these changes in Flink 1.16.
> > >
> > > For 1.15.3, can we make these metrics available via both names
> (numXXXOut and numXXXSend)? In this way we don't break it for those who
> already migrated to 1.15 and numXXXSend. That means we still need to change
> SinkWriterOperator to use another metric name in 1.15.3, which IIUC is
> internal to Flink sink.
> > >
> > > I'm overall +1 to change numXXXOut back to its original semantics.
> AFAIK (from meetup / flink-forward questionaires), most users do not
> migrate to a new Flink release immediately, until the next 1-2 major
> releases are out.
> > >
> > > Best,
> > >
> > > Xintong
> > >
> > >
> > >
> > > On Mon, Oct 10, 2022 at 5:26 PM Martijn Visser <
> martijnvisser@apache.org> wrote:
> > >>
> > >> Hi Qingsheng,
> > >>
> > >> Do you have any idea what has happened in the process here? Do we
> know why
> > >> they were changed? I was under the impression that these metric names
> were
> > >> newly introduced due to the new interfaces and because it still
> depends on
> > >> each connector implementing these.
> > >>
> > >> Sidenote: metric names are not mentioned in the FLIP process as a
> public
> > >> API. Might make sense to have a separate follow-up to add that to the
> list
> > >> (I do think we should list them there).
> > >>
> > >> +1 for reverting this and make this change in Flink 1.16
> > >>
> > >> I'm not in favour of releasing a Flink 1.15.3 with this change: I
> think the
> > >> impact is too big for a patch version, especially given how long
> Flink 1.15
> > >> is already out there.
> > >>
> > >> Best regards,
> > >>
> > >> Martijn
> > >>
> > >> On Mon, Oct 10, 2022 at 11:13 AM Leonard Xu <xb...@gmail.com>
> wrote:
> > >>
> > >> > Thanks Qingsheng for starting this thread.
> > >> >
> > >> > +1 on reverting sink metric name and releasing 1.15.3 to fix this
> > >> > inconsistent behavior.
> > >> >
> > >> >
> > >> > Best,
> > >> > Leonard
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> > 2022年10月10日 下午3:06,Jark Wu <im...@gmail.com> 写道:
> > >> >
> > >> > Thanks for discovering this problem, Qingsheng!
> > >> >
> > >> > I'm also +1 for reverting the breaking changes.
> > >> >
> > >> > IIUC, currently, the behavior of "numXXXOut" metrics of the new and
> old
> > >> > sink is inconsistent.
> > >> > We have to break one of them to have consistent behavior. Sink V2
> is an
> > >> > evolving API which is just introduced in 1.15.
> > >> > I think it makes sense to break the unstable API instead of the
> stable API
> > >> > which many connectors and users depend on.
> > >> >
> > >> > Best,
> > >> > Jark
> > >> >
> > >> >
> > >> >
> > >> > On Mon, 10 Oct 2022 at 11:36, Jingsong Li <ji...@gmail.com>
> wrote:
> > >> >
> > >> >> Thanks for driving, Qingsheng.
> > >> >>
> > >> >> +1 for reverting sink metric name.
> > >> >>
> > >> >> We often forget that metric is also one of the important APIs.
> > >> >>
> > >> >> +1 for releasing 1.15.3 to fix this.
> > >> >>
> > >> >> Best,
> > >> >> Jingsong
> > >> >>
> > >> >> On Sun, Oct 9, 2022 at 11:35 PM Becket Qin <be...@gmail.com>
> wrote:
> > >> >> >
> > >> >> > Thanks for raising the discussion, Qingsheng,
> > >> >> >
> > >> >> > +1 on reverting the breaking changes.
> > >> >> >
> > >> >> > In addition, we might want to release a 1.15.3 to fix this and
> update
> > >> >> the previous release docs with this known issue, so that users can
> upgrade
> > >> >> to 1.15.3 when they hit it. It would also be good to add some
> backwards
> > >> >> compatibility tests on metrics to avoid unintended breaking
> changes like
> > >> >> this in the future.
> > >> >> >
> > >> >> > Thanks,
> > >> >> >
> > >> >> > Jiangjie (Becket) Qin
> > >> >> >
> > >> >> > On Sun, Oct 9, 2022 at 10:35 AM Qingsheng Ren <re...@apache.org>
> wrote:
> > >> >> >>
> > >> >> >> Hi devs and users,
> > >> >> >>
> > >> >> >> I’d like to start a discussion about reverting a breaking
> change about
> > >> >> sink metrics made in 1.15 by FLINK-26126 [1] and FLINK-26492 [2].
> > >> >> >>
> > >> >> >> TL;DR
> > >> >> >>
> > >> >> >> All sink metrics with name “numXXXOut” defined in FLIP-33 are
> replace
> > >> >> by “numXXXSend” in FLINK-26126 and FLINK-26492. Considering metric
> names
> > >> >> are public APIs, this is a breaking change to end users and not
> backward
> > >> >> compatible. Also unfortunately this breaking change was not
> discussed in
> > >> >> the mailing list before.
> > >> >> >>
> > >> >> >> Background
> > >> >> >>
> > >> >> >> As defined previously in FLIP-33 (the FLIP page has been
> changed so
> > >> >> please refer to the old version [3] ), metric “numRecordsOut” is
> used for
> > >> >> reporting the total number of output records since the sink
> started (number
> > >> >> of records written to the external system), and similarly for
> > >> >> “numRecordsOutPerSecond”, “numBytesOut”, “numBytesOutPerSecond” and
> > >> >> “numRecordsOutError”. Most sinks are following this naming and
> definition.
> > >> >> However, these metrics are ambiguous in the new Sink API as
> “numXXXOut”
> > >> >> could be used by the output of SinkWriterOperator for reporting
> number of
> > >> >> Committables delivered to SinkCommitterOperator. In order to
> resolve the
> > >> >> conflict, FLINK-26126 and FLINK-26492 changed names of these
> metrics with
> > >> >> “numXXXSend”.
> > >> >> >>
> > >> >> >> Necessity of reverting this change
> > >> >> >>
> > >> >> >> - Metric names are actually public API, as end users need to
> configure
> > >> >> metric collecting and alerting system with metric names. Users
> have to
> > >> >> reset all configurations related to affected metrics.
> > >> >> >> - This could also affect custom and external sinks not
> maintained by
> > >> >> Flink, which might have implemented with numXXXOut metrics.
> > >> >> >> - The number of records sent to external system is way more
> important
> > >> >> than the number of Committables sent to SinkCommitterOperator, as
> the
> > >> >> latter one is just an internal implementation of sink. We could
> have a new
> > >> >> metric name for the latter one instead.
> > >> >> >> - We could avoid splitting the project by version (like “plz use
> > >> >> numXXXOut before 1.15 and use numXXXSend after”) if we revert it
> ASAP,
> > >> >> cosidering 1.16 is still not released for now.
> > >> >> >>
> > >> >> >> As a consequence, I’d like to hear from devs and users about
> your
> > >> >> opinion on changing these metrics back to “numXXXOut”.
> > >> >> >>
> > >> >> >> Looking forward to your reply!
> > >> >> >>
> > >> >> >> [1] https://issues.apache.org/jira/browse/FLINK-26126
> > >> >> >> [2] https://issues.apache.org/jira/browse/FLINK-26492
> > >> >> >> [1] FLIP-33, version 18:
> > >> >>
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211883136
> > >> >> >>
> > >> >> >> Best,
> > >> >> >> Qingsheng
> > >> >>
> > >> >
> > >> >
> > >
> > >
> > >
>

Re: [DISCUSS] Reverting sink metric name changes made in 1.15

Posted by Qingsheng Ren <re...@apache.org>.
As a supplement, considering it could be a big reconstruction
redefining internal and external traffic and touching metric names in
almost all operators, this requires a lot of discussions and we might
do it finally in Flink 2.0. I think compatibility is a bigger blocker
in front of us, as the output of sink is a metric that users care a
lot about.

Thanks,
Qingsheng

On Wed, Oct 12, 2022 at 6:20 PM Qingsheng Ren <re...@apache.org> wrote:
>
> Thanks Chesnay for the reply. +1 for making a unified and clearer
> metric definition distinguishing internal and external data transfers.
> As you described, having IO in operators is quite common such as
> dimension tables in Table/SQL API. This definitely deserves a FLIP and
> an overall design.
>
> However I think it's necessary to change the metric back to
> numRecordsOut instead of sticking with numRecordsSend in 1.15 and
> 1.16. The most important argument is for compatibility as I mentioned
> in my previous email, otherwise all users have to modify their configs
> of metric systems after upgrading to Flink 1.15+, and all custom
> connectors have to change their implementations to migrate to the new
> metric name. I believe other ones participating and approving this
> proposal share the same concern about compatibility too. Also
> considering this issue is blocking the release of 1.16, maybe we could
> fix this asap, and as for defining a new metric for internal data
> transfers we can have an in-depth discussion later. WDYT?
>
> Best,
> Qingsheng
>
> On Tue, Oct 11, 2022 at 6:06 PM Chesnay Schepler <ch...@apache.org> wrote:
> >
> > Currently I think that would be a mistake.
> >
> > Ultimately what we have here is the culmination of us never really considering how the numRecordsOut metric should behave for operators that emit data to other operators _and_ external systems. This goes beyond sinks.
> > This even applies to numRecordsIn, for cases where functions query/write data from/to the outside, (e.g., Async IO).
> >
> > Having 2 separate metrics for that, 1 exclusively for internal data transfers, and 1 exclusively for external data transfers, is the only way to get a consistent metric definition in the long-run.
> > We can jump back-and-forth now or just commit to it.
> >
> > I don't think we can really judge this based on FLIP-33. It was IIRC written before the two phase sinks were added, which heavily blurred the lines of what a sink even is. Because it definitely is _not_ the last operator in a chain anymore.
> >
> > What I would suggest is to stick with what we got (although I despise the name numRecordsSend), and alias the numRecordsOut metric for all non-TwoPhaseCommittingSink.
> >
> > On 11/10/2022 05:54, Qingsheng Ren wrote:
> >
> > Thanks for the details Chesnay!
> >
> > By “alias” I mean to respect the original definition made in FLIP-33 for numRecordsOut, which is the number of records written to the external system, and keep numRecordsSend as the same value as numRecordsOut for compatibility.
> >
> > I think keeping numRecordsOut for the output to the external system is more intuitive to end users because in most cases the metric of data flow output is more essential. I agree with you that a new metric is required, but considering compatibility and users’ intuition I prefer to keep the initial definition of numRecordsOut in FLIP-33 and name a new metric for sink writer’s output to downstream operators. This might be against consistency with metrics in other operators in Flink but maybe it’s acceptable to have the sink as a special case.
> >
> > Best,
> > Qingsheng
> > On Oct 10, 2022, 19:13 +0800, Chesnay Schepler <ch...@apache.org>, wrote:
> >
> > > I’m with Xintong’s idea to treat numXXXSend as an alias of numXXXOut
> >
> > But that's not possible. If it were that simple there would have never been a need to introduce another metric in the first place.
> >
> > It's a rather fundamental issue with how the new sinks work, in that they emit data to the external system (usually considered as "numRecordsOut" of sinks) while _also_ sending data to a downstream operator (usually considered as "numRecordsOut" of tasks).
> > The original issue was that the numRecordsOut of the sink counted both (which is completely wrong).
> >
> > A new metric was always required; otherwise you inevitably end up breaking some semantic.
> > Adding a new metric for what the sink writes to the external system is, for better or worse, more consistent with how these metrics usually work in Flink.
> >
> > On 10/10/2022 12:45, Qingsheng Ren wrote:
> >
> > Thanks everyone for joining the discussion!
> >
> > > Do you have any idea what has happened in the process here?
> >
> > The discussion in this PR [1] shows some details and could be helpful to understand the original motivation of the renaming. We do have a test case for guarding metrics but unfortunaly the case was also modified so the defense was broken.
> >
> > I think the reason why both the developer and the reviewer forgot to trigger an discussion and gave a green pass on the change is that metrics are quite “trivial” to be noticed as public APIs. As mentioned by Martijn I couldn’t find a place noting that metrics are public APIs and should be treated carefully while contributing and reviewing.
> >
> > IMHO three actions could be made to prevent this kind of changes in the future:
> >
> > a. Add test case for metrics (which we already have in SinkMetricsITCase)
> > b. We emphasize that any public-interface breaking changes should be proposed by a FLIP or discussed in mailing list, and should be listed in the release note.
> > c. We remind contributors and reviewers about what should be considered as public API, and include metric names in it.
> >
> > For b and c these two pages [2][3] might be proper places.
> >
> > About the patch to revert this, it looks like we have a consensus on 1.16. As of 1.15 I think it’s worthy to trigger a minor version. I didn’t see complaints about this for now so it should be OK to save the situation asap. I’m with Xintong’s idea to treat numXXXSend as an alias of numXXXOut considering there could possibly some users have already adapted their system to the new naming, and have another internal metric for reflecting number of outgoing committable batches (actually the numRecordsIn of sink committer operator should be carrying this info already).
> >
> > [1] https://github.com/apache/flink/pull/18825
> > [2] https://flink.apache.org/contributing/contribute-code.html
> > [3] https://flink.apache.org/contributing/reviewing-prs.html
> >
> > Best,
> > Qingsheng
> > On Oct 10, 2022, 17:40 +0800, Xintong Song <to...@gmail.com>, wrote:
> >
> > +1 for reverting these changes in Flink 1.16.
> >
> > For 1.15.3, can we make these metrics available via both names (numXXXOut and numXXXSend)? In this way we don't break it for those who already migrated to 1.15 and numXXXSend. That means we still need to change SinkWriterOperator to use another metric name in 1.15.3, which IIUC is internal to Flink sink.
> >
> > I'm overall +1 to change numXXXOut back to its original semantics. AFAIK (from meetup / flink-forward questionaires), most users do not migrate to a new Flink release immediately, until the next 1-2 major releases are out.
> >
> > Best,
> >
> > Xintong
> >
> >
> >
> > On Mon, Oct 10, 2022 at 5:26 PM Martijn Visser <ma...@apache.org> wrote:
> >>
> >> Hi Qingsheng,
> >>
> >> Do you have any idea what has happened in the process here? Do we know why
> >> they were changed? I was under the impression that these metric names were
> >> newly introduced due to the new interfaces and because it still depends on
> >> each connector implementing these.
> >>
> >> Sidenote: metric names are not mentioned in the FLIP process as a public
> >> API. Might make sense to have a separate follow-up to add that to the list
> >> (I do think we should list them there).
> >>
> >> +1 for reverting this and make this change in Flink 1.16
> >>
> >> I'm not in favour of releasing a Flink 1.15.3 with this change: I think the
> >> impact is too big for a patch version, especially given how long Flink 1.15
> >> is already out there.
> >>
> >> Best regards,
> >>
> >> Martijn
> >>
> >> On Mon, Oct 10, 2022 at 11:13 AM Leonard Xu <xb...@gmail.com> wrote:
> >>
> >> > Thanks Qingsheng for starting this thread.
> >> >
> >> > +1 on reverting sink metric name and releasing 1.15.3 to fix this
> >> > inconsistent behavior.
> >> >
> >> >
> >> > Best,
> >> > Leonard
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > 2022年10月10日 下午3:06,Jark Wu <im...@gmail.com> 写道:
> >> >
> >> > Thanks for discovering this problem, Qingsheng!
> >> >
> >> > I'm also +1 for reverting the breaking changes.
> >> >
> >> > IIUC, currently, the behavior of "numXXXOut" metrics of the new and old
> >> > sink is inconsistent.
> >> > We have to break one of them to have consistent behavior. Sink V2 is an
> >> > evolving API which is just introduced in 1.15.
> >> > I think it makes sense to break the unstable API instead of the stable API
> >> > which many connectors and users depend on.
> >> >
> >> > Best,
> >> > Jark
> >> >
> >> >
> >> >
> >> > On Mon, 10 Oct 2022 at 11:36, Jingsong Li <ji...@gmail.com> wrote:
> >> >
> >> >> Thanks for driving, Qingsheng.
> >> >>
> >> >> +1 for reverting sink metric name.
> >> >>
> >> >> We often forget that metric is also one of the important APIs.
> >> >>
> >> >> +1 for releasing 1.15.3 to fix this.
> >> >>
> >> >> Best,
> >> >> Jingsong
> >> >>
> >> >> On Sun, Oct 9, 2022 at 11:35 PM Becket Qin <be...@gmail.com> wrote:
> >> >> >
> >> >> > Thanks for raising the discussion, Qingsheng,
> >> >> >
> >> >> > +1 on reverting the breaking changes.
> >> >> >
> >> >> > In addition, we might want to release a 1.15.3 to fix this and update
> >> >> the previous release docs with this known issue, so that users can upgrade
> >> >> to 1.15.3 when they hit it. It would also be good to add some backwards
> >> >> compatibility tests on metrics to avoid unintended breaking changes like
> >> >> this in the future.
> >> >> >
> >> >> > Thanks,
> >> >> >
> >> >> > Jiangjie (Becket) Qin
> >> >> >
> >> >> > On Sun, Oct 9, 2022 at 10:35 AM Qingsheng Ren <re...@apache.org> wrote:
> >> >> >>
> >> >> >> Hi devs and users,
> >> >> >>
> >> >> >> I’d like to start a discussion about reverting a breaking change about
> >> >> sink metrics made in 1.15 by FLINK-26126 [1] and FLINK-26492 [2].
> >> >> >>
> >> >> >> TL;DR
> >> >> >>
> >> >> >> All sink metrics with name “numXXXOut” defined in FLIP-33 are replace
> >> >> by “numXXXSend” in FLINK-26126 and FLINK-26492. Considering metric names
> >> >> are public APIs, this is a breaking change to end users and not backward
> >> >> compatible. Also unfortunately this breaking change was not discussed in
> >> >> the mailing list before.
> >> >> >>
> >> >> >> Background
> >> >> >>
> >> >> >> As defined previously in FLIP-33 (the FLIP page has been changed so
> >> >> please refer to the old version [3] ), metric “numRecordsOut” is used for
> >> >> reporting the total number of output records since the sink started (number
> >> >> of records written to the external system), and similarly for
> >> >> “numRecordsOutPerSecond”, “numBytesOut”, “numBytesOutPerSecond” and
> >> >> “numRecordsOutError”. Most sinks are following this naming and definition.
> >> >> However, these metrics are ambiguous in the new Sink API as “numXXXOut”
> >> >> could be used by the output of SinkWriterOperator for reporting number of
> >> >> Committables delivered to SinkCommitterOperator. In order to resolve the
> >> >> conflict, FLINK-26126 and FLINK-26492 changed names of these metrics with
> >> >> “numXXXSend”.
> >> >> >>
> >> >> >> Necessity of reverting this change
> >> >> >>
> >> >> >> - Metric names are actually public API, as end users need to configure
> >> >> metric collecting and alerting system with metric names. Users have to
> >> >> reset all configurations related to affected metrics.
> >> >> >> - This could also affect custom and external sinks not maintained by
> >> >> Flink, which might have implemented with numXXXOut metrics.
> >> >> >> - The number of records sent to external system is way more important
> >> >> than the number of Committables sent to SinkCommitterOperator, as the
> >> >> latter one is just an internal implementation of sink. We could have a new
> >> >> metric name for the latter one instead.
> >> >> >> - We could avoid splitting the project by version (like “plz use
> >> >> numXXXOut before 1.15 and use numXXXSend after”) if we revert it ASAP,
> >> >> cosidering 1.16 is still not released for now.
> >> >> >>
> >> >> >> As a consequence, I’d like to hear from devs and users about your
> >> >> opinion on changing these metrics back to “numXXXOut”.
> >> >> >>
> >> >> >> Looking forward to your reply!
> >> >> >>
> >> >> >> [1] https://issues.apache.org/jira/browse/FLINK-26126
> >> >> >> [2] https://issues.apache.org/jira/browse/FLINK-26492
> >> >> >> [1] FLIP-33, version 18:
> >> >> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211883136
> >> >> >>
> >> >> >> Best,
> >> >> >> Qingsheng
> >> >>
> >> >
> >> >
> >
> >
> >

Re: [DISCUSS] Reverting sink metric name changes made in 1.15

Posted by Qingsheng Ren <re...@apache.org>.
As a supplement, considering it could be a big reconstruction
redefining internal and external traffic and touching metric names in
almost all operators, this requires a lot of discussions and we might
do it finally in Flink 2.0. I think compatibility is a bigger blocker
in front of us, as the output of sink is a metric that users care a
lot about.

Thanks,
Qingsheng

On Wed, Oct 12, 2022 at 6:20 PM Qingsheng Ren <re...@apache.org> wrote:
>
> Thanks Chesnay for the reply. +1 for making a unified and clearer
> metric definition distinguishing internal and external data transfers.
> As you described, having IO in operators is quite common such as
> dimension tables in Table/SQL API. This definitely deserves a FLIP and
> an overall design.
>
> However I think it's necessary to change the metric back to
> numRecordsOut instead of sticking with numRecordsSend in 1.15 and
> 1.16. The most important argument is for compatibility as I mentioned
> in my previous email, otherwise all users have to modify their configs
> of metric systems after upgrading to Flink 1.15+, and all custom
> connectors have to change their implementations to migrate to the new
> metric name. I believe other ones participating and approving this
> proposal share the same concern about compatibility too. Also
> considering this issue is blocking the release of 1.16, maybe we could
> fix this asap, and as for defining a new metric for internal data
> transfers we can have an in-depth discussion later. WDYT?
>
> Best,
> Qingsheng
>
> On Tue, Oct 11, 2022 at 6:06 PM Chesnay Schepler <ch...@apache.org> wrote:
> >
> > Currently I think that would be a mistake.
> >
> > Ultimately what we have here is the culmination of us never really considering how the numRecordsOut metric should behave for operators that emit data to other operators _and_ external systems. This goes beyond sinks.
> > This even applies to numRecordsIn, for cases where functions query/write data from/to the outside, (e.g., Async IO).
> >
> > Having 2 separate metrics for that, 1 exclusively for internal data transfers, and 1 exclusively for external data transfers, is the only way to get a consistent metric definition in the long-run.
> > We can jump back-and-forth now or just commit to it.
> >
> > I don't think we can really judge this based on FLIP-33. It was IIRC written before the two phase sinks were added, which heavily blurred the lines of what a sink even is. Because it definitely is _not_ the last operator in a chain anymore.
> >
> > What I would suggest is to stick with what we got (although I despise the name numRecordsSend), and alias the numRecordsOut metric for all non-TwoPhaseCommittingSink.
> >
> > On 11/10/2022 05:54, Qingsheng Ren wrote:
> >
> > Thanks for the details Chesnay!
> >
> > By “alias” I mean to respect the original definition made in FLIP-33 for numRecordsOut, which is the number of records written to the external system, and keep numRecordsSend as the same value as numRecordsOut for compatibility.
> >
> > I think keeping numRecordsOut for the output to the external system is more intuitive to end users because in most cases the metric of data flow output is more essential. I agree with you that a new metric is required, but considering compatibility and users’ intuition I prefer to keep the initial definition of numRecordsOut in FLIP-33 and name a new metric for sink writer’s output to downstream operators. This might be against consistency with metrics in other operators in Flink but maybe it’s acceptable to have the sink as a special case.
> >
> > Best,
> > Qingsheng
> > On Oct 10, 2022, 19:13 +0800, Chesnay Schepler <ch...@apache.org>, wrote:
> >
> > > I’m with Xintong’s idea to treat numXXXSend as an alias of numXXXOut
> >
> > But that's not possible. If it were that simple there would have never been a need to introduce another metric in the first place.
> >
> > It's a rather fundamental issue with how the new sinks work, in that they emit data to the external system (usually considered as "numRecordsOut" of sinks) while _also_ sending data to a downstream operator (usually considered as "numRecordsOut" of tasks).
> > The original issue was that the numRecordsOut of the sink counted both (which is completely wrong).
> >
> > A new metric was always required; otherwise you inevitably end up breaking some semantic.
> > Adding a new metric for what the sink writes to the external system is, for better or worse, more consistent with how these metrics usually work in Flink.
> >
> > On 10/10/2022 12:45, Qingsheng Ren wrote:
> >
> > Thanks everyone for joining the discussion!
> >
> > > Do you have any idea what has happened in the process here?
> >
> > The discussion in this PR [1] shows some details and could be helpful to understand the original motivation of the renaming. We do have a test case for guarding metrics but unfortunaly the case was also modified so the defense was broken.
> >
> > I think the reason why both the developer and the reviewer forgot to trigger an discussion and gave a green pass on the change is that metrics are quite “trivial” to be noticed as public APIs. As mentioned by Martijn I couldn’t find a place noting that metrics are public APIs and should be treated carefully while contributing and reviewing.
> >
> > IMHO three actions could be made to prevent this kind of changes in the future:
> >
> > a. Add test case for metrics (which we already have in SinkMetricsITCase)
> > b. We emphasize that any public-interface breaking changes should be proposed by a FLIP or discussed in mailing list, and should be listed in the release note.
> > c. We remind contributors and reviewers about what should be considered as public API, and include metric names in it.
> >
> > For b and c these two pages [2][3] might be proper places.
> >
> > About the patch to revert this, it looks like we have a consensus on 1.16. As of 1.15 I think it’s worthy to trigger a minor version. I didn’t see complaints about this for now so it should be OK to save the situation asap. I’m with Xintong’s idea to treat numXXXSend as an alias of numXXXOut considering there could possibly some users have already adapted their system to the new naming, and have another internal metric for reflecting number of outgoing committable batches (actually the numRecordsIn of sink committer operator should be carrying this info already).
> >
> > [1] https://github.com/apache/flink/pull/18825
> > [2] https://flink.apache.org/contributing/contribute-code.html
> > [3] https://flink.apache.org/contributing/reviewing-prs.html
> >
> > Best,
> > Qingsheng
> > On Oct 10, 2022, 17:40 +0800, Xintong Song <to...@gmail.com>, wrote:
> >
> > +1 for reverting these changes in Flink 1.16.
> >
> > For 1.15.3, can we make these metrics available via both names (numXXXOut and numXXXSend)? In this way we don't break it for those who already migrated to 1.15 and numXXXSend. That means we still need to change SinkWriterOperator to use another metric name in 1.15.3, which IIUC is internal to Flink sink.
> >
> > I'm overall +1 to change numXXXOut back to its original semantics. AFAIK (from meetup / flink-forward questionaires), most users do not migrate to a new Flink release immediately, until the next 1-2 major releases are out.
> >
> > Best,
> >
> > Xintong
> >
> >
> >
> > On Mon, Oct 10, 2022 at 5:26 PM Martijn Visser <ma...@apache.org> wrote:
> >>
> >> Hi Qingsheng,
> >>
> >> Do you have any idea what has happened in the process here? Do we know why
> >> they were changed? I was under the impression that these metric names were
> >> newly introduced due to the new interfaces and because it still depends on
> >> each connector implementing these.
> >>
> >> Sidenote: metric names are not mentioned in the FLIP process as a public
> >> API. Might make sense to have a separate follow-up to add that to the list
> >> (I do think we should list them there).
> >>
> >> +1 for reverting this and make this change in Flink 1.16
> >>
> >> I'm not in favour of releasing a Flink 1.15.3 with this change: I think the
> >> impact is too big for a patch version, especially given how long Flink 1.15
> >> is already out there.
> >>
> >> Best regards,
> >>
> >> Martijn
> >>
> >> On Mon, Oct 10, 2022 at 11:13 AM Leonard Xu <xb...@gmail.com> wrote:
> >>
> >> > Thanks Qingsheng for starting this thread.
> >> >
> >> > +1 on reverting sink metric name and releasing 1.15.3 to fix this
> >> > inconsistent behavior.
> >> >
> >> >
> >> > Best,
> >> > Leonard
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > 2022年10月10日 下午3:06,Jark Wu <im...@gmail.com> 写道:
> >> >
> >> > Thanks for discovering this problem, Qingsheng!
> >> >
> >> > I'm also +1 for reverting the breaking changes.
> >> >
> >> > IIUC, currently, the behavior of "numXXXOut" metrics of the new and old
> >> > sink is inconsistent.
> >> > We have to break one of them to have consistent behavior. Sink V2 is an
> >> > evolving API which is just introduced in 1.15.
> >> > I think it makes sense to break the unstable API instead of the stable API
> >> > which many connectors and users depend on.
> >> >
> >> > Best,
> >> > Jark
> >> >
> >> >
> >> >
> >> > On Mon, 10 Oct 2022 at 11:36, Jingsong Li <ji...@gmail.com> wrote:
> >> >
> >> >> Thanks for driving, Qingsheng.
> >> >>
> >> >> +1 for reverting sink metric name.
> >> >>
> >> >> We often forget that metric is also one of the important APIs.
> >> >>
> >> >> +1 for releasing 1.15.3 to fix this.
> >> >>
> >> >> Best,
> >> >> Jingsong
> >> >>
> >> >> On Sun, Oct 9, 2022 at 11:35 PM Becket Qin <be...@gmail.com> wrote:
> >> >> >
> >> >> > Thanks for raising the discussion, Qingsheng,
> >> >> >
> >> >> > +1 on reverting the breaking changes.
> >> >> >
> >> >> > In addition, we might want to release a 1.15.3 to fix this and update
> >> >> the previous release docs with this known issue, so that users can upgrade
> >> >> to 1.15.3 when they hit it. It would also be good to add some backwards
> >> >> compatibility tests on metrics to avoid unintended breaking changes like
> >> >> this in the future.
> >> >> >
> >> >> > Thanks,
> >> >> >
> >> >> > Jiangjie (Becket) Qin
> >> >> >
> >> >> > On Sun, Oct 9, 2022 at 10:35 AM Qingsheng Ren <re...@apache.org> wrote:
> >> >> >>
> >> >> >> Hi devs and users,
> >> >> >>
> >> >> >> I’d like to start a discussion about reverting a breaking change about
> >> >> sink metrics made in 1.15 by FLINK-26126 [1] and FLINK-26492 [2].
> >> >> >>
> >> >> >> TL;DR
> >> >> >>
> >> >> >> All sink metrics with name “numXXXOut” defined in FLIP-33 are replace
> >> >> by “numXXXSend” in FLINK-26126 and FLINK-26492. Considering metric names
> >> >> are public APIs, this is a breaking change to end users and not backward
> >> >> compatible. Also unfortunately this breaking change was not discussed in
> >> >> the mailing list before.
> >> >> >>
> >> >> >> Background
> >> >> >>
> >> >> >> As defined previously in FLIP-33 (the FLIP page has been changed so
> >> >> please refer to the old version [3] ), metric “numRecordsOut” is used for
> >> >> reporting the total number of output records since the sink started (number
> >> >> of records written to the external system), and similarly for
> >> >> “numRecordsOutPerSecond”, “numBytesOut”, “numBytesOutPerSecond” and
> >> >> “numRecordsOutError”. Most sinks are following this naming and definition.
> >> >> However, these metrics are ambiguous in the new Sink API as “numXXXOut”
> >> >> could be used by the output of SinkWriterOperator for reporting number of
> >> >> Committables delivered to SinkCommitterOperator. In order to resolve the
> >> >> conflict, FLINK-26126 and FLINK-26492 changed names of these metrics with
> >> >> “numXXXSend”.
> >> >> >>
> >> >> >> Necessity of reverting this change
> >> >> >>
> >> >> >> - Metric names are actually public API, as end users need to configure
> >> >> metric collecting and alerting system with metric names. Users have to
> >> >> reset all configurations related to affected metrics.
> >> >> >> - This could also affect custom and external sinks not maintained by
> >> >> Flink, which might have implemented with numXXXOut metrics.
> >> >> >> - The number of records sent to external system is way more important
> >> >> than the number of Committables sent to SinkCommitterOperator, as the
> >> >> latter one is just an internal implementation of sink. We could have a new
> >> >> metric name for the latter one instead.
> >> >> >> - We could avoid splitting the project by version (like “plz use
> >> >> numXXXOut before 1.15 and use numXXXSend after”) if we revert it ASAP,
> >> >> cosidering 1.16 is still not released for now.
> >> >> >>
> >> >> >> As a consequence, I’d like to hear from devs and users about your
> >> >> opinion on changing these metrics back to “numXXXOut”.
> >> >> >>
> >> >> >> Looking forward to your reply!
> >> >> >>
> >> >> >> [1] https://issues.apache.org/jira/browse/FLINK-26126
> >> >> >> [2] https://issues.apache.org/jira/browse/FLINK-26492
> >> >> >> [1] FLIP-33, version 18:
> >> >> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211883136
> >> >> >>
> >> >> >> Best,
> >> >> >> Qingsheng
> >> >>
> >> >
> >> >
> >
> >
> >

Re: [DISCUSS] Reverting sink metric name changes made in 1.15

Posted by Qingsheng Ren <re...@apache.org>.
Thanks Chesnay for the reply. +1 for making a unified and clearer
metric definition distinguishing internal and external data transfers.
As you described, having IO in operators is quite common such as
dimension tables in Table/SQL API. This definitely deserves a FLIP and
an overall design.

However I think it's necessary to change the metric back to
numRecordsOut instead of sticking with numRecordsSend in 1.15 and
1.16. The most important argument is for compatibility as I mentioned
in my previous email, otherwise all users have to modify their configs
of metric systems after upgrading to Flink 1.15+, and all custom
connectors have to change their implementations to migrate to the new
metric name. I believe other ones participating and approving this
proposal share the same concern about compatibility too. Also
considering this issue is blocking the release of 1.16, maybe we could
fix this asap, and as for defining a new metric for internal data
transfers we can have an in-depth discussion later. WDYT?

Best,
Qingsheng

On Tue, Oct 11, 2022 at 6:06 PM Chesnay Schepler <ch...@apache.org> wrote:
>
> Currently I think that would be a mistake.
>
> Ultimately what we have here is the culmination of us never really considering how the numRecordsOut metric should behave for operators that emit data to other operators _and_ external systems. This goes beyond sinks.
> This even applies to numRecordsIn, for cases where functions query/write data from/to the outside, (e.g., Async IO).
>
> Having 2 separate metrics for that, 1 exclusively for internal data transfers, and 1 exclusively for external data transfers, is the only way to get a consistent metric definition in the long-run.
> We can jump back-and-forth now or just commit to it.
>
> I don't think we can really judge this based on FLIP-33. It was IIRC written before the two phase sinks were added, which heavily blurred the lines of what a sink even is. Because it definitely is _not_ the last operator in a chain anymore.
>
> What I would suggest is to stick with what we got (although I despise the name numRecordsSend), and alias the numRecordsOut metric for all non-TwoPhaseCommittingSink.
>
> On 11/10/2022 05:54, Qingsheng Ren wrote:
>
> Thanks for the details Chesnay!
>
> By “alias” I mean to respect the original definition made in FLIP-33 for numRecordsOut, which is the number of records written to the external system, and keep numRecordsSend as the same value as numRecordsOut for compatibility.
>
> I think keeping numRecordsOut for the output to the external system is more intuitive to end users because in most cases the metric of data flow output is more essential. I agree with you that a new metric is required, but considering compatibility and users’ intuition I prefer to keep the initial definition of numRecordsOut in FLIP-33 and name a new metric for sink writer’s output to downstream operators. This might be against consistency with metrics in other operators in Flink but maybe it’s acceptable to have the sink as a special case.
>
> Best,
> Qingsheng
> On Oct 10, 2022, 19:13 +0800, Chesnay Schepler <ch...@apache.org>, wrote:
>
> > I’m with Xintong’s idea to treat numXXXSend as an alias of numXXXOut
>
> But that's not possible. If it were that simple there would have never been a need to introduce another metric in the first place.
>
> It's a rather fundamental issue with how the new sinks work, in that they emit data to the external system (usually considered as "numRecordsOut" of sinks) while _also_ sending data to a downstream operator (usually considered as "numRecordsOut" of tasks).
> The original issue was that the numRecordsOut of the sink counted both (which is completely wrong).
>
> A new metric was always required; otherwise you inevitably end up breaking some semantic.
> Adding a new metric for what the sink writes to the external system is, for better or worse, more consistent with how these metrics usually work in Flink.
>
> On 10/10/2022 12:45, Qingsheng Ren wrote:
>
> Thanks everyone for joining the discussion!
>
> > Do you have any idea what has happened in the process here?
>
> The discussion in this PR [1] shows some details and could be helpful to understand the original motivation of the renaming. We do have a test case for guarding metrics but unfortunaly the case was also modified so the defense was broken.
>
> I think the reason why both the developer and the reviewer forgot to trigger an discussion and gave a green pass on the change is that metrics are quite “trivial” to be noticed as public APIs. As mentioned by Martijn I couldn’t find a place noting that metrics are public APIs and should be treated carefully while contributing and reviewing.
>
> IMHO three actions could be made to prevent this kind of changes in the future:
>
> a. Add test case for metrics (which we already have in SinkMetricsITCase)
> b. We emphasize that any public-interface breaking changes should be proposed by a FLIP or discussed in mailing list, and should be listed in the release note.
> c. We remind contributors and reviewers about what should be considered as public API, and include metric names in it.
>
> For b and c these two pages [2][3] might be proper places.
>
> About the patch to revert this, it looks like we have a consensus on 1.16. As of 1.15 I think it’s worthy to trigger a minor version. I didn’t see complaints about this for now so it should be OK to save the situation asap. I’m with Xintong’s idea to treat numXXXSend as an alias of numXXXOut considering there could possibly some users have already adapted their system to the new naming, and have another internal metric for reflecting number of outgoing committable batches (actually the numRecordsIn of sink committer operator should be carrying this info already).
>
> [1] https://github.com/apache/flink/pull/18825
> [2] https://flink.apache.org/contributing/contribute-code.html
> [3] https://flink.apache.org/contributing/reviewing-prs.html
>
> Best,
> Qingsheng
> On Oct 10, 2022, 17:40 +0800, Xintong Song <to...@gmail.com>, wrote:
>
> +1 for reverting these changes in Flink 1.16.
>
> For 1.15.3, can we make these metrics available via both names (numXXXOut and numXXXSend)? In this way we don't break it for those who already migrated to 1.15 and numXXXSend. That means we still need to change SinkWriterOperator to use another metric name in 1.15.3, which IIUC is internal to Flink sink.
>
> I'm overall +1 to change numXXXOut back to its original semantics. AFAIK (from meetup / flink-forward questionaires), most users do not migrate to a new Flink release immediately, until the next 1-2 major releases are out.
>
> Best,
>
> Xintong
>
>
>
> On Mon, Oct 10, 2022 at 5:26 PM Martijn Visser <ma...@apache.org> wrote:
>>
>> Hi Qingsheng,
>>
>> Do you have any idea what has happened in the process here? Do we know why
>> they were changed? I was under the impression that these metric names were
>> newly introduced due to the new interfaces and because it still depends on
>> each connector implementing these.
>>
>> Sidenote: metric names are not mentioned in the FLIP process as a public
>> API. Might make sense to have a separate follow-up to add that to the list
>> (I do think we should list them there).
>>
>> +1 for reverting this and make this change in Flink 1.16
>>
>> I'm not in favour of releasing a Flink 1.15.3 with this change: I think the
>> impact is too big for a patch version, especially given how long Flink 1.15
>> is already out there.
>>
>> Best regards,
>>
>> Martijn
>>
>> On Mon, Oct 10, 2022 at 11:13 AM Leonard Xu <xb...@gmail.com> wrote:
>>
>> > Thanks Qingsheng for starting this thread.
>> >
>> > +1 on reverting sink metric name and releasing 1.15.3 to fix this
>> > inconsistent behavior.
>> >
>> >
>> > Best,
>> > Leonard
>> >
>> >
>> >
>> >
>> >
>> > 2022年10月10日 下午3:06,Jark Wu <im...@gmail.com> 写道:
>> >
>> > Thanks for discovering this problem, Qingsheng!
>> >
>> > I'm also +1 for reverting the breaking changes.
>> >
>> > IIUC, currently, the behavior of "numXXXOut" metrics of the new and old
>> > sink is inconsistent.
>> > We have to break one of them to have consistent behavior. Sink V2 is an
>> > evolving API which is just introduced in 1.15.
>> > I think it makes sense to break the unstable API instead of the stable API
>> > which many connectors and users depend on.
>> >
>> > Best,
>> > Jark
>> >
>> >
>> >
>> > On Mon, 10 Oct 2022 at 11:36, Jingsong Li <ji...@gmail.com> wrote:
>> >
>> >> Thanks for driving, Qingsheng.
>> >>
>> >> +1 for reverting sink metric name.
>> >>
>> >> We often forget that metric is also one of the important APIs.
>> >>
>> >> +1 for releasing 1.15.3 to fix this.
>> >>
>> >> Best,
>> >> Jingsong
>> >>
>> >> On Sun, Oct 9, 2022 at 11:35 PM Becket Qin <be...@gmail.com> wrote:
>> >> >
>> >> > Thanks for raising the discussion, Qingsheng,
>> >> >
>> >> > +1 on reverting the breaking changes.
>> >> >
>> >> > In addition, we might want to release a 1.15.3 to fix this and update
>> >> the previous release docs with this known issue, so that users can upgrade
>> >> to 1.15.3 when they hit it. It would also be good to add some backwards
>> >> compatibility tests on metrics to avoid unintended breaking changes like
>> >> this in the future.
>> >> >
>> >> > Thanks,
>> >> >
>> >> > Jiangjie (Becket) Qin
>> >> >
>> >> > On Sun, Oct 9, 2022 at 10:35 AM Qingsheng Ren <re...@apache.org> wrote:
>> >> >>
>> >> >> Hi devs and users,
>> >> >>
>> >> >> I’d like to start a discussion about reverting a breaking change about
>> >> sink metrics made in 1.15 by FLINK-26126 [1] and FLINK-26492 [2].
>> >> >>
>> >> >> TL;DR
>> >> >>
>> >> >> All sink metrics with name “numXXXOut” defined in FLIP-33 are replace
>> >> by “numXXXSend” in FLINK-26126 and FLINK-26492. Considering metric names
>> >> are public APIs, this is a breaking change to end users and not backward
>> >> compatible. Also unfortunately this breaking change was not discussed in
>> >> the mailing list before.
>> >> >>
>> >> >> Background
>> >> >>
>> >> >> As defined previously in FLIP-33 (the FLIP page has been changed so
>> >> please refer to the old version [3] ), metric “numRecordsOut” is used for
>> >> reporting the total number of output records since the sink started (number
>> >> of records written to the external system), and similarly for
>> >> “numRecordsOutPerSecond”, “numBytesOut”, “numBytesOutPerSecond” and
>> >> “numRecordsOutError”. Most sinks are following this naming and definition.
>> >> However, these metrics are ambiguous in the new Sink API as “numXXXOut”
>> >> could be used by the output of SinkWriterOperator for reporting number of
>> >> Committables delivered to SinkCommitterOperator. In order to resolve the
>> >> conflict, FLINK-26126 and FLINK-26492 changed names of these metrics with
>> >> “numXXXSend”.
>> >> >>
>> >> >> Necessity of reverting this change
>> >> >>
>> >> >> - Metric names are actually public API, as end users need to configure
>> >> metric collecting and alerting system with metric names. Users have to
>> >> reset all configurations related to affected metrics.
>> >> >> - This could also affect custom and external sinks not maintained by
>> >> Flink, which might have implemented with numXXXOut metrics.
>> >> >> - The number of records sent to external system is way more important
>> >> than the number of Committables sent to SinkCommitterOperator, as the
>> >> latter one is just an internal implementation of sink. We could have a new
>> >> metric name for the latter one instead.
>> >> >> - We could avoid splitting the project by version (like “plz use
>> >> numXXXOut before 1.15 and use numXXXSend after”) if we revert it ASAP,
>> >> cosidering 1.16 is still not released for now.
>> >> >>
>> >> >> As a consequence, I’d like to hear from devs and users about your
>> >> opinion on changing these metrics back to “numXXXOut”.
>> >> >>
>> >> >> Looking forward to your reply!
>> >> >>
>> >> >> [1] https://issues.apache.org/jira/browse/FLINK-26126
>> >> >> [2] https://issues.apache.org/jira/browse/FLINK-26492
>> >> >> [1] FLIP-33, version 18:
>> >> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211883136
>> >> >>
>> >> >> Best,
>> >> >> Qingsheng
>> >>
>> >
>> >
>
>
>

Re: [DISCUSS] Reverting sink metric name changes made in 1.15

Posted by Qingsheng Ren <re...@apache.org>.
Thanks Chesnay for the reply. +1 for making a unified and clearer
metric definition distinguishing internal and external data transfers.
As you described, having IO in operators is quite common such as
dimension tables in Table/SQL API. This definitely deserves a FLIP and
an overall design.

However I think it's necessary to change the metric back to
numRecordsOut instead of sticking with numRecordsSend in 1.15 and
1.16. The most important argument is for compatibility as I mentioned
in my previous email, otherwise all users have to modify their configs
of metric systems after upgrading to Flink 1.15+, and all custom
connectors have to change their implementations to migrate to the new
metric name. I believe other ones participating and approving this
proposal share the same concern about compatibility too. Also
considering this issue is blocking the release of 1.16, maybe we could
fix this asap, and as for defining a new metric for internal data
transfers we can have an in-depth discussion later. WDYT?

Best,
Qingsheng

On Tue, Oct 11, 2022 at 6:06 PM Chesnay Schepler <ch...@apache.org> wrote:
>
> Currently I think that would be a mistake.
>
> Ultimately what we have here is the culmination of us never really considering how the numRecordsOut metric should behave for operators that emit data to other operators _and_ external systems. This goes beyond sinks.
> This even applies to numRecordsIn, for cases where functions query/write data from/to the outside, (e.g., Async IO).
>
> Having 2 separate metrics for that, 1 exclusively for internal data transfers, and 1 exclusively for external data transfers, is the only way to get a consistent metric definition in the long-run.
> We can jump back-and-forth now or just commit to it.
>
> I don't think we can really judge this based on FLIP-33. It was IIRC written before the two phase sinks were added, which heavily blurred the lines of what a sink even is. Because it definitely is _not_ the last operator in a chain anymore.
>
> What I would suggest is to stick with what we got (although I despise the name numRecordsSend), and alias the numRecordsOut metric for all non-TwoPhaseCommittingSink.
>
> On 11/10/2022 05:54, Qingsheng Ren wrote:
>
> Thanks for the details Chesnay!
>
> By “alias” I mean to respect the original definition made in FLIP-33 for numRecordsOut, which is the number of records written to the external system, and keep numRecordsSend as the same value as numRecordsOut for compatibility.
>
> I think keeping numRecordsOut for the output to the external system is more intuitive to end users because in most cases the metric of data flow output is more essential. I agree with you that a new metric is required, but considering compatibility and users’ intuition I prefer to keep the initial definition of numRecordsOut in FLIP-33 and name a new metric for sink writer’s output to downstream operators. This might be against consistency with metrics in other operators in Flink but maybe it’s acceptable to have the sink as a special case.
>
> Best,
> Qingsheng
> On Oct 10, 2022, 19:13 +0800, Chesnay Schepler <ch...@apache.org>, wrote:
>
> > I’m with Xintong’s idea to treat numXXXSend as an alias of numXXXOut
>
> But that's not possible. If it were that simple there would have never been a need to introduce another metric in the first place.
>
> It's a rather fundamental issue with how the new sinks work, in that they emit data to the external system (usually considered as "numRecordsOut" of sinks) while _also_ sending data to a downstream operator (usually considered as "numRecordsOut" of tasks).
> The original issue was that the numRecordsOut of the sink counted both (which is completely wrong).
>
> A new metric was always required; otherwise you inevitably end up breaking some semantic.
> Adding a new metric for what the sink writes to the external system is, for better or worse, more consistent with how these metrics usually work in Flink.
>
> On 10/10/2022 12:45, Qingsheng Ren wrote:
>
> Thanks everyone for joining the discussion!
>
> > Do you have any idea what has happened in the process here?
>
> The discussion in this PR [1] shows some details and could be helpful to understand the original motivation of the renaming. We do have a test case for guarding metrics but unfortunaly the case was also modified so the defense was broken.
>
> I think the reason why both the developer and the reviewer forgot to trigger an discussion and gave a green pass on the change is that metrics are quite “trivial” to be noticed as public APIs. As mentioned by Martijn I couldn’t find a place noting that metrics are public APIs and should be treated carefully while contributing and reviewing.
>
> IMHO three actions could be made to prevent this kind of changes in the future:
>
> a. Add test case for metrics (which we already have in SinkMetricsITCase)
> b. We emphasize that any public-interface breaking changes should be proposed by a FLIP or discussed in mailing list, and should be listed in the release note.
> c. We remind contributors and reviewers about what should be considered as public API, and include metric names in it.
>
> For b and c these two pages [2][3] might be proper places.
>
> About the patch to revert this, it looks like we have a consensus on 1.16. As of 1.15 I think it’s worthy to trigger a minor version. I didn’t see complaints about this for now so it should be OK to save the situation asap. I’m with Xintong’s idea to treat numXXXSend as an alias of numXXXOut considering there could possibly some users have already adapted their system to the new naming, and have another internal metric for reflecting number of outgoing committable batches (actually the numRecordsIn of sink committer operator should be carrying this info already).
>
> [1] https://github.com/apache/flink/pull/18825
> [2] https://flink.apache.org/contributing/contribute-code.html
> [3] https://flink.apache.org/contributing/reviewing-prs.html
>
> Best,
> Qingsheng
> On Oct 10, 2022, 17:40 +0800, Xintong Song <to...@gmail.com>, wrote:
>
> +1 for reverting these changes in Flink 1.16.
>
> For 1.15.3, can we make these metrics available via both names (numXXXOut and numXXXSend)? In this way we don't break it for those who already migrated to 1.15 and numXXXSend. That means we still need to change SinkWriterOperator to use another metric name in 1.15.3, which IIUC is internal to Flink sink.
>
> I'm overall +1 to change numXXXOut back to its original semantics. AFAIK (from meetup / flink-forward questionaires), most users do not migrate to a new Flink release immediately, until the next 1-2 major releases are out.
>
> Best,
>
> Xintong
>
>
>
> On Mon, Oct 10, 2022 at 5:26 PM Martijn Visser <ma...@apache.org> wrote:
>>
>> Hi Qingsheng,
>>
>> Do you have any idea what has happened in the process here? Do we know why
>> they were changed? I was under the impression that these metric names were
>> newly introduced due to the new interfaces and because it still depends on
>> each connector implementing these.
>>
>> Sidenote: metric names are not mentioned in the FLIP process as a public
>> API. Might make sense to have a separate follow-up to add that to the list
>> (I do think we should list them there).
>>
>> +1 for reverting this and make this change in Flink 1.16
>>
>> I'm not in favour of releasing a Flink 1.15.3 with this change: I think the
>> impact is too big for a patch version, especially given how long Flink 1.15
>> is already out there.
>>
>> Best regards,
>>
>> Martijn
>>
>> On Mon, Oct 10, 2022 at 11:13 AM Leonard Xu <xb...@gmail.com> wrote:
>>
>> > Thanks Qingsheng for starting this thread.
>> >
>> > +1 on reverting sink metric name and releasing 1.15.3 to fix this
>> > inconsistent behavior.
>> >
>> >
>> > Best,
>> > Leonard
>> >
>> >
>> >
>> >
>> >
>> > 2022年10月10日 下午3:06,Jark Wu <im...@gmail.com> 写道:
>> >
>> > Thanks for discovering this problem, Qingsheng!
>> >
>> > I'm also +1 for reverting the breaking changes.
>> >
>> > IIUC, currently, the behavior of "numXXXOut" metrics of the new and old
>> > sink is inconsistent.
>> > We have to break one of them to have consistent behavior. Sink V2 is an
>> > evolving API which is just introduced in 1.15.
>> > I think it makes sense to break the unstable API instead of the stable API
>> > which many connectors and users depend on.
>> >
>> > Best,
>> > Jark
>> >
>> >
>> >
>> > On Mon, 10 Oct 2022 at 11:36, Jingsong Li <ji...@gmail.com> wrote:
>> >
>> >> Thanks for driving, Qingsheng.
>> >>
>> >> +1 for reverting sink metric name.
>> >>
>> >> We often forget that metric is also one of the important APIs.
>> >>
>> >> +1 for releasing 1.15.3 to fix this.
>> >>
>> >> Best,
>> >> Jingsong
>> >>
>> >> On Sun, Oct 9, 2022 at 11:35 PM Becket Qin <be...@gmail.com> wrote:
>> >> >
>> >> > Thanks for raising the discussion, Qingsheng,
>> >> >
>> >> > +1 on reverting the breaking changes.
>> >> >
>> >> > In addition, we might want to release a 1.15.3 to fix this and update
>> >> the previous release docs with this known issue, so that users can upgrade
>> >> to 1.15.3 when they hit it. It would also be good to add some backwards
>> >> compatibility tests on metrics to avoid unintended breaking changes like
>> >> this in the future.
>> >> >
>> >> > Thanks,
>> >> >
>> >> > Jiangjie (Becket) Qin
>> >> >
>> >> > On Sun, Oct 9, 2022 at 10:35 AM Qingsheng Ren <re...@apache.org> wrote:
>> >> >>
>> >> >> Hi devs and users,
>> >> >>
>> >> >> I’d like to start a discussion about reverting a breaking change about
>> >> sink metrics made in 1.15 by FLINK-26126 [1] and FLINK-26492 [2].
>> >> >>
>> >> >> TL;DR
>> >> >>
>> >> >> All sink metrics with name “numXXXOut” defined in FLIP-33 are replace
>> >> by “numXXXSend” in FLINK-26126 and FLINK-26492. Considering metric names
>> >> are public APIs, this is a breaking change to end users and not backward
>> >> compatible. Also unfortunately this breaking change was not discussed in
>> >> the mailing list before.
>> >> >>
>> >> >> Background
>> >> >>
>> >> >> As defined previously in FLIP-33 (the FLIP page has been changed so
>> >> please refer to the old version [3] ), metric “numRecordsOut” is used for
>> >> reporting the total number of output records since the sink started (number
>> >> of records written to the external system), and similarly for
>> >> “numRecordsOutPerSecond”, “numBytesOut”, “numBytesOutPerSecond” and
>> >> “numRecordsOutError”. Most sinks are following this naming and definition.
>> >> However, these metrics are ambiguous in the new Sink API as “numXXXOut”
>> >> could be used by the output of SinkWriterOperator for reporting number of
>> >> Committables delivered to SinkCommitterOperator. In order to resolve the
>> >> conflict, FLINK-26126 and FLINK-26492 changed names of these metrics with
>> >> “numXXXSend”.
>> >> >>
>> >> >> Necessity of reverting this change
>> >> >>
>> >> >> - Metric names are actually public API, as end users need to configure
>> >> metric collecting and alerting system with metric names. Users have to
>> >> reset all configurations related to affected metrics.
>> >> >> - This could also affect custom and external sinks not maintained by
>> >> Flink, which might have implemented with numXXXOut metrics.
>> >> >> - The number of records sent to external system is way more important
>> >> than the number of Committables sent to SinkCommitterOperator, as the
>> >> latter one is just an internal implementation of sink. We could have a new
>> >> metric name for the latter one instead.
>> >> >> - We could avoid splitting the project by version (like “plz use
>> >> numXXXOut before 1.15 and use numXXXSend after”) if we revert it ASAP,
>> >> cosidering 1.16 is still not released for now.
>> >> >>
>> >> >> As a consequence, I’d like to hear from devs and users about your
>> >> opinion on changing these metrics back to “numXXXOut”.
>> >> >>
>> >> >> Looking forward to your reply!
>> >> >>
>> >> >> [1] https://issues.apache.org/jira/browse/FLINK-26126
>> >> >> [2] https://issues.apache.org/jira/browse/FLINK-26492
>> >> >> [1] FLIP-33, version 18:
>> >> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211883136
>> >> >>
>> >> >> Best,
>> >> >> Qingsheng
>> >>
>> >
>> >
>
>
>

Re: [DISCUSS] Reverting sink metric name changes made in 1.15

Posted by Chesnay Schepler <ch...@apache.org>.
Currently I think that would be a mistake.

Ultimately what we have here is the culmination of us never really 
considering how the numRecordsOut metric should behave for operators 
that emit data to other operators _and_ external systems. This goes 
beyond sinks.
This even applies to numRecordsIn, for cases where functions query/write 
data from/to the outside, (e.g., Async IO).

Having 2 separate metrics for that, 1 exclusively for internal data 
transfers, and 1 exclusively for external data transfers, is the only 
way to get a consistent metric definition in the long-run.
We can jump back-and-forth now or just commit to it.

I don't think we can really judge this based on FLIP-33. It was IIRC 
written before the two phase sinks were added, which heavily blurred the 
lines of what a sink even is. Because it definitely is _not_ the last 
operator in a chain anymore.

What I would suggest is to stick with what we got (although I despise 
the name numRecordsSend), and alias the numRecordsOut metric for all 
non-TwoPhaseCommittingSink.

On 11/10/2022 05:54, Qingsheng Ren wrote:
> Thanks for the details Chesnay!
>
> By “alias” I mean to respect the original definition made in FLIP-33 
> for numRecordsOut, which is the number of records written to the 
> external system, and keep numRecordsSend as the same value as 
> numRecordsOut for compatibility.
>
> I think keeping numRecordsOut for the output to the external system is 
> more intuitive to end users because in most cases the metric of data 
> flow output is more essential. I agree with you that a new metric is 
> required, but considering compatibility and users’ intuition I prefer 
> to keep the initial definition of numRecordsOut in FLIP-33 and name a 
> new metric for sink writer’s output to downstream operators. This 
> might be against consistency with metrics in other operators in Flink 
> but maybe it’s acceptable to have the sink as a special case.
>
> Best,
> Qingsheng
> On Oct 10, 2022, 19:13 +0800, Chesnay Schepler <ch...@apache.org>, 
> wrote:
>> > I’m with Xintong’s idea to treat numXXXSend as an alias of numXXXOut
>>
>> But that's not possible. If it were that simple there would have 
>> never been a need to introduce another metric in the first place.
>>
>> It's a rather fundamental issue with how the new sinks work, in that 
>> they emit data to the external system (usually considered as 
>> "numRecordsOut" of sinks) while _also_ sending data to a downstream 
>> operator (usually considered as "numRecordsOut" of tasks).
>> The original issue was that the numRecordsOut of the sink counted 
>> both (which is completely wrong).
>>
>> A new metric was always required; otherwise you inevitably end up 
>> breaking /some/ semantic.
>> Adding a new metric for what the sink writes to the external system 
>> is, for better or worse, more consistent with how these metrics 
>> usually work in Flink.
>>
>> On 10/10/2022 12:45, Qingsheng Ren wrote:
>>> Thanks everyone for joining the discussion!
>>>
>>> > Do you have any idea what has happened in the process here?
>>>
>>> The discussion in this PR [1] shows some details and could be 
>>> helpful to understand the original motivation of the renaming. We do 
>>> have a test case for guarding metrics but unfortunaly the case was 
>>> also modified so the defense was broken.
>>>
>>> I think the reason why both the developer and the reviewer forgot to 
>>> trigger an discussion and gave a green pass on the change is that 
>>> metrics are quite “trivial” to be noticed as public APIs. As 
>>> mentioned by Martijn I couldn’t find a place noting that metrics are 
>>> public APIs and should be treated carefully while contributing and 
>>> reviewing.
>>>
>>> IMHO three actions could be made to prevent this kind of changes in 
>>> the future:
>>>
>>> a. Add test case for metrics (which we already have in 
>>> SinkMetricsITCase)
>>> b. We emphasize that any public-interface breaking changes should be 
>>> proposed by a FLIP or discussed in mailing list, and should be 
>>> listed in the release note.
>>> c. We remind contributors and reviewers about what should be 
>>> considered as public API, and include metric names in it.
>>>
>>> For b and c these two pages [2][3] might be proper places.
>>>
>>> About the patch to revert this, it looks like we have a consensus on 
>>> 1.16. As of 1.15 I think it’s worthy to trigger a minor version. I 
>>> didn’t see complaints about this for now so it should be OK to save 
>>> the situation asap. I’m with Xintong’s idea to treat numXXXSend as 
>>> an alias of numXXXOut considering there could possibly some users 
>>> have already adapted their system to the new naming, and have 
>>> another internal metric for reflecting number of outgoing 
>>> committable batches (actually the numRecordsIn of sink committer 
>>> operator should be carrying this info already).
>>>
>>> [1] https://github.com/apache/flink/pull/18825
>>> [2] https://flink.apache.org/contributing/contribute-code.html
>>> [3] https://flink.apache.org/contributing/reviewing-prs.html
>>>
>>> Best,
>>> Qingsheng
>>> On Oct 10, 2022, 17:40 +0800, Xintong Song <to...@gmail.com>, 
>>> wrote:
>>>> +1 for reverting these changes in Flink 1.16.
>>>>
>>>> For 1.15.3, can we make these metrics available via both names 
>>>> (numXXXOut and numXXXSend)? In this way we don't break it for those 
>>>> who already migrated to 1.15 and numXXXSend. That means we still 
>>>> need to change SinkWriterOperator to use another metric name in 
>>>> 1.15.3, which IIUC is internal to Flink sink.
>>>>
>>>> I'm overall +1 to change numXXXOut back to its original semantics. 
>>>> AFAIK (from meetup / flink-forward questionaires), most users do 
>>>> not migrate to a new Flink release immediately, until the next 1-2 
>>>> major releases are out.
>>>>
>>>> Best,
>>>>
>>>> Xintong
>>>>
>>>>
>>>>
>>>> On Mon, Oct 10, 2022 at 5:26 PM Martijn Visser 
>>>> <ma...@apache.org> wrote:
>>>>
>>>>     Hi Qingsheng,
>>>>
>>>>     Do you have any idea what has happened in the process here? Do
>>>>     we know why
>>>>     they were changed? I was under the impression that these metric
>>>>     names were
>>>>     newly introduced due to the new interfaces and because it still
>>>>     depends on
>>>>     each connector implementing these.
>>>>
>>>>     Sidenote: metric names are not mentioned in the FLIP process as
>>>>     a public
>>>>     API. Might make sense to have a separate follow-up to add that
>>>>     to the list
>>>>     (I do think we should list them there).
>>>>
>>>>     +1 for reverting this and make this change in Flink 1.16
>>>>
>>>>     I'm not in favour of releasing a Flink 1.15.3 with this change:
>>>>     I think the
>>>>     impact is too big for a patch version, especially given how
>>>>     long Flink 1.15
>>>>     is already out there.
>>>>
>>>>     Best regards,
>>>>
>>>>     Martijn
>>>>
>>>>     On Mon, Oct 10, 2022 at 11:13 AM Leonard Xu <xb...@gmail.com>
>>>>     wrote:
>>>>
>>>>     > Thanks Qingsheng for starting this thread.
>>>>     >
>>>>     > +1 on reverting sink metric name and releasing 1.15.3 to fix this
>>>>     > inconsistent behavior.
>>>>     >
>>>>     >
>>>>     > Best,
>>>>     > Leonard
>>>>     >
>>>>     >
>>>>     >
>>>>     >
>>>>     >
>>>>     > 2022年10月10日 下午3:06,Jark Wu <im...@gmail.com> 写道:
>>>>     >
>>>>     > Thanks for discovering this problem, Qingsheng!
>>>>     >
>>>>     > I'm also +1 for reverting the breaking changes.
>>>>     >
>>>>     > IIUC, currently, the behavior of "numXXXOut" metrics of the
>>>>     new and old
>>>>     > sink is inconsistent.
>>>>     > We have to break one of them to have consistent behavior.
>>>>     Sink V2 is an
>>>>     > evolving API which is just introduced in 1.15.
>>>>     > I think it makes sense to break the unstable API instead of
>>>>     the stable API
>>>>     > which many connectors and users depend on.
>>>>     >
>>>>     > Best,
>>>>     > Jark
>>>>     >
>>>>     >
>>>>     >
>>>>     > On Mon, 10 Oct 2022 at 11:36, Jingsong Li
>>>>     <ji...@gmail.com> wrote:
>>>>     >
>>>>     >> Thanks for driving, Qingsheng.
>>>>     >>
>>>>     >> +1 for reverting sink metric name.
>>>>     >>
>>>>     >> We often forget that metric is also one of the important APIs.
>>>>     >>
>>>>     >> +1 for releasing 1.15.3 to fix this.
>>>>     >>
>>>>     >> Best,
>>>>     >> Jingsong
>>>>     >>
>>>>     >> On Sun, Oct 9, 2022 at 11:35 PM Becket Qin
>>>>     <be...@gmail.com> wrote:
>>>>     >> >
>>>>     >> > Thanks for raising the discussion, Qingsheng,
>>>>     >> >
>>>>     >> > +1 on reverting the breaking changes.
>>>>     >> >
>>>>     >> > In addition, we might want to release a 1.15.3 to fix this
>>>>     and update
>>>>     >> the previous release docs with this known issue, so that
>>>>     users can upgrade
>>>>     >> to 1.15.3 when they hit it. It would also be good to add
>>>>     some backwards
>>>>     >> compatibility tests on metrics to avoid unintended breaking
>>>>     changes like
>>>>     >> this in the future.
>>>>     >> >
>>>>     >> > Thanks,
>>>>     >> >
>>>>     >> > Jiangjie (Becket) Qin
>>>>     >> >
>>>>     >> > On Sun, Oct 9, 2022 at 10:35 AM Qingsheng Ren
>>>>     <re...@apache.org> wrote:
>>>>     >> >>
>>>>     >> >> Hi devs and users,
>>>>     >> >>
>>>>     >> >> I’d like to start a discussion about reverting a breaking
>>>>     change about
>>>>     >> sink metrics made in 1.15 by FLINK-26126 [1] and FLINK-26492
>>>>     [2].
>>>>     >> >>
>>>>     >> >> TL;DR
>>>>     >> >>
>>>>     >> >> All sink metrics with name “numXXXOut” defined in FLIP-33
>>>>     are replace
>>>>     >> by “numXXXSend” in FLINK-26126 and FLINK-26492. Considering
>>>>     metric names
>>>>     >> are public APIs, this is a breaking change to end users and
>>>>     not backward
>>>>     >> compatible. Also unfortunately this breaking change was not
>>>>     discussed in
>>>>     >> the mailing list before.
>>>>     >> >>
>>>>     >> >> Background
>>>>     >> >>
>>>>     >> >> As defined previously in FLIP-33 (the FLIP page has been
>>>>     changed so
>>>>     >> please refer to the old version [3] ), metric
>>>>     “numRecordsOut” is used for
>>>>     >> reporting the total number of output records since the sink
>>>>     started (number
>>>>     >> of records written to the external system), and similarly for
>>>>     >> “numRecordsOutPerSecond”, “numBytesOut”,
>>>>     “numBytesOutPerSecond” and
>>>>     >> “numRecordsOutError”. Most sinks are following this naming
>>>>     and definition.
>>>>     >> However, these metrics are ambiguous in the new Sink API as
>>>>     “numXXXOut”
>>>>     >> could be used by the output of SinkWriterOperator for
>>>>     reporting number of
>>>>     >> Committables delivered to SinkCommitterOperator. In order to
>>>>     resolve the
>>>>     >> conflict, FLINK-26126 and FLINK-26492 changed names of these
>>>>     metrics with
>>>>     >> “numXXXSend”.
>>>>     >> >>
>>>>     >> >> Necessity of reverting this change
>>>>     >> >>
>>>>     >> >> - Metric names are actually public API, as end users need
>>>>     to configure
>>>>     >> metric collecting and alerting system with metric names.
>>>>     Users have to
>>>>     >> reset all configurations related to affected metrics.
>>>>     >> >> - This could also affect custom and external sinks not
>>>>     maintained by
>>>>     >> Flink, which might have implemented with numXXXOut metrics.
>>>>     >> >> - The number of records sent to external system is way
>>>>     more important
>>>>     >> than the number of Committables sent to
>>>>     SinkCommitterOperator, as the
>>>>     >> latter one is just an internal implementation of sink. We
>>>>     could have a new
>>>>     >> metric name for the latter one instead.
>>>>     >> >> - We could avoid splitting the project by version (like
>>>>     “plz use
>>>>     >> numXXXOut before 1.15 and use numXXXSend after”) if we
>>>>     revert it ASAP,
>>>>     >> cosidering 1.16 is still not released for now.
>>>>     >> >>
>>>>     >> >> As a consequence, I’d like to hear from devs and users
>>>>     about your
>>>>     >> opinion on changing these metrics back to “numXXXOut”.
>>>>     >> >>
>>>>     >> >> Looking forward to your reply!
>>>>     >> >>
>>>>     >> >> [1] https://issues.apache.org/jira/browse/FLINK-26126
>>>>     >> >> [2] https://issues.apache.org/jira/browse/FLINK-26492
>>>>     >> >> [1] FLIP-33, version 18:
>>>>     >>
>>>>     https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211883136
>>>>     >> >>
>>>>     >> >> Best,
>>>>     >> >> Qingsheng
>>>>     >>
>>>>     >
>>>>     >
>>>>
>>

Re: [DISCUSS] Reverting sink metric name changes made in 1.15

Posted by Chesnay Schepler <ch...@apache.org>.
Currently I think that would be a mistake.

Ultimately what we have here is the culmination of us never really 
considering how the numRecordsOut metric should behave for operators 
that emit data to other operators _and_ external systems. This goes 
beyond sinks.
This even applies to numRecordsIn, for cases where functions query/write 
data from/to the outside, (e.g., Async IO).

Having 2 separate metrics for that, 1 exclusively for internal data 
transfers, and 1 exclusively for external data transfers, is the only 
way to get a consistent metric definition in the long-run.
We can jump back-and-forth now or just commit to it.

I don't think we can really judge this based on FLIP-33. It was IIRC 
written before the two phase sinks were added, which heavily blurred the 
lines of what a sink even is. Because it definitely is _not_ the last 
operator in a chain anymore.

What I would suggest is to stick with what we got (although I despise 
the name numRecordsSend), and alias the numRecordsOut metric for all 
non-TwoPhaseCommittingSink.

On 11/10/2022 05:54, Qingsheng Ren wrote:
> Thanks for the details Chesnay!
>
> By “alias” I mean to respect the original definition made in FLIP-33 
> for numRecordsOut, which is the number of records written to the 
> external system, and keep numRecordsSend as the same value as 
> numRecordsOut for compatibility.
>
> I think keeping numRecordsOut for the output to the external system is 
> more intuitive to end users because in most cases the metric of data 
> flow output is more essential. I agree with you that a new metric is 
> required, but considering compatibility and users’ intuition I prefer 
> to keep the initial definition of numRecordsOut in FLIP-33 and name a 
> new metric for sink writer’s output to downstream operators. This 
> might be against consistency with metrics in other operators in Flink 
> but maybe it’s acceptable to have the sink as a special case.
>
> Best,
> Qingsheng
> On Oct 10, 2022, 19:13 +0800, Chesnay Schepler <ch...@apache.org>, 
> wrote:
>> > I’m with Xintong’s idea to treat numXXXSend as an alias of numXXXOut
>>
>> But that's not possible. If it were that simple there would have 
>> never been a need to introduce another metric in the first place.
>>
>> It's a rather fundamental issue with how the new sinks work, in that 
>> they emit data to the external system (usually considered as 
>> "numRecordsOut" of sinks) while _also_ sending data to a downstream 
>> operator (usually considered as "numRecordsOut" of tasks).
>> The original issue was that the numRecordsOut of the sink counted 
>> both (which is completely wrong).
>>
>> A new metric was always required; otherwise you inevitably end up 
>> breaking /some/ semantic.
>> Adding a new metric for what the sink writes to the external system 
>> is, for better or worse, more consistent with how these metrics 
>> usually work in Flink.
>>
>> On 10/10/2022 12:45, Qingsheng Ren wrote:
>>> Thanks everyone for joining the discussion!
>>>
>>> > Do you have any idea what has happened in the process here?
>>>
>>> The discussion in this PR [1] shows some details and could be 
>>> helpful to understand the original motivation of the renaming. We do 
>>> have a test case for guarding metrics but unfortunaly the case was 
>>> also modified so the defense was broken.
>>>
>>> I think the reason why both the developer and the reviewer forgot to 
>>> trigger an discussion and gave a green pass on the change is that 
>>> metrics are quite “trivial” to be noticed as public APIs. As 
>>> mentioned by Martijn I couldn’t find a place noting that metrics are 
>>> public APIs and should be treated carefully while contributing and 
>>> reviewing.
>>>
>>> IMHO three actions could be made to prevent this kind of changes in 
>>> the future:
>>>
>>> a. Add test case for metrics (which we already have in 
>>> SinkMetricsITCase)
>>> b. We emphasize that any public-interface breaking changes should be 
>>> proposed by a FLIP or discussed in mailing list, and should be 
>>> listed in the release note.
>>> c. We remind contributors and reviewers about what should be 
>>> considered as public API, and include metric names in it.
>>>
>>> For b and c these two pages [2][3] might be proper places.
>>>
>>> About the patch to revert this, it looks like we have a consensus on 
>>> 1.16. As of 1.15 I think it’s worthy to trigger a minor version. I 
>>> didn’t see complaints about this for now so it should be OK to save 
>>> the situation asap. I’m with Xintong’s idea to treat numXXXSend as 
>>> an alias of numXXXOut considering there could possibly some users 
>>> have already adapted their system to the new naming, and have 
>>> another internal metric for reflecting number of outgoing 
>>> committable batches (actually the numRecordsIn of sink committer 
>>> operator should be carrying this info already).
>>>
>>> [1] https://github.com/apache/flink/pull/18825
>>> [2] https://flink.apache.org/contributing/contribute-code.html
>>> [3] https://flink.apache.org/contributing/reviewing-prs.html
>>>
>>> Best,
>>> Qingsheng
>>> On Oct 10, 2022, 17:40 +0800, Xintong Song <to...@gmail.com>, 
>>> wrote:
>>>> +1 for reverting these changes in Flink 1.16.
>>>>
>>>> For 1.15.3, can we make these metrics available via both names 
>>>> (numXXXOut and numXXXSend)? In this way we don't break it for those 
>>>> who already migrated to 1.15 and numXXXSend. That means we still 
>>>> need to change SinkWriterOperator to use another metric name in 
>>>> 1.15.3, which IIUC is internal to Flink sink.
>>>>
>>>> I'm overall +1 to change numXXXOut back to its original semantics. 
>>>> AFAIK (from meetup / flink-forward questionaires), most users do 
>>>> not migrate to a new Flink release immediately, until the next 1-2 
>>>> major releases are out.
>>>>
>>>> Best,
>>>>
>>>> Xintong
>>>>
>>>>
>>>>
>>>> On Mon, Oct 10, 2022 at 5:26 PM Martijn Visser 
>>>> <ma...@apache.org> wrote:
>>>>
>>>>     Hi Qingsheng,
>>>>
>>>>     Do you have any idea what has happened in the process here? Do
>>>>     we know why
>>>>     they were changed? I was under the impression that these metric
>>>>     names were
>>>>     newly introduced due to the new interfaces and because it still
>>>>     depends on
>>>>     each connector implementing these.
>>>>
>>>>     Sidenote: metric names are not mentioned in the FLIP process as
>>>>     a public
>>>>     API. Might make sense to have a separate follow-up to add that
>>>>     to the list
>>>>     (I do think we should list them there).
>>>>
>>>>     +1 for reverting this and make this change in Flink 1.16
>>>>
>>>>     I'm not in favour of releasing a Flink 1.15.3 with this change:
>>>>     I think the
>>>>     impact is too big for a patch version, especially given how
>>>>     long Flink 1.15
>>>>     is already out there.
>>>>
>>>>     Best regards,
>>>>
>>>>     Martijn
>>>>
>>>>     On Mon, Oct 10, 2022 at 11:13 AM Leonard Xu <xb...@gmail.com>
>>>>     wrote:
>>>>
>>>>     > Thanks Qingsheng for starting this thread.
>>>>     >
>>>>     > +1 on reverting sink metric name and releasing 1.15.3 to fix this
>>>>     > inconsistent behavior.
>>>>     >
>>>>     >
>>>>     > Best,
>>>>     > Leonard
>>>>     >
>>>>     >
>>>>     >
>>>>     >
>>>>     >
>>>>     > 2022年10月10日 下午3:06,Jark Wu <im...@gmail.com> 写道:
>>>>     >
>>>>     > Thanks for discovering this problem, Qingsheng!
>>>>     >
>>>>     > I'm also +1 for reverting the breaking changes.
>>>>     >
>>>>     > IIUC, currently, the behavior of "numXXXOut" metrics of the
>>>>     new and old
>>>>     > sink is inconsistent.
>>>>     > We have to break one of them to have consistent behavior.
>>>>     Sink V2 is an
>>>>     > evolving API which is just introduced in 1.15.
>>>>     > I think it makes sense to break the unstable API instead of
>>>>     the stable API
>>>>     > which many connectors and users depend on.
>>>>     >
>>>>     > Best,
>>>>     > Jark
>>>>     >
>>>>     >
>>>>     >
>>>>     > On Mon, 10 Oct 2022 at 11:36, Jingsong Li
>>>>     <ji...@gmail.com> wrote:
>>>>     >
>>>>     >> Thanks for driving, Qingsheng.
>>>>     >>
>>>>     >> +1 for reverting sink metric name.
>>>>     >>
>>>>     >> We often forget that metric is also one of the important APIs.
>>>>     >>
>>>>     >> +1 for releasing 1.15.3 to fix this.
>>>>     >>
>>>>     >> Best,
>>>>     >> Jingsong
>>>>     >>
>>>>     >> On Sun, Oct 9, 2022 at 11:35 PM Becket Qin
>>>>     <be...@gmail.com> wrote:
>>>>     >> >
>>>>     >> > Thanks for raising the discussion, Qingsheng,
>>>>     >> >
>>>>     >> > +1 on reverting the breaking changes.
>>>>     >> >
>>>>     >> > In addition, we might want to release a 1.15.3 to fix this
>>>>     and update
>>>>     >> the previous release docs with this known issue, so that
>>>>     users can upgrade
>>>>     >> to 1.15.3 when they hit it. It would also be good to add
>>>>     some backwards
>>>>     >> compatibility tests on metrics to avoid unintended breaking
>>>>     changes like
>>>>     >> this in the future.
>>>>     >> >
>>>>     >> > Thanks,
>>>>     >> >
>>>>     >> > Jiangjie (Becket) Qin
>>>>     >> >
>>>>     >> > On Sun, Oct 9, 2022 at 10:35 AM Qingsheng Ren
>>>>     <re...@apache.org> wrote:
>>>>     >> >>
>>>>     >> >> Hi devs and users,
>>>>     >> >>
>>>>     >> >> I’d like to start a discussion about reverting a breaking
>>>>     change about
>>>>     >> sink metrics made in 1.15 by FLINK-26126 [1] and FLINK-26492
>>>>     [2].
>>>>     >> >>
>>>>     >> >> TL;DR
>>>>     >> >>
>>>>     >> >> All sink metrics with name “numXXXOut” defined in FLIP-33
>>>>     are replace
>>>>     >> by “numXXXSend” in FLINK-26126 and FLINK-26492. Considering
>>>>     metric names
>>>>     >> are public APIs, this is a breaking change to end users and
>>>>     not backward
>>>>     >> compatible. Also unfortunately this breaking change was not
>>>>     discussed in
>>>>     >> the mailing list before.
>>>>     >> >>
>>>>     >> >> Background
>>>>     >> >>
>>>>     >> >> As defined previously in FLIP-33 (the FLIP page has been
>>>>     changed so
>>>>     >> please refer to the old version [3] ), metric
>>>>     “numRecordsOut” is used for
>>>>     >> reporting the total number of output records since the sink
>>>>     started (number
>>>>     >> of records written to the external system), and similarly for
>>>>     >> “numRecordsOutPerSecond”, “numBytesOut”,
>>>>     “numBytesOutPerSecond” and
>>>>     >> “numRecordsOutError”. Most sinks are following this naming
>>>>     and definition.
>>>>     >> However, these metrics are ambiguous in the new Sink API as
>>>>     “numXXXOut”
>>>>     >> could be used by the output of SinkWriterOperator for
>>>>     reporting number of
>>>>     >> Committables delivered to SinkCommitterOperator. In order to
>>>>     resolve the
>>>>     >> conflict, FLINK-26126 and FLINK-26492 changed names of these
>>>>     metrics with
>>>>     >> “numXXXSend”.
>>>>     >> >>
>>>>     >> >> Necessity of reverting this change
>>>>     >> >>
>>>>     >> >> - Metric names are actually public API, as end users need
>>>>     to configure
>>>>     >> metric collecting and alerting system with metric names.
>>>>     Users have to
>>>>     >> reset all configurations related to affected metrics.
>>>>     >> >> - This could also affect custom and external sinks not
>>>>     maintained by
>>>>     >> Flink, which might have implemented with numXXXOut metrics.
>>>>     >> >> - The number of records sent to external system is way
>>>>     more important
>>>>     >> than the number of Committables sent to
>>>>     SinkCommitterOperator, as the
>>>>     >> latter one is just an internal implementation of sink. We
>>>>     could have a new
>>>>     >> metric name for the latter one instead.
>>>>     >> >> - We could avoid splitting the project by version (like
>>>>     “plz use
>>>>     >> numXXXOut before 1.15 and use numXXXSend after”) if we
>>>>     revert it ASAP,
>>>>     >> cosidering 1.16 is still not released for now.
>>>>     >> >>
>>>>     >> >> As a consequence, I’d like to hear from devs and users
>>>>     about your
>>>>     >> opinion on changing these metrics back to “numXXXOut”.
>>>>     >> >>
>>>>     >> >> Looking forward to your reply!
>>>>     >> >>
>>>>     >> >> [1] https://issues.apache.org/jira/browse/FLINK-26126
>>>>     >> >> [2] https://issues.apache.org/jira/browse/FLINK-26492
>>>>     >> >> [1] FLIP-33, version 18:
>>>>     >>
>>>>     https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211883136
>>>>     >> >>
>>>>     >> >> Best,
>>>>     >> >> Qingsheng
>>>>     >>
>>>>     >
>>>>     >
>>>>
>>

Re: [DISCUSS] Reverting sink metric name changes made in 1.15

Posted by Qingsheng Ren <re...@apache.org>.
Thanks for the details Chesnay!

By “alias” I mean to respect the original definition made in FLIP-33 for numRecordsOut, which is the number of records written to the external system, and keep numRecordsSend as the same value as numRecordsOut for compatibility.

I think keeping numRecordsOut for the output to the external system is more intuitive to end users because in most cases the metric of data flow output is more essential. I agree with you that a new metric is required, but considering compatibility and users’ intuition I prefer to keep the initial definition of numRecordsOut in FLIP-33 and name a new metric for sink writer’s output to downstream operators. This might be against consistency with metrics in other operators in Flink but maybe it’s acceptable to have the sink as a special case.

Best,
Qingsheng
On Oct 10, 2022, 19:13 +0800, Chesnay Schepler <ch...@apache.org>, wrote:
> > I’m with Xintong’s idea to treat numXXXSend as an alias of numXXXOut
>
> But that's not possible. If it were that simple there would have never been a need to introduce another metric in the first place.
>
> It's a rather fundamental issue with how the new sinks work, in that they emit data to the external system (usually considered as "numRecordsOut" of sinks) while _also_ sending data to a downstream operator (usually considered as "numRecordsOut" of tasks).
> The original issue was that the numRecordsOut of the sink counted both (which is completely wrong).
>
> A new metric was always required; otherwise you inevitably end up breaking some semantic.
> Adding a new metric for what the sink writes to the external system is, for better or worse, more consistent with how these metrics usually work in Flink.
>
> On 10/10/2022 12:45, Qingsheng Ren wrote:
> > Thanks everyone for joining the discussion!
> >
> > > Do you have any idea what has happened in the process here?
> >
> > The discussion in this PR [1] shows some details and could be helpful to understand the original motivation of the renaming. We do have a test case for guarding metrics but unfortunaly the case was also modified so the defense was broken.
> >
> > I think the reason why both the developer and the reviewer forgot to trigger an discussion and gave a green pass on the change is that metrics are quite “trivial” to be noticed as public APIs. As mentioned by Martijn I couldn’t find a place noting that metrics are public APIs and should be treated carefully while contributing and reviewing.
> >
> > IMHO three actions could be made to prevent this kind of changes in the future:
> >
> > a. Add test case for metrics (which we already have in SinkMetricsITCase)
> > b. We emphasize that any public-interface breaking changes should be proposed by a FLIP or discussed in mailing list, and should be listed in the release note.
> > c. We remind contributors and reviewers about what should be considered as public API, and include metric names in it.
> >
> > For b and c these two pages [2][3] might be proper places.
> >
> > About the patch to revert this, it looks like we have a consensus on 1.16. As of 1.15 I think it’s worthy to trigger a minor version. I didn’t see complaints about this for now so it should be OK to save the situation asap. I’m with Xintong’s idea to treat numXXXSend as an alias of numXXXOut considering there could possibly some users have already adapted their system to the new naming, and have another internal metric for reflecting number of outgoing committable batches (actually the numRecordsIn of sink committer operator should be carrying this info already).
> >
> > [1] https://github.com/apache/flink/pull/18825
> > [2] https://flink.apache.org/contributing/contribute-code.html
> > [3] https://flink.apache.org/contributing/reviewing-prs.html
> >
> > Best,
> > Qingsheng
> > On Oct 10, 2022, 17:40 +0800, Xintong Song <to...@gmail.com>, wrote:
> > > +1 for reverting these changes in Flink 1.16.
> > >
> > > For 1.15.3, can we make these metrics available via both names (numXXXOut and numXXXSend)? In this way we don't break it for those who already migrated to 1.15 and numXXXSend. That means we still need to change SinkWriterOperator to use another metric name in 1.15.3, which IIUC is internal to Flink sink.
> > >
> > > I'm overall +1 to change numXXXOut back to its original semantics. AFAIK (from meetup / flink-forward questionaires), most users do not migrate to a new Flink release immediately, until the next 1-2 major releases are out.
> > >
> > > Best,
> > > Xintong
> > >
> > >
> > > > On Mon, Oct 10, 2022 at 5:26 PM Martijn Visser <ma...@apache.org> wrote:
> > > > > Hi Qingsheng,
> > > > >
> > > > > Do you have any idea what has happened in the process here? Do we know why
> > > > > they were changed? I was under the impression that these metric names were
> > > > > newly introduced due to the new interfaces and because it still depends on
> > > > > each connector implementing these.
> > > > >
> > > > > Sidenote: metric names are not mentioned in the FLIP process as a public
> > > > > API. Might make sense to have a separate follow-up to add that to the list
> > > > > (I do think we should list them there).
> > > > >
> > > > > +1 for reverting this and make this change in Flink 1.16
> > > > >
> > > > > I'm not in favour of releasing a Flink 1.15.3 with this change: I think the
> > > > > impact is too big for a patch version, especially given how long Flink 1.15
> > > > > is already out there.
> > > > >
> > > > > Best regards,
> > > > >
> > > > > Martijn
> > > > >
> > > > > On Mon, Oct 10, 2022 at 11:13 AM Leonard Xu <xb...@gmail.com> wrote:
> > > > >
> > > > > > Thanks Qingsheng for starting this thread.
> > > > > >
> > > > > > +1 on reverting sink metric name and releasing 1.15.3 to fix this
> > > > > > inconsistent behavior.
> > > > > >
> > > > > >
> > > > > > Best,
> > > > > > Leonard
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > 2022年10月10日 下午3:06,Jark Wu <im...@gmail.com> 写道:
> > > > > >
> > > > > > Thanks for discovering this problem, Qingsheng!
> > > > > >
> > > > > > I'm also +1 for reverting the breaking changes.
> > > > > >
> > > > > > IIUC, currently, the behavior of "numXXXOut" metrics of the new and old
> > > > > > sink is inconsistent.
> > > > > > We have to break one of them to have consistent behavior. Sink V2 is an
> > > > > > evolving API which is just introduced in 1.15.
> > > > > > I think it makes sense to break the unstable API instead of the stable API
> > > > > > which many connectors and users depend on.
> > > > > >
> > > > > > Best,
> > > > > > Jark
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Mon, 10 Oct 2022 at 11:36, Jingsong Li <ji...@gmail.com> wrote:
> > > > > >
> > > > > >> Thanks for driving, Qingsheng.
> > > > > >>
> > > > > >> +1 for reverting sink metric name.
> > > > > >>
> > > > > >> We often forget that metric is also one of the important APIs.
> > > > > >>
> > > > > >> +1 for releasing 1.15.3 to fix this.
> > > > > >>
> > > > > >> Best,
> > > > > >> Jingsong
> > > > > >>
> > > > > >> On Sun, Oct 9, 2022 at 11:35 PM Becket Qin <be...@gmail.com> wrote:
> > > > > >> >
> > > > > >> > Thanks for raising the discussion, Qingsheng,
> > > > > >> >
> > > > > >> > +1 on reverting the breaking changes.
> > > > > >> >
> > > > > >> > In addition, we might want to release a 1.15.3 to fix this and update
> > > > > >> the previous release docs with this known issue, so that users can upgrade
> > > > > >> to 1.15.3 when they hit it. It would also be good to add some backwards
> > > > > >> compatibility tests on metrics to avoid unintended breaking changes like
> > > > > >> this in the future.
> > > > > >> >
> > > > > >> > Thanks,
> > > > > >> >
> > > > > >> > Jiangjie (Becket) Qin
> > > > > >> >
> > > > > >> > On Sun, Oct 9, 2022 at 10:35 AM Qingsheng Ren <re...@apache.org> wrote:
> > > > > >> >>
> > > > > >> >> Hi devs and users,
> > > > > >> >>
> > > > > >> >> I’d like to start a discussion about reverting a breaking change about
> > > > > >> sink metrics made in 1.15 by FLINK-26126 [1] and FLINK-26492 [2].
> > > > > >> >>
> > > > > >> >> TL;DR
> > > > > >> >>
> > > > > >> >> All sink metrics with name “numXXXOut” defined in FLIP-33 are replace
> > > > > >> by “numXXXSend” in FLINK-26126 and FLINK-26492. Considering metric names
> > > > > >> are public APIs, this is a breaking change to end users and not backward
> > > > > >> compatible. Also unfortunately this breaking change was not discussed in
> > > > > >> the mailing list before.
> > > > > >> >>
> > > > > >> >> Background
> > > > > >> >>
> > > > > >> >> As defined previously in FLIP-33 (the FLIP page has been changed so
> > > > > >> please refer to the old version [3] ), metric “numRecordsOut” is used for
> > > > > >> reporting the total number of output records since the sink started (number
> > > > > >> of records written to the external system), and similarly for
> > > > > >> “numRecordsOutPerSecond”, “numBytesOut”, “numBytesOutPerSecond” and
> > > > > >> “numRecordsOutError”. Most sinks are following this naming and definition.
> > > > > >> However, these metrics are ambiguous in the new Sink API as “numXXXOut”
> > > > > >> could be used by the output of SinkWriterOperator for reporting number of
> > > > > >> Committables delivered to SinkCommitterOperator. In order to resolve the
> > > > > >> conflict, FLINK-26126 and FLINK-26492 changed names of these metrics with
> > > > > >> “numXXXSend”.
> > > > > >> >>
> > > > > >> >> Necessity of reverting this change
> > > > > >> >>
> > > > > >> >> - Metric names are actually public API, as end users need to configure
> > > > > >> metric collecting and alerting system with metric names. Users have to
> > > > > >> reset all configurations related to affected metrics.
> > > > > >> >> - This could also affect custom and external sinks not maintained by
> > > > > >> Flink, which might have implemented with numXXXOut metrics.
> > > > > >> >> - The number of records sent to external system is way more important
> > > > > >> than the number of Committables sent to SinkCommitterOperator, as the
> > > > > >> latter one is just an internal implementation of sink. We could have a new
> > > > > >> metric name for the latter one instead.
> > > > > >> >> - We could avoid splitting the project by version (like “plz use
> > > > > >> numXXXOut before 1.15 and use numXXXSend after”) if we revert it ASAP,
> > > > > >> cosidering 1.16 is still not released for now.
> > > > > >> >>
> > > > > >> >> As a consequence, I’d like to hear from devs and users about your
> > > > > >> opinion on changing these metrics back to “numXXXOut”.
> > > > > >> >>
> > > > > >> >> Looking forward to your reply!
> > > > > >> >>
> > > > > >> >> [1] https://issues.apache.org/jira/browse/FLINK-26126
> > > > > >> >> [2] https://issues.apache.org/jira/browse/FLINK-26492
> > > > > >> >> [1] FLIP-33, version 18:
> > > > > >> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211883136
> > > > > >> >>
> > > > > >> >> Best,
> > > > > >> >> Qingsheng
> > > > > >>
> > > > > >
> > > > > >
>

Re: [DISCUSS] Reverting sink metric name changes made in 1.15

Posted by Xingbo Huang <hx...@apache.org>.
+1 for reverting these changes in Flink 1.16, so I will cancel 1.16.0-rc1.
+1 for `numXXXSend` as the alias of `numXXXOut` in 1.15.3.

Best,
Xingbo

Chesnay Schepler <ch...@apache.org> 于2022年10月10日周一 19:13写道:

> > I’m with Xintong’s idea to treat numXXXSend as an alias of numXXXOut
>
> But that's not possible. If it were that simple there would have never
> been a need to introduce another metric in the first place.
>
> It's a rather fundamental issue with how the new sinks work, in that they
> emit data to the external system (usually considered as "numRecordsOut" of
> sinks) while _also_ sending data to a downstream operator (usually
> considered as "numRecordsOut" of tasks).
> The original issue was that the numRecordsOut of the sink counted both
> (which is completely wrong).
>
> A new metric was always required; otherwise you inevitably end up breaking
> *some* semantic.
> Adding a new metric for what the sink writes to the external system is,
> for better or worse, more consistent with how these metrics usually work in
> Flink.
>
> On 10/10/2022 12:45, Qingsheng Ren wrote:
>
> Thanks everyone for joining the discussion!
>
> > Do you have any idea what has happened in the process here?
>
> The discussion in this PR [1] shows some details and could be helpful to
> understand the original motivation of the renaming. We do have a test case
> for guarding metrics but unfortunaly the case was also modified so the
> defense was broken.
>
> I think the reason why both the developer and the reviewer forgot to
> trigger an discussion and gave a green pass on the change is that metrics
> are quite “trivial” to be noticed as public APIs. As mentioned by Martijn I
> couldn’t find a place noting that metrics are public APIs and should be
> treated carefully while contributing and reviewing.
>
> IMHO three actions could be made to prevent this kind of changes in the
> future:
>
> a. Add test case for metrics (which we already have in SinkMetricsITCase)
> b. We emphasize that any public-interface breaking changes should be
> proposed by a FLIP or discussed in mailing list, and should be listed in
> the release note.
> c. We remind contributors and reviewers about what should be considered as
> public API, and include metric names in it.
>
> For b and c these two pages [2][3] might be proper places.
>
> About the patch to revert this, it looks like we have a consensus on 1.16.
> As of 1.15 I think it’s worthy to trigger a minor version. I didn’t see
> complaints about this for now so it should be OK to save the situation
> asap. I’m with Xintong’s idea to treat numXXXSend as an alias of numXXXOut
> considering there could possibly some users have already adapted their
> system to the new naming, and have another internal metric for reflecting
> number of outgoing committable batches (actually the numRecordsIn of sink
> committer operator should be carrying this info already).
>
> [1] https://github.com/apache/flink/pull/18825
> [2] https://flink.apache.org/contributing/contribute-code.html
> [3] https://flink.apache.org/contributing/reviewing-prs.html
>
> Best,
> Qingsheng
> On Oct 10, 2022, 17:40 +0800, Xintong Song <to...@gmail.com>
> <to...@gmail.com>, wrote:
>
> +1 for reverting these changes in Flink 1.16.
>
> For 1.15.3, can we make these metrics available via both names (numXXXOut
> and numXXXSend)? In this way we don't break it for those who already
> migrated to 1.15 and numXXXSend. That means we still need to change
> SinkWriterOperator to use another metric name in 1.15.3, which IIUC is
> internal to Flink sink.
>
> I'm overall +1 to change numXXXOut back to its original semantics. AFAIK
> (from meetup / flink-forward questionaires), most users do not migrate to a
> new Flink release immediately, until the next 1-2 major releases are out.
>
> Best,
>
> Xintong
>
>
> On Mon, Oct 10, 2022 at 5:26 PM Martijn Visser <ma...@apache.org>
> wrote:
>
>> Hi Qingsheng,
>>
>> Do you have any idea what has happened in the process here? Do we know why
>> they were changed? I was under the impression that these metric names were
>> newly introduced due to the new interfaces and because it still depends on
>> each connector implementing these.
>>
>> Sidenote: metric names are not mentioned in the FLIP process as a public
>> API. Might make sense to have a separate follow-up to add that to the list
>> (I do think we should list them there).
>>
>> +1 for reverting this and make this change in Flink 1.16
>>
>> I'm not in favour of releasing a Flink 1.15.3 with this change: I think
>> the
>> impact is too big for a patch version, especially given how long Flink
>> 1.15
>> is already out there.
>>
>> Best regards,
>>
>> Martijn
>>
>> On Mon, Oct 10, 2022 at 11:13 AM Leonard Xu <xb...@gmail.com> wrote:
>>
>> > Thanks Qingsheng for starting this thread.
>> >
>> > +1 on reverting sink metric name and releasing 1.15.3 to fix this
>> > inconsistent behavior.
>> >
>> >
>> > Best,
>> > Leonard
>> >
>> >
>> >
>> >
>> >
>> > 2022年10月10日 下午3:06,Jark Wu <im...@gmail.com> 写道:
>> >
>> > Thanks for discovering this problem, Qingsheng!
>> >
>> > I'm also +1 for reverting the breaking changes.
>> >
>> > IIUC, currently, the behavior of "numXXXOut" metrics of the new and old
>> > sink is inconsistent.
>> > We have to break one of them to have consistent behavior. Sink V2 is an
>> > evolving API which is just introduced in 1.15.
>> > I think it makes sense to break the unstable API instead of the stable
>> API
>> > which many connectors and users depend on.
>> >
>> > Best,
>> > Jark
>> >
>> >
>> >
>> > On Mon, 10 Oct 2022 at 11:36, Jingsong Li <ji...@gmail.com>
>> wrote:
>> >
>> >> Thanks for driving, Qingsheng.
>> >>
>> >> +1 for reverting sink metric name.
>> >>
>> >> We often forget that metric is also one of the important APIs.
>> >>
>> >> +1 for releasing 1.15.3 to fix this.
>> >>
>> >> Best,
>> >> Jingsong
>> >>
>> >> On Sun, Oct 9, 2022 at 11:35 PM Becket Qin <be...@gmail.com>
>> wrote:
>> >> >
>> >> > Thanks for raising the discussion, Qingsheng,
>> >> >
>> >> > +1 on reverting the breaking changes.
>> >> >
>> >> > In addition, we might want to release a 1.15.3 to fix this and update
>> >> the previous release docs with this known issue, so that users can
>> upgrade
>> >> to 1.15.3 when they hit it. It would also be good to add some backwards
>> >> compatibility tests on metrics to avoid unintended breaking changes
>> like
>> >> this in the future.
>> >> >
>> >> > Thanks,
>> >> >
>> >> > Jiangjie (Becket) Qin
>> >> >
>> >> > On Sun, Oct 9, 2022 at 10:35 AM Qingsheng Ren <re...@apache.org>
>> wrote:
>> >> >>
>> >> >> Hi devs and users,
>> >> >>
>> >> >> I’d like to start a discussion about reverting a breaking change
>> about
>> >> sink metrics made in 1.15 by FLINK-26126 [1] and FLINK-26492 [2].
>> >> >>
>> >> >> TL;DR
>> >> >>
>> >> >> All sink metrics with name “numXXXOut” defined in FLIP-33 are
>> replace
>> >> by “numXXXSend” in FLINK-26126 and FLINK-26492. Considering metric
>> names
>> >> are public APIs, this is a breaking change to end users and not
>> backward
>> >> compatible. Also unfortunately this breaking change was not discussed
>> in
>> >> the mailing list before.
>> >> >>
>> >> >> Background
>> >> >>
>> >> >> As defined previously in FLIP-33 (the FLIP page has been changed so
>> >> please refer to the old version [3] ), metric “numRecordsOut” is used
>> for
>> >> reporting the total number of output records since the sink started
>> (number
>> >> of records written to the external system), and similarly for
>> >> “numRecordsOutPerSecond”, “numBytesOut”, “numBytesOutPerSecond” and
>> >> “numRecordsOutError”. Most sinks are following this naming and
>> definition.
>> >> However, these metrics are ambiguous in the new Sink API as “numXXXOut”
>> >> could be used by the output of SinkWriterOperator for reporting number
>> of
>> >> Committables delivered to SinkCommitterOperator. In order to resolve
>> the
>> >> conflict, FLINK-26126 and FLINK-26492 changed names of these metrics
>> with
>> >> “numXXXSend”.
>> >> >>
>> >> >> Necessity of reverting this change
>> >> >>
>> >> >> - Metric names are actually public API, as end users need to
>> configure
>> >> metric collecting and alerting system with metric names. Users have to
>> >> reset all configurations related to affected metrics.
>> >> >> - This could also affect custom and external sinks not maintained by
>> >> Flink, which might have implemented with numXXXOut metrics.
>> >> >> - The number of records sent to external system is way more
>> important
>> >> than the number of Committables sent to SinkCommitterOperator, as the
>> >> latter one is just an internal implementation of sink. We could have a
>> new
>> >> metric name for the latter one instead.
>> >> >> - We could avoid splitting the project by version (like “plz use
>> >> numXXXOut before 1.15 and use numXXXSend after”) if we revert it ASAP,
>> >> cosidering 1.16 is still not released for now.
>> >> >>
>> >> >> As a consequence, I’d like to hear from devs and users about your
>> >> opinion on changing these metrics back to “numXXXOut”.
>> >> >>
>> >> >> Looking forward to your reply!
>> >> >>
>> >> >> [1] https://issues.apache.org/jira/browse/FLINK-26126
>> >> >> [2] https://issues.apache.org/jira/browse/FLINK-26492
>> >> >> [1] FLIP-33, version 18:
>> >>
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211883136
>> >> >>
>> >> >> Best,
>> >> >> Qingsheng
>> >>
>> >
>> >
>>
>
>

Re: [DISCUSS] Reverting sink metric name changes made in 1.15

Posted by Qingsheng Ren <re...@apache.org>.
Thanks for the details Chesnay!

By “alias” I mean to respect the original definition made in FLIP-33 for numRecordsOut, which is the number of records written to the external system, and keep numRecordsSend as the same value as numRecordsOut for compatibility.

I think keeping numRecordsOut for the output to the external system is more intuitive to end users because in most cases the metric of data flow output is more essential. I agree with you that a new metric is required, but considering compatibility and users’ intuition I prefer to keep the initial definition of numRecordsOut in FLIP-33 and name a new metric for sink writer’s output to downstream operators. This might be against consistency with metrics in other operators in Flink but maybe it’s acceptable to have the sink as a special case.

Best,
Qingsheng
On Oct 10, 2022, 19:13 +0800, Chesnay Schepler <ch...@apache.org>, wrote:
> > I’m with Xintong’s idea to treat numXXXSend as an alias of numXXXOut
>
> But that's not possible. If it were that simple there would have never been a need to introduce another metric in the first place.
>
> It's a rather fundamental issue with how the new sinks work, in that they emit data to the external system (usually considered as "numRecordsOut" of sinks) while _also_ sending data to a downstream operator (usually considered as "numRecordsOut" of tasks).
> The original issue was that the numRecordsOut of the sink counted both (which is completely wrong).
>
> A new metric was always required; otherwise you inevitably end up breaking some semantic.
> Adding a new metric for what the sink writes to the external system is, for better or worse, more consistent with how these metrics usually work in Flink.
>
> On 10/10/2022 12:45, Qingsheng Ren wrote:
> > Thanks everyone for joining the discussion!
> >
> > > Do you have any idea what has happened in the process here?
> >
> > The discussion in this PR [1] shows some details and could be helpful to understand the original motivation of the renaming. We do have a test case for guarding metrics but unfortunaly the case was also modified so the defense was broken.
> >
> > I think the reason why both the developer and the reviewer forgot to trigger an discussion and gave a green pass on the change is that metrics are quite “trivial” to be noticed as public APIs. As mentioned by Martijn I couldn’t find a place noting that metrics are public APIs and should be treated carefully while contributing and reviewing.
> >
> > IMHO three actions could be made to prevent this kind of changes in the future:
> >
> > a. Add test case for metrics (which we already have in SinkMetricsITCase)
> > b. We emphasize that any public-interface breaking changes should be proposed by a FLIP or discussed in mailing list, and should be listed in the release note.
> > c. We remind contributors and reviewers about what should be considered as public API, and include metric names in it.
> >
> > For b and c these two pages [2][3] might be proper places.
> >
> > About the patch to revert this, it looks like we have a consensus on 1.16. As of 1.15 I think it’s worthy to trigger a minor version. I didn’t see complaints about this for now so it should be OK to save the situation asap. I’m with Xintong’s idea to treat numXXXSend as an alias of numXXXOut considering there could possibly some users have already adapted their system to the new naming, and have another internal metric for reflecting number of outgoing committable batches (actually the numRecordsIn of sink committer operator should be carrying this info already).
> >
> > [1] https://github.com/apache/flink/pull/18825
> > [2] https://flink.apache.org/contributing/contribute-code.html
> > [3] https://flink.apache.org/contributing/reviewing-prs.html
> >
> > Best,
> > Qingsheng
> > On Oct 10, 2022, 17:40 +0800, Xintong Song <to...@gmail.com>, wrote:
> > > +1 for reverting these changes in Flink 1.16.
> > >
> > > For 1.15.3, can we make these metrics available via both names (numXXXOut and numXXXSend)? In this way we don't break it for those who already migrated to 1.15 and numXXXSend. That means we still need to change SinkWriterOperator to use another metric name in 1.15.3, which IIUC is internal to Flink sink.
> > >
> > > I'm overall +1 to change numXXXOut back to its original semantics. AFAIK (from meetup / flink-forward questionaires), most users do not migrate to a new Flink release immediately, until the next 1-2 major releases are out.
> > >
> > > Best,
> > > Xintong
> > >
> > >
> > > > On Mon, Oct 10, 2022 at 5:26 PM Martijn Visser <ma...@apache.org> wrote:
> > > > > Hi Qingsheng,
> > > > >
> > > > > Do you have any idea what has happened in the process here? Do we know why
> > > > > they were changed? I was under the impression that these metric names were
> > > > > newly introduced due to the new interfaces and because it still depends on
> > > > > each connector implementing these.
> > > > >
> > > > > Sidenote: metric names are not mentioned in the FLIP process as a public
> > > > > API. Might make sense to have a separate follow-up to add that to the list
> > > > > (I do think we should list them there).
> > > > >
> > > > > +1 for reverting this and make this change in Flink 1.16
> > > > >
> > > > > I'm not in favour of releasing a Flink 1.15.3 with this change: I think the
> > > > > impact is too big for a patch version, especially given how long Flink 1.15
> > > > > is already out there.
> > > > >
> > > > > Best regards,
> > > > >
> > > > > Martijn
> > > > >
> > > > > On Mon, Oct 10, 2022 at 11:13 AM Leonard Xu <xb...@gmail.com> wrote:
> > > > >
> > > > > > Thanks Qingsheng for starting this thread.
> > > > > >
> > > > > > +1 on reverting sink metric name and releasing 1.15.3 to fix this
> > > > > > inconsistent behavior.
> > > > > >
> > > > > >
> > > > > > Best,
> > > > > > Leonard
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > 2022年10月10日 下午3:06,Jark Wu <im...@gmail.com> 写道:
> > > > > >
> > > > > > Thanks for discovering this problem, Qingsheng!
> > > > > >
> > > > > > I'm also +1 for reverting the breaking changes.
> > > > > >
> > > > > > IIUC, currently, the behavior of "numXXXOut" metrics of the new and old
> > > > > > sink is inconsistent.
> > > > > > We have to break one of them to have consistent behavior. Sink V2 is an
> > > > > > evolving API which is just introduced in 1.15.
> > > > > > I think it makes sense to break the unstable API instead of the stable API
> > > > > > which many connectors and users depend on.
> > > > > >
> > > > > > Best,
> > > > > > Jark
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Mon, 10 Oct 2022 at 11:36, Jingsong Li <ji...@gmail.com> wrote:
> > > > > >
> > > > > >> Thanks for driving, Qingsheng.
> > > > > >>
> > > > > >> +1 for reverting sink metric name.
> > > > > >>
> > > > > >> We often forget that metric is also one of the important APIs.
> > > > > >>
> > > > > >> +1 for releasing 1.15.3 to fix this.
> > > > > >>
> > > > > >> Best,
> > > > > >> Jingsong
> > > > > >>
> > > > > >> On Sun, Oct 9, 2022 at 11:35 PM Becket Qin <be...@gmail.com> wrote:
> > > > > >> >
> > > > > >> > Thanks for raising the discussion, Qingsheng,
> > > > > >> >
> > > > > >> > +1 on reverting the breaking changes.
> > > > > >> >
> > > > > >> > In addition, we might want to release a 1.15.3 to fix this and update
> > > > > >> the previous release docs with this known issue, so that users can upgrade
> > > > > >> to 1.15.3 when they hit it. It would also be good to add some backwards
> > > > > >> compatibility tests on metrics to avoid unintended breaking changes like
> > > > > >> this in the future.
> > > > > >> >
> > > > > >> > Thanks,
> > > > > >> >
> > > > > >> > Jiangjie (Becket) Qin
> > > > > >> >
> > > > > >> > On Sun, Oct 9, 2022 at 10:35 AM Qingsheng Ren <re...@apache.org> wrote:
> > > > > >> >>
> > > > > >> >> Hi devs and users,
> > > > > >> >>
> > > > > >> >> I’d like to start a discussion about reverting a breaking change about
> > > > > >> sink metrics made in 1.15 by FLINK-26126 [1] and FLINK-26492 [2].
> > > > > >> >>
> > > > > >> >> TL;DR
> > > > > >> >>
> > > > > >> >> All sink metrics with name “numXXXOut” defined in FLIP-33 are replace
> > > > > >> by “numXXXSend” in FLINK-26126 and FLINK-26492. Considering metric names
> > > > > >> are public APIs, this is a breaking change to end users and not backward
> > > > > >> compatible. Also unfortunately this breaking change was not discussed in
> > > > > >> the mailing list before.
> > > > > >> >>
> > > > > >> >> Background
> > > > > >> >>
> > > > > >> >> As defined previously in FLIP-33 (the FLIP page has been changed so
> > > > > >> please refer to the old version [3] ), metric “numRecordsOut” is used for
> > > > > >> reporting the total number of output records since the sink started (number
> > > > > >> of records written to the external system), and similarly for
> > > > > >> “numRecordsOutPerSecond”, “numBytesOut”, “numBytesOutPerSecond” and
> > > > > >> “numRecordsOutError”. Most sinks are following this naming and definition.
> > > > > >> However, these metrics are ambiguous in the new Sink API as “numXXXOut”
> > > > > >> could be used by the output of SinkWriterOperator for reporting number of
> > > > > >> Committables delivered to SinkCommitterOperator. In order to resolve the
> > > > > >> conflict, FLINK-26126 and FLINK-26492 changed names of these metrics with
> > > > > >> “numXXXSend”.
> > > > > >> >>
> > > > > >> >> Necessity of reverting this change
> > > > > >> >>
> > > > > >> >> - Metric names are actually public API, as end users need to configure
> > > > > >> metric collecting and alerting system with metric names. Users have to
> > > > > >> reset all configurations related to affected metrics.
> > > > > >> >> - This could also affect custom and external sinks not maintained by
> > > > > >> Flink, which might have implemented with numXXXOut metrics.
> > > > > >> >> - The number of records sent to external system is way more important
> > > > > >> than the number of Committables sent to SinkCommitterOperator, as the
> > > > > >> latter one is just an internal implementation of sink. We could have a new
> > > > > >> metric name for the latter one instead.
> > > > > >> >> - We could avoid splitting the project by version (like “plz use
> > > > > >> numXXXOut before 1.15 and use numXXXSend after”) if we revert it ASAP,
> > > > > >> cosidering 1.16 is still not released for now.
> > > > > >> >>
> > > > > >> >> As a consequence, I’d like to hear from devs and users about your
> > > > > >> opinion on changing these metrics back to “numXXXOut”.
> > > > > >> >>
> > > > > >> >> Looking forward to your reply!
> > > > > >> >>
> > > > > >> >> [1] https://issues.apache.org/jira/browse/FLINK-26126
> > > > > >> >> [2] https://issues.apache.org/jira/browse/FLINK-26492
> > > > > >> >> [1] FLIP-33, version 18:
> > > > > >> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211883136
> > > > > >> >>
> > > > > >> >> Best,
> > > > > >> >> Qingsheng
> > > > > >>
> > > > > >
> > > > > >
>

Re: [DISCUSS] Reverting sink metric name changes made in 1.15

Posted by Chesnay Schepler <ch...@apache.org>.
 > I’m with Xintong’s idea to treat numXXXSend as an alias of numXXXOut

But that's not possible. If it were that simple there would have never 
been a need to introduce another metric in the first place.

It's a rather fundamental issue with how the new sinks work, in that 
they emit data to the external system (usually considered as 
"numRecordsOut" of sinks) while _also_ sending data to a downstream 
operator (usually considered as "numRecordsOut" of tasks).
The original issue was that the numRecordsOut of the sink counted both 
(which is completely wrong).

A new metric was always required; otherwise you inevitably end up 
breaking /some/ semantic.
Adding a new metric for what the sink writes to the external system is, 
for better or worse, more consistent with how these metrics usually work 
in Flink.

On 10/10/2022 12:45, Qingsheng Ren wrote:
> Thanks everyone for joining the discussion!
>
> > Do you have any idea what has happened in the process here?
>
> The discussion in this PR [1] shows some details and could be helpful 
> to understand the original motivation of the renaming. We do have a 
> test case for guarding metrics but unfortunaly the case was also 
> modified so the defense was broken.
>
> I think the reason why both the developer and the reviewer forgot to 
> trigger an discussion and gave a green pass on the change is that 
> metrics are quite “trivial” to be noticed as public APIs. As mentioned 
> by Martijn I couldn’t find a place noting that metrics are public APIs 
> and should be treated carefully while contributing and reviewing.
>
> IMHO three actions could be made to prevent this kind of changes in 
> the future:
>
> a. Add test case for metrics (which we already have in SinkMetricsITCase)
> b. We emphasize that any public-interface breaking changes should be 
> proposed by a FLIP or discussed in mailing list, and should be listed 
> in the release note.
> c. We remind contributors and reviewers about what should be 
> considered as public API, and include metric names in it.
>
> For b and c these two pages [2][3] might be proper places.
>
> About the patch to revert this, it looks like we have a consensus on 
> 1.16. As of 1.15 I think it’s worthy to trigger a minor version. I 
> didn’t see complaints about this for now so it should be OK to save 
> the situation asap. I’m with Xintong’s idea to treat numXXXSend as an 
> alias of numXXXOut considering there could possibly some users have 
> already adapted their system to the new naming, and have another 
> internal metric for reflecting number of outgoing committable batches 
> (actually the numRecordsIn of sink committer operator should be 
> carrying this info already).
>
> [1] https://github.com/apache/flink/pull/18825
> [2] https://flink.apache.org/contributing/contribute-code.html
> [3] https://flink.apache.org/contributing/reviewing-prs.html
>
> Best,
> Qingsheng
> On Oct 10, 2022, 17:40 +0800, Xintong Song <to...@gmail.com>, wrote:
>> +1 for reverting these changes in Flink 1.16.
>>
>> For 1.15.3, can we make these metrics available via both names 
>> (numXXXOut and numXXXSend)? In this way we don't break it for those 
>> who already migrated to 1.15 and numXXXSend. That means we still need 
>> to change SinkWriterOperator to use another metric name in 1.15.3, 
>> which IIUC is internal to Flink sink.
>>
>> I'm overall +1 to change numXXXOut back to its original semantics. 
>> AFAIK (from meetup / flink-forward questionaires), most users do not 
>> migrate to a new Flink release immediately, until the next 1-2 major 
>> releases are out.
>>
>> Best,
>>
>> Xintong
>>
>>
>>
>> On Mon, Oct 10, 2022 at 5:26 PM Martijn Visser 
>> <ma...@apache.org> wrote:
>>
>>     Hi Qingsheng,
>>
>>     Do you have any idea what has happened in the process here? Do we
>>     know why
>>     they were changed? I was under the impression that these metric
>>     names were
>>     newly introduced due to the new interfaces and because it still
>>     depends on
>>     each connector implementing these.
>>
>>     Sidenote: metric names are not mentioned in the FLIP process as a
>>     public
>>     API. Might make sense to have a separate follow-up to add that to
>>     the list
>>     (I do think we should list them there).
>>
>>     +1 for reverting this and make this change in Flink 1.16
>>
>>     I'm not in favour of releasing a Flink 1.15.3 with this change: I
>>     think the
>>     impact is too big for a patch version, especially given how long
>>     Flink 1.15
>>     is already out there.
>>
>>     Best regards,
>>
>>     Martijn
>>
>>     On Mon, Oct 10, 2022 at 11:13 AM Leonard Xu <xb...@gmail.com>
>>     wrote:
>>
>>     > Thanks Qingsheng for starting this thread.
>>     >
>>     > +1 on reverting sink metric name and releasing 1.15.3 to fix this
>>     > inconsistent behavior.
>>     >
>>     >
>>     > Best,
>>     > Leonard
>>     >
>>     >
>>     >
>>     >
>>     >
>>     > 2022年10月10日 下午3:06,Jark Wu <im...@gmail.com> 写道:
>>     >
>>     > Thanks for discovering this problem, Qingsheng!
>>     >
>>     > I'm also +1 for reverting the breaking changes.
>>     >
>>     > IIUC, currently, the behavior of "numXXXOut" metrics of the new
>>     and old
>>     > sink is inconsistent.
>>     > We have to break one of them to have consistent behavior. Sink
>>     V2 is an
>>     > evolving API which is just introduced in 1.15.
>>     > I think it makes sense to break the unstable API instead of the
>>     stable API
>>     > which many connectors and users depend on.
>>     >
>>     > Best,
>>     > Jark
>>     >
>>     >
>>     >
>>     > On Mon, 10 Oct 2022 at 11:36, Jingsong Li
>>     <ji...@gmail.com> wrote:
>>     >
>>     >> Thanks for driving, Qingsheng.
>>     >>
>>     >> +1 for reverting sink metric name.
>>     >>
>>     >> We often forget that metric is also one of the important APIs.
>>     >>
>>     >> +1 for releasing 1.15.3 to fix this.
>>     >>
>>     >> Best,
>>     >> Jingsong
>>     >>
>>     >> On Sun, Oct 9, 2022 at 11:35 PM Becket Qin
>>     <be...@gmail.com> wrote:
>>     >> >
>>     >> > Thanks for raising the discussion, Qingsheng,
>>     >> >
>>     >> > +1 on reverting the breaking changes.
>>     >> >
>>     >> > In addition, we might want to release a 1.15.3 to fix this
>>     and update
>>     >> the previous release docs with this known issue, so that users
>>     can upgrade
>>     >> to 1.15.3 when they hit it. It would also be good to add some
>>     backwards
>>     >> compatibility tests on metrics to avoid unintended breaking
>>     changes like
>>     >> this in the future.
>>     >> >
>>     >> > Thanks,
>>     >> >
>>     >> > Jiangjie (Becket) Qin
>>     >> >
>>     >> > On Sun, Oct 9, 2022 at 10:35 AM Qingsheng Ren
>>     <re...@apache.org> wrote:
>>     >> >>
>>     >> >> Hi devs and users,
>>     >> >>
>>     >> >> I’d like to start a discussion about reverting a breaking
>>     change about
>>     >> sink metrics made in 1.15 by FLINK-26126 [1] and FLINK-26492 [2].
>>     >> >>
>>     >> >> TL;DR
>>     >> >>
>>     >> >> All sink metrics with name “numXXXOut” defined in FLIP-33
>>     are replace
>>     >> by “numXXXSend” in FLINK-26126 and FLINK-26492. Considering
>>     metric names
>>     >> are public APIs, this is a breaking change to end users and
>>     not backward
>>     >> compatible. Also unfortunately this breaking change was not
>>     discussed in
>>     >> the mailing list before.
>>     >> >>
>>     >> >> Background
>>     >> >>
>>     >> >> As defined previously in FLIP-33 (the FLIP page has been
>>     changed so
>>     >> please refer to the old version [3] ), metric “numRecordsOut”
>>     is used for
>>     >> reporting the total number of output records since the sink
>>     started (number
>>     >> of records written to the external system), and similarly for
>>     >> “numRecordsOutPerSecond”, “numBytesOut”,
>>     “numBytesOutPerSecond” and
>>     >> “numRecordsOutError”. Most sinks are following this naming and
>>     definition.
>>     >> However, these metrics are ambiguous in the new Sink API as
>>     “numXXXOut”
>>     >> could be used by the output of SinkWriterOperator for
>>     reporting number of
>>     >> Committables delivered to SinkCommitterOperator. In order to
>>     resolve the
>>     >> conflict, FLINK-26126 and FLINK-26492 changed names of these
>>     metrics with
>>     >> “numXXXSend”.
>>     >> >>
>>     >> >> Necessity of reverting this change
>>     >> >>
>>     >> >> - Metric names are actually public API, as end users need
>>     to configure
>>     >> metric collecting and alerting system with metric names. Users
>>     have to
>>     >> reset all configurations related to affected metrics.
>>     >> >> - This could also affect custom and external sinks not
>>     maintained by
>>     >> Flink, which might have implemented with numXXXOut metrics.
>>     >> >> - The number of records sent to external system is way more
>>     important
>>     >> than the number of Committables sent to SinkCommitterOperator,
>>     as the
>>     >> latter one is just an internal implementation of sink. We
>>     could have a new
>>     >> metric name for the latter one instead.
>>     >> >> - We could avoid splitting the project by version (like
>>     “plz use
>>     >> numXXXOut before 1.15 and use numXXXSend after”) if we revert
>>     it ASAP,
>>     >> cosidering 1.16 is still not released for now.
>>     >> >>
>>     >> >> As a consequence, I’d like to hear from devs and users
>>     about your
>>     >> opinion on changing these metrics back to “numXXXOut”.
>>     >> >>
>>     >> >> Looking forward to your reply!
>>     >> >>
>>     >> >> [1] https://issues.apache.org/jira/browse/FLINK-26126
>>     >> >> [2] https://issues.apache.org/jira/browse/FLINK-26492
>>     >> >> [1] FLIP-33, version 18:
>>     >>
>>     https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211883136
>>     >> >>
>>     >> >> Best,
>>     >> >> Qingsheng
>>     >>
>>     >
>>     >
>>

Re: [DISCUSS] Reverting sink metric name changes made in 1.15

Posted by Chesnay Schepler <ch...@apache.org>.
 > I’m with Xintong’s idea to treat numXXXSend as an alias of numXXXOut

But that's not possible. If it were that simple there would have never 
been a need to introduce another metric in the first place.

It's a rather fundamental issue with how the new sinks work, in that 
they emit data to the external system (usually considered as 
"numRecordsOut" of sinks) while _also_ sending data to a downstream 
operator (usually considered as "numRecordsOut" of tasks).
The original issue was that the numRecordsOut of the sink counted both 
(which is completely wrong).

A new metric was always required; otherwise you inevitably end up 
breaking /some/ semantic.
Adding a new metric for what the sink writes to the external system is, 
for better or worse, more consistent with how these metrics usually work 
in Flink.

On 10/10/2022 12:45, Qingsheng Ren wrote:
> Thanks everyone for joining the discussion!
>
> > Do you have any idea what has happened in the process here?
>
> The discussion in this PR [1] shows some details and could be helpful 
> to understand the original motivation of the renaming. We do have a 
> test case for guarding metrics but unfortunaly the case was also 
> modified so the defense was broken.
>
> I think the reason why both the developer and the reviewer forgot to 
> trigger an discussion and gave a green pass on the change is that 
> metrics are quite “trivial” to be noticed as public APIs. As mentioned 
> by Martijn I couldn’t find a place noting that metrics are public APIs 
> and should be treated carefully while contributing and reviewing.
>
> IMHO three actions could be made to prevent this kind of changes in 
> the future:
>
> a. Add test case for metrics (which we already have in SinkMetricsITCase)
> b. We emphasize that any public-interface breaking changes should be 
> proposed by a FLIP or discussed in mailing list, and should be listed 
> in the release note.
> c. We remind contributors and reviewers about what should be 
> considered as public API, and include metric names in it.
>
> For b and c these two pages [2][3] might be proper places.
>
> About the patch to revert this, it looks like we have a consensus on 
> 1.16. As of 1.15 I think it’s worthy to trigger a minor version. I 
> didn’t see complaints about this for now so it should be OK to save 
> the situation asap. I’m with Xintong’s idea to treat numXXXSend as an 
> alias of numXXXOut considering there could possibly some users have 
> already adapted their system to the new naming, and have another 
> internal metric for reflecting number of outgoing committable batches 
> (actually the numRecordsIn of sink committer operator should be 
> carrying this info already).
>
> [1] https://github.com/apache/flink/pull/18825
> [2] https://flink.apache.org/contributing/contribute-code.html
> [3] https://flink.apache.org/contributing/reviewing-prs.html
>
> Best,
> Qingsheng
> On Oct 10, 2022, 17:40 +0800, Xintong Song <to...@gmail.com>, wrote:
>> +1 for reverting these changes in Flink 1.16.
>>
>> For 1.15.3, can we make these metrics available via both names 
>> (numXXXOut and numXXXSend)? In this way we don't break it for those 
>> who already migrated to 1.15 and numXXXSend. That means we still need 
>> to change SinkWriterOperator to use another metric name in 1.15.3, 
>> which IIUC is internal to Flink sink.
>>
>> I'm overall +1 to change numXXXOut back to its original semantics. 
>> AFAIK (from meetup / flink-forward questionaires), most users do not 
>> migrate to a new Flink release immediately, until the next 1-2 major 
>> releases are out.
>>
>> Best,
>>
>> Xintong
>>
>>
>>
>> On Mon, Oct 10, 2022 at 5:26 PM Martijn Visser 
>> <ma...@apache.org> wrote:
>>
>>     Hi Qingsheng,
>>
>>     Do you have any idea what has happened in the process here? Do we
>>     know why
>>     they were changed? I was under the impression that these metric
>>     names were
>>     newly introduced due to the new interfaces and because it still
>>     depends on
>>     each connector implementing these.
>>
>>     Sidenote: metric names are not mentioned in the FLIP process as a
>>     public
>>     API. Might make sense to have a separate follow-up to add that to
>>     the list
>>     (I do think we should list them there).
>>
>>     +1 for reverting this and make this change in Flink 1.16
>>
>>     I'm not in favour of releasing a Flink 1.15.3 with this change: I
>>     think the
>>     impact is too big for a patch version, especially given how long
>>     Flink 1.15
>>     is already out there.
>>
>>     Best regards,
>>
>>     Martijn
>>
>>     On Mon, Oct 10, 2022 at 11:13 AM Leonard Xu <xb...@gmail.com>
>>     wrote:
>>
>>     > Thanks Qingsheng for starting this thread.
>>     >
>>     > +1 on reverting sink metric name and releasing 1.15.3 to fix this
>>     > inconsistent behavior.
>>     >
>>     >
>>     > Best,
>>     > Leonard
>>     >
>>     >
>>     >
>>     >
>>     >
>>     > 2022年10月10日 下午3:06,Jark Wu <im...@gmail.com> 写道:
>>     >
>>     > Thanks for discovering this problem, Qingsheng!
>>     >
>>     > I'm also +1 for reverting the breaking changes.
>>     >
>>     > IIUC, currently, the behavior of "numXXXOut" metrics of the new
>>     and old
>>     > sink is inconsistent.
>>     > We have to break one of them to have consistent behavior. Sink
>>     V2 is an
>>     > evolving API which is just introduced in 1.15.
>>     > I think it makes sense to break the unstable API instead of the
>>     stable API
>>     > which many connectors and users depend on.
>>     >
>>     > Best,
>>     > Jark
>>     >
>>     >
>>     >
>>     > On Mon, 10 Oct 2022 at 11:36, Jingsong Li
>>     <ji...@gmail.com> wrote:
>>     >
>>     >> Thanks for driving, Qingsheng.
>>     >>
>>     >> +1 for reverting sink metric name.
>>     >>
>>     >> We often forget that metric is also one of the important APIs.
>>     >>
>>     >> +1 for releasing 1.15.3 to fix this.
>>     >>
>>     >> Best,
>>     >> Jingsong
>>     >>
>>     >> On Sun, Oct 9, 2022 at 11:35 PM Becket Qin
>>     <be...@gmail.com> wrote:
>>     >> >
>>     >> > Thanks for raising the discussion, Qingsheng,
>>     >> >
>>     >> > +1 on reverting the breaking changes.
>>     >> >
>>     >> > In addition, we might want to release a 1.15.3 to fix this
>>     and update
>>     >> the previous release docs with this known issue, so that users
>>     can upgrade
>>     >> to 1.15.3 when they hit it. It would also be good to add some
>>     backwards
>>     >> compatibility tests on metrics to avoid unintended breaking
>>     changes like
>>     >> this in the future.
>>     >> >
>>     >> > Thanks,
>>     >> >
>>     >> > Jiangjie (Becket) Qin
>>     >> >
>>     >> > On Sun, Oct 9, 2022 at 10:35 AM Qingsheng Ren
>>     <re...@apache.org> wrote:
>>     >> >>
>>     >> >> Hi devs and users,
>>     >> >>
>>     >> >> I’d like to start a discussion about reverting a breaking
>>     change about
>>     >> sink metrics made in 1.15 by FLINK-26126 [1] and FLINK-26492 [2].
>>     >> >>
>>     >> >> TL;DR
>>     >> >>
>>     >> >> All sink metrics with name “numXXXOut” defined in FLIP-33
>>     are replace
>>     >> by “numXXXSend” in FLINK-26126 and FLINK-26492. Considering
>>     metric names
>>     >> are public APIs, this is a breaking change to end users and
>>     not backward
>>     >> compatible. Also unfortunately this breaking change was not
>>     discussed in
>>     >> the mailing list before.
>>     >> >>
>>     >> >> Background
>>     >> >>
>>     >> >> As defined previously in FLIP-33 (the FLIP page has been
>>     changed so
>>     >> please refer to the old version [3] ), metric “numRecordsOut”
>>     is used for
>>     >> reporting the total number of output records since the sink
>>     started (number
>>     >> of records written to the external system), and similarly for
>>     >> “numRecordsOutPerSecond”, “numBytesOut”,
>>     “numBytesOutPerSecond” and
>>     >> “numRecordsOutError”. Most sinks are following this naming and
>>     definition.
>>     >> However, these metrics are ambiguous in the new Sink API as
>>     “numXXXOut”
>>     >> could be used by the output of SinkWriterOperator for
>>     reporting number of
>>     >> Committables delivered to SinkCommitterOperator. In order to
>>     resolve the
>>     >> conflict, FLINK-26126 and FLINK-26492 changed names of these
>>     metrics with
>>     >> “numXXXSend”.
>>     >> >>
>>     >> >> Necessity of reverting this change
>>     >> >>
>>     >> >> - Metric names are actually public API, as end users need
>>     to configure
>>     >> metric collecting and alerting system with metric names. Users
>>     have to
>>     >> reset all configurations related to affected metrics.
>>     >> >> - This could also affect custom and external sinks not
>>     maintained by
>>     >> Flink, which might have implemented with numXXXOut metrics.
>>     >> >> - The number of records sent to external system is way more
>>     important
>>     >> than the number of Committables sent to SinkCommitterOperator,
>>     as the
>>     >> latter one is just an internal implementation of sink. We
>>     could have a new
>>     >> metric name for the latter one instead.
>>     >> >> - We could avoid splitting the project by version (like
>>     “plz use
>>     >> numXXXOut before 1.15 and use numXXXSend after”) if we revert
>>     it ASAP,
>>     >> cosidering 1.16 is still not released for now.
>>     >> >>
>>     >> >> As a consequence, I’d like to hear from devs and users
>>     about your
>>     >> opinion on changing these metrics back to “numXXXOut”.
>>     >> >>
>>     >> >> Looking forward to your reply!
>>     >> >>
>>     >> >> [1] https://issues.apache.org/jira/browse/FLINK-26126
>>     >> >> [2] https://issues.apache.org/jira/browse/FLINK-26492
>>     >> >> [1] FLIP-33, version 18:
>>     >>
>>     https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211883136
>>     >> >>
>>     >> >> Best,
>>     >> >> Qingsheng
>>     >>
>>     >
>>     >
>>

Re: [DISCUSS] Reverting sink metric name changes made in 1.15

Posted by Qingsheng Ren <re...@apache.org>.
Thanks everyone for joining the discussion!

> Do you have any idea what has happened in the process here?

The discussion in this PR [1] shows some details and could be helpful to understand the original motivation of the renaming. We do have a test case for guarding metrics but unfortunaly the case was also modified so the defense was broken.

I think the reason why both the developer and the reviewer forgot to trigger an discussion and gave a green pass on the change is that metrics are quite “trivial” to be noticed as public APIs. As mentioned by Martijn I couldn’t find a place noting that metrics are public APIs and should be treated carefully while contributing and reviewing.

IMHO three actions could be made to prevent this kind of changes in the future:

a. Add test case for metrics (which we already have in SinkMetricsITCase)
b. We emphasize that any public-interface breaking changes should be proposed by a FLIP or discussed in mailing list, and should be listed in the release note.
c. We remind contributors and reviewers about what should be considered as public API, and include metric names in it.

For b and c these two pages [2][3] might be proper places.

About the patch to revert this, it looks like we have a consensus on 1.16. As of 1.15 I think it’s worthy to trigger a minor version. I didn’t see complaints about this for now so it should be OK to save the situation asap. I’m with Xintong’s idea to treat numXXXSend as an alias of numXXXOut considering there could possibly some users have already adapted their system to the new naming, and have another internal metric for reflecting number of outgoing committable batches (actually the numRecordsIn of sink committer operator should be carrying this info already).

[1] https://github.com/apache/flink/pull/18825
[2] https://flink.apache.org/contributing/contribute-code.html
[3] https://flink.apache.org/contributing/reviewing-prs.html

Best,
Qingsheng
On Oct 10, 2022, 17:40 +0800, Xintong Song <to...@gmail.com>, wrote:
> +1 for reverting these changes in Flink 1.16.
>
> For 1.15.3, can we make these metrics available via both names (numXXXOut and numXXXSend)? In this way we don't break it for those who already migrated to 1.15 and numXXXSend. That means we still need to change SinkWriterOperator to use another metric name in 1.15.3, which IIUC is internal to Flink sink.
>
> I'm overall +1 to change numXXXOut back to its original semantics. AFAIK (from meetup / flink-forward questionaires), most users do not migrate to a new Flink release immediately, until the next 1-2 major releases are out.
>
> Best,
> Xintong
>
>
> > On Mon, Oct 10, 2022 at 5:26 PM Martijn Visser <ma...@apache.org> wrote:
> > > Hi Qingsheng,
> > >
> > > Do you have any idea what has happened in the process here? Do we know why
> > > they were changed? I was under the impression that these metric names were
> > > newly introduced due to the new interfaces and because it still depends on
> > > each connector implementing these.
> > >
> > > Sidenote: metric names are not mentioned in the FLIP process as a public
> > > API. Might make sense to have a separate follow-up to add that to the list
> > > (I do think we should list them there).
> > >
> > > +1 for reverting this and make this change in Flink 1.16
> > >
> > > I'm not in favour of releasing a Flink 1.15.3 with this change: I think the
> > > impact is too big for a patch version, especially given how long Flink 1.15
> > > is already out there.
> > >
> > > Best regards,
> > >
> > > Martijn
> > >
> > > On Mon, Oct 10, 2022 at 11:13 AM Leonard Xu <xb...@gmail.com> wrote:
> > >
> > > > Thanks Qingsheng for starting this thread.
> > > >
> > > > +1 on reverting sink metric name and releasing 1.15.3 to fix this
> > > > inconsistent behavior.
> > > >
> > > >
> > > > Best,
> > > > Leonard
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > 2022年10月10日 下午3:06,Jark Wu <im...@gmail.com> 写道:
> > > >
> > > > Thanks for discovering this problem, Qingsheng!
> > > >
> > > > I'm also +1 for reverting the breaking changes.
> > > >
> > > > IIUC, currently, the behavior of "numXXXOut" metrics of the new and old
> > > > sink is inconsistent.
> > > > We have to break one of them to have consistent behavior. Sink V2 is an
> > > > evolving API which is just introduced in 1.15.
> > > > I think it makes sense to break the unstable API instead of the stable API
> > > > which many connectors and users depend on.
> > > >
> > > > Best,
> > > > Jark
> > > >
> > > >
> > > >
> > > > On Mon, 10 Oct 2022 at 11:36, Jingsong Li <ji...@gmail.com> wrote:
> > > >
> > > >> Thanks for driving, Qingsheng.
> > > >>
> > > >> +1 for reverting sink metric name.
> > > >>
> > > >> We often forget that metric is also one of the important APIs.
> > > >>
> > > >> +1 for releasing 1.15.3 to fix this.
> > > >>
> > > >> Best,
> > > >> Jingsong
> > > >>
> > > >> On Sun, Oct 9, 2022 at 11:35 PM Becket Qin <be...@gmail.com> wrote:
> > > >> >
> > > >> > Thanks for raising the discussion, Qingsheng,
> > > >> >
> > > >> > +1 on reverting the breaking changes.
> > > >> >
> > > >> > In addition, we might want to release a 1.15.3 to fix this and update
> > > >> the previous release docs with this known issue, so that users can upgrade
> > > >> to 1.15.3 when they hit it. It would also be good to add some backwards
> > > >> compatibility tests on metrics to avoid unintended breaking changes like
> > > >> this in the future.
> > > >> >
> > > >> > Thanks,
> > > >> >
> > > >> > Jiangjie (Becket) Qin
> > > >> >
> > > >> > On Sun, Oct 9, 2022 at 10:35 AM Qingsheng Ren <re...@apache.org> wrote:
> > > >> >>
> > > >> >> Hi devs and users,
> > > >> >>
> > > >> >> I’d like to start a discussion about reverting a breaking change about
> > > >> sink metrics made in 1.15 by FLINK-26126 [1] and FLINK-26492 [2].
> > > >> >>
> > > >> >> TL;DR
> > > >> >>
> > > >> >> All sink metrics with name “numXXXOut” defined in FLIP-33 are replace
> > > >> by “numXXXSend” in FLINK-26126 and FLINK-26492. Considering metric names
> > > >> are public APIs, this is a breaking change to end users and not backward
> > > >> compatible. Also unfortunately this breaking change was not discussed in
> > > >> the mailing list before.
> > > >> >>
> > > >> >> Background
> > > >> >>
> > > >> >> As defined previously in FLIP-33 (the FLIP page has been changed so
> > > >> please refer to the old version [3] ), metric “numRecordsOut” is used for
> > > >> reporting the total number of output records since the sink started (number
> > > >> of records written to the external system), and similarly for
> > > >> “numRecordsOutPerSecond”, “numBytesOut”, “numBytesOutPerSecond” and
> > > >> “numRecordsOutError”. Most sinks are following this naming and definition.
> > > >> However, these metrics are ambiguous in the new Sink API as “numXXXOut”
> > > >> could be used by the output of SinkWriterOperator for reporting number of
> > > >> Committables delivered to SinkCommitterOperator. In order to resolve the
> > > >> conflict, FLINK-26126 and FLINK-26492 changed names of these metrics with
> > > >> “numXXXSend”.
> > > >> >>
> > > >> >> Necessity of reverting this change
> > > >> >>
> > > >> >> - Metric names are actually public API, as end users need to configure
> > > >> metric collecting and alerting system with metric names. Users have to
> > > >> reset all configurations related to affected metrics.
> > > >> >> - This could also affect custom and external sinks not maintained by
> > > >> Flink, which might have implemented with numXXXOut metrics.
> > > >> >> - The number of records sent to external system is way more important
> > > >> than the number of Committables sent to SinkCommitterOperator, as the
> > > >> latter one is just an internal implementation of sink. We could have a new
> > > >> metric name for the latter one instead.
> > > >> >> - We could avoid splitting the project by version (like “plz use
> > > >> numXXXOut before 1.15 and use numXXXSend after”) if we revert it ASAP,
> > > >> cosidering 1.16 is still not released for now.
> > > >> >>
> > > >> >> As a consequence, I’d like to hear from devs and users about your
> > > >> opinion on changing these metrics back to “numXXXOut”.
> > > >> >>
> > > >> >> Looking forward to your reply!
> > > >> >>
> > > >> >> [1] https://issues.apache.org/jira/browse/FLINK-26126
> > > >> >> [2] https://issues.apache.org/jira/browse/FLINK-26492
> > > >> >> [1] FLIP-33, version 18:
> > > >> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211883136
> > > >> >>
> > > >> >> Best,
> > > >> >> Qingsheng
> > > >>
> > > >
> > > >

Re: [DISCUSS] Reverting sink metric name changes made in 1.15

Posted by Qingsheng Ren <re...@apache.org>.
Thanks everyone for joining the discussion!

> Do you have any idea what has happened in the process here?

The discussion in this PR [1] shows some details and could be helpful to understand the original motivation of the renaming. We do have a test case for guarding metrics but unfortunaly the case was also modified so the defense was broken.

I think the reason why both the developer and the reviewer forgot to trigger an discussion and gave a green pass on the change is that metrics are quite “trivial” to be noticed as public APIs. As mentioned by Martijn I couldn’t find a place noting that metrics are public APIs and should be treated carefully while contributing and reviewing.

IMHO three actions could be made to prevent this kind of changes in the future:

a. Add test case for metrics (which we already have in SinkMetricsITCase)
b. We emphasize that any public-interface breaking changes should be proposed by a FLIP or discussed in mailing list, and should be listed in the release note.
c. We remind contributors and reviewers about what should be considered as public API, and include metric names in it.

For b and c these two pages [2][3] might be proper places.

About the patch to revert this, it looks like we have a consensus on 1.16. As of 1.15 I think it’s worthy to trigger a minor version. I didn’t see complaints about this for now so it should be OK to save the situation asap. I’m with Xintong’s idea to treat numXXXSend as an alias of numXXXOut considering there could possibly some users have already adapted their system to the new naming, and have another internal metric for reflecting number of outgoing committable batches (actually the numRecordsIn of sink committer operator should be carrying this info already).

[1] https://github.com/apache/flink/pull/18825
[2] https://flink.apache.org/contributing/contribute-code.html
[3] https://flink.apache.org/contributing/reviewing-prs.html

Best,
Qingsheng
On Oct 10, 2022, 17:40 +0800, Xintong Song <to...@gmail.com>, wrote:
> +1 for reverting these changes in Flink 1.16.
>
> For 1.15.3, can we make these metrics available via both names (numXXXOut and numXXXSend)? In this way we don't break it for those who already migrated to 1.15 and numXXXSend. That means we still need to change SinkWriterOperator to use another metric name in 1.15.3, which IIUC is internal to Flink sink.
>
> I'm overall +1 to change numXXXOut back to its original semantics. AFAIK (from meetup / flink-forward questionaires), most users do not migrate to a new Flink release immediately, until the next 1-2 major releases are out.
>
> Best,
> Xintong
>
>
> > On Mon, Oct 10, 2022 at 5:26 PM Martijn Visser <ma...@apache.org> wrote:
> > > Hi Qingsheng,
> > >
> > > Do you have any idea what has happened in the process here? Do we know why
> > > they were changed? I was under the impression that these metric names were
> > > newly introduced due to the new interfaces and because it still depends on
> > > each connector implementing these.
> > >
> > > Sidenote: metric names are not mentioned in the FLIP process as a public
> > > API. Might make sense to have a separate follow-up to add that to the list
> > > (I do think we should list them there).
> > >
> > > +1 for reverting this and make this change in Flink 1.16
> > >
> > > I'm not in favour of releasing a Flink 1.15.3 with this change: I think the
> > > impact is too big for a patch version, especially given how long Flink 1.15
> > > is already out there.
> > >
> > > Best regards,
> > >
> > > Martijn
> > >
> > > On Mon, Oct 10, 2022 at 11:13 AM Leonard Xu <xb...@gmail.com> wrote:
> > >
> > > > Thanks Qingsheng for starting this thread.
> > > >
> > > > +1 on reverting sink metric name and releasing 1.15.3 to fix this
> > > > inconsistent behavior.
> > > >
> > > >
> > > > Best,
> > > > Leonard
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > 2022年10月10日 下午3:06,Jark Wu <im...@gmail.com> 写道:
> > > >
> > > > Thanks for discovering this problem, Qingsheng!
> > > >
> > > > I'm also +1 for reverting the breaking changes.
> > > >
> > > > IIUC, currently, the behavior of "numXXXOut" metrics of the new and old
> > > > sink is inconsistent.
> > > > We have to break one of them to have consistent behavior. Sink V2 is an
> > > > evolving API which is just introduced in 1.15.
> > > > I think it makes sense to break the unstable API instead of the stable API
> > > > which many connectors and users depend on.
> > > >
> > > > Best,
> > > > Jark
> > > >
> > > >
> > > >
> > > > On Mon, 10 Oct 2022 at 11:36, Jingsong Li <ji...@gmail.com> wrote:
> > > >
> > > >> Thanks for driving, Qingsheng.
> > > >>
> > > >> +1 for reverting sink metric name.
> > > >>
> > > >> We often forget that metric is also one of the important APIs.
> > > >>
> > > >> +1 for releasing 1.15.3 to fix this.
> > > >>
> > > >> Best,
> > > >> Jingsong
> > > >>
> > > >> On Sun, Oct 9, 2022 at 11:35 PM Becket Qin <be...@gmail.com> wrote:
> > > >> >
> > > >> > Thanks for raising the discussion, Qingsheng,
> > > >> >
> > > >> > +1 on reverting the breaking changes.
> > > >> >
> > > >> > In addition, we might want to release a 1.15.3 to fix this and update
> > > >> the previous release docs with this known issue, so that users can upgrade
> > > >> to 1.15.3 when they hit it. It would also be good to add some backwards
> > > >> compatibility tests on metrics to avoid unintended breaking changes like
> > > >> this in the future.
> > > >> >
> > > >> > Thanks,
> > > >> >
> > > >> > Jiangjie (Becket) Qin
> > > >> >
> > > >> > On Sun, Oct 9, 2022 at 10:35 AM Qingsheng Ren <re...@apache.org> wrote:
> > > >> >>
> > > >> >> Hi devs and users,
> > > >> >>
> > > >> >> I’d like to start a discussion about reverting a breaking change about
> > > >> sink metrics made in 1.15 by FLINK-26126 [1] and FLINK-26492 [2].
> > > >> >>
> > > >> >> TL;DR
> > > >> >>
> > > >> >> All sink metrics with name “numXXXOut” defined in FLIP-33 are replace
> > > >> by “numXXXSend” in FLINK-26126 and FLINK-26492. Considering metric names
> > > >> are public APIs, this is a breaking change to end users and not backward
> > > >> compatible. Also unfortunately this breaking change was not discussed in
> > > >> the mailing list before.
> > > >> >>
> > > >> >> Background
> > > >> >>
> > > >> >> As defined previously in FLIP-33 (the FLIP page has been changed so
> > > >> please refer to the old version [3] ), metric “numRecordsOut” is used for
> > > >> reporting the total number of output records since the sink started (number
> > > >> of records written to the external system), and similarly for
> > > >> “numRecordsOutPerSecond”, “numBytesOut”, “numBytesOutPerSecond” and
> > > >> “numRecordsOutError”. Most sinks are following this naming and definition.
> > > >> However, these metrics are ambiguous in the new Sink API as “numXXXOut”
> > > >> could be used by the output of SinkWriterOperator for reporting number of
> > > >> Committables delivered to SinkCommitterOperator. In order to resolve the
> > > >> conflict, FLINK-26126 and FLINK-26492 changed names of these metrics with
> > > >> “numXXXSend”.
> > > >> >>
> > > >> >> Necessity of reverting this change
> > > >> >>
> > > >> >> - Metric names are actually public API, as end users need to configure
> > > >> metric collecting and alerting system with metric names. Users have to
> > > >> reset all configurations related to affected metrics.
> > > >> >> - This could also affect custom and external sinks not maintained by
> > > >> Flink, which might have implemented with numXXXOut metrics.
> > > >> >> - The number of records sent to external system is way more important
> > > >> than the number of Committables sent to SinkCommitterOperator, as the
> > > >> latter one is just an internal implementation of sink. We could have a new
> > > >> metric name for the latter one instead.
> > > >> >> - We could avoid splitting the project by version (like “plz use
> > > >> numXXXOut before 1.15 and use numXXXSend after”) if we revert it ASAP,
> > > >> cosidering 1.16 is still not released for now.
> > > >> >>
> > > >> >> As a consequence, I’d like to hear from devs and users about your
> > > >> opinion on changing these metrics back to “numXXXOut”.
> > > >> >>
> > > >> >> Looking forward to your reply!
> > > >> >>
> > > >> >> [1] https://issues.apache.org/jira/browse/FLINK-26126
> > > >> >> [2] https://issues.apache.org/jira/browse/FLINK-26492
> > > >> >> [1] FLIP-33, version 18:
> > > >> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211883136
> > > >> >>
> > > >> >> Best,
> > > >> >> Qingsheng
> > > >>
> > > >
> > > >

Re: [DISCUSS] Reverting sink metric name changes made in 1.15

Posted by Xintong Song <to...@gmail.com>.
+1 for reverting these changes in Flink 1.16.

For 1.15.3, can we make these metrics available via both names (numXXXOut
and numXXXSend)? In this way we don't break it for those who already
migrated to 1.15 and numXXXSend. That means we still need to change
SinkWriterOperator to use another metric name in 1.15.3, which IIUC is
internal to Flink sink.

I'm overall +1 to change numXXXOut back to its original semantics. AFAIK
(from meetup / flink-forward questionaires), most users do not migrate to a
new Flink release immediately, until the next 1-2 major releases are out.

Best,

Xintong



On Mon, Oct 10, 2022 at 5:26 PM Martijn Visser <ma...@apache.org>
wrote:

> Hi Qingsheng,
>
> Do you have any idea what has happened in the process here? Do we know why
> they were changed? I was under the impression that these metric names were
> newly introduced due to the new interfaces and because it still depends on
> each connector implementing these.
>
> Sidenote: metric names are not mentioned in the FLIP process as a public
> API. Might make sense to have a separate follow-up to add that to the list
> (I do think we should list them there).
>
> +1 for reverting this and make this change in Flink 1.16
>
> I'm not in favour of releasing a Flink 1.15.3 with this change: I think the
> impact is too big for a patch version, especially given how long Flink 1.15
> is already out there.
>
> Best regards,
>
> Martijn
>
> On Mon, Oct 10, 2022 at 11:13 AM Leonard Xu <xb...@gmail.com> wrote:
>
> > Thanks Qingsheng for starting this thread.
> >
> > +1 on reverting sink metric name and releasing 1.15.3 to fix this
> > inconsistent behavior.
> >
> >
> > Best,
> > Leonard
> >
> >
> >
> >
> >
> > 2022年10月10日 下午3:06,Jark Wu <im...@gmail.com> 写道:
> >
> > Thanks for discovering this problem, Qingsheng!
> >
> > I'm also +1 for reverting the breaking changes.
> >
> > IIUC, currently, the behavior of "numXXXOut" metrics of the new and old
> > sink is inconsistent.
> > We have to break one of them to have consistent behavior. Sink V2 is an
> > evolving API which is just introduced in 1.15.
> > I think it makes sense to break the unstable API instead of the stable
> API
> > which many connectors and users depend on.
> >
> > Best,
> > Jark
> >
> >
> >
> > On Mon, 10 Oct 2022 at 11:36, Jingsong Li <ji...@gmail.com>
> wrote:
> >
> >> Thanks for driving, Qingsheng.
> >>
> >> +1 for reverting sink metric name.
> >>
> >> We often forget that metric is also one of the important APIs.
> >>
> >> +1 for releasing 1.15.3 to fix this.
> >>
> >> Best,
> >> Jingsong
> >>
> >> On Sun, Oct 9, 2022 at 11:35 PM Becket Qin <be...@gmail.com>
> wrote:
> >> >
> >> > Thanks for raising the discussion, Qingsheng,
> >> >
> >> > +1 on reverting the breaking changes.
> >> >
> >> > In addition, we might want to release a 1.15.3 to fix this and update
> >> the previous release docs with this known issue, so that users can
> upgrade
> >> to 1.15.3 when they hit it. It would also be good to add some backwards
> >> compatibility tests on metrics to avoid unintended breaking changes like
> >> this in the future.
> >> >
> >> > Thanks,
> >> >
> >> > Jiangjie (Becket) Qin
> >> >
> >> > On Sun, Oct 9, 2022 at 10:35 AM Qingsheng Ren <re...@apache.org>
> wrote:
> >> >>
> >> >> Hi devs and users,
> >> >>
> >> >> I’d like to start a discussion about reverting a breaking change
> about
> >> sink metrics made in 1.15 by FLINK-26126 [1] and FLINK-26492 [2].
> >> >>
> >> >> TL;DR
> >> >>
> >> >> All sink metrics with name “numXXXOut” defined in FLIP-33 are replace
> >> by “numXXXSend” in FLINK-26126 and FLINK-26492. Considering metric names
> >> are public APIs, this is a breaking change to end users and not backward
> >> compatible. Also unfortunately this breaking change was not discussed in
> >> the mailing list before.
> >> >>
> >> >> Background
> >> >>
> >> >> As defined previously in FLIP-33 (the FLIP page has been changed so
> >> please refer to the old version [3] ), metric “numRecordsOut” is used
> for
> >> reporting the total number of output records since the sink started
> (number
> >> of records written to the external system), and similarly for
> >> “numRecordsOutPerSecond”, “numBytesOut”, “numBytesOutPerSecond” and
> >> “numRecordsOutError”. Most sinks are following this naming and
> definition.
> >> However, these metrics are ambiguous in the new Sink API as “numXXXOut”
> >> could be used by the output of SinkWriterOperator for reporting number
> of
> >> Committables delivered to SinkCommitterOperator. In order to resolve the
> >> conflict, FLINK-26126 and FLINK-26492 changed names of these metrics
> with
> >> “numXXXSend”.
> >> >>
> >> >> Necessity of reverting this change
> >> >>
> >> >> - Metric names are actually public API, as end users need to
> configure
> >> metric collecting and alerting system with metric names. Users have to
> >> reset all configurations related to affected metrics.
> >> >> - This could also affect custom and external sinks not maintained by
> >> Flink, which might have implemented with numXXXOut metrics.
> >> >> - The number of records sent to external system is way more important
> >> than the number of Committables sent to SinkCommitterOperator, as the
> >> latter one is just an internal implementation of sink. We could have a
> new
> >> metric name for the latter one instead.
> >> >> - We could avoid splitting the project by version (like “plz use
> >> numXXXOut before 1.15 and use numXXXSend after”) if we revert it ASAP,
> >> cosidering 1.16 is still not released for now.
> >> >>
> >> >> As a consequence, I’d like to hear from devs and users about your
> >> opinion on changing these metrics back to “numXXXOut”.
> >> >>
> >> >> Looking forward to your reply!
> >> >>
> >> >> [1] https://issues.apache.org/jira/browse/FLINK-26126
> >> >> [2] https://issues.apache.org/jira/browse/FLINK-26492
> >> >> [1] FLIP-33, version 18:
> >>
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211883136
> >> >>
> >> >> Best,
> >> >> Qingsheng
> >>
> >
> >
>

Re: [DISCUSS] Reverting sink metric name changes made in 1.15

Posted by Xintong Song <to...@gmail.com>.
+1 for reverting these changes in Flink 1.16.

For 1.15.3, can we make these metrics available via both names (numXXXOut
and numXXXSend)? In this way we don't break it for those who already
migrated to 1.15 and numXXXSend. That means we still need to change
SinkWriterOperator to use another metric name in 1.15.3, which IIUC is
internal to Flink sink.

I'm overall +1 to change numXXXOut back to its original semantics. AFAIK
(from meetup / flink-forward questionaires), most users do not migrate to a
new Flink release immediately, until the next 1-2 major releases are out.

Best,

Xintong



On Mon, Oct 10, 2022 at 5:26 PM Martijn Visser <ma...@apache.org>
wrote:

> Hi Qingsheng,
>
> Do you have any idea what has happened in the process here? Do we know why
> they were changed? I was under the impression that these metric names were
> newly introduced due to the new interfaces and because it still depends on
> each connector implementing these.
>
> Sidenote: metric names are not mentioned in the FLIP process as a public
> API. Might make sense to have a separate follow-up to add that to the list
> (I do think we should list them there).
>
> +1 for reverting this and make this change in Flink 1.16
>
> I'm not in favour of releasing a Flink 1.15.3 with this change: I think the
> impact is too big for a patch version, especially given how long Flink 1.15
> is already out there.
>
> Best regards,
>
> Martijn
>
> On Mon, Oct 10, 2022 at 11:13 AM Leonard Xu <xb...@gmail.com> wrote:
>
> > Thanks Qingsheng for starting this thread.
> >
> > +1 on reverting sink metric name and releasing 1.15.3 to fix this
> > inconsistent behavior.
> >
> >
> > Best,
> > Leonard
> >
> >
> >
> >
> >
> > 2022年10月10日 下午3:06,Jark Wu <im...@gmail.com> 写道:
> >
> > Thanks for discovering this problem, Qingsheng!
> >
> > I'm also +1 for reverting the breaking changes.
> >
> > IIUC, currently, the behavior of "numXXXOut" metrics of the new and old
> > sink is inconsistent.
> > We have to break one of them to have consistent behavior. Sink V2 is an
> > evolving API which is just introduced in 1.15.
> > I think it makes sense to break the unstable API instead of the stable
> API
> > which many connectors and users depend on.
> >
> > Best,
> > Jark
> >
> >
> >
> > On Mon, 10 Oct 2022 at 11:36, Jingsong Li <ji...@gmail.com>
> wrote:
> >
> >> Thanks for driving, Qingsheng.
> >>
> >> +1 for reverting sink metric name.
> >>
> >> We often forget that metric is also one of the important APIs.
> >>
> >> +1 for releasing 1.15.3 to fix this.
> >>
> >> Best,
> >> Jingsong
> >>
> >> On Sun, Oct 9, 2022 at 11:35 PM Becket Qin <be...@gmail.com>
> wrote:
> >> >
> >> > Thanks for raising the discussion, Qingsheng,
> >> >
> >> > +1 on reverting the breaking changes.
> >> >
> >> > In addition, we might want to release a 1.15.3 to fix this and update
> >> the previous release docs with this known issue, so that users can
> upgrade
> >> to 1.15.3 when they hit it. It would also be good to add some backwards
> >> compatibility tests on metrics to avoid unintended breaking changes like
> >> this in the future.
> >> >
> >> > Thanks,
> >> >
> >> > Jiangjie (Becket) Qin
> >> >
> >> > On Sun, Oct 9, 2022 at 10:35 AM Qingsheng Ren <re...@apache.org>
> wrote:
> >> >>
> >> >> Hi devs and users,
> >> >>
> >> >> I’d like to start a discussion about reverting a breaking change
> about
> >> sink metrics made in 1.15 by FLINK-26126 [1] and FLINK-26492 [2].
> >> >>
> >> >> TL;DR
> >> >>
> >> >> All sink metrics with name “numXXXOut” defined in FLIP-33 are replace
> >> by “numXXXSend” in FLINK-26126 and FLINK-26492. Considering metric names
> >> are public APIs, this is a breaking change to end users and not backward
> >> compatible. Also unfortunately this breaking change was not discussed in
> >> the mailing list before.
> >> >>
> >> >> Background
> >> >>
> >> >> As defined previously in FLIP-33 (the FLIP page has been changed so
> >> please refer to the old version [3] ), metric “numRecordsOut” is used
> for
> >> reporting the total number of output records since the sink started
> (number
> >> of records written to the external system), and similarly for
> >> “numRecordsOutPerSecond”, “numBytesOut”, “numBytesOutPerSecond” and
> >> “numRecordsOutError”. Most sinks are following this naming and
> definition.
> >> However, these metrics are ambiguous in the new Sink API as “numXXXOut”
> >> could be used by the output of SinkWriterOperator for reporting number
> of
> >> Committables delivered to SinkCommitterOperator. In order to resolve the
> >> conflict, FLINK-26126 and FLINK-26492 changed names of these metrics
> with
> >> “numXXXSend”.
> >> >>
> >> >> Necessity of reverting this change
> >> >>
> >> >> - Metric names are actually public API, as end users need to
> configure
> >> metric collecting and alerting system with metric names. Users have to
> >> reset all configurations related to affected metrics.
> >> >> - This could also affect custom and external sinks not maintained by
> >> Flink, which might have implemented with numXXXOut metrics.
> >> >> - The number of records sent to external system is way more important
> >> than the number of Committables sent to SinkCommitterOperator, as the
> >> latter one is just an internal implementation of sink. We could have a
> new
> >> metric name for the latter one instead.
> >> >> - We could avoid splitting the project by version (like “plz use
> >> numXXXOut before 1.15 and use numXXXSend after”) if we revert it ASAP,
> >> cosidering 1.16 is still not released for now.
> >> >>
> >> >> As a consequence, I’d like to hear from devs and users about your
> >> opinion on changing these metrics back to “numXXXOut”.
> >> >>
> >> >> Looking forward to your reply!
> >> >>
> >> >> [1] https://issues.apache.org/jira/browse/FLINK-26126
> >> >> [2] https://issues.apache.org/jira/browse/FLINK-26492
> >> >> [1] FLIP-33, version 18:
> >>
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211883136
> >> >>
> >> >> Best,
> >> >> Qingsheng
> >>
> >
> >
>

Re: [DISCUSS] Reverting sink metric name changes made in 1.15

Posted by Chesnay Schepler <ch...@apache.org>.
On 10/10/2022 11:24, Martijn Visser wrote:
> Sidenote: metric names are not mentioned in the FLIP process as a 
> public API. Might make sense to have a separate follow-up to add that 
> to the list (I do think we should list them there).

That's a general issue we have. There's a lot of things we _ usually_ 
treat as a public API without having written it down; including

  * config options (I mean _keys_, not ConfigOption members)
  * CLI
  * REST API
  * metric names
  * scripts in distribution bin/ directory

Re: [DISCUSS] Reverting sink metric name changes made in 1.15

Posted by Chesnay Schepler <ch...@apache.org>.
On 10/10/2022 11:24, Martijn Visser wrote:
> Sidenote: metric names are not mentioned in the FLIP process as a 
> public API. Might make sense to have a separate follow-up to add that 
> to the list (I do think we should list them there).

That's a general issue we have. There's a lot of things we _ usually_ 
treat as a public API without having written it down; including

  * config options (I mean _keys_, not ConfigOption members)
  * CLI
  * REST API
  * metric names
  * scripts in distribution bin/ directory

Re: [DISCUSS] Reverting sink metric name changes made in 1.15

Posted by Martijn Visser <ma...@apache.org>.
Hi Qingsheng,

Do you have any idea what has happened in the process here? Do we know why
they were changed? I was under the impression that these metric names were
newly introduced due to the new interfaces and because it still depends on
each connector implementing these.

Sidenote: metric names are not mentioned in the FLIP process as a public
API. Might make sense to have a separate follow-up to add that to the list
(I do think we should list them there).

+1 for reverting this and make this change in Flink 1.16

I'm not in favour of releasing a Flink 1.15.3 with this change: I think the
impact is too big for a patch version, especially given how long Flink 1.15
is already out there.

Best regards,

Martijn

On Mon, Oct 10, 2022 at 11:13 AM Leonard Xu <xb...@gmail.com> wrote:

> Thanks Qingsheng for starting this thread.
>
> +1 on reverting sink metric name and releasing 1.15.3 to fix this
> inconsistent behavior.
>
>
> Best,
> Leonard
>
>
>
>
>
> 2022年10月10日 下午3:06,Jark Wu <im...@gmail.com> 写道:
>
> Thanks for discovering this problem, Qingsheng!
>
> I'm also +1 for reverting the breaking changes.
>
> IIUC, currently, the behavior of "numXXXOut" metrics of the new and old
> sink is inconsistent.
> We have to break one of them to have consistent behavior. Sink V2 is an
> evolving API which is just introduced in 1.15.
> I think it makes sense to break the unstable API instead of the stable API
> which many connectors and users depend on.
>
> Best,
> Jark
>
>
>
> On Mon, 10 Oct 2022 at 11:36, Jingsong Li <ji...@gmail.com> wrote:
>
>> Thanks for driving, Qingsheng.
>>
>> +1 for reverting sink metric name.
>>
>> We often forget that metric is also one of the important APIs.
>>
>> +1 for releasing 1.15.3 to fix this.
>>
>> Best,
>> Jingsong
>>
>> On Sun, Oct 9, 2022 at 11:35 PM Becket Qin <be...@gmail.com> wrote:
>> >
>> > Thanks for raising the discussion, Qingsheng,
>> >
>> > +1 on reverting the breaking changes.
>> >
>> > In addition, we might want to release a 1.15.3 to fix this and update
>> the previous release docs with this known issue, so that users can upgrade
>> to 1.15.3 when they hit it. It would also be good to add some backwards
>> compatibility tests on metrics to avoid unintended breaking changes like
>> this in the future.
>> >
>> > Thanks,
>> >
>> > Jiangjie (Becket) Qin
>> >
>> > On Sun, Oct 9, 2022 at 10:35 AM Qingsheng Ren <re...@apache.org> wrote:
>> >>
>> >> Hi devs and users,
>> >>
>> >> I’d like to start a discussion about reverting a breaking change about
>> sink metrics made in 1.15 by FLINK-26126 [1] and FLINK-26492 [2].
>> >>
>> >> TL;DR
>> >>
>> >> All sink metrics with name “numXXXOut” defined in FLIP-33 are replace
>> by “numXXXSend” in FLINK-26126 and FLINK-26492. Considering metric names
>> are public APIs, this is a breaking change to end users and not backward
>> compatible. Also unfortunately this breaking change was not discussed in
>> the mailing list before.
>> >>
>> >> Background
>> >>
>> >> As defined previously in FLIP-33 (the FLIP page has been changed so
>> please refer to the old version [3] ), metric “numRecordsOut” is used for
>> reporting the total number of output records since the sink started (number
>> of records written to the external system), and similarly for
>> “numRecordsOutPerSecond”, “numBytesOut”, “numBytesOutPerSecond” and
>> “numRecordsOutError”. Most sinks are following this naming and definition.
>> However, these metrics are ambiguous in the new Sink API as “numXXXOut”
>> could be used by the output of SinkWriterOperator for reporting number of
>> Committables delivered to SinkCommitterOperator. In order to resolve the
>> conflict, FLINK-26126 and FLINK-26492 changed names of these metrics with
>> “numXXXSend”.
>> >>
>> >> Necessity of reverting this change
>> >>
>> >> - Metric names are actually public API, as end users need to configure
>> metric collecting and alerting system with metric names. Users have to
>> reset all configurations related to affected metrics.
>> >> - This could also affect custom and external sinks not maintained by
>> Flink, which might have implemented with numXXXOut metrics.
>> >> - The number of records sent to external system is way more important
>> than the number of Committables sent to SinkCommitterOperator, as the
>> latter one is just an internal implementation of sink. We could have a new
>> metric name for the latter one instead.
>> >> - We could avoid splitting the project by version (like “plz use
>> numXXXOut before 1.15 and use numXXXSend after”) if we revert it ASAP,
>> cosidering 1.16 is still not released for now.
>> >>
>> >> As a consequence, I’d like to hear from devs and users about your
>> opinion on changing these metrics back to “numXXXOut”.
>> >>
>> >> Looking forward to your reply!
>> >>
>> >> [1] https://issues.apache.org/jira/browse/FLINK-26126
>> >> [2] https://issues.apache.org/jira/browse/FLINK-26492
>> >> [1] FLIP-33, version 18:
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211883136
>> >>
>> >> Best,
>> >> Qingsheng
>>
>
>

Re: [DISCUSS] Reverting sink metric name changes made in 1.15

Posted by Martijn Visser <ma...@apache.org>.
Hi Qingsheng,

Do you have any idea what has happened in the process here? Do we know why
they were changed? I was under the impression that these metric names were
newly introduced due to the new interfaces and because it still depends on
each connector implementing these.

Sidenote: metric names are not mentioned in the FLIP process as a public
API. Might make sense to have a separate follow-up to add that to the list
(I do think we should list them there).

+1 for reverting this and make this change in Flink 1.16

I'm not in favour of releasing a Flink 1.15.3 with this change: I think the
impact is too big for a patch version, especially given how long Flink 1.15
is already out there.

Best regards,

Martijn

On Mon, Oct 10, 2022 at 11:13 AM Leonard Xu <xb...@gmail.com> wrote:

> Thanks Qingsheng for starting this thread.
>
> +1 on reverting sink metric name and releasing 1.15.3 to fix this
> inconsistent behavior.
>
>
> Best,
> Leonard
>
>
>
>
>
> 2022年10月10日 下午3:06,Jark Wu <im...@gmail.com> 写道:
>
> Thanks for discovering this problem, Qingsheng!
>
> I'm also +1 for reverting the breaking changes.
>
> IIUC, currently, the behavior of "numXXXOut" metrics of the new and old
> sink is inconsistent.
> We have to break one of them to have consistent behavior. Sink V2 is an
> evolving API which is just introduced in 1.15.
> I think it makes sense to break the unstable API instead of the stable API
> which many connectors and users depend on.
>
> Best,
> Jark
>
>
>
> On Mon, 10 Oct 2022 at 11:36, Jingsong Li <ji...@gmail.com> wrote:
>
>> Thanks for driving, Qingsheng.
>>
>> +1 for reverting sink metric name.
>>
>> We often forget that metric is also one of the important APIs.
>>
>> +1 for releasing 1.15.3 to fix this.
>>
>> Best,
>> Jingsong
>>
>> On Sun, Oct 9, 2022 at 11:35 PM Becket Qin <be...@gmail.com> wrote:
>> >
>> > Thanks for raising the discussion, Qingsheng,
>> >
>> > +1 on reverting the breaking changes.
>> >
>> > In addition, we might want to release a 1.15.3 to fix this and update
>> the previous release docs with this known issue, so that users can upgrade
>> to 1.15.3 when they hit it. It would also be good to add some backwards
>> compatibility tests on metrics to avoid unintended breaking changes like
>> this in the future.
>> >
>> > Thanks,
>> >
>> > Jiangjie (Becket) Qin
>> >
>> > On Sun, Oct 9, 2022 at 10:35 AM Qingsheng Ren <re...@apache.org> wrote:
>> >>
>> >> Hi devs and users,
>> >>
>> >> I’d like to start a discussion about reverting a breaking change about
>> sink metrics made in 1.15 by FLINK-26126 [1] and FLINK-26492 [2].
>> >>
>> >> TL;DR
>> >>
>> >> All sink metrics with name “numXXXOut” defined in FLIP-33 are replace
>> by “numXXXSend” in FLINK-26126 and FLINK-26492. Considering metric names
>> are public APIs, this is a breaking change to end users and not backward
>> compatible. Also unfortunately this breaking change was not discussed in
>> the mailing list before.
>> >>
>> >> Background
>> >>
>> >> As defined previously in FLIP-33 (the FLIP page has been changed so
>> please refer to the old version [3] ), metric “numRecordsOut” is used for
>> reporting the total number of output records since the sink started (number
>> of records written to the external system), and similarly for
>> “numRecordsOutPerSecond”, “numBytesOut”, “numBytesOutPerSecond” and
>> “numRecordsOutError”. Most sinks are following this naming and definition.
>> However, these metrics are ambiguous in the new Sink API as “numXXXOut”
>> could be used by the output of SinkWriterOperator for reporting number of
>> Committables delivered to SinkCommitterOperator. In order to resolve the
>> conflict, FLINK-26126 and FLINK-26492 changed names of these metrics with
>> “numXXXSend”.
>> >>
>> >> Necessity of reverting this change
>> >>
>> >> - Metric names are actually public API, as end users need to configure
>> metric collecting and alerting system with metric names. Users have to
>> reset all configurations related to affected metrics.
>> >> - This could also affect custom and external sinks not maintained by
>> Flink, which might have implemented with numXXXOut metrics.
>> >> - The number of records sent to external system is way more important
>> than the number of Committables sent to SinkCommitterOperator, as the
>> latter one is just an internal implementation of sink. We could have a new
>> metric name for the latter one instead.
>> >> - We could avoid splitting the project by version (like “plz use
>> numXXXOut before 1.15 and use numXXXSend after”) if we revert it ASAP,
>> cosidering 1.16 is still not released for now.
>> >>
>> >> As a consequence, I’d like to hear from devs and users about your
>> opinion on changing these metrics back to “numXXXOut”.
>> >>
>> >> Looking forward to your reply!
>> >>
>> >> [1] https://issues.apache.org/jira/browse/FLINK-26126
>> >> [2] https://issues.apache.org/jira/browse/FLINK-26492
>> >> [1] FLIP-33, version 18:
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211883136
>> >>
>> >> Best,
>> >> Qingsheng
>>
>
>

Re: [DISCUSS] Reverting sink metric name changes made in 1.15

Posted by Leonard Xu <xb...@gmail.com>.
Thanks Qingsheng for starting this thread.

+1 on reverting sink metric name and releasing 1.15.3 to fix this inconsistent behavior.


Best,
Leonard





> 2022年10月10日 下午3:06,Jark Wu <im...@gmail.com> 写道:
> 
> Thanks for discovering this problem, Qingsheng!
> 
> I'm also +1 for reverting the breaking changes. 
> 
> IIUC, currently, the behavior of "numXXXOut" metrics of the new and old sink is inconsistent. 
> We have to break one of them to have consistent behavior. Sink V2 is an evolving API which is just introduced in 1.15. 
> I think it makes sense to break the unstable API instead of the stable API which many connectors and users depend on.
> 
> Best,
> Jark
> 
> 
> 
> On Mon, 10 Oct 2022 at 11:36, Jingsong Li <jingsonglee0@gmail.com <ma...@gmail.com>> wrote:
> Thanks for driving, Qingsheng.
> 
> +1 for reverting sink metric name.
> 
> We often forget that metric is also one of the important APIs.
> 
> +1 for releasing 1.15.3 to fix this.
> 
> Best,
> Jingsong
> 
> On Sun, Oct 9, 2022 at 11:35 PM Becket Qin <becket.qin@gmail.com <ma...@gmail.com>> wrote:
> >
> > Thanks for raising the discussion, Qingsheng,
> >
> > +1 on reverting the breaking changes.
> >
> > In addition, we might want to release a 1.15.3 to fix this and update the previous release docs with this known issue, so that users can upgrade to 1.15.3 when they hit it. It would also be good to add some backwards compatibility tests on metrics to avoid unintended breaking changes like this in the future.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> > On Sun, Oct 9, 2022 at 10:35 AM Qingsheng Ren <renqs@apache.org <ma...@apache.org>> wrote:
> >>
> >> Hi devs and users,
> >>
> >> I’d like to start a discussion about reverting a breaking change about sink metrics made in 1.15 by FLINK-26126 [1] and FLINK-26492 [2].
> >>
> >> TL;DR
> >>
> >> All sink metrics with name “numXXXOut” defined in FLIP-33 are replace by “numXXXSend” in FLINK-26126 and FLINK-26492. Considering metric names are public APIs, this is a breaking change to end users and not backward compatible. Also unfortunately this breaking change was not discussed in the mailing list before.
> >>
> >> Background
> >>
> >> As defined previously in FLIP-33 (the FLIP page has been changed so please refer to the old version [3] ), metric “numRecordsOut” is used for reporting the total number of output records since the sink started (number of records written to the external system), and similarly for “numRecordsOutPerSecond”, “numBytesOut”, “numBytesOutPerSecond” and “numRecordsOutError”. Most sinks are following this naming and definition. However, these metrics are ambiguous in the new Sink API as “numXXXOut” could be used by the output of SinkWriterOperator for reporting number of Committables delivered to SinkCommitterOperator. In order to resolve the conflict, FLINK-26126 and FLINK-26492 changed names of these metrics with “numXXXSend”.
> >>
> >> Necessity of reverting this change
> >>
> >> - Metric names are actually public API, as end users need to configure metric collecting and alerting system with metric names. Users have to reset all configurations related to affected metrics.
> >> - This could also affect custom and external sinks not maintained by Flink, which might have implemented with numXXXOut metrics.
> >> - The number of records sent to external system is way more important than the number of Committables sent to SinkCommitterOperator, as the latter one is just an internal implementation of sink. We could have a new metric name for the latter one instead.
> >> - We could avoid splitting the project by version (like “plz use numXXXOut before 1.15 and use numXXXSend after”) if we revert it ASAP, cosidering 1.16 is still not released for now.
> >>
> >> As a consequence, I’d like to hear from devs and users about your opinion on changing these metrics back to “numXXXOut”.
> >>
> >> Looking forward to your reply!
> >>
> >> [1] https://issues.apache.org/jira/browse/FLINK-26126 <https://issues.apache.org/jira/browse/FLINK-26126>
> >> [2] https://issues.apache.org/jira/browse/FLINK-26492 <https://issues.apache.org/jira/browse/FLINK-26492>
> >> [1] FLIP-33, version 18: https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211883136 <https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211883136>
> >>
> >> Best,
> >> Qingsheng


Re: [DISCUSS] Reverting sink metric name changes made in 1.15

Posted by Leonard Xu <xb...@gmail.com>.
Thanks Qingsheng for starting this thread.

+1 on reverting sink metric name and releasing 1.15.3 to fix this inconsistent behavior.


Best,
Leonard





> 2022年10月10日 下午3:06,Jark Wu <im...@gmail.com> 写道:
> 
> Thanks for discovering this problem, Qingsheng!
> 
> I'm also +1 for reverting the breaking changes. 
> 
> IIUC, currently, the behavior of "numXXXOut" metrics of the new and old sink is inconsistent. 
> We have to break one of them to have consistent behavior. Sink V2 is an evolving API which is just introduced in 1.15. 
> I think it makes sense to break the unstable API instead of the stable API which many connectors and users depend on.
> 
> Best,
> Jark
> 
> 
> 
> On Mon, 10 Oct 2022 at 11:36, Jingsong Li <jingsonglee0@gmail.com <ma...@gmail.com>> wrote:
> Thanks for driving, Qingsheng.
> 
> +1 for reverting sink metric name.
> 
> We often forget that metric is also one of the important APIs.
> 
> +1 for releasing 1.15.3 to fix this.
> 
> Best,
> Jingsong
> 
> On Sun, Oct 9, 2022 at 11:35 PM Becket Qin <becket.qin@gmail.com <ma...@gmail.com>> wrote:
> >
> > Thanks for raising the discussion, Qingsheng,
> >
> > +1 on reverting the breaking changes.
> >
> > In addition, we might want to release a 1.15.3 to fix this and update the previous release docs with this known issue, so that users can upgrade to 1.15.3 when they hit it. It would also be good to add some backwards compatibility tests on metrics to avoid unintended breaking changes like this in the future.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> > On Sun, Oct 9, 2022 at 10:35 AM Qingsheng Ren <renqs@apache.org <ma...@apache.org>> wrote:
> >>
> >> Hi devs and users,
> >>
> >> I’d like to start a discussion about reverting a breaking change about sink metrics made in 1.15 by FLINK-26126 [1] and FLINK-26492 [2].
> >>
> >> TL;DR
> >>
> >> All sink metrics with name “numXXXOut” defined in FLIP-33 are replace by “numXXXSend” in FLINK-26126 and FLINK-26492. Considering metric names are public APIs, this is a breaking change to end users and not backward compatible. Also unfortunately this breaking change was not discussed in the mailing list before.
> >>
> >> Background
> >>
> >> As defined previously in FLIP-33 (the FLIP page has been changed so please refer to the old version [3] ), metric “numRecordsOut” is used for reporting the total number of output records since the sink started (number of records written to the external system), and similarly for “numRecordsOutPerSecond”, “numBytesOut”, “numBytesOutPerSecond” and “numRecordsOutError”. Most sinks are following this naming and definition. However, these metrics are ambiguous in the new Sink API as “numXXXOut” could be used by the output of SinkWriterOperator for reporting number of Committables delivered to SinkCommitterOperator. In order to resolve the conflict, FLINK-26126 and FLINK-26492 changed names of these metrics with “numXXXSend”.
> >>
> >> Necessity of reverting this change
> >>
> >> - Metric names are actually public API, as end users need to configure metric collecting and alerting system with metric names. Users have to reset all configurations related to affected metrics.
> >> - This could also affect custom and external sinks not maintained by Flink, which might have implemented with numXXXOut metrics.
> >> - The number of records sent to external system is way more important than the number of Committables sent to SinkCommitterOperator, as the latter one is just an internal implementation of sink. We could have a new metric name for the latter one instead.
> >> - We could avoid splitting the project by version (like “plz use numXXXOut before 1.15 and use numXXXSend after”) if we revert it ASAP, cosidering 1.16 is still not released for now.
> >>
> >> As a consequence, I’d like to hear from devs and users about your opinion on changing these metrics back to “numXXXOut”.
> >>
> >> Looking forward to your reply!
> >>
> >> [1] https://issues.apache.org/jira/browse/FLINK-26126 <https://issues.apache.org/jira/browse/FLINK-26126>
> >> [2] https://issues.apache.org/jira/browse/FLINK-26492 <https://issues.apache.org/jira/browse/FLINK-26492>
> >> [1] FLIP-33, version 18: https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211883136 <https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211883136>
> >>
> >> Best,
> >> Qingsheng


Re: [DISCUSS] Reverting sink metric name changes made in 1.15

Posted by Jark Wu <im...@gmail.com>.
Thanks for discovering this problem, Qingsheng!

I'm also +1 for reverting the breaking changes.

IIUC, currently, the behavior of "numXXXOut" metrics of the new and old
sink is inconsistent.
We have to break one of them to have consistent behavior. Sink V2 is an
evolving API which is just introduced in 1.15.
I think it makes sense to break the unstable API instead of the stable API
which many connectors and users depend on.

Best,
Jark



On Mon, 10 Oct 2022 at 11:36, Jingsong Li <ji...@gmail.com> wrote:

> Thanks for driving, Qingsheng.
>
> +1 for reverting sink metric name.
>
> We often forget that metric is also one of the important APIs.
>
> +1 for releasing 1.15.3 to fix this.
>
> Best,
> Jingsong
>
> On Sun, Oct 9, 2022 at 11:35 PM Becket Qin <be...@gmail.com> wrote:
> >
> > Thanks for raising the discussion, Qingsheng,
> >
> > +1 on reverting the breaking changes.
> >
> > In addition, we might want to release a 1.15.3 to fix this and update
> the previous release docs with this known issue, so that users can upgrade
> to 1.15.3 when they hit it. It would also be good to add some backwards
> compatibility tests on metrics to avoid unintended breaking changes like
> this in the future.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> > On Sun, Oct 9, 2022 at 10:35 AM Qingsheng Ren <re...@apache.org> wrote:
> >>
> >> Hi devs and users,
> >>
> >> I’d like to start a discussion about reverting a breaking change about
> sink metrics made in 1.15 by FLINK-26126 [1] and FLINK-26492 [2].
> >>
> >> TL;DR
> >>
> >> All sink metrics with name “numXXXOut” defined in FLIP-33 are replace
> by “numXXXSend” in FLINK-26126 and FLINK-26492. Considering metric names
> are public APIs, this is a breaking change to end users and not backward
> compatible. Also unfortunately this breaking change was not discussed in
> the mailing list before.
> >>
> >> Background
> >>
> >> As defined previously in FLIP-33 (the FLIP page has been changed so
> please refer to the old version [3] ), metric “numRecordsOut” is used for
> reporting the total number of output records since the sink started (number
> of records written to the external system), and similarly for
> “numRecordsOutPerSecond”, “numBytesOut”, “numBytesOutPerSecond” and
> “numRecordsOutError”. Most sinks are following this naming and definition.
> However, these metrics are ambiguous in the new Sink API as “numXXXOut”
> could be used by the output of SinkWriterOperator for reporting number of
> Committables delivered to SinkCommitterOperator. In order to resolve the
> conflict, FLINK-26126 and FLINK-26492 changed names of these metrics with
> “numXXXSend”.
> >>
> >> Necessity of reverting this change
> >>
> >> - Metric names are actually public API, as end users need to configure
> metric collecting and alerting system with metric names. Users have to
> reset all configurations related to affected metrics.
> >> - This could also affect custom and external sinks not maintained by
> Flink, which might have implemented with numXXXOut metrics.
> >> - The number of records sent to external system is way more important
> than the number of Committables sent to SinkCommitterOperator, as the
> latter one is just an internal implementation of sink. We could have a new
> metric name for the latter one instead.
> >> - We could avoid splitting the project by version (like “plz use
> numXXXOut before 1.15 and use numXXXSend after”) if we revert it ASAP,
> cosidering 1.16 is still not released for now.
> >>
> >> As a consequence, I’d like to hear from devs and users about your
> opinion on changing these metrics back to “numXXXOut”.
> >>
> >> Looking forward to your reply!
> >>
> >> [1] https://issues.apache.org/jira/browse/FLINK-26126
> >> [2] https://issues.apache.org/jira/browse/FLINK-26492
> >> [1] FLIP-33, version 18:
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211883136
> >>
> >> Best,
> >> Qingsheng
>

Re: [DISCUSS] Reverting sink metric name changes made in 1.15

Posted by Jark Wu <im...@gmail.com>.
Thanks for discovering this problem, Qingsheng!

I'm also +1 for reverting the breaking changes.

IIUC, currently, the behavior of "numXXXOut" metrics of the new and old
sink is inconsistent.
We have to break one of them to have consistent behavior. Sink V2 is an
evolving API which is just introduced in 1.15.
I think it makes sense to break the unstable API instead of the stable API
which many connectors and users depend on.

Best,
Jark



On Mon, 10 Oct 2022 at 11:36, Jingsong Li <ji...@gmail.com> wrote:

> Thanks for driving, Qingsheng.
>
> +1 for reverting sink metric name.
>
> We often forget that metric is also one of the important APIs.
>
> +1 for releasing 1.15.3 to fix this.
>
> Best,
> Jingsong
>
> On Sun, Oct 9, 2022 at 11:35 PM Becket Qin <be...@gmail.com> wrote:
> >
> > Thanks for raising the discussion, Qingsheng,
> >
> > +1 on reverting the breaking changes.
> >
> > In addition, we might want to release a 1.15.3 to fix this and update
> the previous release docs with this known issue, so that users can upgrade
> to 1.15.3 when they hit it. It would also be good to add some backwards
> compatibility tests on metrics to avoid unintended breaking changes like
> this in the future.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> > On Sun, Oct 9, 2022 at 10:35 AM Qingsheng Ren <re...@apache.org> wrote:
> >>
> >> Hi devs and users,
> >>
> >> I’d like to start a discussion about reverting a breaking change about
> sink metrics made in 1.15 by FLINK-26126 [1] and FLINK-26492 [2].
> >>
> >> TL;DR
> >>
> >> All sink metrics with name “numXXXOut” defined in FLIP-33 are replace
> by “numXXXSend” in FLINK-26126 and FLINK-26492. Considering metric names
> are public APIs, this is a breaking change to end users and not backward
> compatible. Also unfortunately this breaking change was not discussed in
> the mailing list before.
> >>
> >> Background
> >>
> >> As defined previously in FLIP-33 (the FLIP page has been changed so
> please refer to the old version [3] ), metric “numRecordsOut” is used for
> reporting the total number of output records since the sink started (number
> of records written to the external system), and similarly for
> “numRecordsOutPerSecond”, “numBytesOut”, “numBytesOutPerSecond” and
> “numRecordsOutError”. Most sinks are following this naming and definition.
> However, these metrics are ambiguous in the new Sink API as “numXXXOut”
> could be used by the output of SinkWriterOperator for reporting number of
> Committables delivered to SinkCommitterOperator. In order to resolve the
> conflict, FLINK-26126 and FLINK-26492 changed names of these metrics with
> “numXXXSend”.
> >>
> >> Necessity of reverting this change
> >>
> >> - Metric names are actually public API, as end users need to configure
> metric collecting and alerting system with metric names. Users have to
> reset all configurations related to affected metrics.
> >> - This could also affect custom and external sinks not maintained by
> Flink, which might have implemented with numXXXOut metrics.
> >> - The number of records sent to external system is way more important
> than the number of Committables sent to SinkCommitterOperator, as the
> latter one is just an internal implementation of sink. We could have a new
> metric name for the latter one instead.
> >> - We could avoid splitting the project by version (like “plz use
> numXXXOut before 1.15 and use numXXXSend after”) if we revert it ASAP,
> cosidering 1.16 is still not released for now.
> >>
> >> As a consequence, I’d like to hear from devs and users about your
> opinion on changing these metrics back to “numXXXOut”.
> >>
> >> Looking forward to your reply!
> >>
> >> [1] https://issues.apache.org/jira/browse/FLINK-26126
> >> [2] https://issues.apache.org/jira/browse/FLINK-26492
> >> [1] FLIP-33, version 18:
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211883136
> >>
> >> Best,
> >> Qingsheng
>

Re: [DISCUSS] Reverting sink metric name changes made in 1.15

Posted by Jingsong Li <ji...@gmail.com>.
Thanks for driving, Qingsheng.

+1 for reverting sink metric name.

We often forget that metric is also one of the important APIs.

+1 for releasing 1.15.3 to fix this.

Best,
Jingsong

On Sun, Oct 9, 2022 at 11:35 PM Becket Qin <be...@gmail.com> wrote:
>
> Thanks for raising the discussion, Qingsheng,
>
> +1 on reverting the breaking changes.
>
> In addition, we might want to release a 1.15.3 to fix this and update the previous release docs with this known issue, so that users can upgrade to 1.15.3 when they hit it. It would also be good to add some backwards compatibility tests on metrics to avoid unintended breaking changes like this in the future.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
> On Sun, Oct 9, 2022 at 10:35 AM Qingsheng Ren <re...@apache.org> wrote:
>>
>> Hi devs and users,
>>
>> I’d like to start a discussion about reverting a breaking change about sink metrics made in 1.15 by FLINK-26126 [1] and FLINK-26492 [2].
>>
>> TL;DR
>>
>> All sink metrics with name “numXXXOut” defined in FLIP-33 are replace by “numXXXSend” in FLINK-26126 and FLINK-26492. Considering metric names are public APIs, this is a breaking change to end users and not backward compatible. Also unfortunately this breaking change was not discussed in the mailing list before.
>>
>> Background
>>
>> As defined previously in FLIP-33 (the FLIP page has been changed so please refer to the old version [3] ), metric “numRecordsOut” is used for reporting the total number of output records since the sink started (number of records written to the external system), and similarly for “numRecordsOutPerSecond”, “numBytesOut”, “numBytesOutPerSecond” and “numRecordsOutError”. Most sinks are following this naming and definition. However, these metrics are ambiguous in the new Sink API as “numXXXOut” could be used by the output of SinkWriterOperator for reporting number of Committables delivered to SinkCommitterOperator. In order to resolve the conflict, FLINK-26126 and FLINK-26492 changed names of these metrics with “numXXXSend”.
>>
>> Necessity of reverting this change
>>
>> - Metric names are actually public API, as end users need to configure metric collecting and alerting system with metric names. Users have to reset all configurations related to affected metrics.
>> - This could also affect custom and external sinks not maintained by Flink, which might have implemented with numXXXOut metrics.
>> - The number of records sent to external system is way more important than the number of Committables sent to SinkCommitterOperator, as the latter one is just an internal implementation of sink. We could have a new metric name for the latter one instead.
>> - We could avoid splitting the project by version (like “plz use numXXXOut before 1.15 and use numXXXSend after”) if we revert it ASAP, cosidering 1.16 is still not released for now.
>>
>> As a consequence, I’d like to hear from devs and users about your opinion on changing these metrics back to “numXXXOut”.
>>
>> Looking forward to your reply!
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-26126
>> [2] https://issues.apache.org/jira/browse/FLINK-26492
>> [1] FLIP-33, version 18: https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211883136
>>
>> Best,
>> Qingsheng

Re: [DISCUSS] Reverting sink metric name changes made in 1.15

Posted by Jingsong Li <ji...@gmail.com>.
Thanks for driving, Qingsheng.

+1 for reverting sink metric name.

We often forget that metric is also one of the important APIs.

+1 for releasing 1.15.3 to fix this.

Best,
Jingsong

On Sun, Oct 9, 2022 at 11:35 PM Becket Qin <be...@gmail.com> wrote:
>
> Thanks for raising the discussion, Qingsheng,
>
> +1 on reverting the breaking changes.
>
> In addition, we might want to release a 1.15.3 to fix this and update the previous release docs with this known issue, so that users can upgrade to 1.15.3 when they hit it. It would also be good to add some backwards compatibility tests on metrics to avoid unintended breaking changes like this in the future.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
> On Sun, Oct 9, 2022 at 10:35 AM Qingsheng Ren <re...@apache.org> wrote:
>>
>> Hi devs and users,
>>
>> I’d like to start a discussion about reverting a breaking change about sink metrics made in 1.15 by FLINK-26126 [1] and FLINK-26492 [2].
>>
>> TL;DR
>>
>> All sink metrics with name “numXXXOut” defined in FLIP-33 are replace by “numXXXSend” in FLINK-26126 and FLINK-26492. Considering metric names are public APIs, this is a breaking change to end users and not backward compatible. Also unfortunately this breaking change was not discussed in the mailing list before.
>>
>> Background
>>
>> As defined previously in FLIP-33 (the FLIP page has been changed so please refer to the old version [3] ), metric “numRecordsOut” is used for reporting the total number of output records since the sink started (number of records written to the external system), and similarly for “numRecordsOutPerSecond”, “numBytesOut”, “numBytesOutPerSecond” and “numRecordsOutError”. Most sinks are following this naming and definition. However, these metrics are ambiguous in the new Sink API as “numXXXOut” could be used by the output of SinkWriterOperator for reporting number of Committables delivered to SinkCommitterOperator. In order to resolve the conflict, FLINK-26126 and FLINK-26492 changed names of these metrics with “numXXXSend”.
>>
>> Necessity of reverting this change
>>
>> - Metric names are actually public API, as end users need to configure metric collecting and alerting system with metric names. Users have to reset all configurations related to affected metrics.
>> - This could also affect custom and external sinks not maintained by Flink, which might have implemented with numXXXOut metrics.
>> - The number of records sent to external system is way more important than the number of Committables sent to SinkCommitterOperator, as the latter one is just an internal implementation of sink. We could have a new metric name for the latter one instead.
>> - We could avoid splitting the project by version (like “plz use numXXXOut before 1.15 and use numXXXSend after”) if we revert it ASAP, cosidering 1.16 is still not released for now.
>>
>> As a consequence, I’d like to hear from devs and users about your opinion on changing these metrics back to “numXXXOut”.
>>
>> Looking forward to your reply!
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-26126
>> [2] https://issues.apache.org/jira/browse/FLINK-26492
>> [1] FLIP-33, version 18: https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211883136
>>
>> Best,
>> Qingsheng

Re: [DISCUSS] Reverting sink metric name changes made in 1.15

Posted by Becket Qin <be...@gmail.com>.
Thanks for raising the discussion, Qingsheng,

+1 on reverting the breaking changes.

In addition, we might want to release a 1.15.3 to fix this and update the
previous release docs with this known issue, so that users can upgrade to
1.15.3 when they hit it. It would also be good to add some backwards
compatibility tests on metrics to avoid unintended breaking changes like
this in the future.

Thanks,

Jiangjie (Becket) Qin

On Sun, Oct 9, 2022 at 10:35 AM Qingsheng Ren <re...@apache.org> wrote:

> Hi devs and users,
>
> I’d like to start a discussion about reverting a breaking change about sink metrics made in 1.15 by FLINK-26126
> [1] and FLINK-26492 [2].
>
> TL;DR
>
>
> All sink metrics with name “numXXXOut” defined in FLIP-33 are replace by “numXXXSend” in FLINK-26126 and FLINK-26492. Considering metric names are public APIs, this is a breaking change to end users and not backward compatible. Also unfortunately this breaking change was not discussed in the mailing list before.
>
> Background
>
>
> As defined previously in FLIP-33 (the FLIP page has been changed so please refer to the old version [3] ), metric “numRecordsOut” is used for reporting the total number of output records since the sink started (number of records written to the external system), and similarly for “numRecordsOutPerSecond”, “numBytesOut”, “numBytesOutPerSecond” and “numRecordsOutError”. Most sinks are following this naming and definition. However, these metrics are ambiguous in the new Sink API as “numXXXOut” could be used by the output of SinkWriterOperator for reporting number of Committables delivered to SinkCommitterOperator. In order to resolve the conflict, FLINK-26126 and FLINK-26492 changed names of these metrics with “numXXXSend”.
>
> Necessity of reverting this change
>
>
> - Metric names are actually public API, as end users need to configure metric collecting and alerting system with metric names. Users have to reset all configurations related to affected metrics.
>
> - This could also affect custom and external sinks not maintained by Flink, which might have implemented with numXXXOut metrics.
>
> - The number of records sent to external system is way more important than the number of Committables sent to SinkCommitterOperator, as the latter one is just an internal implementation of sink. We could have a new metric name for the latter one instead.
>
> - We could avoid splitting the project by version (like “plz use numXXXOut before 1.15 and use numXXXSend after”) if we revert it ASAP, cosidering 1.16 is still not released for now.
>
>
> As a consequence, I’d like to hear from devs and users about your opinion on changing these metrics back to “numXXXOut”.
>
> Looking forward to your reply!
>
> [1] https://issues.apache.org/jira/browse/FLINK-26126
> [2] https://issues.apache.org/jira/browse/FLINK-26492
> [1] FLIP-33, version 18:
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211883136
>
> Best,
> Qingsheng
>

Re: [DISCUSS] Reverting sink metric name changes made in 1.15

Posted by Becket Qin <be...@gmail.com>.
Thanks for raising the discussion, Qingsheng,

+1 on reverting the breaking changes.

In addition, we might want to release a 1.15.3 to fix this and update the
previous release docs with this known issue, so that users can upgrade to
1.15.3 when they hit it. It would also be good to add some backwards
compatibility tests on metrics to avoid unintended breaking changes like
this in the future.

Thanks,

Jiangjie (Becket) Qin

On Sun, Oct 9, 2022 at 10:35 AM Qingsheng Ren <re...@apache.org> wrote:

> Hi devs and users,
>
> I’d like to start a discussion about reverting a breaking change about sink metrics made in 1.15 by FLINK-26126
> [1] and FLINK-26492 [2].
>
> TL;DR
>
>
> All sink metrics with name “numXXXOut” defined in FLIP-33 are replace by “numXXXSend” in FLINK-26126 and FLINK-26492. Considering metric names are public APIs, this is a breaking change to end users and not backward compatible. Also unfortunately this breaking change was not discussed in the mailing list before.
>
> Background
>
>
> As defined previously in FLIP-33 (the FLIP page has been changed so please refer to the old version [3] ), metric “numRecordsOut” is used for reporting the total number of output records since the sink started (number of records written to the external system), and similarly for “numRecordsOutPerSecond”, “numBytesOut”, “numBytesOutPerSecond” and “numRecordsOutError”. Most sinks are following this naming and definition. However, these metrics are ambiguous in the new Sink API as “numXXXOut” could be used by the output of SinkWriterOperator for reporting number of Committables delivered to SinkCommitterOperator. In order to resolve the conflict, FLINK-26126 and FLINK-26492 changed names of these metrics with “numXXXSend”.
>
> Necessity of reverting this change
>
>
> - Metric names are actually public API, as end users need to configure metric collecting and alerting system with metric names. Users have to reset all configurations related to affected metrics.
>
> - This could also affect custom and external sinks not maintained by Flink, which might have implemented with numXXXOut metrics.
>
> - The number of records sent to external system is way more important than the number of Committables sent to SinkCommitterOperator, as the latter one is just an internal implementation of sink. We could have a new metric name for the latter one instead.
>
> - We could avoid splitting the project by version (like “plz use numXXXOut before 1.15 and use numXXXSend after”) if we revert it ASAP, cosidering 1.16 is still not released for now.
>
>
> As a consequence, I’d like to hear from devs and users about your opinion on changing these metrics back to “numXXXOut”.
>
> Looking forward to your reply!
>
> [1] https://issues.apache.org/jira/browse/FLINK-26126
> [2] https://issues.apache.org/jira/browse/FLINK-26492
> [1] FLIP-33, version 18:
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=211883136
>
> Best,
> Qingsheng
>