You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by Hang Ruan <ru...@gmail.com> on 2022/12/08 07:28:00 UTC

[DISCUSS] FLIP-274 : Introduce metric group for OperatorCoordinator

Hi all,

MengYue and I created FLIP-274[1] Introduce metric group for
OperatorCoordinator. OperatorCoordinator is the coordinator for runtime
operators and running on Job Manager. The coordination mechanism is
operator events between OperatorCoordinator and its all operators, the
coordination is more and more using in Flink, for example many Sources and
Sinks depend on the mechanism to assign splits and coordinate commits to
external systems. The OperatorCoordinator is widely using in flink kafka
connector, flink pulsar connector, flink cdc connector, flink hudi
connector and so on.

But there is not a suitable metric group scope for the OperatorCoordinator
and not an implementation for the interface OperatorCoordinatorMetricGroup.
These metrics in OperatorCoordinator could be how many splits/partitions
have been assigned to source readers, how many files have been written out
by sink writers, these metrics not only help users to know the job progress
but also make big job maintaining easier. Thus we propose the FLIP-274 to
introduce a new metric group scope for OperatorCoordinator and provide an
internal implementation for OperatorCoordinatorMetricGroup.

Could you help review this FLIP when you get time? Any feedback is
appreciated!

Best,
Hang

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-274%3A+Introduce+metric+group+for+OperatorCoordinator

Re: [DISCUSS] FLIP-274 : Introduce metric group for OperatorCoordinator

Posted by Hang Ruan <ru...@gmail.com>.

Hi everyone,

All issues about FLIP-274 have been resolved and consensus has been
reached. If there are no further problems, we plan to start the voting
thread.

Thanks a lot for participating in the vote discussion. Special thanks
to @Chesnay
Schepler <ch...@apache.org> and @Jark Wu <im...@gmail.com>, their
suggestions and discussion make this FLIP clearer and more reasonable.

Best,
Hang

Chesnay Schepler <ch...@apache.org> 于2023年2月2日周四 18:44写道：

> Looks good from my side.
>
> On 01/02/2023 13:27, Hang Ruan wrote:
> > Hi all,
> >
> > I make a mistake about the default value of the  scope format
> > JobManagerOperatorScopeFormat. It should be
> > <host>.jobmanager.<job_name>.<operator_name>.
> > I have fixed it in the FLIP.
> > And the variables (task id and task name) have been added to the list.
> >
> > Best,
> > Hang
> >
> >
> > Hang Ruan <ru...@gmail.com> 于2023年2月1日周三 14:11写道：
> >
> >> Hi all,
> >>
> >>> IIUC, there is only an OperatorCoordinatorMetricGroup interface without
> >> implementations (it never worked). There were no coordinator metrics
> >> implemented and registered. @Hang Ruan please correct me if I'm wrong.
> >> Yes,  we shouldn't have to worry about backwards-compatibility as
> Chesnay
> >> said.
> >>
> >> After the discussion, I think the FLIP should be modified as follow.
> >> 1.  Add new metric group JobManagerOperatorMetricGroup, whose scope
> >> format is JobManagerOperatorScopeFormat. The config key of
> JobManagerOperatorScopeFormat
> >> is metrics.scope.jm-operator, and the default value is
> >> <host>.jobmanager.<job_name>.<task_name>.<operator_name>.
> >> 2. JobManagerOperatorMetricGroup will be sub components of
> >> JobManagerJobMetricGroup.
> >> 3. InternalOperatorCoordinatorMetricGroup will extend ProxyMetricGroup,
> >> which registered under JobManagerOperatorMetricGroup by
> >> addGroup("coordinator").
> >>
> >> I have updated the FLIP-274. @Chesnay Schepler <ch...@apache.org>
> @Jark
> >> Wu <im...@gmail.com> , please help to take a look at the modified FLIP
> >> and make sure that I do not miss  something.
> >>
> >> If there are no more problems, we plan to start the voting thread later.
> >>
> >> Best,
> >> Hang
> >>
> >> Chesnay Schepler <ch...@apache.org> 于2023年1月31日周二 20:09写道：
> >>
> >>>> IIUC, there is only an OperatorCoordinatorMetricGroup interface
> without
> >>> implementations (it never worked). There were no coordinator metrics
> >>> implemented and registered. @Hang Ruan <ru...@gmail.com> please
> >>> correct me if I'm wrong.
> >>>
> >>> Indeed. There are interfaces, and theoretically a metric group is
> exposed
> >>> to user-code via the SplitEnumeratorContext, but the metric group is
> always
> >>> null in the SourceCoordinatorContext. So we shouldn't have to worry
> about
> >>> backwards-compatibility.
> >>>
> >>>> However, the config key "metrics.scope.jm.job" is deprecated and is
> >>> replaced by "metrics.scope.jm-job". Similarly, "metrics.scope.tm.job"
> is
> >>> replaced with "metrics.scope.tm-job". That's why I though
> >>> "metrics.scope.jm-operator" might be more consistent.
> >>>
> >>> HA! Good catch. I only looked at the 1.16 docs where they weren't
> >>> deprecated yet.
> >>>
> >>> Then it should be "jm-operator" indeed.
> >>>
> >>> On 31/01/2023 10:51, Jark Wu wrote:
> >>>
> >>> Hi Chesnay,
> >>>
> >>>> The new option must default to the configured value of
> >>> metrics.scope.jm.job because otherwise it isn't backwards-compatible
> with
> >>> existing coordinator metrics.
> >>> IIUC, there is only an OperatorCoordinatorMetricGroup interface without
> >>> implementations (it never worked). There were no coordinator metrics
> >>> implemented and registered. @Hang Ruan <ru...@gmail.com> please
> >>> correct me if I'm wrong.
> >>>
> >>>> Intuitively I think the config key should be
> >>> metrics.scope.jm.operator., similar to metrics.scope.jm.job.
> >>> However, the config key "metrics.scope.jm.job" is deprecated and is
> >>> replaced by "metrics.scope.jm-job". Similarly, "metrics.scope.tm.job"
> is
> >>> replaced with "metrics.scope.tm-job". That's why I though
> >>> "metrics.scope.jm-operator" might be more consistent.
> >>>
> >>>> We would only support operator_name, operator_id, task_name and
> task_id
> >>> (+ everything else from the jm/job groups of course).
> >>> +1
> >>>
> >>> Best,
> >>> Jark
> >>>
> >>>
> >>>
> >>> On Mon, 30 Jan 2023 at 20:09, Chesnay Schepler <ch...@apache.org>
> >>> wrote:
> >>>
> >>>> The new option must default to the configured value of
> >>>> metrics.scope.jm.job because otherwise it isn't backwards-compatible
> with
> >>>> existing coordinator metrics.
> >>>> According to some earlier response in this thread there are already
> >>>> coordinator metrics.
> >>>>
> >>>> If that weren't the case you'd be correct, and we'd keep the default
> >>>> consistent with the default for tm operator metrics.
> >>>>
> >>>> Intuitively I think the config key should be
> metrics.scope.jm.operator.,
> >>>> similar to metrics.scope.jm.job.
> >>>> For symmetry we should also add new metrics.scope.tm.operator/task and
> >>>> eventually phase out metrics.scope.operator.
> >>>>
> >>>>> Regarding "coordinator metrics don't *belong to* any tasks/vertexes",
> >>>> I mean the coordinator doesn't run on specific tasks, so it doesn't
> have
> >>>> runtime information of tasks.
> >>>>
> >>>> That is true. I intentionally only referred to task id/name, as things
> >>>> like subtask index or attempt ids don't make sense for coordinators
> as you
> >>>> said.
> >>>> We would only support operator_name, operator_id, task_name and
> task_id
> >>>> (+ everything else from the jm/job groups of course).
> >>>> (Maybe in the future attempt_num could be interesting though)
> >>>>
> >>>> On 30/01/2023 10:39, Jark Wu wrote:
> >>>>
> >>>> Hi Chesnay,
> >>>>
> >>>> IIUC, our only disagreement is whether to support task info in the
> >>>> metric format. But we all agree with the default metric format
> >>>>
> "metrics.scope.jm-operator=<host>.jobmanager.<job_name>.<operator_name>"
> >>>> which doesn't include task info, right?
> >>>>
> >>>> Regarding "coordinator metrics don't *belong to* any tasks/vertexes",
> I
> >>>> mean the coordinator doesn't run on specific tasks, so it doesn't have
> >>>> runtime information of tasks. My concern about adding task info to the
> >>>> coordinator metric is that
> >>>> 1) if we register coordinator metrics for every subtask, the number of
> >>>> coordinator metrics will be exploded when task parallelism is large.
> This
> >>>> is an overhead to JM & metric systems, and the duplicated metrics are
> >>>> useless for users.
> >>>> 2) Correct me if I'm wrong, MetricGroup#counter only supports
> >>>> registering one metric. It seems it can't register a metric reporting
> for
> >>>> multiple metric names (e.g. multiple subtasks).
> >>>>
> >>>> Best,
> >>>> Jark
> >>>>
> >>>>
> >>>> On Mon, 30 Jan 2023 at 16:11, Chesnay Schepler <ch...@apache.org>
> >>>> wrote:
> >>>>
> >>>>> I hadn't looked in detail in to how the task info can be introduced,
> >>>>> but
> >>>>> given that the coordinates are created by the execution graph, where
> we
> >>>>> only work on tasks, it should be possible and rather simple.
> >>>>>
> >>>>> The OperatorCoordinatorHolder gets access to the ExecutionJobVertex
> >>>>> from
> >>>>> which it can extract everything and create the task/operator groups.
> >>>>> Exposure to the coordinator could happen via the coordinator context
> as
> >>>>> originally planned.
> >>>>>
> >>>>>   > As we know, coordinator metrics are just like JMJobMetricGroup,
> >>>>> which
> >>>>> don't belong to any tasks/vertexes.
> >>>>>
> >>>>> I don't think this is true, every coordinator is scoped to a specific
> >>>>> operator, and every operator is associated with a particular vertex.
> >>>>>
> >>>>>   From the javadocs:
> >>>>>
> >>>>> " A coordinator for runtime operators. The OperatorCoordinator runs
> on
> >>>>> the master, associated with
> >>>>>    the job vertex of the operator. It communicates with operators via
> >>>>> sending operator events."
> >>>>>
> >>>>> On 28/01/2023 09:15, Jark Wu wrote:
> >>>>>> Hi all,
> >>>>>>
> >>>>>> IIUC, Chesnay means we should have a more general metric group for
> >>>>>> **operator**
> >>>>>> not for **coordinator** in JM, this would be useful to extend
> >>>>>> other operator-specific
> >>>>>> metrics in the future. That means the new scope format should be
> >>>>> designed
> >>>>>> for
> >>>>>> the operator,
> >>>>>> e.g.,
> >>>>>
> metrics.scope.jm-operator=<host>.jobmanager.<job_name>.<operator_name>
> >>>>>> The coordinator metric is a subgroup (a constant "coordinator"
> >>>>> suffix) of
> >>>>>> the JMOperatorMG.
> >>>>>>
> >>>>>> I think this is a nice design. However, I have a question about
> >>>>> adding task
> >>>>>> id/name to this list.
> >>>>>> How to get the task id/name when reporting a coordinator metric? As
> we
> >>>>>> know, coordinator metrics
> >>>>>> are just like JMJobMetricGroup, which don't belong to any
> >>>>> tasks/vertexes.
> >>>>>> Do you mean reporting
> >>>>>> coordinator metrics for every task id under the operator?
> >>>>>>
> >>>>>> Best,
> >>>>>> Jark
> >>>>>>
> >>>>>>
> >>>>>> On Thu, 19 Jan 2023 at 17:33, Chesnay Schepler <ch...@apache.org>
> >>>>> wrote:
> >>>>>>>    > First, I do not understand why users have to configure the new
> >>>>> scope
> >>>>>>> format, which has a default value.
> >>>>>>>
> >>>>>>> If you don't use scope formats, sure. If you do use scope formats,
> >>>>> e.g.
> >>>>>>> to add a common prefix (which is the case for datadog users for
> >>>>>>> example), then the current default in the FLIP is insufficient and
> >>>>>>> requires the user to update the configuration.
> >>>>>>>
> >>>>>>>    > I usually do not need to change these configuration of scope
> >>>>> formats
> >>>>>>> when submitting the flink job.
> >>>>>>>
> >>>>>>> Scope formats as a whole are quite a power-user feature, but that
> >>>>>>> doesn't mean we should ignore it.
> >>>>>>>
> >>>>>>>    > I try to let it extend the ComponentMetricGroup
> >>>>>>>
> >>>>>>> This isn't inherently required just because it is a "component".
> >>>>>>> Component metric groups should _only_ be used for cases where such
> a
> >>>>>>> component provides several bits of metadata.
> >>>>>>> For example, tasks provide vertex ids, task names, attempted IDs
> >>>>> etc.,
> >>>>>>> and we'd like users to have the option on how this metadata ends up
> >>>>>>> being used (via scope formats). This currently can't be built with
> >>>>> the
> >>>>>>> addGroup() methods, hence why the component groups exist.
> >>>>>>> However, the operator coordinator _itself_does not provide several
> >>>>> bits
> >>>>>>> of metadata. Logically, the _operator_ does, not the coordinator.
> >>>>>>>
> >>>>>>>    > This make me decide to add a new scope format.
> >>>>>>>
> >>>>>>> And this is fine; just don't add a new scope format for the
> >>>>> coordinator
> >>>>>>> but logically for operators on the JM side, such that we can extend
> >>>>> the
> >>>>>>> set of operator-specific metrics in the future without having to
> add
> >>>>> yet
> >>>>>>> another scope format.
> >>>>>>>
> >>>>>>>    > The relationship between the Operator and OperatorCoordinator
> is
> >>>>>>> maintained in the OperatorCoordinatorHolder. In the POC, the
> operator
> >>>>>>> name/id could be found and the OperatorCoordinatorMetricGroup will
> be
> >>>>>>> created here.
> >>>>>>>
> >>>>>>> That's all irrelevant.
> >>>>>>>
> >>>>>>>    > The registered metrics of OperatorCoordinator always are those
> >>>>> which
> >>>>>>> can not be aggregated from the tasks, like the number of unassigned
> >>>>>>> splits in the SourceCoordinator. Actually we have not encountered
> the
> >>>>>>> scenario you mentioned. I think the OperatorCoordinatorMetricGroup
> >>>>>>> should only contain the metrics for itself instead of its subtasks.
> >>>>>>>
> >>>>>>> You are misunderstanding the problem.
> >>>>>>>
> >>>>>>> This isn't about aggregating metrics on our side or exposing
> metrics
> >>>>>>> from TMs on the JM side or anything like that.
> >>>>>>>
> >>>>>>> It's purely about the available metadata for operator metrics on
> the
> >>>>> JM
> >>>>>>> side. Currently you suggest operator name / id; what I'm proposing
> >>>>> is to
> >>>>>>> add task(==vertex!) id / name to this list.
> >>>>>>>
> >>>>>>> The benefit here is simple. I can look up all metrics where
> >>>>> task_id==XXX
> >>>>>>> and get _everything_ related to that vertex, including all metrics
> >>>>>>> associated with operators that are part of that vertex, including
> the
> >>>>>>> coordinator.
> >>>>>>>
> >>>>>>>    > Using metrics.scope.jm.job is not enough to distinguish the
> >>>>> different
> >>>>>>> coordinators.
> >>>>>>>
> >>>>>>> I did not suggest just using metrics.scope.jm.job. I suggested that
> >>>>> as
> >>>>>>> the default for the jm operator scope format, that is used for the
> >>>>>>> proposed JobManagerOperatorMetricGroup, which will have some plain
> >>>>>>> metric groups as children for coordinators.
> >>>>>>> Aka, you have the JMOperatorMG, then you call
> >>>>> addGroup("coordinators")
> >>>>>>> and then addGroup(coordinator_name) for each coordinator.
> >>>>>>>
> >>>>>>>    > So there are two choice for the
> >>>>>>> InternalOperatorCoordinatorMetricGroup: 1. add and improve the new
> >>>>> scope
> >>>>>>> format; 2. use the metrics.scope.jm.job and ProxyMetricGroup. Which
> >>>>> one
> >>>>>>> is better?
> >>>>>>>
> >>>>>>> There are more options than that.
> >>>>>>> I feel like there are a lot of misunderstandings in this
> discussion.
> >>>>>>> Please let me know if I could clear things up. If not I can also
> >>>>> provide
> >>>>>>> a PoC based on your PoC if that can speed things up. It's not that
> >>>>>>> different anyway.
> >>>>>>>
> >>>>>>> On 19/01/2023 07:39, Hang Ruan wrote:
> >>>>>>>> Hi, chesnay,
> >>>>>>>>
> >>>>>>>> Thanks for your reply. I still have some doubts about the
> questions
> >>>>>>>> you raised.
> >>>>>>>>
> >>>>>>>>       > Extending the set of ScopeFormats is problematic because
> it
> >>>>> in
> >>>>>>>>       practice
> >>>>>>>>       it breaks the config if users actively rely on it, since
> >>>>> there's now
> >>>>>>>>       another key that they _must_ set for it to be
> >>>>>>>>       consistent/compatible with
> >>>>>>>>       their existing setup.
> >>>>>>>>       Unfortunately due to how powerful scope formats are we can't
> >>>>> derive a
> >>>>>>>>       default value that matches their existing setup.
> >>>>>>>>       Hence we should try to do this as rarely as possible.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>       > This FLIP does not adhere to that since it proposes a
> >>>>> dedicated
> >>>>>>>>       format
> >>>>>>>>       for coordinators; next time we want to expose
> operator-specific
> >>>>>>>>       metrics
> >>>>>>>>       (e.g., in the scheduler) we'd have to add another one to
> >>>>> support it.
> >>>>>>>>
> >>>>>>>> First, I do not understand why users have to configure the new
> scope
> >>>>>>>> format, which has a default value. I usually do not need to change
> >>>>>>>> these configuration of scope formats when submitting the flink
> job.
> >>>>>>>> IMO, the OperatorCoordinator is a component running at the
> >>>>> JobMaster.
> >>>>>>>> I try to let it extend the ComponentMetricGroup, but I cannot find
> >>>>>>>> a suitable scope. This make me decide to add a new scope format.
> >>>>>>>>
> >>>>>>>>       > Additionally, the configurable variables (operator
> name/id)
> >>>>> are
> >>>>>>>>       logically not attached to the coordinator, but operators, so
> >>>>> to me it
> >>>>>>>>       just doesn't make sense to structure it like this.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> The relationship between the Operator and OperatorCoordinator is
> >>>>>>>> maintained in the OperatorCoordinatorHolder. In the POC, the
> >>>>> operator
> >>>>>>>> name/id could be found and the OperatorCoordinatorMetricGroup will
> >>>>> be
> >>>>>>>> created here.
> >>>>>>>>
> >>>>>>>>       > Another thing I'm concerned about is that, because we
> don't
> >>>>> include
> >>>>>>>>       tasks in the hierarchy, users wishing to collect all metrics
> >>>>> for a
> >>>>>>>>       particular task (in this case ==vertex) now have to go
> >>>>>>>>       significantly out
> >>>>>>>>       of their way to get them, since they can no longer just
> filter
> >>>>> by the
> >>>>>>>>       task ID but have to be filter for _all_ operators that are
> >>>>> part of
> >>>>>>>>       the task.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> The registered metrics of OperatorCoordinator always are those
> which
> >>>>>>>> can not be aggregated from the tasks, like the number of
> unassigned
> >>>>>>>> splits in the SourceCoordinator. Actually we have not encountered
> >>>>> the
> >>>>>>>> scenario you mentioned. I think the OperatorCoordinatorMetricGroup
> >>>>>>>> should only contain the metrics for itself instead of its
> subtasks.
> >>>>>>>>
> >>>>>>>> Using metrics.scope.jm.job is not enough to distinguish the
> >>>>> different
> >>>>>>>> coordinators. The operator id/name is necessary. Then
> >>>>>>>> the implementation should be like this. But users can not change
> the
> >>>>>>>> scope in this way. This is also acceptable.
> >>>>>>>> @Internal public class InternalOperatorCoordinatorMetricGroup
> >>>>>>>>            extends ProxyMetricGroup<MetricGroup>
> >>>>>>>>            implements OperatorCoordinatorMetricGroup {
> >>>>>>>>        public InternalOperatorCoordinatorMetricGroup(
> >>>>>>>>                JobManagerJobMetricGroup parent, OperatorID
> >>>>> operatorID,
> >>>>>>> String operatorName) {
> >>>>>>>>            super(parent.addGroup(operatorID +
> >>>>>>> operatorName).addGroup("coordinator")); }
> >>>>>>>> }
> >>>>>>>> So there are two choice for the
> >>>>>>>> InternalOperatorCoordinatorMetricGroup: 1. add and improve the new
> >>>>>>>> scope format; 2. use the metrics.scope.jm.job and
> ProxyMetricGroup.
> >>>>>>>> Which one is better?
> >>>>>>>>
> >>>>>>>> Best,
> >>>>>>>> Hang
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Chesnay Schepler <ch...@apache.org> 于2023年1月18日周三 17:03写道：
> >>>>>>>>
> >>>>>>>>       You're misunderstanding the problem.
> >>>>>>>>
> >>>>>>>>       Metric groups form a tree with each group providing certain
> >>>>> metadata.
> >>>>>>>>       E.g., on the taskmanager we have a TM metric group that
> >>>>> provides info
> >>>>>>>>       about the TM, that has child task metric groups, that have
> >>>>> child
> >>>>>>>>       operator metric groups etc. The operator metric group again
> has
> >>>>>>>>       children, where we are now mostly in the user/connector
> land,
> >>>>> like a
> >>>>>>>>       kafka metric group providing partition info or something.
> >>>>>>>>
> >>>>>>>>       What is being proposed is to create a coordinator metric
> group
> >>>>> on
> >>>>>>>>       the JM
> >>>>>>>>       side that provides operator metadata.
> >>>>>>>>
> >>>>>>>>       A more appropriate structure is to create an operator group
> on
> >>>>> the JM
> >>>>>>>>       side that provides this info, with the coordinator metric
> group
> >>>>>>>>       being a
> >>>>>>>>       child.
> >>>>>>>>
> >>>>>>>>       On 17/01/2023 19:56, Steven Wu wrote:
> >>>>>>>>       >> Additionally, the configurable variables (operator
> name/id)
> >>>>> are
> >>>>>>>>       > logically not attached to the coordinator, but operators,
> so
> >>>>> to
> >>>>>>>>       me it
> >>>>>>>>       > just doesn't make sense to structure it like this.
> >>>>>>>>       >
> >>>>>>>>       > Chesnay, maybe we should clarify the terminology. To me,
> >>>>>>>>       pperators (like
> >>>>>>>>       > FLIP-27 source) can have two parts (coordinator and
> >>>>>>>>       reader/subtask). I
> >>>>>>>>       > think it is fine to include operator name/id for
> coordinator
> >>>>>>>>       metrics.
> >>>>>>>>       >
> >>>>>>>>       > On Mon, Jan 16, 2023 at 2:13 AM Chesnay Schepler
> >>>>>>>>       <ch...@apache.org> wrote:
> >>>>>>>>       >
> >>>>>>>>       >> Slight correction: Using metrics.scope.jm.job as the
> default
> >>>>>>>>       should be
> >>>>>>>>       >> safe.
> >>>>>>>>       >>
> >>>>>>>>       >> On 16/01/2023 10:18, Chesnay Schepler wrote:
> >>>>>>>>       >>> The proposed ScopeFormat is still problematic for a few
> >>>>> reasons.
> >>>>>>>>       >>>
> >>>>>>>>       >>> Extending the set of ScopeFormats is problematic because
> >>>>> it in
> >>>>>>>>       >>> practice it breaks the config if users actively rely on
> >>>>> it, since
> >>>>>>>>       >>> there's now another key that they _must_ set for it to
> be
> >>>>>>>>       >>> consistent/compatible with their existing setup.
> >>>>>>>>       >>> Unfortunately due to how powerful scope formats are we
> >>>>> can't
> >>>>>>>>       derive a
> >>>>>>>>       >>> default value that matches their existing setup.
> >>>>>>>>       >>> Hence we should try to do this as rarely as possible.
> >>>>>>>>       >>>
> >>>>>>>>       >>> This FLIP does not adhere to that since it proposes a
> >>>>>>>>       dedicated format
> >>>>>>>>       >>> for coordinators; next time we want to expose
> >>>>> operator-specific
> >>>>>>>>       >>> metrics (e.g., in the scheduler) we'd have to add
> another
> >>>>> one to
> >>>>>>>>       >>> support it.
> >>>>>>>>       >>>
> >>>>>>>>       >>> Additionally, the configurable variables (operator
> >>>>> name/id) are
> >>>>>>>>       >>> logically not attached to the coordinator, but
> operators,
> >>>>> so
> >>>>>>>>       to me it
> >>>>>>>>       >>> just doesn't make sense to structure it like this.
> >>>>>>>>       >>>
> >>>>>>>>       >>> Another thing I'm concerned about is that, because we
> don't
> >>>>>>>>       include
> >>>>>>>>       >>> tasks in the hierarchy, users wishing to collect all
> >>>>> metrics for
> >>>>>>> a
> >>>>>>>>       >>> particular task (in this case ==vertex) now have to go
> >>>>>>>>       significantly
> >>>>>>>>       >>> out of their way to get them, since they can no longer
> just
> >>>>>>>>       filter by
> >>>>>>>>       >>> the task ID but have to be filter for _all_ operators
> that
> >>>>> are
> >>>>>>>>       part of
> >>>>>>>>       >>> the task.
> >>>>>>>>       >>>
> >>>>>>>>       >>> On 16/01/2023 03:09, Hang Ruan wrote:
> >>>>>>>>       >>>> Hi, @chesnay@apache.org <ch...@apache.org> ,
> >>>>>>>>       >>>>
> >>>>>>>>       >>>> Do you have time to help to review this FLIP again
> after
> >>>>> the
> >>>>>>>>       >>>> modification?
> >>>>>>>>       >>>> Looking forward to your reply.
> >>>>>>>>       >>>> This FLIP will add a new configuration for the
> >>>>>>>>       >>>> OperatorCoordinatorMetricGroup scope format. It
> provides
> >>>>> an
> >>>>>>>>       internal
> >>>>>>>>       >>>> implementation and is added as a component to the
> >>>>>>>>       >>>> JobManagerJobMetricGroup.
> >>>>>>>>       >>>> If something doesn't make sense, could you provide some
> >>>>>>>>       advice? It
> >>>>>>>>       >>>> will be
> >>>>>>>>       >>>> very helpful. Thanks a lot for your help.
> >>>>>>>>       >>>>
> >>>>>>>>       >>>> Best,
> >>>>>>>>       >>>> Hang
> >>>>>>>>       >>>>
> >>>>>>>>       >>>> Martijn Visser <ma...@apache.org>
> 于2023年1月11日周三
> >>>>>>>>       16:34写道：
> >>>>>>>>       >>>>
> >>>>>>>>       >>>>
> >>>>>>>>       >>>>> Hi Hang,
> >>>>>>>>       >>>>>
> >>>>>>>>       >>>>> I'm a bit surprised that this has gone to a vote,
> given
> >>>>> that
> >>>>>>>>       Chesnay
> >>>>>>>>       >>>>> deliberately mentioned that he would vote against it
> >>>>> as-is.
> >>>>>>>>       I would
> >>>>>>>>       >>>>> expect
> >>>>>>>>       >>>>> that before going to a vote, he has had the
> opportunity
> >>>>> to
> >>>>>>>>       >>>>> participate in
> >>>>>>>>       >>>>> this discussion.
> >>>>>>>>       >>>>>
> >>>>>>>>       >>>>> Best regards,
> >>>>>>>>       >>>>>
> >>>>>>>>       >>>>> Martijn
> >>>>>>>>       >>>>>
> >>>>>>>>       >>>>> On Tue, Jan 3, 2023 at 12:53 PM Jark Wu <
> >>>>> imjark@gmail.com>
> >>>>>>>>       wrote:
> >>>>>>>>       >>>>>
> >>>>>>>>       >>>>>> Hi Dong,
> >>>>>>>>       >>>>>>
> >>>>>>>>       >>>>>> Regarding “SplitEnumeratorContext#metricGroup���, my
> only
> >>>>>>>>       concern is
> >>>>>>>>       >>>>>> that
> >>>>>>>>       >>>>>> this is a core interface for the FLIP. It’s hard to
> >>>>> tell how
> >>>>>>>>       >>>>>> sources use
> >>>>>>>>       >>>>>> metric group without mentioning this interface. Even
> if
> >>>>>>>>       this is an
> >>>>>>>>       >>>>> existing
> >>>>>>>>       >>>>>> API, I think it’s worth introducing the interface
> again
> >>>>> and
> >>>>>>>>       declaring
> >>>>>>>>       >>>>> that
> >>>>>>>>       >>>>>> we will implement the interface instead of a no-op
> >>>>> method
> >>>>>>>>       in this
> >>>>>>>>       >>>>>> FLIP.
> >>>>>>>>       >>>>>>
> >>>>>>>>       >>>>>> Anyway, this is a minor problem and shouldn’t block
> this
> >>>>>>>>       FLIP. I’m
> >>>>>>>>       >>>>>> +1 to
> >>>>>>>>       >>>>>> start a vote.
> >>>>>>>>       >>>>>>
> >>>>>>>>       >>>>>> Best,
> >>>>>>>>       >>>>>> Jark
> >>>>>>>>       >>>>>>
> >>>>>>>>       >>>>>>
> >>>>>>>>       >>>>>>> 2023年1月3日 10:03，Hang Ruan <ru...@gmail.com>
> 写道：
> >>>>>>>>       >>>>>>>
> >>>>>>>>       >>>>>>> Hi, Jark and Dong,
> >>>>>>>>       >>>>>>>
> >>>>>>>>       >>>>>>> Thanks for your comments. Sorry for my late reply.
> >>>>>>>>       >>>>>>>
> >>>>>>>>       >>>>>>> For suggestion 1, I plan to implement the
> >>>>>>>>       >>>>>>> SplitEnumeratorMetricGroup in
> >>>>>>>>       >>>>>>> another issue, and it is not contained in this
> FLIP. I
> >>>>>>>>       will add some
> >>>>>>>>       >>>>>>> description about this part.
> >>>>>>>>       >>>>>>> For suggestion 2, changes about
> >>>>>>>>       OperatorCoordinator#metricGroup has
> >>>>>>>>       >>>>>> already
> >>>>>>>>       >>>>>>> been documented in the proposed change section.
> >>>>>>>>       >>>>>>>
> >>>>>>>>       >>>>>>> Best,
> >>>>>>>>       >>>>>>> Hang
> >>>>>>>>       >>>>>>>
> >>>>>>>>       >>>>>>> Dong Lin <li...@gmail.com> 于2023年1月1日周日
> 09:45写道：
> >>>>>>>>       >>>>>>>
> >>>>>>>>       >>>>>>>> Let me chime-in and add comments regarding the
> public
> >>>>>>>>       interface
> >>>>>>>>       >>>>> section.
> >>>>>>>>       >>>>>>>> Please see my comments inline.
> >>>>>>>>       >>>>>>>>
> >>>>>>>>       >>>>>>>> On Thu, Dec 29, 2022 at 6:08 PM Jark Wu
> >>>>>>>>       <im...@gmail.com> wrote:
> >>>>>>>>       >>>>>>>>
> >>>>>>>>       >>>>>>>>> Hi Hang,
> >>>>>>>>       >>>>>>>>>
> >>>>>>>>       >>>>>>>>> Thanks for driving this discussion. I think this
> is a
> >>>>>>>>       very useful
> >>>>>>>>       >>>>>> feature
> >>>>>>>>       >>>>>>>>> for connectors.
> >>>>>>>>       >>>>>>>>>
> >>>>>>>>       >>>>>>>>> The FLIP looks quite good to me, and I just have
> two
> >>>>>>>>       suggestions.
> >>>>>>>>       >>>>>>>>>
> >>>>>>>>       >>>>>>>>> 1. In the "Public Interface" section, mention that
> >>>>> the
> >>>>>>>>       >>>>>>>>> implementation
> >>>>>>>>       >>>>>>>>> behavior of "SplitEnumeratorContext#metricGroup"
> is
> >>>>>>>>       changed from
> >>>>>>>>       >>>>>>>> returning
> >>>>>>>>       >>>>>>>>> null to returning a concrete
> >>>>> SplitEnumeratorMetricGroup
> >>>>>>>>       instance.
> >>>>>>>>       >>>>> Even
> >>>>>>>>       >>>>>>>>> though the API is already there, the behavior
> change
> >>>>> can
> >>>>>>>>       also be
> >>>>>>>>       >>>>>>>> considered
> >>>>>>>>       >>>>>>>>> a public change.
> >>>>>>>>       >>>>>>>>>
> >>>>>>>>       >>>>>>>> SplitEnumeratorContext#metricGroup is an interface
> and
> >>>>>>>>       this FLIP
> >>>>>>>>       >>>>>>>> does
> >>>>>>>>       >>>>>> not
> >>>>>>>>       >>>>>>>> seem to change its semantics/behavior. The FLIP
> does
> >>>>>>>>       change the
> >>>>>>>>       >>>>>>>> implementation/behavior of
> >>>>>>>>       SourceCoordinatorContext#metricGroup,
> >>>>>>>>       >>>>>>>> which
> >>>>>>>>       >>>>>> is
> >>>>>>>>       >>>>>>>> marked @Internal.
> >>>>>>>>       >>>>>>>>
> >>>>>>>>       >>>>>>>> Thus it might seem a bit weird to add in the public
> >>>>>>> interface
> >>>>>>>>       >>>>>>>> section
> >>>>>>>>       >>>>>>>> saying that we change the interface
> >>>>>>>>       >>>>>>>> SplitEnumeratorContext#metricGroup
> >>>>>>>>       >>>>>> from
> >>>>>>>>       >>>>>>>> returning null to non-null object.
> >>>>>>>>       >>>>>>>>
> >>>>>>>>       >>>>>>>> 2. Mention the newly added interface of
> >>>>>>>>       >>>>>> "OperatorCoordinator#metricGroup"
> >>>>>>>>       >>>>>>>>> in the "Proposed Changes" section or "Public
> >>>>> Interface"
> >>>>>>>>       section. As
> >>>>>>>>       >>>>> the
> >>>>>>>>       >>>>>>>>> FLIP said, OperatorCoordinator is widely used in
> many
> >>>>>>>>       connectors.
> >>>>>>>>       >>>>>> Though
> >>>>>>>>       >>>>>>>> it
> >>>>>>>>       >>>>>>>>> is still an @Internal API, I think it is worth
> >>>>>>>>       mentioning the
> >>>>>>>>       >>>>>>>>> change
> >>>>>>>>       >>>>> in
> >>>>>>>>       >>>>>>>> the
> >>>>>>>>       >>>>>>>>> FLIP.
> >>>>>>>>       >>>>>>>> Since OperatorCoordinator is an internal API, it
> seems
> >>>>>>>>       reasonable to
> >>>>>>>>       >>>>>>>> explain it in the proposed change section. The FLIP
> >>>>> seems
> >>>>>>>>       to have
> >>>>>>>>       >>>>>>>> documented this in point 5 of the proposed change
> >>>>> section.
> >>>>>>>>       >>>>>>>>
> >>>>>>>>       >>>>>>>> BTW, if we think there are @internal classes that
> are
> >>>>>>>>       important
> >>>>>>>>       >>>>>>>> enough
> >>>>>>>>       >>>>>> to
> >>>>>>>>       >>>>>>>> be added in the public interface section, it might
> be
> >>>>>>>>       useful to
> >>>>>>>>       >>>>>> explicitly
> >>>>>>>>       >>>>>>>> discuss this topic and document it in the "*What
> are
> >>>>> the
> >>>>>>>>       "public
> >>>>>>>>       >>>>>>>> interfaces" of the project*" in this
> >>>>>>>>       >>>>>>>> <
> >>>>>>>>       >>>>>>>>
> >>>>>>>>       >>
> >>>>>>>>
> >>>>>
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
> >>>>>>>>       >>>>>>>> wiki.
> >>>>>>>>       >>>>>>>>
> >>>>>>>>       >>>>>>>>
> >>>>>>>>       >>>>>>>>> Best,
> >>>>>>>>       >>>>>>>>> Jark
> >>>>>>>>       >>>>>>>>>
> >>>>>>>>       >>>>>>>>>
> >>>>>>>>       >>>>>>>>> On Mon, 26 Dec 2022 at 18:06, Hang Ruan
> >>>>>>>>       <ru...@gmail.com>
> >>>>>>>>       >>>>>> wrote:
> >>>>>>>>       >>>>>>>>>> Hi, thanks for the feedback, Zhu Zhu and
> Qingsheng.
> >>>>>>>>       >>>>>>>>>> After combining everyone's comments, the main
> >>>>> concerns and
> >>>>>>>>       >>>>>>>> corresponding
> >>>>>>>>       >>>>>>>>>> adjustments are as follows.
> >>>>>>>>       >>>>>>>>>>
> >>>>>>>>       >>>>>>>>>> Q1: Common metrics are not quite useful.
> >>>>>>>>       >>>>>>>>>> numEventsIn and numEventsOut counters will be
> >>>>> removed
> >>>>>>>>       from the
> >>>>>>>>       >>>>>>>>>> OperatorCoordinatorMetricGroup. These common
> >>>>> metrics do
> >>>>>>> not
> >>>>>>>>       >>>>>>>>>> provide
> >>>>>>>>       >>>>>>>>> enough
> >>>>>>>>       >>>>>>>>>> information for users. The users are more
> willing to
> >>>>>>>>       get the
> >>>>>>>>       >>>>>>>>>> number
> >>>>>>>>       >>>>> of
> >>>>>>>>       >>>>>>>>>> events of the specified type instead of the total
> >>>>>>>>       number. And this
> >>>>>>>>       >>>>>>>> metric
> >>>>>>>>       >>>>>>>>>> is calculated differently. The implementation
> could
> >>>>>>>>       register the
> >>>>>>>>       >>>>>> metric
> >>>>>>>>       >>>>>>>>> by
> >>>>>>>>       >>>>>>>>>> themselves.
> >>>>>>>>       >>>>>>>>>>
> >>>>>>>>       >>>>>>>>>> Q2: This FLIP is overly complicated.
> >>>>>>>>       >>>>>>>>>> This FLIP will become concise after these
> >>>>> modifications.
> >>>>>>>>       >>>>>>>>>> OperatorCoordinatorMetricGroup has already been
> >>>>>>>>       introduced into
> >>>>>>>>       >>>>> Flink
> >>>>>>>>       >>>>>>>> by
> >>>>>>>>       >>>>>>>>>> FLIP-179<
> >>>>>>>>       >>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>
> >>>>>>>>       >>
> >>>>>>>>
> >>>>>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics
> >>>>>>>>       >>>>>>>>>>> .
> >>>>>>>>       >>>>>>>>>> And
> >>>>>>>>       >>>>>>>>>> this FLIP will not change it. This FLIP only
> >>>>> provides a
> >>>>>>>>       new metric
> >>>>>>>>       >>>>>>>> option
> >>>>>>>>       >>>>>>>>>> and a new metric group scope. The changes in
> >>>>> proposed
> >>>>>>>>       changes
> >>>>>>>>       >>>>> provide
> >>>>>>>>       >>>>>>>> the
> >>>>>>>>       >>>>>>>>>> details about the modifications for the internal
> >>>>>>>>       classes, which
> >>>>>>>>       >>>>> might
> >>>>>>>>       >>>>>>>>> make
> >>>>>>>>       >>>>>>>>>> it look complicated.
> >>>>>>>>       >>>>>>>>>>
> >>>>>>>>       >>>>>>>>>> Thanks for all the comments again. If there are
> no
> >>>>> further
> >>>>>>>>       >>>>>>>>>> comments,
> >>>>>>>>       >>>>>> we
> >>>>>>>>       >>>>>>>>>> plan to start the voting thread this week.
> >>>>>>>>       >>>>>>>>>>
> >>>>>>>>       >>>>>>>>>> Best,
> >>>>>>>>       >>>>>>>>>> Hang
> >>>>>>>>       >>>>>>>>>>
> >>>>>>>>       >>>>>>>>>> Qingsheng Ren <re...@gmail.com>
> 于2022年12月26日周一
> >>>>>>>>       16:48写道：
> >>>>>>>>       >>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>> Thanks for the FLIP, Hang!
> >>>>>>>>       >>>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>> This FLIP overall looks good to me. Actually I
> >>>>> share
> >>>>>>>>       the same
> >>>>>>>>       >>>>> concern
> >>>>>>>>       >>>>>>>>>> with
> >>>>>>>>       >>>>>>>>>>> Zhu that numEventsIn and numEventsOut counters
> are
> >>>>> not
> >>>>>>>>       quite
> >>>>>>>>       >>>>>>>>>>> useful
> >>>>>>>>       >>>>>>>> to
> >>>>>>>>       >>>>>>>>>> end
> >>>>>>>>       >>>>>>>>>>> users. OperatorEvent is a quite low-level
> >>>>> abstraction,
> >>>>>>>>       which
> >>>>>>>>       >>>>> requires
> >>>>>>>>       >>>>>>>>>>> instantialization in order to be practical to
> >>>>> users and
> >>>>>>>>       >>>>>>>>>>> developers,
> >>>>>>>>       >>>>>>>> so
> >>>>>>>>       >>>>>>>>>>> maybe it's better to exclude them from the FLIP.
> >>>>>>>>       >>>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>> Best,
> >>>>>>>>       >>>>>>>>>>> Qingsheng
> >>>>>>>>       >>>>>>>>>>> Ververica (Alibaba)
> >>>>>>>>       >>>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>> On Mon, Dec 26, 2022 at 12:08 PM Zhu Zhu
> >>>>>>>>       <re...@gmail.com>
> >>>>>>>>       >>>>> wrote:
> >>>>>>>>       >>>>>>>>>>>> Hi Hang,
> >>>>>>>>       >>>>>>>>>>>> I still see no strong reason why we need
> >>>>>>>>       >>>>>>>>>>>> numEventsIn/numEventsOut
> >>>>>>>>       >>>>>>>>>>> metrics.
> >>>>>>>>       >>>>>>>>>>>> In the discussion in FLINK-29801, I can see the
> >>>>> same
> >>>>>>>>       concern
> >>>>>>>>       >>>>>>>>>>>> from
> >>>>>>>>       >>>>>>>>>> others.
> >>>>>>>>       >>>>>>>>>>>> So I prefer to exclude them from this FLIP to
> >>>>> avoid
> >>>>>>>>       >>>>>>>>>>>> over-extending
> >>>>>>>>       >>>>>>>>> the
> >>>>>>>>       >>>>>>>>>>>> scope.
> >>>>>>>>       >>>>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>>> Thanks,
> >>>>>>>>       >>>>>>>>>>>> Zhu
> >>>>>>>>       >>>>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>>> Hang Ruan <ru...@gmail.com>
> 于2022年12月23日周五
> >>>>>>>>       >>>>>>>>>>>> 15:21写道：
> >>>>>>>>       >>>>>>>>>>>>> Hi everyone,
> >>>>>>>>       >>>>>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>>>> As for the Zhu Zhu's problem, I think we
> should
> >>>>> keep
> >>>>>>>>       the common
> >>>>>>>>       >>>>>>>>>>> metrics,
> >>>>>>>>       >>>>>>>>>>>> which will help to observe incoming and
> outgoing
> >>>>>>>>       events. What do
> >>>>>>>>       >>>>>>>> you
> >>>>>>>>       >>>>>>>>>>> think,
> >>>>>>>>       >>>>>>>>>>>> @Zhu Zhu ?
> >>>>>>>>       >>>>>>>>>>>>> And @Chesnay, are there any other issues you
> are
> >>>>>>>>       more concerned
> >>>>>>>>       >>>>>>>>>> about?
> >>>>>>>>       >>>>>>>>>>>> Looking forward to your reply.
> >>>>>>>>       >>>>>>>>>>>>> Thanks for all the comments. If there are no
> >>>>> further
> >>>>>>>>       >>>>>>>>>>>>> comments, we
> >>>>>>>>       >>>>>>>>>> plan
> >>>>>>>>       >>>>>>>>>>>> to start the voting thread next week.
> >>>>>>>>       >>>>>>>>>>>>> Best,
> >>>>>>>>       >>>>>>>>>>>>> Hang
> >>>>>>>>       >>>>>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>>>> Hang Ruan <ru...@gmail.com>
> >>>>> 于2022年12月15日周四
> >>>>>>>>       >>>>>>>>>>>>> 16:49写道：
> >>>>>>>>       >>>>>>>>>>>>>> Hi, Zhu Zhu,
> >>>>>>>>       >>>>>>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>>>>> Thanks for your feedback!
> >>>>>>>>       >>>>>>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>>>>> The OperatorCoordinator implementations are
> >>>>>>>>       different. And
> >>>>>>>>       >>>>>>>>>>>>>> their
> >>>>>>>>       >>>>>>>>>>>> metrics are much different too. We try to find
> the
> >>>>>>> common
> >>>>>>>>       >>>>>>>>>>>> metrics
> >>>>>>>>       >>>>>>>> and
> >>>>>>>>       >>>>>>>>>> put
> >>>>>>>>       >>>>>>>>>>>> them in the OperatorCoordinatorMetricGroup. If
> >>>>> most
> >>>>>>>>       developers
> >>>>>>>>       >>>>>>>> think
> >>>>>>>>       >>>>>>>>> we
> >>>>>>>>       >>>>>>>>>>> do
> >>>>>>>>       >>>>>>>>>>>> not need these common metrics, removing the
> common
> >>>>>>>>       metrics is
> >>>>>>>>       >>>>>>>>>>>> also
> >>>>>>>>       >>>>>>>>>>>> acceptable.
> >>>>>>>>       >>>>>>>>>>>>>> Best,
> >>>>>>>>       >>>>>>>>>>>>>> Hang
> >>>>>>>>       >>>>>>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>>>>> Zhu Zhu <re...@gmail.com> 于2022年12月14日周三
> >>>>>>>>       22:09写道：
> >>>>>>>>       >>>>>>>>>>>>>>> Hi Hang & MengYue,
> >>>>>>>>       >>>>>>>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>>>>>> Thanks for creating this FLIP!
> >>>>>>>>       >>>>>>>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>>>>>> I think it is very useful, mainly in two
> >>>>> aspects:
> >>>>>>>>       >>>>>>>>>>>>>>> 1. Enables OperatorCoordinators to register
> >>>>> metrics.
> >>>>>>>>       >>>>>>>>>>>>>>> Currently
> >>>>>>>>       >>>>>>>>>>>>>>> the coordinators has no way to do this. And
> >>>>> operator
> >>>>>>>>       >>>>>>>> coordinator
> >>>>>>>>       >>>>>>>>>>>>>>> metric group further enables the
> >>>>> SplitEnumerator
> >>>>>>>>       to have
> >>>>>>>>       >>>>>>>>>>>>>>> access
> >>>>>>>>       >>>>>>>>>>>>>>> to a registered metric group (via the
> existing
> >>>>> public
> >>>>>>>>       >>>>>>>>>>>>>>> interface
> >>>>>>>>       >>>>>>>>>>>>>>> SplitEnumeratorContext#metricGroup()),
> which is
> >>>>>>>>       null at the
> >>>>>>>>       >>>>>>>>> moment.
> >>>>>>>>       >>>>>>>>>>>>>>> 2. Defines the scope of operator coordinator
> >>>>>>>>       metrics. A clear
> >>>>>>>>       >>>>>>>>>>>> definition
> >>>>>>>>       >>>>>>>>>>>>>>> makes it easy for users to find their wanted
> >>>>>>>>       metrics. The
> >>>>>>>>       >>>>>>>>>> definition
> >>>>>>>>       >>>>>>>>>>>>>>> also helps to avoid conflicts of metrics
> from
> >>>>>>> multiple
> >>>>>>>>       >>>>>>>>>>>> OperatorCoordinators
> >>>>>>>>       >>>>>>>>>>>>>>> of the same kind. E.g. each
> SourceCoordinator
> >>>>> may
> >>>>>>>>       have its
> >>>>>>>>       >>>>>>>>>>>>>>> own
> >>>>>>>>       >>>>>>>>>>>>>>> numSourceSplits metric, these metrics should
> >>>>> not
> >>>>>>>>       be directly
> >>>>>>>>       >>>>>>>>>>> registered
> >>>>>>>>       >>>>>>>>>>>>>>> to the job metric group.
> >>>>>>>>       >>>>>>>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>>>>>> What I'm a bit concerned is the necessity of
> >>>>> the
> >>>>>>>>       introduced
> >>>>>>>>       >>>>>>>>> common
> >>>>>>>>       >>>>>>>>>>>> metrics
> >>>>>>>>       >>>>>>>>>>>>>>> numEventsInCounter & numEventsOutCounter. If
> >>>>> there
> >>>>>>>>       any case
> >>>>>>>>       >>>>>>>> which
> >>>>>>>>       >>>>>>>>>>>> strongly
> >>>>>>>>       >>>>>>>>>>>>>>> requires them?
> >>>>>>>>       >>>>>>>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>>>>>> Regarding the concerns of Chesnay,
> >>>>>>>>       >>>>>>>>>>>>>>>> A dedicated coordinator MG implementation
> is
> >>>>>>> overkill
> >>>>>>>>       >>>>>>>>>>>>>>> Directly using the job metric group can
> result
> >>>>> in
> >>>>>>>>       metric
> >>>>>>>>       >>>>>>>>> conflicts,
> >>>>>>>>       >>>>>>>>>>> as
> >>>>>>>>       >>>>>>>>>>>> mentioned
> >>>>>>>>       >>>>>>>>>>>>>>> in above #2.
> >>>>>>>>       >>>>>>>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>>>>>> Thanks,
> >>>>>>>>       >>>>>>>>>>>>>>> Zhu
> >>>>>>>>       >>>>>>>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>>>>>> Dong Lin <li...@gmail.com>
> 于2022年12月10日周六
> >>>>>>>>       >>>>>>>>>>>>>>> 14:16写道：
> >>>>>>>>       >>>>>>>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>>>>>>> Hi Chesney,
> >>>>>>>>       >>>>>>>>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>>>>>>> Just to double check with you,
> >>>>>>>>       >>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup
> >>>>>>>>       >>>>>>>>>>>> (annotated as
> >>>>>>>>       >>>>>>>>>>>>>>>> @PublicEvolving) has already been
> introduced
> >>>>> into
> >>>>>>>>       Flink by
> >>>>>>>>       >>>>>>>>>> FLIP-179
> >>>>>>>>       >>>>>>>>>>>>>>>> <
> >>>>>>>>       >>
> >>>>>>>>
> >>>>>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics
> >>>>>>>>       >>>>>>>>>>>>> .
> >>>>>>>>       >>>>>>>>>>>>>>>> And that FLIP has got you +1.. Do you mean
> we
> >>>>>>>>       should remove
> >>>>>>>>       >>>>>>>>> this
> >>>>>>>>       >>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup?
> >>>>>>>>       >>>>>>>>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>>>>>>> Regards,
> >>>>>>>>       >>>>>>>>>>>>>>>> Dong
> >>>>>>>>       >>>>>>>>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>>>>>>> On Sat, Dec 10, 2022 at 1:33 AM Chesnay
> >>>>> Schepler <
> >>>>>>>>       >>>>>>>>>>> chesnay@apache.org>
> >>>>>>>>       >>>>>>>>>>>> wrote:
> >>>>>>>>       >>>>>>>>>>>>>>>>> As a whole I feel like this FLIP is overly
> >>>>>>>>       complicated. A
> >>>>>>>>       >>>>>>>>>>> dedicated
> >>>>>>>>       >>>>>>>>>>>>>>>>> coordinator MG implementation is
> overkill; it
> >>>>>>>>       could just
> >>>>>>>>       >>>>>>>>> re-use
> >>>>>>>>       >>>>>>>>>>> the
> >>>>>>>>       >>>>>>>>>>>>>>>>> existing Task/OperatorMGs to create the
> same
> >>>>>>>>       structure we
> >>>>>>>>       >>>>>>>>> have
> >>>>>>>>       >>>>>>>>>> on
> >>>>>>>>       >>>>>>>>>>>> TMs,
> >>>>>>>>       >>>>>>>>>>>>>>>>> similar to what we did with the Job MG.
> >>>>>>>>       >>>>>>>>>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>>>>>>>> However, I'm not convinced that this is
> >>>>> required
> >>>>>>>>       anyway,
> >>>>>>>>       >>>>>>>>>> because
> >>>>>>>>       >>>>>>>>>>>> all the
> >>>>>>>>       >>>>>>>>>>>>>>>>> example metrics you listed can be
> >>>>> implemented on
> >>>>>>>>       the TM
> >>>>>>>>       >>>>>>>> side
> >>>>>>>>       >>>>>>>>> +
> >>>>>>>>       >>>>>>>>>>>>>>>>> aggregating them in the external metrics
> >>>>> backend.
> >>>>>>>>       >>>>>>>>>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>>>>>>>> Since I'm on holidays soon, just so no one
> >>>>> tries
> >>>>>>>>       to pull a
> >>>>>>>>       >>>>>>>>> fast
> >>>>>>>>       >>>>>>>>>>>> one on
> >>>>>>>>       >>>>>>>>>>>>>>>>> me, if this were to go to a vote as-is
> I'd be
> >>>>>>>>       against it.
> >>>>>>>>       >>>>>>>>>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>>>>>>>> On 09/12/2022 15:30, Dong Lin wrote:
> >>>>>>>>       >>>>>>>>>>>>>>>>>> Hi Hang,
> >>>>>>>>       >>>>>>>>>>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>>>>>>>>> Thanks for the FLIP! The FLIP looks good
> >>>>> and it
> >>>>>>>>       is pretty
> >>>>>>>>       >>>>>>>>>>>> informative.
> >>>>>>>>       >>>>>>>>>>>>>>>>>> I have just two minor comments regarding
> >>>>> names:
> >>>>>>>>       >>>>>>>>>>>>>>>>>> - Would it be useful to rename the config
> >>>>> key as
> >>>>>>>>       >>>>>>>>>>>>>>>>>>
> *metrics.scope.jm.job.operator-coordinator*
> >>>>> for
> >>>>>>>>       >>>>>>>> consistency
> >>>>>>>>       >>>>>>>>>>> with
> >>>>>>>>       >>>>>>>>>>>>>>>>>> *metrics.scope.jm.job
> >>>>>>>>       >>>>>>>>>>>>>>>>>> *(which is not named as *jm-job)?
> >>>>>>>>       >>>>>>>>>>>>>>>>>> - Maybe rename the variable as
> >>>>>>>>       >>>>>>>>>>> SCOPE_NAMING_OPERATOR_COORDINATOR
> >>>>>>>>       >>>>>>>>>>>> for
> >>>>>>>>       >>>>>>>>>>>>>>>>>> simplicity and consistency with
> >>>>>>>>       SCOPE_NAMING_OPERATOR
> >>>>>>>>       >>>>>>>>> (which
> >>>>>>>>       >>>>>>>>>> is
> >>>>>>>>       >>>>>>>>>>>> not named
> >>>>>>>>       >>>>>>>>>>>>>>>>>> as SCOPE_NAMING_TM_JOB_OPERATOR)?
> >>>>>>>>       >>>>>>>>>>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>>>>>>>>> Cheers,
> >>>>>>>>       >>>>>>>>>>>>>>>>>> Dong
> >>>>>>>>       >>>>>>>>>>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>>>>>>>>> On Thu, Dec 8, 2022 at 3:28 PM Hang Ruan
> <
> >>>>>>>>       >>>>>>>>>>> ruanhang1993@gmail.com>
> >>>>>>>>       >>>>>>>>>>>> wrote:
> >>>>>>>>       >>>>>>>>>>>>>>>>>>> Hi all,
> >>>>>>>>       >>>>>>>>>>>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>>>>>>>>>> MengYue and I created FLIP-274[1]
> Introduce
> >>>>>>>>       metric group
> >>>>>>>>       >>>>>>>>> for
> >>>>>>>>       >>>>>>>>>>>>>>>>>>> OperatorCoordinator. OperatorCoordinator
> >>>>> is the
> >>>>>>>>       >>>>>>>>> coordinator
> >>>>>>>>       >>>>>>>>>>> for
> >>>>>>>>       >>>>>>>>>>>> runtime
> >>>>>>>>       >>>>>>>>>>>>>>>>>>> operators and running on Job Manager.
> The
> >>>>>>>>       coordination
> >>>>>>>>       >>>>>>>>>>>> mechanism is
> >>>>>>>>       >>>>>>>>>>>>>>>>>>> operator events between
> OperatorCoordinator
> >>>>>>>>       and its all
> >>>>>>>>       >>>>>>>>>>>> operators, the
> >>>>>>>>       >>>>>>>>>>>>>>>>>>> coordination is more and more using in
> >>>>> Flink, for
> >>>>>>>>       >>>>>>>> example
> >>>>>>>>       >>>>>>>>>> many
> >>>>>>>>       >>>>>>>>>>>> Sources
> >>>>>>>>       >>>>>>>>>>>>>>>>> and
> >>>>>>>>       >>>>>>>>>>>>>>>>>>> Sinks depend on the mechanism to assign
> >>>>> splits
> >>>>>>> and
> >>>>>>>>       >>>>>>>>>> coordinate
> >>>>>>>>       >>>>>>>>>>>> commits to
> >>>>>>>>       >>>>>>>>>>>>>>>>>>> external systems. The
> OperatorCoordinator
> >>>>> is
> >>>>>>>>       widely
> >>>>>>>>       >>>>>>>> using
> >>>>>>>>       >>>>>>>>> in
> >>>>>>>>       >>>>>>>>>>>> flink kafka
> >>>>>>>>       >>>>>>>>>>>>>>>>>>> connector, flink pulsar connector, flink
> >>>>> cdc
> >>>>>>>>       connector,
> >>>>>>>>       >>>>>>>>>> flink
> >>>>>>>>       >>>>>>>>>>>> hudi
> >>>>>>>>       >>>>>>>>>>>>>>>>>>> connector and so on.
> >>>>>>>>       >>>>>>>>>>>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>>>>>>>>>> But there is not a suitable metric group
> >>>>> scope
> >>>>>>>>       for the
> >>>>>>>>       >>>>>>>>>>>>>>>>> OperatorCoordinator
> >>>>>>>>       >>>>>>>>>>>>>>>>>>> and not an implementation for the
> interface
> >>>>>>>>       >>>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup.
> >>>>>>>>       >>>>>>>>>>>>>>>>>>> These metrics in OperatorCoordinator
> could
> >>>>> be
> >>>>>>>>       how many
> >>>>>>>>       >>>>>>>>>>>> splits/partitions
> >>>>>>>>       >>>>>>>>>>>>>>>>>>> have been assigned to source readers,
> how
> >>>>> many
> >>>>>>>>       files
> >>>>>>>>       >>>>>>>> have
> >>>>>>>>       >>>>>>>>>> been
> >>>>>>>>       >>>>>>>>>>>> written
> >>>>>>>>       >>>>>>>>>>>>>>>>> out
> >>>>>>>>       >>>>>>>>>>>>>>>>>>> by sink writers, these metrics not only
> >>>>> help
> >>>>>>>>       users to
> >>>>>>>>       >>>>>>>> know
> >>>>>>>>       >>>>>>>>>> the
> >>>>>>>>       >>>>>>>>>>>> job
> >>>>>>>>       >>>>>>>>>>>>>>>>> progress
> >>>>>>>>       >>>>>>>>>>>>>>>>>>> but also make big job maintaining
> easier.
> >>>>> Thus we
> >>>>>>>>       >>>>>>>> propose
> >>>>>>>>       >>>>>>>>>> the
> >>>>>>>>       >>>>>>>>>>>> FLIP-274
> >>>>>>>>       >>>>>>>>>>>>>>>>> to
> >>>>>>>>       >>>>>>>>>>>>>>>>>>> introduce a new metric group scope for
> >>>>>>>>       >>>>>>>> OperatorCoordinator
> >>>>>>>>       >>>>>>>>>> and
> >>>>>>>>       >>>>>>>>>>>> provide
> >>>>>>>>       >>>>>>>>>>>>>>>>> an
> >>>>>>>>       >>>>>>>>>>>>>>>>>>> internal implementation for
> >>>>>>>>       >>>>>>>>> OperatorCoordinatorMetricGroup.
> >>>>>>>>       >>>>>>>>>>>>>>>>>>> Could you help review this FLIP when you
> >>>>> get
> >>>>>>>>       time? Any
> >>>>>>>>       >>>>>>>>>>> feedback
> >>>>>>>>       >>>>>>>>>>>> is
> >>>>>>>>       >>>>>>>>>>>>>>>>>>> appreciated!
> >>>>>>>>       >>>>>>>>>>>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>       >>>>>>>>>>>>>>>>>>> Hang
> >>>>>>>>       >>>>>>>>>>>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>>>>>>>>>> [1]
> >>>>>>>>       >>>>>>>>>>>>>>>>>>>
> >>>>>>>>       >>>>>>>>>>>>>>>>>>>
> >>>>>>>>       >>
> >>>>>>>>
> >>>>>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-274%3A+Introduce+metric+group+for+OperatorCoordinator
> >>>>>>>>       >>
> >>>>>>>>
> >>>>>
>
>

Re: [DISCUSS] FLIP-274 : Introduce metric group for OperatorCoordinator

Posted by Chesnay Schepler <ch...@apache.org>.

Looks good from my side.

On 01/02/2023 13:27, Hang Ruan wrote:
> Hi all,
>
> I make a mistake about the default value of the  scope format
> JobManagerOperatorScopeFormat. It should be
> <host>.jobmanager.<job_name>.<operator_name>.
> I have fixed it in the FLIP.
> And the variables (task id and task name) have been added to the list.
>
> Best,
> Hang
>
>
> Hang Ruan <ru...@gmail.com> 于2023年2月1日周三 14:11写道：
>
>> Hi all,
>>
>>> IIUC, there is only an OperatorCoordinatorMetricGroup interface without
>> implementations (it never worked). There were no coordinator metrics
>> implemented and registered. @Hang Ruan please correct me if I'm wrong.
>> Yes,  we shouldn't have to worry about backwards-compatibility as Chesnay
>> said.
>>
>> After the discussion, I think the FLIP should be modified as follow.
>> 1.  Add new metric group JobManagerOperatorMetricGroup, whose scope
>> format is JobManagerOperatorScopeFormat. The config key of JobManagerOperatorScopeFormat
>> is metrics.scope.jm-operator, and the default value is
>> <host>.jobmanager.<job_name>.<task_name>.<operator_name>.
>> 2. JobManagerOperatorMetricGroup will be sub components of
>> JobManagerJobMetricGroup.
>> 3. InternalOperatorCoordinatorMetricGroup will extend ProxyMetricGroup,
>> which registered under JobManagerOperatorMetricGroup by
>> addGroup("coordinator").
>>
>> I have updated the FLIP-274. @Chesnay Schepler <ch...@apache.org> @Jark
>> Wu <im...@gmail.com> , please help to take a look at the modified FLIP
>> and make sure that I do not miss  something.
>>
>> If there are no more problems, we plan to start the voting thread later.
>>
>> Best,
>> Hang
>>
>> Chesnay Schepler <ch...@apache.org> 于2023年1月31日周二 20:09写道：
>>
>>>> IIUC, there is only an OperatorCoordinatorMetricGroup interface without
>>> implementations (it never worked). There were no coordinator metrics
>>> implemented and registered. @Hang Ruan <ru...@gmail.com> please
>>> correct me if I'm wrong.
>>>
>>> Indeed. There are interfaces, and theoretically a metric group is exposed
>>> to user-code via the SplitEnumeratorContext, but the metric group is always
>>> null in the SourceCoordinatorContext. So we shouldn't have to worry about
>>> backwards-compatibility.
>>>
>>>> However, the config key "metrics.scope.jm.job" is deprecated and is
>>> replaced by "metrics.scope.jm-job". Similarly, "metrics.scope.tm.job" is
>>> replaced with "metrics.scope.tm-job". That's why I though
>>> "metrics.scope.jm-operator" might be more consistent.
>>>
>>> HA! Good catch. I only looked at the 1.16 docs where they weren't
>>> deprecated yet.
>>>
>>> Then it should be "jm-operator" indeed.
>>>
>>> On 31/01/2023 10:51, Jark Wu wrote:
>>>
>>> Hi Chesnay,
>>>
>>>> The new option must default to the configured value of
>>> metrics.scope.jm.job because otherwise it isn't backwards-compatible with
>>> existing coordinator metrics.
>>> IIUC, there is only an OperatorCoordinatorMetricGroup interface without
>>> implementations (it never worked). There were no coordinator metrics
>>> implemented and registered. @Hang Ruan <ru...@gmail.com> please
>>> correct me if I'm wrong.
>>>
>>>> Intuitively I think the config key should be
>>> metrics.scope.jm.operator., similar to metrics.scope.jm.job.
>>> However, the config key "metrics.scope.jm.job" is deprecated and is
>>> replaced by "metrics.scope.jm-job". Similarly, "metrics.scope.tm.job" is
>>> replaced with "metrics.scope.tm-job". That's why I though
>>> "metrics.scope.jm-operator" might be more consistent.
>>>
>>>> We would only support operator_name, operator_id, task_name and task_id
>>> (+ everything else from the jm/job groups of course).
>>> +1
>>>
>>> Best,
>>> Jark
>>>
>>>
>>>
>>> On Mon, 30 Jan 2023 at 20:09, Chesnay Schepler <ch...@apache.org>
>>> wrote:
>>>
>>>> The new option must default to the configured value of
>>>> metrics.scope.jm.job because otherwise it isn't backwards-compatible with
>>>> existing coordinator metrics.
>>>> According to some earlier response in this thread there are already
>>>> coordinator metrics.
>>>>
>>>> If that weren't the case you'd be correct, and we'd keep the default
>>>> consistent with the default for tm operator metrics.
>>>>
>>>> Intuitively I think the config key should be metrics.scope.jm.operator.,
>>>> similar to metrics.scope.jm.job.
>>>> For symmetry we should also add new metrics.scope.tm.operator/task and
>>>> eventually phase out metrics.scope.operator.
>>>>
>>>>> Regarding "coordinator metrics don't *belong to* any tasks/vertexes",
>>>> I mean the coordinator doesn't run on specific tasks, so it doesn't have
>>>> runtime information of tasks.
>>>>
>>>> That is true. I intentionally only referred to task id/name, as things
>>>> like subtask index or attempt ids don't make sense for coordinators as you
>>>> said.
>>>> We would only support operator_name, operator_id, task_name and task_id
>>>> (+ everything else from the jm/job groups of course).
>>>> (Maybe in the future attempt_num could be interesting though)
>>>>
>>>> On 30/01/2023 10:39, Jark Wu wrote:
>>>>
>>>> Hi Chesnay,
>>>>
>>>> IIUC, our only disagreement is whether to support task info in the
>>>> metric format. But we all agree with the default metric format
>>>> "metrics.scope.jm-operator=<host>.jobmanager.<job_name>.<operator_name>"
>>>> which doesn't include task info, right?
>>>>
>>>> Regarding "coordinator metrics don't *belong to* any tasks/vertexes", I
>>>> mean the coordinator doesn't run on specific tasks, so it doesn't have
>>>> runtime information of tasks. My concern about adding task info to the
>>>> coordinator metric is that
>>>> 1) if we register coordinator metrics for every subtask, the number of
>>>> coordinator metrics will be exploded when task parallelism is large. This
>>>> is an overhead to JM & metric systems, and the duplicated metrics are
>>>> useless for users.
>>>> 2) Correct me if I'm wrong, MetricGroup#counter only supports
>>>> registering one metric. It seems it can't register a metric reporting for
>>>> multiple metric names (e.g. multiple subtasks).
>>>>
>>>> Best,
>>>> Jark
>>>>
>>>>
>>>> On Mon, 30 Jan 2023 at 16:11, Chesnay Schepler <ch...@apache.org>
>>>> wrote:
>>>>
>>>>> I hadn't looked in detail in to how the task info can be introduced,
>>>>> but
>>>>> given that the coordinates are created by the execution graph, where we
>>>>> only work on tasks, it should be possible and rather simple.
>>>>>
>>>>> The OperatorCoordinatorHolder gets access to the ExecutionJobVertex
>>>>> from
>>>>> which it can extract everything and create the task/operator groups.
>>>>> Exposure to the coordinator could happen via the coordinator context as
>>>>> originally planned.
>>>>>
>>>>>   > As we know, coordinator metrics are just like JMJobMetricGroup,
>>>>> which
>>>>> don't belong to any tasks/vertexes.
>>>>>
>>>>> I don't think this is true, every coordinator is scoped to a specific
>>>>> operator, and every operator is associated with a particular vertex.
>>>>>
>>>>>   From the javadocs:
>>>>>
>>>>> " A coordinator for runtime operators. The OperatorCoordinator runs on
>>>>> the master, associated with
>>>>>    the job vertex of the operator. It communicates with operators via
>>>>> sending operator events."
>>>>>
>>>>> On 28/01/2023 09:15, Jark Wu wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> IIUC, Chesnay means we should have a more general metric group for
>>>>>> **operator**
>>>>>> not for **coordinator** in JM, this would be useful to extend
>>>>>> other operator-specific
>>>>>> metrics in the future. That means the new scope format should be
>>>>> designed
>>>>>> for
>>>>>> the operator,
>>>>>> e.g.,
>>>>> metrics.scope.jm-operator=<host>.jobmanager.<job_name>.<operator_name>
>>>>>> The coordinator metric is a subgroup (a constant "coordinator"
>>>>> suffix) of
>>>>>> the JMOperatorMG.
>>>>>>
>>>>>> I think this is a nice design. However, I have a question about
>>>>> adding task
>>>>>> id/name to this list.
>>>>>> How to get the task id/name when reporting a coordinator metric? As we
>>>>>> know, coordinator metrics
>>>>>> are just like JMJobMetricGroup, which don't belong to any
>>>>> tasks/vertexes.
>>>>>> Do you mean reporting
>>>>>> coordinator metrics for every task id under the operator?
>>>>>>
>>>>>> Best,
>>>>>> Jark
>>>>>>
>>>>>>
>>>>>> On Thu, 19 Jan 2023 at 17:33, Chesnay Schepler <ch...@apache.org>
>>>>> wrote:
>>>>>>>    > First, I do not understand why users have to configure the new
>>>>> scope
>>>>>>> format, which has a default value.
>>>>>>>
>>>>>>> If you don't use scope formats, sure. If you do use scope formats,
>>>>> e.g.
>>>>>>> to add a common prefix (which is the case for datadog users for
>>>>>>> example), then the current default in the FLIP is insufficient and
>>>>>>> requires the user to update the configuration.
>>>>>>>
>>>>>>>    > I usually do not need to change these configuration of scope
>>>>> formats
>>>>>>> when submitting the flink job.
>>>>>>>
>>>>>>> Scope formats as a whole are quite a power-user feature, but that
>>>>>>> doesn't mean we should ignore it.
>>>>>>>
>>>>>>>    > I try to let it extend the ComponentMetricGroup
>>>>>>>
>>>>>>> This isn't inherently required just because it is a "component".
>>>>>>> Component metric groups should _only_ be used for cases where such a
>>>>>>> component provides several bits of metadata.
>>>>>>> For example, tasks provide vertex ids, task names, attempted IDs
>>>>> etc.,
>>>>>>> and we'd like users to have the option on how this metadata ends up
>>>>>>> being used (via scope formats). This currently can't be built with
>>>>> the
>>>>>>> addGroup() methods, hence why the component groups exist.
>>>>>>> However, the operator coordinator _itself_does not provide several
>>>>> bits
>>>>>>> of metadata. Logically, the _operator_ does, not the coordinator.
>>>>>>>
>>>>>>>    > This make me decide to add a new scope format.
>>>>>>>
>>>>>>> And this is fine; just don't add a new scope format for the
>>>>> coordinator
>>>>>>> but logically for operators on the JM side, such that we can extend
>>>>> the
>>>>>>> set of operator-specific metrics in the future without having to add
>>>>> yet
>>>>>>> another scope format.
>>>>>>>
>>>>>>>    > The relationship between the Operator and OperatorCoordinator is
>>>>>>> maintained in the OperatorCoordinatorHolder. In the POC, the operator
>>>>>>> name/id could be found and the OperatorCoordinatorMetricGroup will be
>>>>>>> created here.
>>>>>>>
>>>>>>> That's all irrelevant.
>>>>>>>
>>>>>>>    > The registered metrics of OperatorCoordinator always are those
>>>>> which
>>>>>>> can not be aggregated from the tasks, like the number of unassigned
>>>>>>> splits in the SourceCoordinator. Actually we have not encountered the
>>>>>>> scenario you mentioned. I think the OperatorCoordinatorMetricGroup
>>>>>>> should only contain the metrics for itself instead of its subtasks.
>>>>>>>
>>>>>>> You are misunderstanding the problem.
>>>>>>>
>>>>>>> This isn't about aggregating metrics on our side or exposing metrics
>>>>>>> from TMs on the JM side or anything like that.
>>>>>>>
>>>>>>> It's purely about the available metadata for operator metrics on the
>>>>> JM
>>>>>>> side. Currently you suggest operator name / id; what I'm proposing
>>>>> is to
>>>>>>> add task(==vertex!) id / name to this list.
>>>>>>>
>>>>>>> The benefit here is simple. I can look up all metrics where
>>>>> task_id==XXX
>>>>>>> and get _everything_ related to that vertex, including all metrics
>>>>>>> associated with operators that are part of that vertex, including the
>>>>>>> coordinator.
>>>>>>>
>>>>>>>    > Using metrics.scope.jm.job is not enough to distinguish the
>>>>> different
>>>>>>> coordinators.
>>>>>>>
>>>>>>> I did not suggest just using metrics.scope.jm.job. I suggested that
>>>>> as
>>>>>>> the default for the jm operator scope format, that is used for the
>>>>>>> proposed JobManagerOperatorMetricGroup, which will have some plain
>>>>>>> metric groups as children for coordinators.
>>>>>>> Aka, you have the JMOperatorMG, then you call
>>>>> addGroup("coordinators")
>>>>>>> and then addGroup(coordinator_name) for each coordinator.
>>>>>>>
>>>>>>>    > So there are two choice for the
>>>>>>> InternalOperatorCoordinatorMetricGroup: 1. add and improve the new
>>>>> scope
>>>>>>> format; 2. use the metrics.scope.jm.job and ProxyMetricGroup. Which
>>>>> one
>>>>>>> is better?
>>>>>>>
>>>>>>> There are more options than that.
>>>>>>> I feel like there are a lot of misunderstandings in this discussion.
>>>>>>> Please let me know if I could clear things up. If not I can also
>>>>> provide
>>>>>>> a PoC based on your PoC if that can speed things up. It's not that
>>>>>>> different anyway.
>>>>>>>
>>>>>>> On 19/01/2023 07:39, Hang Ruan wrote:
>>>>>>>> Hi, chesnay,
>>>>>>>>
>>>>>>>> Thanks for your reply. I still have some doubts about the questions
>>>>>>>> you raised.
>>>>>>>>
>>>>>>>>       > Extending the set of ScopeFormats is problematic because it
>>>>> in
>>>>>>>>       practice
>>>>>>>>       it breaks the config if users actively rely on it, since
>>>>> there's now
>>>>>>>>       another key that they _must_ set for it to be
>>>>>>>>       consistent/compatible with
>>>>>>>>       their existing setup.
>>>>>>>>       Unfortunately due to how powerful scope formats are we can't
>>>>> derive a
>>>>>>>>       default value that matches their existing setup.
>>>>>>>>       Hence we should try to do this as rarely as possible.
>>>>>>>>
>>>>>>>>
>>>>>>>>       > This FLIP does not adhere to that since it proposes a
>>>>> dedicated
>>>>>>>>       format
>>>>>>>>       for coordinators; next time we want to expose operator-specific
>>>>>>>>       metrics
>>>>>>>>       (e.g., in the scheduler) we'd have to add another one to
>>>>> support it.
>>>>>>>>
>>>>>>>> First, I do not understand why users have to configure the new scope
>>>>>>>> format, which has a default value. I usually do not need to change
>>>>>>>> these configuration of scope formats when submitting the flink job.
>>>>>>>> IMO, the OperatorCoordinator is a component running at the
>>>>> JobMaster.
>>>>>>>> I try to let it extend the ComponentMetricGroup, but I cannot find
>>>>>>>> a suitable scope. This make me decide to add a new scope format.
>>>>>>>>
>>>>>>>>       > Additionally, the configurable variables (operator name/id)
>>>>> are
>>>>>>>>       logically not attached to the coordinator, but operators, so
>>>>> to me it
>>>>>>>>       just doesn't make sense to structure it like this.
>>>>>>>>
>>>>>>>>
>>>>>>>> The relationship between the Operator and OperatorCoordinator is
>>>>>>>> maintained in the OperatorCoordinatorHolder. In the POC, the
>>>>> operator
>>>>>>>> name/id could be found and the OperatorCoordinatorMetricGroup will
>>>>> be
>>>>>>>> created here.
>>>>>>>>
>>>>>>>>       > Another thing I'm concerned about is that, because we don't
>>>>> include
>>>>>>>>       tasks in the hierarchy, users wishing to collect all metrics
>>>>> for a
>>>>>>>>       particular task (in this case ==vertex) now have to go
>>>>>>>>       significantly out
>>>>>>>>       of their way to get them, since they can no longer just filter
>>>>> by the
>>>>>>>>       task ID but have to be filter for _all_ operators that are
>>>>> part of
>>>>>>>>       the task.
>>>>>>>>
>>>>>>>>
>>>>>>>> The registered metrics of OperatorCoordinator always are those which
>>>>>>>> can not be aggregated from the tasks, like the number of unassigned
>>>>>>>> splits in the SourceCoordinator. Actually we have not encountered
>>>>> the
>>>>>>>> scenario you mentioned. I think the OperatorCoordinatorMetricGroup
>>>>>>>> should only contain the metrics for itself instead of its subtasks.
>>>>>>>>
>>>>>>>> Using metrics.scope.jm.job is not enough to distinguish the
>>>>> different
>>>>>>>> coordinators. The operator id/name is necessary. Then
>>>>>>>> the implementation should be like this. But users can not change the
>>>>>>>> scope in this way. This is also acceptable.
>>>>>>>> @Internal public class InternalOperatorCoordinatorMetricGroup
>>>>>>>>            extends ProxyMetricGroup<MetricGroup>
>>>>>>>>            implements OperatorCoordinatorMetricGroup {
>>>>>>>>        public InternalOperatorCoordinatorMetricGroup(
>>>>>>>>                JobManagerJobMetricGroup parent, OperatorID
>>>>> operatorID,
>>>>>>> String operatorName) {
>>>>>>>>            super(parent.addGroup(operatorID +
>>>>>>> operatorName).addGroup("coordinator")); }
>>>>>>>> }
>>>>>>>> So there are two choice for the
>>>>>>>> InternalOperatorCoordinatorMetricGroup: 1. add and improve the new
>>>>>>>> scope format; 2. use the metrics.scope.jm.job and ProxyMetricGroup.
>>>>>>>> Which one is better?
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Hang
>>>>>>>>
>>>>>>>>
>>>>>>>> Chesnay Schepler <ch...@apache.org> 于2023年1月18日周三 17:03写道：
>>>>>>>>
>>>>>>>>       You're misunderstanding the problem.
>>>>>>>>
>>>>>>>>       Metric groups form a tree with each group providing certain
>>>>> metadata.
>>>>>>>>       E.g., on the taskmanager we have a TM metric group that
>>>>> provides info
>>>>>>>>       about the TM, that has child task metric groups, that have
>>>>> child
>>>>>>>>       operator metric groups etc. The operator metric group again has
>>>>>>>>       children, where we are now mostly in the user/connector land,
>>>>> like a
>>>>>>>>       kafka metric group providing partition info or something.
>>>>>>>>
>>>>>>>>       What is being proposed is to create a coordinator metric group
>>>>> on
>>>>>>>>       the JM
>>>>>>>>       side that provides operator metadata.
>>>>>>>>
>>>>>>>>       A more appropriate structure is to create an operator group on
>>>>> the JM
>>>>>>>>       side that provides this info, with the coordinator metric group
>>>>>>>>       being a
>>>>>>>>       child.
>>>>>>>>
>>>>>>>>       On 17/01/2023 19:56, Steven Wu wrote:
>>>>>>>>       >> Additionally, the configurable variables (operator name/id)
>>>>> are
>>>>>>>>       > logically not attached to the coordinator, but operators, so
>>>>> to
>>>>>>>>       me it
>>>>>>>>       > just doesn't make sense to structure it like this.
>>>>>>>>       >
>>>>>>>>       > Chesnay, maybe we should clarify the terminology. To me,
>>>>>>>>       pperators (like
>>>>>>>>       > FLIP-27 source) can have two parts (coordinator and
>>>>>>>>       reader/subtask). I
>>>>>>>>       > think it is fine to include operator name/id for coordinator
>>>>>>>>       metrics.
>>>>>>>>       >
>>>>>>>>       > On Mon, Jan 16, 2023 at 2:13 AM Chesnay Schepler
>>>>>>>>       <ch...@apache.org> wrote:
>>>>>>>>       >
>>>>>>>>       >> Slight correction: Using metrics.scope.jm.job as the default
>>>>>>>>       should be
>>>>>>>>       >> safe.
>>>>>>>>       >>
>>>>>>>>       >> On 16/01/2023 10:18, Chesnay Schepler wrote:
>>>>>>>>       >>> The proposed ScopeFormat is still problematic for a few
>>>>> reasons.
>>>>>>>>       >>>
>>>>>>>>       >>> Extending the set of ScopeFormats is problematic because
>>>>> it in
>>>>>>>>       >>> practice it breaks the config if users actively rely on
>>>>> it, since
>>>>>>>>       >>> there's now another key that they _must_ set for it to be
>>>>>>>>       >>> consistent/compatible with their existing setup.
>>>>>>>>       >>> Unfortunately due to how powerful scope formats are we
>>>>> can't
>>>>>>>>       derive a
>>>>>>>>       >>> default value that matches their existing setup.
>>>>>>>>       >>> Hence we should try to do this as rarely as possible.
>>>>>>>>       >>>
>>>>>>>>       >>> This FLIP does not adhere to that since it proposes a
>>>>>>>>       dedicated format
>>>>>>>>       >>> for coordinators; next time we want to expose
>>>>> operator-specific
>>>>>>>>       >>> metrics (e.g., in the scheduler) we'd have to add another
>>>>> one to
>>>>>>>>       >>> support it.
>>>>>>>>       >>>
>>>>>>>>       >>> Additionally, the configurable variables (operator
>>>>> name/id) are
>>>>>>>>       >>> logically not attached to the coordinator, but operators,
>>>>> so
>>>>>>>>       to me it
>>>>>>>>       >>> just doesn't make sense to structure it like this.
>>>>>>>>       >>>
>>>>>>>>       >>> Another thing I'm concerned about is that, because we don't
>>>>>>>>       include
>>>>>>>>       >>> tasks in the hierarchy, users wishing to collect all
>>>>> metrics for
>>>>>>> a
>>>>>>>>       >>> particular task (in this case ==vertex) now have to go
>>>>>>>>       significantly
>>>>>>>>       >>> out of their way to get them, since they can no longer just
>>>>>>>>       filter by
>>>>>>>>       >>> the task ID but have to be filter for _all_ operators that
>>>>> are
>>>>>>>>       part of
>>>>>>>>       >>> the task.
>>>>>>>>       >>>
>>>>>>>>       >>> On 16/01/2023 03:09, Hang Ruan wrote:
>>>>>>>>       >>>> Hi, @chesnay@apache.org <ch...@apache.org> ,
>>>>>>>>       >>>>
>>>>>>>>       >>>> Do you have time to help to review this FLIP again after
>>>>> the
>>>>>>>>       >>>> modification?
>>>>>>>>       >>>> Looking forward to your reply.
>>>>>>>>       >>>> This FLIP will add a new configuration for the
>>>>>>>>       >>>> OperatorCoordinatorMetricGroup scope format. It provides
>>>>> an
>>>>>>>>       internal
>>>>>>>>       >>>> implementation and is added as a component to the
>>>>>>>>       >>>> JobManagerJobMetricGroup.
>>>>>>>>       >>>> If something doesn't make sense, could you provide some
>>>>>>>>       advice? It
>>>>>>>>       >>>> will be
>>>>>>>>       >>>> very helpful. Thanks a lot for your help.
>>>>>>>>       >>>>
>>>>>>>>       >>>> Best,
>>>>>>>>       >>>> Hang
>>>>>>>>       >>>>
>>>>>>>>       >>>> Martijn Visser <ma...@apache.org> 于2023年1月11日周三
>>>>>>>>       16:34写道：
>>>>>>>>       >>>>
>>>>>>>>       >>>>
>>>>>>>>       >>>>> Hi Hang,
>>>>>>>>       >>>>>
>>>>>>>>       >>>>> I'm a bit surprised that this has gone to a vote, given
>>>>> that
>>>>>>>>       Chesnay
>>>>>>>>       >>>>> deliberately mentioned that he would vote against it
>>>>> as-is.
>>>>>>>>       I would
>>>>>>>>       >>>>> expect
>>>>>>>>       >>>>> that before going to a vote, he has had the opportunity
>>>>> to
>>>>>>>>       >>>>> participate in
>>>>>>>>       >>>>> this discussion.
>>>>>>>>       >>>>>
>>>>>>>>       >>>>> Best regards,
>>>>>>>>       >>>>>
>>>>>>>>       >>>>> Martijn
>>>>>>>>       >>>>>
>>>>>>>>       >>>>> On Tue, Jan 3, 2023 at 12:53 PM Jark Wu <
>>>>> imjark@gmail.com>
>>>>>>>>       wrote:
>>>>>>>>       >>>>>
>>>>>>>>       >>>>>> Hi Dong,
>>>>>>>>       >>>>>>
>>>>>>>>       >>>>>> Regarding “SplitEnumeratorContext#metricGroup���, my only
>>>>>>>>       concern is
>>>>>>>>       >>>>>> that
>>>>>>>>       >>>>>> this is a core interface for the FLIP. It’s hard to
>>>>> tell how
>>>>>>>>       >>>>>> sources use
>>>>>>>>       >>>>>> metric group without mentioning this interface. Even if
>>>>>>>>       this is an
>>>>>>>>       >>>>> existing
>>>>>>>>       >>>>>> API, I think it’s worth introducing the interface again
>>>>> and
>>>>>>>>       declaring
>>>>>>>>       >>>>> that
>>>>>>>>       >>>>>> we will implement the interface instead of a no-op
>>>>> method
>>>>>>>>       in this
>>>>>>>>       >>>>>> FLIP.
>>>>>>>>       >>>>>>
>>>>>>>>       >>>>>> Anyway, this is a minor problem and shouldn’t block this
>>>>>>>>       FLIP. I’m
>>>>>>>>       >>>>>> +1 to
>>>>>>>>       >>>>>> start a vote.
>>>>>>>>       >>>>>>
>>>>>>>>       >>>>>> Best,
>>>>>>>>       >>>>>> Jark
>>>>>>>>       >>>>>>
>>>>>>>>       >>>>>>
>>>>>>>>       >>>>>>> 2023年1月3日 10:03，Hang Ruan <ru...@gmail.com> 写道：
>>>>>>>>       >>>>>>>
>>>>>>>>       >>>>>>> Hi, Jark and Dong,
>>>>>>>>       >>>>>>>
>>>>>>>>       >>>>>>> Thanks for your comments. Sorry for my late reply.
>>>>>>>>       >>>>>>>
>>>>>>>>       >>>>>>> For suggestion 1, I plan to implement the
>>>>>>>>       >>>>>>> SplitEnumeratorMetricGroup in
>>>>>>>>       >>>>>>> another issue, and it is not contained in this FLIP. I
>>>>>>>>       will add some
>>>>>>>>       >>>>>>> description about this part.
>>>>>>>>       >>>>>>> For suggestion 2, changes about
>>>>>>>>       OperatorCoordinator#metricGroup has
>>>>>>>>       >>>>>> already
>>>>>>>>       >>>>>>> been documented in the proposed change section.
>>>>>>>>       >>>>>>>
>>>>>>>>       >>>>>>> Best,
>>>>>>>>       >>>>>>> Hang
>>>>>>>>       >>>>>>>
>>>>>>>>       >>>>>>> Dong Lin <li...@gmail.com> 于2023年1月1日周日 09:45写道：
>>>>>>>>       >>>>>>>
>>>>>>>>       >>>>>>>> Let me chime-in and add comments regarding the public
>>>>>>>>       interface
>>>>>>>>       >>>>> section.
>>>>>>>>       >>>>>>>> Please see my comments inline.
>>>>>>>>       >>>>>>>>
>>>>>>>>       >>>>>>>> On Thu, Dec 29, 2022 at 6:08 PM Jark Wu
>>>>>>>>       <im...@gmail.com> wrote:
>>>>>>>>       >>>>>>>>
>>>>>>>>       >>>>>>>>> Hi Hang,
>>>>>>>>       >>>>>>>>>
>>>>>>>>       >>>>>>>>> Thanks for driving this discussion. I think this is a
>>>>>>>>       very useful
>>>>>>>>       >>>>>> feature
>>>>>>>>       >>>>>>>>> for connectors.
>>>>>>>>       >>>>>>>>>
>>>>>>>>       >>>>>>>>> The FLIP looks quite good to me, and I just have two
>>>>>>>>       suggestions.
>>>>>>>>       >>>>>>>>>
>>>>>>>>       >>>>>>>>> 1. In the "Public Interface" section, mention that
>>>>> the
>>>>>>>>       >>>>>>>>> implementation
>>>>>>>>       >>>>>>>>> behavior of "SplitEnumeratorContext#metricGroup" is
>>>>>>>>       changed from
>>>>>>>>       >>>>>>>> returning
>>>>>>>>       >>>>>>>>> null to returning a concrete
>>>>> SplitEnumeratorMetricGroup
>>>>>>>>       instance.
>>>>>>>>       >>>>> Even
>>>>>>>>       >>>>>>>>> though the API is already there, the behavior change
>>>>> can
>>>>>>>>       also be
>>>>>>>>       >>>>>>>> considered
>>>>>>>>       >>>>>>>>> a public change.
>>>>>>>>       >>>>>>>>>
>>>>>>>>       >>>>>>>> SplitEnumeratorContext#metricGroup is an interface and
>>>>>>>>       this FLIP
>>>>>>>>       >>>>>>>> does
>>>>>>>>       >>>>>> not
>>>>>>>>       >>>>>>>> seem to change its semantics/behavior. The FLIP does
>>>>>>>>       change the
>>>>>>>>       >>>>>>>> implementation/behavior of
>>>>>>>>       SourceCoordinatorContext#metricGroup,
>>>>>>>>       >>>>>>>> which
>>>>>>>>       >>>>>> is
>>>>>>>>       >>>>>>>> marked @Internal.
>>>>>>>>       >>>>>>>>
>>>>>>>>       >>>>>>>> Thus it might seem a bit weird to add in the public
>>>>>>> interface
>>>>>>>>       >>>>>>>> section
>>>>>>>>       >>>>>>>> saying that we change the interface
>>>>>>>>       >>>>>>>> SplitEnumeratorContext#metricGroup
>>>>>>>>       >>>>>> from
>>>>>>>>       >>>>>>>> returning null to non-null object.
>>>>>>>>       >>>>>>>>
>>>>>>>>       >>>>>>>> 2. Mention the newly added interface of
>>>>>>>>       >>>>>> "OperatorCoordinator#metricGroup"
>>>>>>>>       >>>>>>>>> in the "Proposed Changes" section or "Public
>>>>> Interface"
>>>>>>>>       section. As
>>>>>>>>       >>>>> the
>>>>>>>>       >>>>>>>>> FLIP said, OperatorCoordinator is widely used in many
>>>>>>>>       connectors.
>>>>>>>>       >>>>>> Though
>>>>>>>>       >>>>>>>> it
>>>>>>>>       >>>>>>>>> is still an @Internal API, I think it is worth
>>>>>>>>       mentioning the
>>>>>>>>       >>>>>>>>> change
>>>>>>>>       >>>>> in
>>>>>>>>       >>>>>>>> the
>>>>>>>>       >>>>>>>>> FLIP.
>>>>>>>>       >>>>>>>> Since OperatorCoordinator is an internal API, it seems
>>>>>>>>       reasonable to
>>>>>>>>       >>>>>>>> explain it in the proposed change section. The FLIP
>>>>> seems
>>>>>>>>       to have
>>>>>>>>       >>>>>>>> documented this in point 5 of the proposed change
>>>>> section.
>>>>>>>>       >>>>>>>>
>>>>>>>>       >>>>>>>> BTW, if we think there are @internal classes that are
>>>>>>>>       important
>>>>>>>>       >>>>>>>> enough
>>>>>>>>       >>>>>> to
>>>>>>>>       >>>>>>>> be added in the public interface section, it might be
>>>>>>>>       useful to
>>>>>>>>       >>>>>> explicitly
>>>>>>>>       >>>>>>>> discuss this topic and document it in the "*What are
>>>>> the
>>>>>>>>       "public
>>>>>>>>       >>>>>>>> interfaces" of the project*" in this
>>>>>>>>       >>>>>>>> <
>>>>>>>>       >>>>>>>>
>>>>>>>>       >>
>>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
>>>>>>>>       >>>>>>>> wiki.
>>>>>>>>       >>>>>>>>
>>>>>>>>       >>>>>>>>
>>>>>>>>       >>>>>>>>> Best,
>>>>>>>>       >>>>>>>>> Jark
>>>>>>>>       >>>>>>>>>
>>>>>>>>       >>>>>>>>>
>>>>>>>>       >>>>>>>>> On Mon, 26 Dec 2022 at 18:06, Hang Ruan
>>>>>>>>       <ru...@gmail.com>
>>>>>>>>       >>>>>> wrote:
>>>>>>>>       >>>>>>>>>> Hi, thanks for the feedback, Zhu Zhu and Qingsheng.
>>>>>>>>       >>>>>>>>>> After combining everyone's comments, the main
>>>>> concerns and
>>>>>>>>       >>>>>>>> corresponding
>>>>>>>>       >>>>>>>>>> adjustments are as follows.
>>>>>>>>       >>>>>>>>>>
>>>>>>>>       >>>>>>>>>> Q1: Common metrics are not quite useful.
>>>>>>>>       >>>>>>>>>> numEventsIn and numEventsOut counters will be
>>>>> removed
>>>>>>>>       from the
>>>>>>>>       >>>>>>>>>> OperatorCoordinatorMetricGroup. These common
>>>>> metrics do
>>>>>>> not
>>>>>>>>       >>>>>>>>>> provide
>>>>>>>>       >>>>>>>>> enough
>>>>>>>>       >>>>>>>>>> information for users. The users are more willing to
>>>>>>>>       get the
>>>>>>>>       >>>>>>>>>> number
>>>>>>>>       >>>>> of
>>>>>>>>       >>>>>>>>>> events of the specified type instead of the total
>>>>>>>>       number. And this
>>>>>>>>       >>>>>>>> metric
>>>>>>>>       >>>>>>>>>> is calculated differently. The implementation could
>>>>>>>>       register the
>>>>>>>>       >>>>>> metric
>>>>>>>>       >>>>>>>>> by
>>>>>>>>       >>>>>>>>>> themselves.
>>>>>>>>       >>>>>>>>>>
>>>>>>>>       >>>>>>>>>> Q2: This FLIP is overly complicated.
>>>>>>>>       >>>>>>>>>> This FLIP will become concise after these
>>>>> modifications.
>>>>>>>>       >>>>>>>>>> OperatorCoordinatorMetricGroup has already been
>>>>>>>>       introduced into
>>>>>>>>       >>>>> Flink
>>>>>>>>       >>>>>>>> by
>>>>>>>>       >>>>>>>>>> FLIP-179<
>>>>>>>>       >>>>>>>>>>
>>>>>>>>       >>>>>>>>>>
>>>>>>>>       >>
>>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics
>>>>>>>>       >>>>>>>>>>> .
>>>>>>>>       >>>>>>>>>> And
>>>>>>>>       >>>>>>>>>> this FLIP will not change it. This FLIP only
>>>>> provides a
>>>>>>>>       new metric
>>>>>>>>       >>>>>>>> option
>>>>>>>>       >>>>>>>>>> and a new metric group scope. The changes in
>>>>> proposed
>>>>>>>>       changes
>>>>>>>>       >>>>> provide
>>>>>>>>       >>>>>>>> the
>>>>>>>>       >>>>>>>>>> details about the modifications for the internal
>>>>>>>>       classes, which
>>>>>>>>       >>>>> might
>>>>>>>>       >>>>>>>>> make
>>>>>>>>       >>>>>>>>>> it look complicated.
>>>>>>>>       >>>>>>>>>>
>>>>>>>>       >>>>>>>>>> Thanks for all the comments again. If there are no
>>>>> further
>>>>>>>>       >>>>>>>>>> comments,
>>>>>>>>       >>>>>> we
>>>>>>>>       >>>>>>>>>> plan to start the voting thread this week.
>>>>>>>>       >>>>>>>>>>
>>>>>>>>       >>>>>>>>>> Best,
>>>>>>>>       >>>>>>>>>> Hang
>>>>>>>>       >>>>>>>>>>
>>>>>>>>       >>>>>>>>>> Qingsheng Ren <re...@gmail.com> 于2022年12月26日周一
>>>>>>>>       16:48写道：
>>>>>>>>       >>>>>>>>>>
>>>>>>>>       >>>>>>>>>>
>>>>>>>>       >>>>>>>>>>> Thanks for the FLIP, Hang!
>>>>>>>>       >>>>>>>>>>>
>>>>>>>>       >>>>>>>>>>> This FLIP overall looks good to me. Actually I
>>>>> share
>>>>>>>>       the same
>>>>>>>>       >>>>> concern
>>>>>>>>       >>>>>>>>>> with
>>>>>>>>       >>>>>>>>>>> Zhu that numEventsIn and numEventsOut counters are
>>>>> not
>>>>>>>>       quite
>>>>>>>>       >>>>>>>>>>> useful
>>>>>>>>       >>>>>>>> to
>>>>>>>>       >>>>>>>>>> end
>>>>>>>>       >>>>>>>>>>> users. OperatorEvent is a quite low-level
>>>>> abstraction,
>>>>>>>>       which
>>>>>>>>       >>>>> requires
>>>>>>>>       >>>>>>>>>>> instantialization in order to be practical to
>>>>> users and
>>>>>>>>       >>>>>>>>>>> developers,
>>>>>>>>       >>>>>>>> so
>>>>>>>>       >>>>>>>>>>> maybe it's better to exclude them from the FLIP.
>>>>>>>>       >>>>>>>>>>>
>>>>>>>>       >>>>>>>>>>> Best,
>>>>>>>>       >>>>>>>>>>> Qingsheng
>>>>>>>>       >>>>>>>>>>> Ververica (Alibaba)
>>>>>>>>       >>>>>>>>>>>
>>>>>>>>       >>>>>>>>>>> On Mon, Dec 26, 2022 at 12:08 PM Zhu Zhu
>>>>>>>>       <re...@gmail.com>
>>>>>>>>       >>>>> wrote:
>>>>>>>>       >>>>>>>>>>>> Hi Hang,
>>>>>>>>       >>>>>>>>>>>> I still see no strong reason why we need
>>>>>>>>       >>>>>>>>>>>> numEventsIn/numEventsOut
>>>>>>>>       >>>>>>>>>>> metrics.
>>>>>>>>       >>>>>>>>>>>> In the discussion in FLINK-29801, I can see the
>>>>> same
>>>>>>>>       concern
>>>>>>>>       >>>>>>>>>>>> from
>>>>>>>>       >>>>>>>>>> others.
>>>>>>>>       >>>>>>>>>>>> So I prefer to exclude them from this FLIP to
>>>>> avoid
>>>>>>>>       >>>>>>>>>>>> over-extending
>>>>>>>>       >>>>>>>>> the
>>>>>>>>       >>>>>>>>>>>> scope.
>>>>>>>>       >>>>>>>>>>>>
>>>>>>>>       >>>>>>>>>>>> Thanks,
>>>>>>>>       >>>>>>>>>>>> Zhu
>>>>>>>>       >>>>>>>>>>>>
>>>>>>>>       >>>>>>>>>>>> Hang Ruan <ru...@gmail.com> 于2022年12月23日周五
>>>>>>>>       >>>>>>>>>>>> 15:21写道：
>>>>>>>>       >>>>>>>>>>>>> Hi everyone,
>>>>>>>>       >>>>>>>>>>>>>
>>>>>>>>       >>>>>>>>>>>>> As for the Zhu Zhu's problem, I think we should
>>>>> keep
>>>>>>>>       the common
>>>>>>>>       >>>>>>>>>>> metrics,
>>>>>>>>       >>>>>>>>>>>> which will help to observe incoming and outgoing
>>>>>>>>       events. What do
>>>>>>>>       >>>>>>>> you
>>>>>>>>       >>>>>>>>>>> think,
>>>>>>>>       >>>>>>>>>>>> @Zhu Zhu ?
>>>>>>>>       >>>>>>>>>>>>> And @Chesnay, are there any other issues you are
>>>>>>>>       more concerned
>>>>>>>>       >>>>>>>>>> about?
>>>>>>>>       >>>>>>>>>>>> Looking forward to your reply.
>>>>>>>>       >>>>>>>>>>>>> Thanks for all the comments. If there are no
>>>>> further
>>>>>>>>       >>>>>>>>>>>>> comments, we
>>>>>>>>       >>>>>>>>>> plan
>>>>>>>>       >>>>>>>>>>>> to start the voting thread next week.
>>>>>>>>       >>>>>>>>>>>>> Best,
>>>>>>>>       >>>>>>>>>>>>> Hang
>>>>>>>>       >>>>>>>>>>>>>
>>>>>>>>       >>>>>>>>>>>>> Hang Ruan <ru...@gmail.com>
>>>>> 于2022年12月15日周四
>>>>>>>>       >>>>>>>>>>>>> 16:49写道：
>>>>>>>>       >>>>>>>>>>>>>> Hi, Zhu Zhu,
>>>>>>>>       >>>>>>>>>>>>>>
>>>>>>>>       >>>>>>>>>>>>>> Thanks for your feedback!
>>>>>>>>       >>>>>>>>>>>>>>
>>>>>>>>       >>>>>>>>>>>>>> The OperatorCoordinator implementations are
>>>>>>>>       different. And
>>>>>>>>       >>>>>>>>>>>>>> their
>>>>>>>>       >>>>>>>>>>>> metrics are much different too. We try to find the
>>>>>>> common
>>>>>>>>       >>>>>>>>>>>> metrics
>>>>>>>>       >>>>>>>> and
>>>>>>>>       >>>>>>>>>> put
>>>>>>>>       >>>>>>>>>>>> them in the OperatorCoordinatorMetricGroup. If
>>>>> most
>>>>>>>>       developers
>>>>>>>>       >>>>>>>> think
>>>>>>>>       >>>>>>>>> we
>>>>>>>>       >>>>>>>>>>> do
>>>>>>>>       >>>>>>>>>>>> not need these common metrics, removing the common
>>>>>>>>       metrics is
>>>>>>>>       >>>>>>>>>>>> also
>>>>>>>>       >>>>>>>>>>>> acceptable.
>>>>>>>>       >>>>>>>>>>>>>> Best,
>>>>>>>>       >>>>>>>>>>>>>> Hang
>>>>>>>>       >>>>>>>>>>>>>>
>>>>>>>>       >>>>>>>>>>>>>> Zhu Zhu <re...@gmail.com> 于2022年12月14日周三
>>>>>>>>       22:09写道：
>>>>>>>>       >>>>>>>>>>>>>>> Hi Hang & MengYue,
>>>>>>>>       >>>>>>>>>>>>>>>
>>>>>>>>       >>>>>>>>>>>>>>> Thanks for creating this FLIP!
>>>>>>>>       >>>>>>>>>>>>>>>
>>>>>>>>       >>>>>>>>>>>>>>> I think it is very useful, mainly in two
>>>>> aspects:
>>>>>>>>       >>>>>>>>>>>>>>> 1. Enables OperatorCoordinators to register
>>>>> metrics.
>>>>>>>>       >>>>>>>>>>>>>>> Currently
>>>>>>>>       >>>>>>>>>>>>>>> the coordinators has no way to do this. And
>>>>> operator
>>>>>>>>       >>>>>>>> coordinator
>>>>>>>>       >>>>>>>>>>>>>>> metric group further enables the
>>>>> SplitEnumerator
>>>>>>>>       to have
>>>>>>>>       >>>>>>>>>>>>>>> access
>>>>>>>>       >>>>>>>>>>>>>>> to a registered metric group (via the existing
>>>>> public
>>>>>>>>       >>>>>>>>>>>>>>> interface
>>>>>>>>       >>>>>>>>>>>>>>> SplitEnumeratorContext#metricGroup()), which is
>>>>>>>>       null at the
>>>>>>>>       >>>>>>>>> moment.
>>>>>>>>       >>>>>>>>>>>>>>> 2. Defines the scope of operator coordinator
>>>>>>>>       metrics. A clear
>>>>>>>>       >>>>>>>>>>>> definition
>>>>>>>>       >>>>>>>>>>>>>>> makes it easy for users to find their wanted
>>>>>>>>       metrics. The
>>>>>>>>       >>>>>>>>>> definition
>>>>>>>>       >>>>>>>>>>>>>>> also helps to avoid conflicts of metrics from
>>>>>>> multiple
>>>>>>>>       >>>>>>>>>>>> OperatorCoordinators
>>>>>>>>       >>>>>>>>>>>>>>> of the same kind. E.g. each SourceCoordinator
>>>>> may
>>>>>>>>       have its
>>>>>>>>       >>>>>>>>>>>>>>> own
>>>>>>>>       >>>>>>>>>>>>>>> numSourceSplits metric, these metrics should
>>>>> not
>>>>>>>>       be directly
>>>>>>>>       >>>>>>>>>>> registered
>>>>>>>>       >>>>>>>>>>>>>>> to the job metric group.
>>>>>>>>       >>>>>>>>>>>>>>>
>>>>>>>>       >>>>>>>>>>>>>>> What I'm a bit concerned is the necessity of
>>>>> the
>>>>>>>>       introduced
>>>>>>>>       >>>>>>>>> common
>>>>>>>>       >>>>>>>>>>>> metrics
>>>>>>>>       >>>>>>>>>>>>>>> numEventsInCounter & numEventsOutCounter. If
>>>>> there
>>>>>>>>       any case
>>>>>>>>       >>>>>>>> which
>>>>>>>>       >>>>>>>>>>>> strongly
>>>>>>>>       >>>>>>>>>>>>>>> requires them?
>>>>>>>>       >>>>>>>>>>>>>>>
>>>>>>>>       >>>>>>>>>>>>>>> Regarding the concerns of Chesnay,
>>>>>>>>       >>>>>>>>>>>>>>>> A dedicated coordinator MG implementation is
>>>>>>> overkill
>>>>>>>>       >>>>>>>>>>>>>>> Directly using the job metric group can result
>>>>> in
>>>>>>>>       metric
>>>>>>>>       >>>>>>>>> conflicts,
>>>>>>>>       >>>>>>>>>>> as
>>>>>>>>       >>>>>>>>>>>> mentioned
>>>>>>>>       >>>>>>>>>>>>>>> in above #2.
>>>>>>>>       >>>>>>>>>>>>>>>
>>>>>>>>       >>>>>>>>>>>>>>> Thanks,
>>>>>>>>       >>>>>>>>>>>>>>> Zhu
>>>>>>>>       >>>>>>>>>>>>>>>
>>>>>>>>       >>>>>>>>>>>>>>> Dong Lin <li...@gmail.com> 于2022年12月10日周六
>>>>>>>>       >>>>>>>>>>>>>>> 14:16写道：
>>>>>>>>       >>>>>>>>>>>>>>>
>>>>>>>>       >>>>>>>>>>>>>>>> Hi Chesney,
>>>>>>>>       >>>>>>>>>>>>>>>>
>>>>>>>>       >>>>>>>>>>>>>>>> Just to double check with you,
>>>>>>>>       >>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup
>>>>>>>>       >>>>>>>>>>>> (annotated as
>>>>>>>>       >>>>>>>>>>>>>>>> @PublicEvolving) has already been introduced
>>>>> into
>>>>>>>>       Flink by
>>>>>>>>       >>>>>>>>>> FLIP-179
>>>>>>>>       >>>>>>>>>>>>>>>> <
>>>>>>>>       >>
>>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics
>>>>>>>>       >>>>>>>>>>>>> .
>>>>>>>>       >>>>>>>>>>>>>>>> And that FLIP has got you +1.. Do you mean we
>>>>>>>>       should remove
>>>>>>>>       >>>>>>>>> this
>>>>>>>>       >>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup?
>>>>>>>>       >>>>>>>>>>>>>>>>
>>>>>>>>       >>>>>>>>>>>>>>>> Regards,
>>>>>>>>       >>>>>>>>>>>>>>>> Dong
>>>>>>>>       >>>>>>>>>>>>>>>>
>>>>>>>>       >>>>>>>>>>>>>>>> On Sat, Dec 10, 2022 at 1:33 AM Chesnay
>>>>> Schepler <
>>>>>>>>       >>>>>>>>>>> chesnay@apache.org>
>>>>>>>>       >>>>>>>>>>>> wrote:
>>>>>>>>       >>>>>>>>>>>>>>>>> As a whole I feel like this FLIP is overly
>>>>>>>>       complicated. A
>>>>>>>>       >>>>>>>>>>> dedicated
>>>>>>>>       >>>>>>>>>>>>>>>>> coordinator MG implementation is overkill; it
>>>>>>>>       could just
>>>>>>>>       >>>>>>>>> re-use
>>>>>>>>       >>>>>>>>>>> the
>>>>>>>>       >>>>>>>>>>>>>>>>> existing Task/OperatorMGs to create the same
>>>>>>>>       structure we
>>>>>>>>       >>>>>>>>> have
>>>>>>>>       >>>>>>>>>> on
>>>>>>>>       >>>>>>>>>>>> TMs,
>>>>>>>>       >>>>>>>>>>>>>>>>> similar to what we did with the Job MG.
>>>>>>>>       >>>>>>>>>>>>>>>>>
>>>>>>>>       >>>>>>>>>>>>>>>>> However, I'm not convinced that this is
>>>>> required
>>>>>>>>       anyway,
>>>>>>>>       >>>>>>>>>> because
>>>>>>>>       >>>>>>>>>>>> all the
>>>>>>>>       >>>>>>>>>>>>>>>>> example metrics you listed can be
>>>>> implemented on
>>>>>>>>       the TM
>>>>>>>>       >>>>>>>> side
>>>>>>>>       >>>>>>>>> +
>>>>>>>>       >>>>>>>>>>>>>>>>> aggregating them in the external metrics
>>>>> backend.
>>>>>>>>       >>>>>>>>>>>>>>>>>
>>>>>>>>       >>>>>>>>>>>>>>>>> Since I'm on holidays soon, just so no one
>>>>> tries
>>>>>>>>       to pull a
>>>>>>>>       >>>>>>>>> fast
>>>>>>>>       >>>>>>>>>>>> one on
>>>>>>>>       >>>>>>>>>>>>>>>>> me, if this were to go to a vote as-is I'd be
>>>>>>>>       against it.
>>>>>>>>       >>>>>>>>>>>>>>>>>
>>>>>>>>       >>>>>>>>>>>>>>>>>
>>>>>>>>       >>>>>>>>>>>>>>>>> On 09/12/2022 15:30, Dong Lin wrote:
>>>>>>>>       >>>>>>>>>>>>>>>>>> Hi Hang,
>>>>>>>>       >>>>>>>>>>>>>>>>>>
>>>>>>>>       >>>>>>>>>>>>>>>>>> Thanks for the FLIP! The FLIP looks good
>>>>> and it
>>>>>>>>       is pretty
>>>>>>>>       >>>>>>>>>>>> informative.
>>>>>>>>       >>>>>>>>>>>>>>>>>> I have just two minor comments regarding
>>>>> names:
>>>>>>>>       >>>>>>>>>>>>>>>>>> - Would it be useful to rename the config
>>>>> key as
>>>>>>>>       >>>>>>>>>>>>>>>>>> *metrics.scope.jm.job.operator-coordinator*
>>>>> for
>>>>>>>>       >>>>>>>> consistency
>>>>>>>>       >>>>>>>>>>> with
>>>>>>>>       >>>>>>>>>>>>>>>>>> *metrics.scope.jm.job
>>>>>>>>       >>>>>>>>>>>>>>>>>> *(which is not named as *jm-job)?
>>>>>>>>       >>>>>>>>>>>>>>>>>> - Maybe rename the variable as
>>>>>>>>       >>>>>>>>>>> SCOPE_NAMING_OPERATOR_COORDINATOR
>>>>>>>>       >>>>>>>>>>>> for
>>>>>>>>       >>>>>>>>>>>>>>>>>> simplicity and consistency with
>>>>>>>>       SCOPE_NAMING_OPERATOR
>>>>>>>>       >>>>>>>>> (which
>>>>>>>>       >>>>>>>>>> is
>>>>>>>>       >>>>>>>>>>>> not named
>>>>>>>>       >>>>>>>>>>>>>>>>>> as SCOPE_NAMING_TM_JOB_OPERATOR)?
>>>>>>>>       >>>>>>>>>>>>>>>>>>
>>>>>>>>       >>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>       >>>>>>>>>>>>>>>>>> Dong
>>>>>>>>       >>>>>>>>>>>>>>>>>>
>>>>>>>>       >>>>>>>>>>>>>>>>>>
>>>>>>>>       >>>>>>>>>>>>>>>>>>
>>>>>>>>       >>>>>>>>>>>>>>>>>> On Thu, Dec 8, 2022 at 3:28 PM Hang Ruan <
>>>>>>>>       >>>>>>>>>>> ruanhang1993@gmail.com>
>>>>>>>>       >>>>>>>>>>>> wrote:
>>>>>>>>       >>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>       >>>>>>>>>>>>>>>>>>>
>>>>>>>>       >>>>>>>>>>>>>>>>>>> MengYue and I created FLIP-274[1] Introduce
>>>>>>>>       metric group
>>>>>>>>       >>>>>>>>> for
>>>>>>>>       >>>>>>>>>>>>>>>>>>> OperatorCoordinator. OperatorCoordinator
>>>>> is the
>>>>>>>>       >>>>>>>>> coordinator
>>>>>>>>       >>>>>>>>>>> for
>>>>>>>>       >>>>>>>>>>>> runtime
>>>>>>>>       >>>>>>>>>>>>>>>>>>> operators and running on Job Manager. The
>>>>>>>>       coordination
>>>>>>>>       >>>>>>>>>>>> mechanism is
>>>>>>>>       >>>>>>>>>>>>>>>>>>> operator events between OperatorCoordinator
>>>>>>>>       and its all
>>>>>>>>       >>>>>>>>>>>> operators, the
>>>>>>>>       >>>>>>>>>>>>>>>>>>> coordination is more and more using in
>>>>> Flink, for
>>>>>>>>       >>>>>>>> example
>>>>>>>>       >>>>>>>>>> many
>>>>>>>>       >>>>>>>>>>>> Sources
>>>>>>>>       >>>>>>>>>>>>>>>>> and
>>>>>>>>       >>>>>>>>>>>>>>>>>>> Sinks depend on the mechanism to assign
>>>>> splits
>>>>>>> and
>>>>>>>>       >>>>>>>>>> coordinate
>>>>>>>>       >>>>>>>>>>>> commits to
>>>>>>>>       >>>>>>>>>>>>>>>>>>> external systems. The OperatorCoordinator
>>>>> is
>>>>>>>>       widely
>>>>>>>>       >>>>>>>> using
>>>>>>>>       >>>>>>>>> in
>>>>>>>>       >>>>>>>>>>>> flink kafka
>>>>>>>>       >>>>>>>>>>>>>>>>>>> connector, flink pulsar connector, flink
>>>>> cdc
>>>>>>>>       connector,
>>>>>>>>       >>>>>>>>>> flink
>>>>>>>>       >>>>>>>>>>>> hudi
>>>>>>>>       >>>>>>>>>>>>>>>>>>> connector and so on.
>>>>>>>>       >>>>>>>>>>>>>>>>>>>
>>>>>>>>       >>>>>>>>>>>>>>>>>>> But there is not a suitable metric group
>>>>> scope
>>>>>>>>       for the
>>>>>>>>       >>>>>>>>>>>>>>>>> OperatorCoordinator
>>>>>>>>       >>>>>>>>>>>>>>>>>>> and not an implementation for the interface
>>>>>>>>       >>>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup.
>>>>>>>>       >>>>>>>>>>>>>>>>>>> These metrics in OperatorCoordinator could
>>>>> be
>>>>>>>>       how many
>>>>>>>>       >>>>>>>>>>>> splits/partitions
>>>>>>>>       >>>>>>>>>>>>>>>>>>> have been assigned to source readers, how
>>>>> many
>>>>>>>>       files
>>>>>>>>       >>>>>>>> have
>>>>>>>>       >>>>>>>>>> been
>>>>>>>>       >>>>>>>>>>>> written
>>>>>>>>       >>>>>>>>>>>>>>>>> out
>>>>>>>>       >>>>>>>>>>>>>>>>>>> by sink writers, these metrics not only
>>>>> help
>>>>>>>>       users to
>>>>>>>>       >>>>>>>> know
>>>>>>>>       >>>>>>>>>> the
>>>>>>>>       >>>>>>>>>>>> job
>>>>>>>>       >>>>>>>>>>>>>>>>> progress
>>>>>>>>       >>>>>>>>>>>>>>>>>>> but also make big job maintaining easier.
>>>>> Thus we
>>>>>>>>       >>>>>>>> propose
>>>>>>>>       >>>>>>>>>> the
>>>>>>>>       >>>>>>>>>>>> FLIP-274
>>>>>>>>       >>>>>>>>>>>>>>>>> to
>>>>>>>>       >>>>>>>>>>>>>>>>>>> introduce a new metric group scope for
>>>>>>>>       >>>>>>>> OperatorCoordinator
>>>>>>>>       >>>>>>>>>> and
>>>>>>>>       >>>>>>>>>>>> provide
>>>>>>>>       >>>>>>>>>>>>>>>>> an
>>>>>>>>       >>>>>>>>>>>>>>>>>>> internal implementation for
>>>>>>>>       >>>>>>>>> OperatorCoordinatorMetricGroup.
>>>>>>>>       >>>>>>>>>>>>>>>>>>> Could you help review this FLIP when you
>>>>> get
>>>>>>>>       time? Any
>>>>>>>>       >>>>>>>>>>> feedback
>>>>>>>>       >>>>>>>>>>>> is
>>>>>>>>       >>>>>>>>>>>>>>>>>>> appreciated!
>>>>>>>>       >>>>>>>>>>>>>>>>>>>
>>>>>>>>       >>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>       >>>>>>>>>>>>>>>>>>> Hang
>>>>>>>>       >>>>>>>>>>>>>>>>>>>
>>>>>>>>       >>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>       >>>>>>>>>>>>>>>>>>>
>>>>>>>>       >>>>>>>>>>>>>>>>>>>
>>>>>>>>       >>
>>>>>>>>
>>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-274%3A+Introduce+metric+group+for+OperatorCoordinator
>>>>>>>>       >>
>>>>>>>>
>>>>>

Re: [DISCUSS] FLIP-274 : Introduce metric group for OperatorCoordinator

Posted by Hang Ruan <ru...@gmail.com>.

Hi all,

I make a mistake about the default value of the  scope format
JobManagerOperatorScopeFormat. It should be
<host>.jobmanager.<job_name>.<operator_name>.
I have fixed it in the FLIP.
And the variables (task id and task name) have been added to the list.

Best,
Hang


Hang Ruan <ru...@gmail.com> 于2023年2月1日周三 14:11写道：

> Hi all,
>
> > IIUC, there is only an OperatorCoordinatorMetricGroup interface without
> implementations (it never worked). There were no coordinator metrics
> implemented and registered. @Hang Ruan please correct me if I'm wrong.
> Yes,  we shouldn't have to worry about backwards-compatibility as Chesnay
> said.
>
> After the discussion, I think the FLIP should be modified as follow.
> 1.  Add new metric group JobManagerOperatorMetricGroup, whose scope
> format is JobManagerOperatorScopeFormat. The config key of JobManagerOperatorScopeFormat
> is metrics.scope.jm-operator, and the default value is
> <host>.jobmanager.<job_name>.<task_name>.<operator_name>.
> 2. JobManagerOperatorMetricGroup will be sub components of
> JobManagerJobMetricGroup.
> 3. InternalOperatorCoordinatorMetricGroup will extend ProxyMetricGroup,
> which registered under JobManagerOperatorMetricGroup by
> addGroup("coordinator").
>
> I have updated the FLIP-274. @Chesnay Schepler <ch...@apache.org> @Jark
> Wu <im...@gmail.com> , please help to take a look at the modified FLIP
> and make sure that I do not miss  something.
>
> If there are no more problems, we plan to start the voting thread later.
>
> Best,
> Hang
>
> Chesnay Schepler <ch...@apache.org> 于2023年1月31日周二 20:09写道：
>
>> > IIUC, there is only an OperatorCoordinatorMetricGroup interface without
>> implementations (it never worked). There were no coordinator metrics
>> implemented and registered. @Hang Ruan <ru...@gmail.com> please
>> correct me if I'm wrong.
>>
>> Indeed. There are interfaces, and theoretically a metric group is exposed
>> to user-code via the SplitEnumeratorContext, but the metric group is always
>> null in the SourceCoordinatorContext. So we shouldn't have to worry about
>> backwards-compatibility.
>>
>> > However, the config key "metrics.scope.jm.job" is deprecated and is
>> replaced by "metrics.scope.jm-job". Similarly, "metrics.scope.tm.job" is
>> replaced with "metrics.scope.tm-job". That's why I though
>> "metrics.scope.jm-operator" might be more consistent.
>>
>> HA! Good catch. I only looked at the 1.16 docs where they weren't
>> deprecated yet.
>>
>> Then it should be "jm-operator" indeed.
>>
>> On 31/01/2023 10:51, Jark Wu wrote:
>>
>> Hi Chesnay,
>>
>> > The new option must default to the configured value of
>> metrics.scope.jm.job because otherwise it isn't backwards-compatible with
>> existing coordinator metrics.
>> IIUC, there is only an OperatorCoordinatorMetricGroup interface without
>> implementations (it never worked). There were no coordinator metrics
>> implemented and registered. @Hang Ruan <ru...@gmail.com> please
>> correct me if I'm wrong.
>>
>> > Intuitively I think the config key should be
>> metrics.scope.jm.operator., similar to metrics.scope.jm.job.
>> However, the config key "metrics.scope.jm.job" is deprecated and is
>> replaced by "metrics.scope.jm-job". Similarly, "metrics.scope.tm.job" is
>> replaced with "metrics.scope.tm-job". That's why I though
>> "metrics.scope.jm-operator" might be more consistent.
>>
>> > We would only support operator_name, operator_id, task_name and task_id
>> (+ everything else from the jm/job groups of course).
>> +1
>>
>> Best,
>> Jark
>>
>>
>>
>> On Mon, 30 Jan 2023 at 20:09, Chesnay Schepler <ch...@apache.org>
>> wrote:
>>
>>> The new option must default to the configured value of
>>> metrics.scope.jm.job because otherwise it isn't backwards-compatible with
>>> existing coordinator metrics.
>>> According to some earlier response in this thread there are already
>>> coordinator metrics.
>>>
>>> If that weren't the case you'd be correct, and we'd keep the default
>>> consistent with the default for tm operator metrics.
>>>
>>> Intuitively I think the config key should be metrics.scope.jm.operator.,
>>> similar to metrics.scope.jm.job.
>>> For symmetry we should also add new metrics.scope.tm.operator/task and
>>> eventually phase out metrics.scope.operator.
>>>
>>> > Regarding "coordinator metrics don't *belong to* any tasks/vertexes",
>>> I mean the coordinator doesn't run on specific tasks, so it doesn't have
>>> runtime information of tasks.
>>>
>>> That is true. I intentionally only referred to task id/name, as things
>>> like subtask index or attempt ids don't make sense for coordinators as you
>>> said.
>>> We would only support operator_name, operator_id, task_name and task_id
>>> (+ everything else from the jm/job groups of course).
>>> (Maybe in the future attempt_num could be interesting though)
>>>
>>> On 30/01/2023 10:39, Jark Wu wrote:
>>>
>>> Hi Chesnay,
>>>
>>> IIUC, our only disagreement is whether to support task info in the
>>> metric format. But we all agree with the default metric format
>>> "metrics.scope.jm-operator=<host>.jobmanager.<job_name>.<operator_name>"
>>> which doesn't include task info, right?
>>>
>>> Regarding "coordinator metrics don't *belong to* any tasks/vertexes", I
>>> mean the coordinator doesn't run on specific tasks, so it doesn't have
>>> runtime information of tasks. My concern about adding task info to the
>>> coordinator metric is that
>>> 1) if we register coordinator metrics for every subtask, the number of
>>> coordinator metrics will be exploded when task parallelism is large. This
>>> is an overhead to JM & metric systems, and the duplicated metrics are
>>> useless for users.
>>> 2) Correct me if I'm wrong, MetricGroup#counter only supports
>>> registering one metric. It seems it can't register a metric reporting for
>>> multiple metric names (e.g. multiple subtasks).
>>>
>>> Best,
>>> Jark
>>>
>>>
>>> On Mon, 30 Jan 2023 at 16:11, Chesnay Schepler <ch...@apache.org>
>>> wrote:
>>>
>>>> I hadn't looked in detail in to how the task info can be introduced,
>>>> but
>>>> given that the coordinates are created by the execution graph, where we
>>>> only work on tasks, it should be possible and rather simple.
>>>>
>>>> The OperatorCoordinatorHolder gets access to the ExecutionJobVertex
>>>> from
>>>> which it can extract everything and create the task/operator groups.
>>>> Exposure to the coordinator could happen via the coordinator context as
>>>> originally planned.
>>>>
>>>>  > As we know, coordinator metrics are just like JMJobMetricGroup,
>>>> which
>>>> don't belong to any tasks/vertexes.
>>>>
>>>> I don't think this is true, every coordinator is scoped to a specific
>>>> operator, and every operator is associated with a particular vertex.
>>>>
>>>>  From the javadocs:
>>>>
>>>> " A coordinator for runtime operators. The OperatorCoordinator runs on
>>>> the master, associated with
>>>>   the job vertex of the operator. It communicates with operators via
>>>> sending operator events."
>>>>
>>>> On 28/01/2023 09:15, Jark Wu wrote:
>>>> > Hi all,
>>>> >
>>>> > IIUC, Chesnay means we should have a more general metric group for
>>>> > **operator**
>>>> > not for **coordinator** in JM, this would be useful to extend
>>>> > other operator-specific
>>>> > metrics in the future. That means the new scope format should be
>>>> designed
>>>> > for
>>>> > the operator,
>>>> > e.g.,
>>>> metrics.scope.jm-operator=<host>.jobmanager.<job_name>.<operator_name>
>>>> > The coordinator metric is a subgroup (a constant "coordinator"
>>>> suffix) of
>>>> > the JMOperatorMG.
>>>> >
>>>> > I think this is a nice design. However, I have a question about
>>>> adding task
>>>> > id/name to this list.
>>>> > How to get the task id/name when reporting a coordinator metric? As we
>>>> > know, coordinator metrics
>>>> > are just like JMJobMetricGroup, which don't belong to any
>>>> tasks/vertexes.
>>>> > Do you mean reporting
>>>> > coordinator metrics for every task id under the operator?
>>>> >
>>>> > Best,
>>>> > Jark
>>>> >
>>>> >
>>>> > On Thu, 19 Jan 2023 at 17:33, Chesnay Schepler <ch...@apache.org>
>>>> wrote:
>>>> >
>>>> >>   > First, I do not understand why users have to configure the new
>>>> scope
>>>> >> format, which has a default value.
>>>> >>
>>>> >> If you don't use scope formats, sure. If you do use scope formats,
>>>> e.g.
>>>> >> to add a common prefix (which is the case for datadog users for
>>>> >> example), then the current default in the FLIP is insufficient and
>>>> >> requires the user to update the configuration.
>>>> >>
>>>> >>   > I usually do not need to change these configuration of scope
>>>> formats
>>>> >> when submitting the flink job.
>>>> >>
>>>> >> Scope formats as a whole are quite a power-user feature, but that
>>>> >> doesn't mean we should ignore it.
>>>> >>
>>>> >>   > I try to let it extend the ComponentMetricGroup
>>>> >>
>>>> >> This isn't inherently required just because it is a "component".
>>>> >> Component metric groups should _only_ be used for cases where such a
>>>> >> component provides several bits of metadata.
>>>> >> For example, tasks provide vertex ids, task names, attempted IDs
>>>> etc.,
>>>> >> and we'd like users to have the option on how this metadata ends up
>>>> >> being used (via scope formats). This currently can't be built with
>>>> the
>>>> >> addGroup() methods, hence why the component groups exist.
>>>> >> However, the operator coordinator _itself_does not provide several
>>>> bits
>>>> >> of metadata. Logically, the _operator_ does, not the coordinator.
>>>> >>
>>>> >>   > This make me decide to add a new scope format.
>>>> >>
>>>> >> And this is fine; just don't add a new scope format for the
>>>> coordinator
>>>> >> but logically for operators on the JM side, such that we can extend
>>>> the
>>>> >> set of operator-specific metrics in the future without having to add
>>>> yet
>>>> >> another scope format.
>>>> >>
>>>> >>   > The relationship between the Operator and OperatorCoordinator is
>>>> >> maintained in the OperatorCoordinatorHolder. In the POC, the operator
>>>> >> name/id could be found and the OperatorCoordinatorMetricGroup will be
>>>> >> created here.
>>>> >>
>>>> >> That's all irrelevant.
>>>> >>
>>>> >>   > The registered metrics of OperatorCoordinator always are those
>>>> which
>>>> >> can not be aggregated from the tasks, like the number of unassigned
>>>> >> splits in the SourceCoordinator. Actually we have not encountered the
>>>> >> scenario you mentioned. I think the OperatorCoordinatorMetricGroup
>>>> >> should only contain the metrics for itself instead of its subtasks.
>>>> >>
>>>> >> You are misunderstanding the problem.
>>>> >>
>>>> >> This isn't about aggregating metrics on our side or exposing metrics
>>>> >> from TMs on the JM side or anything like that.
>>>> >>
>>>> >> It's purely about the available metadata for operator metrics on the
>>>> JM
>>>> >> side. Currently you suggest operator name / id; what I'm proposing
>>>> is to
>>>> >> add task(==vertex!) id / name to this list.
>>>> >>
>>>> >> The benefit here is simple. I can look up all metrics where
>>>> task_id==XXX
>>>> >> and get _everything_ related to that vertex, including all metrics
>>>> >> associated with operators that are part of that vertex, including the
>>>> >> coordinator.
>>>> >>
>>>> >>   > Using metrics.scope.jm.job is not enough to distinguish the
>>>> different
>>>> >> coordinators.
>>>> >>
>>>> >> I did not suggest just using metrics.scope.jm.job. I suggested that
>>>> as
>>>> >> the default for the jm operator scope format, that is used for the
>>>> >> proposed JobManagerOperatorMetricGroup, which will have some plain
>>>> >> metric groups as children for coordinators.
>>>> >> Aka, you have the JMOperatorMG, then you call
>>>> addGroup("coordinators")
>>>> >> and then addGroup(coordinator_name) for each coordinator.
>>>> >>
>>>> >>   > So there are two choice for the
>>>> >> InternalOperatorCoordinatorMetricGroup: 1. add and improve the new
>>>> scope
>>>> >> format; 2. use the metrics.scope.jm.job and ProxyMetricGroup. Which
>>>> one
>>>> >> is better?
>>>> >>
>>>> >> There are more options than that.
>>>> >> I feel like there are a lot of misunderstandings in this discussion.
>>>> >> Please let me know if I could clear things up. If not I can also
>>>> provide
>>>> >> a PoC based on your PoC if that can speed things up. It's not that
>>>> >> different anyway.
>>>> >>
>>>> >> On 19/01/2023 07:39, Hang Ruan wrote:
>>>> >>> Hi, chesnay,
>>>> >>>
>>>> >>> Thanks for your reply. I still have some doubts about the questions
>>>> >>> you raised.
>>>> >>>
>>>> >>>      > Extending the set of ScopeFormats is problematic because it
>>>> in
>>>> >>>      practice
>>>> >>>      it breaks the config if users actively rely on it, since
>>>> there's now
>>>> >>>      another key that they _must_ set for it to be
>>>> >>>      consistent/compatible with
>>>> >>>      their existing setup.
>>>> >>>      Unfortunately due to how powerful scope formats are we can't
>>>> derive a
>>>> >>>      default value that matches their existing setup.
>>>> >>>      Hence we should try to do this as rarely as possible.
>>>> >>>
>>>> >>>
>>>> >>>      > This FLIP does not adhere to that since it proposes a
>>>> dedicated
>>>> >>>      format
>>>> >>>      for coordinators; next time we want to expose operator-specific
>>>> >>>      metrics
>>>> >>>      (e.g., in the scheduler) we'd have to add another one to
>>>> support it.
>>>> >>>
>>>> >>>
>>>> >>> First, I do not understand why users have to configure the new scope
>>>> >>> format, which has a default value. I usually do not need to change
>>>> >>> these configuration of scope formats when submitting the flink job.
>>>> >>> IMO, the OperatorCoordinator is a component running at the
>>>> JobMaster.
>>>> >>> I try to let it extend the ComponentMetricGroup, but I cannot find
>>>> >>> a suitable scope. This make me decide to add a new scope format.
>>>> >>>
>>>> >>>      > Additionally, the configurable variables (operator name/id)
>>>> are
>>>> >>>      logically not attached to the coordinator, but operators, so
>>>> to me it
>>>> >>>      just doesn't make sense to structure it like this.
>>>> >>>
>>>> >>>
>>>> >>> The relationship between the Operator and OperatorCoordinator is
>>>> >>> maintained in the OperatorCoordinatorHolder. In the POC, the
>>>> operator
>>>> >>> name/id could be found and the OperatorCoordinatorMetricGroup will
>>>> be
>>>> >>> created here.
>>>> >>>
>>>> >>>      > Another thing I'm concerned about is that, because we don't
>>>> include
>>>> >>>      tasks in the hierarchy, users wishing to collect all metrics
>>>> for a
>>>> >>>      particular task (in this case ==vertex) now have to go
>>>> >>>      significantly out
>>>> >>>      of their way to get them, since they can no longer just filter
>>>> by the
>>>> >>>      task ID but have to be filter for _all_ operators that are
>>>> part of
>>>> >>>      the task.
>>>> >>>
>>>> >>>
>>>> >>> The registered metrics of OperatorCoordinator always are those which
>>>> >>> can not be aggregated from the tasks, like the number of unassigned
>>>> >>> splits in the SourceCoordinator. Actually we have not encountered
>>>> the
>>>> >>> scenario you mentioned. I think the OperatorCoordinatorMetricGroup
>>>> >>> should only contain the metrics for itself instead of its subtasks.
>>>> >>>
>>>> >>> Using metrics.scope.jm.job is not enough to distinguish the
>>>> different
>>>> >>> coordinators. The operator id/name is necessary. Then
>>>> >>> the implementation should be like this. But users can not change the
>>>> >>> scope in this way. This is also acceptable.
>>>> >>> @Internal public class InternalOperatorCoordinatorMetricGroup
>>>> >>>           extends ProxyMetricGroup<MetricGroup>
>>>> >>>           implements OperatorCoordinatorMetricGroup {
>>>> >>>       public InternalOperatorCoordinatorMetricGroup(
>>>> >>>               JobManagerJobMetricGroup parent, OperatorID
>>>> operatorID,
>>>> >> String operatorName) {
>>>> >>>           super(parent.addGroup(operatorID +
>>>> >> operatorName).addGroup("coordinator")); }
>>>> >>> }
>>>> >>> So there are two choice for the
>>>> >>> InternalOperatorCoordinatorMetricGroup: 1. add and improve the new
>>>> >>> scope format; 2. use the metrics.scope.jm.job and ProxyMetricGroup.
>>>> >>> Which one is better?
>>>> >>>
>>>> >>> Best,
>>>> >>> Hang
>>>> >>>
>>>> >>>
>>>> >>> Chesnay Schepler <ch...@apache.org> 于2023年1月18日周三 17:03写道：
>>>> >>>
>>>> >>>      You're misunderstanding the problem.
>>>> >>>
>>>> >>>      Metric groups form a tree with each group providing certain
>>>> metadata.
>>>> >>>
>>>> >>>      E.g., on the taskmanager we have a TM metric group that
>>>> provides info
>>>> >>>      about the TM, that has child task metric groups, that have
>>>> child
>>>> >>>      operator metric groups etc. The operator metric group again has
>>>> >>>      children, where we are now mostly in the user/connector land,
>>>> like a
>>>> >>>      kafka metric group providing partition info or something.
>>>> >>>
>>>> >>>      What is being proposed is to create a coordinator metric group
>>>> on
>>>> >>>      the JM
>>>> >>>      side that provides operator metadata.
>>>> >>>
>>>> >>>      A more appropriate structure is to create an operator group on
>>>> the JM
>>>> >>>      side that provides this info, with the coordinator metric group
>>>> >>>      being a
>>>> >>>      child.
>>>> >>>
>>>> >>>      On 17/01/2023 19:56, Steven Wu wrote:
>>>> >>>      >> Additionally, the configurable variables (operator name/id)
>>>> are
>>>> >>>      > logically not attached to the coordinator, but operators, so
>>>> to
>>>> >>>      me it
>>>> >>>      > just doesn't make sense to structure it like this.
>>>> >>>      >
>>>> >>>      > Chesnay, maybe we should clarify the terminology. To me,
>>>> >>>      pperators (like
>>>> >>>      > FLIP-27 source) can have two parts (coordinator and
>>>> >>>      reader/subtask). I
>>>> >>>      > think it is fine to include operator name/id for coordinator
>>>> >>>      metrics.
>>>> >>>      >
>>>> >>>      > On Mon, Jan 16, 2023 at 2:13 AM Chesnay Schepler
>>>> >>>      <ch...@apache.org> wrote:
>>>> >>>      >
>>>> >>>      >> Slight correction: Using metrics.scope.jm.job as the default
>>>> >>>      should be
>>>> >>>      >> safe.
>>>> >>>      >>
>>>> >>>      >> On 16/01/2023 10:18, Chesnay Schepler wrote:
>>>> >>>      >>> The proposed ScopeFormat is still problematic for a few
>>>> reasons.
>>>> >>>      >>>
>>>> >>>      >>> Extending the set of ScopeFormats is problematic because
>>>> it in
>>>> >>>      >>> practice it breaks the config if users actively rely on
>>>> it, since
>>>> >>>      >>> there's now another key that they _must_ set for it to be
>>>> >>>      >>> consistent/compatible with their existing setup.
>>>> >>>      >>> Unfortunately due to how powerful scope formats are we
>>>> can't
>>>> >>>      derive a
>>>> >>>      >>> default value that matches their existing setup.
>>>> >>>      >>> Hence we should try to do this as rarely as possible.
>>>> >>>      >>>
>>>> >>>      >>> This FLIP does not adhere to that since it proposes a
>>>> >>>      dedicated format
>>>> >>>      >>> for coordinators; next time we want to expose
>>>> operator-specific
>>>> >>>      >>> metrics (e.g., in the scheduler) we'd have to add another
>>>> one to
>>>> >>>      >>> support it.
>>>> >>>      >>>
>>>> >>>      >>> Additionally, the configurable variables (operator
>>>> name/id) are
>>>> >>>      >>> logically not attached to the coordinator, but operators,
>>>> so
>>>> >>>      to me it
>>>> >>>      >>> just doesn't make sense to structure it like this.
>>>> >>>      >>>
>>>> >>>      >>> Another thing I'm concerned about is that, because we don't
>>>> >>>      include
>>>> >>>      >>> tasks in the hierarchy, users wishing to collect all
>>>> metrics for
>>>> >> a
>>>> >>>      >>> particular task (in this case ==vertex) now have to go
>>>> >>>      significantly
>>>> >>>      >>> out of their way to get them, since they can no longer just
>>>> >>>      filter by
>>>> >>>      >>> the task ID but have to be filter for _all_ operators that
>>>> are
>>>> >>>      part of
>>>> >>>      >>> the task.
>>>> >>>      >>>
>>>> >>>      >>> On 16/01/2023 03:09, Hang Ruan wrote:
>>>> >>>      >>>> Hi, @chesnay@apache.org <ch...@apache.org> ,
>>>> >>>      >>>>
>>>> >>>      >>>> Do you have time to help to review this FLIP again after
>>>> the
>>>> >>>      >>>> modification?
>>>> >>>      >>>> Looking forward to your reply.
>>>> >>>      >>>> This FLIP will add a new configuration for the
>>>> >>>      >>>> OperatorCoordinatorMetricGroup scope format. It provides
>>>> an
>>>> >>>      internal
>>>> >>>      >>>> implementation and is added as a component to the
>>>> >>>      >>>> JobManagerJobMetricGroup.
>>>> >>>      >>>> If something doesn't make sense, could you provide some
>>>> >>>      advice? It
>>>> >>>      >>>> will be
>>>> >>>      >>>> very helpful. Thanks a lot for your help.
>>>> >>>      >>>>
>>>> >>>      >>>> Best,
>>>> >>>      >>>> Hang
>>>> >>>      >>>>
>>>> >>>      >>>> Martijn Visser <ma...@apache.org> 于2023年1月11日周三
>>>> >>>      16:34写道：
>>>> >>>      >>>>
>>>> >>>      >>>>
>>>> >>>      >>>>> Hi Hang,
>>>> >>>      >>>>>
>>>> >>>      >>>>> I'm a bit surprised that this has gone to a vote, given
>>>> that
>>>> >>>      Chesnay
>>>> >>>      >>>>> deliberately mentioned that he would vote against it
>>>> as-is.
>>>> >>>      I would
>>>> >>>      >>>>> expect
>>>> >>>      >>>>> that before going to a vote, he has had the opportunity
>>>> to
>>>> >>>      >>>>> participate in
>>>> >>>      >>>>> this discussion.
>>>> >>>      >>>>>
>>>> >>>      >>>>> Best regards,
>>>> >>>      >>>>>
>>>> >>>      >>>>> Martijn
>>>> >>>      >>>>>
>>>> >>>      >>>>> On Tue, Jan 3, 2023 at 12:53 PM Jark Wu <
>>>> imjark@gmail.com>
>>>> >>>      wrote:
>>>> >>>      >>>>>
>>>> >>>      >>>>>> Hi Dong,
>>>> >>>      >>>>>>
>>>> >>>      >>>>>> Regarding “SplitEnumeratorContext#metricGroup”, my only
>>>> >>>      concern is
>>>> >>>      >>>>>> that
>>>> >>>      >>>>>> this is a core interface for the FLIP. It’s hard to
>>>> tell how
>>>> >>>      >>>>>> sources use
>>>> >>>      >>>>>> metric group without mentioning this interface. Even if
>>>> >>>      this is an
>>>> >>>      >>>>> existing
>>>> >>>      >>>>>> API, I think it’s worth introducing the interface again
>>>> and
>>>> >>>      declaring
>>>> >>>      >>>>> that
>>>> >>>      >>>>>> we will implement the interface instead of a no-op
>>>> method
>>>> >>>      in this
>>>> >>>      >>>>>> FLIP.
>>>> >>>      >>>>>>
>>>> >>>      >>>>>> Anyway, this is a minor problem and shouldn’t block this
>>>> >>>      FLIP. I’m
>>>> >>>      >>>>>> +1 to
>>>> >>>      >>>>>> start a vote.
>>>> >>>      >>>>>>
>>>> >>>      >>>>>> Best,
>>>> >>>      >>>>>> Jark
>>>> >>>      >>>>>>
>>>> >>>      >>>>>>
>>>> >>>      >>>>>>> 2023年1月3日 10:03，Hang Ruan <ru...@gmail.com> 写道：
>>>> >>>      >>>>>>>
>>>> >>>      >>>>>>> Hi, Jark and Dong,
>>>> >>>      >>>>>>>
>>>> >>>      >>>>>>> Thanks for your comments. Sorry for my late reply.
>>>> >>>      >>>>>>>
>>>> >>>      >>>>>>> For suggestion 1, I plan to implement the
>>>> >>>      >>>>>>> SplitEnumeratorMetricGroup in
>>>> >>>      >>>>>>> another issue, and it is not contained in this FLIP. I
>>>> >>>      will add some
>>>> >>>      >>>>>>> description about this part.
>>>> >>>      >>>>>>> For suggestion 2, changes about
>>>> >>>      OperatorCoordinator#metricGroup has
>>>> >>>      >>>>>> already
>>>> >>>      >>>>>>> been documented in the proposed change section.
>>>> >>>      >>>>>>>
>>>> >>>      >>>>>>> Best,
>>>> >>>      >>>>>>> Hang
>>>> >>>      >>>>>>>
>>>> >>>      >>>>>>> Dong Lin <li...@gmail.com> 于2023年1月1日周日 09:45写道：
>>>> >>>      >>>>>>>
>>>> >>>      >>>>>>>> Let me chime-in and add comments regarding the public
>>>> >>>      interface
>>>> >>>      >>>>> section.
>>>> >>>      >>>>>>>> Please see my comments inline.
>>>> >>>      >>>>>>>>
>>>> >>>      >>>>>>>> On Thu, Dec 29, 2022 at 6:08 PM Jark Wu
>>>> >>>      <im...@gmail.com> wrote:
>>>> >>>      >>>>>>>>
>>>> >>>      >>>>>>>>> Hi Hang,
>>>> >>>      >>>>>>>>>
>>>> >>>      >>>>>>>>> Thanks for driving this discussion. I think this is a
>>>> >>>      very useful
>>>> >>>      >>>>>> feature
>>>> >>>      >>>>>>>>> for connectors.
>>>> >>>      >>>>>>>>>
>>>> >>>      >>>>>>>>> The FLIP looks quite good to me, and I just have two
>>>> >>>      suggestions.
>>>> >>>      >>>>>>>>>
>>>> >>>      >>>>>>>>> 1. In the "Public Interface" section, mention that
>>>> the
>>>> >>>      >>>>>>>>> implementation
>>>> >>>      >>>>>>>>> behavior of "SplitEnumeratorContext#metricGroup" is
>>>> >>>      changed from
>>>> >>>      >>>>>>>> returning
>>>> >>>      >>>>>>>>> null to returning a concrete
>>>> SplitEnumeratorMetricGroup
>>>> >>>      instance.
>>>> >>>      >>>>> Even
>>>> >>>      >>>>>>>>> though the API is already there, the behavior change
>>>> can
>>>> >>>      also be
>>>> >>>      >>>>>>>> considered
>>>> >>>      >>>>>>>>> a public change.
>>>> >>>      >>>>>>>>>
>>>> >>>      >>>>>>>> SplitEnumeratorContext#metricGroup is an interface and
>>>> >>>      this FLIP
>>>> >>>      >>>>>>>> does
>>>> >>>      >>>>>> not
>>>> >>>      >>>>>>>> seem to change its semantics/behavior. The FLIP does
>>>> >>>      change the
>>>> >>>      >>>>>>>> implementation/behavior of
>>>> >>>      SourceCoordinatorContext#metricGroup,
>>>> >>>      >>>>>>>> which
>>>> >>>      >>>>>> is
>>>> >>>      >>>>>>>> marked @Internal.
>>>> >>>      >>>>>>>>
>>>> >>>      >>>>>>>> Thus it might seem a bit weird to add in the public
>>>> >> interface
>>>> >>>      >>>>>>>> section
>>>> >>>      >>>>>>>> saying that we change the interface
>>>> >>>      >>>>>>>> SplitEnumeratorContext#metricGroup
>>>> >>>      >>>>>> from
>>>> >>>      >>>>>>>> returning null to non-null object.
>>>> >>>      >>>>>>>>
>>>> >>>      >>>>>>>> 2. Mention the newly added interface of
>>>> >>>      >>>>>> "OperatorCoordinator#metricGroup"
>>>> >>>      >>>>>>>>> in the "Proposed Changes" section or "Public
>>>> Interface"
>>>> >>>      section. As
>>>> >>>      >>>>> the
>>>> >>>      >>>>>>>>> FLIP said, OperatorCoordinator is widely used in many
>>>> >>>      connectors.
>>>> >>>      >>>>>> Though
>>>> >>>      >>>>>>>> it
>>>> >>>      >>>>>>>>> is still an @Internal API, I think it is worth
>>>> >>>      mentioning the
>>>> >>>      >>>>>>>>> change
>>>> >>>      >>>>> in
>>>> >>>      >>>>>>>> the
>>>> >>>      >>>>>>>>> FLIP.
>>>> >>>      >>>>>>>> Since OperatorCoordinator is an internal API, it seems
>>>> >>>      reasonable to
>>>> >>>      >>>>>>>> explain it in the proposed change section. The FLIP
>>>> seems
>>>> >>>      to have
>>>> >>>      >>>>>>>> documented this in point 5 of the proposed change
>>>> section.
>>>> >>>      >>>>>>>>
>>>> >>>      >>>>>>>> BTW, if we think there are @internal classes that are
>>>> >>>      important
>>>> >>>      >>>>>>>> enough
>>>> >>>      >>>>>> to
>>>> >>>      >>>>>>>> be added in the public interface section, it might be
>>>> >>>      useful to
>>>> >>>      >>>>>> explicitly
>>>> >>>      >>>>>>>> discuss this topic and document it in the "*What are
>>>> the
>>>> >>>      "public
>>>> >>>      >>>>>>>> interfaces" of the project*" in this
>>>> >>>      >>>>>>>> <
>>>> >>>      >>>>>>>>
>>>> >>>      >>
>>>> >>>
>>>> >>
>>>> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
>>>> >>>      >>>>>>>> wiki.
>>>> >>>      >>>>>>>>
>>>> >>>      >>>>>>>>
>>>> >>>      >>>>>>>>> Best,
>>>> >>>      >>>>>>>>> Jark
>>>> >>>      >>>>>>>>>
>>>> >>>      >>>>>>>>>
>>>> >>>      >>>>>>>>> On Mon, 26 Dec 2022 at 18:06, Hang Ruan
>>>> >>>      <ru...@gmail.com>
>>>> >>>      >>>>>> wrote:
>>>> >>>      >>>>>>>>>> Hi, thanks for the feedback, Zhu Zhu and Qingsheng.
>>>> >>>      >>>>>>>>>> After combining everyone's comments, the main
>>>> concerns and
>>>> >>>      >>>>>>>> corresponding
>>>> >>>      >>>>>>>>>> adjustments are as follows.
>>>> >>>      >>>>>>>>>>
>>>> >>>      >>>>>>>>>> Q1: Common metrics are not quite useful.
>>>> >>>      >>>>>>>>>> numEventsIn and numEventsOut counters will be
>>>> removed
>>>> >>>      from the
>>>> >>>      >>>>>>>>>> OperatorCoordinatorMetricGroup. These common
>>>> metrics do
>>>> >> not
>>>> >>>      >>>>>>>>>> provide
>>>> >>>      >>>>>>>>> enough
>>>> >>>      >>>>>>>>>> information for users. The users are more willing to
>>>> >>>      get the
>>>> >>>      >>>>>>>>>> number
>>>> >>>      >>>>> of
>>>> >>>      >>>>>>>>>> events of the specified type instead of the total
>>>> >>>      number. And this
>>>> >>>      >>>>>>>> metric
>>>> >>>      >>>>>>>>>> is calculated differently. The implementation could
>>>> >>>      register the
>>>> >>>      >>>>>> metric
>>>> >>>      >>>>>>>>> by
>>>> >>>      >>>>>>>>>> themselves.
>>>> >>>      >>>>>>>>>>
>>>> >>>      >>>>>>>>>> Q2: This FLIP is overly complicated.
>>>> >>>      >>>>>>>>>> This FLIP will become concise after these
>>>> modifications.
>>>> >>>      >>>>>>>>>> OperatorCoordinatorMetricGroup has already been
>>>> >>>      introduced into
>>>> >>>      >>>>> Flink
>>>> >>>      >>>>>>>> by
>>>> >>>      >>>>>>>>>> FLIP-179<
>>>> >>>      >>>>>>>>>>
>>>> >>>      >>>>>>>>>>
>>>> >>>      >>
>>>> >>>
>>>> >>
>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics
>>>> >>>      >>>>>>>>>>> .
>>>> >>>      >>>>>>>>>> And
>>>> >>>      >>>>>>>>>> this FLIP will not change it. This FLIP only
>>>> provides a
>>>> >>>      new metric
>>>> >>>      >>>>>>>> option
>>>> >>>      >>>>>>>>>> and a new metric group scope. The changes in
>>>> proposed
>>>> >>>      changes
>>>> >>>      >>>>> provide
>>>> >>>      >>>>>>>> the
>>>> >>>      >>>>>>>>>> details about the modifications for the internal
>>>> >>>      classes, which
>>>> >>>      >>>>> might
>>>> >>>      >>>>>>>>> make
>>>> >>>      >>>>>>>>>> it look complicated.
>>>> >>>      >>>>>>>>>>
>>>> >>>      >>>>>>>>>> Thanks for all the comments again. If there are no
>>>> further
>>>> >>>      >>>>>>>>>> comments,
>>>> >>>      >>>>>> we
>>>> >>>      >>>>>>>>>> plan to start the voting thread this week.
>>>> >>>      >>>>>>>>>>
>>>> >>>      >>>>>>>>>> Best,
>>>> >>>      >>>>>>>>>> Hang
>>>> >>>      >>>>>>>>>>
>>>> >>>      >>>>>>>>>> Qingsheng Ren <re...@gmail.com> 于2022年12月26日周一
>>>> >>>      16:48写道：
>>>> >>>      >>>>>>>>>>
>>>> >>>      >>>>>>>>>>
>>>> >>>      >>>>>>>>>>> Thanks for the FLIP, Hang!
>>>> >>>      >>>>>>>>>>>
>>>> >>>      >>>>>>>>>>> This FLIP overall looks good to me. Actually I
>>>> share
>>>> >>>      the same
>>>> >>>      >>>>> concern
>>>> >>>      >>>>>>>>>> with
>>>> >>>      >>>>>>>>>>> Zhu that numEventsIn and numEventsOut counters are
>>>> not
>>>> >>>      quite
>>>> >>>      >>>>>>>>>>> useful
>>>> >>>      >>>>>>>> to
>>>> >>>      >>>>>>>>>> end
>>>> >>>      >>>>>>>>>>> users. OperatorEvent is a quite low-level
>>>> abstraction,
>>>> >>>      which
>>>> >>>      >>>>> requires
>>>> >>>      >>>>>>>>>>> instantialization in order to be practical to
>>>> users and
>>>> >>>      >>>>>>>>>>> developers,
>>>> >>>      >>>>>>>> so
>>>> >>>      >>>>>>>>>>> maybe it's better to exclude them from the FLIP.
>>>> >>>      >>>>>>>>>>>
>>>> >>>      >>>>>>>>>>> Best,
>>>> >>>      >>>>>>>>>>> Qingsheng
>>>> >>>      >>>>>>>>>>> Ververica (Alibaba)
>>>> >>>      >>>>>>>>>>>
>>>> >>>      >>>>>>>>>>> On Mon, Dec 26, 2022 at 12:08 PM Zhu Zhu
>>>> >>>      <re...@gmail.com>
>>>> >>>      >>>>> wrote:
>>>> >>>      >>>>>>>>>>>> Hi Hang,
>>>> >>>      >>>>>>>>>>>> I still see no strong reason why we need
>>>> >>>      >>>>>>>>>>>> numEventsIn/numEventsOut
>>>> >>>      >>>>>>>>>>> metrics.
>>>> >>>      >>>>>>>>>>>> In the discussion in FLINK-29801, I can see the
>>>> same
>>>> >>>      concern
>>>> >>>      >>>>>>>>>>>> from
>>>> >>>      >>>>>>>>>> others.
>>>> >>>      >>>>>>>>>>>> So I prefer to exclude them from this FLIP to
>>>> avoid
>>>> >>>      >>>>>>>>>>>> over-extending
>>>> >>>      >>>>>>>>> the
>>>> >>>      >>>>>>>>>>>> scope.
>>>> >>>      >>>>>>>>>>>>
>>>> >>>      >>>>>>>>>>>> Thanks,
>>>> >>>      >>>>>>>>>>>> Zhu
>>>> >>>      >>>>>>>>>>>>
>>>> >>>      >>>>>>>>>>>> Hang Ruan <ru...@gmail.com> 于2022年12月23日周五
>>>> >>>      >>>>>>>>>>>> 15:21写道：
>>>> >>>      >>>>>>>>>>>>> Hi everyone,
>>>> >>>      >>>>>>>>>>>>>
>>>> >>>      >>>>>>>>>>>>> As for the Zhu Zhu's problem, I think we should
>>>> keep
>>>> >>>      the common
>>>> >>>      >>>>>>>>>>> metrics,
>>>> >>>      >>>>>>>>>>>> which will help to observe incoming and outgoing
>>>> >>>      events. What do
>>>> >>>      >>>>>>>> you
>>>> >>>      >>>>>>>>>>> think,
>>>> >>>      >>>>>>>>>>>> @Zhu Zhu ?
>>>> >>>      >>>>>>>>>>>>> And @Chesnay, are there any other issues you are
>>>> >>>      more concerned
>>>> >>>      >>>>>>>>>> about?
>>>> >>>      >>>>>>>>>>>> Looking forward to your reply.
>>>> >>>      >>>>>>>>>>>>> Thanks for all the comments. If there are no
>>>> further
>>>> >>>      >>>>>>>>>>>>> comments, we
>>>> >>>      >>>>>>>>>> plan
>>>> >>>      >>>>>>>>>>>> to start the voting thread next week.
>>>> >>>      >>>>>>>>>>>>> Best,
>>>> >>>      >>>>>>>>>>>>> Hang
>>>> >>>      >>>>>>>>>>>>>
>>>> >>>      >>>>>>>>>>>>> Hang Ruan <ru...@gmail.com>
>>>> 于2022年12月15日周四
>>>> >>>      >>>>>>>>>>>>> 16:49写道：
>>>> >>>      >>>>>>>>>>>>>> Hi, Zhu Zhu,
>>>> >>>      >>>>>>>>>>>>>>
>>>> >>>      >>>>>>>>>>>>>> Thanks for your feedback!
>>>> >>>      >>>>>>>>>>>>>>
>>>> >>>      >>>>>>>>>>>>>> The OperatorCoordinator implementations are
>>>> >>>      different. And
>>>> >>>      >>>>>>>>>>>>>> their
>>>> >>>      >>>>>>>>>>>> metrics are much different too. We try to find the
>>>> >> common
>>>> >>>      >>>>>>>>>>>> metrics
>>>> >>>      >>>>>>>> and
>>>> >>>      >>>>>>>>>> put
>>>> >>>      >>>>>>>>>>>> them in the OperatorCoordinatorMetricGroup. If
>>>> most
>>>> >>>      developers
>>>> >>>      >>>>>>>> think
>>>> >>>      >>>>>>>>> we
>>>> >>>      >>>>>>>>>>> do
>>>> >>>      >>>>>>>>>>>> not need these common metrics, removing the common
>>>> >>>      metrics is
>>>> >>>      >>>>>>>>>>>> also
>>>> >>>      >>>>>>>>>>>> acceptable.
>>>> >>>      >>>>>>>>>>>>>> Best,
>>>> >>>      >>>>>>>>>>>>>> Hang
>>>> >>>      >>>>>>>>>>>>>>
>>>> >>>      >>>>>>>>>>>>>> Zhu Zhu <re...@gmail.com> 于2022年12月14日周三
>>>> >>>      22:09写道：
>>>> >>>      >>>>>>>>>>>>>>> Hi Hang & MengYue,
>>>> >>>      >>>>>>>>>>>>>>>
>>>> >>>      >>>>>>>>>>>>>>> Thanks for creating this FLIP!
>>>> >>>      >>>>>>>>>>>>>>>
>>>> >>>      >>>>>>>>>>>>>>> I think it is very useful, mainly in two
>>>> aspects:
>>>> >>>      >>>>>>>>>>>>>>> 1. Enables OperatorCoordinators to register
>>>> metrics.
>>>> >>>      >>>>>>>>>>>>>>> Currently
>>>> >>>      >>>>>>>>>>>>>>> the coordinators has no way to do this. And
>>>> operator
>>>> >>>      >>>>>>>> coordinator
>>>> >>>      >>>>>>>>>>>>>>> metric group further enables the
>>>> SplitEnumerator
>>>> >>>      to have
>>>> >>>      >>>>>>>>>>>>>>> access
>>>> >>>      >>>>>>>>>>>>>>> to a registered metric group (via the existing
>>>> public
>>>> >>>      >>>>>>>>>>>>>>> interface
>>>> >>>      >>>>>>>>>>>>>>> SplitEnumeratorContext#metricGroup()), which is
>>>> >>>      null at the
>>>> >>>      >>>>>>>>> moment.
>>>> >>>      >>>>>>>>>>>>>>> 2. Defines the scope of operator coordinator
>>>> >>>      metrics. A clear
>>>> >>>      >>>>>>>>>>>> definition
>>>> >>>      >>>>>>>>>>>>>>> makes it easy for users to find their wanted
>>>> >>>      metrics. The
>>>> >>>      >>>>>>>>>> definition
>>>> >>>      >>>>>>>>>>>>>>> also helps to avoid conflicts of metrics from
>>>> >> multiple
>>>> >>>      >>>>>>>>>>>> OperatorCoordinators
>>>> >>>      >>>>>>>>>>>>>>> of the same kind. E.g. each SourceCoordinator
>>>> may
>>>> >>>      have its
>>>> >>>      >>>>>>>>>>>>>>> own
>>>> >>>      >>>>>>>>>>>>>>> numSourceSplits metric, these metrics should
>>>> not
>>>> >>>      be directly
>>>> >>>      >>>>>>>>>>> registered
>>>> >>>      >>>>>>>>>>>>>>> to the job metric group.
>>>> >>>      >>>>>>>>>>>>>>>
>>>> >>>      >>>>>>>>>>>>>>> What I'm a bit concerned is the necessity of
>>>> the
>>>> >>>      introduced
>>>> >>>      >>>>>>>>> common
>>>> >>>      >>>>>>>>>>>> metrics
>>>> >>>      >>>>>>>>>>>>>>> numEventsInCounter & numEventsOutCounter. If
>>>> there
>>>> >>>      any case
>>>> >>>      >>>>>>>> which
>>>> >>>      >>>>>>>>>>>> strongly
>>>> >>>      >>>>>>>>>>>>>>> requires them?
>>>> >>>      >>>>>>>>>>>>>>>
>>>> >>>      >>>>>>>>>>>>>>> Regarding the concerns of Chesnay,
>>>> >>>      >>>>>>>>>>>>>>>> A dedicated coordinator MG implementation is
>>>> >> overkill
>>>> >>>      >>>>>>>>>>>>>>> Directly using the job metric group can result
>>>> in
>>>> >>>      metric
>>>> >>>      >>>>>>>>> conflicts,
>>>> >>>      >>>>>>>>>>> as
>>>> >>>      >>>>>>>>>>>> mentioned
>>>> >>>      >>>>>>>>>>>>>>> in above #2.
>>>> >>>      >>>>>>>>>>>>>>>
>>>> >>>      >>>>>>>>>>>>>>> Thanks,
>>>> >>>      >>>>>>>>>>>>>>> Zhu
>>>> >>>      >>>>>>>>>>>>>>>
>>>> >>>      >>>>>>>>>>>>>>> Dong Lin <li...@gmail.com> 于2022年12月10日周六
>>>> >>>      >>>>>>>>>>>>>>> 14:16写道：
>>>> >>>      >>>>>>>>>>>>>>>
>>>> >>>      >>>>>>>>>>>>>>>> Hi Chesney,
>>>> >>>      >>>>>>>>>>>>>>>>
>>>> >>>      >>>>>>>>>>>>>>>> Just to double check with you,
>>>> >>>      >>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup
>>>> >>>      >>>>>>>>>>>> (annotated as
>>>> >>>      >>>>>>>>>>>>>>>> @PublicEvolving) has already been introduced
>>>> into
>>>> >>>      Flink by
>>>> >>>      >>>>>>>>>> FLIP-179
>>>> >>>      >>>>>>>>>>>>>>>> <
>>>> >>>      >>
>>>> >>>
>>>> >>
>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics
>>>> >>>      >>>>>>>>>>>>> .
>>>> >>>      >>>>>>>>>>>>>>>> And that FLIP has got you +1.. Do you mean we
>>>> >>>      should remove
>>>> >>>      >>>>>>>>> this
>>>> >>>      >>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup?
>>>> >>>      >>>>>>>>>>>>>>>>
>>>> >>>      >>>>>>>>>>>>>>>> Regards,
>>>> >>>      >>>>>>>>>>>>>>>> Dong
>>>> >>>      >>>>>>>>>>>>>>>>
>>>> >>>      >>>>>>>>>>>>>>>> On Sat, Dec 10, 2022 at 1:33 AM Chesnay
>>>> Schepler <
>>>> >>>      >>>>>>>>>>> chesnay@apache.org>
>>>> >>>      >>>>>>>>>>>> wrote:
>>>> >>>      >>>>>>>>>>>>>>>>> As a whole I feel like this FLIP is overly
>>>> >>>      complicated. A
>>>> >>>      >>>>>>>>>>> dedicated
>>>> >>>      >>>>>>>>>>>>>>>>> coordinator MG implementation is overkill; it
>>>> >>>      could just
>>>> >>>      >>>>>>>>> re-use
>>>> >>>      >>>>>>>>>>> the
>>>> >>>      >>>>>>>>>>>>>>>>> existing Task/OperatorMGs to create the same
>>>> >>>      structure we
>>>> >>>      >>>>>>>>> have
>>>> >>>      >>>>>>>>>> on
>>>> >>>      >>>>>>>>>>>> TMs,
>>>> >>>      >>>>>>>>>>>>>>>>> similar to what we did with the Job MG.
>>>> >>>      >>>>>>>>>>>>>>>>>
>>>> >>>      >>>>>>>>>>>>>>>>> However, I'm not convinced that this is
>>>> required
>>>> >>>      anyway,
>>>> >>>      >>>>>>>>>> because
>>>> >>>      >>>>>>>>>>>> all the
>>>> >>>      >>>>>>>>>>>>>>>>> example metrics you listed can be
>>>> implemented on
>>>> >>>      the TM
>>>> >>>      >>>>>>>> side
>>>> >>>      >>>>>>>>> +
>>>> >>>      >>>>>>>>>>>>>>>>> aggregating them in the external metrics
>>>> backend.
>>>> >>>      >>>>>>>>>>>>>>>>>
>>>> >>>      >>>>>>>>>>>>>>>>> Since I'm on holidays soon, just so no one
>>>> tries
>>>> >>>      to pull a
>>>> >>>      >>>>>>>>> fast
>>>> >>>      >>>>>>>>>>>> one on
>>>> >>>      >>>>>>>>>>>>>>>>> me, if this were to go to a vote as-is I'd be
>>>> >>>      against it.
>>>> >>>      >>>>>>>>>>>>>>>>>
>>>> >>>      >>>>>>>>>>>>>>>>>
>>>> >>>      >>>>>>>>>>>>>>>>> On 09/12/2022 15:30, Dong Lin wrote:
>>>> >>>      >>>>>>>>>>>>>>>>>> Hi Hang,
>>>> >>>      >>>>>>>>>>>>>>>>>>
>>>> >>>      >>>>>>>>>>>>>>>>>> Thanks for the FLIP! The FLIP looks good
>>>> and it
>>>> >>>      is pretty
>>>> >>>      >>>>>>>>>>>> informative.
>>>> >>>      >>>>>>>>>>>>>>>>>> I have just two minor comments regarding
>>>> names:
>>>> >>>      >>>>>>>>>>>>>>>>>> - Would it be useful to rename the config
>>>> key as
>>>> >>>      >>>>>>>>>>>>>>>>>> *metrics.scope.jm.job.operator-coordinator*
>>>> for
>>>> >>>      >>>>>>>> consistency
>>>> >>>      >>>>>>>>>>> with
>>>> >>>      >>>>>>>>>>>>>>>>>> *metrics.scope.jm.job
>>>> >>>      >>>>>>>>>>>>>>>>>> *(which is not named as *jm-job)?
>>>> >>>      >>>>>>>>>>>>>>>>>> - Maybe rename the variable as
>>>> >>>      >>>>>>>>>>> SCOPE_NAMING_OPERATOR_COORDINATOR
>>>> >>>      >>>>>>>>>>>> for
>>>> >>>      >>>>>>>>>>>>>>>>>> simplicity and consistency with
>>>> >>>      SCOPE_NAMING_OPERATOR
>>>> >>>      >>>>>>>>> (which
>>>> >>>      >>>>>>>>>> is
>>>> >>>      >>>>>>>>>>>> not named
>>>> >>>      >>>>>>>>>>>>>>>>>> as SCOPE_NAMING_TM_JOB_OPERATOR)?
>>>> >>>      >>>>>>>>>>>>>>>>>>
>>>> >>>      >>>>>>>>>>>>>>>>>> Cheers,
>>>> >>>      >>>>>>>>>>>>>>>>>> Dong
>>>> >>>      >>>>>>>>>>>>>>>>>>
>>>> >>>      >>>>>>>>>>>>>>>>>>
>>>> >>>      >>>>>>>>>>>>>>>>>>
>>>> >>>      >>>>>>>>>>>>>>>>>> On Thu, Dec 8, 2022 at 3:28 PM Hang Ruan <
>>>> >>>      >>>>>>>>>>> ruanhang1993@gmail.com>
>>>> >>>      >>>>>>>>>>>> wrote:
>>>> >>>      >>>>>>>>>>>>>>>>>>> Hi all,
>>>> >>>      >>>>>>>>>>>>>>>>>>>
>>>> >>>      >>>>>>>>>>>>>>>>>>> MengYue and I created FLIP-274[1] Introduce
>>>> >>>      metric group
>>>> >>>      >>>>>>>>> for
>>>> >>>      >>>>>>>>>>>>>>>>>>> OperatorCoordinator. OperatorCoordinator
>>>> is the
>>>> >>>      >>>>>>>>> coordinator
>>>> >>>      >>>>>>>>>>> for
>>>> >>>      >>>>>>>>>>>> runtime
>>>> >>>      >>>>>>>>>>>>>>>>>>> operators and running on Job Manager. The
>>>> >>>      coordination
>>>> >>>      >>>>>>>>>>>> mechanism is
>>>> >>>      >>>>>>>>>>>>>>>>>>> operator events between OperatorCoordinator
>>>> >>>      and its all
>>>> >>>      >>>>>>>>>>>> operators, the
>>>> >>>      >>>>>>>>>>>>>>>>>>> coordination is more and more using in
>>>> Flink, for
>>>> >>>      >>>>>>>> example
>>>> >>>      >>>>>>>>>> many
>>>> >>>      >>>>>>>>>>>> Sources
>>>> >>>      >>>>>>>>>>>>>>>>> and
>>>> >>>      >>>>>>>>>>>>>>>>>>> Sinks depend on the mechanism to assign
>>>> splits
>>>> >> and
>>>> >>>      >>>>>>>>>> coordinate
>>>> >>>      >>>>>>>>>>>> commits to
>>>> >>>      >>>>>>>>>>>>>>>>>>> external systems. The OperatorCoordinator
>>>> is
>>>> >>>      widely
>>>> >>>      >>>>>>>> using
>>>> >>>      >>>>>>>>> in
>>>> >>>      >>>>>>>>>>>> flink kafka
>>>> >>>      >>>>>>>>>>>>>>>>>>> connector, flink pulsar connector, flink
>>>> cdc
>>>> >>>      connector,
>>>> >>>      >>>>>>>>>> flink
>>>> >>>      >>>>>>>>>>>> hudi
>>>> >>>      >>>>>>>>>>>>>>>>>>> connector and so on.
>>>> >>>      >>>>>>>>>>>>>>>>>>>
>>>> >>>      >>>>>>>>>>>>>>>>>>> But there is not a suitable metric group
>>>> scope
>>>> >>>      for the
>>>> >>>      >>>>>>>>>>>>>>>>> OperatorCoordinator
>>>> >>>      >>>>>>>>>>>>>>>>>>> and not an implementation for the interface
>>>> >>>      >>>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup.
>>>> >>>      >>>>>>>>>>>>>>>>>>> These metrics in OperatorCoordinator could
>>>> be
>>>> >>>      how many
>>>> >>>      >>>>>>>>>>>> splits/partitions
>>>> >>>      >>>>>>>>>>>>>>>>>>> have been assigned to source readers, how
>>>> many
>>>> >>>      files
>>>> >>>      >>>>>>>> have
>>>> >>>      >>>>>>>>>> been
>>>> >>>      >>>>>>>>>>>> written
>>>> >>>      >>>>>>>>>>>>>>>>> out
>>>> >>>      >>>>>>>>>>>>>>>>>>> by sink writers, these metrics not only
>>>> help
>>>> >>>      users to
>>>> >>>      >>>>>>>> know
>>>> >>>      >>>>>>>>>> the
>>>> >>>      >>>>>>>>>>>> job
>>>> >>>      >>>>>>>>>>>>>>>>> progress
>>>> >>>      >>>>>>>>>>>>>>>>>>> but also make big job maintaining easier.
>>>> Thus we
>>>> >>>      >>>>>>>> propose
>>>> >>>      >>>>>>>>>> the
>>>> >>>      >>>>>>>>>>>> FLIP-274
>>>> >>>      >>>>>>>>>>>>>>>>> to
>>>> >>>      >>>>>>>>>>>>>>>>>>> introduce a new metric group scope for
>>>> >>>      >>>>>>>> OperatorCoordinator
>>>> >>>      >>>>>>>>>> and
>>>> >>>      >>>>>>>>>>>> provide
>>>> >>>      >>>>>>>>>>>>>>>>> an
>>>> >>>      >>>>>>>>>>>>>>>>>>> internal implementation for
>>>> >>>      >>>>>>>>> OperatorCoordinatorMetricGroup.
>>>> >>>      >>>>>>>>>>>>>>>>>>> Could you help review this FLIP when you
>>>> get
>>>> >>>      time? Any
>>>> >>>      >>>>>>>>>>> feedback
>>>> >>>      >>>>>>>>>>>> is
>>>> >>>      >>>>>>>>>>>>>>>>>>> appreciated!
>>>> >>>      >>>>>>>>>>>>>>>>>>>
>>>> >>>      >>>>>>>>>>>>>>>>>>> Best,
>>>> >>>      >>>>>>>>>>>>>>>>>>> Hang
>>>> >>>      >>>>>>>>>>>>>>>>>>>
>>>> >>>      >>>>>>>>>>>>>>>>>>> [1]
>>>> >>>      >>>>>>>>>>>>>>>>>>>
>>>> >>>      >>>>>>>>>>>>>>>>>>>
>>>> >>>      >>
>>>> >>>
>>>> >>
>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-274%3A+Introduce+metric+group+for+OperatorCoordinator
>>>> >>>      >>
>>>> >>>
>>>>
>>>>
>>>
>>

Re: [DISCUSS] FLIP-274 : Introduce metric group for OperatorCoordinator

Posted by Hang Ruan <ru...@gmail.com>.

Hi all,

> IIUC, there is only an OperatorCoordinatorMetricGroup interface without
implementations (it never worked). There were no coordinator metrics
implemented and registered. @Hang Ruan please correct me if I'm wrong.
Yes,  we shouldn't have to worry about backwards-compatibility as Chesnay
said.

After the discussion, I think the FLIP should be modified as follow.
1.  Add new metric group JobManagerOperatorMetricGroup, whose scope format
is JobManagerOperatorScopeFormat. The config key of
JobManagerOperatorScopeFormat
is metrics.scope.jm-operator, and the default value is
<host>.jobmanager.<job_name>.<task_name>.<operator_name>.
2. JobManagerOperatorMetricGroup will be sub components of
JobManagerJobMetricGroup.
3. InternalOperatorCoordinatorMetricGroup will extend ProxyMetricGroup,
which registered under JobManagerOperatorMetricGroup by
addGroup("coordinator").

I have updated the FLIP-274. @Chesnay Schepler <ch...@apache.org> @Jark Wu
<im...@gmail.com> , please help to take a look at the modified FLIP and
make sure that I do not miss  something.

If there are no more problems, we plan to start the voting thread later.

Best,
Hang

Chesnay Schepler <ch...@apache.org> 于2023年1月31日周二 20:09写道：

> > IIUC, there is only an OperatorCoordinatorMetricGroup interface without
> implementations (it never worked). There were no coordinator metrics
> implemented and registered. @Hang Ruan <ru...@gmail.com> please
> correct me if I'm wrong.
>
> Indeed. There are interfaces, and theoretically a metric group is exposed
> to user-code via the SplitEnumeratorContext, but the metric group is always
> null in the SourceCoordinatorContext. So we shouldn't have to worry about
> backwards-compatibility.
>
> > However, the config key "metrics.scope.jm.job" is deprecated and is
> replaced by "metrics.scope.jm-job". Similarly, "metrics.scope.tm.job" is
> replaced with "metrics.scope.tm-job". That's why I though
> "metrics.scope.jm-operator" might be more consistent.
>
> HA! Good catch. I only looked at the 1.16 docs where they weren't
> deprecated yet.
>
> Then it should be "jm-operator" indeed.
>
> On 31/01/2023 10:51, Jark Wu wrote:
>
> Hi Chesnay,
>
> > The new option must default to the configured value of
> metrics.scope.jm.job because otherwise it isn't backwards-compatible with
> existing coordinator metrics.
> IIUC, there is only an OperatorCoordinatorMetricGroup interface without
> implementations (it never worked). There were no coordinator metrics
> implemented and registered. @Hang Ruan <ru...@gmail.com> please
> correct me if I'm wrong.
>
> > Intuitively I think the config key should be metrics.scope.jm.operator.,
> similar to metrics.scope.jm.job.
> However, the config key "metrics.scope.jm.job" is deprecated and is
> replaced by "metrics.scope.jm-job". Similarly, "metrics.scope.tm.job" is
> replaced with "metrics.scope.tm-job". That's why I though
> "metrics.scope.jm-operator" might be more consistent.
>
> > We would only support operator_name, operator_id, task_name and task_id
> (+ everything else from the jm/job groups of course).
> +1
>
> Best,
> Jark
>
>
>
> On Mon, 30 Jan 2023 at 20:09, Chesnay Schepler <ch...@apache.org> wrote:
>
>> The new option must default to the configured value of
>> metrics.scope.jm.job because otherwise it isn't backwards-compatible with
>> existing coordinator metrics.
>> According to some earlier response in this thread there are already
>> coordinator metrics.
>>
>> If that weren't the case you'd be correct, and we'd keep the default
>> consistent with the default for tm operator metrics.
>>
>> Intuitively I think the config key should be metrics.scope.jm.operator.,
>> similar to metrics.scope.jm.job.
>> For symmetry we should also add new metrics.scope.tm.operator/task and
>> eventually phase out metrics.scope.operator.
>>
>> > Regarding "coordinator metrics don't *belong to* any tasks/vertexes",
>> I mean the coordinator doesn't run on specific tasks, so it doesn't have
>> runtime information of tasks.
>>
>> That is true. I intentionally only referred to task id/name, as things
>> like subtask index or attempt ids don't make sense for coordinators as you
>> said.
>> We would only support operator_name, operator_id, task_name and task_id
>> (+ everything else from the jm/job groups of course).
>> (Maybe in the future attempt_num could be interesting though)
>>
>> On 30/01/2023 10:39, Jark Wu wrote:
>>
>> Hi Chesnay,
>>
>> IIUC, our only disagreement is whether to support task info in the metric
>> format. But we all agree with the default metric format
>> "metrics.scope.jm-operator=<host>.jobmanager.<job_name>.<operator_name>"
>> which doesn't include task info, right?
>>
>> Regarding "coordinator metrics don't *belong to* any tasks/vertexes", I
>> mean the coordinator doesn't run on specific tasks, so it doesn't have
>> runtime information of tasks. My concern about adding task info to the
>> coordinator metric is that
>> 1) if we register coordinator metrics for every subtask, the number of
>> coordinator metrics will be exploded when task parallelism is large. This
>> is an overhead to JM & metric systems, and the duplicated metrics are
>> useless for users.
>> 2) Correct me if I'm wrong, MetricGroup#counter only supports registering
>> one metric. It seems it can't register a metric reporting for multiple
>> metric names (e.g. multiple subtasks).
>>
>> Best,
>> Jark
>>
>>
>> On Mon, 30 Jan 2023 at 16:11, Chesnay Schepler <ch...@apache.org>
>> wrote:
>>
>>> I hadn't looked in detail in to how the task info can be introduced, but
>>> given that the coordinates are created by the execution graph, where we
>>> only work on tasks, it should be possible and rather simple.
>>>
>>> The OperatorCoordinatorHolder gets access to the ExecutionJobVertex from
>>> which it can extract everything and create the task/operator groups.
>>> Exposure to the coordinator could happen via the coordinator context as
>>> originally planned.
>>>
>>>  > As we know, coordinator metrics are just like JMJobMetricGroup, which
>>> don't belong to any tasks/vertexes.
>>>
>>> I don't think this is true, every coordinator is scoped to a specific
>>> operator, and every operator is associated with a particular vertex.
>>>
>>>  From the javadocs:
>>>
>>> " A coordinator for runtime operators. The OperatorCoordinator runs on
>>> the master, associated with
>>>   the job vertex of the operator. It communicates with operators via
>>> sending operator events."
>>>
>>> On 28/01/2023 09:15, Jark Wu wrote:
>>> > Hi all,
>>> >
>>> > IIUC, Chesnay means we should have a more general metric group for
>>> > **operator**
>>> > not for **coordinator** in JM, this would be useful to extend
>>> > other operator-specific
>>> > metrics in the future. That means the new scope format should be
>>> designed
>>> > for
>>> > the operator,
>>> > e.g.,
>>> metrics.scope.jm-operator=<host>.jobmanager.<job_name>.<operator_name>
>>> > The coordinator metric is a subgroup (a constant "coordinator" suffix)
>>> of
>>> > the JMOperatorMG.
>>> >
>>> > I think this is a nice design. However, I have a question about adding
>>> task
>>> > id/name to this list.
>>> > How to get the task id/name when reporting a coordinator metric? As we
>>> > know, coordinator metrics
>>> > are just like JMJobMetricGroup, which don't belong to any
>>> tasks/vertexes.
>>> > Do you mean reporting
>>> > coordinator metrics for every task id under the operator?
>>> >
>>> > Best,
>>> > Jark
>>> >
>>> >
>>> > On Thu, 19 Jan 2023 at 17:33, Chesnay Schepler <ch...@apache.org>
>>> wrote:
>>> >
>>> >>   > First, I do not understand why users have to configure the new
>>> scope
>>> >> format, which has a default value.
>>> >>
>>> >> If you don't use scope formats, sure. If you do use scope formats,
>>> e.g.
>>> >> to add a common prefix (which is the case for datadog users for
>>> >> example), then the current default in the FLIP is insufficient and
>>> >> requires the user to update the configuration.
>>> >>
>>> >>   > I usually do not need to change these configuration of scope
>>> formats
>>> >> when submitting the flink job.
>>> >>
>>> >> Scope formats as a whole are quite a power-user feature, but that
>>> >> doesn't mean we should ignore it.
>>> >>
>>> >>   > I try to let it extend the ComponentMetricGroup
>>> >>
>>> >> This isn't inherently required just because it is a "component".
>>> >> Component metric groups should _only_ be used for cases where such a
>>> >> component provides several bits of metadata.
>>> >> For example, tasks provide vertex ids, task names, attempted IDs etc.,
>>> >> and we'd like users to have the option on how this metadata ends up
>>> >> being used (via scope formats). This currently can't be built with the
>>> >> addGroup() methods, hence why the component groups exist.
>>> >> However, the operator coordinator _itself_does not provide several
>>> bits
>>> >> of metadata. Logically, the _operator_ does, not the coordinator.
>>> >>
>>> >>   > This make me decide to add a new scope format.
>>> >>
>>> >> And this is fine; just don't add a new scope format for the
>>> coordinator
>>> >> but logically for operators on the JM side, such that we can extend
>>> the
>>> >> set of operator-specific metrics in the future without having to add
>>> yet
>>> >> another scope format.
>>> >>
>>> >>   > The relationship between the Operator and OperatorCoordinator is
>>> >> maintained in the OperatorCoordinatorHolder. In the POC, the operator
>>> >> name/id could be found and the OperatorCoordinatorMetricGroup will be
>>> >> created here.
>>> >>
>>> >> That's all irrelevant.
>>> >>
>>> >>   > The registered metrics of OperatorCoordinator always are those
>>> which
>>> >> can not be aggregated from the tasks, like the number of unassigned
>>> >> splits in the SourceCoordinator. Actually we have not encountered the
>>> >> scenario you mentioned. I think the OperatorCoordinatorMetricGroup
>>> >> should only contain the metrics for itself instead of its subtasks.
>>> >>
>>> >> You are misunderstanding the problem.
>>> >>
>>> >> This isn't about aggregating metrics on our side or exposing metrics
>>> >> from TMs on the JM side or anything like that.
>>> >>
>>> >> It's purely about the available metadata for operator metrics on the
>>> JM
>>> >> side. Currently you suggest operator name / id; what I'm proposing is
>>> to
>>> >> add task(==vertex!) id / name to this list.
>>> >>
>>> >> The benefit here is simple. I can look up all metrics where
>>> task_id==XXX
>>> >> and get _everything_ related to that vertex, including all metrics
>>> >> associated with operators that are part of that vertex, including the
>>> >> coordinator.
>>> >>
>>> >>   > Using metrics.scope.jm.job is not enough to distinguish the
>>> different
>>> >> coordinators.
>>> >>
>>> >> I did not suggest just using metrics.scope.jm.job. I suggested that as
>>> >> the default for the jm operator scope format, that is used for the
>>> >> proposed JobManagerOperatorMetricGroup, which will have some plain
>>> >> metric groups as children for coordinators.
>>> >> Aka, you have the JMOperatorMG, then you call addGroup("coordinators")
>>> >> and then addGroup(coordinator_name) for each coordinator.
>>> >>
>>> >>   > So there are two choice for the
>>> >> InternalOperatorCoordinatorMetricGroup: 1. add and improve the new
>>> scope
>>> >> format; 2. use the metrics.scope.jm.job and ProxyMetricGroup. Which
>>> one
>>> >> is better?
>>> >>
>>> >> There are more options than that.
>>> >> I feel like there are a lot of misunderstandings in this discussion.
>>> >> Please let me know if I could clear things up. If not I can also
>>> provide
>>> >> a PoC based on your PoC if that can speed things up. It's not that
>>> >> different anyway.
>>> >>
>>> >> On 19/01/2023 07:39, Hang Ruan wrote:
>>> >>> Hi, chesnay,
>>> >>>
>>> >>> Thanks for your reply. I still have some doubts about the questions
>>> >>> you raised.
>>> >>>
>>> >>>      > Extending the set of ScopeFormats is problematic because it in
>>> >>>      practice
>>> >>>      it breaks the config if users actively rely on it, since
>>> there's now
>>> >>>      another key that they _must_ set for it to be
>>> >>>      consistent/compatible with
>>> >>>      their existing setup.
>>> >>>      Unfortunately due to how powerful scope formats are we can't
>>> derive a
>>> >>>      default value that matches their existing setup.
>>> >>>      Hence we should try to do this as rarely as possible.
>>> >>>
>>> >>>
>>> >>>      > This FLIP does not adhere to that since it proposes a
>>> dedicated
>>> >>>      format
>>> >>>      for coordinators; next time we want to expose operator-specific
>>> >>>      metrics
>>> >>>      (e.g., in the scheduler) we'd have to add another one to
>>> support it.
>>> >>>
>>> >>>
>>> >>> First, I do not understand why users have to configure the new scope
>>> >>> format, which has a default value. I usually do not need to change
>>> >>> these configuration of scope formats when submitting the flink job.
>>> >>> IMO, the OperatorCoordinator is a component running at the JobMaster.
>>> >>> I try to let it extend the ComponentMetricGroup, but I cannot find
>>> >>> a suitable scope. This make me decide to add a new scope format.
>>> >>>
>>> >>>      > Additionally, the configurable variables (operator name/id)
>>> are
>>> >>>      logically not attached to the coordinator, but operators, so to
>>> me it
>>> >>>      just doesn't make sense to structure it like this.
>>> >>>
>>> >>>
>>> >>> The relationship between the Operator and OperatorCoordinator is
>>> >>> maintained in the OperatorCoordinatorHolder. In the POC, the operator
>>> >>> name/id could be found and the OperatorCoordinatorMetricGroup will be
>>> >>> created here.
>>> >>>
>>> >>>      > Another thing I'm concerned about is that, because we don't
>>> include
>>> >>>      tasks in the hierarchy, users wishing to collect all metrics
>>> for a
>>> >>>      particular task (in this case ==vertex) now have to go
>>> >>>      significantly out
>>> >>>      of their way to get them, since they can no longer just filter
>>> by the
>>> >>>      task ID but have to be filter for _all_ operators that are part
>>> of
>>> >>>      the task.
>>> >>>
>>> >>>
>>> >>> The registered metrics of OperatorCoordinator always are those which
>>> >>> can not be aggregated from the tasks, like the number of unassigned
>>> >>> splits in the SourceCoordinator. Actually we have not encountered the
>>> >>> scenario you mentioned. I think the OperatorCoordinatorMetricGroup
>>> >>> should only contain the metrics for itself instead of its subtasks.
>>> >>>
>>> >>> Using metrics.scope.jm.job is not enough to distinguish the different
>>> >>> coordinators. The operator id/name is necessary. Then
>>> >>> the implementation should be like this. But users can not change the
>>> >>> scope in this way. This is also acceptable.
>>> >>> @Internal public class InternalOperatorCoordinatorMetricGroup
>>> >>>           extends ProxyMetricGroup<MetricGroup>
>>> >>>           implements OperatorCoordinatorMetricGroup {
>>> >>>       public InternalOperatorCoordinatorMetricGroup(
>>> >>>               JobManagerJobMetricGroup parent, OperatorID operatorID,
>>> >> String operatorName) {
>>> >>>           super(parent.addGroup(operatorID +
>>> >> operatorName).addGroup("coordinator")); }
>>> >>> }
>>> >>> So there are two choice for the
>>> >>> InternalOperatorCoordinatorMetricGroup: 1. add and improve the new
>>> >>> scope format; 2. use the metrics.scope.jm.job and ProxyMetricGroup.
>>> >>> Which one is better?
>>> >>>
>>> >>> Best,
>>> >>> Hang
>>> >>>
>>> >>>
>>> >>> Chesnay Schepler <ch...@apache.org> 于2023年1月18日周三 17:03写道：
>>> >>>
>>> >>>      You're misunderstanding the problem.
>>> >>>
>>> >>>      Metric groups form a tree with each group providing certain
>>> metadata.
>>> >>>
>>> >>>      E.g., on the taskmanager we have a TM metric group that
>>> provides info
>>> >>>      about the TM, that has child task metric groups, that have child
>>> >>>      operator metric groups etc. The operator metric group again has
>>> >>>      children, where we are now mostly in the user/connector land,
>>> like a
>>> >>>      kafka metric group providing partition info or something.
>>> >>>
>>> >>>      What is being proposed is to create a coordinator metric group
>>> on
>>> >>>      the JM
>>> >>>      side that provides operator metadata.
>>> >>>
>>> >>>      A more appropriate structure is to create an operator group on
>>> the JM
>>> >>>      side that provides this info, with the coordinator metric group
>>> >>>      being a
>>> >>>      child.
>>> >>>
>>> >>>      On 17/01/2023 19:56, Steven Wu wrote:
>>> >>>      >> Additionally, the configurable variables (operator name/id)
>>> are
>>> >>>      > logically not attached to the coordinator, but operators, so
>>> to
>>> >>>      me it
>>> >>>      > just doesn't make sense to structure it like this.
>>> >>>      >
>>> >>>      > Chesnay, maybe we should clarify the terminology. To me,
>>> >>>      pperators (like
>>> >>>      > FLIP-27 source) can have two parts (coordinator and
>>> >>>      reader/subtask). I
>>> >>>      > think it is fine to include operator name/id for coordinator
>>> >>>      metrics.
>>> >>>      >
>>> >>>      > On Mon, Jan 16, 2023 at 2:13 AM Chesnay Schepler
>>> >>>      <ch...@apache.org> wrote:
>>> >>>      >
>>> >>>      >> Slight correction: Using metrics.scope.jm.job as the default
>>> >>>      should be
>>> >>>      >> safe.
>>> >>>      >>
>>> >>>      >> On 16/01/2023 10:18, Chesnay Schepler wrote:
>>> >>>      >>> The proposed ScopeFormat is still problematic for a few
>>> reasons.
>>> >>>      >>>
>>> >>>      >>> Extending the set of ScopeFormats is problematic because it
>>> in
>>> >>>      >>> practice it breaks the config if users actively rely on it,
>>> since
>>> >>>      >>> there's now another key that they _must_ set for it to be
>>> >>>      >>> consistent/compatible with their existing setup.
>>> >>>      >>> Unfortunately due to how powerful scope formats are we can't
>>> >>>      derive a
>>> >>>      >>> default value that matches their existing setup.
>>> >>>      >>> Hence we should try to do this as rarely as possible.
>>> >>>      >>>
>>> >>>      >>> This FLIP does not adhere to that since it proposes a
>>> >>>      dedicated format
>>> >>>      >>> for coordinators; next time we want to expose
>>> operator-specific
>>> >>>      >>> metrics (e.g., in the scheduler) we'd have to add another
>>> one to
>>> >>>      >>> support it.
>>> >>>      >>>
>>> >>>      >>> Additionally, the configurable variables (operator name/id)
>>> are
>>> >>>      >>> logically not attached to the coordinator, but operators, so
>>> >>>      to me it
>>> >>>      >>> just doesn't make sense to structure it like this.
>>> >>>      >>>
>>> >>>      >>> Another thing I'm concerned about is that, because we don't
>>> >>>      include
>>> >>>      >>> tasks in the hierarchy, users wishing to collect all
>>> metrics for
>>> >> a
>>> >>>      >>> particular task (in this case ==vertex) now have to go
>>> >>>      significantly
>>> >>>      >>> out of their way to get them, since they can no longer just
>>> >>>      filter by
>>> >>>      >>> the task ID but have to be filter for _all_ operators that
>>> are
>>> >>>      part of
>>> >>>      >>> the task.
>>> >>>      >>>
>>> >>>      >>> On 16/01/2023 03:09, Hang Ruan wrote:
>>> >>>      >>>> Hi, @chesnay@apache.org <ch...@apache.org> ,
>>> >>>      >>>>
>>> >>>      >>>> Do you have time to help to review this FLIP again after
>>> the
>>> >>>      >>>> modification?
>>> >>>      >>>> Looking forward to your reply.
>>> >>>      >>>> This FLIP will add a new configuration for the
>>> >>>      >>>> OperatorCoordinatorMetricGroup scope format. It provides an
>>> >>>      internal
>>> >>>      >>>> implementation and is added as a component to the
>>> >>>      >>>> JobManagerJobMetricGroup.
>>> >>>      >>>> If something doesn't make sense, could you provide some
>>> >>>      advice? It
>>> >>>      >>>> will be
>>> >>>      >>>> very helpful. Thanks a lot for your help.
>>> >>>      >>>>
>>> >>>      >>>> Best,
>>> >>>      >>>> Hang
>>> >>>      >>>>
>>> >>>      >>>> Martijn Visser <ma...@apache.org> 于2023年1月11日周三
>>> >>>      16:34写道：
>>> >>>      >>>>
>>> >>>      >>>>
>>> >>>      >>>>> Hi Hang,
>>> >>>      >>>>>
>>> >>>      >>>>> I'm a bit surprised that this has gone to a vote, given
>>> that
>>> >>>      Chesnay
>>> >>>      >>>>> deliberately mentioned that he would vote against it
>>> as-is.
>>> >>>      I would
>>> >>>      >>>>> expect
>>> >>>      >>>>> that before going to a vote, he has had the opportunity to
>>> >>>      >>>>> participate in
>>> >>>      >>>>> this discussion.
>>> >>>      >>>>>
>>> >>>      >>>>> Best regards,
>>> >>>      >>>>>
>>> >>>      >>>>> Martijn
>>> >>>      >>>>>
>>> >>>      >>>>> On Tue, Jan 3, 2023 at 12:53 PM Jark Wu <imjark@gmail.com
>>> >
>>> >>>      wrote:
>>> >>>      >>>>>
>>> >>>      >>>>>> Hi Dong,
>>> >>>      >>>>>>
>>> >>>      >>>>>> Regarding “SplitEnumeratorContext#metricGroup”, my only
>>> >>>      concern is
>>> >>>      >>>>>> that
>>> >>>      >>>>>> this is a core interface for the FLIP. It’s hard to tell
>>> how
>>> >>>      >>>>>> sources use
>>> >>>      >>>>>> metric group without mentioning this interface. Even if
>>> >>>      this is an
>>> >>>      >>>>> existing
>>> >>>      >>>>>> API, I think it’s worth introducing the interface again
>>> and
>>> >>>      declaring
>>> >>>      >>>>> that
>>> >>>      >>>>>> we will implement the interface instead of a no-op method
>>> >>>      in this
>>> >>>      >>>>>> FLIP.
>>> >>>      >>>>>>
>>> >>>      >>>>>> Anyway, this is a minor problem and shouldn’t block this
>>> >>>      FLIP. I’m
>>> >>>      >>>>>> +1 to
>>> >>>      >>>>>> start a vote.
>>> >>>      >>>>>>
>>> >>>      >>>>>> Best,
>>> >>>      >>>>>> Jark
>>> >>>      >>>>>>
>>> >>>      >>>>>>
>>> >>>      >>>>>>> 2023年1月3日 10:03，Hang Ruan <ru...@gmail.com> 写道：
>>> >>>      >>>>>>>
>>> >>>      >>>>>>> Hi, Jark and Dong,
>>> >>>      >>>>>>>
>>> >>>      >>>>>>> Thanks for your comments. Sorry for my late reply.
>>> >>>      >>>>>>>
>>> >>>      >>>>>>> For suggestion 1, I plan to implement the
>>> >>>      >>>>>>> SplitEnumeratorMetricGroup in
>>> >>>      >>>>>>> another issue, and it is not contained in this FLIP. I
>>> >>>      will add some
>>> >>>      >>>>>>> description about this part.
>>> >>>      >>>>>>> For suggestion 2, changes about
>>> >>>      OperatorCoordinator#metricGroup has
>>> >>>      >>>>>> already
>>> >>>      >>>>>>> been documented in the proposed change section.
>>> >>>      >>>>>>>
>>> >>>      >>>>>>> Best,
>>> >>>      >>>>>>> Hang
>>> >>>      >>>>>>>
>>> >>>      >>>>>>> Dong Lin <li...@gmail.com> 于2023年1月1日周日 09:45写道：
>>> >>>      >>>>>>>
>>> >>>      >>>>>>>> Let me chime-in and add comments regarding the public
>>> >>>      interface
>>> >>>      >>>>> section.
>>> >>>      >>>>>>>> Please see my comments inline.
>>> >>>      >>>>>>>>
>>> >>>      >>>>>>>> On Thu, Dec 29, 2022 at 6:08 PM Jark Wu
>>> >>>      <im...@gmail.com> wrote:
>>> >>>      >>>>>>>>
>>> >>>      >>>>>>>>> Hi Hang,
>>> >>>      >>>>>>>>>
>>> >>>      >>>>>>>>> Thanks for driving this discussion. I think this is a
>>> >>>      very useful
>>> >>>      >>>>>> feature
>>> >>>      >>>>>>>>> for connectors.
>>> >>>      >>>>>>>>>
>>> >>>      >>>>>>>>> The FLIP looks quite good to me, and I just have two
>>> >>>      suggestions.
>>> >>>      >>>>>>>>>
>>> >>>      >>>>>>>>> 1. In the "Public Interface" section, mention that the
>>> >>>      >>>>>>>>> implementation
>>> >>>      >>>>>>>>> behavior of "SplitEnumeratorContext#metricGroup" is
>>> >>>      changed from
>>> >>>      >>>>>>>> returning
>>> >>>      >>>>>>>>> null to returning a concrete
>>> SplitEnumeratorMetricGroup
>>> >>>      instance.
>>> >>>      >>>>> Even
>>> >>>      >>>>>>>>> though the API is already there, the behavior change
>>> can
>>> >>>      also be
>>> >>>      >>>>>>>> considered
>>> >>>      >>>>>>>>> a public change.
>>> >>>      >>>>>>>>>
>>> >>>      >>>>>>>> SplitEnumeratorContext#metricGroup is an interface and
>>> >>>      this FLIP
>>> >>>      >>>>>>>> does
>>> >>>      >>>>>> not
>>> >>>      >>>>>>>> seem to change its semantics/behavior. The FLIP does
>>> >>>      change the
>>> >>>      >>>>>>>> implementation/behavior of
>>> >>>      SourceCoordinatorContext#metricGroup,
>>> >>>      >>>>>>>> which
>>> >>>      >>>>>> is
>>> >>>      >>>>>>>> marked @Internal.
>>> >>>      >>>>>>>>
>>> >>>      >>>>>>>> Thus it might seem a bit weird to add in the public
>>> >> interface
>>> >>>      >>>>>>>> section
>>> >>>      >>>>>>>> saying that we change the interface
>>> >>>      >>>>>>>> SplitEnumeratorContext#metricGroup
>>> >>>      >>>>>> from
>>> >>>      >>>>>>>> returning null to non-null object.
>>> >>>      >>>>>>>>
>>> >>>      >>>>>>>> 2. Mention the newly added interface of
>>> >>>      >>>>>> "OperatorCoordinator#metricGroup"
>>> >>>      >>>>>>>>> in the "Proposed Changes" section or "Public
>>> Interface"
>>> >>>      section. As
>>> >>>      >>>>> the
>>> >>>      >>>>>>>>> FLIP said, OperatorCoordinator is widely used in many
>>> >>>      connectors.
>>> >>>      >>>>>> Though
>>> >>>      >>>>>>>> it
>>> >>>      >>>>>>>>> is still an @Internal API, I think it is worth
>>> >>>      mentioning the
>>> >>>      >>>>>>>>> change
>>> >>>      >>>>> in
>>> >>>      >>>>>>>> the
>>> >>>      >>>>>>>>> FLIP.
>>> >>>      >>>>>>>> Since OperatorCoordinator is an internal API, it seems
>>> >>>      reasonable to
>>> >>>      >>>>>>>> explain it in the proposed change section. The FLIP
>>> seems
>>> >>>      to have
>>> >>>      >>>>>>>> documented this in point 5 of the proposed change
>>> section.
>>> >>>      >>>>>>>>
>>> >>>      >>>>>>>> BTW, if we think there are @internal classes that are
>>> >>>      important
>>> >>>      >>>>>>>> enough
>>> >>>      >>>>>> to
>>> >>>      >>>>>>>> be added in the public interface section, it might be
>>> >>>      useful to
>>> >>>      >>>>>> explicitly
>>> >>>      >>>>>>>> discuss this topic and document it in the "*What are
>>> the
>>> >>>      "public
>>> >>>      >>>>>>>> interfaces" of the project*" in this
>>> >>>      >>>>>>>> <
>>> >>>      >>>>>>>>
>>> >>>      >>
>>> >>>
>>> >>
>>> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
>>> >>>      >>>>>>>> wiki.
>>> >>>      >>>>>>>>
>>> >>>      >>>>>>>>
>>> >>>      >>>>>>>>> Best,
>>> >>>      >>>>>>>>> Jark
>>> >>>      >>>>>>>>>
>>> >>>      >>>>>>>>>
>>> >>>      >>>>>>>>> On Mon, 26 Dec 2022 at 18:06, Hang Ruan
>>> >>>      <ru...@gmail.com>
>>> >>>      >>>>>> wrote:
>>> >>>      >>>>>>>>>> Hi, thanks for the feedback, Zhu Zhu and Qingsheng.
>>> >>>      >>>>>>>>>> After combining everyone's comments, the main
>>> concerns and
>>> >>>      >>>>>>>> corresponding
>>> >>>      >>>>>>>>>> adjustments are as follows.
>>> >>>      >>>>>>>>>>
>>> >>>      >>>>>>>>>> Q1: Common metrics are not quite useful.
>>> >>>      >>>>>>>>>> numEventsIn and numEventsOut counters will be removed
>>> >>>      from the
>>> >>>      >>>>>>>>>> OperatorCoordinatorMetricGroup. These common metrics
>>> do
>>> >> not
>>> >>>      >>>>>>>>>> provide
>>> >>>      >>>>>>>>> enough
>>> >>>      >>>>>>>>>> information for users. The users are more willing to
>>> >>>      get the
>>> >>>      >>>>>>>>>> number
>>> >>>      >>>>> of
>>> >>>      >>>>>>>>>> events of the specified type instead of the total
>>> >>>      number. And this
>>> >>>      >>>>>>>> metric
>>> >>>      >>>>>>>>>> is calculated differently. The implementation could
>>> >>>      register the
>>> >>>      >>>>>> metric
>>> >>>      >>>>>>>>> by
>>> >>>      >>>>>>>>>> themselves.
>>> >>>      >>>>>>>>>>
>>> >>>      >>>>>>>>>> Q2: This FLIP is overly complicated.
>>> >>>      >>>>>>>>>> This FLIP will become concise after these
>>> modifications.
>>> >>>      >>>>>>>>>> OperatorCoordinatorMetricGroup has already been
>>> >>>      introduced into
>>> >>>      >>>>> Flink
>>> >>>      >>>>>>>> by
>>> >>>      >>>>>>>>>> FLIP-179<
>>> >>>      >>>>>>>>>>
>>> >>>      >>>>>>>>>>
>>> >>>      >>
>>> >>>
>>> >>
>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics
>>> >>>      >>>>>>>>>>> .
>>> >>>      >>>>>>>>>> And
>>> >>>      >>>>>>>>>> this FLIP will not change it. This FLIP only
>>> provides a
>>> >>>      new metric
>>> >>>      >>>>>>>> option
>>> >>>      >>>>>>>>>> and a new metric group scope. The changes in proposed
>>> >>>      changes
>>> >>>      >>>>> provide
>>> >>>      >>>>>>>> the
>>> >>>      >>>>>>>>>> details about the modifications for the internal
>>> >>>      classes, which
>>> >>>      >>>>> might
>>> >>>      >>>>>>>>> make
>>> >>>      >>>>>>>>>> it look complicated.
>>> >>>      >>>>>>>>>>
>>> >>>      >>>>>>>>>> Thanks for all the comments again. If there are no
>>> further
>>> >>>      >>>>>>>>>> comments,
>>> >>>      >>>>>> we
>>> >>>      >>>>>>>>>> plan to start the voting thread this week.
>>> >>>      >>>>>>>>>>
>>> >>>      >>>>>>>>>> Best,
>>> >>>      >>>>>>>>>> Hang
>>> >>>      >>>>>>>>>>
>>> >>>      >>>>>>>>>> Qingsheng Ren <re...@gmail.com> 于2022年12月26日周一
>>> >>>      16:48写道：
>>> >>>      >>>>>>>>>>
>>> >>>      >>>>>>>>>>
>>> >>>      >>>>>>>>>>> Thanks for the FLIP, Hang!
>>> >>>      >>>>>>>>>>>
>>> >>>      >>>>>>>>>>> This FLIP overall looks good to me. Actually I share
>>> >>>      the same
>>> >>>      >>>>> concern
>>> >>>      >>>>>>>>>> with
>>> >>>      >>>>>>>>>>> Zhu that numEventsIn and numEventsOut counters are
>>> not
>>> >>>      quite
>>> >>>      >>>>>>>>>>> useful
>>> >>>      >>>>>>>> to
>>> >>>      >>>>>>>>>> end
>>> >>>      >>>>>>>>>>> users. OperatorEvent is a quite low-level
>>> abstraction,
>>> >>>      which
>>> >>>      >>>>> requires
>>> >>>      >>>>>>>>>>> instantialization in order to be practical to users
>>> and
>>> >>>      >>>>>>>>>>> developers,
>>> >>>      >>>>>>>> so
>>> >>>      >>>>>>>>>>> maybe it's better to exclude them from the FLIP.
>>> >>>      >>>>>>>>>>>
>>> >>>      >>>>>>>>>>> Best,
>>> >>>      >>>>>>>>>>> Qingsheng
>>> >>>      >>>>>>>>>>> Ververica (Alibaba)
>>> >>>      >>>>>>>>>>>
>>> >>>      >>>>>>>>>>> On Mon, Dec 26, 2022 at 12:08 PM Zhu Zhu
>>> >>>      <re...@gmail.com>
>>> >>>      >>>>> wrote:
>>> >>>      >>>>>>>>>>>> Hi Hang,
>>> >>>      >>>>>>>>>>>> I still see no strong reason why we need
>>> >>>      >>>>>>>>>>>> numEventsIn/numEventsOut
>>> >>>      >>>>>>>>>>> metrics.
>>> >>>      >>>>>>>>>>>> In the discussion in FLINK-29801, I can see the
>>> same
>>> >>>      concern
>>> >>>      >>>>>>>>>>>> from
>>> >>>      >>>>>>>>>> others.
>>> >>>      >>>>>>>>>>>> So I prefer to exclude them from this FLIP to avoid
>>> >>>      >>>>>>>>>>>> over-extending
>>> >>>      >>>>>>>>> the
>>> >>>      >>>>>>>>>>>> scope.
>>> >>>      >>>>>>>>>>>>
>>> >>>      >>>>>>>>>>>> Thanks,
>>> >>>      >>>>>>>>>>>> Zhu
>>> >>>      >>>>>>>>>>>>
>>> >>>      >>>>>>>>>>>> Hang Ruan <ru...@gmail.com> 于2022年12月23日周五
>>> >>>      >>>>>>>>>>>> 15:21写道：
>>> >>>      >>>>>>>>>>>>> Hi everyone,
>>> >>>      >>>>>>>>>>>>>
>>> >>>      >>>>>>>>>>>>> As for the Zhu Zhu's problem, I think we should
>>> keep
>>> >>>      the common
>>> >>>      >>>>>>>>>>> metrics,
>>> >>>      >>>>>>>>>>>> which will help to observe incoming and outgoing
>>> >>>      events. What do
>>> >>>      >>>>>>>> you
>>> >>>      >>>>>>>>>>> think,
>>> >>>      >>>>>>>>>>>> @Zhu Zhu ?
>>> >>>      >>>>>>>>>>>>> And @Chesnay, are there any other issues you are
>>> >>>      more concerned
>>> >>>      >>>>>>>>>> about?
>>> >>>      >>>>>>>>>>>> Looking forward to your reply.
>>> >>>      >>>>>>>>>>>>> Thanks for all the comments. If there are no
>>> further
>>> >>>      >>>>>>>>>>>>> comments, we
>>> >>>      >>>>>>>>>> plan
>>> >>>      >>>>>>>>>>>> to start the voting thread next week.
>>> >>>      >>>>>>>>>>>>> Best,
>>> >>>      >>>>>>>>>>>>> Hang
>>> >>>      >>>>>>>>>>>>>
>>> >>>      >>>>>>>>>>>>> Hang Ruan <ru...@gmail.com> 于2022年12月15日周四
>>> >>>      >>>>>>>>>>>>> 16:49写道：
>>> >>>      >>>>>>>>>>>>>> Hi, Zhu Zhu,
>>> >>>      >>>>>>>>>>>>>>
>>> >>>      >>>>>>>>>>>>>> Thanks for your feedback!
>>> >>>      >>>>>>>>>>>>>>
>>> >>>      >>>>>>>>>>>>>> The OperatorCoordinator implementations are
>>> >>>      different. And
>>> >>>      >>>>>>>>>>>>>> their
>>> >>>      >>>>>>>>>>>> metrics are much different too. We try to find the
>>> >> common
>>> >>>      >>>>>>>>>>>> metrics
>>> >>>      >>>>>>>> and
>>> >>>      >>>>>>>>>> put
>>> >>>      >>>>>>>>>>>> them in the OperatorCoordinatorMetricGroup. If most
>>> >>>      developers
>>> >>>      >>>>>>>> think
>>> >>>      >>>>>>>>> we
>>> >>>      >>>>>>>>>>> do
>>> >>>      >>>>>>>>>>>> not need these common metrics, removing the common
>>> >>>      metrics is
>>> >>>      >>>>>>>>>>>> also
>>> >>>      >>>>>>>>>>>> acceptable.
>>> >>>      >>>>>>>>>>>>>> Best,
>>> >>>      >>>>>>>>>>>>>> Hang
>>> >>>      >>>>>>>>>>>>>>
>>> >>>      >>>>>>>>>>>>>> Zhu Zhu <re...@gmail.com> 于2022年12月14日周三
>>> >>>      22:09写道：
>>> >>>      >>>>>>>>>>>>>>> Hi Hang & MengYue,
>>> >>>      >>>>>>>>>>>>>>>
>>> >>>      >>>>>>>>>>>>>>> Thanks for creating this FLIP!
>>> >>>      >>>>>>>>>>>>>>>
>>> >>>      >>>>>>>>>>>>>>> I think it is very useful, mainly in two
>>> aspects:
>>> >>>      >>>>>>>>>>>>>>> 1. Enables OperatorCoordinators to register
>>> metrics.
>>> >>>      >>>>>>>>>>>>>>> Currently
>>> >>>      >>>>>>>>>>>>>>> the coordinators has no way to do this. And
>>> operator
>>> >>>      >>>>>>>> coordinator
>>> >>>      >>>>>>>>>>>>>>> metric group further enables the SplitEnumerator
>>> >>>      to have
>>> >>>      >>>>>>>>>>>>>>> access
>>> >>>      >>>>>>>>>>>>>>> to a registered metric group (via the existing
>>> public
>>> >>>      >>>>>>>>>>>>>>> interface
>>> >>>      >>>>>>>>>>>>>>> SplitEnumeratorContext#metricGroup()), which is
>>> >>>      null at the
>>> >>>      >>>>>>>>> moment.
>>> >>>      >>>>>>>>>>>>>>> 2. Defines the scope of operator coordinator
>>> >>>      metrics. A clear
>>> >>>      >>>>>>>>>>>> definition
>>> >>>      >>>>>>>>>>>>>>> makes it easy for users to find their wanted
>>> >>>      metrics. The
>>> >>>      >>>>>>>>>> definition
>>> >>>      >>>>>>>>>>>>>>> also helps to avoid conflicts of metrics from
>>> >> multiple
>>> >>>      >>>>>>>>>>>> OperatorCoordinators
>>> >>>      >>>>>>>>>>>>>>> of the same kind. E.g. each SourceCoordinator
>>> may
>>> >>>      have its
>>> >>>      >>>>>>>>>>>>>>> own
>>> >>>      >>>>>>>>>>>>>>> numSourceSplits metric, these metrics should not
>>> >>>      be directly
>>> >>>      >>>>>>>>>>> registered
>>> >>>      >>>>>>>>>>>>>>> to the job metric group.
>>> >>>      >>>>>>>>>>>>>>>
>>> >>>      >>>>>>>>>>>>>>> What I'm a bit concerned is the necessity of the
>>> >>>      introduced
>>> >>>      >>>>>>>>> common
>>> >>>      >>>>>>>>>>>> metrics
>>> >>>      >>>>>>>>>>>>>>> numEventsInCounter & numEventsOutCounter. If
>>> there
>>> >>>      any case
>>> >>>      >>>>>>>> which
>>> >>>      >>>>>>>>>>>> strongly
>>> >>>      >>>>>>>>>>>>>>> requires them?
>>> >>>      >>>>>>>>>>>>>>>
>>> >>>      >>>>>>>>>>>>>>> Regarding the concerns of Chesnay,
>>> >>>      >>>>>>>>>>>>>>>> A dedicated coordinator MG implementation is
>>> >> overkill
>>> >>>      >>>>>>>>>>>>>>> Directly using the job metric group can result
>>> in
>>> >>>      metric
>>> >>>      >>>>>>>>> conflicts,
>>> >>>      >>>>>>>>>>> as
>>> >>>      >>>>>>>>>>>> mentioned
>>> >>>      >>>>>>>>>>>>>>> in above #2.
>>> >>>      >>>>>>>>>>>>>>>
>>> >>>      >>>>>>>>>>>>>>> Thanks,
>>> >>>      >>>>>>>>>>>>>>> Zhu
>>> >>>      >>>>>>>>>>>>>>>
>>> >>>      >>>>>>>>>>>>>>> Dong Lin <li...@gmail.com> 于2022年12月10日周六
>>> >>>      >>>>>>>>>>>>>>> 14:16写道：
>>> >>>      >>>>>>>>>>>>>>>
>>> >>>      >>>>>>>>>>>>>>>> Hi Chesney,
>>> >>>      >>>>>>>>>>>>>>>>
>>> >>>      >>>>>>>>>>>>>>>> Just to double check with you,
>>> >>>      >>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup
>>> >>>      >>>>>>>>>>>> (annotated as
>>> >>>      >>>>>>>>>>>>>>>> @PublicEvolving) has already been introduced
>>> into
>>> >>>      Flink by
>>> >>>      >>>>>>>>>> FLIP-179
>>> >>>      >>>>>>>>>>>>>>>> <
>>> >>>      >>
>>> >>>
>>> >>
>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics
>>> >>>      >>>>>>>>>>>>> .
>>> >>>      >>>>>>>>>>>>>>>> And that FLIP has got you +1.. Do you mean we
>>> >>>      should remove
>>> >>>      >>>>>>>>> this
>>> >>>      >>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup?
>>> >>>      >>>>>>>>>>>>>>>>
>>> >>>      >>>>>>>>>>>>>>>> Regards,
>>> >>>      >>>>>>>>>>>>>>>> Dong
>>> >>>      >>>>>>>>>>>>>>>>
>>> >>>      >>>>>>>>>>>>>>>> On Sat, Dec 10, 2022 at 1:33 AM Chesnay
>>> Schepler <
>>> >>>      >>>>>>>>>>> chesnay@apache.org>
>>> >>>      >>>>>>>>>>>> wrote:
>>> >>>      >>>>>>>>>>>>>>>>> As a whole I feel like this FLIP is overly
>>> >>>      complicated. A
>>> >>>      >>>>>>>>>>> dedicated
>>> >>>      >>>>>>>>>>>>>>>>> coordinator MG implementation is overkill; it
>>> >>>      could just
>>> >>>      >>>>>>>>> re-use
>>> >>>      >>>>>>>>>>> the
>>> >>>      >>>>>>>>>>>>>>>>> existing Task/OperatorMGs to create the same
>>> >>>      structure we
>>> >>>      >>>>>>>>> have
>>> >>>      >>>>>>>>>> on
>>> >>>      >>>>>>>>>>>> TMs,
>>> >>>      >>>>>>>>>>>>>>>>> similar to what we did with the Job MG.
>>> >>>      >>>>>>>>>>>>>>>>>
>>> >>>      >>>>>>>>>>>>>>>>> However, I'm not convinced that this is
>>> required
>>> >>>      anyway,
>>> >>>      >>>>>>>>>> because
>>> >>>      >>>>>>>>>>>> all the
>>> >>>      >>>>>>>>>>>>>>>>> example metrics you listed can be implemented
>>> on
>>> >>>      the TM
>>> >>>      >>>>>>>> side
>>> >>>      >>>>>>>>> +
>>> >>>      >>>>>>>>>>>>>>>>> aggregating them in the external metrics
>>> backend.
>>> >>>      >>>>>>>>>>>>>>>>>
>>> >>>      >>>>>>>>>>>>>>>>> Since I'm on holidays soon, just so no one
>>> tries
>>> >>>      to pull a
>>> >>>      >>>>>>>>> fast
>>> >>>      >>>>>>>>>>>> one on
>>> >>>      >>>>>>>>>>>>>>>>> me, if this were to go to a vote as-is I'd be
>>> >>>      against it.
>>> >>>      >>>>>>>>>>>>>>>>>
>>> >>>      >>>>>>>>>>>>>>>>>
>>> >>>      >>>>>>>>>>>>>>>>> On 09/12/2022 15:30, Dong Lin wrote:
>>> >>>      >>>>>>>>>>>>>>>>>> Hi Hang,
>>> >>>      >>>>>>>>>>>>>>>>>>
>>> >>>      >>>>>>>>>>>>>>>>>> Thanks for the FLIP! The FLIP looks good and
>>> it
>>> >>>      is pretty
>>> >>>      >>>>>>>>>>>> informative.
>>> >>>      >>>>>>>>>>>>>>>>>> I have just two minor comments regarding
>>> names:
>>> >>>      >>>>>>>>>>>>>>>>>> - Would it be useful to rename the config
>>> key as
>>> >>>      >>>>>>>>>>>>>>>>>> *metrics.scope.jm.job.operator-coordinator*
>>> for
>>> >>>      >>>>>>>> consistency
>>> >>>      >>>>>>>>>>> with
>>> >>>      >>>>>>>>>>>>>>>>>> *metrics.scope.jm.job
>>> >>>      >>>>>>>>>>>>>>>>>> *(which is not named as *jm-job)?
>>> >>>      >>>>>>>>>>>>>>>>>> - Maybe rename the variable as
>>> >>>      >>>>>>>>>>> SCOPE_NAMING_OPERATOR_COORDINATOR
>>> >>>      >>>>>>>>>>>> for
>>> >>>      >>>>>>>>>>>>>>>>>> simplicity and consistency with
>>> >>>      SCOPE_NAMING_OPERATOR
>>> >>>      >>>>>>>>> (which
>>> >>>      >>>>>>>>>> is
>>> >>>      >>>>>>>>>>>> not named
>>> >>>      >>>>>>>>>>>>>>>>>> as SCOPE_NAMING_TM_JOB_OPERATOR)?
>>> >>>      >>>>>>>>>>>>>>>>>>
>>> >>>      >>>>>>>>>>>>>>>>>> Cheers,
>>> >>>      >>>>>>>>>>>>>>>>>> Dong
>>> >>>      >>>>>>>>>>>>>>>>>>
>>> >>>      >>>>>>>>>>>>>>>>>>
>>> >>>      >>>>>>>>>>>>>>>>>>
>>> >>>      >>>>>>>>>>>>>>>>>> On Thu, Dec 8, 2022 at 3:28 PM Hang Ruan <
>>> >>>      >>>>>>>>>>> ruanhang1993@gmail.com>
>>> >>>      >>>>>>>>>>>> wrote:
>>> >>>      >>>>>>>>>>>>>>>>>>> Hi all,
>>> >>>      >>>>>>>>>>>>>>>>>>>
>>> >>>      >>>>>>>>>>>>>>>>>>> MengYue and I created FLIP-274[1] Introduce
>>> >>>      metric group
>>> >>>      >>>>>>>>> for
>>> >>>      >>>>>>>>>>>>>>>>>>> OperatorCoordinator. OperatorCoordinator is
>>> the
>>> >>>      >>>>>>>>> coordinator
>>> >>>      >>>>>>>>>>> for
>>> >>>      >>>>>>>>>>>> runtime
>>> >>>      >>>>>>>>>>>>>>>>>>> operators and running on Job Manager. The
>>> >>>      coordination
>>> >>>      >>>>>>>>>>>> mechanism is
>>> >>>      >>>>>>>>>>>>>>>>>>> operator events between OperatorCoordinator
>>> >>>      and its all
>>> >>>      >>>>>>>>>>>> operators, the
>>> >>>      >>>>>>>>>>>>>>>>>>> coordination is more and more using in
>>> Flink, for
>>> >>>      >>>>>>>> example
>>> >>>      >>>>>>>>>> many
>>> >>>      >>>>>>>>>>>> Sources
>>> >>>      >>>>>>>>>>>>>>>>> and
>>> >>>      >>>>>>>>>>>>>>>>>>> Sinks depend on the mechanism to assign
>>> splits
>>> >> and
>>> >>>      >>>>>>>>>> coordinate
>>> >>>      >>>>>>>>>>>> commits to
>>> >>>      >>>>>>>>>>>>>>>>>>> external systems. The OperatorCoordinator is
>>> >>>      widely
>>> >>>      >>>>>>>> using
>>> >>>      >>>>>>>>> in
>>> >>>      >>>>>>>>>>>> flink kafka
>>> >>>      >>>>>>>>>>>>>>>>>>> connector, flink pulsar connector, flink cdc
>>> >>>      connector,
>>> >>>      >>>>>>>>>> flink
>>> >>>      >>>>>>>>>>>> hudi
>>> >>>      >>>>>>>>>>>>>>>>>>> connector and so on.
>>> >>>      >>>>>>>>>>>>>>>>>>>
>>> >>>      >>>>>>>>>>>>>>>>>>> But there is not a suitable metric group
>>> scope
>>> >>>      for the
>>> >>>      >>>>>>>>>>>>>>>>> OperatorCoordinator
>>> >>>      >>>>>>>>>>>>>>>>>>> and not an implementation for the interface
>>> >>>      >>>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup.
>>> >>>      >>>>>>>>>>>>>>>>>>> These metrics in OperatorCoordinator could
>>> be
>>> >>>      how many
>>> >>>      >>>>>>>>>>>> splits/partitions
>>> >>>      >>>>>>>>>>>>>>>>>>> have been assigned to source readers, how
>>> many
>>> >>>      files
>>> >>>      >>>>>>>> have
>>> >>>      >>>>>>>>>> been
>>> >>>      >>>>>>>>>>>> written
>>> >>>      >>>>>>>>>>>>>>>>> out
>>> >>>      >>>>>>>>>>>>>>>>>>> by sink writers, these metrics not only help
>>> >>>      users to
>>> >>>      >>>>>>>> know
>>> >>>      >>>>>>>>>> the
>>> >>>      >>>>>>>>>>>> job
>>> >>>      >>>>>>>>>>>>>>>>> progress
>>> >>>      >>>>>>>>>>>>>>>>>>> but also make big job maintaining easier.
>>> Thus we
>>> >>>      >>>>>>>> propose
>>> >>>      >>>>>>>>>> the
>>> >>>      >>>>>>>>>>>> FLIP-274
>>> >>>      >>>>>>>>>>>>>>>>> to
>>> >>>      >>>>>>>>>>>>>>>>>>> introduce a new metric group scope for
>>> >>>      >>>>>>>> OperatorCoordinator
>>> >>>      >>>>>>>>>> and
>>> >>>      >>>>>>>>>>>> provide
>>> >>>      >>>>>>>>>>>>>>>>> an
>>> >>>      >>>>>>>>>>>>>>>>>>> internal implementation for
>>> >>>      >>>>>>>>> OperatorCoordinatorMetricGroup.
>>> >>>      >>>>>>>>>>>>>>>>>>> Could you help review this FLIP when you get
>>> >>>      time? Any
>>> >>>      >>>>>>>>>>> feedback
>>> >>>      >>>>>>>>>>>> is
>>> >>>      >>>>>>>>>>>>>>>>>>> appreciated!
>>> >>>      >>>>>>>>>>>>>>>>>>>
>>> >>>      >>>>>>>>>>>>>>>>>>> Best,
>>> >>>      >>>>>>>>>>>>>>>>>>> Hang
>>> >>>      >>>>>>>>>>>>>>>>>>>
>>> >>>      >>>>>>>>>>>>>>>>>>> [1]
>>> >>>      >>>>>>>>>>>>>>>>>>>
>>> >>>      >>>>>>>>>>>>>>>>>>>
>>> >>>      >>
>>> >>>
>>> >>
>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-274%3A+Introduce+metric+group+for+OperatorCoordinator
>>> >>>      >>
>>> >>>
>>>
>>>
>>
>

Re: [DISCUSS] FLIP-274 : Introduce metric group for OperatorCoordinator

Posted by Chesnay Schepler <ch...@apache.org>.

 > IIUC, there is only an OperatorCoordinatorMetricGroup interface 
without implementations (it never worked). There were no coordinator 
metrics implemented and registered. @Hang Ruan 
<ma...@gmail.com> please correct me if I'm wrong.

Indeed. There are interfaces, and theoretically a metric group is 
exposed to user-code via the SplitEnumeratorContext, but the metric 
group is always null in the SourceCoordinatorContext. So we shouldn't 
have to worry about backwards-compatibility.

 > However, the config key "metrics.scope.jm.job" is deprecated and is 
replaced by "metrics.scope.jm-job". Similarly, "metrics.scope.tm.job" is 
replaced with "metrics.scope.tm-job". That's why I though 
"metrics.scope.jm-operator" might be more consistent.

HA! Good catch. I only looked at the 1.16 docs where they weren't 
deprecated yet.

Then it should be "jm-operator" indeed.

On 31/01/2023 10:51, Jark Wu wrote:
> Hi Chesnay,
>
> > The new option must default to the configured value of 
> metrics.scope.jm.job because otherwise it isn't backwards-compatible 
> with existing coordinator metrics.
> IIUC, there is only an OperatorCoordinatorMetricGroup interface 
> without implementations (it never worked). There were no coordinator 
> metrics implemented and registered. @Hang Ruan 
> <ma...@gmail.com> please correct me if I'm wrong.
>
> > Intuitively I think the config key should be 
> metrics.scope.jm.operator., similar to metrics.scope.jm.job.
> However, the config key "metrics.scope.jm.job" is deprecated and is 
> replaced by "metrics.scope.jm-job". Similarly, "metrics.scope.tm.job" 
> is replaced with "metrics.scope.tm-job". That's why I though 
> "metrics.scope.jm-operator" might be more consistent.
>
> > We would only support operator_name, operator_id, task_name and 
> task_id (+ everything else from the jm/job groups of course).
> +1
>
> Best,
> Jark
>
>
>
> On Mon, 30 Jan 2023 at 20:09, Chesnay Schepler <ch...@apache.org> wrote:
>
>     The new option must default to the configured value of
>     metrics.scope.jm.job because otherwise it isn't
>     backwards-compatible with existing coordinator metrics.
>     According to some earlier response in this thread there are
>     already coordinator metrics.
>
>     If that weren't the case you'd be correct, and we'd keep the
>     default consistent with the default for tm operator metrics.
>
>     Intuitively I think the config key should be
>     metrics.scope.jm.operator., similar to metrics.scope.jm.job.
>     For symmetry we should also add new metrics.scope.tm.operator/task
>     and eventually phase out metrics.scope.operator.
>
>     > Regarding "coordinator metrics don't *belong to* any
>     tasks/vertexes", I mean the coordinator doesn't run on specific
>     tasks, so it doesn't have runtime information of tasks.
>
>     That is true. I intentionally only referred to task id/name, as
>     things like subtask index or attempt ids don't make sense for
>     coordinators as you said.
>     We would only support operator_name, operator_id, task_name and
>     task_id (+ everything else from the jm/job groups of course).
>     (Maybe in the future attempt_num could be interesting though)
>
>     On 30/01/2023 10:39, Jark Wu wrote:
>>     Hi Chesnay,
>>
>>     IIUC, our only disagreement is whether to support task info in
>>     the metric format. But we all agree with the default metric
>>     format
>>     "metrics.scope.jm-operator=<host>.jobmanager.<job_name>.<operator_name>"
>>     which doesn't include task info, right?
>>
>>     Regarding "coordinator metrics don't *belong to* any
>>     tasks/vertexes", I mean the coordinator doesn't run on specific
>>     tasks, so it doesn't have runtime information of tasks. My
>>     concern about adding task info to the coordinator metric is that
>>     1) if we register coordinator metrics for every subtask, the
>>     number of coordinator metrics will be exploded when task
>>     parallelism is large. This is an overhead to JM & metric systems,
>>     and the duplicated metrics are useless for users.
>>     2) Correct me if I'm wrong, MetricGroup#counter only supports
>>     registering one metric. It seems it can't register a metric
>>     reporting for multiple metric names (e.g. multiple subtasks).
>>
>>     Best,
>>     Jark
>>
>>
>>     On Mon, 30 Jan 2023 at 16:11, Chesnay Schepler
>>     <ch...@apache.org> wrote:
>>
>>         I hadn't looked in detail in to how the task info can be
>>         introduced, but
>>         given that the coordinates are created by the execution
>>         graph, where we
>>         only work on tasks, it should be possible and rather simple.
>>
>>         The OperatorCoordinatorHolder gets access to the
>>         ExecutionJobVertex from
>>         which it can extract everything and create the task/operator
>>         groups.
>>         Exposure to the coordinator could happen via the coordinator
>>         context as
>>         originally planned.
>>
>>          > As we know, coordinator metrics are just like
>>         JMJobMetricGroup, which
>>         don't belong to any tasks/vertexes.
>>
>>         I don't think this is true, every coordinator is scoped to a
>>         specific
>>         operator, and every operator is associated with a particular
>>         vertex.
>>
>>          From the javadocs:
>>
>>         " A coordinator for runtime operators. The
>>         OperatorCoordinator runs on
>>         the master, associated with
>>           the job vertex of the operator. It communicates with
>>         operators via
>>         sending operator events."
>>
>>         On 28/01/2023 09:15, Jark Wu wrote:
>>         > Hi all,
>>         >
>>         > IIUC, Chesnay means we should have a more general metric
>>         group for
>>         > **operator**
>>         > not for **coordinator** in JM, this would be useful to extend
>>         > other operator-specific
>>         > metrics in the future. That means the new scope format
>>         should be designed
>>         > for
>>         > the operator,
>>         > e.g.,
>>         metrics.scope.jm-operator=<host>.jobmanager.<job_name>.<operator_name>
>>         > The coordinator metric is a subgroup (a constant
>>         "coordinator" suffix) of
>>         > the JMOperatorMG.
>>         >
>>         > I think this is a nice design. However, I have a question
>>         about adding task
>>         > id/name to this list.
>>         > How to get the task id/name when reporting a coordinator
>>         metric? As we
>>         > know, coordinator metrics
>>         > are just like JMJobMetricGroup, which don't belong to any
>>         tasks/vertexes.
>>         > Do you mean reporting
>>         > coordinator metrics for every task id under the operator?
>>         >
>>         > Best,
>>         > Jark
>>         >
>>         >
>>         > On Thu, 19 Jan 2023 at 17:33, Chesnay Schepler
>>         <ch...@apache.org> wrote:
>>         >
>>         >>   > First, I do not understand why users have to configure
>>         the new scope
>>         >> format, which has a default value.
>>         >>
>>         >> If you don't use scope formats, sure. If you do use scope
>>         formats, e.g.
>>         >> to add a common prefix (which is the case for datadog
>>         users for
>>         >> example), then the current default in the FLIP is
>>         insufficient and
>>         >> requires the user to update the configuration.
>>         >>
>>         >>   > I usually do not need to change these configuration of
>>         scope formats
>>         >> when submitting the flink job.
>>         >>
>>         >> Scope formats as a whole are quite a power-user feature,
>>         but that
>>         >> doesn't mean we should ignore it.
>>         >>
>>         >>   > I try to let it extend the ComponentMetricGroup
>>         >>
>>         >> This isn't inherently required just because it is a
>>         "component".
>>         >> Component metric groups should _only_ be used for cases
>>         where such a
>>         >> component provides several bits of metadata.
>>         >> For example, tasks provide vertex ids, task names,
>>         attempted IDs etc.,
>>         >> and we'd like users to have the option on how this
>>         metadata ends up
>>         >> being used (via scope formats). This currently can't be
>>         built with the
>>         >> addGroup() methods, hence why the component groups exist.
>>         >> However, the operator coordinator _itself_does not provide
>>         several bits
>>         >> of metadata. Logically, the _operator_ does, not the
>>         coordinator.
>>         >>
>>         >>   > This make me decide to add a new scope format.
>>         >>
>>         >> And this is fine; just don't add a new scope format for
>>         the coordinator
>>         >> but logically for operators on the JM side, such that we
>>         can extend the
>>         >> set of operator-specific metrics in the future without
>>         having to add yet
>>         >> another scope format.
>>         >>
>>         >>   > The relationship between the Operator and
>>         OperatorCoordinator is
>>         >> maintained in the OperatorCoordinatorHolder. In the POC,
>>         the operator
>>         >> name/id could be found and the
>>         OperatorCoordinatorMetricGroup will be
>>         >> created here.
>>         >>
>>         >> That's all irrelevant.
>>         >>
>>         >>   > The registered metrics of OperatorCoordinator always
>>         are those which
>>         >> can not be aggregated from the tasks, like the number of
>>         unassigned
>>         >> splits in the SourceCoordinator. Actually we have not
>>         encountered the
>>         >> scenario you mentioned. I think the
>>         OperatorCoordinatorMetricGroup
>>         >> should only contain the metrics for itself instead of its
>>         subtasks.
>>         >>
>>         >> You are misunderstanding the problem.
>>         >>
>>         >> This isn't about aggregating metrics on our side or
>>         exposing metrics
>>         >> from TMs on the JM side or anything like that.
>>         >>
>>         >> It's purely about the available metadata for operator
>>         metrics on the JM
>>         >> side. Currently you suggest operator name / id; what I'm
>>         proposing is to
>>         >> add task(==vertex!) id / name to this list.
>>         >>
>>         >> The benefit here is simple. I can look up all metrics
>>         where task_id==XXX
>>         >> and get _everything_ related to that vertex, including all
>>         metrics
>>         >> associated with operators that are part of that vertex,
>>         including the
>>         >> coordinator.
>>         >>
>>         >>   > Using metrics.scope.jm.job is not enough to
>>         distinguish the different
>>         >> coordinators.
>>         >>
>>         >> I did not suggest just using metrics.scope.jm.job. I
>>         suggested that as
>>         >> the default for the jm operator scope format, that is used
>>         for the
>>         >> proposed JobManagerOperatorMetricGroup, which will have
>>         some plain
>>         >> metric groups as children for coordinators.
>>         >> Aka, you have the JMOperatorMG, then you call
>>         addGroup("coordinators")
>>         >> and then addGroup(coordinator_name) for each coordinator.
>>         >>
>>         >>   > So there are two choice for the
>>         >> InternalOperatorCoordinatorMetricGroup: 1. add and improve
>>         the new scope
>>         >> format; 2. use the metrics.scope.jm.job and
>>         ProxyMetricGroup. Which one
>>         >> is better?
>>         >>
>>         >> There are more options than that.
>>         >> I feel like there are a lot of misunderstandings in this
>>         discussion.
>>         >> Please let me know if I could clear things up. If not I
>>         can also provide
>>         >> a PoC based on your PoC if that can speed things up. It's
>>         not that
>>         >> different anyway.
>>         >>
>>         >> On 19/01/2023 07:39, Hang Ruan wrote:
>>         >>> Hi, chesnay,
>>         >>>
>>         >>> Thanks for your reply. I still have some doubts about the
>>         questions
>>         >>> you raised.
>>         >>>
>>         >>>      > Extending the set of ScopeFormats is problematic
>>         because it in
>>         >>>      practice
>>         >>>      it breaks the config if users actively rely on it,
>>         since there's now
>>         >>>      another key that they _must_ set for it to be
>>         >>>      consistent/compatible with
>>         >>>      their existing setup.
>>         >>>      Unfortunately due to how powerful scope formats are
>>         we can't derive a
>>         >>>      default value that matches their existing setup.
>>         >>>      Hence we should try to do this as rarely as possible.
>>         >>>
>>         >>>
>>         >>>      > This FLIP does not adhere to that since it
>>         proposes a dedicated
>>         >>>      format
>>         >>>      for coordinators; next time we want to expose
>>         operator-specific
>>         >>>      metrics
>>         >>>      (e.g., in the scheduler) we'd have to add another
>>         one to support it.
>>         >>>
>>         >>>
>>         >>> First, I do not understand why users have to configure
>>         the new scope
>>         >>> format, which has a default value. I usually do not need
>>         to change
>>         >>> these configuration of scope formats when submitting the
>>         flink job.
>>         >>> IMO, the OperatorCoordinator is a component running at
>>         the JobMaster.
>>         >>> I try to let it extend the ComponentMetricGroup, but I
>>         cannot find
>>         >>> a suitable scope. This make me decide to add a new scope
>>         format.
>>         >>>
>>         >>>      > Additionally, the configurable variables (operator
>>         name/id) are
>>         >>>      logically not attached to the coordinator, but
>>         operators, so to me it
>>         >>>      just doesn't make sense to structure it like this.
>>         >>>
>>         >>>
>>         >>> The relationship between the Operator and
>>         OperatorCoordinator is
>>         >>> maintained in the OperatorCoordinatorHolder. In the POC,
>>         the operator
>>         >>> name/id could be found and the
>>         OperatorCoordinatorMetricGroup will be
>>         >>> created here.
>>         >>>
>>         >>>      > Another thing I'm concerned about is that, because
>>         we don't include
>>         >>>      tasks in the hierarchy, users wishing to collect all
>>         metrics for a
>>         >>>      particular task (in this case ==vertex) now have to go
>>         >>>      significantly out
>>         >>>      of their way to get them, since they can no longer
>>         just filter by the
>>         >>>      task ID but have to be filter for _all_ operators
>>         that are part of
>>         >>>      the task.
>>         >>>
>>         >>>
>>         >>> The registered metrics of OperatorCoordinator always are
>>         those which
>>         >>> can not be aggregated from the tasks, like the number of
>>         unassigned
>>         >>> splits in the SourceCoordinator. Actually we have not
>>         encountered the
>>         >>> scenario you mentioned. I think the
>>         OperatorCoordinatorMetricGroup
>>         >>> should only contain the metrics for itself instead of its
>>         subtasks.
>>         >>>
>>         >>> Using metrics.scope.jm.job is not enough to distinguish
>>         the different
>>         >>> coordinators. The operator id/name is necessary. Then
>>         >>> the implementation should be like this. But users can not
>>         change the
>>         >>> scope in this way. This is also acceptable.
>>         >>> @Internal public class InternalOperatorCoordinatorMetricGroup
>>         >>>           extends ProxyMetricGroup<MetricGroup>
>>         >>>           implements OperatorCoordinatorMetricGroup {
>>         >>>       public InternalOperatorCoordinatorMetricGroup(
>>         >>>               JobManagerJobMetricGroup parent, OperatorID
>>         operatorID,
>>         >> String operatorName) {
>>         >>>  super(parent.addGroup(operatorID +
>>         >> operatorName).addGroup("coordinator")); }
>>         >>> }
>>         >>> So there are two choice for the
>>         >>> InternalOperatorCoordinatorMetricGroup: 1. add and
>>         improve the new
>>         >>> scope format; 2. use the metrics.scope.jm.job and
>>         ProxyMetricGroup.
>>         >>> Which one is better?
>>         >>>
>>         >>> Best,
>>         >>> Hang
>>         >>>
>>         >>>
>>         >>> Chesnay Schepler <ch...@apache.org> 于2023年1月18日周三
>>         17:03写道：
>>         >>>
>>         >>>      You're misunderstanding the problem.
>>         >>>
>>         >>>      Metric groups form a tree with each group providing
>>         certain metadata.
>>         >>>
>>         >>>      E.g., on the taskmanager we have a TM metric group
>>         that provides info
>>         >>>      about the TM, that has child task metric groups,
>>         that have child
>>         >>>      operator metric groups etc. The operator metric
>>         group again has
>>         >>>      children, where we are now mostly in the
>>         user/connector land, like a
>>         >>>      kafka metric group providing partition info or
>>         something.
>>         >>>
>>         >>>      What is being proposed is to create a coordinator
>>         metric group on
>>         >>>      the JM
>>         >>>      side that provides operator metadata.
>>         >>>
>>         >>>      A more appropriate structure is to create an
>>         operator group on the JM
>>         >>>      side that provides this info, with the coordinator
>>         metric group
>>         >>>      being a
>>         >>>      child.
>>         >>>
>>         >>>      On 17/01/2023 19:56, Steven Wu wrote:
>>         >>>      >> Additionally, the configurable variables
>>         (operator name/id) are
>>         >>>      > logically not attached to the coordinator, but
>>         operators, so to
>>         >>>      me it
>>         >>>      > just doesn't make sense to structure it like this.
>>         >>>      >
>>         >>>      > Chesnay, maybe we should clarify the terminology.
>>         To me,
>>         >>>      pperators (like
>>         >>>      > FLIP-27 source) can have two parts (coordinator and
>>         >>>      reader/subtask). I
>>         >>>      > think it is fine to include operator name/id for
>>         coordinator
>>         >>>      metrics.
>>         >>>      >
>>         >>>      > On Mon, Jan 16, 2023 at 2:13 AM Chesnay Schepler
>>         >>>      <ch...@apache.org> wrote:
>>         >>>      >
>>         >>>      >> Slight correction: Using metrics.scope.jm.job as
>>         the default
>>         >>>      should be
>>         >>>      >> safe.
>>         >>>      >>
>>         >>>      >> On 16/01/2023 10:18, Chesnay Schepler wrote:
>>         >>>      >>> The proposed ScopeFormat is still problematic
>>         for a few reasons.
>>         >>>      >>>
>>         >>>      >>> Extending the set of ScopeFormats is problematic
>>         because it in
>>         >>>      >>> practice it breaks the config if users actively
>>         rely on it, since
>>         >>>      >>> there's now another key that they _must_ set for
>>         it to be
>>         >>>      >>> consistent/compatible with their existing setup.
>>         >>>      >>> Unfortunately due to how powerful scope formats
>>         are we can't
>>         >>>      derive a
>>         >>>      >>> default value that matches their existing setup.
>>         >>>      >>> Hence we should try to do this as rarely as
>>         possible.
>>         >>>      >>>
>>         >>>      >>> This FLIP does not adhere to that since it
>>         proposes a
>>         >>>      dedicated format
>>         >>>      >>> for coordinators; next time we want to expose
>>         operator-specific
>>         >>>      >>> metrics (e.g., in the scheduler) we'd have to
>>         add another one to
>>         >>>      >>> support it.
>>         >>>      >>>
>>         >>>      >>> Additionally, the configurable variables
>>         (operator name/id) are
>>         >>>      >>> logically not attached to the coordinator, but
>>         operators, so
>>         >>>      to me it
>>         >>>      >>> just doesn't make sense to structure it like this.
>>         >>>      >>>
>>         >>>      >>> Another thing I'm concerned about is that,
>>         because we don't
>>         >>>      include
>>         >>>      >>> tasks in the hierarchy, users wishing to collect
>>         all metrics for
>>         >> a
>>         >>>      >>> particular task (in this case ==vertex) now have
>>         to go
>>         >>>      significantly
>>         >>>      >>> out of their way to get them, since they can no
>>         longer just
>>         >>>      filter by
>>         >>>      >>> the task ID but have to be filter for _all_
>>         operators that are
>>         >>>      part of
>>         >>>      >>> the task.
>>         >>>      >>>
>>         >>>      >>> On 16/01/2023 03:09, Hang Ruan wrote:
>>         >>>      >>>> Hi, @chesnay@apache.org <ch...@apache.org> ,
>>         >>>      >>>>
>>         >>>      >>>> Do you have time to help to review this FLIP
>>         again after the
>>         >>>      >>>> modification?
>>         >>>      >>>> Looking forward to your reply.
>>         >>>      >>>> This FLIP will add a new configuration for the
>>         >>>      >>>> OperatorCoordinatorMetricGroup scope format. It
>>         provides an
>>         >>>      internal
>>         >>>      >>>> implementation and is added as a component to the
>>         >>>      >>>> JobManagerJobMetricGroup.
>>         >>>      >>>> If something doesn't make sense, could you
>>         provide some
>>         >>>      advice? It
>>         >>>      >>>> will be
>>         >>>      >>>> very helpful. Thanks a lot for your help.
>>         >>>      >>>>
>>         >>>      >>>> Best,
>>         >>>      >>>> Hang
>>         >>>      >>>>
>>         >>>      >>>> Martijn Visser <ma...@apache.org>
>>         于2023年1月11日周三
>>         >>>      16:34写道：
>>         >>>      >>>>
>>         >>>      >>>>
>>         >>>      >>>>> Hi Hang,
>>         >>>      >>>>>
>>         >>>      >>>>> I'm a bit surprised that this has gone to a
>>         vote, given that
>>         >>>      Chesnay
>>         >>>      >>>>> deliberately mentioned that he would vote
>>         against it as-is.
>>         >>>      I would
>>         >>>      >>>>> expect
>>         >>>      >>>>> that before going to a vote, he has had the
>>         opportunity to
>>         >>>      >>>>> participate in
>>         >>>      >>>>> this discussion.
>>         >>>      >>>>>
>>         >>>      >>>>> Best regards,
>>         >>>      >>>>>
>>         >>>      >>>>> Martijn
>>         >>>      >>>>>
>>         >>>      >>>>> On Tue, Jan 3, 2023 at 12:53 PM Jark Wu
>>         <im...@gmail.com>
>>         >>>      wrote:
>>         >>>      >>>>>
>>         >>>      >>>>>> Hi Dong,
>>         >>>      >>>>>>
>>         >>>      >>>>>> Regarding
>>         “SplitEnumeratorContext#metricGroup”, my only
>>         >>>      concern is
>>         >>>      >>>>>> that
>>         >>>      >>>>>> this is a core interface for the FLIP. It’s
>>         hard to tell how
>>         >>>      >>>>>> sources use
>>         >>>      >>>>>> metric group without mentioning this
>>         interface. Even if
>>         >>>      this is an
>>         >>>      >>>>> existing
>>         >>>      >>>>>> API, I think it’s worth introducing the
>>         interface again and
>>         >>>      declaring
>>         >>>      >>>>> that
>>         >>>      >>>>>> we will implement the interface instead of a
>>         no-op method
>>         >>>      in this
>>         >>>      >>>>>> FLIP.
>>         >>>      >>>>>>
>>         >>>      >>>>>> Anyway, this is a minor problem and shouldn’t
>>         block this
>>         >>>      FLIP. I’m
>>         >>>      >>>>>> +1 to
>>         >>>      >>>>>> start a vote.
>>         >>>      >>>>>>
>>         >>>      >>>>>> Best,
>>         >>>      >>>>>> Jark
>>         >>>      >>>>>>
>>         >>>      >>>>>>
>>         >>>      >>>>>>> 2023年1月3日 10:03，Hang Ruan
>>         <ru...@gmail.com> 写道：
>>         >>>      >>>>>>>
>>         >>>      >>>>>>> Hi, Jark and Dong,
>>         >>>      >>>>>>>
>>         >>>      >>>>>>> Thanks for your comments. Sorry for my late
>>         reply.
>>         >>>      >>>>>>>
>>         >>>      >>>>>>> For suggestion 1, I plan to implement the
>>         >>>      >>>>>>> SplitEnumeratorMetricGroup in
>>         >>>      >>>>>>> another issue, and it is not contained in
>>         this FLIP. I
>>         >>>      will add some
>>         >>>      >>>>>>> description about this part.
>>         >>>      >>>>>>> For suggestion 2, changes about
>>         >>>      OperatorCoordinator#metricGroup has
>>         >>>      >>>>>> already
>>         >>>      >>>>>>> been documented in the proposed change section.
>>         >>>      >>>>>>>
>>         >>>      >>>>>>> Best,
>>         >>>      >>>>>>> Hang
>>         >>>      >>>>>>>
>>         >>>      >>>>>>> Dong Lin <li...@gmail.com>
>>         于2023年1月1日周日 09:45写道：
>>         >>>      >>>>>>>
>>         >>>      >>>>>>>> Let me chime-in and add comments regarding
>>         the public
>>         >>>      interface
>>         >>>      >>>>> section.
>>         >>>      >>>>>>>> Please see my comments inline.
>>         >>>      >>>>>>>>
>>         >>>      >>>>>>>> On Thu, Dec 29, 2022 at 6:08 PM Jark Wu
>>         >>>      <im...@gmail.com> wrote:
>>         >>>      >>>>>>>>
>>         >>>      >>>>>>>>> Hi Hang,
>>         >>>      >>>>>>>>>
>>         >>>      >>>>>>>>> Thanks for driving this discussion. I
>>         think this is a
>>         >>>      very useful
>>         >>>      >>>>>> feature
>>         >>>      >>>>>>>>> for connectors.
>>         >>>      >>>>>>>>>
>>         >>>      >>>>>>>>> The FLIP looks quite good to me, and I
>>         just have two
>>         >>>      suggestions.
>>         >>>      >>>>>>>>>
>>         >>>      >>>>>>>>> 1. In the "Public Interface" section,
>>         mention that the
>>         >>>      >>>>>>>>> implementation
>>         >>>      >>>>>>>>> behavior of
>>         "SplitEnumeratorContext#metricGroup" is
>>         >>>      changed from
>>         >>>      >>>>>>>> returning
>>         >>>      >>>>>>>>> null to returning a concrete
>>         SplitEnumeratorMetricGroup
>>         >>>      instance.
>>         >>>      >>>>> Even
>>         >>>      >>>>>>>>> though the API is already there, the
>>         behavior change can
>>         >>>      also be
>>         >>>      >>>>>>>> considered
>>         >>>      >>>>>>>>> a public change.
>>         >>>      >>>>>>>>>
>>         >>>      >>>>>>>> SplitEnumeratorContext#metricGroup is an
>>         interface and
>>         >>>      this FLIP
>>         >>>      >>>>>>>> does
>>         >>>      >>>>>> not
>>         >>>      >>>>>>>> seem to change its semantics/behavior. The
>>         FLIP does
>>         >>>      change the
>>         >>>      >>>>>>>> implementation/behavior of
>>         >>> SourceCoordinatorContext#metricGroup,
>>         >>>      >>>>>>>> which
>>         >>>      >>>>>> is
>>         >>>      >>>>>>>> marked @Internal.
>>         >>>      >>>>>>>>
>>         >>>      >>>>>>>> Thus it might seem a bit weird to add in
>>         the public
>>         >> interface
>>         >>>      >>>>>>>> section
>>         >>>      >>>>>>>> saying that we change the interface
>>         >>>      >>>>>>>> SplitEnumeratorContext#metricGroup
>>         >>>      >>>>>> from
>>         >>>      >>>>>>>> returning null to non-null object.
>>         >>>      >>>>>>>>
>>         >>>      >>>>>>>> 2. Mention the newly added interface of
>>         >>>      >>>>>> "OperatorCoordinator#metricGroup"
>>         >>>      >>>>>>>>> in the "Proposed Changes" section or
>>         "Public Interface"
>>         >>>      section. As
>>         >>>      >>>>> the
>>         >>>      >>>>>>>>> FLIP said, OperatorCoordinator is widely
>>         used in many
>>         >>>      connectors.
>>         >>>      >>>>>> Though
>>         >>>      >>>>>>>> it
>>         >>>      >>>>>>>>> is still an @Internal API, I think it is worth
>>         >>>      mentioning the
>>         >>>      >>>>>>>>> change
>>         >>>      >>>>> in
>>         >>>      >>>>>>>> the
>>         >>>      >>>>>>>>> FLIP.
>>         >>>      >>>>>>>> Since OperatorCoordinator is an internal
>>         API, it seems
>>         >>>      reasonable to
>>         >>>      >>>>>>>> explain it in the proposed change section.
>>         The FLIP seems
>>         >>>      to have
>>         >>>      >>>>>>>> documented this in point 5 of the proposed
>>         change section.
>>         >>>      >>>>>>>>
>>         >>>      >>>>>>>> BTW, if we think there are @internal
>>         classes that are
>>         >>>      important
>>         >>>      >>>>>>>> enough
>>         >>>      >>>>>> to
>>         >>>      >>>>>>>> be added in the public interface section,
>>         it might be
>>         >>>      useful to
>>         >>>      >>>>>> explicitly
>>         >>>      >>>>>>>> discuss this topic and document it in the
>>         "*What are the
>>         >>>      "public
>>         >>>      >>>>>>>> interfaces" of the project*" in this
>>         >>>      >>>>>>>> <
>>         >>>      >>>>>>>>
>>         >>>      >>
>>         >>>
>>         >>
>>         https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
>>         >>>      >>>>>>>> wiki.
>>         >>>      >>>>>>>>
>>         >>>      >>>>>>>>
>>         >>>      >>>>>>>>> Best,
>>         >>>      >>>>>>>>> Jark
>>         >>>      >>>>>>>>>
>>         >>>      >>>>>>>>>
>>         >>>      >>>>>>>>> On Mon, 26 Dec 2022 at 18:06, Hang Ruan
>>         >>>      <ru...@gmail.com>
>>         >>>      >>>>>> wrote:
>>         >>> >>>>>>>>>> Hi, thanks for the feedback, Zhu Zhu and
>>         Qingsheng.
>>         >>> >>>>>>>>>> After combining everyone's comments, the main
>>         concerns and
>>         >>>      >>>>>>>> corresponding
>>         >>> >>>>>>>>>> adjustments are as follows.
>>         >>> >>>>>>>>>>
>>         >>> >>>>>>>>>> Q1: Common metrics are not quite useful.
>>         >>> >>>>>>>>>> numEventsIn and numEventsOut counters will be
>>         removed
>>         >>>      from the
>>         >>> >>>>>>>>>> OperatorCoordinatorMetricGroup. These common
>>         metrics do
>>         >> not
>>         >>> >>>>>>>>>> provide
>>         >>>      >>>>>>>>> enough
>>         >>> >>>>>>>>>> information for users. The users are more
>>         willing to
>>         >>>      get the
>>         >>> >>>>>>>>>> number
>>         >>>      >>>>> of
>>         >>> >>>>>>>>>> events of the specified type instead of the total
>>         >>>      number. And this
>>         >>>      >>>>>>>> metric
>>         >>> >>>>>>>>>> is calculated differently. The implementation
>>         could
>>         >>>      register the
>>         >>>      >>>>>> metric
>>         >>>      >>>>>>>>> by
>>         >>> >>>>>>>>>> themselves.
>>         >>> >>>>>>>>>>
>>         >>> >>>>>>>>>> Q2: This FLIP is overly complicated.
>>         >>> >>>>>>>>>> This FLIP will become concise after these
>>         modifications.
>>         >>> >>>>>>>>>> OperatorCoordinatorMetricGroup has already been
>>         >>>      introduced into
>>         >>>      >>>>> Flink
>>         >>>      >>>>>>>> by
>>         >>> >>>>>>>>>> FLIP-179<
>>         >>> >>>>>>>>>>
>>         >>> >>>>>>>>>>
>>         >>>      >>
>>         >>>
>>         >>
>>         https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics
>>         >>> >>>>>>>>>>> .
>>         >>> >>>>>>>>>> And
>>         >>> >>>>>>>>>> this FLIP will not change it. This FLIP only
>>         provides a
>>         >>>      new metric
>>         >>>      >>>>>>>> option
>>         >>> >>>>>>>>>> and a new metric group scope. The changes in
>>         proposed
>>         >>>      changes
>>         >>>      >>>>> provide
>>         >>>      >>>>>>>> the
>>         >>> >>>>>>>>>> details about the modifications for the internal
>>         >>>      classes, which
>>         >>>      >>>>> might
>>         >>>      >>>>>>>>> make
>>         >>> >>>>>>>>>> it look complicated.
>>         >>> >>>>>>>>>>
>>         >>> >>>>>>>>>> Thanks for all the comments again. If there
>>         are no further
>>         >>> >>>>>>>>>> comments,
>>         >>>      >>>>>> we
>>         >>> >>>>>>>>>> plan to start the voting thread this week.
>>         >>> >>>>>>>>>>
>>         >>> >>>>>>>>>> Best,
>>         >>> >>>>>>>>>> Hang
>>         >>> >>>>>>>>>>
>>         >>> >>>>>>>>>> Qingsheng Ren <re...@gmail.com>
>>         于2022年12月26日周一
>>         >>>      16:48写道：
>>         >>> >>>>>>>>>>
>>         >>> >>>>>>>>>>
>>         >>> >>>>>>>>>>> Thanks for the FLIP, Hang!
>>         >>> >>>>>>>>>>>
>>         >>> >>>>>>>>>>> This FLIP overall looks good to me. Actually
>>         I share
>>         >>>      the same
>>         >>>      >>>>> concern
>>         >>> >>>>>>>>>> with
>>         >>> >>>>>>>>>>> Zhu that numEventsIn and numEventsOut
>>         counters are not
>>         >>>      quite
>>         >>> >>>>>>>>>>> useful
>>         >>>      >>>>>>>> to
>>         >>> >>>>>>>>>> end
>>         >>> >>>>>>>>>>> users. OperatorEvent is a quite low-level
>>         abstraction,
>>         >>>      which
>>         >>>      >>>>> requires
>>         >>> >>>>>>>>>>> instantialization in order to be practical to
>>         users and
>>         >>> >>>>>>>>>>> developers,
>>         >>>      >>>>>>>> so
>>         >>> >>>>>>>>>>> maybe it's better to exclude them from the FLIP.
>>         >>> >>>>>>>>>>>
>>         >>> >>>>>>>>>>> Best,
>>         >>> >>>>>>>>>>> Qingsheng
>>         >>> >>>>>>>>>>> Ververica (Alibaba)
>>         >>> >>>>>>>>>>>
>>         >>> >>>>>>>>>>> On Mon, Dec 26, 2022 at 12:08 PM Zhu Zhu
>>         >>>      <re...@gmail.com>
>>         >>>      >>>>> wrote:
>>         >>> >>>>>>>>>>>> Hi Hang,
>>         >>> >>>>>>>>>>>> I still see no strong reason why we need
>>         >>> >>>>>>>>>>>> numEventsIn/numEventsOut
>>         >>> >>>>>>>>>>> metrics.
>>         >>> >>>>>>>>>>>> In the discussion in FLINK-29801, I can see
>>         the same
>>         >>>      concern
>>         >>> >>>>>>>>>>>> from
>>         >>> >>>>>>>>>> others.
>>         >>> >>>>>>>>>>>> So I prefer to exclude them from this FLIP
>>         to avoid
>>         >>> >>>>>>>>>>>> over-extending
>>         >>>      >>>>>>>>> the
>>         >>> >>>>>>>>>>>> scope.
>>         >>> >>>>>>>>>>>>
>>         >>> >>>>>>>>>>>> Thanks,
>>         >>> >>>>>>>>>>>> Zhu
>>         >>> >>>>>>>>>>>>
>>         >>> >>>>>>>>>>>> Hang Ruan <ru...@gmail.com>
>>         于2022年12月23日周五
>>         >>> >>>>>>>>>>>> 15:21写道：
>>         >>> >>>>>>>>>>>>> Hi everyone,
>>         >>> >>>>>>>>>>>>>
>>         >>> >>>>>>>>>>>>> As for the Zhu Zhu's problem, I think we
>>         should keep
>>         >>>      the common
>>         >>> >>>>>>>>>>> metrics,
>>         >>> >>>>>>>>>>>> which will help to observe incoming and outgoing
>>         >>>      events. What do
>>         >>>      >>>>>>>> you
>>         >>> >>>>>>>>>>> think,
>>         >>> >>>>>>>>>>>> @Zhu Zhu ?
>>         >>> >>>>>>>>>>>>> And @Chesnay, are there any other issues
>>         you are
>>         >>>      more concerned
>>         >>> >>>>>>>>>> about?
>>         >>> >>>>>>>>>>>> Looking forward to your reply.
>>         >>> >>>>>>>>>>>>> Thanks for all the comments. If there are
>>         no further
>>         >>> >>>>>>>>>>>>> comments, we
>>         >>> >>>>>>>>>> plan
>>         >>> >>>>>>>>>>>> to start the voting thread next week.
>>         >>> >>>>>>>>>>>>> Best,
>>         >>> >>>>>>>>>>>>> Hang
>>         >>> >>>>>>>>>>>>>
>>         >>> >>>>>>>>>>>>> Hang Ruan <ru...@gmail.com>
>>         于2022年12月15日周四
>>         >>> >>>>>>>>>>>>> 16:49写道：
>>         >>> >>>>>>>>>>>>>> Hi, Zhu Zhu,
>>         >>> >>>>>>>>>>>>>>
>>         >>> >>>>>>>>>>>>>> Thanks for your feedback!
>>         >>> >>>>>>>>>>>>>>
>>         >>> >>>>>>>>>>>>>> The OperatorCoordinator implementations are
>>         >>>      different. And
>>         >>> >>>>>>>>>>>>>> their
>>         >>> >>>>>>>>>>>> metrics are much different too. We try to
>>         find the
>>         >> common
>>         >>> >>>>>>>>>>>> metrics
>>         >>>      >>>>>>>> and
>>         >>> >>>>>>>>>> put
>>         >>> >>>>>>>>>>>> them in the OperatorCoordinatorMetricGroup.
>>         If most
>>         >>>      developers
>>         >>>      >>>>>>>> think
>>         >>>      >>>>>>>>> we
>>         >>> >>>>>>>>>>> do
>>         >>> >>>>>>>>>>>> not need these common metrics, removing the
>>         common
>>         >>>      metrics is
>>         >>> >>>>>>>>>>>> also
>>         >>> >>>>>>>>>>>> acceptable.
>>         >>> >>>>>>>>>>>>>> Best,
>>         >>> >>>>>>>>>>>>>> Hang
>>         >>> >>>>>>>>>>>>>>
>>         >>> >>>>>>>>>>>>>> Zhu Zhu <re...@gmail.com>
>>         于2022年12月14日周三
>>         >>>      22:09写道：
>>         >>> >>>>>>>>>>>>>>> Hi Hang & MengYue,
>>         >>> >>>>>>>>>>>>>>>
>>         >>> >>>>>>>>>>>>>>> Thanks for creating this FLIP!
>>         >>> >>>>>>>>>>>>>>>
>>         >>> >>>>>>>>>>>>>>> I think it is very useful, mainly in two
>>         aspects:
>>         >>> >>>>>>>>>>>>>>> 1. Enables OperatorCoordinators to
>>         register metrics.
>>         >>> >>>>>>>>>>>>>>> Currently
>>         >>> >>>>>>>>>>>>>>> the coordinators has no way to do this.
>>         And operator
>>         >>>      >>>>>>>> coordinator
>>         >>> >>>>>>>>>>>>>>> metric group further enables the
>>         SplitEnumerator
>>         >>>      to have
>>         >>> >>>>>>>>>>>>>>> access
>>         >>> >>>>>>>>>>>>>>> to a registered metric group (via the
>>         existing public
>>         >>> >>>>>>>>>>>>>>> interface
>>         >>> >>>>>>>>>>>>>>> SplitEnumeratorContext#metricGroup()),
>>         which is
>>         >>>      null at the
>>         >>>      >>>>>>>>> moment.
>>         >>> >>>>>>>>>>>>>>> 2. Defines the scope of operator coordinator
>>         >>>      metrics. A clear
>>         >>> >>>>>>>>>>>> definition
>>         >>> >>>>>>>>>>>>>>> makes it easy for users to find their wanted
>>         >>>      metrics. The
>>         >>> >>>>>>>>>> definition
>>         >>> >>>>>>>>>>>>>>> also helps to avoid conflicts of metrics from
>>         >> multiple
>>         >>> >>>>>>>>>>>> OperatorCoordinators
>>         >>> >>>>>>>>>>>>>>> of the same kind. E.g. each
>>         SourceCoordinator may
>>         >>>      have its
>>         >>> >>>>>>>>>>>>>>> own
>>         >>> >>>>>>>>>>>>>>> numSourceSplits metric, these metrics
>>         should not
>>         >>>      be directly
>>         >>> >>>>>>>>>>> registered
>>         >>> >>>>>>>>>>>>>>> to the job metric group.
>>         >>> >>>>>>>>>>>>>>>
>>         >>> >>>>>>>>>>>>>>> What I'm a bit concerned is the necessity
>>         of the
>>         >>>      introduced
>>         >>>      >>>>>>>>> common
>>         >>> >>>>>>>>>>>> metrics
>>         >>> >>>>>>>>>>>>>>> numEventsInCounter & numEventsOutCounter.
>>         If there
>>         >>>      any case
>>         >>>      >>>>>>>> which
>>         >>> >>>>>>>>>>>> strongly
>>         >>> >>>>>>>>>>>>>>> requires them?
>>         >>> >>>>>>>>>>>>>>>
>>         >>> >>>>>>>>>>>>>>> Regarding the concerns of Chesnay,
>>         >>> >>>>>>>>>>>>>>>> A dedicated coordinator MG implementation is
>>         >> overkill
>>         >>> >>>>>>>>>>>>>>> Directly using the job metric group can
>>         result in
>>         >>>      metric
>>         >>>      >>>>>>>>> conflicts,
>>         >>> >>>>>>>>>>> as
>>         >>> >>>>>>>>>>>> mentioned
>>         >>> >>>>>>>>>>>>>>> in above #2.
>>         >>> >>>>>>>>>>>>>>>
>>         >>> >>>>>>>>>>>>>>> Thanks,
>>         >>> >>>>>>>>>>>>>>> Zhu
>>         >>> >>>>>>>>>>>>>>>
>>         >>> >>>>>>>>>>>>>>> Dong Lin <li...@gmail.com>
>>         于2022年12月10日周六
>>         >>> >>>>>>>>>>>>>>> 14:16写道：
>>         >>> >>>>>>>>>>>>>>>
>>         >>> >>>>>>>>>>>>>>>> Hi Chesney,
>>         >>> >>>>>>>>>>>>>>>>
>>         >>> >>>>>>>>>>>>>>>> Just to double check with you,
>>         >>> >>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup
>>         >>> >>>>>>>>>>>> (annotated as
>>         >>> >>>>>>>>>>>>>>>> @PublicEvolving) has already been
>>         introduced into
>>         >>>      Flink by
>>         >>> >>>>>>>>>> FLIP-179
>>         >>> >>>>>>>>>>>>>>>> <
>>         >>>      >>
>>         >>>
>>         >>
>>         https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics
>>         >>> >>>>>>>>>>>>> .
>>         >>> >>>>>>>>>>>>>>>> And that FLIP has got you +1.. Do you
>>         mean we
>>         >>>      should remove
>>         >>>      >>>>>>>>> this
>>         >>> >>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup?
>>         >>> >>>>>>>>>>>>>>>>
>>         >>> >>>>>>>>>>>>>>>> Regards,
>>         >>> >>>>>>>>>>>>>>>> Dong
>>         >>> >>>>>>>>>>>>>>>>
>>         >>> >>>>>>>>>>>>>>>> On Sat, Dec 10, 2022 at 1:33 AM Chesnay
>>         Schepler <
>>         >>> >>>>>>>>>>> chesnay@apache.org>
>>         >>> >>>>>>>>>>>> wrote:
>>         >>> >>>>>>>>>>>>>>>>> As a whole I feel like this FLIP is overly
>>         >>>      complicated. A
>>         >>> >>>>>>>>>>> dedicated
>>         >>> >>>>>>>>>>>>>>>>> coordinator MG implementation is
>>         overkill; it
>>         >>>      could just
>>         >>>      >>>>>>>>> re-use
>>         >>> >>>>>>>>>>> the
>>         >>> >>>>>>>>>>>>>>>>> existing Task/OperatorMGs to create the
>>         same
>>         >>>      structure we
>>         >>>      >>>>>>>>> have
>>         >>> >>>>>>>>>> on
>>         >>> >>>>>>>>>>>> TMs,
>>         >>> >>>>>>>>>>>>>>>>> similar to what we did with the Job MG.
>>         >>> >>>>>>>>>>>>>>>>>
>>         >>> >>>>>>>>>>>>>>>>> However, I'm not convinced that this is
>>         required
>>         >>>      anyway,
>>         >>> >>>>>>>>>> because
>>         >>> >>>>>>>>>>>> all the
>>         >>> >>>>>>>>>>>>>>>>> example metrics you listed can be
>>         implemented on
>>         >>>      the TM
>>         >>>      >>>>>>>> side
>>         >>>      >>>>>>>>> +
>>         >>> >>>>>>>>>>>>>>>>> aggregating them in the external
>>         metrics backend.
>>         >>> >>>>>>>>>>>>>>>>>
>>         >>> >>>>>>>>>>>>>>>>> Since I'm on holidays soon, just so no
>>         one tries
>>         >>>      to pull a
>>         >>>      >>>>>>>>> fast
>>         >>> >>>>>>>>>>>> one on
>>         >>> >>>>>>>>>>>>>>>>> me, if this were to go to a vote as-is
>>         I'd be
>>         >>>      against it.
>>         >>> >>>>>>>>>>>>>>>>>
>>         >>> >>>>>>>>>>>>>>>>>
>>         >>> >>>>>>>>>>>>>>>>> On 09/12/2022 15:30, Dong Lin wrote:
>>         >>> >>>>>>>>>>>>>>>>>> Hi Hang,
>>         >>> >>>>>>>>>>>>>>>>>>
>>         >>> >>>>>>>>>>>>>>>>>> Thanks for the FLIP! The FLIP looks
>>         good and it
>>         >>>      is pretty
>>         >>> >>>>>>>>>>>> informative.
>>         >>> >>>>>>>>>>>>>>>>>> I have just two minor comments
>>         regarding names:
>>         >>> >>>>>>>>>>>>>>>>>> - Would it be useful to rename the
>>         config key as
>>         >>> >>>>>>>>>>>>>>>>>>
>>         *metrics.scope.jm.job.operator-coordinator* for
>>         >>>      >>>>>>>> consistency
>>         >>> >>>>>>>>>>> with
>>         >>> >>>>>>>>>>>>>>>>>> *metrics.scope.jm.job
>>         >>> >>>>>>>>>>>>>>>>>> *(which is not named as *jm-job)?
>>         >>> >>>>>>>>>>>>>>>>>> - Maybe rename the variable as
>>         >>> >>>>>>>>>>> SCOPE_NAMING_OPERATOR_COORDINATOR
>>         >>> >>>>>>>>>>>> for
>>         >>> >>>>>>>>>>>>>>>>>> simplicity and consistency with
>>         >>>      SCOPE_NAMING_OPERATOR
>>         >>>      >>>>>>>>> (which
>>         >>> >>>>>>>>>> is
>>         >>> >>>>>>>>>>>> not named
>>         >>> >>>>>>>>>>>>>>>>>> as SCOPE_NAMING_TM_JOB_OPERATOR)?
>>         >>> >>>>>>>>>>>>>>>>>>
>>         >>> >>>>>>>>>>>>>>>>>> Cheers,
>>         >>> >>>>>>>>>>>>>>>>>> Dong
>>         >>> >>>>>>>>>>>>>>>>>>
>>         >>> >>>>>>>>>>>>>>>>>>
>>         >>> >>>>>>>>>>>>>>>>>>
>>         >>> >>>>>>>>>>>>>>>>>> On Thu, Dec 8, 2022 at 3:28 PM Hang Ruan <
>>         >>> >>>>>>>>>>> ruanhang1993@gmail.com>
>>         >>> >>>>>>>>>>>> wrote:
>>         >>> >>>>>>>>>>>>>>>>>>> Hi all,
>>         >>> >>>>>>>>>>>>>>>>>>>
>>         >>> >>>>>>>>>>>>>>>>>>> MengYue and I created FLIP-274[1]
>>         Introduce
>>         >>>      metric group
>>         >>>      >>>>>>>>> for
>>         >>> >>>>>>>>>>>>>>>>>>> OperatorCoordinator.
>>         OperatorCoordinator is the
>>         >>>      >>>>>>>>> coordinator
>>         >>> >>>>>>>>>>> for
>>         >>> >>>>>>>>>>>> runtime
>>         >>> >>>>>>>>>>>>>>>>>>> operators and running on Job Manager. The
>>         >>>      coordination
>>         >>> >>>>>>>>>>>> mechanism is
>>         >>> >>>>>>>>>>>>>>>>>>> operator events between
>>         OperatorCoordinator
>>         >>>      and its all
>>         >>> >>>>>>>>>>>> operators, the
>>         >>> >>>>>>>>>>>>>>>>>>> coordination is more and more using
>>         in Flink, for
>>         >>>      >>>>>>>> example
>>         >>> >>>>>>>>>> many
>>         >>> >>>>>>>>>>>> Sources
>>         >>> >>>>>>>>>>>>>>>>> and
>>         >>> >>>>>>>>>>>>>>>>>>> Sinks depend on the mechanism to
>>         assign splits
>>         >> and
>>         >>> >>>>>>>>>> coordinate
>>         >>> >>>>>>>>>>>> commits to
>>         >>> >>>>>>>>>>>>>>>>>>> external systems. The
>>         OperatorCoordinator is
>>         >>>      widely
>>         >>>      >>>>>>>> using
>>         >>>      >>>>>>>>> in
>>         >>> >>>>>>>>>>>> flink kafka
>>         >>> >>>>>>>>>>>>>>>>>>> connector, flink pulsar connector,
>>         flink cdc
>>         >>>      connector,
>>         >>> >>>>>>>>>> flink
>>         >>> >>>>>>>>>>>> hudi
>>         >>> >>>>>>>>>>>>>>>>>>> connector and so on.
>>         >>> >>>>>>>>>>>>>>>>>>>
>>         >>> >>>>>>>>>>>>>>>>>>> But there is not a suitable metric
>>         group scope
>>         >>>      for the
>>         >>> >>>>>>>>>>>>>>>>> OperatorCoordinator
>>         >>> >>>>>>>>>>>>>>>>>>> and not an implementation for the
>>         interface
>>         >>> >>>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup.
>>         >>> >>>>>>>>>>>>>>>>>>> These metrics in OperatorCoordinator
>>         could be
>>         >>>      how many
>>         >>> >>>>>>>>>>>> splits/partitions
>>         >>> >>>>>>>>>>>>>>>>>>> have been assigned to source readers,
>>         how many
>>         >>>      files
>>         >>>      >>>>>>>> have
>>         >>> >>>>>>>>>> been
>>         >>> >>>>>>>>>>>> written
>>         >>> >>>>>>>>>>>>>>>>> out
>>         >>> >>>>>>>>>>>>>>>>>>> by sink writers, these metrics not
>>         only help
>>         >>>      users to
>>         >>>      >>>>>>>> know
>>         >>> >>>>>>>>>> the
>>         >>> >>>>>>>>>>>> job
>>         >>> >>>>>>>>>>>>>>>>> progress
>>         >>> >>>>>>>>>>>>>>>>>>> but also make big job maintaining
>>         easier. Thus we
>>         >>>      >>>>>>>> propose
>>         >>> >>>>>>>>>> the
>>         >>> >>>>>>>>>>>> FLIP-274
>>         >>> >>>>>>>>>>>>>>>>> to
>>         >>> >>>>>>>>>>>>>>>>>>> introduce a new metric group scope for
>>         >>>      >>>>>>>> OperatorCoordinator
>>         >>> >>>>>>>>>> and
>>         >>> >>>>>>>>>>>> provide
>>         >>> >>>>>>>>>>>>>>>>> an
>>         >>> >>>>>>>>>>>>>>>>>>> internal implementation for
>>         >>>      >>>>>>>>> OperatorCoordinatorMetricGroup.
>>         >>> >>>>>>>>>>>>>>>>>>> Could you help review this FLIP when
>>         you get
>>         >>>      time? Any
>>         >>> >>>>>>>>>>> feedback
>>         >>> >>>>>>>>>>>> is
>>         >>> >>>>>>>>>>>>>>>>>>> appreciated!
>>         >>> >>>>>>>>>>>>>>>>>>>
>>         >>> >>>>>>>>>>>>>>>>>>> Best,
>>         >>> >>>>>>>>>>>>>>>>>>> Hang
>>         >>> >>>>>>>>>>>>>>>>>>>
>>         >>> >>>>>>>>>>>>>>>>>>> [1]
>>         >>> >>>>>>>>>>>>>>>>>>>
>>         >>> >>>>>>>>>>>>>>>>>>>
>>         >>>      >>
>>         >>>
>>         >>
>>         https://cwiki.apache.org/confluence/display/FLINK/FLIP-274%3A+Introduce+metric+group+for+OperatorCoordinator
>>         >>>      >>
>>         >>>
>>
>

Re: [DISCUSS] FLIP-274 : Introduce metric group for OperatorCoordinator

Posted by Jark Wu <im...@gmail.com>.

Hi Chesnay,

> The new option must default to the configured value of
metrics.scope.jm.job because otherwise it isn't backwards-compatible with
existing coordinator metrics.
IIUC, there is only an OperatorCoordinatorMetricGroup interface without
implementations (it never worked). There were no coordinator metrics
implemented and registered. @Hang Ruan <ru...@gmail.com> please
correct me if I'm wrong.

> Intuitively I think the config key should be metrics.scope.jm.operator.,
similar to metrics.scope.jm.job.
However, the config key "metrics.scope.jm.job" is deprecated and is
replaced by "metrics.scope.jm-job". Similarly, "metrics.scope.tm.job" is
replaced with "metrics.scope.tm-job". That's why I though
"metrics.scope.jm-operator" might be more consistent.

> We would only support operator_name, operator_id, task_name and task_id
(+ everything else from the jm/job groups of course).
+1

Best,
Jark



On Mon, 30 Jan 2023 at 20:09, Chesnay Schepler <ch...@apache.org> wrote:

> The new option must default to the configured value of
> metrics.scope.jm.job because otherwise it isn't backwards-compatible with
> existing coordinator metrics.
> According to some earlier response in this thread there are already
> coordinator metrics.
>
> If that weren't the case you'd be correct, and we'd keep the default
> consistent with the default for tm operator metrics.
>
> Intuitively I think the config key should be metrics.scope.jm.operator.,
> similar to metrics.scope.jm.job.
> For symmetry we should also add new metrics.scope.tm.operator/task and
> eventually phase out metrics.scope.operator.
>
> > Regarding "coordinator metrics don't *belong to* any tasks/vertexes", I
> mean the coordinator doesn't run on specific tasks, so it doesn't have
> runtime information of tasks.
>
> That is true. I intentionally only referred to task id/name, as things
> like subtask index or attempt ids don't make sense for coordinators as you
> said.
> We would only support operator_name, operator_id, task_name and task_id (+
> everything else from the jm/job groups of course).
> (Maybe in the future attempt_num could be interesting though)
>
> On 30/01/2023 10:39, Jark Wu wrote:
>
> Hi Chesnay,
>
> IIUC, our only disagreement is whether to support task info in the metric
> format. But we all agree with the default metric format
> "metrics.scope.jm-operator=<host>.jobmanager.<job_name>.<operator_name>"
> which doesn't include task info, right?
>
> Regarding "coordinator metrics don't *belong to* any tasks/vertexes", I
> mean the coordinator doesn't run on specific tasks, so it doesn't have
> runtime information of tasks. My concern about adding task info to the
> coordinator metric is that
> 1) if we register coordinator metrics for every subtask, the number of
> coordinator metrics will be exploded when task parallelism is large. This
> is an overhead to JM & metric systems, and the duplicated metrics are
> useless for users.
> 2) Correct me if I'm wrong, MetricGroup#counter only supports registering
> one metric. It seems it can't register a metric reporting for multiple
> metric names (e.g. multiple subtasks).
>
> Best,
> Jark
>
>
> On Mon, 30 Jan 2023 at 16:11, Chesnay Schepler <ch...@apache.org> wrote:
>
>> I hadn't looked in detail in to how the task info can be introduced, but
>> given that the coordinates are created by the execution graph, where we
>> only work on tasks, it should be possible and rather simple.
>>
>> The OperatorCoordinatorHolder gets access to the ExecutionJobVertex from
>> which it can extract everything and create the task/operator groups.
>> Exposure to the coordinator could happen via the coordinator context as
>> originally planned.
>>
>>  > As we know, coordinator metrics are just like JMJobMetricGroup, which
>> don't belong to any tasks/vertexes.
>>
>> I don't think this is true, every coordinator is scoped to a specific
>> operator, and every operator is associated with a particular vertex.
>>
>>  From the javadocs:
>>
>> " A coordinator for runtime operators. The OperatorCoordinator runs on
>> the master, associated with
>>   the job vertex of the operator. It communicates with operators via
>> sending operator events."
>>
>> On 28/01/2023 09:15, Jark Wu wrote:
>> > Hi all,
>> >
>> > IIUC, Chesnay means we should have a more general metric group for
>> > **operator**
>> > not for **coordinator** in JM, this would be useful to extend
>> > other operator-specific
>> > metrics in the future. That means the new scope format should be
>> designed
>> > for
>> > the operator,
>> > e.g.,
>> metrics.scope.jm-operator=<host>.jobmanager.<job_name>.<operator_name>
>> > The coordinator metric is a subgroup (a constant "coordinator" suffix)
>> of
>> > the JMOperatorMG.
>> >
>> > I think this is a nice design. However, I have a question about adding
>> task
>> > id/name to this list.
>> > How to get the task id/name when reporting a coordinator metric? As we
>> > know, coordinator metrics
>> > are just like JMJobMetricGroup, which don't belong to any
>> tasks/vertexes.
>> > Do you mean reporting
>> > coordinator metrics for every task id under the operator?
>> >
>> > Best,
>> > Jark
>> >
>> >
>> > On Thu, 19 Jan 2023 at 17:33, Chesnay Schepler <ch...@apache.org>
>> wrote:
>> >
>> >>   > First, I do not understand why users have to configure the new
>> scope
>> >> format, which has a default value.
>> >>
>> >> If you don't use scope formats, sure. If you do use scope formats, e.g.
>> >> to add a common prefix (which is the case for datadog users for
>> >> example), then the current default in the FLIP is insufficient and
>> >> requires the user to update the configuration.
>> >>
>> >>   > I usually do not need to change these configuration of scope
>> formats
>> >> when submitting the flink job.
>> >>
>> >> Scope formats as a whole are quite a power-user feature, but that
>> >> doesn't mean we should ignore it.
>> >>
>> >>   > I try to let it extend the ComponentMetricGroup
>> >>
>> >> This isn't inherently required just because it is a "component".
>> >> Component metric groups should _only_ be used for cases where such a
>> >> component provides several bits of metadata.
>> >> For example, tasks provide vertex ids, task names, attempted IDs etc.,
>> >> and we'd like users to have the option on how this metadata ends up
>> >> being used (via scope formats). This currently can't be built with the
>> >> addGroup() methods, hence why the component groups exist.
>> >> However, the operator coordinator _itself_does not provide several bits
>> >> of metadata. Logically, the _operator_ does, not the coordinator.
>> >>
>> >>   > This make me decide to add a new scope format.
>> >>
>> >> And this is fine; just don't add a new scope format for the coordinator
>> >> but logically for operators on the JM side, such that we can extend the
>> >> set of operator-specific metrics in the future without having to add
>> yet
>> >> another scope format.
>> >>
>> >>   > The relationship between the Operator and OperatorCoordinator is
>> >> maintained in the OperatorCoordinatorHolder. In the POC, the operator
>> >> name/id could be found and the OperatorCoordinatorMetricGroup will be
>> >> created here.
>> >>
>> >> That's all irrelevant.
>> >>
>> >>   > The registered metrics of OperatorCoordinator always are those
>> which
>> >> can not be aggregated from the tasks, like the number of unassigned
>> >> splits in the SourceCoordinator. Actually we have not encountered the
>> >> scenario you mentioned. I think the OperatorCoordinatorMetricGroup
>> >> should only contain the metrics for itself instead of its subtasks.
>> >>
>> >> You are misunderstanding the problem.
>> >>
>> >> This isn't about aggregating metrics on our side or exposing metrics
>> >> from TMs on the JM side or anything like that.
>> >>
>> >> It's purely about the available metadata for operator metrics on the JM
>> >> side. Currently you suggest operator name / id; what I'm proposing is
>> to
>> >> add task(==vertex!) id / name to this list.
>> >>
>> >> The benefit here is simple. I can look up all metrics where
>> task_id==XXX
>> >> and get _everything_ related to that vertex, including all metrics
>> >> associated with operators that are part of that vertex, including the
>> >> coordinator.
>> >>
>> >>   > Using metrics.scope.jm.job is not enough to distinguish the
>> different
>> >> coordinators.
>> >>
>> >> I did not suggest just using metrics.scope.jm.job. I suggested that as
>> >> the default for the jm operator scope format, that is used for the
>> >> proposed JobManagerOperatorMetricGroup, which will have some plain
>> >> metric groups as children for coordinators.
>> >> Aka, you have the JMOperatorMG, then you call addGroup("coordinators")
>> >> and then addGroup(coordinator_name) for each coordinator.
>> >>
>> >>   > So there are two choice for the
>> >> InternalOperatorCoordinatorMetricGroup: 1. add and improve the new
>> scope
>> >> format; 2. use the metrics.scope.jm.job and ProxyMetricGroup. Which one
>> >> is better?
>> >>
>> >> There are more options than that.
>> >> I feel like there are a lot of misunderstandings in this discussion.
>> >> Please let me know if I could clear things up. If not I can also
>> provide
>> >> a PoC based on your PoC if that can speed things up. It's not that
>> >> different anyway.
>> >>
>> >> On 19/01/2023 07:39, Hang Ruan wrote:
>> >>> Hi, chesnay,
>> >>>
>> >>> Thanks for your reply. I still have some doubts about the questions
>> >>> you raised.
>> >>>
>> >>>      > Extending the set of ScopeFormats is problematic because it in
>> >>>      practice
>> >>>      it breaks the config if users actively rely on it, since there's
>> now
>> >>>      another key that they _must_ set for it to be
>> >>>      consistent/compatible with
>> >>>      their existing setup.
>> >>>      Unfortunately due to how powerful scope formats are we can't
>> derive a
>> >>>      default value that matches their existing setup.
>> >>>      Hence we should try to do this as rarely as possible.
>> >>>
>> >>>
>> >>>      > This FLIP does not adhere to that since it proposes a dedicated
>> >>>      format
>> >>>      for coordinators; next time we want to expose operator-specific
>> >>>      metrics
>> >>>      (e.g., in the scheduler) we'd have to add another one to support
>> it.
>> >>>
>> >>>
>> >>> First, I do not understand why users have to configure the new scope
>> >>> format, which has a default value. I usually do not need to change
>> >>> these configuration of scope formats when submitting the flink job.
>> >>> IMO, the OperatorCoordinator is a component running at the JobMaster.
>> >>> I try to let it extend the ComponentMetricGroup, but I cannot find
>> >>> a suitable scope. This make me decide to add a new scope format.
>> >>>
>> >>>      > Additionally, the configurable variables (operator name/id) are
>> >>>      logically not attached to the coordinator, but operators, so to
>> me it
>> >>>      just doesn't make sense to structure it like this.
>> >>>
>> >>>
>> >>> The relationship between the Operator and OperatorCoordinator is
>> >>> maintained in the OperatorCoordinatorHolder. In the POC, the operator
>> >>> name/id could be found and the OperatorCoordinatorMetricGroup will be
>> >>> created here.
>> >>>
>> >>>      > Another thing I'm concerned about is that, because we don't
>> include
>> >>>      tasks in the hierarchy, users wishing to collect all metrics for
>> a
>> >>>      particular task (in this case ==vertex) now have to go
>> >>>      significantly out
>> >>>      of their way to get them, since they can no longer just filter
>> by the
>> >>>      task ID but have to be filter for _all_ operators that are part
>> of
>> >>>      the task.
>> >>>
>> >>>
>> >>> The registered metrics of OperatorCoordinator always are those which
>> >>> can not be aggregated from the tasks, like the number of unassigned
>> >>> splits in the SourceCoordinator. Actually we have not encountered the
>> >>> scenario you mentioned. I think the OperatorCoordinatorMetricGroup
>> >>> should only contain the metrics for itself instead of its subtasks.
>> >>>
>> >>> Using metrics.scope.jm.job is not enough to distinguish the different
>> >>> coordinators. The operator id/name is necessary. Then
>> >>> the implementation should be like this. But users can not change the
>> >>> scope in this way. This is also acceptable.
>> >>> @Internal public class InternalOperatorCoordinatorMetricGroup
>> >>>           extends ProxyMetricGroup<MetricGroup>
>> >>>           implements OperatorCoordinatorMetricGroup {
>> >>>       public InternalOperatorCoordinatorMetricGroup(
>> >>>               JobManagerJobMetricGroup parent, OperatorID operatorID,
>> >> String operatorName) {
>> >>>           super(parent.addGroup(operatorID +
>> >> operatorName).addGroup("coordinator")); }
>> >>> }
>> >>> So there are two choice for the
>> >>> InternalOperatorCoordinatorMetricGroup: 1. add and improve the new
>> >>> scope format; 2. use the metrics.scope.jm.job and ProxyMetricGroup.
>> >>> Which one is better?
>> >>>
>> >>> Best,
>> >>> Hang
>> >>>
>> >>>
>> >>> Chesnay Schepler <ch...@apache.org> 于2023年1月18日周三 17:03写道：
>> >>>
>> >>>      You're misunderstanding the problem.
>> >>>
>> >>>      Metric groups form a tree with each group providing certain
>> metadata.
>> >>>
>> >>>      E.g., on the taskmanager we have a TM metric group that provides
>> info
>> >>>      about the TM, that has child task metric groups, that have child
>> >>>      operator metric groups etc. The operator metric group again has
>> >>>      children, where we are now mostly in the user/connector land,
>> like a
>> >>>      kafka metric group providing partition info or something.
>> >>>
>> >>>      What is being proposed is to create a coordinator metric group on
>> >>>      the JM
>> >>>      side that provides operator metadata.
>> >>>
>> >>>      A more appropriate structure is to create an operator group on
>> the JM
>> >>>      side that provides this info, with the coordinator metric group
>> >>>      being a
>> >>>      child.
>> >>>
>> >>>      On 17/01/2023 19:56, Steven Wu wrote:
>> >>>      >> Additionally, the configurable variables (operator name/id)
>> are
>> >>>      > logically not attached to the coordinator, but operators, so to
>> >>>      me it
>> >>>      > just doesn't make sense to structure it like this.
>> >>>      >
>> >>>      > Chesnay, maybe we should clarify the terminology. To me,
>> >>>      pperators (like
>> >>>      > FLIP-27 source) can have two parts (coordinator and
>> >>>      reader/subtask). I
>> >>>      > think it is fine to include operator name/id for coordinator
>> >>>      metrics.
>> >>>      >
>> >>>      > On Mon, Jan 16, 2023 at 2:13 AM Chesnay Schepler
>> >>>      <ch...@apache.org> wrote:
>> >>>      >
>> >>>      >> Slight correction: Using metrics.scope.jm.job as the default
>> >>>      should be
>> >>>      >> safe.
>> >>>      >>
>> >>>      >> On 16/01/2023 10:18, Chesnay Schepler wrote:
>> >>>      >>> The proposed ScopeFormat is still problematic for a few
>> reasons.
>> >>>      >>>
>> >>>      >>> Extending the set of ScopeFormats is problematic because it
>> in
>> >>>      >>> practice it breaks the config if users actively rely on it,
>> since
>> >>>      >>> there's now another key that they _must_ set for it to be
>> >>>      >>> consistent/compatible with their existing setup.
>> >>>      >>> Unfortunately due to how powerful scope formats are we can't
>> >>>      derive a
>> >>>      >>> default value that matches their existing setup.
>> >>>      >>> Hence we should try to do this as rarely as possible.
>> >>>      >>>
>> >>>      >>> This FLIP does not adhere to that since it proposes a
>> >>>      dedicated format
>> >>>      >>> for coordinators; next time we want to expose
>> operator-specific
>> >>>      >>> metrics (e.g., in the scheduler) we'd have to add another
>> one to
>> >>>      >>> support it.
>> >>>      >>>
>> >>>      >>> Additionally, the configurable variables (operator name/id)
>> are
>> >>>      >>> logically not attached to the coordinator, but operators, so
>> >>>      to me it
>> >>>      >>> just doesn't make sense to structure it like this.
>> >>>      >>>
>> >>>      >>> Another thing I'm concerned about is that, because we don't
>> >>>      include
>> >>>      >>> tasks in the hierarchy, users wishing to collect all metrics
>> for
>> >> a
>> >>>      >>> particular task (in this case ==vertex) now have to go
>> >>>      significantly
>> >>>      >>> out of their way to get them, since they can no longer just
>> >>>      filter by
>> >>>      >>> the task ID but have to be filter for _all_ operators that
>> are
>> >>>      part of
>> >>>      >>> the task.
>> >>>      >>>
>> >>>      >>> On 16/01/2023 03:09, Hang Ruan wrote:
>> >>>      >>>> Hi, @chesnay@apache.org <ch...@apache.org> ,
>> >>>      >>>>
>> >>>      >>>> Do you have time to help to review this FLIP again after the
>> >>>      >>>> modification?
>> >>>      >>>> Looking forward to your reply.
>> >>>      >>>> This FLIP will add a new configuration for the
>> >>>      >>>> OperatorCoordinatorMetricGroup scope format. It provides an
>> >>>      internal
>> >>>      >>>> implementation and is added as a component to the
>> >>>      >>>> JobManagerJobMetricGroup.
>> >>>      >>>> If something doesn't make sense, could you provide some
>> >>>      advice? It
>> >>>      >>>> will be
>> >>>      >>>> very helpful. Thanks a lot for your help.
>> >>>      >>>>
>> >>>      >>>> Best,
>> >>>      >>>> Hang
>> >>>      >>>>
>> >>>      >>>> Martijn Visser <ma...@apache.org> 于2023年1月11日周三
>> >>>      16:34写道：
>> >>>      >>>>
>> >>>      >>>>
>> >>>      >>>>> Hi Hang,
>> >>>      >>>>>
>> >>>      >>>>> I'm a bit surprised that this has gone to a vote, given
>> that
>> >>>      Chesnay
>> >>>      >>>>> deliberately mentioned that he would vote against it as-is.
>> >>>      I would
>> >>>      >>>>> expect
>> >>>      >>>>> that before going to a vote, he has had the opportunity to
>> >>>      >>>>> participate in
>> >>>      >>>>> this discussion.
>> >>>      >>>>>
>> >>>      >>>>> Best regards,
>> >>>      >>>>>
>> >>>      >>>>> Martijn
>> >>>      >>>>>
>> >>>      >>>>> On Tue, Jan 3, 2023 at 12:53 PM Jark Wu <im...@gmail.com>
>> >>>      wrote:
>> >>>      >>>>>
>> >>>      >>>>>> Hi Dong,
>> >>>      >>>>>>
>> >>>      >>>>>> Regarding “SplitEnumeratorContext#metricGroup”, my only
>> >>>      concern is
>> >>>      >>>>>> that
>> >>>      >>>>>> this is a core interface for the FLIP. It’s hard to tell
>> how
>> >>>      >>>>>> sources use
>> >>>      >>>>>> metric group without mentioning this interface. Even if
>> >>>      this is an
>> >>>      >>>>> existing
>> >>>      >>>>>> API, I think it’s worth introducing the interface again
>> and
>> >>>      declaring
>> >>>      >>>>> that
>> >>>      >>>>>> we will implement the interface instead of a no-op method
>> >>>      in this
>> >>>      >>>>>> FLIP.
>> >>>      >>>>>>
>> >>>      >>>>>> Anyway, this is a minor problem and shouldn’t block this
>> >>>      FLIP. I’m
>> >>>      >>>>>> +1 to
>> >>>      >>>>>> start a vote.
>> >>>      >>>>>>
>> >>>      >>>>>> Best,
>> >>>      >>>>>> Jark
>> >>>      >>>>>>
>> >>>      >>>>>>
>> >>>      >>>>>>> 2023年1月3日 10:03，Hang Ruan <ru...@gmail.com> 写道：
>> >>>      >>>>>>>
>> >>>      >>>>>>> Hi, Jark and Dong,
>> >>>      >>>>>>>
>> >>>      >>>>>>> Thanks for your comments. Sorry for my late reply.
>> >>>      >>>>>>>
>> >>>      >>>>>>> For suggestion 1, I plan to implement the
>> >>>      >>>>>>> SplitEnumeratorMetricGroup in
>> >>>      >>>>>>> another issue, and it is not contained in this FLIP. I
>> >>>      will add some
>> >>>      >>>>>>> description about this part.
>> >>>      >>>>>>> For suggestion 2, changes about
>> >>>      OperatorCoordinator#metricGroup has
>> >>>      >>>>>> already
>> >>>      >>>>>>> been documented in the proposed change section.
>> >>>      >>>>>>>
>> >>>      >>>>>>> Best,
>> >>>      >>>>>>> Hang
>> >>>      >>>>>>>
>> >>>      >>>>>>> Dong Lin <li...@gmail.com> 于2023年1月1日周日 09:45写道：
>> >>>      >>>>>>>
>> >>>      >>>>>>>> Let me chime-in and add comments regarding the public
>> >>>      interface
>> >>>      >>>>> section.
>> >>>      >>>>>>>> Please see my comments inline.
>> >>>      >>>>>>>>
>> >>>      >>>>>>>> On Thu, Dec 29, 2022 at 6:08 PM Jark Wu
>> >>>      <im...@gmail.com> wrote:
>> >>>      >>>>>>>>
>> >>>      >>>>>>>>> Hi Hang,
>> >>>      >>>>>>>>>
>> >>>      >>>>>>>>> Thanks for driving this discussion. I think this is a
>> >>>      very useful
>> >>>      >>>>>> feature
>> >>>      >>>>>>>>> for connectors.
>> >>>      >>>>>>>>>
>> >>>      >>>>>>>>> The FLIP looks quite good to me, and I just have two
>> >>>      suggestions.
>> >>>      >>>>>>>>>
>> >>>      >>>>>>>>> 1. In the "Public Interface" section, mention that the
>> >>>      >>>>>>>>> implementation
>> >>>      >>>>>>>>> behavior of "SplitEnumeratorContext#metricGroup" is
>> >>>      changed from
>> >>>      >>>>>>>> returning
>> >>>      >>>>>>>>> null to returning a concrete SplitEnumeratorMetricGroup
>> >>>      instance.
>> >>>      >>>>> Even
>> >>>      >>>>>>>>> though the API is already there, the behavior change
>> can
>> >>>      also be
>> >>>      >>>>>>>> considered
>> >>>      >>>>>>>>> a public change.
>> >>>      >>>>>>>>>
>> >>>      >>>>>>>> SplitEnumeratorContext#metricGroup is an interface and
>> >>>      this FLIP
>> >>>      >>>>>>>> does
>> >>>      >>>>>> not
>> >>>      >>>>>>>> seem to change its semantics/behavior. The FLIP does
>> >>>      change the
>> >>>      >>>>>>>> implementation/behavior of
>> >>>      SourceCoordinatorContext#metricGroup,
>> >>>      >>>>>>>> which
>> >>>      >>>>>> is
>> >>>      >>>>>>>> marked @Internal.
>> >>>      >>>>>>>>
>> >>>      >>>>>>>> Thus it might seem a bit weird to add in the public
>> >> interface
>> >>>      >>>>>>>> section
>> >>>      >>>>>>>> saying that we change the interface
>> >>>      >>>>>>>> SplitEnumeratorContext#metricGroup
>> >>>      >>>>>> from
>> >>>      >>>>>>>> returning null to non-null object.
>> >>>      >>>>>>>>
>> >>>      >>>>>>>> 2. Mention the newly added interface of
>> >>>      >>>>>> "OperatorCoordinator#metricGroup"
>> >>>      >>>>>>>>> in the "Proposed Changes" section or "Public Interface"
>> >>>      section. As
>> >>>      >>>>> the
>> >>>      >>>>>>>>> FLIP said, OperatorCoordinator is widely used in many
>> >>>      connectors.
>> >>>      >>>>>> Though
>> >>>      >>>>>>>> it
>> >>>      >>>>>>>>> is still an @Internal API, I think it is worth
>> >>>      mentioning the
>> >>>      >>>>>>>>> change
>> >>>      >>>>> in
>> >>>      >>>>>>>> the
>> >>>      >>>>>>>>> FLIP.
>> >>>      >>>>>>>> Since OperatorCoordinator is an internal API, it seems
>> >>>      reasonable to
>> >>>      >>>>>>>> explain it in the proposed change section. The FLIP
>> seems
>> >>>      to have
>> >>>      >>>>>>>> documented this in point 5 of the proposed change
>> section.
>> >>>      >>>>>>>>
>> >>>      >>>>>>>> BTW, if we think there are @internal classes that are
>> >>>      important
>> >>>      >>>>>>>> enough
>> >>>      >>>>>> to
>> >>>      >>>>>>>> be added in the public interface section, it might be
>> >>>      useful to
>> >>>      >>>>>> explicitly
>> >>>      >>>>>>>> discuss this topic and document it in the "*What are the
>> >>>      "public
>> >>>      >>>>>>>> interfaces" of the project*" in this
>> >>>      >>>>>>>> <
>> >>>      >>>>>>>>
>> >>>      >>
>> >>>
>> >>
>> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
>> >>>      >>>>>>>> wiki.
>> >>>      >>>>>>>>
>> >>>      >>>>>>>>
>> >>>      >>>>>>>>> Best,
>> >>>      >>>>>>>>> Jark
>> >>>      >>>>>>>>>
>> >>>      >>>>>>>>>
>> >>>      >>>>>>>>> On Mon, 26 Dec 2022 at 18:06, Hang Ruan
>> >>>      <ru...@gmail.com>
>> >>>      >>>>>> wrote:
>> >>>      >>>>>>>>>> Hi, thanks for the feedback, Zhu Zhu and Qingsheng.
>> >>>      >>>>>>>>>> After combining everyone's comments, the main
>> concerns and
>> >>>      >>>>>>>> corresponding
>> >>>      >>>>>>>>>> adjustments are as follows.
>> >>>      >>>>>>>>>>
>> >>>      >>>>>>>>>> Q1: Common metrics are not quite useful.
>> >>>      >>>>>>>>>> numEventsIn and numEventsOut counters will be removed
>> >>>      from the
>> >>>      >>>>>>>>>> OperatorCoordinatorMetricGroup. These common metrics
>> do
>> >> not
>> >>>      >>>>>>>>>> provide
>> >>>      >>>>>>>>> enough
>> >>>      >>>>>>>>>> information for users. The users are more willing to
>> >>>      get the
>> >>>      >>>>>>>>>> number
>> >>>      >>>>> of
>> >>>      >>>>>>>>>> events of the specified type instead of the total
>> >>>      number. And this
>> >>>      >>>>>>>> metric
>> >>>      >>>>>>>>>> is calculated differently. The implementation could
>> >>>      register the
>> >>>      >>>>>> metric
>> >>>      >>>>>>>>> by
>> >>>      >>>>>>>>>> themselves.
>> >>>      >>>>>>>>>>
>> >>>      >>>>>>>>>> Q2: This FLIP is overly complicated.
>> >>>      >>>>>>>>>> This FLIP will become concise after these
>> modifications.
>> >>>      >>>>>>>>>> OperatorCoordinatorMetricGroup has already been
>> >>>      introduced into
>> >>>      >>>>> Flink
>> >>>      >>>>>>>> by
>> >>>      >>>>>>>>>> FLIP-179<
>> >>>      >>>>>>>>>>
>> >>>      >>>>>>>>>>
>> >>>      >>
>> >>>
>> >>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics
>> >>>      >>>>>>>>>>> .
>> >>>      >>>>>>>>>> And
>> >>>      >>>>>>>>>> this FLIP will not change it. This FLIP only provides
>> a
>> >>>      new metric
>> >>>      >>>>>>>> option
>> >>>      >>>>>>>>>> and a new metric group scope. The changes in proposed
>> >>>      changes
>> >>>      >>>>> provide
>> >>>      >>>>>>>> the
>> >>>      >>>>>>>>>> details about the modifications for the internal
>> >>>      classes, which
>> >>>      >>>>> might
>> >>>      >>>>>>>>> make
>> >>>      >>>>>>>>>> it look complicated.
>> >>>      >>>>>>>>>>
>> >>>      >>>>>>>>>> Thanks for all the comments again. If there are no
>> further
>> >>>      >>>>>>>>>> comments,
>> >>>      >>>>>> we
>> >>>      >>>>>>>>>> plan to start the voting thread this week.
>> >>>      >>>>>>>>>>
>> >>>      >>>>>>>>>> Best,
>> >>>      >>>>>>>>>> Hang
>> >>>      >>>>>>>>>>
>> >>>      >>>>>>>>>> Qingsheng Ren <re...@gmail.com> 于2022年12月26日周一
>> >>>      16:48写道：
>> >>>      >>>>>>>>>>
>> >>>      >>>>>>>>>>
>> >>>      >>>>>>>>>>> Thanks for the FLIP, Hang!
>> >>>      >>>>>>>>>>>
>> >>>      >>>>>>>>>>> This FLIP overall looks good to me. Actually I share
>> >>>      the same
>> >>>      >>>>> concern
>> >>>      >>>>>>>>>> with
>> >>>      >>>>>>>>>>> Zhu that numEventsIn and numEventsOut counters are
>> not
>> >>>      quite
>> >>>      >>>>>>>>>>> useful
>> >>>      >>>>>>>> to
>> >>>      >>>>>>>>>> end
>> >>>      >>>>>>>>>>> users. OperatorEvent is a quite low-level
>> abstraction,
>> >>>      which
>> >>>      >>>>> requires
>> >>>      >>>>>>>>>>> instantialization in order to be practical to users
>> and
>> >>>      >>>>>>>>>>> developers,
>> >>>      >>>>>>>> so
>> >>>      >>>>>>>>>>> maybe it's better to exclude them from the FLIP.
>> >>>      >>>>>>>>>>>
>> >>>      >>>>>>>>>>> Best,
>> >>>      >>>>>>>>>>> Qingsheng
>> >>>      >>>>>>>>>>> Ververica (Alibaba)
>> >>>      >>>>>>>>>>>
>> >>>      >>>>>>>>>>> On Mon, Dec 26, 2022 at 12:08 PM Zhu Zhu
>> >>>      <re...@gmail.com>
>> >>>      >>>>> wrote:
>> >>>      >>>>>>>>>>>> Hi Hang,
>> >>>      >>>>>>>>>>>> I still see no strong reason why we need
>> >>>      >>>>>>>>>>>> numEventsIn/numEventsOut
>> >>>      >>>>>>>>>>> metrics.
>> >>>      >>>>>>>>>>>> In the discussion in FLINK-29801, I can see the same
>> >>>      concern
>> >>>      >>>>>>>>>>>> from
>> >>>      >>>>>>>>>> others.
>> >>>      >>>>>>>>>>>> So I prefer to exclude them from this FLIP to avoid
>> >>>      >>>>>>>>>>>> over-extending
>> >>>      >>>>>>>>> the
>> >>>      >>>>>>>>>>>> scope.
>> >>>      >>>>>>>>>>>>
>> >>>      >>>>>>>>>>>> Thanks,
>> >>>      >>>>>>>>>>>> Zhu
>> >>>      >>>>>>>>>>>>
>> >>>      >>>>>>>>>>>> Hang Ruan <ru...@gmail.com> 于2022年12月23日周五
>> >>>      >>>>>>>>>>>> 15:21写道：
>> >>>      >>>>>>>>>>>>> Hi everyone,
>> >>>      >>>>>>>>>>>>>
>> >>>      >>>>>>>>>>>>> As for the Zhu Zhu's problem, I think we should
>> keep
>> >>>      the common
>> >>>      >>>>>>>>>>> metrics,
>> >>>      >>>>>>>>>>>> which will help to observe incoming and outgoing
>> >>>      events. What do
>> >>>      >>>>>>>> you
>> >>>      >>>>>>>>>>> think,
>> >>>      >>>>>>>>>>>> @Zhu Zhu ?
>> >>>      >>>>>>>>>>>>> And @Chesnay, are there any other issues you are
>> >>>      more concerned
>> >>>      >>>>>>>>>> about?
>> >>>      >>>>>>>>>>>> Looking forward to your reply.
>> >>>      >>>>>>>>>>>>> Thanks for all the comments. If there are no
>> further
>> >>>      >>>>>>>>>>>>> comments, we
>> >>>      >>>>>>>>>> plan
>> >>>      >>>>>>>>>>>> to start the voting thread next week.
>> >>>      >>>>>>>>>>>>> Best,
>> >>>      >>>>>>>>>>>>> Hang
>> >>>      >>>>>>>>>>>>>
>> >>>      >>>>>>>>>>>>> Hang Ruan <ru...@gmail.com> 于2022年12月15日周四
>> >>>      >>>>>>>>>>>>> 16:49写道：
>> >>>      >>>>>>>>>>>>>> Hi, Zhu Zhu,
>> >>>      >>>>>>>>>>>>>>
>> >>>      >>>>>>>>>>>>>> Thanks for your feedback!
>> >>>      >>>>>>>>>>>>>>
>> >>>      >>>>>>>>>>>>>> The OperatorCoordinator implementations are
>> >>>      different. And
>> >>>      >>>>>>>>>>>>>> their
>> >>>      >>>>>>>>>>>> metrics are much different too. We try to find the
>> >> common
>> >>>      >>>>>>>>>>>> metrics
>> >>>      >>>>>>>> and
>> >>>      >>>>>>>>>> put
>> >>>      >>>>>>>>>>>> them in the OperatorCoordinatorMetricGroup. If most
>> >>>      developers
>> >>>      >>>>>>>> think
>> >>>      >>>>>>>>> we
>> >>>      >>>>>>>>>>> do
>> >>>      >>>>>>>>>>>> not need these common metrics, removing the common
>> >>>      metrics is
>> >>>      >>>>>>>>>>>> also
>> >>>      >>>>>>>>>>>> acceptable.
>> >>>      >>>>>>>>>>>>>> Best,
>> >>>      >>>>>>>>>>>>>> Hang
>> >>>      >>>>>>>>>>>>>>
>> >>>      >>>>>>>>>>>>>> Zhu Zhu <re...@gmail.com> 于2022年12月14日周三
>> >>>      22:09写道：
>> >>>      >>>>>>>>>>>>>>> Hi Hang & MengYue,
>> >>>      >>>>>>>>>>>>>>>
>> >>>      >>>>>>>>>>>>>>> Thanks for creating this FLIP!
>> >>>      >>>>>>>>>>>>>>>
>> >>>      >>>>>>>>>>>>>>> I think it is very useful, mainly in two aspects:
>> >>>      >>>>>>>>>>>>>>> 1. Enables OperatorCoordinators to register
>> metrics.
>> >>>      >>>>>>>>>>>>>>> Currently
>> >>>      >>>>>>>>>>>>>>> the coordinators has no way to do this. And
>> operator
>> >>>      >>>>>>>> coordinator
>> >>>      >>>>>>>>>>>>>>> metric group further enables the SplitEnumerator
>> >>>      to have
>> >>>      >>>>>>>>>>>>>>> access
>> >>>      >>>>>>>>>>>>>>> to a registered metric group (via the existing
>> public
>> >>>      >>>>>>>>>>>>>>> interface
>> >>>      >>>>>>>>>>>>>>> SplitEnumeratorContext#metricGroup()), which is
>> >>>      null at the
>> >>>      >>>>>>>>> moment.
>> >>>      >>>>>>>>>>>>>>> 2. Defines the scope of operator coordinator
>> >>>      metrics. A clear
>> >>>      >>>>>>>>>>>> definition
>> >>>      >>>>>>>>>>>>>>> makes it easy for users to find their wanted
>> >>>      metrics. The
>> >>>      >>>>>>>>>> definition
>> >>>      >>>>>>>>>>>>>>> also helps to avoid conflicts of metrics from
>> >> multiple
>> >>>      >>>>>>>>>>>> OperatorCoordinators
>> >>>      >>>>>>>>>>>>>>> of the same kind. E.g. each SourceCoordinator may
>> >>>      have its
>> >>>      >>>>>>>>>>>>>>> own
>> >>>      >>>>>>>>>>>>>>> numSourceSplits metric, these metrics should not
>> >>>      be directly
>> >>>      >>>>>>>>>>> registered
>> >>>      >>>>>>>>>>>>>>> to the job metric group.
>> >>>      >>>>>>>>>>>>>>>
>> >>>      >>>>>>>>>>>>>>> What I'm a bit concerned is the necessity of the
>> >>>      introduced
>> >>>      >>>>>>>>> common
>> >>>      >>>>>>>>>>>> metrics
>> >>>      >>>>>>>>>>>>>>> numEventsInCounter & numEventsOutCounter. If
>> there
>> >>>      any case
>> >>>      >>>>>>>> which
>> >>>      >>>>>>>>>>>> strongly
>> >>>      >>>>>>>>>>>>>>> requires them?
>> >>>      >>>>>>>>>>>>>>>
>> >>>      >>>>>>>>>>>>>>> Regarding the concerns of Chesnay,
>> >>>      >>>>>>>>>>>>>>>> A dedicated coordinator MG implementation is
>> >> overkill
>> >>>      >>>>>>>>>>>>>>> Directly using the job metric group can result in
>> >>>      metric
>> >>>      >>>>>>>>> conflicts,
>> >>>      >>>>>>>>>>> as
>> >>>      >>>>>>>>>>>> mentioned
>> >>>      >>>>>>>>>>>>>>> in above #2.
>> >>>      >>>>>>>>>>>>>>>
>> >>>      >>>>>>>>>>>>>>> Thanks,
>> >>>      >>>>>>>>>>>>>>> Zhu
>> >>>      >>>>>>>>>>>>>>>
>> >>>      >>>>>>>>>>>>>>> Dong Lin <li...@gmail.com> 于2022年12月10日周六
>> >>>      >>>>>>>>>>>>>>> 14:16写道：
>> >>>      >>>>>>>>>>>>>>>
>> >>>      >>>>>>>>>>>>>>>> Hi Chesney,
>> >>>      >>>>>>>>>>>>>>>>
>> >>>      >>>>>>>>>>>>>>>> Just to double check with you,
>> >>>      >>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup
>> >>>      >>>>>>>>>>>> (annotated as
>> >>>      >>>>>>>>>>>>>>>> @PublicEvolving) has already been introduced
>> into
>> >>>      Flink by
>> >>>      >>>>>>>>>> FLIP-179
>> >>>      >>>>>>>>>>>>>>>> <
>> >>>      >>
>> >>>
>> >>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics
>> >>>      >>>>>>>>>>>>> .
>> >>>      >>>>>>>>>>>>>>>> And that FLIP has got you +1.. Do you mean we
>> >>>      should remove
>> >>>      >>>>>>>>> this
>> >>>      >>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup?
>> >>>      >>>>>>>>>>>>>>>>
>> >>>      >>>>>>>>>>>>>>>> Regards,
>> >>>      >>>>>>>>>>>>>>>> Dong
>> >>>      >>>>>>>>>>>>>>>>
>> >>>      >>>>>>>>>>>>>>>> On Sat, Dec 10, 2022 at 1:33 AM Chesnay
>> Schepler <
>> >>>      >>>>>>>>>>> chesnay@apache.org>
>> >>>      >>>>>>>>>>>> wrote:
>> >>>      >>>>>>>>>>>>>>>>> As a whole I feel like this FLIP is overly
>> >>>      complicated. A
>> >>>      >>>>>>>>>>> dedicated
>> >>>      >>>>>>>>>>>>>>>>> coordinator MG implementation is overkill; it
>> >>>      could just
>> >>>      >>>>>>>>> re-use
>> >>>      >>>>>>>>>>> the
>> >>>      >>>>>>>>>>>>>>>>> existing Task/OperatorMGs to create the same
>> >>>      structure we
>> >>>      >>>>>>>>> have
>> >>>      >>>>>>>>>> on
>> >>>      >>>>>>>>>>>> TMs,
>> >>>      >>>>>>>>>>>>>>>>> similar to what we did with the Job MG.
>> >>>      >>>>>>>>>>>>>>>>>
>> >>>      >>>>>>>>>>>>>>>>> However, I'm not convinced that this is
>> required
>> >>>      anyway,
>> >>>      >>>>>>>>>> because
>> >>>      >>>>>>>>>>>> all the
>> >>>      >>>>>>>>>>>>>>>>> example metrics you listed can be implemented
>> on
>> >>>      the TM
>> >>>      >>>>>>>> side
>> >>>      >>>>>>>>> +
>> >>>      >>>>>>>>>>>>>>>>> aggregating them in the external metrics
>> backend.
>> >>>      >>>>>>>>>>>>>>>>>
>> >>>      >>>>>>>>>>>>>>>>> Since I'm on holidays soon, just so no one
>> tries
>> >>>      to pull a
>> >>>      >>>>>>>>> fast
>> >>>      >>>>>>>>>>>> one on
>> >>>      >>>>>>>>>>>>>>>>> me, if this were to go to a vote as-is I'd be
>> >>>      against it.
>> >>>      >>>>>>>>>>>>>>>>>
>> >>>      >>>>>>>>>>>>>>>>>
>> >>>      >>>>>>>>>>>>>>>>> On 09/12/2022 15:30, Dong Lin wrote:
>> >>>      >>>>>>>>>>>>>>>>>> Hi Hang,
>> >>>      >>>>>>>>>>>>>>>>>>
>> >>>      >>>>>>>>>>>>>>>>>> Thanks for the FLIP! The FLIP looks good and
>> it
>> >>>      is pretty
>> >>>      >>>>>>>>>>>> informative.
>> >>>      >>>>>>>>>>>>>>>>>> I have just two minor comments regarding
>> names:
>> >>>      >>>>>>>>>>>>>>>>>> - Would it be useful to rename the config key
>> as
>> >>>      >>>>>>>>>>>>>>>>>> *metrics.scope.jm.job.operator-coordinator*
>> for
>> >>>      >>>>>>>> consistency
>> >>>      >>>>>>>>>>> with
>> >>>      >>>>>>>>>>>>>>>>>> *metrics.scope.jm.job
>> >>>      >>>>>>>>>>>>>>>>>> *(which is not named as *jm-job)?
>> >>>      >>>>>>>>>>>>>>>>>> - Maybe rename the variable as
>> >>>      >>>>>>>>>>> SCOPE_NAMING_OPERATOR_COORDINATOR
>> >>>      >>>>>>>>>>>> for
>> >>>      >>>>>>>>>>>>>>>>>> simplicity and consistency with
>> >>>      SCOPE_NAMING_OPERATOR
>> >>>      >>>>>>>>> (which
>> >>>      >>>>>>>>>> is
>> >>>      >>>>>>>>>>>> not named
>> >>>      >>>>>>>>>>>>>>>>>> as SCOPE_NAMING_TM_JOB_OPERATOR)?
>> >>>      >>>>>>>>>>>>>>>>>>
>> >>>      >>>>>>>>>>>>>>>>>> Cheers,
>> >>>      >>>>>>>>>>>>>>>>>> Dong
>> >>>      >>>>>>>>>>>>>>>>>>
>> >>>      >>>>>>>>>>>>>>>>>>
>> >>>      >>>>>>>>>>>>>>>>>>
>> >>>      >>>>>>>>>>>>>>>>>> On Thu, Dec 8, 2022 at 3:28 PM Hang Ruan <
>> >>>      >>>>>>>>>>> ruanhang1993@gmail.com>
>> >>>      >>>>>>>>>>>> wrote:
>> >>>      >>>>>>>>>>>>>>>>>>> Hi all,
>> >>>      >>>>>>>>>>>>>>>>>>>
>> >>>      >>>>>>>>>>>>>>>>>>> MengYue and I created FLIP-274[1] Introduce
>> >>>      metric group
>> >>>      >>>>>>>>> for
>> >>>      >>>>>>>>>>>>>>>>>>> OperatorCoordinator. OperatorCoordinator is
>> the
>> >>>      >>>>>>>>> coordinator
>> >>>      >>>>>>>>>>> for
>> >>>      >>>>>>>>>>>> runtime
>> >>>      >>>>>>>>>>>>>>>>>>> operators and running on Job Manager. The
>> >>>      coordination
>> >>>      >>>>>>>>>>>> mechanism is
>> >>>      >>>>>>>>>>>>>>>>>>> operator events between OperatorCoordinator
>> >>>      and its all
>> >>>      >>>>>>>>>>>> operators, the
>> >>>      >>>>>>>>>>>>>>>>>>> coordination is more and more using in
>> Flink, for
>> >>>      >>>>>>>> example
>> >>>      >>>>>>>>>> many
>> >>>      >>>>>>>>>>>> Sources
>> >>>      >>>>>>>>>>>>>>>>> and
>> >>>      >>>>>>>>>>>>>>>>>>> Sinks depend on the mechanism to assign
>> splits
>> >> and
>> >>>      >>>>>>>>>> coordinate
>> >>>      >>>>>>>>>>>> commits to
>> >>>      >>>>>>>>>>>>>>>>>>> external systems. The OperatorCoordinator is
>> >>>      widely
>> >>>      >>>>>>>> using
>> >>>      >>>>>>>>> in
>> >>>      >>>>>>>>>>>> flink kafka
>> >>>      >>>>>>>>>>>>>>>>>>> connector, flink pulsar connector, flink cdc
>> >>>      connector,
>> >>>      >>>>>>>>>> flink
>> >>>      >>>>>>>>>>>> hudi
>> >>>      >>>>>>>>>>>>>>>>>>> connector and so on.
>> >>>      >>>>>>>>>>>>>>>>>>>
>> >>>      >>>>>>>>>>>>>>>>>>> But there is not a suitable metric group
>> scope
>> >>>      for the
>> >>>      >>>>>>>>>>>>>>>>> OperatorCoordinator
>> >>>      >>>>>>>>>>>>>>>>>>> and not an implementation for the interface
>> >>>      >>>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup.
>> >>>      >>>>>>>>>>>>>>>>>>> These metrics in OperatorCoordinator could be
>> >>>      how many
>> >>>      >>>>>>>>>>>> splits/partitions
>> >>>      >>>>>>>>>>>>>>>>>>> have been assigned to source readers, how
>> many
>> >>>      files
>> >>>      >>>>>>>> have
>> >>>      >>>>>>>>>> been
>> >>>      >>>>>>>>>>>> written
>> >>>      >>>>>>>>>>>>>>>>> out
>> >>>      >>>>>>>>>>>>>>>>>>> by sink writers, these metrics not only help
>> >>>      users to
>> >>>      >>>>>>>> know
>> >>>      >>>>>>>>>> the
>> >>>      >>>>>>>>>>>> job
>> >>>      >>>>>>>>>>>>>>>>> progress
>> >>>      >>>>>>>>>>>>>>>>>>> but also make big job maintaining easier.
>> Thus we
>> >>>      >>>>>>>> propose
>> >>>      >>>>>>>>>> the
>> >>>      >>>>>>>>>>>> FLIP-274
>> >>>      >>>>>>>>>>>>>>>>> to
>> >>>      >>>>>>>>>>>>>>>>>>> introduce a new metric group scope for
>> >>>      >>>>>>>> OperatorCoordinator
>> >>>      >>>>>>>>>> and
>> >>>      >>>>>>>>>>>> provide
>> >>>      >>>>>>>>>>>>>>>>> an
>> >>>      >>>>>>>>>>>>>>>>>>> internal implementation for
>> >>>      >>>>>>>>> OperatorCoordinatorMetricGroup.
>> >>>      >>>>>>>>>>>>>>>>>>> Could you help review this FLIP when you get
>> >>>      time? Any
>> >>>      >>>>>>>>>>> feedback
>> >>>      >>>>>>>>>>>> is
>> >>>      >>>>>>>>>>>>>>>>>>> appreciated!
>> >>>      >>>>>>>>>>>>>>>>>>>
>> >>>      >>>>>>>>>>>>>>>>>>> Best,
>> >>>      >>>>>>>>>>>>>>>>>>> Hang
>> >>>      >>>>>>>>>>>>>>>>>>>
>> >>>      >>>>>>>>>>>>>>>>>>> [1]
>> >>>      >>>>>>>>>>>>>>>>>>>
>> >>>      >>>>>>>>>>>>>>>>>>>
>> >>>      >>
>> >>>
>> >>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-274%3A+Introduce+metric+group+for+OperatorCoordinator
>> >>>      >>
>> >>>
>>
>>
>

Re: [DISCUSS] FLIP-274 : Introduce metric group for OperatorCoordinator

Posted by Chesnay Schepler <ch...@apache.org>.

The new option must default to the configured value of 
metrics.scope.jm.job because otherwise it isn't backwards-compatible 
with existing coordinator metrics.
According to some earlier response in this thread there are already 
coordinator metrics.

If that weren't the case you'd be correct, and we'd keep the default 
consistent with the default for tm operator metrics.

Intuitively I think the config key should be metrics.scope.jm.operator., 
similar to metrics.scope.jm.job.
For symmetry we should also add new metrics.scope.tm.operator/task and 
eventually phase out metrics.scope.operator.

 > Regarding "coordinator metrics don't *belong to* any tasks/vertexes", 
I mean the coordinator doesn't run on specific tasks, so it doesn't have 
runtime information of tasks.

That is true. I intentionally only referred to task id/name, as things 
like subtask index or attempt ids don't make sense for coordinators as 
you said.
We would only support operator_name, operator_id, task_name and task_id 
(+ everything else from the jm/job groups of course).
(Maybe in the future attempt_num could be interesting though)

On 30/01/2023 10:39, Jark Wu wrote:
> Hi Chesnay,
>
> IIUC, our only disagreement is whether to support task info in the 
> metric format. But we all agree with the default metric format 
> "metrics.scope.jm-operator=<host>.jobmanager.<job_name>.<operator_name>" 
> which doesn't include task info, right?
>
> Regarding "coordinator metrics don't *belong to* any tasks/vertexes", 
> I mean the coordinator doesn't run on specific tasks, so it doesn't 
> have runtime information of tasks. My concern about adding task info 
> to the coordinator metric is that
> 1) if we register coordinator metrics for every subtask, the number of 
> coordinator metrics will be exploded when task parallelism is large. 
> This is an overhead to JM & metric systems, and the duplicated metrics 
> are useless for users.
> 2) Correct me if I'm wrong, MetricGroup#counter only supports 
> registering one metric. It seems it can't register a metric reporting 
> for multiple metric names (e.g. multiple subtasks).
>
> Best,
> Jark
>
>
> On Mon, 30 Jan 2023 at 16:11, Chesnay Schepler <ch...@apache.org> wrote:
>
>     I hadn't looked in detail in to how the task info can be
>     introduced, but
>     given that the coordinates are created by the execution graph,
>     where we
>     only work on tasks, it should be possible and rather simple.
>
>     The OperatorCoordinatorHolder gets access to the
>     ExecutionJobVertex from
>     which it can extract everything and create the task/operator groups.
>     Exposure to the coordinator could happen via the coordinator
>     context as
>     originally planned.
>
>      > As we know, coordinator metrics are just like JMJobMetricGroup,
>     which
>     don't belong to any tasks/vertexes.
>
>     I don't think this is true, every coordinator is scoped to a specific
>     operator, and every operator is associated with a particular vertex.
>
>      From the javadocs:
>
>     " A coordinator for runtime operators. The OperatorCoordinator
>     runs on
>     the master, associated with
>       the job vertex of the operator. It communicates with operators via
>     sending operator events."
>
>     On 28/01/2023 09:15, Jark Wu wrote:
>     > Hi all,
>     >
>     > IIUC, Chesnay means we should have a more general metric group for
>     > **operator**
>     > not for **coordinator** in JM, this would be useful to extend
>     > other operator-specific
>     > metrics in the future. That means the new scope format should be
>     designed
>     > for
>     > the operator,
>     > e.g.,
>     metrics.scope.jm-operator=<host>.jobmanager.<job_name>.<operator_name>
>     > The coordinator metric is a subgroup (a constant "coordinator"
>     suffix) of
>     > the JMOperatorMG.
>     >
>     > I think this is a nice design. However, I have a question about
>     adding task
>     > id/name to this list.
>     > How to get the task id/name when reporting a coordinator metric?
>     As we
>     > know, coordinator metrics
>     > are just like JMJobMetricGroup, which don't belong to any
>     tasks/vertexes.
>     > Do you mean reporting
>     > coordinator metrics for every task id under the operator?
>     >
>     > Best,
>     > Jark
>     >
>     >
>     > On Thu, 19 Jan 2023 at 17:33, Chesnay Schepler
>     <ch...@apache.org> wrote:
>     >
>     >>   > First, I do not understand why users have to configure the
>     new scope
>     >> format, which has a default value.
>     >>
>     >> If you don't use scope formats, sure. If you do use scope
>     formats, e.g.
>     >> to add a common prefix (which is the case for datadog users for
>     >> example), then the current default in the FLIP is insufficient and
>     >> requires the user to update the configuration.
>     >>
>     >>   > I usually do not need to change these configuration of
>     scope formats
>     >> when submitting the flink job.
>     >>
>     >> Scope formats as a whole are quite a power-user feature, but that
>     >> doesn't mean we should ignore it.
>     >>
>     >>   > I try to let it extend the ComponentMetricGroup
>     >>
>     >> This isn't inherently required just because it is a "component".
>     >> Component metric groups should _only_ be used for cases where
>     such a
>     >> component provides several bits of metadata.
>     >> For example, tasks provide vertex ids, task names, attempted
>     IDs etc.,
>     >> and we'd like users to have the option on how this metadata ends up
>     >> being used (via scope formats). This currently can't be built
>     with the
>     >> addGroup() methods, hence why the component groups exist.
>     >> However, the operator coordinator _itself_does not provide
>     several bits
>     >> of metadata. Logically, the _operator_ does, not the coordinator.
>     >>
>     >>   > This make me decide to add a new scope format.
>     >>
>     >> And this is fine; just don't add a new scope format for the
>     coordinator
>     >> but logically for operators on the JM side, such that we can
>     extend the
>     >> set of operator-specific metrics in the future without having
>     to add yet
>     >> another scope format.
>     >>
>     >>   > The relationship between the Operator and
>     OperatorCoordinator is
>     >> maintained in the OperatorCoordinatorHolder. In the POC, the
>     operator
>     >> name/id could be found and the OperatorCoordinatorMetricGroup
>     will be
>     >> created here.
>     >>
>     >> That's all irrelevant.
>     >>
>     >>   > The registered metrics of OperatorCoordinator always are
>     those which
>     >> can not be aggregated from the tasks, like the number of unassigned
>     >> splits in the SourceCoordinator. Actually we have not
>     encountered the
>     >> scenario you mentioned. I think the OperatorCoordinatorMetricGroup
>     >> should only contain the metrics for itself instead of its subtasks.
>     >>
>     >> You are misunderstanding the problem.
>     >>
>     >> This isn't about aggregating metrics on our side or exposing
>     metrics
>     >> from TMs on the JM side or anything like that.
>     >>
>     >> It's purely about the available metadata for operator metrics
>     on the JM
>     >> side. Currently you suggest operator name / id; what I'm
>     proposing is to
>     >> add task(==vertex!) id / name to this list.
>     >>
>     >> The benefit here is simple. I can look up all metrics where
>     task_id==XXX
>     >> and get _everything_ related to that vertex, including all metrics
>     >> associated with operators that are part of that vertex,
>     including the
>     >> coordinator.
>     >>
>     >>   > Using metrics.scope.jm.job is not enough to distinguish the
>     different
>     >> coordinators.
>     >>
>     >> I did not suggest just using metrics.scope.jm.job. I suggested
>     that as
>     >> the default for the jm operator scope format, that is used for the
>     >> proposed JobManagerOperatorMetricGroup, which will have some plain
>     >> metric groups as children for coordinators.
>     >> Aka, you have the JMOperatorMG, then you call
>     addGroup("coordinators")
>     >> and then addGroup(coordinator_name) for each coordinator.
>     >>
>     >>   > So there are two choice for the
>     >> InternalOperatorCoordinatorMetricGroup: 1. add and improve the
>     new scope
>     >> format; 2. use the metrics.scope.jm.job and ProxyMetricGroup.
>     Which one
>     >> is better?
>     >>
>     >> There are more options than that.
>     >> I feel like there are a lot of misunderstandings in this
>     discussion.
>     >> Please let me know if I could clear things up. If not I can
>     also provide
>     >> a PoC based on your PoC if that can speed things up. It's not that
>     >> different anyway.
>     >>
>     >> On 19/01/2023 07:39, Hang Ruan wrote:
>     >>> Hi, chesnay,
>     >>>
>     >>> Thanks for your reply. I still have some doubts about the
>     questions
>     >>> you raised.
>     >>>
>     >>>      > Extending the set of ScopeFormats is problematic
>     because it in
>     >>>      practice
>     >>>      it breaks the config if users actively rely on it, since
>     there's now
>     >>>      another key that they _must_ set for it to be
>     >>>      consistent/compatible with
>     >>>      their existing setup.
>     >>>      Unfortunately due to how powerful scope formats are we
>     can't derive a
>     >>>      default value that matches their existing setup.
>     >>>      Hence we should try to do this as rarely as possible.
>     >>>
>     >>>
>     >>>      > This FLIP does not adhere to that since it proposes a
>     dedicated
>     >>>      format
>     >>>      for coordinators; next time we want to expose
>     operator-specific
>     >>>      metrics
>     >>>      (e.g., in the scheduler) we'd have to add another one to
>     support it.
>     >>>
>     >>>
>     >>> First, I do not understand why users have to configure the new
>     scope
>     >>> format, which has a default value. I usually do not need to change
>     >>> these configuration of scope formats when submitting the flink
>     job.
>     >>> IMO, the OperatorCoordinator is a component running at the
>     JobMaster.
>     >>> I try to let it extend the ComponentMetricGroup, but I cannot find
>     >>> a suitable scope. This make me decide to add a new scope format.
>     >>>
>     >>>      > Additionally, the configurable variables (operator
>     name/id) are
>     >>>      logically not attached to the coordinator, but operators,
>     so to me it
>     >>>      just doesn't make sense to structure it like this.
>     >>>
>     >>>
>     >>> The relationship between the Operator and OperatorCoordinator is
>     >>> maintained in the OperatorCoordinatorHolder. In the POC, the
>     operator
>     >>> name/id could be found and the OperatorCoordinatorMetricGroup
>     will be
>     >>> created here.
>     >>>
>     >>>      > Another thing I'm concerned about is that, because we
>     don't include
>     >>>      tasks in the hierarchy, users wishing to collect all
>     metrics for a
>     >>>      particular task (in this case ==vertex) now have to go
>     >>>      significantly out
>     >>>      of their way to get them, since they can no longer just
>     filter by the
>     >>>      task ID but have to be filter for _all_ operators that
>     are part of
>     >>>      the task.
>     >>>
>     >>>
>     >>> The registered metrics of OperatorCoordinator always are those
>     which
>     >>> can not be aggregated from the tasks, like the number of
>     unassigned
>     >>> splits in the SourceCoordinator. Actually we have not
>     encountered the
>     >>> scenario you mentioned. I think the OperatorCoordinatorMetricGroup
>     >>> should only contain the metrics for itself instead of its
>     subtasks.
>     >>>
>     >>> Using metrics.scope.jm.job is not enough to distinguish the
>     different
>     >>> coordinators. The operator id/name is necessary. Then
>     >>> the implementation should be like this. But users can not
>     change the
>     >>> scope in this way. This is also acceptable.
>     >>> @Internal public class InternalOperatorCoordinatorMetricGroup
>     >>>           extends ProxyMetricGroup<MetricGroup>
>     >>>           implements OperatorCoordinatorMetricGroup {
>     >>>       public InternalOperatorCoordinatorMetricGroup(
>     >>>               JobManagerJobMetricGroup parent, OperatorID
>     operatorID,
>     >> String operatorName) {
>     >>>           super(parent.addGroup(operatorID +
>     >> operatorName).addGroup("coordinator")); }
>     >>> }
>     >>> So there are two choice for the
>     >>> InternalOperatorCoordinatorMetricGroup: 1. add and improve the new
>     >>> scope format; 2. use the metrics.scope.jm.job and
>     ProxyMetricGroup.
>     >>> Which one is better?
>     >>>
>     >>> Best,
>     >>> Hang
>     >>>
>     >>>
>     >>> Chesnay Schepler <ch...@apache.org> 于2023年1月18日周三
>     17:03写道：
>     >>>
>     >>>      You're misunderstanding the problem.
>     >>>
>     >>>      Metric groups form a tree with each group providing
>     certain metadata.
>     >>>
>     >>>      E.g., on the taskmanager we have a TM metric group that
>     provides info
>     >>>      about the TM, that has child task metric groups, that
>     have child
>     >>>      operator metric groups etc. The operator metric group
>     again has
>     >>>      children, where we are now mostly in the user/connector
>     land, like a
>     >>>      kafka metric group providing partition info or something.
>     >>>
>     >>>      What is being proposed is to create a coordinator metric
>     group on
>     >>>      the JM
>     >>>      side that provides operator metadata.
>     >>>
>     >>>      A more appropriate structure is to create an operator
>     group on the JM
>     >>>      side that provides this info, with the coordinator metric
>     group
>     >>>      being a
>     >>>      child.
>     >>>
>     >>>      On 17/01/2023 19:56, Steven Wu wrote:
>     >>>      >> Additionally, the configurable variables (operator
>     name/id) are
>     >>>      > logically not attached to the coordinator, but
>     operators, so to
>     >>>      me it
>     >>>      > just doesn't make sense to structure it like this.
>     >>>      >
>     >>>      > Chesnay, maybe we should clarify the terminology. To me,
>     >>>      pperators (like
>     >>>      > FLIP-27 source) can have two parts (coordinator and
>     >>>      reader/subtask). I
>     >>>      > think it is fine to include operator name/id for
>     coordinator
>     >>>      metrics.
>     >>>      >
>     >>>      > On Mon, Jan 16, 2023 at 2:13 AM Chesnay Schepler
>     >>>      <ch...@apache.org> wrote:
>     >>>      >
>     >>>      >> Slight correction: Using metrics.scope.jm.job as the
>     default
>     >>>      should be
>     >>>      >> safe.
>     >>>      >>
>     >>>      >> On 16/01/2023 10:18, Chesnay Schepler wrote:
>     >>>      >>> The proposed ScopeFormat is still problematic for a
>     few reasons.
>     >>>      >>>
>     >>>      >>> Extending the set of ScopeFormats is problematic
>     because it in
>     >>>      >>> practice it breaks the config if users actively rely
>     on it, since
>     >>>      >>> there's now another key that they _must_ set for it to be
>     >>>      >>> consistent/compatible with their existing setup.
>     >>>      >>> Unfortunately due to how powerful scope formats are
>     we can't
>     >>>      derive a
>     >>>      >>> default value that matches their existing setup.
>     >>>      >>> Hence we should try to do this as rarely as possible.
>     >>>      >>>
>     >>>      >>> This FLIP does not adhere to that since it proposes a
>     >>>      dedicated format
>     >>>      >>> for coordinators; next time we want to expose
>     operator-specific
>     >>>      >>> metrics (e.g., in the scheduler) we'd have to add
>     another one to
>     >>>      >>> support it.
>     >>>      >>>
>     >>>      >>> Additionally, the configurable variables (operator
>     name/id) are
>     >>>      >>> logically not attached to the coordinator, but
>     operators, so
>     >>>      to me it
>     >>>      >>> just doesn't make sense to structure it like this.
>     >>>      >>>
>     >>>      >>> Another thing I'm concerned about is that, because we
>     don't
>     >>>      include
>     >>>      >>> tasks in the hierarchy, users wishing to collect all
>     metrics for
>     >> a
>     >>>      >>> particular task (in this case ==vertex) now have to go
>     >>>      significantly
>     >>>      >>> out of their way to get them, since they can no
>     longer just
>     >>>      filter by
>     >>>      >>> the task ID but have to be filter for _all_ operators
>     that are
>     >>>      part of
>     >>>      >>> the task.
>     >>>      >>>
>     >>>      >>> On 16/01/2023 03:09, Hang Ruan wrote:
>     >>>      >>>> Hi, @chesnay@apache.org <ch...@apache.org> ,
>     >>>      >>>>
>     >>>      >>>> Do you have time to help to review this FLIP again
>     after the
>     >>>      >>>> modification?
>     >>>      >>>> Looking forward to your reply.
>     >>>      >>>> This FLIP will add a new configuration for the
>     >>>      >>>> OperatorCoordinatorMetricGroup scope format. It
>     provides an
>     >>>      internal
>     >>>      >>>> implementation and is added as a component to the
>     >>>      >>>> JobManagerJobMetricGroup.
>     >>>      >>>> If something doesn't make sense, could you provide some
>     >>>      advice? It
>     >>>      >>>> will be
>     >>>      >>>> very helpful. Thanks a lot for your help.
>     >>>      >>>>
>     >>>      >>>> Best,
>     >>>      >>>> Hang
>     >>>      >>>>
>     >>>      >>>> Martijn Visser <ma...@apache.org>
>     于2023年1月11日周三
>     >>>      16:34写道：
>     >>>      >>>>
>     >>>      >>>>
>     >>>      >>>>> Hi Hang,
>     >>>      >>>>>
>     >>>      >>>>> I'm a bit surprised that this has gone to a vote,
>     given that
>     >>>      Chesnay
>     >>>      >>>>> deliberately mentioned that he would vote against
>     it as-is.
>     >>>      I would
>     >>>      >>>>> expect
>     >>>      >>>>> that before going to a vote, he has had the
>     opportunity to
>     >>>      >>>>> participate in
>     >>>      >>>>> this discussion.
>     >>>      >>>>>
>     >>>      >>>>> Best regards,
>     >>>      >>>>>
>     >>>      >>>>> Martijn
>     >>>      >>>>>
>     >>>      >>>>> On Tue, Jan 3, 2023 at 12:53 PM Jark Wu
>     <im...@gmail.com>
>     >>>      wrote:
>     >>>      >>>>>
>     >>>      >>>>>> Hi Dong,
>     >>>      >>>>>>
>     >>>      >>>>>> Regarding “SplitEnumeratorContext#metricGroup”, my
>     only
>     >>>      concern is
>     >>>      >>>>>> that
>     >>>      >>>>>> this is a core interface for the FLIP. It’s hard
>     to tell how
>     >>>      >>>>>> sources use
>     >>>      >>>>>> metric group without mentioning this interface.
>     Even if
>     >>>      this is an
>     >>>      >>>>> existing
>     >>>      >>>>>> API, I think it’s worth introducing the interface
>     again and
>     >>>      declaring
>     >>>      >>>>> that
>     >>>      >>>>>> we will implement the interface instead of a no-op
>     method
>     >>>      in this
>     >>>      >>>>>> FLIP.
>     >>>      >>>>>>
>     >>>      >>>>>> Anyway, this is a minor problem and shouldn’t
>     block this
>     >>>      FLIP. I’m
>     >>>      >>>>>> +1 to
>     >>>      >>>>>> start a vote.
>     >>>      >>>>>>
>     >>>      >>>>>> Best,
>     >>>      >>>>>> Jark
>     >>>      >>>>>>
>     >>>      >>>>>>
>     >>>      >>>>>>> 2023年1月3日 10:03，Hang Ruan
>     <ru...@gmail.com> 写道：
>     >>>      >>>>>>>
>     >>>      >>>>>>> Hi, Jark and Dong,
>     >>>      >>>>>>>
>     >>>      >>>>>>> Thanks for your comments. Sorry for my late reply.
>     >>>      >>>>>>>
>     >>>      >>>>>>> For suggestion 1, I plan to implement the
>     >>>      >>>>>>> SplitEnumeratorMetricGroup in
>     >>>      >>>>>>> another issue, and it is not contained in this
>     FLIP. I
>     >>>      will add some
>     >>>      >>>>>>> description about this part.
>     >>>      >>>>>>> For suggestion 2, changes about
>     >>>      OperatorCoordinator#metricGroup has
>     >>>      >>>>>> already
>     >>>      >>>>>>> been documented in the proposed change section.
>     >>>      >>>>>>>
>     >>>      >>>>>>> Best,
>     >>>      >>>>>>> Hang
>     >>>      >>>>>>>
>     >>>      >>>>>>> Dong Lin <li...@gmail.com> 于2023年1月1日周日
>     09:45写道：
>     >>>      >>>>>>>
>     >>>      >>>>>>>> Let me chime-in and add comments regarding the
>     public
>     >>>      interface
>     >>>      >>>>> section.
>     >>>      >>>>>>>> Please see my comments inline.
>     >>>      >>>>>>>>
>     >>>      >>>>>>>> On Thu, Dec 29, 2022 at 6:08 PM Jark Wu
>     >>>      <im...@gmail.com> wrote:
>     >>>      >>>>>>>>
>     >>>      >>>>>>>>> Hi Hang,
>     >>>      >>>>>>>>>
>     >>>      >>>>>>>>> Thanks for driving this discussion. I think
>     this is a
>     >>>      very useful
>     >>>      >>>>>> feature
>     >>>      >>>>>>>>> for connectors.
>     >>>      >>>>>>>>>
>     >>>      >>>>>>>>> The FLIP looks quite good to me, and I just
>     have two
>     >>>      suggestions.
>     >>>      >>>>>>>>>
>     >>>      >>>>>>>>> 1. In the "Public Interface" section, mention
>     that the
>     >>>      >>>>>>>>> implementation
>     >>>      >>>>>>>>> behavior of "SplitEnumeratorContext#metricGroup" is
>     >>>      changed from
>     >>>      >>>>>>>> returning
>     >>>      >>>>>>>>> null to returning a concrete
>     SplitEnumeratorMetricGroup
>     >>>      instance.
>     >>>      >>>>> Even
>     >>>      >>>>>>>>> though the API is already there, the behavior
>     change can
>     >>>      also be
>     >>>      >>>>>>>> considered
>     >>>      >>>>>>>>> a public change.
>     >>>      >>>>>>>>>
>     >>>      >>>>>>>> SplitEnumeratorContext#metricGroup is an
>     interface and
>     >>>      this FLIP
>     >>>      >>>>>>>> does
>     >>>      >>>>>> not
>     >>>      >>>>>>>> seem to change its semantics/behavior. The FLIP does
>     >>>      change the
>     >>>      >>>>>>>> implementation/behavior of
>     >>>      SourceCoordinatorContext#metricGroup,
>     >>>      >>>>>>>> which
>     >>>      >>>>>> is
>     >>>      >>>>>>>> marked @Internal.
>     >>>      >>>>>>>>
>     >>>      >>>>>>>> Thus it might seem a bit weird to add in the public
>     >> interface
>     >>>      >>>>>>>> section
>     >>>      >>>>>>>> saying that we change the interface
>     >>>      >>>>>>>> SplitEnumeratorContext#metricGroup
>     >>>      >>>>>> from
>     >>>      >>>>>>>> returning null to non-null object.
>     >>>      >>>>>>>>
>     >>>      >>>>>>>> 2. Mention the newly added interface of
>     >>>      >>>>>> "OperatorCoordinator#metricGroup"
>     >>>      >>>>>>>>> in the "Proposed Changes" section or "Public
>     Interface"
>     >>>      section. As
>     >>>      >>>>> the
>     >>>      >>>>>>>>> FLIP said, OperatorCoordinator is widely used
>     in many
>     >>>      connectors.
>     >>>      >>>>>> Though
>     >>>      >>>>>>>> it
>     >>>      >>>>>>>>> is still an @Internal API, I think it is worth
>     >>>      mentioning the
>     >>>      >>>>>>>>> change
>     >>>      >>>>> in
>     >>>      >>>>>>>> the
>     >>>      >>>>>>>>> FLIP.
>     >>>      >>>>>>>> Since OperatorCoordinator is an internal API, it
>     seems
>     >>>      reasonable to
>     >>>      >>>>>>>> explain it in the proposed change section. The
>     FLIP seems
>     >>>      to have
>     >>>      >>>>>>>> documented this in point 5 of the proposed
>     change section.
>     >>>      >>>>>>>>
>     >>>      >>>>>>>> BTW, if we think there are @internal classes
>     that are
>     >>>      important
>     >>>      >>>>>>>> enough
>     >>>      >>>>>> to
>     >>>      >>>>>>>> be added in the public interface section, it
>     might be
>     >>>      useful to
>     >>>      >>>>>> explicitly
>     >>>      >>>>>>>> discuss this topic and document it in the "*What
>     are the
>     >>>      "public
>     >>>      >>>>>>>> interfaces" of the project*" in this
>     >>>      >>>>>>>> <
>     >>>      >>>>>>>>
>     >>>      >>
>     >>>
>     >>
>     https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
>     >>>      >>>>>>>> wiki.
>     >>>      >>>>>>>>
>     >>>      >>>>>>>>
>     >>>      >>>>>>>>> Best,
>     >>>      >>>>>>>>> Jark
>     >>>      >>>>>>>>>
>     >>>      >>>>>>>>>
>     >>>      >>>>>>>>> On Mon, 26 Dec 2022 at 18:06, Hang Ruan
>     >>>      <ru...@gmail.com>
>     >>>      >>>>>> wrote:
>     >>>      >>>>>>>>>> Hi, thanks for the feedback, Zhu Zhu and
>     Qingsheng.
>     >>>      >>>>>>>>>> After combining everyone's comments, the main
>     concerns and
>     >>>      >>>>>>>> corresponding
>     >>>      >>>>>>>>>> adjustments are as follows.
>     >>>      >>>>>>>>>>
>     >>>      >>>>>>>>>> Q1: Common metrics are not quite useful.
>     >>>      >>>>>>>>>> numEventsIn and numEventsOut counters will be
>     removed
>     >>>      from the
>     >>>      >>>>>>>>>> OperatorCoordinatorMetricGroup. These common
>     metrics do
>     >> not
>     >>>      >>>>>>>>>> provide
>     >>>      >>>>>>>>> enough
>     >>>      >>>>>>>>>> information for users. The users are more
>     willing to
>     >>>      get the
>     >>>      >>>>>>>>>> number
>     >>>      >>>>> of
>     >>>      >>>>>>>>>> events of the specified type instead of the total
>     >>>      number. And this
>     >>>      >>>>>>>> metric
>     >>>      >>>>>>>>>> is calculated differently. The implementation
>     could
>     >>>      register the
>     >>>      >>>>>> metric
>     >>>      >>>>>>>>> by
>     >>>      >>>>>>>>>> themselves.
>     >>>      >>>>>>>>>>
>     >>>      >>>>>>>>>> Q2: This FLIP is overly complicated.
>     >>>      >>>>>>>>>> This FLIP will become concise after these
>     modifications.
>     >>>      >>>>>>>>>> OperatorCoordinatorMetricGroup has already been
>     >>>      introduced into
>     >>>      >>>>> Flink
>     >>>      >>>>>>>> by
>     >>>      >>>>>>>>>> FLIP-179<
>     >>>      >>>>>>>>>>
>     >>>      >>>>>>>>>>
>     >>>      >>
>     >>>
>     >>
>     https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics
>     >>>      >>>>>>>>>>> .
>     >>>      >>>>>>>>>> And
>     >>>      >>>>>>>>>> this FLIP will not change it. This FLIP only
>     provides a
>     >>>      new metric
>     >>>      >>>>>>>> option
>     >>>      >>>>>>>>>> and a new metric group scope. The changes in
>     proposed
>     >>>      changes
>     >>>      >>>>> provide
>     >>>      >>>>>>>> the
>     >>>      >>>>>>>>>> details about the modifications for the internal
>     >>>      classes, which
>     >>>      >>>>> might
>     >>>      >>>>>>>>> make
>     >>>      >>>>>>>>>> it look complicated.
>     >>>      >>>>>>>>>>
>     >>>      >>>>>>>>>> Thanks for all the comments again. If there
>     are no further
>     >>>      >>>>>>>>>> comments,
>     >>>      >>>>>> we
>     >>>      >>>>>>>>>> plan to start the voting thread this week.
>     >>>      >>>>>>>>>>
>     >>>      >>>>>>>>>> Best,
>     >>>      >>>>>>>>>> Hang
>     >>>      >>>>>>>>>>
>     >>>      >>>>>>>>>> Qingsheng Ren <re...@gmail.com>
>     于2022年12月26日周一
>     >>>      16:48写道：
>     >>>      >>>>>>>>>>
>     >>>      >>>>>>>>>>
>     >>>      >>>>>>>>>>> Thanks for the FLIP, Hang!
>     >>>      >>>>>>>>>>>
>     >>>      >>>>>>>>>>> This FLIP overall looks good to me. Actually
>     I share
>     >>>      the same
>     >>>      >>>>> concern
>     >>>      >>>>>>>>>> with
>     >>>      >>>>>>>>>>> Zhu that numEventsIn and numEventsOut
>     counters are not
>     >>>      quite
>     >>>      >>>>>>>>>>> useful
>     >>>      >>>>>>>> to
>     >>>      >>>>>>>>>> end
>     >>>      >>>>>>>>>>> users. OperatorEvent is a quite low-level
>     abstraction,
>     >>>      which
>     >>>      >>>>> requires
>     >>>      >>>>>>>>>>> instantialization in order to be practical to
>     users and
>     >>>      >>>>>>>>>>> developers,
>     >>>      >>>>>>>> so
>     >>>      >>>>>>>>>>> maybe it's better to exclude them from the FLIP.
>     >>>      >>>>>>>>>>>
>     >>>      >>>>>>>>>>> Best,
>     >>>      >>>>>>>>>>> Qingsheng
>     >>>      >>>>>>>>>>> Ververica (Alibaba)
>     >>>      >>>>>>>>>>>
>     >>>      >>>>>>>>>>> On Mon, Dec 26, 2022 at 12:08 PM Zhu Zhu
>     >>>      <re...@gmail.com>
>     >>>      >>>>> wrote:
>     >>> >>>>>>>>>>>> Hi Hang,
>     >>> >>>>>>>>>>>> I still see no strong reason why we need
>     >>> >>>>>>>>>>>> numEventsIn/numEventsOut
>     >>>      >>>>>>>>>>> metrics.
>     >>> >>>>>>>>>>>> In the discussion in FLINK-29801, I can see the same
>     >>>      concern
>     >>> >>>>>>>>>>>> from
>     >>>      >>>>>>>>>> others.
>     >>> >>>>>>>>>>>> So I prefer to exclude them from this FLIP to avoid
>     >>> >>>>>>>>>>>> over-extending
>     >>>      >>>>>>>>> the
>     >>> >>>>>>>>>>>> scope.
>     >>> >>>>>>>>>>>>
>     >>> >>>>>>>>>>>> Thanks,
>     >>> >>>>>>>>>>>> Zhu
>     >>> >>>>>>>>>>>>
>     >>> >>>>>>>>>>>> Hang Ruan <ru...@gmail.com>
>     于2022年12月23日周五
>     >>> >>>>>>>>>>>> 15:21写道：
>     >>> >>>>>>>>>>>>> Hi everyone,
>     >>> >>>>>>>>>>>>>
>     >>> >>>>>>>>>>>>> As for the Zhu Zhu's problem, I think we should keep
>     >>>      the common
>     >>>      >>>>>>>>>>> metrics,
>     >>> >>>>>>>>>>>> which will help to observe incoming and outgoing
>     >>>      events. What do
>     >>>      >>>>>>>> you
>     >>>      >>>>>>>>>>> think,
>     >>> >>>>>>>>>>>> @Zhu Zhu ?
>     >>> >>>>>>>>>>>>> And @Chesnay, are there any other issues you are
>     >>>      more concerned
>     >>>      >>>>>>>>>> about?
>     >>> >>>>>>>>>>>> Looking forward to your reply.
>     >>> >>>>>>>>>>>>> Thanks for all the comments. If there are no further
>     >>> >>>>>>>>>>>>> comments, we
>     >>>      >>>>>>>>>> plan
>     >>> >>>>>>>>>>>> to start the voting thread next week.
>     >>> >>>>>>>>>>>>> Best,
>     >>> >>>>>>>>>>>>> Hang
>     >>> >>>>>>>>>>>>>
>     >>> >>>>>>>>>>>>> Hang Ruan <ru...@gmail.com>
>     于2022年12月15日周四
>     >>> >>>>>>>>>>>>> 16:49写道：
>     >>> >>>>>>>>>>>>>> Hi, Zhu Zhu,
>     >>> >>>>>>>>>>>>>>
>     >>> >>>>>>>>>>>>>> Thanks for your feedback!
>     >>> >>>>>>>>>>>>>>
>     >>> >>>>>>>>>>>>>> The OperatorCoordinator implementations are
>     >>>      different. And
>     >>> >>>>>>>>>>>>>> their
>     >>> >>>>>>>>>>>> metrics are much different too. We try to find the
>     >> common
>     >>> >>>>>>>>>>>> metrics
>     >>>      >>>>>>>> and
>     >>>      >>>>>>>>>> put
>     >>> >>>>>>>>>>>> them in the OperatorCoordinatorMetricGroup. If most
>     >>>      developers
>     >>>      >>>>>>>> think
>     >>>      >>>>>>>>> we
>     >>>      >>>>>>>>>>> do
>     >>> >>>>>>>>>>>> not need these common metrics, removing the common
>     >>>      metrics is
>     >>> >>>>>>>>>>>> also
>     >>> >>>>>>>>>>>> acceptable.
>     >>> >>>>>>>>>>>>>> Best,
>     >>> >>>>>>>>>>>>>> Hang
>     >>> >>>>>>>>>>>>>>
>     >>> >>>>>>>>>>>>>> Zhu Zhu <re...@gmail.com> 于2022年12月14日周三
>     >>>      22:09写道：
>     >>> >>>>>>>>>>>>>>> Hi Hang & MengYue,
>     >>> >>>>>>>>>>>>>>>
>     >>> >>>>>>>>>>>>>>> Thanks for creating this FLIP!
>     >>> >>>>>>>>>>>>>>>
>     >>> >>>>>>>>>>>>>>> I think it is very useful, mainly in two aspects:
>     >>> >>>>>>>>>>>>>>> 1. Enables OperatorCoordinators to register
>     metrics.
>     >>> >>>>>>>>>>>>>>> Currently
>     >>> >>>>>>>>>>>>>>> the coordinators has no way to do this. And
>     operator
>     >>>      >>>>>>>> coordinator
>     >>> >>>>>>>>>>>>>>> metric group further enables the SplitEnumerator
>     >>>      to have
>     >>> >>>>>>>>>>>>>>> access
>     >>> >>>>>>>>>>>>>>> to a registered metric group (via the existing
>     public
>     >>> >>>>>>>>>>>>>>> interface
>     >>> >>>>>>>>>>>>>>> SplitEnumeratorContext#metricGroup()), which is
>     >>>      null at the
>     >>>      >>>>>>>>> moment.
>     >>> >>>>>>>>>>>>>>> 2. Defines the scope of operator coordinator
>     >>>      metrics. A clear
>     >>> >>>>>>>>>>>> definition
>     >>> >>>>>>>>>>>>>>> makes it easy for users to find their wanted
>     >>>      metrics. The
>     >>>      >>>>>>>>>> definition
>     >>> >>>>>>>>>>>>>>> also helps to avoid conflicts of metrics from
>     >> multiple
>     >>> >>>>>>>>>>>> OperatorCoordinators
>     >>> >>>>>>>>>>>>>>> of the same kind. E.g. each SourceCoordinator may
>     >>>      have its
>     >>> >>>>>>>>>>>>>>> own
>     >>> >>>>>>>>>>>>>>> numSourceSplits metric, these metrics should not
>     >>>      be directly
>     >>>      >>>>>>>>>>> registered
>     >>> >>>>>>>>>>>>>>> to the job metric group.
>     >>> >>>>>>>>>>>>>>>
>     >>> >>>>>>>>>>>>>>> What I'm a bit concerned is the necessity of the
>     >>>      introduced
>     >>>      >>>>>>>>> common
>     >>> >>>>>>>>>>>> metrics
>     >>> >>>>>>>>>>>>>>> numEventsInCounter & numEventsOutCounter. If there
>     >>>      any case
>     >>>      >>>>>>>> which
>     >>> >>>>>>>>>>>> strongly
>     >>> >>>>>>>>>>>>>>> requires them?
>     >>> >>>>>>>>>>>>>>>
>     >>> >>>>>>>>>>>>>>> Regarding the concerns of Chesnay,
>     >>> >>>>>>>>>>>>>>>> A dedicated coordinator MG implementation is
>     >> overkill
>     >>> >>>>>>>>>>>>>>> Directly using the job metric group can result in
>     >>>      metric
>     >>>      >>>>>>>>> conflicts,
>     >>>      >>>>>>>>>>> as
>     >>> >>>>>>>>>>>> mentioned
>     >>> >>>>>>>>>>>>>>> in above #2.
>     >>> >>>>>>>>>>>>>>>
>     >>> >>>>>>>>>>>>>>> Thanks,
>     >>> >>>>>>>>>>>>>>> Zhu
>     >>> >>>>>>>>>>>>>>>
>     >>> >>>>>>>>>>>>>>> Dong Lin <li...@gmail.com>
>     于2022年12月10日周六
>     >>> >>>>>>>>>>>>>>> 14:16写道：
>     >>> >>>>>>>>>>>>>>>
>     >>> >>>>>>>>>>>>>>>> Hi Chesney,
>     >>> >>>>>>>>>>>>>>>>
>     >>> >>>>>>>>>>>>>>>> Just to double check with you,
>     >>> >>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup
>     >>> >>>>>>>>>>>> (annotated as
>     >>> >>>>>>>>>>>>>>>> @PublicEvolving) has already been introduced into
>     >>>      Flink by
>     >>>      >>>>>>>>>> FLIP-179
>     >>> >>>>>>>>>>>>>>>> <
>     >>>      >>
>     >>>
>     >>
>     https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics
>     >>> >>>>>>>>>>>>> .
>     >>> >>>>>>>>>>>>>>>> And that FLIP has got you +1.. Do you mean we
>     >>>      should remove
>     >>>      >>>>>>>>> this
>     >>> >>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup?
>     >>> >>>>>>>>>>>>>>>>
>     >>> >>>>>>>>>>>>>>>> Regards,
>     >>> >>>>>>>>>>>>>>>> Dong
>     >>> >>>>>>>>>>>>>>>>
>     >>> >>>>>>>>>>>>>>>> On Sat, Dec 10, 2022 at 1:33 AM Chesnay
>     Schepler <
>     >>>      >>>>>>>>>>> chesnay@apache.org>
>     >>> >>>>>>>>>>>> wrote:
>     >>> >>>>>>>>>>>>>>>>> As a whole I feel like this FLIP is overly
>     >>>      complicated. A
>     >>>      >>>>>>>>>>> dedicated
>     >>> >>>>>>>>>>>>>>>>> coordinator MG implementation is overkill; it
>     >>>      could just
>     >>>      >>>>>>>>> re-use
>     >>>      >>>>>>>>>>> the
>     >>> >>>>>>>>>>>>>>>>> existing Task/OperatorMGs to create the same
>     >>>      structure we
>     >>>      >>>>>>>>> have
>     >>>      >>>>>>>>>> on
>     >>> >>>>>>>>>>>> TMs,
>     >>> >>>>>>>>>>>>>>>>> similar to what we did with the Job MG.
>     >>> >>>>>>>>>>>>>>>>>
>     >>> >>>>>>>>>>>>>>>>> However, I'm not convinced that this is required
>     >>>      anyway,
>     >>>      >>>>>>>>>> because
>     >>> >>>>>>>>>>>> all the
>     >>> >>>>>>>>>>>>>>>>> example metrics you listed can be implemented on
>     >>>      the TM
>     >>>      >>>>>>>> side
>     >>>      >>>>>>>>> +
>     >>> >>>>>>>>>>>>>>>>> aggregating them in the external metrics
>     backend.
>     >>> >>>>>>>>>>>>>>>>>
>     >>> >>>>>>>>>>>>>>>>> Since I'm on holidays soon, just so no one tries
>     >>>      to pull a
>     >>>      >>>>>>>>> fast
>     >>> >>>>>>>>>>>> one on
>     >>> >>>>>>>>>>>>>>>>> me, if this were to go to a vote as-is I'd be
>     >>>      against it.
>     >>> >>>>>>>>>>>>>>>>>
>     >>> >>>>>>>>>>>>>>>>>
>     >>> >>>>>>>>>>>>>>>>> On 09/12/2022 15:30, Dong Lin wrote:
>     >>> >>>>>>>>>>>>>>>>>> Hi Hang,
>     >>> >>>>>>>>>>>>>>>>>>
>     >>> >>>>>>>>>>>>>>>>>> Thanks for the FLIP! The FLIP looks good and it
>     >>>      is pretty
>     >>> >>>>>>>>>>>> informative.
>     >>> >>>>>>>>>>>>>>>>>> I have just two minor comments regarding names:
>     >>> >>>>>>>>>>>>>>>>>> - Would it be useful to rename the config
>     key as
>     >>> >>>>>>>>>>>>>>>>>> *metrics.scope.jm.job.operator-coordinator* for
>     >>>      >>>>>>>> consistency
>     >>>      >>>>>>>>>>> with
>     >>> >>>>>>>>>>>>>>>>>> *metrics.scope.jm.job
>     >>> >>>>>>>>>>>>>>>>>> *(which is not named as *jm-job)?
>     >>> >>>>>>>>>>>>>>>>>> - Maybe rename the variable as
>     >>>      >>>>>>>>>>> SCOPE_NAMING_OPERATOR_COORDINATOR
>     >>> >>>>>>>>>>>> for
>     >>> >>>>>>>>>>>>>>>>>> simplicity and consistency with
>     >>>      SCOPE_NAMING_OPERATOR
>     >>>      >>>>>>>>> (which
>     >>>      >>>>>>>>>> is
>     >>> >>>>>>>>>>>> not named
>     >>> >>>>>>>>>>>>>>>>>> as SCOPE_NAMING_TM_JOB_OPERATOR)?
>     >>> >>>>>>>>>>>>>>>>>>
>     >>> >>>>>>>>>>>>>>>>>> Cheers,
>     >>> >>>>>>>>>>>>>>>>>> Dong
>     >>> >>>>>>>>>>>>>>>>>>
>     >>> >>>>>>>>>>>>>>>>>>
>     >>> >>>>>>>>>>>>>>>>>>
>     >>> >>>>>>>>>>>>>>>>>> On Thu, Dec 8, 2022 at 3:28 PM Hang Ruan <
>     >>>      >>>>>>>>>>> ruanhang1993@gmail.com>
>     >>> >>>>>>>>>>>> wrote:
>     >>> >>>>>>>>>>>>>>>>>>> Hi all,
>     >>> >>>>>>>>>>>>>>>>>>>
>     >>> >>>>>>>>>>>>>>>>>>> MengYue and I created FLIP-274[1] Introduce
>     >>>      metric group
>     >>>      >>>>>>>>> for
>     >>> >>>>>>>>>>>>>>>>>>> OperatorCoordinator. OperatorCoordinator
>     is the
>     >>>      >>>>>>>>> coordinator
>     >>>      >>>>>>>>>>> for
>     >>> >>>>>>>>>>>> runtime
>     >>> >>>>>>>>>>>>>>>>>>> operators and running on Job Manager. The
>     >>>      coordination
>     >>> >>>>>>>>>>>> mechanism is
>     >>> >>>>>>>>>>>>>>>>>>> operator events between OperatorCoordinator
>     >>>      and its all
>     >>> >>>>>>>>>>>> operators, the
>     >>> >>>>>>>>>>>>>>>>>>> coordination is more and more using in
>     Flink, for
>     >>>      >>>>>>>> example
>     >>>      >>>>>>>>>> many
>     >>> >>>>>>>>>>>> Sources
>     >>> >>>>>>>>>>>>>>>>> and
>     >>> >>>>>>>>>>>>>>>>>>> Sinks depend on the mechanism to assign splits
>     >> and
>     >>>      >>>>>>>>>> coordinate
>     >>> >>>>>>>>>>>> commits to
>     >>> >>>>>>>>>>>>>>>>>>> external systems. The OperatorCoordinator is
>     >>>      widely
>     >>>      >>>>>>>> using
>     >>>      >>>>>>>>> in
>     >>> >>>>>>>>>>>> flink kafka
>     >>> >>>>>>>>>>>>>>>>>>> connector, flink pulsar connector, flink cdc
>     >>>      connector,
>     >>>      >>>>>>>>>> flink
>     >>> >>>>>>>>>>>> hudi
>     >>> >>>>>>>>>>>>>>>>>>> connector and so on.
>     >>> >>>>>>>>>>>>>>>>>>>
>     >>> >>>>>>>>>>>>>>>>>>> But there is not a suitable metric group scope
>     >>>      for the
>     >>> >>>>>>>>>>>>>>>>> OperatorCoordinator
>     >>> >>>>>>>>>>>>>>>>>>> and not an implementation for the interface
>     >>> >>>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup.
>     >>> >>>>>>>>>>>>>>>>>>> These metrics in OperatorCoordinator could be
>     >>>      how many
>     >>> >>>>>>>>>>>> splits/partitions
>     >>> >>>>>>>>>>>>>>>>>>> have been assigned to source readers, how many
>     >>>      files
>     >>>      >>>>>>>> have
>     >>>      >>>>>>>>>> been
>     >>> >>>>>>>>>>>> written
>     >>> >>>>>>>>>>>>>>>>> out
>     >>> >>>>>>>>>>>>>>>>>>> by sink writers, these metrics not only help
>     >>>      users to
>     >>>      >>>>>>>> know
>     >>>      >>>>>>>>>> the
>     >>> >>>>>>>>>>>> job
>     >>> >>>>>>>>>>>>>>>>> progress
>     >>> >>>>>>>>>>>>>>>>>>> but also make big job maintaining easier.
>     Thus we
>     >>>      >>>>>>>> propose
>     >>>      >>>>>>>>>> the
>     >>> >>>>>>>>>>>> FLIP-274
>     >>> >>>>>>>>>>>>>>>>> to
>     >>> >>>>>>>>>>>>>>>>>>> introduce a new metric group scope for
>     >>>      >>>>>>>> OperatorCoordinator
>     >>>      >>>>>>>>>> and
>     >>> >>>>>>>>>>>> provide
>     >>> >>>>>>>>>>>>>>>>> an
>     >>> >>>>>>>>>>>>>>>>>>> internal implementation for
>     >>>      >>>>>>>>> OperatorCoordinatorMetricGroup.
>     >>> >>>>>>>>>>>>>>>>>>> Could you help review this FLIP when you get
>     >>>      time? Any
>     >>>      >>>>>>>>>>> feedback
>     >>> >>>>>>>>>>>> is
>     >>> >>>>>>>>>>>>>>>>>>> appreciated!
>     >>> >>>>>>>>>>>>>>>>>>>
>     >>> >>>>>>>>>>>>>>>>>>> Best,
>     >>> >>>>>>>>>>>>>>>>>>> Hang
>     >>> >>>>>>>>>>>>>>>>>>>
>     >>> >>>>>>>>>>>>>>>>>>> [1]
>     >>> >>>>>>>>>>>>>>>>>>>
>     >>> >>>>>>>>>>>>>>>>>>>
>     >>>      >>
>     >>>
>     >>
>     https://cwiki.apache.org/confluence/display/FLINK/FLIP-274%3A+Introduce+metric+group+for+OperatorCoordinator
>     >>>      >>
>     >>>
>

Re: [DISCUSS] FLIP-274 : Introduce metric group for OperatorCoordinator

Posted by Jark Wu <im...@gmail.com>.

Hi Chesnay,

IIUC, our only disagreement is whether to support task info in the metric
format. But we all agree with the default metric format
"metrics.scope.jm-operator=<host>.jobmanager.<job_name>.<operator_name>"
which doesn't include task info, right?

Regarding "coordinator metrics don't *belong to* any tasks/vertexes", I
mean the coordinator doesn't run on specific tasks, so it doesn't have
runtime information of tasks. My concern about adding task info to the
coordinator metric is that
1) if we register coordinator metrics for every subtask, the number of
coordinator metrics will be exploded when task parallelism is large. This
is an overhead to JM & metric systems, and the duplicated metrics are
useless for users.
2) Correct me if I'm wrong, MetricGroup#counter only supports registering
one metric. It seems it can't register a metric reporting for multiple
metric names (e.g. multiple subtasks).

Best,
Jark


On Mon, 30 Jan 2023 at 16:11, Chesnay Schepler <ch...@apache.org> wrote:

> I hadn't looked in detail in to how the task info can be introduced, but
> given that the coordinates are created by the execution graph, where we
> only work on tasks, it should be possible and rather simple.
>
> The OperatorCoordinatorHolder gets access to the ExecutionJobVertex from
> which it can extract everything and create the task/operator groups.
> Exposure to the coordinator could happen via the coordinator context as
> originally planned.
>
>  > As we know, coordinator metrics are just like JMJobMetricGroup, which
> don't belong to any tasks/vertexes.
>
> I don't think this is true, every coordinator is scoped to a specific
> operator, and every operator is associated with a particular vertex.
>
>  From the javadocs:
>
> " A coordinator for runtime operators. The OperatorCoordinator runs on
> the master, associated with
>   the job vertex of the operator. It communicates with operators via
> sending operator events."
>
> On 28/01/2023 09:15, Jark Wu wrote:
> > Hi all,
> >
> > IIUC, Chesnay means we should have a more general metric group for
> > **operator**
> > not for **coordinator** in JM, this would be useful to extend
> > other operator-specific
> > metrics in the future. That means the new scope format should be designed
> > for
> > the operator,
> > e.g.,
> metrics.scope.jm-operator=<host>.jobmanager.<job_name>.<operator_name>
> > The coordinator metric is a subgroup (a constant "coordinator" suffix) of
> > the JMOperatorMG.
> >
> > I think this is a nice design. However, I have a question about adding
> task
> > id/name to this list.
> > How to get the task id/name when reporting a coordinator metric? As we
> > know, coordinator metrics
> > are just like JMJobMetricGroup, which don't belong to any tasks/vertexes.
> > Do you mean reporting
> > coordinator metrics for every task id under the operator?
> >
> > Best,
> > Jark
> >
> >
> > On Thu, 19 Jan 2023 at 17:33, Chesnay Schepler <ch...@apache.org>
> wrote:
> >
> >>   > First, I do not understand why users have to configure the new scope
> >> format, which has a default value.
> >>
> >> If you don't use scope formats, sure. If you do use scope formats, e.g.
> >> to add a common prefix (which is the case for datadog users for
> >> example), then the current default in the FLIP is insufficient and
> >> requires the user to update the configuration.
> >>
> >>   > I usually do not need to change these configuration of scope formats
> >> when submitting the flink job.
> >>
> >> Scope formats as a whole are quite a power-user feature, but that
> >> doesn't mean we should ignore it.
> >>
> >>   > I try to let it extend the ComponentMetricGroup
> >>
> >> This isn't inherently required just because it is a "component".
> >> Component metric groups should _only_ be used for cases where such a
> >> component provides several bits of metadata.
> >> For example, tasks provide vertex ids, task names, attempted IDs etc.,
> >> and we'd like users to have the option on how this metadata ends up
> >> being used (via scope formats). This currently can't be built with the
> >> addGroup() methods, hence why the component groups exist.
> >> However, the operator coordinator _itself_does not provide several bits
> >> of metadata. Logically, the _operator_ does, not the coordinator.
> >>
> >>   > This make me decide to add a new scope format.
> >>
> >> And this is fine; just don't add a new scope format for the coordinator
> >> but logically for operators on the JM side, such that we can extend the
> >> set of operator-specific metrics in the future without having to add yet
> >> another scope format.
> >>
> >>   > The relationship between the Operator and OperatorCoordinator is
> >> maintained in the OperatorCoordinatorHolder. In the POC, the operator
> >> name/id could be found and the OperatorCoordinatorMetricGroup will be
> >> created here.
> >>
> >> That's all irrelevant.
> >>
> >>   > The registered metrics of OperatorCoordinator always are those which
> >> can not be aggregated from the tasks, like the number of unassigned
> >> splits in the SourceCoordinator. Actually we have not encountered the
> >> scenario you mentioned. I think the OperatorCoordinatorMetricGroup
> >> should only contain the metrics for itself instead of its subtasks.
> >>
> >> You are misunderstanding the problem.
> >>
> >> This isn't about aggregating metrics on our side or exposing metrics
> >> from TMs on the JM side or anything like that.
> >>
> >> It's purely about the available metadata for operator metrics on the JM
> >> side. Currently you suggest operator name / id; what I'm proposing is to
> >> add task(==vertex!) id / name to this list.
> >>
> >> The benefit here is simple. I can look up all metrics where task_id==XXX
> >> and get _everything_ related to that vertex, including all metrics
> >> associated with operators that are part of that vertex, including the
> >> coordinator.
> >>
> >>   > Using metrics.scope.jm.job is not enough to distinguish the
> different
> >> coordinators.
> >>
> >> I did not suggest just using metrics.scope.jm.job. I suggested that as
> >> the default for the jm operator scope format, that is used for the
> >> proposed JobManagerOperatorMetricGroup, which will have some plain
> >> metric groups as children for coordinators.
> >> Aka, you have the JMOperatorMG, then you call addGroup("coordinators")
> >> and then addGroup(coordinator_name) for each coordinator.
> >>
> >>   > So there are two choice for the
> >> InternalOperatorCoordinatorMetricGroup: 1. add and improve the new scope
> >> format; 2. use the metrics.scope.jm.job and ProxyMetricGroup. Which one
> >> is better?
> >>
> >> There are more options than that.
> >> I feel like there are a lot of misunderstandings in this discussion.
> >> Please let me know if I could clear things up. If not I can also provide
> >> a PoC based on your PoC if that can speed things up. It's not that
> >> different anyway.
> >>
> >> On 19/01/2023 07:39, Hang Ruan wrote:
> >>> Hi, chesnay,
> >>>
> >>> Thanks for your reply. I still have some doubts about the questions
> >>> you raised.
> >>>
> >>>      > Extending the set of ScopeFormats is problematic because it in
> >>>      practice
> >>>      it breaks the config if users actively rely on it, since there's
> now
> >>>      another key that they _must_ set for it to be
> >>>      consistent/compatible with
> >>>      their existing setup.
> >>>      Unfortunately due to how powerful scope formats are we can't
> derive a
> >>>      default value that matches their existing setup.
> >>>      Hence we should try to do this as rarely as possible.
> >>>
> >>>
> >>>      > This FLIP does not adhere to that since it proposes a dedicated
> >>>      format
> >>>      for coordinators; next time we want to expose operator-specific
> >>>      metrics
> >>>      (e.g., in the scheduler) we'd have to add another one to support
> it.
> >>>
> >>>
> >>> First, I do not understand why users have to configure the new scope
> >>> format, which has a default value. I usually do not need to change
> >>> these configuration of scope formats when submitting the flink job.
> >>> IMO, the OperatorCoordinator is a component running at the JobMaster.
> >>> I try to let it extend the ComponentMetricGroup, but I cannot find
> >>> a suitable scope. This make me decide to add a new scope format.
> >>>
> >>>      > Additionally, the configurable variables (operator name/id) are
> >>>      logically not attached to the coordinator, but operators, so to
> me it
> >>>      just doesn't make sense to structure it like this.
> >>>
> >>>
> >>> The relationship between the Operator and OperatorCoordinator is
> >>> maintained in the OperatorCoordinatorHolder. In the POC, the operator
> >>> name/id could be found and the OperatorCoordinatorMetricGroup will be
> >>> created here.
> >>>
> >>>      > Another thing I'm concerned about is that, because we don't
> include
> >>>      tasks in the hierarchy, users wishing to collect all metrics for a
> >>>      particular task (in this case ==vertex) now have to go
> >>>      significantly out
> >>>      of their way to get them, since they can no longer just filter by
> the
> >>>      task ID but have to be filter for _all_ operators that are part of
> >>>      the task.
> >>>
> >>>
> >>> The registered metrics of OperatorCoordinator always are those which
> >>> can not be aggregated from the tasks, like the number of unassigned
> >>> splits in the SourceCoordinator. Actually we have not encountered the
> >>> scenario you mentioned. I think the OperatorCoordinatorMetricGroup
> >>> should only contain the metrics for itself instead of its subtasks.
> >>>
> >>> Using metrics.scope.jm.job is not enough to distinguish the different
> >>> coordinators. The operator id/name is necessary. Then
> >>> the implementation should be like this. But users can not change the
> >>> scope in this way. This is also acceptable.
> >>> @Internal public class InternalOperatorCoordinatorMetricGroup
> >>>           extends ProxyMetricGroup<MetricGroup>
> >>>           implements OperatorCoordinatorMetricGroup {
> >>>       public InternalOperatorCoordinatorMetricGroup(
> >>>               JobManagerJobMetricGroup parent, OperatorID operatorID,
> >> String operatorName) {
> >>>           super(parent.addGroup(operatorID +
> >> operatorName).addGroup("coordinator")); }
> >>> }
> >>> So there are two choice for the
> >>> InternalOperatorCoordinatorMetricGroup: 1. add and improve the new
> >>> scope format; 2. use the metrics.scope.jm.job and ProxyMetricGroup.
> >>> Which one is better?
> >>>
> >>> Best,
> >>> Hang
> >>>
> >>>
> >>> Chesnay Schepler <ch...@apache.org> 于2023年1月18日周三 17:03写道：
> >>>
> >>>      You're misunderstanding the problem.
> >>>
> >>>      Metric groups form a tree with each group providing certain
> metadata.
> >>>
> >>>      E.g., on the taskmanager we have a TM metric group that provides
> info
> >>>      about the TM, that has child task metric groups, that have child
> >>>      operator metric groups etc. The operator metric group again has
> >>>      children, where we are now mostly in the user/connector land,
> like a
> >>>      kafka metric group providing partition info or something.
> >>>
> >>>      What is being proposed is to create a coordinator metric group on
> >>>      the JM
> >>>      side that provides operator metadata.
> >>>
> >>>      A more appropriate structure is to create an operator group on
> the JM
> >>>      side that provides this info, with the coordinator metric group
> >>>      being a
> >>>      child.
> >>>
> >>>      On 17/01/2023 19:56, Steven Wu wrote:
> >>>      >> Additionally, the configurable variables (operator name/id) are
> >>>      > logically not attached to the coordinator, but operators, so to
> >>>      me it
> >>>      > just doesn't make sense to structure it like this.
> >>>      >
> >>>      > Chesnay, maybe we should clarify the terminology. To me,
> >>>      pperators (like
> >>>      > FLIP-27 source) can have two parts (coordinator and
> >>>      reader/subtask). I
> >>>      > think it is fine to include operator name/id for coordinator
> >>>      metrics.
> >>>      >
> >>>      > On Mon, Jan 16, 2023 at 2:13 AM Chesnay Schepler
> >>>      <ch...@apache.org> wrote:
> >>>      >
> >>>      >> Slight correction: Using metrics.scope.jm.job as the default
> >>>      should be
> >>>      >> safe.
> >>>      >>
> >>>      >> On 16/01/2023 10:18, Chesnay Schepler wrote:
> >>>      >>> The proposed ScopeFormat is still problematic for a few
> reasons.
> >>>      >>>
> >>>      >>> Extending the set of ScopeFormats is problematic because it in
> >>>      >>> practice it breaks the config if users actively rely on it,
> since
> >>>      >>> there's now another key that they _must_ set for it to be
> >>>      >>> consistent/compatible with their existing setup.
> >>>      >>> Unfortunately due to how powerful scope formats are we can't
> >>>      derive a
> >>>      >>> default value that matches their existing setup.
> >>>      >>> Hence we should try to do this as rarely as possible.
> >>>      >>>
> >>>      >>> This FLIP does not adhere to that since it proposes a
> >>>      dedicated format
> >>>      >>> for coordinators; next time we want to expose
> operator-specific
> >>>      >>> metrics (e.g., in the scheduler) we'd have to add another one
> to
> >>>      >>> support it.
> >>>      >>>
> >>>      >>> Additionally, the configurable variables (operator name/id)
> are
> >>>      >>> logically not attached to the coordinator, but operators, so
> >>>      to me it
> >>>      >>> just doesn't make sense to structure it like this.
> >>>      >>>
> >>>      >>> Another thing I'm concerned about is that, because we don't
> >>>      include
> >>>      >>> tasks in the hierarchy, users wishing to collect all metrics
> for
> >> a
> >>>      >>> particular task (in this case ==vertex) now have to go
> >>>      significantly
> >>>      >>> out of their way to get them, since they can no longer just
> >>>      filter by
> >>>      >>> the task ID but have to be filter for _all_ operators that are
> >>>      part of
> >>>      >>> the task.
> >>>      >>>
> >>>      >>> On 16/01/2023 03:09, Hang Ruan wrote:
> >>>      >>>> Hi, @chesnay@apache.org <ch...@apache.org> ,
> >>>      >>>>
> >>>      >>>> Do you have time to help to review this FLIP again after the
> >>>      >>>> modification?
> >>>      >>>> Looking forward to your reply.
> >>>      >>>> This FLIP will add a new configuration for the
> >>>      >>>> OperatorCoordinatorMetricGroup scope format. It provides an
> >>>      internal
> >>>      >>>> implementation and is added as a component to the
> >>>      >>>> JobManagerJobMetricGroup.
> >>>      >>>> If something doesn't make sense, could you provide some
> >>>      advice? It
> >>>      >>>> will be
> >>>      >>>> very helpful. Thanks a lot for your help.
> >>>      >>>>
> >>>      >>>> Best,
> >>>      >>>> Hang
> >>>      >>>>
> >>>      >>>> Martijn Visser <ma...@apache.org> 于2023年1月11日周三
> >>>      16:34写道：
> >>>      >>>>
> >>>      >>>>
> >>>      >>>>> Hi Hang,
> >>>      >>>>>
> >>>      >>>>> I'm a bit surprised that this has gone to a vote, given that
> >>>      Chesnay
> >>>      >>>>> deliberately mentioned that he would vote against it as-is.
> >>>      I would
> >>>      >>>>> expect
> >>>      >>>>> that before going to a vote, he has had the opportunity to
> >>>      >>>>> participate in
> >>>      >>>>> this discussion.
> >>>      >>>>>
> >>>      >>>>> Best regards,
> >>>      >>>>>
> >>>      >>>>> Martijn
> >>>      >>>>>
> >>>      >>>>> On Tue, Jan 3, 2023 at 12:53 PM Jark Wu <im...@gmail.com>
> >>>      wrote:
> >>>      >>>>>
> >>>      >>>>>> Hi Dong,
> >>>      >>>>>>
> >>>      >>>>>> Regarding “SplitEnumeratorContext#metricGroup”, my only
> >>>      concern is
> >>>      >>>>>> that
> >>>      >>>>>> this is a core interface for the FLIP. It’s hard to tell
> how
> >>>      >>>>>> sources use
> >>>      >>>>>> metric group without mentioning this interface. Even if
> >>>      this is an
> >>>      >>>>> existing
> >>>      >>>>>> API, I think it’s worth introducing the interface again and
> >>>      declaring
> >>>      >>>>> that
> >>>      >>>>>> we will implement the interface instead of a no-op method
> >>>      in this
> >>>      >>>>>> FLIP.
> >>>      >>>>>>
> >>>      >>>>>> Anyway, this is a minor problem and shouldn’t block this
> >>>      FLIP. I’m
> >>>      >>>>>> +1 to
> >>>      >>>>>> start a vote.
> >>>      >>>>>>
> >>>      >>>>>> Best,
> >>>      >>>>>> Jark
> >>>      >>>>>>
> >>>      >>>>>>
> >>>      >>>>>>> 2023年1月3日 10:03，Hang Ruan <ru...@gmail.com> 写道：
> >>>      >>>>>>>
> >>>      >>>>>>> Hi, Jark and Dong,
> >>>      >>>>>>>
> >>>      >>>>>>> Thanks for your comments. Sorry for my late reply.
> >>>      >>>>>>>
> >>>      >>>>>>> For suggestion 1, I plan to implement the
> >>>      >>>>>>> SplitEnumeratorMetricGroup in
> >>>      >>>>>>> another issue, and it is not contained in this FLIP. I
> >>>      will add some
> >>>      >>>>>>> description about this part.
> >>>      >>>>>>> For suggestion 2, changes about
> >>>      OperatorCoordinator#metricGroup has
> >>>      >>>>>> already
> >>>      >>>>>>> been documented in the proposed change section.
> >>>      >>>>>>>
> >>>      >>>>>>> Best,
> >>>      >>>>>>> Hang
> >>>      >>>>>>>
> >>>      >>>>>>> Dong Lin <li...@gmail.com> 于2023年1月1日周日 09:45写道：
> >>>      >>>>>>>
> >>>      >>>>>>>> Let me chime-in and add comments regarding the public
> >>>      interface
> >>>      >>>>> section.
> >>>      >>>>>>>> Please see my comments inline.
> >>>      >>>>>>>>
> >>>      >>>>>>>> On Thu, Dec 29, 2022 at 6:08 PM Jark Wu
> >>>      <im...@gmail.com> wrote:
> >>>      >>>>>>>>
> >>>      >>>>>>>>> Hi Hang,
> >>>      >>>>>>>>>
> >>>      >>>>>>>>> Thanks for driving this discussion. I think this is a
> >>>      very useful
> >>>      >>>>>> feature
> >>>      >>>>>>>>> for connectors.
> >>>      >>>>>>>>>
> >>>      >>>>>>>>> The FLIP looks quite good to me, and I just have two
> >>>      suggestions.
> >>>      >>>>>>>>>
> >>>      >>>>>>>>> 1. In the "Public Interface" section, mention that the
> >>>      >>>>>>>>> implementation
> >>>      >>>>>>>>> behavior of "SplitEnumeratorContext#metricGroup" is
> >>>      changed from
> >>>      >>>>>>>> returning
> >>>      >>>>>>>>> null to returning a concrete SplitEnumeratorMetricGroup
> >>>      instance.
> >>>      >>>>> Even
> >>>      >>>>>>>>> though the API is already there, the behavior change can
> >>>      also be
> >>>      >>>>>>>> considered
> >>>      >>>>>>>>> a public change.
> >>>      >>>>>>>>>
> >>>      >>>>>>>> SplitEnumeratorContext#metricGroup is an interface and
> >>>      this FLIP
> >>>      >>>>>>>> does
> >>>      >>>>>> not
> >>>      >>>>>>>> seem to change its semantics/behavior. The FLIP does
> >>>      change the
> >>>      >>>>>>>> implementation/behavior of
> >>>      SourceCoordinatorContext#metricGroup,
> >>>      >>>>>>>> which
> >>>      >>>>>> is
> >>>      >>>>>>>> marked @Internal.
> >>>      >>>>>>>>
> >>>      >>>>>>>> Thus it might seem a bit weird to add in the public
> >> interface
> >>>      >>>>>>>> section
> >>>      >>>>>>>> saying that we change the interface
> >>>      >>>>>>>> SplitEnumeratorContext#metricGroup
> >>>      >>>>>> from
> >>>      >>>>>>>> returning null to non-null object.
> >>>      >>>>>>>>
> >>>      >>>>>>>> 2. Mention the newly added interface of
> >>>      >>>>>> "OperatorCoordinator#metricGroup"
> >>>      >>>>>>>>> in the "Proposed Changes" section or "Public Interface"
> >>>      section. As
> >>>      >>>>> the
> >>>      >>>>>>>>> FLIP said, OperatorCoordinator is widely used in many
> >>>      connectors.
> >>>      >>>>>> Though
> >>>      >>>>>>>> it
> >>>      >>>>>>>>> is still an @Internal API, I think it is worth
> >>>      mentioning the
> >>>      >>>>>>>>> change
> >>>      >>>>> in
> >>>      >>>>>>>> the
> >>>      >>>>>>>>> FLIP.
> >>>      >>>>>>>> Since OperatorCoordinator is an internal API, it seems
> >>>      reasonable to
> >>>      >>>>>>>> explain it in the proposed change section. The FLIP seems
> >>>      to have
> >>>      >>>>>>>> documented this in point 5 of the proposed change
> section.
> >>>      >>>>>>>>
> >>>      >>>>>>>> BTW, if we think there are @internal classes that are
> >>>      important
> >>>      >>>>>>>> enough
> >>>      >>>>>> to
> >>>      >>>>>>>> be added in the public interface section, it might be
> >>>      useful to
> >>>      >>>>>> explicitly
> >>>      >>>>>>>> discuss this topic and document it in the "*What are the
> >>>      "public
> >>>      >>>>>>>> interfaces" of the project*" in this
> >>>      >>>>>>>> <
> >>>      >>>>>>>>
> >>>      >>
> >>>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
> >>>      >>>>>>>> wiki.
> >>>      >>>>>>>>
> >>>      >>>>>>>>
> >>>      >>>>>>>>> Best,
> >>>      >>>>>>>>> Jark
> >>>      >>>>>>>>>
> >>>      >>>>>>>>>
> >>>      >>>>>>>>> On Mon, 26 Dec 2022 at 18:06, Hang Ruan
> >>>      <ru...@gmail.com>
> >>>      >>>>>> wrote:
> >>>      >>>>>>>>>> Hi, thanks for the feedback, Zhu Zhu and Qingsheng.
> >>>      >>>>>>>>>> After combining everyone's comments, the main concerns
> and
> >>>      >>>>>>>> corresponding
> >>>      >>>>>>>>>> adjustments are as follows.
> >>>      >>>>>>>>>>
> >>>      >>>>>>>>>> Q1: Common metrics are not quite useful.
> >>>      >>>>>>>>>> numEventsIn and numEventsOut counters will be removed
> >>>      from the
> >>>      >>>>>>>>>> OperatorCoordinatorMetricGroup. These common metrics do
> >> not
> >>>      >>>>>>>>>> provide
> >>>      >>>>>>>>> enough
> >>>      >>>>>>>>>> information for users. The users are more willing to
> >>>      get the
> >>>      >>>>>>>>>> number
> >>>      >>>>> of
> >>>      >>>>>>>>>> events of the specified type instead of the total
> >>>      number. And this
> >>>      >>>>>>>> metric
> >>>      >>>>>>>>>> is calculated differently. The implementation could
> >>>      register the
> >>>      >>>>>> metric
> >>>      >>>>>>>>> by
> >>>      >>>>>>>>>> themselves.
> >>>      >>>>>>>>>>
> >>>      >>>>>>>>>> Q2: This FLIP is overly complicated.
> >>>      >>>>>>>>>> This FLIP will become concise after these
> modifications.
> >>>      >>>>>>>>>> OperatorCoordinatorMetricGroup has already been
> >>>      introduced into
> >>>      >>>>> Flink
> >>>      >>>>>>>> by
> >>>      >>>>>>>>>> FLIP-179<
> >>>      >>>>>>>>>>
> >>>      >>>>>>>>>>
> >>>      >>
> >>>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics
> >>>      >>>>>>>>>>> .
> >>>      >>>>>>>>>> And
> >>>      >>>>>>>>>> this FLIP will not change it. This FLIP only provides a
> >>>      new metric
> >>>      >>>>>>>> option
> >>>      >>>>>>>>>> and a new metric group scope. The changes in proposed
> >>>      changes
> >>>      >>>>> provide
> >>>      >>>>>>>> the
> >>>      >>>>>>>>>> details about the modifications for the internal
> >>>      classes, which
> >>>      >>>>> might
> >>>      >>>>>>>>> make
> >>>      >>>>>>>>>> it look complicated.
> >>>      >>>>>>>>>>
> >>>      >>>>>>>>>> Thanks for all the comments again. If there are no
> further
> >>>      >>>>>>>>>> comments,
> >>>      >>>>>> we
> >>>      >>>>>>>>>> plan to start the voting thread this week.
> >>>      >>>>>>>>>>
> >>>      >>>>>>>>>> Best,
> >>>      >>>>>>>>>> Hang
> >>>      >>>>>>>>>>
> >>>      >>>>>>>>>> Qingsheng Ren <re...@gmail.com> 于2022年12月26日周一
> >>>      16:48写道：
> >>>      >>>>>>>>>>
> >>>      >>>>>>>>>>
> >>>      >>>>>>>>>>> Thanks for the FLIP, Hang!
> >>>      >>>>>>>>>>>
> >>>      >>>>>>>>>>> This FLIP overall looks good to me. Actually I share
> >>>      the same
> >>>      >>>>> concern
> >>>      >>>>>>>>>> with
> >>>      >>>>>>>>>>> Zhu that numEventsIn and numEventsOut counters are not
> >>>      quite
> >>>      >>>>>>>>>>> useful
> >>>      >>>>>>>> to
> >>>      >>>>>>>>>> end
> >>>      >>>>>>>>>>> users. OperatorEvent is a quite low-level abstraction,
> >>>      which
> >>>      >>>>> requires
> >>>      >>>>>>>>>>> instantialization in order to be practical to users
> and
> >>>      >>>>>>>>>>> developers,
> >>>      >>>>>>>> so
> >>>      >>>>>>>>>>> maybe it's better to exclude them from the FLIP.
> >>>      >>>>>>>>>>>
> >>>      >>>>>>>>>>> Best,
> >>>      >>>>>>>>>>> Qingsheng
> >>>      >>>>>>>>>>> Ververica (Alibaba)
> >>>      >>>>>>>>>>>
> >>>      >>>>>>>>>>> On Mon, Dec 26, 2022 at 12:08 PM Zhu Zhu
> >>>      <re...@gmail.com>
> >>>      >>>>> wrote:
> >>>      >>>>>>>>>>>> Hi Hang,
> >>>      >>>>>>>>>>>> I still see no strong reason why we need
> >>>      >>>>>>>>>>>> numEventsIn/numEventsOut
> >>>      >>>>>>>>>>> metrics.
> >>>      >>>>>>>>>>>> In the discussion in FLINK-29801, I can see the same
> >>>      concern
> >>>      >>>>>>>>>>>> from
> >>>      >>>>>>>>>> others.
> >>>      >>>>>>>>>>>> So I prefer to exclude them from this FLIP to avoid
> >>>      >>>>>>>>>>>> over-extending
> >>>      >>>>>>>>> the
> >>>      >>>>>>>>>>>> scope.
> >>>      >>>>>>>>>>>>
> >>>      >>>>>>>>>>>> Thanks,
> >>>      >>>>>>>>>>>> Zhu
> >>>      >>>>>>>>>>>>
> >>>      >>>>>>>>>>>> Hang Ruan <ru...@gmail.com> 于2022年12月23日周五
> >>>      >>>>>>>>>>>> 15:21写道：
> >>>      >>>>>>>>>>>>> Hi everyone,
> >>>      >>>>>>>>>>>>>
> >>>      >>>>>>>>>>>>> As for the Zhu Zhu's problem, I think we should keep
> >>>      the common
> >>>      >>>>>>>>>>> metrics,
> >>>      >>>>>>>>>>>> which will help to observe incoming and outgoing
> >>>      events. What do
> >>>      >>>>>>>> you
> >>>      >>>>>>>>>>> think,
> >>>      >>>>>>>>>>>> @Zhu Zhu ?
> >>>      >>>>>>>>>>>>> And @Chesnay, are there any other issues you are
> >>>      more concerned
> >>>      >>>>>>>>>> about?
> >>>      >>>>>>>>>>>> Looking forward to your reply.
> >>>      >>>>>>>>>>>>> Thanks for all the comments. If there are no further
> >>>      >>>>>>>>>>>>> comments, we
> >>>      >>>>>>>>>> plan
> >>>      >>>>>>>>>>>> to start the voting thread next week.
> >>>      >>>>>>>>>>>>> Best,
> >>>      >>>>>>>>>>>>> Hang
> >>>      >>>>>>>>>>>>>
> >>>      >>>>>>>>>>>>> Hang Ruan <ru...@gmail.com> 于2022年12月15日周四
> >>>      >>>>>>>>>>>>> 16:49写道：
> >>>      >>>>>>>>>>>>>> Hi, Zhu Zhu,
> >>>      >>>>>>>>>>>>>>
> >>>      >>>>>>>>>>>>>> Thanks for your feedback!
> >>>      >>>>>>>>>>>>>>
> >>>      >>>>>>>>>>>>>> The OperatorCoordinator implementations are
> >>>      different. And
> >>>      >>>>>>>>>>>>>> their
> >>>      >>>>>>>>>>>> metrics are much different too. We try to find the
> >> common
> >>>      >>>>>>>>>>>> metrics
> >>>      >>>>>>>> and
> >>>      >>>>>>>>>> put
> >>>      >>>>>>>>>>>> them in the OperatorCoordinatorMetricGroup. If most
> >>>      developers
> >>>      >>>>>>>> think
> >>>      >>>>>>>>> we
> >>>      >>>>>>>>>>> do
> >>>      >>>>>>>>>>>> not need these common metrics, removing the common
> >>>      metrics is
> >>>      >>>>>>>>>>>> also
> >>>      >>>>>>>>>>>> acceptable.
> >>>      >>>>>>>>>>>>>> Best,
> >>>      >>>>>>>>>>>>>> Hang
> >>>      >>>>>>>>>>>>>>
> >>>      >>>>>>>>>>>>>> Zhu Zhu <re...@gmail.com> 于2022年12月14日周三
> >>>      22:09写道：
> >>>      >>>>>>>>>>>>>>> Hi Hang & MengYue,
> >>>      >>>>>>>>>>>>>>>
> >>>      >>>>>>>>>>>>>>> Thanks for creating this FLIP!
> >>>      >>>>>>>>>>>>>>>
> >>>      >>>>>>>>>>>>>>> I think it is very useful, mainly in two aspects:
> >>>      >>>>>>>>>>>>>>> 1. Enables OperatorCoordinators to register
> metrics.
> >>>      >>>>>>>>>>>>>>> Currently
> >>>      >>>>>>>>>>>>>>> the coordinators has no way to do this. And
> operator
> >>>      >>>>>>>> coordinator
> >>>      >>>>>>>>>>>>>>> metric group further enables the SplitEnumerator
> >>>      to have
> >>>      >>>>>>>>>>>>>>> access
> >>>      >>>>>>>>>>>>>>> to a registered metric group (via the existing
> public
> >>>      >>>>>>>>>>>>>>> interface
> >>>      >>>>>>>>>>>>>>> SplitEnumeratorContext#metricGroup()), which is
> >>>      null at the
> >>>      >>>>>>>>> moment.
> >>>      >>>>>>>>>>>>>>> 2. Defines the scope of operator coordinator
> >>>      metrics. A clear
> >>>      >>>>>>>>>>>> definition
> >>>      >>>>>>>>>>>>>>> makes it easy for users to find their wanted
> >>>      metrics. The
> >>>      >>>>>>>>>> definition
> >>>      >>>>>>>>>>>>>>> also helps to avoid conflicts of metrics from
> >> multiple
> >>>      >>>>>>>>>>>> OperatorCoordinators
> >>>      >>>>>>>>>>>>>>> of the same kind. E.g. each SourceCoordinator may
> >>>      have its
> >>>      >>>>>>>>>>>>>>> own
> >>>      >>>>>>>>>>>>>>> numSourceSplits metric, these metrics should not
> >>>      be directly
> >>>      >>>>>>>>>>> registered
> >>>      >>>>>>>>>>>>>>> to the job metric group.
> >>>      >>>>>>>>>>>>>>>
> >>>      >>>>>>>>>>>>>>> What I'm a bit concerned is the necessity of the
> >>>      introduced
> >>>      >>>>>>>>> common
> >>>      >>>>>>>>>>>> metrics
> >>>      >>>>>>>>>>>>>>> numEventsInCounter & numEventsOutCounter. If there
> >>>      any case
> >>>      >>>>>>>> which
> >>>      >>>>>>>>>>>> strongly
> >>>      >>>>>>>>>>>>>>> requires them?
> >>>      >>>>>>>>>>>>>>>
> >>>      >>>>>>>>>>>>>>> Regarding the concerns of Chesnay,
> >>>      >>>>>>>>>>>>>>>> A dedicated coordinator MG implementation is
> >> overkill
> >>>      >>>>>>>>>>>>>>> Directly using the job metric group can result in
> >>>      metric
> >>>      >>>>>>>>> conflicts,
> >>>      >>>>>>>>>>> as
> >>>      >>>>>>>>>>>> mentioned
> >>>      >>>>>>>>>>>>>>> in above #2.
> >>>      >>>>>>>>>>>>>>>
> >>>      >>>>>>>>>>>>>>> Thanks,
> >>>      >>>>>>>>>>>>>>> Zhu
> >>>      >>>>>>>>>>>>>>>
> >>>      >>>>>>>>>>>>>>> Dong Lin <li...@gmail.com> 于2022年12月10日周六
> >>>      >>>>>>>>>>>>>>> 14:16写道：
> >>>      >>>>>>>>>>>>>>>
> >>>      >>>>>>>>>>>>>>>> Hi Chesney,
> >>>      >>>>>>>>>>>>>>>>
> >>>      >>>>>>>>>>>>>>>> Just to double check with you,
> >>>      >>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup
> >>>      >>>>>>>>>>>> (annotated as
> >>>      >>>>>>>>>>>>>>>> @PublicEvolving) has already been introduced into
> >>>      Flink by
> >>>      >>>>>>>>>> FLIP-179
> >>>      >>>>>>>>>>>>>>>> <
> >>>      >>
> >>>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics
> >>>      >>>>>>>>>>>>> .
> >>>      >>>>>>>>>>>>>>>> And that FLIP has got you +1.. Do you mean we
> >>>      should remove
> >>>      >>>>>>>>> this
> >>>      >>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup?
> >>>      >>>>>>>>>>>>>>>>
> >>>      >>>>>>>>>>>>>>>> Regards,
> >>>      >>>>>>>>>>>>>>>> Dong
> >>>      >>>>>>>>>>>>>>>>
> >>>      >>>>>>>>>>>>>>>> On Sat, Dec 10, 2022 at 1:33 AM Chesnay Schepler
> <
> >>>      >>>>>>>>>>> chesnay@apache.org>
> >>>      >>>>>>>>>>>> wrote:
> >>>      >>>>>>>>>>>>>>>>> As a whole I feel like this FLIP is overly
> >>>      complicated. A
> >>>      >>>>>>>>>>> dedicated
> >>>      >>>>>>>>>>>>>>>>> coordinator MG implementation is overkill; it
> >>>      could just
> >>>      >>>>>>>>> re-use
> >>>      >>>>>>>>>>> the
> >>>      >>>>>>>>>>>>>>>>> existing Task/OperatorMGs to create the same
> >>>      structure we
> >>>      >>>>>>>>> have
> >>>      >>>>>>>>>> on
> >>>      >>>>>>>>>>>> TMs,
> >>>      >>>>>>>>>>>>>>>>> similar to what we did with the Job MG.
> >>>      >>>>>>>>>>>>>>>>>
> >>>      >>>>>>>>>>>>>>>>> However, I'm not convinced that this is required
> >>>      anyway,
> >>>      >>>>>>>>>> because
> >>>      >>>>>>>>>>>> all the
> >>>      >>>>>>>>>>>>>>>>> example metrics you listed can be implemented on
> >>>      the TM
> >>>      >>>>>>>> side
> >>>      >>>>>>>>> +
> >>>      >>>>>>>>>>>>>>>>> aggregating them in the external metrics
> backend.
> >>>      >>>>>>>>>>>>>>>>>
> >>>      >>>>>>>>>>>>>>>>> Since I'm on holidays soon, just so no one tries
> >>>      to pull a
> >>>      >>>>>>>>> fast
> >>>      >>>>>>>>>>>> one on
> >>>      >>>>>>>>>>>>>>>>> me, if this were to go to a vote as-is I'd be
> >>>      against it.
> >>>      >>>>>>>>>>>>>>>>>
> >>>      >>>>>>>>>>>>>>>>>
> >>>      >>>>>>>>>>>>>>>>> On 09/12/2022 15:30, Dong Lin wrote:
> >>>      >>>>>>>>>>>>>>>>>> Hi Hang,
> >>>      >>>>>>>>>>>>>>>>>>
> >>>      >>>>>>>>>>>>>>>>>> Thanks for the FLIP! The FLIP looks good and it
> >>>      is pretty
> >>>      >>>>>>>>>>>> informative.
> >>>      >>>>>>>>>>>>>>>>>> I have just two minor comments regarding names:
> >>>      >>>>>>>>>>>>>>>>>> - Would it be useful to rename the config key
> as
> >>>      >>>>>>>>>>>>>>>>>> *metrics.scope.jm.job.operator-coordinator* for
> >>>      >>>>>>>> consistency
> >>>      >>>>>>>>>>> with
> >>>      >>>>>>>>>>>>>>>>>> *metrics.scope.jm.job
> >>>      >>>>>>>>>>>>>>>>>> *(which is not named as *jm-job)?
> >>>      >>>>>>>>>>>>>>>>>> - Maybe rename the variable as
> >>>      >>>>>>>>>>> SCOPE_NAMING_OPERATOR_COORDINATOR
> >>>      >>>>>>>>>>>> for
> >>>      >>>>>>>>>>>>>>>>>> simplicity and consistency with
> >>>      SCOPE_NAMING_OPERATOR
> >>>      >>>>>>>>> (which
> >>>      >>>>>>>>>> is
> >>>      >>>>>>>>>>>> not named
> >>>      >>>>>>>>>>>>>>>>>> as SCOPE_NAMING_TM_JOB_OPERATOR)?
> >>>      >>>>>>>>>>>>>>>>>>
> >>>      >>>>>>>>>>>>>>>>>> Cheers,
> >>>      >>>>>>>>>>>>>>>>>> Dong
> >>>      >>>>>>>>>>>>>>>>>>
> >>>      >>>>>>>>>>>>>>>>>>
> >>>      >>>>>>>>>>>>>>>>>>
> >>>      >>>>>>>>>>>>>>>>>> On Thu, Dec 8, 2022 at 3:28 PM Hang Ruan <
> >>>      >>>>>>>>>>> ruanhang1993@gmail.com>
> >>>      >>>>>>>>>>>> wrote:
> >>>      >>>>>>>>>>>>>>>>>>> Hi all,
> >>>      >>>>>>>>>>>>>>>>>>>
> >>>      >>>>>>>>>>>>>>>>>>> MengYue and I created FLIP-274[1] Introduce
> >>>      metric group
> >>>      >>>>>>>>> for
> >>>      >>>>>>>>>>>>>>>>>>> OperatorCoordinator. OperatorCoordinator is
> the
> >>>      >>>>>>>>> coordinator
> >>>      >>>>>>>>>>> for
> >>>      >>>>>>>>>>>> runtime
> >>>      >>>>>>>>>>>>>>>>>>> operators and running on Job Manager. The
> >>>      coordination
> >>>      >>>>>>>>>>>> mechanism is
> >>>      >>>>>>>>>>>>>>>>>>> operator events between OperatorCoordinator
> >>>      and its all
> >>>      >>>>>>>>>>>> operators, the
> >>>      >>>>>>>>>>>>>>>>>>> coordination is more and more using in Flink,
> for
> >>>      >>>>>>>> example
> >>>      >>>>>>>>>> many
> >>>      >>>>>>>>>>>> Sources
> >>>      >>>>>>>>>>>>>>>>> and
> >>>      >>>>>>>>>>>>>>>>>>> Sinks depend on the mechanism to assign splits
> >> and
> >>>      >>>>>>>>>> coordinate
> >>>      >>>>>>>>>>>> commits to
> >>>      >>>>>>>>>>>>>>>>>>> external systems. The OperatorCoordinator is
> >>>      widely
> >>>      >>>>>>>> using
> >>>      >>>>>>>>> in
> >>>      >>>>>>>>>>>> flink kafka
> >>>      >>>>>>>>>>>>>>>>>>> connector, flink pulsar connector, flink cdc
> >>>      connector,
> >>>      >>>>>>>>>> flink
> >>>      >>>>>>>>>>>> hudi
> >>>      >>>>>>>>>>>>>>>>>>> connector and so on.
> >>>      >>>>>>>>>>>>>>>>>>>
> >>>      >>>>>>>>>>>>>>>>>>> But there is not a suitable metric group scope
> >>>      for the
> >>>      >>>>>>>>>>>>>>>>> OperatorCoordinator
> >>>      >>>>>>>>>>>>>>>>>>> and not an implementation for the interface
> >>>      >>>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup.
> >>>      >>>>>>>>>>>>>>>>>>> These metrics in OperatorCoordinator could be
> >>>      how many
> >>>      >>>>>>>>>>>> splits/partitions
> >>>      >>>>>>>>>>>>>>>>>>> have been assigned to source readers, how many
> >>>      files
> >>>      >>>>>>>> have
> >>>      >>>>>>>>>> been
> >>>      >>>>>>>>>>>> written
> >>>      >>>>>>>>>>>>>>>>> out
> >>>      >>>>>>>>>>>>>>>>>>> by sink writers, these metrics not only help
> >>>      users to
> >>>      >>>>>>>> know
> >>>      >>>>>>>>>> the
> >>>      >>>>>>>>>>>> job
> >>>      >>>>>>>>>>>>>>>>> progress
> >>>      >>>>>>>>>>>>>>>>>>> but also make big job maintaining easier.
> Thus we
> >>>      >>>>>>>> propose
> >>>      >>>>>>>>>> the
> >>>      >>>>>>>>>>>> FLIP-274
> >>>      >>>>>>>>>>>>>>>>> to
> >>>      >>>>>>>>>>>>>>>>>>> introduce a new metric group scope for
> >>>      >>>>>>>> OperatorCoordinator
> >>>      >>>>>>>>>> and
> >>>      >>>>>>>>>>>> provide
> >>>      >>>>>>>>>>>>>>>>> an
> >>>      >>>>>>>>>>>>>>>>>>> internal implementation for
> >>>      >>>>>>>>> OperatorCoordinatorMetricGroup.
> >>>      >>>>>>>>>>>>>>>>>>> Could you help review this FLIP when you get
> >>>      time? Any
> >>>      >>>>>>>>>>> feedback
> >>>      >>>>>>>>>>>> is
> >>>      >>>>>>>>>>>>>>>>>>> appreciated!
> >>>      >>>>>>>>>>>>>>>>>>>
> >>>      >>>>>>>>>>>>>>>>>>> Best,
> >>>      >>>>>>>>>>>>>>>>>>> Hang
> >>>      >>>>>>>>>>>>>>>>>>>
> >>>      >>>>>>>>>>>>>>>>>>> [1]
> >>>      >>>>>>>>>>>>>>>>>>>
> >>>      >>>>>>>>>>>>>>>>>>>
> >>>      >>
> >>>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-274%3A+Introduce+metric+group+for+OperatorCoordinator
> >>>      >>
> >>>
>
>

Re: [DISCUSS] FLIP-274 : Introduce metric group for OperatorCoordinator

Posted by Chesnay Schepler <ch...@apache.org>.

I hadn't looked in detail in to how the task info can be introduced, but 
given that the coordinates are created by the execution graph, where we 
only work on tasks, it should be possible and rather simple.

The OperatorCoordinatorHolder gets access to the ExecutionJobVertex from 
which it can extract everything and create the task/operator groups. 
Exposure to the coordinator could happen via the coordinator context as 
originally planned.

 > As we know, coordinator metrics are just like JMJobMetricGroup, which 
don't belong to any tasks/vertexes.

I don't think this is true, every coordinator is scoped to a specific 
operator, and every operator is associated with a particular vertex.

 From the javadocs:

" A coordinator for runtime operators. The OperatorCoordinator runs on 
the master, associated with
  the job vertex of the operator. It communicates with operators via 
sending operator events."

On 28/01/2023 09:15, Jark Wu wrote:
> Hi all,
>
> IIUC, Chesnay means we should have a more general metric group for
> **operator**
> not for **coordinator** in JM, this would be useful to extend
> other operator-specific
> metrics in the future. That means the new scope format should be designed
> for
> the operator,
> e.g., metrics.scope.jm-operator=<host>.jobmanager.<job_name>.<operator_name>
> The coordinator metric is a subgroup (a constant "coordinator" suffix) of
> the JMOperatorMG.
>
> I think this is a nice design. However, I have a question about adding task
> id/name to this list.
> How to get the task id/name when reporting a coordinator metric? As we
> know, coordinator metrics
> are just like JMJobMetricGroup, which don't belong to any tasks/vertexes.
> Do you mean reporting
> coordinator metrics for every task id under the operator?
>
> Best,
> Jark
>
>
> On Thu, 19 Jan 2023 at 17:33, Chesnay Schepler <ch...@apache.org> wrote:
>
>>   > First, I do not understand why users have to configure the new scope
>> format, which has a default value.
>>
>> If you don't use scope formats, sure. If you do use scope formats, e.g.
>> to add a common prefix (which is the case for datadog users for
>> example), then the current default in the FLIP is insufficient and
>> requires the user to update the configuration.
>>
>>   > I usually do not need to change these configuration of scope formats
>> when submitting the flink job.
>>
>> Scope formats as a whole are quite a power-user feature, but that
>> doesn't mean we should ignore it.
>>
>>   > I try to let it extend the ComponentMetricGroup
>>
>> This isn't inherently required just because it is a "component".
>> Component metric groups should _only_ be used for cases where such a
>> component provides several bits of metadata.
>> For example, tasks provide vertex ids, task names, attempted IDs etc.,
>> and we'd like users to have the option on how this metadata ends up
>> being used (via scope formats). This currently can't be built with the
>> addGroup() methods, hence why the component groups exist.
>> However, the operator coordinator _itself_does not provide several bits
>> of metadata. Logically, the _operator_ does, not the coordinator.
>>
>>   > This make me decide to add a new scope format.
>>
>> And this is fine; just don't add a new scope format for the coordinator
>> but logically for operators on the JM side, such that we can extend the
>> set of operator-specific metrics in the future without having to add yet
>> another scope format.
>>
>>   > The relationship between the Operator and OperatorCoordinator is
>> maintained in the OperatorCoordinatorHolder. In the POC, the operator
>> name/id could be found and the OperatorCoordinatorMetricGroup will be
>> created here.
>>
>> That's all irrelevant.
>>
>>   > The registered metrics of OperatorCoordinator always are those which
>> can not be aggregated from the tasks, like the number of unassigned
>> splits in the SourceCoordinator. Actually we have not encountered the
>> scenario you mentioned. I think the OperatorCoordinatorMetricGroup
>> should only contain the metrics for itself instead of its subtasks.
>>
>> You are misunderstanding the problem.
>>
>> This isn't about aggregating metrics on our side or exposing metrics
>> from TMs on the JM side or anything like that.
>>
>> It's purely about the available metadata for operator metrics on the JM
>> side. Currently you suggest operator name / id; what I'm proposing is to
>> add task(==vertex!) id / name to this list.
>>
>> The benefit here is simple. I can look up all metrics where task_id==XXX
>> and get _everything_ related to that vertex, including all metrics
>> associated with operators that are part of that vertex, including the
>> coordinator.
>>
>>   > Using metrics.scope.jm.job is not enough to distinguish the different
>> coordinators.
>>
>> I did not suggest just using metrics.scope.jm.job. I suggested that as
>> the default for the jm operator scope format, that is used for the
>> proposed JobManagerOperatorMetricGroup, which will have some plain
>> metric groups as children for coordinators.
>> Aka, you have the JMOperatorMG, then you call addGroup("coordinators")
>> and then addGroup(coordinator_name) for each coordinator.
>>
>>   > So there are two choice for the
>> InternalOperatorCoordinatorMetricGroup: 1. add and improve the new scope
>> format; 2. use the metrics.scope.jm.job and ProxyMetricGroup. Which one
>> is better?
>>
>> There are more options than that.
>> I feel like there are a lot of misunderstandings in this discussion.
>> Please let me know if I could clear things up. If not I can also provide
>> a PoC based on your PoC if that can speed things up. It's not that
>> different anyway.
>>
>> On 19/01/2023 07:39, Hang Ruan wrote:
>>> Hi, chesnay,
>>>
>>> Thanks for your reply. I still have some doubts about the questions
>>> you raised.
>>>
>>>      > Extending the set of ScopeFormats is problematic because it in
>>>      practice
>>>      it breaks the config if users actively rely on it, since there's now
>>>      another key that they _must_ set for it to be
>>>      consistent/compatible with
>>>      their existing setup.
>>>      Unfortunately due to how powerful scope formats are we can't derive a
>>>      default value that matches their existing setup.
>>>      Hence we should try to do this as rarely as possible.
>>>
>>>
>>>      > This FLIP does not adhere to that since it proposes a dedicated
>>>      format
>>>      for coordinators; next time we want to expose operator-specific
>>>      metrics
>>>      (e.g., in the scheduler) we'd have to add another one to support it.
>>>
>>>
>>> First, I do not understand why users have to configure the new scope
>>> format, which has a default value. I usually do not need to change
>>> these configuration of scope formats when submitting the flink job.
>>> IMO, the OperatorCoordinator is a component running at the JobMaster.
>>> I try to let it extend the ComponentMetricGroup, but I cannot find
>>> a suitable scope. This make me decide to add a new scope format.
>>>
>>>      > Additionally, the configurable variables (operator name/id) are
>>>      logically not attached to the coordinator, but operators, so to me it
>>>      just doesn't make sense to structure it like this.
>>>
>>>
>>> The relationship between the Operator and OperatorCoordinator is
>>> maintained in the OperatorCoordinatorHolder. In the POC, the operator
>>> name/id could be found and the OperatorCoordinatorMetricGroup will be
>>> created here.
>>>
>>>      > Another thing I'm concerned about is that, because we don't include
>>>      tasks in the hierarchy, users wishing to collect all metrics for a
>>>      particular task (in this case ==vertex) now have to go
>>>      significantly out
>>>      of their way to get them, since they can no longer just filter by the
>>>      task ID but have to be filter for _all_ operators that are part of
>>>      the task.
>>>
>>>
>>> The registered metrics of OperatorCoordinator always are those which
>>> can not be aggregated from the tasks, like the number of unassigned
>>> splits in the SourceCoordinator. Actually we have not encountered the
>>> scenario you mentioned. I think the OperatorCoordinatorMetricGroup
>>> should only contain the metrics for itself instead of its subtasks.
>>>
>>> Using metrics.scope.jm.job is not enough to distinguish the different
>>> coordinators. The operator id/name is necessary. Then
>>> the implementation should be like this. But users can not change the
>>> scope in this way. This is also acceptable.
>>> @Internal public class InternalOperatorCoordinatorMetricGroup
>>>           extends ProxyMetricGroup<MetricGroup>
>>>           implements OperatorCoordinatorMetricGroup {
>>>       public InternalOperatorCoordinatorMetricGroup(
>>>               JobManagerJobMetricGroup parent, OperatorID operatorID,
>> String operatorName) {
>>>           super(parent.addGroup(operatorID +
>> operatorName).addGroup("coordinator")); }
>>> }
>>> So there are two choice for the
>>> InternalOperatorCoordinatorMetricGroup: 1. add and improve the new
>>> scope format; 2. use the metrics.scope.jm.job and ProxyMetricGroup.
>>> Which one is better?
>>>
>>> Best,
>>> Hang
>>>
>>>
>>> Chesnay Schepler <ch...@apache.org> 于2023年1月18日周三 17:03写道：
>>>
>>>      You're misunderstanding the problem.
>>>
>>>      Metric groups form a tree with each group providing certain metadata.
>>>
>>>      E.g., on the taskmanager we have a TM metric group that provides info
>>>      about the TM, that has child task metric groups, that have child
>>>      operator metric groups etc. The operator metric group again has
>>>      children, where we are now mostly in the user/connector land, like a
>>>      kafka metric group providing partition info or something.
>>>
>>>      What is being proposed is to create a coordinator metric group on
>>>      the JM
>>>      side that provides operator metadata.
>>>
>>>      A more appropriate structure is to create an operator group on the JM
>>>      side that provides this info, with the coordinator metric group
>>>      being a
>>>      child.
>>>
>>>      On 17/01/2023 19:56, Steven Wu wrote:
>>>      >> Additionally, the configurable variables (operator name/id) are
>>>      > logically not attached to the coordinator, but operators, so to
>>>      me it
>>>      > just doesn't make sense to structure it like this.
>>>      >
>>>      > Chesnay, maybe we should clarify the terminology. To me,
>>>      pperators (like
>>>      > FLIP-27 source) can have two parts (coordinator and
>>>      reader/subtask). I
>>>      > think it is fine to include operator name/id for coordinator
>>>      metrics.
>>>      >
>>>      > On Mon, Jan 16, 2023 at 2:13 AM Chesnay Schepler
>>>      <ch...@apache.org> wrote:
>>>      >
>>>      >> Slight correction: Using metrics.scope.jm.job as the default
>>>      should be
>>>      >> safe.
>>>      >>
>>>      >> On 16/01/2023 10:18, Chesnay Schepler wrote:
>>>      >>> The proposed ScopeFormat is still problematic for a few reasons.
>>>      >>>
>>>      >>> Extending the set of ScopeFormats is problematic because it in
>>>      >>> practice it breaks the config if users actively rely on it, since
>>>      >>> there's now another key that they _must_ set for it to be
>>>      >>> consistent/compatible with their existing setup.
>>>      >>> Unfortunately due to how powerful scope formats are we can't
>>>      derive a
>>>      >>> default value that matches their existing setup.
>>>      >>> Hence we should try to do this as rarely as possible.
>>>      >>>
>>>      >>> This FLIP does not adhere to that since it proposes a
>>>      dedicated format
>>>      >>> for coordinators; next time we want to expose operator-specific
>>>      >>> metrics (e.g., in the scheduler) we'd have to add another one to
>>>      >>> support it.
>>>      >>>
>>>      >>> Additionally, the configurable variables (operator name/id) are
>>>      >>> logically not attached to the coordinator, but operators, so
>>>      to me it
>>>      >>> just doesn't make sense to structure it like this.
>>>      >>>
>>>      >>> Another thing I'm concerned about is that, because we don't
>>>      include
>>>      >>> tasks in the hierarchy, users wishing to collect all metrics for
>> a
>>>      >>> particular task (in this case ==vertex) now have to go
>>>      significantly
>>>      >>> out of their way to get them, since they can no longer just
>>>      filter by
>>>      >>> the task ID but have to be filter for _all_ operators that are
>>>      part of
>>>      >>> the task.
>>>      >>>
>>>      >>> On 16/01/2023 03:09, Hang Ruan wrote:
>>>      >>>> Hi, @chesnay@apache.org <ch...@apache.org> ,
>>>      >>>>
>>>      >>>> Do you have time to help to review this FLIP again after the
>>>      >>>> modification?
>>>      >>>> Looking forward to your reply.
>>>      >>>> This FLIP will add a new configuration for the
>>>      >>>> OperatorCoordinatorMetricGroup scope format. It provides an
>>>      internal
>>>      >>>> implementation and is added as a component to the
>>>      >>>> JobManagerJobMetricGroup.
>>>      >>>> If something doesn't make sense, could you provide some
>>>      advice? It
>>>      >>>> will be
>>>      >>>> very helpful. Thanks a lot for your help.
>>>      >>>>
>>>      >>>> Best,
>>>      >>>> Hang
>>>      >>>>
>>>      >>>> Martijn Visser <ma...@apache.org> 于2023年1月11日周三
>>>      16:34写道：
>>>      >>>>
>>>      >>>>
>>>      >>>>> Hi Hang,
>>>      >>>>>
>>>      >>>>> I'm a bit surprised that this has gone to a vote, given that
>>>      Chesnay
>>>      >>>>> deliberately mentioned that he would vote against it as-is.
>>>      I would
>>>      >>>>> expect
>>>      >>>>> that before going to a vote, he has had the opportunity to
>>>      >>>>> participate in
>>>      >>>>> this discussion.
>>>      >>>>>
>>>      >>>>> Best regards,
>>>      >>>>>
>>>      >>>>> Martijn
>>>      >>>>>
>>>      >>>>> On Tue, Jan 3, 2023 at 12:53 PM Jark Wu <im...@gmail.com>
>>>      wrote:
>>>      >>>>>
>>>      >>>>>> Hi Dong,
>>>      >>>>>>
>>>      >>>>>> Regarding “SplitEnumeratorContext#metricGroup”, my only
>>>      concern is
>>>      >>>>>> that
>>>      >>>>>> this is a core interface for the FLIP. It’s hard to tell how
>>>      >>>>>> sources use
>>>      >>>>>> metric group without mentioning this interface. Even if
>>>      this is an
>>>      >>>>> existing
>>>      >>>>>> API, I think it’s worth introducing the interface again and
>>>      declaring
>>>      >>>>> that
>>>      >>>>>> we will implement the interface instead of a no-op method
>>>      in this
>>>      >>>>>> FLIP.
>>>      >>>>>>
>>>      >>>>>> Anyway, this is a minor problem and shouldn’t block this
>>>      FLIP. I’m
>>>      >>>>>> +1 to
>>>      >>>>>> start a vote.
>>>      >>>>>>
>>>      >>>>>> Best,
>>>      >>>>>> Jark
>>>      >>>>>>
>>>      >>>>>>
>>>      >>>>>>> 2023年1月3日 10:03，Hang Ruan <ru...@gmail.com> 写道：
>>>      >>>>>>>
>>>      >>>>>>> Hi, Jark and Dong,
>>>      >>>>>>>
>>>      >>>>>>> Thanks for your comments. Sorry for my late reply.
>>>      >>>>>>>
>>>      >>>>>>> For suggestion 1, I plan to implement the
>>>      >>>>>>> SplitEnumeratorMetricGroup in
>>>      >>>>>>> another issue, and it is not contained in this FLIP. I
>>>      will add some
>>>      >>>>>>> description about this part.
>>>      >>>>>>> For suggestion 2, changes about
>>>      OperatorCoordinator#metricGroup has
>>>      >>>>>> already
>>>      >>>>>>> been documented in the proposed change section.
>>>      >>>>>>>
>>>      >>>>>>> Best,
>>>      >>>>>>> Hang
>>>      >>>>>>>
>>>      >>>>>>> Dong Lin <li...@gmail.com> 于2023年1月1日周日 09:45写道：
>>>      >>>>>>>
>>>      >>>>>>>> Let me chime-in and add comments regarding the public
>>>      interface
>>>      >>>>> section.
>>>      >>>>>>>> Please see my comments inline.
>>>      >>>>>>>>
>>>      >>>>>>>> On Thu, Dec 29, 2022 at 6:08 PM Jark Wu
>>>      <im...@gmail.com> wrote:
>>>      >>>>>>>>
>>>      >>>>>>>>> Hi Hang,
>>>      >>>>>>>>>
>>>      >>>>>>>>> Thanks for driving this discussion. I think this is a
>>>      very useful
>>>      >>>>>> feature
>>>      >>>>>>>>> for connectors.
>>>      >>>>>>>>>
>>>      >>>>>>>>> The FLIP looks quite good to me, and I just have two
>>>      suggestions.
>>>      >>>>>>>>>
>>>      >>>>>>>>> 1. In the "Public Interface" section, mention that the
>>>      >>>>>>>>> implementation
>>>      >>>>>>>>> behavior of "SplitEnumeratorContext#metricGroup" is
>>>      changed from
>>>      >>>>>>>> returning
>>>      >>>>>>>>> null to returning a concrete SplitEnumeratorMetricGroup
>>>      instance.
>>>      >>>>> Even
>>>      >>>>>>>>> though the API is already there, the behavior change can
>>>      also be
>>>      >>>>>>>> considered
>>>      >>>>>>>>> a public change.
>>>      >>>>>>>>>
>>>      >>>>>>>> SplitEnumeratorContext#metricGroup is an interface and
>>>      this FLIP
>>>      >>>>>>>> does
>>>      >>>>>> not
>>>      >>>>>>>> seem to change its semantics/behavior. The FLIP does
>>>      change the
>>>      >>>>>>>> implementation/behavior of
>>>      SourceCoordinatorContext#metricGroup,
>>>      >>>>>>>> which
>>>      >>>>>> is
>>>      >>>>>>>> marked @Internal.
>>>      >>>>>>>>
>>>      >>>>>>>> Thus it might seem a bit weird to add in the public
>> interface
>>>      >>>>>>>> section
>>>      >>>>>>>> saying that we change the interface
>>>      >>>>>>>> SplitEnumeratorContext#metricGroup
>>>      >>>>>> from
>>>      >>>>>>>> returning null to non-null object.
>>>      >>>>>>>>
>>>      >>>>>>>> 2. Mention the newly added interface of
>>>      >>>>>> "OperatorCoordinator#metricGroup"
>>>      >>>>>>>>> in the "Proposed Changes" section or "Public Interface"
>>>      section. As
>>>      >>>>> the
>>>      >>>>>>>>> FLIP said, OperatorCoordinator is widely used in many
>>>      connectors.
>>>      >>>>>> Though
>>>      >>>>>>>> it
>>>      >>>>>>>>> is still an @Internal API, I think it is worth
>>>      mentioning the
>>>      >>>>>>>>> change
>>>      >>>>> in
>>>      >>>>>>>> the
>>>      >>>>>>>>> FLIP.
>>>      >>>>>>>> Since OperatorCoordinator is an internal API, it seems
>>>      reasonable to
>>>      >>>>>>>> explain it in the proposed change section. The FLIP seems
>>>      to have
>>>      >>>>>>>> documented this in point 5 of the proposed change section.
>>>      >>>>>>>>
>>>      >>>>>>>> BTW, if we think there are @internal classes that are
>>>      important
>>>      >>>>>>>> enough
>>>      >>>>>> to
>>>      >>>>>>>> be added in the public interface section, it might be
>>>      useful to
>>>      >>>>>> explicitly
>>>      >>>>>>>> discuss this topic and document it in the "*What are the
>>>      "public
>>>      >>>>>>>> interfaces" of the project*" in this
>>>      >>>>>>>> <
>>>      >>>>>>>>
>>>      >>
>>>
>> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
>>>      >>>>>>>> wiki.
>>>      >>>>>>>>
>>>      >>>>>>>>
>>>      >>>>>>>>> Best,
>>>      >>>>>>>>> Jark
>>>      >>>>>>>>>
>>>      >>>>>>>>>
>>>      >>>>>>>>> On Mon, 26 Dec 2022 at 18:06, Hang Ruan
>>>      <ru...@gmail.com>
>>>      >>>>>> wrote:
>>>      >>>>>>>>>> Hi, thanks for the feedback, Zhu Zhu and Qingsheng.
>>>      >>>>>>>>>> After combining everyone's comments, the main concerns and
>>>      >>>>>>>> corresponding
>>>      >>>>>>>>>> adjustments are as follows.
>>>      >>>>>>>>>>
>>>      >>>>>>>>>> Q1: Common metrics are not quite useful.
>>>      >>>>>>>>>> numEventsIn and numEventsOut counters will be removed
>>>      from the
>>>      >>>>>>>>>> OperatorCoordinatorMetricGroup. These common metrics do
>> not
>>>      >>>>>>>>>> provide
>>>      >>>>>>>>> enough
>>>      >>>>>>>>>> information for users. The users are more willing to
>>>      get the
>>>      >>>>>>>>>> number
>>>      >>>>> of
>>>      >>>>>>>>>> events of the specified type instead of the total
>>>      number. And this
>>>      >>>>>>>> metric
>>>      >>>>>>>>>> is calculated differently. The implementation could
>>>      register the
>>>      >>>>>> metric
>>>      >>>>>>>>> by
>>>      >>>>>>>>>> themselves.
>>>      >>>>>>>>>>
>>>      >>>>>>>>>> Q2: This FLIP is overly complicated.
>>>      >>>>>>>>>> This FLIP will become concise after these modifications.
>>>      >>>>>>>>>> OperatorCoordinatorMetricGroup has already been
>>>      introduced into
>>>      >>>>> Flink
>>>      >>>>>>>> by
>>>      >>>>>>>>>> FLIP-179<
>>>      >>>>>>>>>>
>>>      >>>>>>>>>>
>>>      >>
>>>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics
>>>      >>>>>>>>>>> .
>>>      >>>>>>>>>> And
>>>      >>>>>>>>>> this FLIP will not change it. This FLIP only provides a
>>>      new metric
>>>      >>>>>>>> option
>>>      >>>>>>>>>> and a new metric group scope. The changes in proposed
>>>      changes
>>>      >>>>> provide
>>>      >>>>>>>> the
>>>      >>>>>>>>>> details about the modifications for the internal
>>>      classes, which
>>>      >>>>> might
>>>      >>>>>>>>> make
>>>      >>>>>>>>>> it look complicated.
>>>      >>>>>>>>>>
>>>      >>>>>>>>>> Thanks for all the comments again. If there are no further
>>>      >>>>>>>>>> comments,
>>>      >>>>>> we
>>>      >>>>>>>>>> plan to start the voting thread this week.
>>>      >>>>>>>>>>
>>>      >>>>>>>>>> Best,
>>>      >>>>>>>>>> Hang
>>>      >>>>>>>>>>
>>>      >>>>>>>>>> Qingsheng Ren <re...@gmail.com> 于2022年12月26日周一
>>>      16:48写道：
>>>      >>>>>>>>>>
>>>      >>>>>>>>>>
>>>      >>>>>>>>>>> Thanks for the FLIP, Hang!
>>>      >>>>>>>>>>>
>>>      >>>>>>>>>>> This FLIP overall looks good to me. Actually I share
>>>      the same
>>>      >>>>> concern
>>>      >>>>>>>>>> with
>>>      >>>>>>>>>>> Zhu that numEventsIn and numEventsOut counters are not
>>>      quite
>>>      >>>>>>>>>>> useful
>>>      >>>>>>>> to
>>>      >>>>>>>>>> end
>>>      >>>>>>>>>>> users. OperatorEvent is a quite low-level abstraction,
>>>      which
>>>      >>>>> requires
>>>      >>>>>>>>>>> instantialization in order to be practical to users and
>>>      >>>>>>>>>>> developers,
>>>      >>>>>>>> so
>>>      >>>>>>>>>>> maybe it's better to exclude them from the FLIP.
>>>      >>>>>>>>>>>
>>>      >>>>>>>>>>> Best,
>>>      >>>>>>>>>>> Qingsheng
>>>      >>>>>>>>>>> Ververica (Alibaba)
>>>      >>>>>>>>>>>
>>>      >>>>>>>>>>> On Mon, Dec 26, 2022 at 12:08 PM Zhu Zhu
>>>      <re...@gmail.com>
>>>      >>>>> wrote:
>>>      >>>>>>>>>>>> Hi Hang,
>>>      >>>>>>>>>>>> I still see no strong reason why we need
>>>      >>>>>>>>>>>> numEventsIn/numEventsOut
>>>      >>>>>>>>>>> metrics.
>>>      >>>>>>>>>>>> In the discussion in FLINK-29801, I can see the same
>>>      concern
>>>      >>>>>>>>>>>> from
>>>      >>>>>>>>>> others.
>>>      >>>>>>>>>>>> So I prefer to exclude them from this FLIP to avoid
>>>      >>>>>>>>>>>> over-extending
>>>      >>>>>>>>> the
>>>      >>>>>>>>>>>> scope.
>>>      >>>>>>>>>>>>
>>>      >>>>>>>>>>>> Thanks,
>>>      >>>>>>>>>>>> Zhu
>>>      >>>>>>>>>>>>
>>>      >>>>>>>>>>>> Hang Ruan <ru...@gmail.com> 于2022年12月23日周五
>>>      >>>>>>>>>>>> 15:21写道：
>>>      >>>>>>>>>>>>> Hi everyone,
>>>      >>>>>>>>>>>>>
>>>      >>>>>>>>>>>>> As for the Zhu Zhu's problem, I think we should keep
>>>      the common
>>>      >>>>>>>>>>> metrics,
>>>      >>>>>>>>>>>> which will help to observe incoming and outgoing
>>>      events. What do
>>>      >>>>>>>> you
>>>      >>>>>>>>>>> think,
>>>      >>>>>>>>>>>> @Zhu Zhu ?
>>>      >>>>>>>>>>>>> And @Chesnay, are there any other issues you are
>>>      more concerned
>>>      >>>>>>>>>> about?
>>>      >>>>>>>>>>>> Looking forward to your reply.
>>>      >>>>>>>>>>>>> Thanks for all the comments. If there are no further
>>>      >>>>>>>>>>>>> comments, we
>>>      >>>>>>>>>> plan
>>>      >>>>>>>>>>>> to start the voting thread next week.
>>>      >>>>>>>>>>>>> Best,
>>>      >>>>>>>>>>>>> Hang
>>>      >>>>>>>>>>>>>
>>>      >>>>>>>>>>>>> Hang Ruan <ru...@gmail.com> 于2022年12月15日周四
>>>      >>>>>>>>>>>>> 16:49写道：
>>>      >>>>>>>>>>>>>> Hi, Zhu Zhu,
>>>      >>>>>>>>>>>>>>
>>>      >>>>>>>>>>>>>> Thanks for your feedback!
>>>      >>>>>>>>>>>>>>
>>>      >>>>>>>>>>>>>> The OperatorCoordinator implementations are
>>>      different. And
>>>      >>>>>>>>>>>>>> their
>>>      >>>>>>>>>>>> metrics are much different too. We try to find the
>> common
>>>      >>>>>>>>>>>> metrics
>>>      >>>>>>>> and
>>>      >>>>>>>>>> put
>>>      >>>>>>>>>>>> them in the OperatorCoordinatorMetricGroup. If most
>>>      developers
>>>      >>>>>>>> think
>>>      >>>>>>>>> we
>>>      >>>>>>>>>>> do
>>>      >>>>>>>>>>>> not need these common metrics, removing the common
>>>      metrics is
>>>      >>>>>>>>>>>> also
>>>      >>>>>>>>>>>> acceptable.
>>>      >>>>>>>>>>>>>> Best,
>>>      >>>>>>>>>>>>>> Hang
>>>      >>>>>>>>>>>>>>
>>>      >>>>>>>>>>>>>> Zhu Zhu <re...@gmail.com> 于2022年12月14日周三
>>>      22:09写道：
>>>      >>>>>>>>>>>>>>> Hi Hang & MengYue,
>>>      >>>>>>>>>>>>>>>
>>>      >>>>>>>>>>>>>>> Thanks for creating this FLIP!
>>>      >>>>>>>>>>>>>>>
>>>      >>>>>>>>>>>>>>> I think it is very useful, mainly in two aspects:
>>>      >>>>>>>>>>>>>>> 1. Enables OperatorCoordinators to register metrics.
>>>      >>>>>>>>>>>>>>> Currently
>>>      >>>>>>>>>>>>>>> the coordinators has no way to do this. And operator
>>>      >>>>>>>> coordinator
>>>      >>>>>>>>>>>>>>> metric group further enables the SplitEnumerator
>>>      to have
>>>      >>>>>>>>>>>>>>> access
>>>      >>>>>>>>>>>>>>> to a registered metric group (via the existing public
>>>      >>>>>>>>>>>>>>> interface
>>>      >>>>>>>>>>>>>>> SplitEnumeratorContext#metricGroup()), which is
>>>      null at the
>>>      >>>>>>>>> moment.
>>>      >>>>>>>>>>>>>>> 2. Defines the scope of operator coordinator
>>>      metrics. A clear
>>>      >>>>>>>>>>>> definition
>>>      >>>>>>>>>>>>>>> makes it easy for users to find their wanted
>>>      metrics. The
>>>      >>>>>>>>>> definition
>>>      >>>>>>>>>>>>>>> also helps to avoid conflicts of metrics from
>> multiple
>>>      >>>>>>>>>>>> OperatorCoordinators
>>>      >>>>>>>>>>>>>>> of the same kind. E.g. each SourceCoordinator may
>>>      have its
>>>      >>>>>>>>>>>>>>> own
>>>      >>>>>>>>>>>>>>> numSourceSplits metric, these metrics should not
>>>      be directly
>>>      >>>>>>>>>>> registered
>>>      >>>>>>>>>>>>>>> to the job metric group.
>>>      >>>>>>>>>>>>>>>
>>>      >>>>>>>>>>>>>>> What I'm a bit concerned is the necessity of the
>>>      introduced
>>>      >>>>>>>>> common
>>>      >>>>>>>>>>>> metrics
>>>      >>>>>>>>>>>>>>> numEventsInCounter & numEventsOutCounter. If there
>>>      any case
>>>      >>>>>>>> which
>>>      >>>>>>>>>>>> strongly
>>>      >>>>>>>>>>>>>>> requires them?
>>>      >>>>>>>>>>>>>>>
>>>      >>>>>>>>>>>>>>> Regarding the concerns of Chesnay,
>>>      >>>>>>>>>>>>>>>> A dedicated coordinator MG implementation is
>> overkill
>>>      >>>>>>>>>>>>>>> Directly using the job metric group can result in
>>>      metric
>>>      >>>>>>>>> conflicts,
>>>      >>>>>>>>>>> as
>>>      >>>>>>>>>>>> mentioned
>>>      >>>>>>>>>>>>>>> in above #2.
>>>      >>>>>>>>>>>>>>>
>>>      >>>>>>>>>>>>>>> Thanks,
>>>      >>>>>>>>>>>>>>> Zhu
>>>      >>>>>>>>>>>>>>>
>>>      >>>>>>>>>>>>>>> Dong Lin <li...@gmail.com> 于2022年12月10日周六
>>>      >>>>>>>>>>>>>>> 14:16写道：
>>>      >>>>>>>>>>>>>>>
>>>      >>>>>>>>>>>>>>>> Hi Chesney,
>>>      >>>>>>>>>>>>>>>>
>>>      >>>>>>>>>>>>>>>> Just to double check with you,
>>>      >>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup
>>>      >>>>>>>>>>>> (annotated as
>>>      >>>>>>>>>>>>>>>> @PublicEvolving) has already been introduced into
>>>      Flink by
>>>      >>>>>>>>>> FLIP-179
>>>      >>>>>>>>>>>>>>>> <
>>>      >>
>>>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics
>>>      >>>>>>>>>>>>> .
>>>      >>>>>>>>>>>>>>>> And that FLIP has got you +1.. Do you mean we
>>>      should remove
>>>      >>>>>>>>> this
>>>      >>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup?
>>>      >>>>>>>>>>>>>>>>
>>>      >>>>>>>>>>>>>>>> Regards,
>>>      >>>>>>>>>>>>>>>> Dong
>>>      >>>>>>>>>>>>>>>>
>>>      >>>>>>>>>>>>>>>> On Sat, Dec 10, 2022 at 1:33 AM Chesnay Schepler <
>>>      >>>>>>>>>>> chesnay@apache.org>
>>>      >>>>>>>>>>>> wrote:
>>>      >>>>>>>>>>>>>>>>> As a whole I feel like this FLIP is overly
>>>      complicated. A
>>>      >>>>>>>>>>> dedicated
>>>      >>>>>>>>>>>>>>>>> coordinator MG implementation is overkill; it
>>>      could just
>>>      >>>>>>>>> re-use
>>>      >>>>>>>>>>> the
>>>      >>>>>>>>>>>>>>>>> existing Task/OperatorMGs to create the same
>>>      structure we
>>>      >>>>>>>>> have
>>>      >>>>>>>>>> on
>>>      >>>>>>>>>>>> TMs,
>>>      >>>>>>>>>>>>>>>>> similar to what we did with the Job MG.
>>>      >>>>>>>>>>>>>>>>>
>>>      >>>>>>>>>>>>>>>>> However, I'm not convinced that this is required
>>>      anyway,
>>>      >>>>>>>>>> because
>>>      >>>>>>>>>>>> all the
>>>      >>>>>>>>>>>>>>>>> example metrics you listed can be implemented on
>>>      the TM
>>>      >>>>>>>> side
>>>      >>>>>>>>> +
>>>      >>>>>>>>>>>>>>>>> aggregating them in the external metrics backend.
>>>      >>>>>>>>>>>>>>>>>
>>>      >>>>>>>>>>>>>>>>> Since I'm on holidays soon, just so no one tries
>>>      to pull a
>>>      >>>>>>>>> fast
>>>      >>>>>>>>>>>> one on
>>>      >>>>>>>>>>>>>>>>> me, if this were to go to a vote as-is I'd be
>>>      against it.
>>>      >>>>>>>>>>>>>>>>>
>>>      >>>>>>>>>>>>>>>>>
>>>      >>>>>>>>>>>>>>>>> On 09/12/2022 15:30, Dong Lin wrote:
>>>      >>>>>>>>>>>>>>>>>> Hi Hang,
>>>      >>>>>>>>>>>>>>>>>>
>>>      >>>>>>>>>>>>>>>>>> Thanks for the FLIP! The FLIP looks good and it
>>>      is pretty
>>>      >>>>>>>>>>>> informative.
>>>      >>>>>>>>>>>>>>>>>> I have just two minor comments regarding names:
>>>      >>>>>>>>>>>>>>>>>> - Would it be useful to rename the config key as
>>>      >>>>>>>>>>>>>>>>>> *metrics.scope.jm.job.operator-coordinator* for
>>>      >>>>>>>> consistency
>>>      >>>>>>>>>>> with
>>>      >>>>>>>>>>>>>>>>>> *metrics.scope.jm.job
>>>      >>>>>>>>>>>>>>>>>> *(which is not named as *jm-job)?
>>>      >>>>>>>>>>>>>>>>>> - Maybe rename the variable as
>>>      >>>>>>>>>>> SCOPE_NAMING_OPERATOR_COORDINATOR
>>>      >>>>>>>>>>>> for
>>>      >>>>>>>>>>>>>>>>>> simplicity and consistency with
>>>      SCOPE_NAMING_OPERATOR
>>>      >>>>>>>>> (which
>>>      >>>>>>>>>> is
>>>      >>>>>>>>>>>> not named
>>>      >>>>>>>>>>>>>>>>>> as SCOPE_NAMING_TM_JOB_OPERATOR)?
>>>      >>>>>>>>>>>>>>>>>>
>>>      >>>>>>>>>>>>>>>>>> Cheers,
>>>      >>>>>>>>>>>>>>>>>> Dong
>>>      >>>>>>>>>>>>>>>>>>
>>>      >>>>>>>>>>>>>>>>>>
>>>      >>>>>>>>>>>>>>>>>>
>>>      >>>>>>>>>>>>>>>>>> On Thu, Dec 8, 2022 at 3:28 PM Hang Ruan <
>>>      >>>>>>>>>>> ruanhang1993@gmail.com>
>>>      >>>>>>>>>>>> wrote:
>>>      >>>>>>>>>>>>>>>>>>> Hi all,
>>>      >>>>>>>>>>>>>>>>>>>
>>>      >>>>>>>>>>>>>>>>>>> MengYue and I created FLIP-274[1] Introduce
>>>      metric group
>>>      >>>>>>>>> for
>>>      >>>>>>>>>>>>>>>>>>> OperatorCoordinator. OperatorCoordinator is the
>>>      >>>>>>>>> coordinator
>>>      >>>>>>>>>>> for
>>>      >>>>>>>>>>>> runtime
>>>      >>>>>>>>>>>>>>>>>>> operators and running on Job Manager. The
>>>      coordination
>>>      >>>>>>>>>>>> mechanism is
>>>      >>>>>>>>>>>>>>>>>>> operator events between OperatorCoordinator
>>>      and its all
>>>      >>>>>>>>>>>> operators, the
>>>      >>>>>>>>>>>>>>>>>>> coordination is more and more using in Flink, for
>>>      >>>>>>>> example
>>>      >>>>>>>>>> many
>>>      >>>>>>>>>>>> Sources
>>>      >>>>>>>>>>>>>>>>> and
>>>      >>>>>>>>>>>>>>>>>>> Sinks depend on the mechanism to assign splits
>> and
>>>      >>>>>>>>>> coordinate
>>>      >>>>>>>>>>>> commits to
>>>      >>>>>>>>>>>>>>>>>>> external systems. The OperatorCoordinator is
>>>      widely
>>>      >>>>>>>> using
>>>      >>>>>>>>> in
>>>      >>>>>>>>>>>> flink kafka
>>>      >>>>>>>>>>>>>>>>>>> connector, flink pulsar connector, flink cdc
>>>      connector,
>>>      >>>>>>>>>> flink
>>>      >>>>>>>>>>>> hudi
>>>      >>>>>>>>>>>>>>>>>>> connector and so on.
>>>      >>>>>>>>>>>>>>>>>>>
>>>      >>>>>>>>>>>>>>>>>>> But there is not a suitable metric group scope
>>>      for the
>>>      >>>>>>>>>>>>>>>>> OperatorCoordinator
>>>      >>>>>>>>>>>>>>>>>>> and not an implementation for the interface
>>>      >>>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup.
>>>      >>>>>>>>>>>>>>>>>>> These metrics in OperatorCoordinator could be
>>>      how many
>>>      >>>>>>>>>>>> splits/partitions
>>>      >>>>>>>>>>>>>>>>>>> have been assigned to source readers, how many
>>>      files
>>>      >>>>>>>> have
>>>      >>>>>>>>>> been
>>>      >>>>>>>>>>>> written
>>>      >>>>>>>>>>>>>>>>> out
>>>      >>>>>>>>>>>>>>>>>>> by sink writers, these metrics not only help
>>>      users to
>>>      >>>>>>>> know
>>>      >>>>>>>>>> the
>>>      >>>>>>>>>>>> job
>>>      >>>>>>>>>>>>>>>>> progress
>>>      >>>>>>>>>>>>>>>>>>> but also make big job maintaining easier. Thus we
>>>      >>>>>>>> propose
>>>      >>>>>>>>>> the
>>>      >>>>>>>>>>>> FLIP-274
>>>      >>>>>>>>>>>>>>>>> to
>>>      >>>>>>>>>>>>>>>>>>> introduce a new metric group scope for
>>>      >>>>>>>> OperatorCoordinator
>>>      >>>>>>>>>> and
>>>      >>>>>>>>>>>> provide
>>>      >>>>>>>>>>>>>>>>> an
>>>      >>>>>>>>>>>>>>>>>>> internal implementation for
>>>      >>>>>>>>> OperatorCoordinatorMetricGroup.
>>>      >>>>>>>>>>>>>>>>>>> Could you help review this FLIP when you get
>>>      time? Any
>>>      >>>>>>>>>>> feedback
>>>      >>>>>>>>>>>> is
>>>      >>>>>>>>>>>>>>>>>>> appreciated!
>>>      >>>>>>>>>>>>>>>>>>>
>>>      >>>>>>>>>>>>>>>>>>> Best,
>>>      >>>>>>>>>>>>>>>>>>> Hang
>>>      >>>>>>>>>>>>>>>>>>>
>>>      >>>>>>>>>>>>>>>>>>> [1]
>>>      >>>>>>>>>>>>>>>>>>>
>>>      >>>>>>>>>>>>>>>>>>>
>>>      >>
>>>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-274%3A+Introduce+metric+group+for+OperatorCoordinator
>>>      >>
>>>

Re: [DISCUSS] FLIP-274 : Introduce metric group for OperatorCoordinator

Posted by Hang Ruan <ru...@gmail.com>.

Hi all,

Thanks for all the comments. Sorry for my late reply.

I think I understand most of the suggestions. I still have the same
question about the adding task id/name like Jark.
And I have refactored the POC like this (
https://github.com/ruanhang1993/flink/commit/321192d17afbb9d00285a78a121676ae5e371292),
whose scope has the task name. When we initialize the coordinator, it is
hard to get the task name like we do in the TaskMetricGroup.
I also found a weird thing
that ExecutionJobVertex#getTaskInformationOrBlobKey will use
the jobVertex.getName() as the task name.
But ExecutionJobVertex#createOperatorCoordinatorHolder also will use
the jobVertex.getName() as the operatorName
for LazyInitializedCoordinatorContext. Maybe the task name will
be duplicated to the operator name in the metric name.

Best,
Hang

Jark Wu <im...@gmail.com> 于2023年1月28日周六 16:15写道：

> Hi all,
>
> IIUC, Chesnay means we should have a more general metric group for
> **operator**
> not for **coordinator** in JM, this would be useful to extend
> other operator-specific
> metrics in the future. That means the new scope format should be designed
> for
> the operator,
> e.g., metrics.scope.jm-operator=<host>.jobmanager.<job_name>.<operator_name>
> The coordinator metric is a subgroup (a constant "coordinator" suffix) of
> the JMOperatorMG.
>
> I think this is a nice design. However, I have a question about adding
> task id/name to this list.
> How to get the task id/name when reporting a coordinator metric? As we
> know, coordinator metrics
> are just like JMJobMetricGroup, which don't belong to any tasks/vertexes.
> Do you mean reporting
> coordinator metrics for every task id under the operator?
>
> Best,
> Jark
>
>
> On Thu, 19 Jan 2023 at 17:33, Chesnay Schepler <ch...@apache.org> wrote:
>
>>  > First, I do not understand why users have to configure the new scope
>> format, which has a default value.
>>
>> If you don't use scope formats, sure. If you do use scope formats, e.g.
>> to add a common prefix (which is the case for datadog users for
>> example), then the current default in the FLIP is insufficient and
>> requires the user to update the configuration.
>>
>>  > I usually do not need to change these configuration of scope formats
>> when submitting the flink job.
>>
>> Scope formats as a whole are quite a power-user feature, but that
>> doesn't mean we should ignore it.
>>
>>  > I try to let it extend the ComponentMetricGroup
>>
>> This isn't inherently required just because it is a "component".
>> Component metric groups should _only_ be used for cases where such a
>> component provides several bits of metadata.
>> For example, tasks provide vertex ids, task names, attempted IDs etc.,
>> and we'd like users to have the option on how this metadata ends up
>> being used (via scope formats). This currently can't be built with the
>> addGroup() methods, hence why the component groups exist.
>> However, the operator coordinator _itself_does not provide several bits
>> of metadata. Logically, the _operator_ does, not the coordinator.
>>
>>  > This make me decide to add a new scope format.
>>
>> And this is fine; just don't add a new scope format for the coordinator
>> but logically for operators on the JM side, such that we can extend the
>> set of operator-specific metrics in the future without having to add yet
>> another scope format.
>>
>>  > The relationship between the Operator and OperatorCoordinator is
>> maintained in the OperatorCoordinatorHolder. In the POC, the operator
>> name/id could be found and the OperatorCoordinatorMetricGroup will be
>> created here.
>>
>> That's all irrelevant.
>>
>>  > The registered metrics of OperatorCoordinator always are those which
>> can not be aggregated from the tasks, like the number of unassigned
>> splits in the SourceCoordinator. Actually we have not encountered the
>> scenario you mentioned. I think the OperatorCoordinatorMetricGroup
>> should only contain the metrics for itself instead of its subtasks.
>>
>> You are misunderstanding the problem.
>>
>> This isn't about aggregating metrics on our side or exposing metrics
>> from TMs on the JM side or anything like that.
>>
>> It's purely about the available metadata for operator metrics on the JM
>> side. Currently you suggest operator name / id; what I'm proposing is to
>> add task(==vertex!) id / name to this list.
>>
>> The benefit here is simple. I can look up all metrics where task_id==XXX
>> and get _everything_ related to that vertex, including all metrics
>> associated with operators that are part of that vertex, including the
>> coordinator.
>>
>>  > Using metrics.scope.jm.job is not enough to distinguish the different
>> coordinators.
>>
>> I did not suggest just using metrics.scope.jm.job. I suggested that as
>> the default for the jm operator scope format, that is used for the
>> proposed JobManagerOperatorMetricGroup, which will have some plain
>> metric groups as children for coordinators.
>> Aka, you have the JMOperatorMG, then you call addGroup("coordinators")
>> and then addGroup(coordinator_name) for each coordinator.
>>
>>  > So there are two choice for the
>> InternalOperatorCoordinatorMetricGroup: 1. add and improve the new scope
>> format; 2. use the metrics.scope.jm.job and ProxyMetricGroup. Which one
>> is better?
>>
>> There are more options than that.
>> I feel like there are a lot of misunderstandings in this discussion.
>> Please let me know if I could clear things up. If not I can also provide
>> a PoC based on your PoC if that can speed things up. It's not that
>> different anyway.
>>
>> On 19/01/2023 07:39, Hang Ruan wrote:
>> > Hi, chesnay,
>> >
>> > Thanks for your reply. I still have some doubts about the questions
>> > you raised.
>> >
>> >     > Extending the set of ScopeFormats is problematic because it in
>> >     practice
>> >     it breaks the config if users actively rely on it, since there's now
>> >     another key that they _must_ set for it to be
>> >     consistent/compatible with
>> >     their existing setup.
>> >     Unfortunately due to how powerful scope formats are we can't derive
>> a
>> >     default value that matches their existing setup.
>> >     Hence we should try to do this as rarely as possible.
>> >
>> >
>> >     > This FLIP does not adhere to that since it proposes a dedicated
>> >     format
>> >     for coordinators; next time we want to expose operator-specific
>> >     metrics
>> >     (e.g., in the scheduler) we'd have to add another one to support it.
>> >
>> >
>> > First, I do not understand why users have to configure the new scope
>> > format, which has a default value. I usually do not need to change
>> > these configuration of scope formats when submitting the flink job.
>> > IMO, the OperatorCoordinator is a component running at the JobMaster.
>> > I try to let it extend the ComponentMetricGroup, but I cannot find
>> > a suitable scope. This make me decide to add a new scope format.
>> >
>> >     > Additionally, the configurable variables (operator name/id) are
>> >     logically not attached to the coordinator, but operators, so to me
>> it
>> >     just doesn't make sense to structure it like this.
>> >
>> >
>> > The relationship between the Operator and OperatorCoordinator is
>> > maintained in the OperatorCoordinatorHolder. In the POC, the operator
>> > name/id could be found and the OperatorCoordinatorMetricGroup will be
>> > created here.
>> >
>> >     > Another thing I'm concerned about is that, because we don't
>> include
>> >     tasks in the hierarchy, users wishing to collect all metrics for a
>> >     particular task (in this case ==vertex) now have to go
>> >     significantly out
>> >     of their way to get them, since they can no longer just filter by
>> the
>> >     task ID but have to be filter for _all_ operators that are part of
>> >     the task.
>> >
>> >
>> > The registered metrics of OperatorCoordinator always are those which
>> > can not be aggregated from the tasks, like the number of unassigned
>> > splits in the SourceCoordinator. Actually we have not encountered the
>> > scenario you mentioned. I think the OperatorCoordinatorMetricGroup
>> > should only contain the metrics for itself instead of its subtasks.
>> >
>> > Using metrics.scope.jm.job is not enough to distinguish the different
>> > coordinators. The operator id/name is necessary. Then
>> > the implementation should be like this. But users can not change the
>> > scope in this way. This is also acceptable.
>> > @Internal public class InternalOperatorCoordinatorMetricGroup
>> >          extends ProxyMetricGroup<MetricGroup>
>> >          implements OperatorCoordinatorMetricGroup {
>> >      public InternalOperatorCoordinatorMetricGroup(
>> >              JobManagerJobMetricGroup parent, OperatorID operatorID,
>> String operatorName) {
>> >          super(parent.addGroup(operatorID +
>> operatorName).addGroup("coordinator")); }
>> > }
>> > So there are two choice for the
>> > InternalOperatorCoordinatorMetricGroup: 1. add and improve the new
>> > scope format; 2. use the metrics.scope.jm.job and ProxyMetricGroup.
>> > Which one is better?
>> >
>> > Best,
>> > Hang
>> >
>> >
>> > Chesnay Schepler <ch...@apache.org> 于2023年1月18日周三 17:03写道：
>> >
>> >     You're misunderstanding the problem.
>> >
>> >     Metric groups form a tree with each group providing certain
>> metadata.
>> >
>> >     E.g., on the taskmanager we have a TM metric group that provides
>> info
>> >     about the TM, that has child task metric groups, that have child
>> >     operator metric groups etc. The operator metric group again has
>> >     children, where we are now mostly in the user/connector land, like a
>> >     kafka metric group providing partition info or something.
>> >
>> >     What is being proposed is to create a coordinator metric group on
>> >     the JM
>> >     side that provides operator metadata.
>> >
>> >     A more appropriate structure is to create an operator group on the
>> JM
>> >     side that provides this info, with the coordinator metric group
>> >     being a
>> >     child.
>> >
>> >     On 17/01/2023 19:56, Steven Wu wrote:
>> >     >> Additionally, the configurable variables (operator name/id) are
>> >     > logically not attached to the coordinator, but operators, so to
>> >     me it
>> >     > just doesn't make sense to structure it like this.
>> >     >
>> >     > Chesnay, maybe we should clarify the terminology. To me,
>> >     pperators (like
>> >     > FLIP-27 source) can have two parts (coordinator and
>> >     reader/subtask). I
>> >     > think it is fine to include operator name/id for coordinator
>> >     metrics.
>> >     >
>> >     > On Mon, Jan 16, 2023 at 2:13 AM Chesnay Schepler
>> >     <ch...@apache.org> wrote:
>> >     >
>> >     >> Slight correction: Using metrics.scope.jm.job as the default
>> >     should be
>> >     >> safe.
>> >     >>
>> >     >> On 16/01/2023 10:18, Chesnay Schepler wrote:
>> >     >>> The proposed ScopeFormat is still problematic for a few reasons.
>> >     >>>
>> >     >>> Extending the set of ScopeFormats is problematic because it in
>> >     >>> practice it breaks the config if users actively rely on it,
>> since
>> >     >>> there's now another key that they _must_ set for it to be
>> >     >>> consistent/compatible with their existing setup.
>> >     >>> Unfortunately due to how powerful scope formats are we can't
>> >     derive a
>> >     >>> default value that matches their existing setup.
>> >     >>> Hence we should try to do this as rarely as possible.
>> >     >>>
>> >     >>> This FLIP does not adhere to that since it proposes a
>> >     dedicated format
>> >     >>> for coordinators; next time we want to expose operator-specific
>> >     >>> metrics (e.g., in the scheduler) we'd have to add another one to
>> >     >>> support it.
>> >     >>>
>> >     >>> Additionally, the configurable variables (operator name/id) are
>> >     >>> logically not attached to the coordinator, but operators, so
>> >     to me it
>> >     >>> just doesn't make sense to structure it like this.
>> >     >>>
>> >     >>> Another thing I'm concerned about is that, because we don't
>> >     include
>> >     >>> tasks in the hierarchy, users wishing to collect all metrics
>> for a
>> >     >>> particular task (in this case ==vertex) now have to go
>> >     significantly
>> >     >>> out of their way to get them, since they can no longer just
>> >     filter by
>> >     >>> the task ID but have to be filter for _all_ operators that are
>> >     part of
>> >     >>> the task.
>> >     >>>
>> >     >>> On 16/01/2023 03:09, Hang Ruan wrote:
>> >     >>>> Hi, @chesnay@apache.org <ch...@apache.org> ,
>> >     >>>>
>> >     >>>> Do you have time to help to review this FLIP again after the
>> >     >>>> modification?
>> >     >>>> Looking forward to your reply.
>> >     >>>> This FLIP will add a new configuration for the
>> >     >>>> OperatorCoordinatorMetricGroup scope format. It provides an
>> >     internal
>> >     >>>> implementation and is added as a component to the
>> >     >>>> JobManagerJobMetricGroup.
>> >     >>>> If something doesn't make sense, could you provide some
>> >     advice? It
>> >     >>>> will be
>> >     >>>> very helpful. Thanks a lot for your help.
>> >     >>>>
>> >     >>>> Best,
>> >     >>>> Hang
>> >     >>>>
>> >     >>>> Martijn Visser <ma...@apache.org> 于2023年1月11日周三
>> >     16:34写道：
>> >     >>>>
>> >     >>>>
>> >     >>>>> Hi Hang,
>> >     >>>>>
>> >     >>>>> I'm a bit surprised that this has gone to a vote, given that
>> >     Chesnay
>> >     >>>>> deliberately mentioned that he would vote against it as-is.
>> >     I would
>> >     >>>>> expect
>> >     >>>>> that before going to a vote, he has had the opportunity to
>> >     >>>>> participate in
>> >     >>>>> this discussion.
>> >     >>>>>
>> >     >>>>> Best regards,
>> >     >>>>>
>> >     >>>>> Martijn
>> >     >>>>>
>> >     >>>>> On Tue, Jan 3, 2023 at 12:53 PM Jark Wu <im...@gmail.com>
>> >     wrote:
>> >     >>>>>
>> >     >>>>>> Hi Dong,
>> >     >>>>>>
>> >     >>>>>> Regarding “SplitEnumeratorContext#metricGroup”, my only
>> >     concern is
>> >     >>>>>> that
>> >     >>>>>> this is a core interface for the FLIP. It’s hard to tell how
>> >     >>>>>> sources use
>> >     >>>>>> metric group without mentioning this interface. Even if
>> >     this is an
>> >     >>>>> existing
>> >     >>>>>> API, I think it’s worth introducing the interface again and
>> >     declaring
>> >     >>>>> that
>> >     >>>>>> we will implement the interface instead of a no-op method
>> >     in this
>> >     >>>>>> FLIP.
>> >     >>>>>>
>> >     >>>>>> Anyway, this is a minor problem and shouldn’t block this
>> >     FLIP. I’m
>> >     >>>>>> +1 to
>> >     >>>>>> start a vote.
>> >     >>>>>>
>> >     >>>>>> Best,
>> >     >>>>>> Jark
>> >     >>>>>>
>> >     >>>>>>
>> >     >>>>>>> 2023年1月3日 10:03，Hang Ruan <ru...@gmail.com> 写道：
>> >     >>>>>>>
>> >     >>>>>>> Hi, Jark and Dong,
>> >     >>>>>>>
>> >     >>>>>>> Thanks for your comments. Sorry for my late reply.
>> >     >>>>>>>
>> >     >>>>>>> For suggestion 1, I plan to implement the
>> >     >>>>>>> SplitEnumeratorMetricGroup in
>> >     >>>>>>> another issue, and it is not contained in this FLIP. I
>> >     will add some
>> >     >>>>>>> description about this part.
>> >     >>>>>>> For suggestion 2, changes about
>> >     OperatorCoordinator#metricGroup has
>> >     >>>>>> already
>> >     >>>>>>> been documented in the proposed change section.
>> >     >>>>>>>
>> >     >>>>>>> Best,
>> >     >>>>>>> Hang
>> >     >>>>>>>
>> >     >>>>>>> Dong Lin <li...@gmail.com> 于2023年1月1日周日 09:45写道：
>> >     >>>>>>>
>> >     >>>>>>>> Let me chime-in and add comments regarding the public
>> >     interface
>> >     >>>>> section.
>> >     >>>>>>>> Please see my comments inline.
>> >     >>>>>>>>
>> >     >>>>>>>> On Thu, Dec 29, 2022 at 6:08 PM Jark Wu
>> >     <im...@gmail.com> wrote:
>> >     >>>>>>>>
>> >     >>>>>>>>> Hi Hang,
>> >     >>>>>>>>>
>> >     >>>>>>>>> Thanks for driving this discussion. I think this is a
>> >     very useful
>> >     >>>>>> feature
>> >     >>>>>>>>> for connectors.
>> >     >>>>>>>>>
>> >     >>>>>>>>> The FLIP looks quite good to me, and I just have two
>> >     suggestions.
>> >     >>>>>>>>>
>> >     >>>>>>>>> 1. In the "Public Interface" section, mention that the
>> >     >>>>>>>>> implementation
>> >     >>>>>>>>> behavior of "SplitEnumeratorContext#metricGroup" is
>> >     changed from
>> >     >>>>>>>> returning
>> >     >>>>>>>>> null to returning a concrete SplitEnumeratorMetricGroup
>> >     instance.
>> >     >>>>> Even
>> >     >>>>>>>>> though the API is already there, the behavior change can
>> >     also be
>> >     >>>>>>>> considered
>> >     >>>>>>>>> a public change.
>> >     >>>>>>>>>
>> >     >>>>>>>> SplitEnumeratorContext#metricGroup is an interface and
>> >     this FLIP
>> >     >>>>>>>> does
>> >     >>>>>> not
>> >     >>>>>>>> seem to change its semantics/behavior. The FLIP does
>> >     change the
>> >     >>>>>>>> implementation/behavior of
>> >     SourceCoordinatorContext#metricGroup,
>> >     >>>>>>>> which
>> >     >>>>>> is
>> >     >>>>>>>> marked @Internal.
>> >     >>>>>>>>
>> >     >>>>>>>> Thus it might seem a bit weird to add in the public
>> interface
>> >     >>>>>>>> section
>> >     >>>>>>>> saying that we change the interface
>> >     >>>>>>>> SplitEnumeratorContext#metricGroup
>> >     >>>>>> from
>> >     >>>>>>>> returning null to non-null object.
>> >     >>>>>>>>
>> >     >>>>>>>> 2. Mention the newly added interface of
>> >     >>>>>> "OperatorCoordinator#metricGroup"
>> >     >>>>>>>>> in the "Proposed Changes" section or "Public Interface"
>> >     section. As
>> >     >>>>> the
>> >     >>>>>>>>> FLIP said, OperatorCoordinator is widely used in many
>> >     connectors.
>> >     >>>>>> Though
>> >     >>>>>>>> it
>> >     >>>>>>>>> is still an @Internal API, I think it is worth
>> >     mentioning the
>> >     >>>>>>>>> change
>> >     >>>>> in
>> >     >>>>>>>> the
>> >     >>>>>>>>> FLIP.
>> >     >>>>>>>> Since OperatorCoordinator is an internal API, it seems
>> >     reasonable to
>> >     >>>>>>>> explain it in the proposed change section. The FLIP seems
>> >     to have
>> >     >>>>>>>> documented this in point 5 of the proposed change section.
>> >     >>>>>>>>
>> >     >>>>>>>> BTW, if we think there are @internal classes that are
>> >     important
>> >     >>>>>>>> enough
>> >     >>>>>> to
>> >     >>>>>>>> be added in the public interface section, it might be
>> >     useful to
>> >     >>>>>> explicitly
>> >     >>>>>>>> discuss this topic and document it in the "*What are the
>> >     "public
>> >     >>>>>>>> interfaces" of the project*" in this
>> >     >>>>>>>> <
>> >     >>>>>>>>
>> >     >>
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
>> >     >>>>>>>> wiki.
>> >     >>>>>>>>
>> >     >>>>>>>>
>> >     >>>>>>>>> Best,
>> >     >>>>>>>>> Jark
>> >     >>>>>>>>>
>> >     >>>>>>>>>
>> >     >>>>>>>>> On Mon, 26 Dec 2022 at 18:06, Hang Ruan
>> >     <ru...@gmail.com>
>> >     >>>>>> wrote:
>> >     >>>>>>>>>> Hi, thanks for the feedback, Zhu Zhu and Qingsheng.
>> >     >>>>>>>>>> After combining everyone's comments, the main concerns
>> and
>> >     >>>>>>>> corresponding
>> >     >>>>>>>>>> adjustments are as follows.
>> >     >>>>>>>>>>
>> >     >>>>>>>>>> Q1: Common metrics are not quite useful.
>> >     >>>>>>>>>> numEventsIn and numEventsOut counters will be removed
>> >     from the
>> >     >>>>>>>>>> OperatorCoordinatorMetricGroup. These common metrics do
>> not
>> >     >>>>>>>>>> provide
>> >     >>>>>>>>> enough
>> >     >>>>>>>>>> information for users. The users are more willing to
>> >     get the
>> >     >>>>>>>>>> number
>> >     >>>>> of
>> >     >>>>>>>>>> events of the specified type instead of the total
>> >     number. And this
>> >     >>>>>>>> metric
>> >     >>>>>>>>>> is calculated differently. The implementation could
>> >     register the
>> >     >>>>>> metric
>> >     >>>>>>>>> by
>> >     >>>>>>>>>> themselves.
>> >     >>>>>>>>>>
>> >     >>>>>>>>>> Q2: This FLIP is overly complicated.
>> >     >>>>>>>>>> This FLIP will become concise after these modifications.
>> >     >>>>>>>>>> OperatorCoordinatorMetricGroup has already been
>> >     introduced into
>> >     >>>>> Flink
>> >     >>>>>>>> by
>> >     >>>>>>>>>> FLIP-179<
>> >     >>>>>>>>>>
>> >     >>>>>>>>>>
>> >     >>
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics
>> >     >>>>>>>>>>> .
>> >     >>>>>>>>>> And
>> >     >>>>>>>>>> this FLIP will not change it. This FLIP only provides a
>> >     new metric
>> >     >>>>>>>> option
>> >     >>>>>>>>>> and a new metric group scope. The changes in proposed
>> >     changes
>> >     >>>>> provide
>> >     >>>>>>>> the
>> >     >>>>>>>>>> details about the modifications for the internal
>> >     classes, which
>> >     >>>>> might
>> >     >>>>>>>>> make
>> >     >>>>>>>>>> it look complicated.
>> >     >>>>>>>>>>
>> >     >>>>>>>>>> Thanks for all the comments again. If there are no
>> further
>> >     >>>>>>>>>> comments,
>> >     >>>>>> we
>> >     >>>>>>>>>> plan to start the voting thread this week.
>> >     >>>>>>>>>>
>> >     >>>>>>>>>> Best,
>> >     >>>>>>>>>> Hang
>> >     >>>>>>>>>>
>> >     >>>>>>>>>> Qingsheng Ren <re...@gmail.com> 于2022年12月26日周一
>> >     16:48写道：
>> >     >>>>>>>>>>
>> >     >>>>>>>>>>
>> >     >>>>>>>>>>> Thanks for the FLIP, Hang!
>> >     >>>>>>>>>>>
>> >     >>>>>>>>>>> This FLIP overall looks good to me. Actually I share
>> >     the same
>> >     >>>>> concern
>> >     >>>>>>>>>> with
>> >     >>>>>>>>>>> Zhu that numEventsIn and numEventsOut counters are not
>> >     quite
>> >     >>>>>>>>>>> useful
>> >     >>>>>>>> to
>> >     >>>>>>>>>> end
>> >     >>>>>>>>>>> users. OperatorEvent is a quite low-level abstraction,
>> >     which
>> >     >>>>> requires
>> >     >>>>>>>>>>> instantialization in order to be practical to users and
>> >     >>>>>>>>>>> developers,
>> >     >>>>>>>> so
>> >     >>>>>>>>>>> maybe it's better to exclude them from the FLIP.
>> >     >>>>>>>>>>>
>> >     >>>>>>>>>>> Best,
>> >     >>>>>>>>>>> Qingsheng
>> >     >>>>>>>>>>> Ververica (Alibaba)
>> >     >>>>>>>>>>>
>> >     >>>>>>>>>>> On Mon, Dec 26, 2022 at 12:08 PM Zhu Zhu
>> >     <re...@gmail.com>
>> >     >>>>> wrote:
>> >     >>>>>>>>>>>> Hi Hang,
>> >     >>>>>>>>>>>> I still see no strong reason why we need
>> >     >>>>>>>>>>>> numEventsIn/numEventsOut
>> >     >>>>>>>>>>> metrics.
>> >     >>>>>>>>>>>> In the discussion in FLINK-29801, I can see the same
>> >     concern
>> >     >>>>>>>>>>>> from
>> >     >>>>>>>>>> others.
>> >     >>>>>>>>>>>> So I prefer to exclude them from this FLIP to avoid
>> >     >>>>>>>>>>>> over-extending
>> >     >>>>>>>>> the
>> >     >>>>>>>>>>>> scope.
>> >     >>>>>>>>>>>>
>> >     >>>>>>>>>>>> Thanks,
>> >     >>>>>>>>>>>> Zhu
>> >     >>>>>>>>>>>>
>> >     >>>>>>>>>>>> Hang Ruan <ru...@gmail.com> 于2022年12月23日周五
>> >     >>>>>>>>>>>> 15:21写道：
>> >     >>>>>>>>>>>>> Hi everyone,
>> >     >>>>>>>>>>>>>
>> >     >>>>>>>>>>>>> As for the Zhu Zhu's problem, I think we should keep
>> >     the common
>> >     >>>>>>>>>>> metrics,
>> >     >>>>>>>>>>>> which will help to observe incoming and outgoing
>> >     events. What do
>> >     >>>>>>>> you
>> >     >>>>>>>>>>> think,
>> >     >>>>>>>>>>>> @Zhu Zhu ?
>> >     >>>>>>>>>>>>> And @Chesnay, are there any other issues you are
>> >     more concerned
>> >     >>>>>>>>>> about?
>> >     >>>>>>>>>>>> Looking forward to your reply.
>> >     >>>>>>>>>>>>> Thanks for all the comments. If there are no further
>> >     >>>>>>>>>>>>> comments, we
>> >     >>>>>>>>>> plan
>> >     >>>>>>>>>>>> to start the voting thread next week.
>> >     >>>>>>>>>>>>> Best,
>> >     >>>>>>>>>>>>> Hang
>> >     >>>>>>>>>>>>>
>> >     >>>>>>>>>>>>> Hang Ruan <ru...@gmail.com> 于2022年12月15日周四
>> >     >>>>>>>>>>>>> 16:49写道：
>> >     >>>>>>>>>>>>>> Hi, Zhu Zhu,
>> >     >>>>>>>>>>>>>>
>> >     >>>>>>>>>>>>>> Thanks for your feedback!
>> >     >>>>>>>>>>>>>>
>> >     >>>>>>>>>>>>>> The OperatorCoordinator implementations are
>> >     different. And
>> >     >>>>>>>>>>>>>> their
>> >     >>>>>>>>>>>> metrics are much different too. We try to find the
>> common
>> >     >>>>>>>>>>>> metrics
>> >     >>>>>>>> and
>> >     >>>>>>>>>> put
>> >     >>>>>>>>>>>> them in the OperatorCoordinatorMetricGroup. If most
>> >     developers
>> >     >>>>>>>> think
>> >     >>>>>>>>> we
>> >     >>>>>>>>>>> do
>> >     >>>>>>>>>>>> not need these common metrics, removing the common
>> >     metrics is
>> >     >>>>>>>>>>>> also
>> >     >>>>>>>>>>>> acceptable.
>> >     >>>>>>>>>>>>>> Best,
>> >     >>>>>>>>>>>>>> Hang
>> >     >>>>>>>>>>>>>>
>> >     >>>>>>>>>>>>>> Zhu Zhu <re...@gmail.com> 于2022年12月14日周三
>> >     22:09写道：
>> >     >>>>>>>>>>>>>>> Hi Hang & MengYue,
>> >     >>>>>>>>>>>>>>>
>> >     >>>>>>>>>>>>>>> Thanks for creating this FLIP!
>> >     >>>>>>>>>>>>>>>
>> >     >>>>>>>>>>>>>>> I think it is very useful, mainly in two aspects:
>> >     >>>>>>>>>>>>>>> 1. Enables OperatorCoordinators to register metrics.
>> >     >>>>>>>>>>>>>>> Currently
>> >     >>>>>>>>>>>>>>> the coordinators has no way to do this. And operator
>> >     >>>>>>>> coordinator
>> >     >>>>>>>>>>>>>>> metric group further enables the SplitEnumerator
>> >     to have
>> >     >>>>>>>>>>>>>>> access
>> >     >>>>>>>>>>>>>>> to a registered metric group (via the existing
>> public
>> >     >>>>>>>>>>>>>>> interface
>> >     >>>>>>>>>>>>>>> SplitEnumeratorContext#metricGroup()), which is
>> >     null at the
>> >     >>>>>>>>> moment.
>> >     >>>>>>>>>>>>>>> 2. Defines the scope of operator coordinator
>> >     metrics. A clear
>> >     >>>>>>>>>>>> definition
>> >     >>>>>>>>>>>>>>> makes it easy for users to find their wanted
>> >     metrics. The
>> >     >>>>>>>>>> definition
>> >     >>>>>>>>>>>>>>> also helps to avoid conflicts of metrics from
>> multiple
>> >     >>>>>>>>>>>> OperatorCoordinators
>> >     >>>>>>>>>>>>>>> of the same kind. E.g. each SourceCoordinator may
>> >     have its
>> >     >>>>>>>>>>>>>>> own
>> >     >>>>>>>>>>>>>>> numSourceSplits metric, these metrics should not
>> >     be directly
>> >     >>>>>>>>>>> registered
>> >     >>>>>>>>>>>>>>> to the job metric group.
>> >     >>>>>>>>>>>>>>>
>> >     >>>>>>>>>>>>>>> What I'm a bit concerned is the necessity of the
>> >     introduced
>> >     >>>>>>>>> common
>> >     >>>>>>>>>>>> metrics
>> >     >>>>>>>>>>>>>>> numEventsInCounter & numEventsOutCounter. If there
>> >     any case
>> >     >>>>>>>> which
>> >     >>>>>>>>>>>> strongly
>> >     >>>>>>>>>>>>>>> requires them?
>> >     >>>>>>>>>>>>>>>
>> >     >>>>>>>>>>>>>>> Regarding the concerns of Chesnay,
>> >     >>>>>>>>>>>>>>>> A dedicated coordinator MG implementation is
>> overkill
>> >     >>>>>>>>>>>>>>> Directly using the job metric group can result in
>> >     metric
>> >     >>>>>>>>> conflicts,
>> >     >>>>>>>>>>> as
>> >     >>>>>>>>>>>> mentioned
>> >     >>>>>>>>>>>>>>> in above #2.
>> >     >>>>>>>>>>>>>>>
>> >     >>>>>>>>>>>>>>> Thanks,
>> >     >>>>>>>>>>>>>>> Zhu
>> >     >>>>>>>>>>>>>>>
>> >     >>>>>>>>>>>>>>> Dong Lin <li...@gmail.com> 于2022年12月10日周六
>> >     >>>>>>>>>>>>>>> 14:16写道：
>> >     >>>>>>>>>>>>>>>
>> >     >>>>>>>>>>>>>>>> Hi Chesney,
>> >     >>>>>>>>>>>>>>>>
>> >     >>>>>>>>>>>>>>>> Just to double check with you,
>> >     >>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup
>> >     >>>>>>>>>>>> (annotated as
>> >     >>>>>>>>>>>>>>>> @PublicEvolving) has already been introduced into
>> >     Flink by
>> >     >>>>>>>>>> FLIP-179
>> >     >>>>>>>>>>>>>>>> <
>> >     >>
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics
>> >     >>>>>>>>>>>>> .
>> >     >>>>>>>>>>>>>>>> And that FLIP has got you +1.. Do you mean we
>> >     should remove
>> >     >>>>>>>>> this
>> >     >>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup?
>> >     >>>>>>>>>>>>>>>>
>> >     >>>>>>>>>>>>>>>> Regards,
>> >     >>>>>>>>>>>>>>>> Dong
>> >     >>>>>>>>>>>>>>>>
>> >     >>>>>>>>>>>>>>>> On Sat, Dec 10, 2022 at 1:33 AM Chesnay Schepler <
>> >     >>>>>>>>>>> chesnay@apache.org>
>> >     >>>>>>>>>>>> wrote:
>> >     >>>>>>>>>>>>>>>>> As a whole I feel like this FLIP is overly
>> >     complicated. A
>> >     >>>>>>>>>>> dedicated
>> >     >>>>>>>>>>>>>>>>> coordinator MG implementation is overkill; it
>> >     could just
>> >     >>>>>>>>> re-use
>> >     >>>>>>>>>>> the
>> >     >>>>>>>>>>>>>>>>> existing Task/OperatorMGs to create the same
>> >     structure we
>> >     >>>>>>>>> have
>> >     >>>>>>>>>> on
>> >     >>>>>>>>>>>> TMs,
>> >     >>>>>>>>>>>>>>>>> similar to what we did with the Job MG.
>> >     >>>>>>>>>>>>>>>>>
>> >     >>>>>>>>>>>>>>>>> However, I'm not convinced that this is required
>> >     anyway,
>> >     >>>>>>>>>> because
>> >     >>>>>>>>>>>> all the
>> >     >>>>>>>>>>>>>>>>> example metrics you listed can be implemented on
>> >     the TM
>> >     >>>>>>>> side
>> >     >>>>>>>>> +
>> >     >>>>>>>>>>>>>>>>> aggregating them in the external metrics backend.
>> >     >>>>>>>>>>>>>>>>>
>> >     >>>>>>>>>>>>>>>>> Since I'm on holidays soon, just so no one tries
>> >     to pull a
>> >     >>>>>>>>> fast
>> >     >>>>>>>>>>>> one on
>> >     >>>>>>>>>>>>>>>>> me, if this were to go to a vote as-is I'd be
>> >     against it.
>> >     >>>>>>>>>>>>>>>>>
>> >     >>>>>>>>>>>>>>>>>
>> >     >>>>>>>>>>>>>>>>> On 09/12/2022 15:30, Dong Lin wrote:
>> >     >>>>>>>>>>>>>>>>>> Hi Hang,
>> >     >>>>>>>>>>>>>>>>>>
>> >     >>>>>>>>>>>>>>>>>> Thanks for the FLIP! The FLIP looks good and it
>> >     is pretty
>> >     >>>>>>>>>>>> informative.
>> >     >>>>>>>>>>>>>>>>>> I have just two minor comments regarding names:
>> >     >>>>>>>>>>>>>>>>>> - Would it be useful to rename the config key as
>> >     >>>>>>>>>>>>>>>>>> *metrics.scope.jm.job.operator-coordinator* for
>> >     >>>>>>>> consistency
>> >     >>>>>>>>>>> with
>> >     >>>>>>>>>>>>>>>>>> *metrics.scope.jm.job
>> >     >>>>>>>>>>>>>>>>>> *(which is not named as *jm-job)?
>> >     >>>>>>>>>>>>>>>>>> - Maybe rename the variable as
>> >     >>>>>>>>>>> SCOPE_NAMING_OPERATOR_COORDINATOR
>> >     >>>>>>>>>>>> for
>> >     >>>>>>>>>>>>>>>>>> simplicity and consistency with
>> >     SCOPE_NAMING_OPERATOR
>> >     >>>>>>>>> (which
>> >     >>>>>>>>>> is
>> >     >>>>>>>>>>>> not named
>> >     >>>>>>>>>>>>>>>>>> as SCOPE_NAMING_TM_JOB_OPERATOR)?
>> >     >>>>>>>>>>>>>>>>>>
>> >     >>>>>>>>>>>>>>>>>> Cheers,
>> >     >>>>>>>>>>>>>>>>>> Dong
>> >     >>>>>>>>>>>>>>>>>>
>> >     >>>>>>>>>>>>>>>>>>
>> >     >>>>>>>>>>>>>>>>>>
>> >     >>>>>>>>>>>>>>>>>> On Thu, Dec 8, 2022 at 3:28 PM Hang Ruan <
>> >     >>>>>>>>>>> ruanhang1993@gmail.com>
>> >     >>>>>>>>>>>> wrote:
>> >     >>>>>>>>>>>>>>>>>>> Hi all,
>> >     >>>>>>>>>>>>>>>>>>>
>> >     >>>>>>>>>>>>>>>>>>> MengYue and I created FLIP-274[1] Introduce
>> >     metric group
>> >     >>>>>>>>> for
>> >     >>>>>>>>>>>>>>>>>>> OperatorCoordinator. OperatorCoordinator is the
>> >     >>>>>>>>> coordinator
>> >     >>>>>>>>>>> for
>> >     >>>>>>>>>>>> runtime
>> >     >>>>>>>>>>>>>>>>>>> operators and running on Job Manager. The
>> >     coordination
>> >     >>>>>>>>>>>> mechanism is
>> >     >>>>>>>>>>>>>>>>>>> operator events between OperatorCoordinator
>> >     and its all
>> >     >>>>>>>>>>>> operators, the
>> >     >>>>>>>>>>>>>>>>>>> coordination is more and more using in Flink,
>> for
>> >     >>>>>>>> example
>> >     >>>>>>>>>> many
>> >     >>>>>>>>>>>> Sources
>> >     >>>>>>>>>>>>>>>>> and
>> >     >>>>>>>>>>>>>>>>>>> Sinks depend on the mechanism to assign splits
>> and
>> >     >>>>>>>>>> coordinate
>> >     >>>>>>>>>>>> commits to
>> >     >>>>>>>>>>>>>>>>>>> external systems. The OperatorCoordinator is
>> >     widely
>> >     >>>>>>>> using
>> >     >>>>>>>>> in
>> >     >>>>>>>>>>>> flink kafka
>> >     >>>>>>>>>>>>>>>>>>> connector, flink pulsar connector, flink cdc
>> >     connector,
>> >     >>>>>>>>>> flink
>> >     >>>>>>>>>>>> hudi
>> >     >>>>>>>>>>>>>>>>>>> connector and so on.
>> >     >>>>>>>>>>>>>>>>>>>
>> >     >>>>>>>>>>>>>>>>>>> But there is not a suitable metric group scope
>> >     for the
>> >     >>>>>>>>>>>>>>>>> OperatorCoordinator
>> >     >>>>>>>>>>>>>>>>>>> and not an implementation for the interface
>> >     >>>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup.
>> >     >>>>>>>>>>>>>>>>>>> These metrics in OperatorCoordinator could be
>> >     how many
>> >     >>>>>>>>>>>> splits/partitions
>> >     >>>>>>>>>>>>>>>>>>> have been assigned to source readers, how many
>> >     files
>> >     >>>>>>>> have
>> >     >>>>>>>>>> been
>> >     >>>>>>>>>>>> written
>> >     >>>>>>>>>>>>>>>>> out
>> >     >>>>>>>>>>>>>>>>>>> by sink writers, these metrics not only help
>> >     users to
>> >     >>>>>>>> know
>> >     >>>>>>>>>> the
>> >     >>>>>>>>>>>> job
>> >     >>>>>>>>>>>>>>>>> progress
>> >     >>>>>>>>>>>>>>>>>>> but also make big job maintaining easier. Thus
>> we
>> >     >>>>>>>> propose
>> >     >>>>>>>>>> the
>> >     >>>>>>>>>>>> FLIP-274
>> >     >>>>>>>>>>>>>>>>> to
>> >     >>>>>>>>>>>>>>>>>>> introduce a new metric group scope for
>> >     >>>>>>>> OperatorCoordinator
>> >     >>>>>>>>>> and
>> >     >>>>>>>>>>>> provide
>> >     >>>>>>>>>>>>>>>>> an
>> >     >>>>>>>>>>>>>>>>>>> internal implementation for
>> >     >>>>>>>>> OperatorCoordinatorMetricGroup.
>> >     >>>>>>>>>>>>>>>>>>> Could you help review this FLIP when you get
>> >     time? Any
>> >     >>>>>>>>>>> feedback
>> >     >>>>>>>>>>>> is
>> >     >>>>>>>>>>>>>>>>>>> appreciated!
>> >     >>>>>>>>>>>>>>>>>>>
>> >     >>>>>>>>>>>>>>>>>>> Best,
>> >     >>>>>>>>>>>>>>>>>>> Hang
>> >     >>>>>>>>>>>>>>>>>>>
>> >     >>>>>>>>>>>>>>>>>>> [1]
>> >     >>>>>>>>>>>>>>>>>>>
>> >     >>>>>>>>>>>>>>>>>>>
>> >     >>
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-274%3A+Introduce+metric+group+for+OperatorCoordinator
>> >     >>
>> >
>>
>

Re: [DISCUSS] FLIP-274 : Introduce metric group for OperatorCoordinator

Posted by Jark Wu <im...@gmail.com>.

Hi all,

IIUC, Chesnay means we should have a more general metric group for
**operator**
not for **coordinator** in JM, this would be useful to extend
other operator-specific
metrics in the future. That means the new scope format should be designed
for
the operator,
e.g., metrics.scope.jm-operator=<host>.jobmanager.<job_name>.<operator_name>
The coordinator metric is a subgroup (a constant "coordinator" suffix) of
the JMOperatorMG.

I think this is a nice design. However, I have a question about adding task
id/name to this list.
How to get the task id/name when reporting a coordinator metric? As we
know, coordinator metrics
are just like JMJobMetricGroup, which don't belong to any tasks/vertexes.
Do you mean reporting
coordinator metrics for every task id under the operator?

Best,
Jark


On Thu, 19 Jan 2023 at 17:33, Chesnay Schepler <ch...@apache.org> wrote:

>  > First, I do not understand why users have to configure the new scope
> format, which has a default value.
>
> If you don't use scope formats, sure. If you do use scope formats, e.g.
> to add a common prefix (which is the case for datadog users for
> example), then the current default in the FLIP is insufficient and
> requires the user to update the configuration.
>
>  > I usually do not need to change these configuration of scope formats
> when submitting the flink job.
>
> Scope formats as a whole are quite a power-user feature, but that
> doesn't mean we should ignore it.
>
>  > I try to let it extend the ComponentMetricGroup
>
> This isn't inherently required just because it is a "component".
> Component metric groups should _only_ be used for cases where such a
> component provides several bits of metadata.
> For example, tasks provide vertex ids, task names, attempted IDs etc.,
> and we'd like users to have the option on how this metadata ends up
> being used (via scope formats). This currently can't be built with the
> addGroup() methods, hence why the component groups exist.
> However, the operator coordinator _itself_does not provide several bits
> of metadata. Logically, the _operator_ does, not the coordinator.
>
>  > This make me decide to add a new scope format.
>
> And this is fine; just don't add a new scope format for the coordinator
> but logically for operators on the JM side, such that we can extend the
> set of operator-specific metrics in the future without having to add yet
> another scope format.
>
>  > The relationship between the Operator and OperatorCoordinator is
> maintained in the OperatorCoordinatorHolder. In the POC, the operator
> name/id could be found and the OperatorCoordinatorMetricGroup will be
> created here.
>
> That's all irrelevant.
>
>  > The registered metrics of OperatorCoordinator always are those which
> can not be aggregated from the tasks, like the number of unassigned
> splits in the SourceCoordinator. Actually we have not encountered the
> scenario you mentioned. I think the OperatorCoordinatorMetricGroup
> should only contain the metrics for itself instead of its subtasks.
>
> You are misunderstanding the problem.
>
> This isn't about aggregating metrics on our side or exposing metrics
> from TMs on the JM side or anything like that.
>
> It's purely about the available metadata for operator metrics on the JM
> side. Currently you suggest operator name / id; what I'm proposing is to
> add task(==vertex!) id / name to this list.
>
> The benefit here is simple. I can look up all metrics where task_id==XXX
> and get _everything_ related to that vertex, including all metrics
> associated with operators that are part of that vertex, including the
> coordinator.
>
>  > Using metrics.scope.jm.job is not enough to distinguish the different
> coordinators.
>
> I did not suggest just using metrics.scope.jm.job. I suggested that as
> the default for the jm operator scope format, that is used for the
> proposed JobManagerOperatorMetricGroup, which will have some plain
> metric groups as children for coordinators.
> Aka, you have the JMOperatorMG, then you call addGroup("coordinators")
> and then addGroup(coordinator_name) for each coordinator.
>
>  > So there are two choice for the
> InternalOperatorCoordinatorMetricGroup: 1. add and improve the new scope
> format; 2. use the metrics.scope.jm.job and ProxyMetricGroup. Which one
> is better?
>
> There are more options than that.
> I feel like there are a lot of misunderstandings in this discussion.
> Please let me know if I could clear things up. If not I can also provide
> a PoC based on your PoC if that can speed things up. It's not that
> different anyway.
>
> On 19/01/2023 07:39, Hang Ruan wrote:
> > Hi, chesnay,
> >
> > Thanks for your reply. I still have some doubts about the questions
> > you raised.
> >
> >     > Extending the set of ScopeFormats is problematic because it in
> >     practice
> >     it breaks the config if users actively rely on it, since there's now
> >     another key that they _must_ set for it to be
> >     consistent/compatible with
> >     their existing setup.
> >     Unfortunately due to how powerful scope formats are we can't derive a
> >     default value that matches their existing setup.
> >     Hence we should try to do this as rarely as possible.
> >
> >
> >     > This FLIP does not adhere to that since it proposes a dedicated
> >     format
> >     for coordinators; next time we want to expose operator-specific
> >     metrics
> >     (e.g., in the scheduler) we'd have to add another one to support it.
> >
> >
> > First, I do not understand why users have to configure the new scope
> > format, which has a default value. I usually do not need to change
> > these configuration of scope formats when submitting the flink job.
> > IMO, the OperatorCoordinator is a component running at the JobMaster.
> > I try to let it extend the ComponentMetricGroup, but I cannot find
> > a suitable scope. This make me decide to add a new scope format.
> >
> >     > Additionally, the configurable variables (operator name/id) are
> >     logically not attached to the coordinator, but operators, so to me it
> >     just doesn't make sense to structure it like this.
> >
> >
> > The relationship between the Operator and OperatorCoordinator is
> > maintained in the OperatorCoordinatorHolder. In the POC, the operator
> > name/id could be found and the OperatorCoordinatorMetricGroup will be
> > created here.
> >
> >     > Another thing I'm concerned about is that, because we don't include
> >     tasks in the hierarchy, users wishing to collect all metrics for a
> >     particular task (in this case ==vertex) now have to go
> >     significantly out
> >     of their way to get them, since they can no longer just filter by the
> >     task ID but have to be filter for _all_ operators that are part of
> >     the task.
> >
> >
> > The registered metrics of OperatorCoordinator always are those which
> > can not be aggregated from the tasks, like the number of unassigned
> > splits in the SourceCoordinator. Actually we have not encountered the
> > scenario you mentioned. I think the OperatorCoordinatorMetricGroup
> > should only contain the metrics for itself instead of its subtasks.
> >
> > Using metrics.scope.jm.job is not enough to distinguish the different
> > coordinators. The operator id/name is necessary. Then
> > the implementation should be like this. But users can not change the
> > scope in this way. This is also acceptable.
> > @Internal public class InternalOperatorCoordinatorMetricGroup
> >          extends ProxyMetricGroup<MetricGroup>
> >          implements OperatorCoordinatorMetricGroup {
> >      public InternalOperatorCoordinatorMetricGroup(
> >              JobManagerJobMetricGroup parent, OperatorID operatorID,
> String operatorName) {
> >          super(parent.addGroup(operatorID +
> operatorName).addGroup("coordinator")); }
> > }
> > So there are two choice for the
> > InternalOperatorCoordinatorMetricGroup: 1. add and improve the new
> > scope format; 2. use the metrics.scope.jm.job and ProxyMetricGroup.
> > Which one is better?
> >
> > Best,
> > Hang
> >
> >
> > Chesnay Schepler <ch...@apache.org> 于2023年1月18日周三 17:03写道：
> >
> >     You're misunderstanding the problem.
> >
> >     Metric groups form a tree with each group providing certain metadata.
> >
> >     E.g., on the taskmanager we have a TM metric group that provides info
> >     about the TM, that has child task metric groups, that have child
> >     operator metric groups etc. The operator metric group again has
> >     children, where we are now mostly in the user/connector land, like a
> >     kafka metric group providing partition info or something.
> >
> >     What is being proposed is to create a coordinator metric group on
> >     the JM
> >     side that provides operator metadata.
> >
> >     A more appropriate structure is to create an operator group on the JM
> >     side that provides this info, with the coordinator metric group
> >     being a
> >     child.
> >
> >     On 17/01/2023 19:56, Steven Wu wrote:
> >     >> Additionally, the configurable variables (operator name/id) are
> >     > logically not attached to the coordinator, but operators, so to
> >     me it
> >     > just doesn't make sense to structure it like this.
> >     >
> >     > Chesnay, maybe we should clarify the terminology. To me,
> >     pperators (like
> >     > FLIP-27 source) can have two parts (coordinator and
> >     reader/subtask). I
> >     > think it is fine to include operator name/id for coordinator
> >     metrics.
> >     >
> >     > On Mon, Jan 16, 2023 at 2:13 AM Chesnay Schepler
> >     <ch...@apache.org> wrote:
> >     >
> >     >> Slight correction: Using metrics.scope.jm.job as the default
> >     should be
> >     >> safe.
> >     >>
> >     >> On 16/01/2023 10:18, Chesnay Schepler wrote:
> >     >>> The proposed ScopeFormat is still problematic for a few reasons.
> >     >>>
> >     >>> Extending the set of ScopeFormats is problematic because it in
> >     >>> practice it breaks the config if users actively rely on it, since
> >     >>> there's now another key that they _must_ set for it to be
> >     >>> consistent/compatible with their existing setup.
> >     >>> Unfortunately due to how powerful scope formats are we can't
> >     derive a
> >     >>> default value that matches their existing setup.
> >     >>> Hence we should try to do this as rarely as possible.
> >     >>>
> >     >>> This FLIP does not adhere to that since it proposes a
> >     dedicated format
> >     >>> for coordinators; next time we want to expose operator-specific
> >     >>> metrics (e.g., in the scheduler) we'd have to add another one to
> >     >>> support it.
> >     >>>
> >     >>> Additionally, the configurable variables (operator name/id) are
> >     >>> logically not attached to the coordinator, but operators, so
> >     to me it
> >     >>> just doesn't make sense to structure it like this.
> >     >>>
> >     >>> Another thing I'm concerned about is that, because we don't
> >     include
> >     >>> tasks in the hierarchy, users wishing to collect all metrics for
> a
> >     >>> particular task (in this case ==vertex) now have to go
> >     significantly
> >     >>> out of their way to get them, since they can no longer just
> >     filter by
> >     >>> the task ID but have to be filter for _all_ operators that are
> >     part of
> >     >>> the task.
> >     >>>
> >     >>> On 16/01/2023 03:09, Hang Ruan wrote:
> >     >>>> Hi, @chesnay@apache.org <ch...@apache.org> ,
> >     >>>>
> >     >>>> Do you have time to help to review this FLIP again after the
> >     >>>> modification?
> >     >>>> Looking forward to your reply.
> >     >>>> This FLIP will add a new configuration for the
> >     >>>> OperatorCoordinatorMetricGroup scope format. It provides an
> >     internal
> >     >>>> implementation and is added as a component to the
> >     >>>> JobManagerJobMetricGroup.
> >     >>>> If something doesn't make sense, could you provide some
> >     advice? It
> >     >>>> will be
> >     >>>> very helpful. Thanks a lot for your help.
> >     >>>>
> >     >>>> Best,
> >     >>>> Hang
> >     >>>>
> >     >>>> Martijn Visser <ma...@apache.org> 于2023年1月11日周三
> >     16:34写道：
> >     >>>>
> >     >>>>
> >     >>>>> Hi Hang,
> >     >>>>>
> >     >>>>> I'm a bit surprised that this has gone to a vote, given that
> >     Chesnay
> >     >>>>> deliberately mentioned that he would vote against it as-is.
> >     I would
> >     >>>>> expect
> >     >>>>> that before going to a vote, he has had the opportunity to
> >     >>>>> participate in
> >     >>>>> this discussion.
> >     >>>>>
> >     >>>>> Best regards,
> >     >>>>>
> >     >>>>> Martijn
> >     >>>>>
> >     >>>>> On Tue, Jan 3, 2023 at 12:53 PM Jark Wu <im...@gmail.com>
> >     wrote:
> >     >>>>>
> >     >>>>>> Hi Dong,
> >     >>>>>>
> >     >>>>>> Regarding “SplitEnumeratorContext#metricGroup”, my only
> >     concern is
> >     >>>>>> that
> >     >>>>>> this is a core interface for the FLIP. It’s hard to tell how
> >     >>>>>> sources use
> >     >>>>>> metric group without mentioning this interface. Even if
> >     this is an
> >     >>>>> existing
> >     >>>>>> API, I think it’s worth introducing the interface again and
> >     declaring
> >     >>>>> that
> >     >>>>>> we will implement the interface instead of a no-op method
> >     in this
> >     >>>>>> FLIP.
> >     >>>>>>
> >     >>>>>> Anyway, this is a minor problem and shouldn’t block this
> >     FLIP. I’m
> >     >>>>>> +1 to
> >     >>>>>> start a vote.
> >     >>>>>>
> >     >>>>>> Best,
> >     >>>>>> Jark
> >     >>>>>>
> >     >>>>>>
> >     >>>>>>> 2023年1月3日 10:03，Hang Ruan <ru...@gmail.com> 写道：
> >     >>>>>>>
> >     >>>>>>> Hi, Jark and Dong,
> >     >>>>>>>
> >     >>>>>>> Thanks for your comments. Sorry for my late reply.
> >     >>>>>>>
> >     >>>>>>> For suggestion 1, I plan to implement the
> >     >>>>>>> SplitEnumeratorMetricGroup in
> >     >>>>>>> another issue, and it is not contained in this FLIP. I
> >     will add some
> >     >>>>>>> description about this part.
> >     >>>>>>> For suggestion 2, changes about
> >     OperatorCoordinator#metricGroup has
> >     >>>>>> already
> >     >>>>>>> been documented in the proposed change section.
> >     >>>>>>>
> >     >>>>>>> Best,
> >     >>>>>>> Hang
> >     >>>>>>>
> >     >>>>>>> Dong Lin <li...@gmail.com> 于2023年1月1日周日 09:45写道：
> >     >>>>>>>
> >     >>>>>>>> Let me chime-in and add comments regarding the public
> >     interface
> >     >>>>> section.
> >     >>>>>>>> Please see my comments inline.
> >     >>>>>>>>
> >     >>>>>>>> On Thu, Dec 29, 2022 at 6:08 PM Jark Wu
> >     <im...@gmail.com> wrote:
> >     >>>>>>>>
> >     >>>>>>>>> Hi Hang,
> >     >>>>>>>>>
> >     >>>>>>>>> Thanks for driving this discussion. I think this is a
> >     very useful
> >     >>>>>> feature
> >     >>>>>>>>> for connectors.
> >     >>>>>>>>>
> >     >>>>>>>>> The FLIP looks quite good to me, and I just have two
> >     suggestions.
> >     >>>>>>>>>
> >     >>>>>>>>> 1. In the "Public Interface" section, mention that the
> >     >>>>>>>>> implementation
> >     >>>>>>>>> behavior of "SplitEnumeratorContext#metricGroup" is
> >     changed from
> >     >>>>>>>> returning
> >     >>>>>>>>> null to returning a concrete SplitEnumeratorMetricGroup
> >     instance.
> >     >>>>> Even
> >     >>>>>>>>> though the API is already there, the behavior change can
> >     also be
> >     >>>>>>>> considered
> >     >>>>>>>>> a public change.
> >     >>>>>>>>>
> >     >>>>>>>> SplitEnumeratorContext#metricGroup is an interface and
> >     this FLIP
> >     >>>>>>>> does
> >     >>>>>> not
> >     >>>>>>>> seem to change its semantics/behavior. The FLIP does
> >     change the
> >     >>>>>>>> implementation/behavior of
> >     SourceCoordinatorContext#metricGroup,
> >     >>>>>>>> which
> >     >>>>>> is
> >     >>>>>>>> marked @Internal.
> >     >>>>>>>>
> >     >>>>>>>> Thus it might seem a bit weird to add in the public
> interface
> >     >>>>>>>> section
> >     >>>>>>>> saying that we change the interface
> >     >>>>>>>> SplitEnumeratorContext#metricGroup
> >     >>>>>> from
> >     >>>>>>>> returning null to non-null object.
> >     >>>>>>>>
> >     >>>>>>>> 2. Mention the newly added interface of
> >     >>>>>> "OperatorCoordinator#metricGroup"
> >     >>>>>>>>> in the "Proposed Changes" section or "Public Interface"
> >     section. As
> >     >>>>> the
> >     >>>>>>>>> FLIP said, OperatorCoordinator is widely used in many
> >     connectors.
> >     >>>>>> Though
> >     >>>>>>>> it
> >     >>>>>>>>> is still an @Internal API, I think it is worth
> >     mentioning the
> >     >>>>>>>>> change
> >     >>>>> in
> >     >>>>>>>> the
> >     >>>>>>>>> FLIP.
> >     >>>>>>>> Since OperatorCoordinator is an internal API, it seems
> >     reasonable to
> >     >>>>>>>> explain it in the proposed change section. The FLIP seems
> >     to have
> >     >>>>>>>> documented this in point 5 of the proposed change section.
> >     >>>>>>>>
> >     >>>>>>>> BTW, if we think there are @internal classes that are
> >     important
> >     >>>>>>>> enough
> >     >>>>>> to
> >     >>>>>>>> be added in the public interface section, it might be
> >     useful to
> >     >>>>>> explicitly
> >     >>>>>>>> discuss this topic and document it in the "*What are the
> >     "public
> >     >>>>>>>> interfaces" of the project*" in this
> >     >>>>>>>> <
> >     >>>>>>>>
> >     >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
> >     >>>>>>>> wiki.
> >     >>>>>>>>
> >     >>>>>>>>
> >     >>>>>>>>> Best,
> >     >>>>>>>>> Jark
> >     >>>>>>>>>
> >     >>>>>>>>>
> >     >>>>>>>>> On Mon, 26 Dec 2022 at 18:06, Hang Ruan
> >     <ru...@gmail.com>
> >     >>>>>> wrote:
> >     >>>>>>>>>> Hi, thanks for the feedback, Zhu Zhu and Qingsheng.
> >     >>>>>>>>>> After combining everyone's comments, the main concerns and
> >     >>>>>>>> corresponding
> >     >>>>>>>>>> adjustments are as follows.
> >     >>>>>>>>>>
> >     >>>>>>>>>> Q1: Common metrics are not quite useful.
> >     >>>>>>>>>> numEventsIn and numEventsOut counters will be removed
> >     from the
> >     >>>>>>>>>> OperatorCoordinatorMetricGroup. These common metrics do
> not
> >     >>>>>>>>>> provide
> >     >>>>>>>>> enough
> >     >>>>>>>>>> information for users. The users are more willing to
> >     get the
> >     >>>>>>>>>> number
> >     >>>>> of
> >     >>>>>>>>>> events of the specified type instead of the total
> >     number. And this
> >     >>>>>>>> metric
> >     >>>>>>>>>> is calculated differently. The implementation could
> >     register the
> >     >>>>>> metric
> >     >>>>>>>>> by
> >     >>>>>>>>>> themselves.
> >     >>>>>>>>>>
> >     >>>>>>>>>> Q2: This FLIP is overly complicated.
> >     >>>>>>>>>> This FLIP will become concise after these modifications.
> >     >>>>>>>>>> OperatorCoordinatorMetricGroup has already been
> >     introduced into
> >     >>>>> Flink
> >     >>>>>>>> by
> >     >>>>>>>>>> FLIP-179<
> >     >>>>>>>>>>
> >     >>>>>>>>>>
> >     >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics
> >     >>>>>>>>>>> .
> >     >>>>>>>>>> And
> >     >>>>>>>>>> this FLIP will not change it. This FLIP only provides a
> >     new metric
> >     >>>>>>>> option
> >     >>>>>>>>>> and a new metric group scope. The changes in proposed
> >     changes
> >     >>>>> provide
> >     >>>>>>>> the
> >     >>>>>>>>>> details about the modifications for the internal
> >     classes, which
> >     >>>>> might
> >     >>>>>>>>> make
> >     >>>>>>>>>> it look complicated.
> >     >>>>>>>>>>
> >     >>>>>>>>>> Thanks for all the comments again. If there are no further
> >     >>>>>>>>>> comments,
> >     >>>>>> we
> >     >>>>>>>>>> plan to start the voting thread this week.
> >     >>>>>>>>>>
> >     >>>>>>>>>> Best,
> >     >>>>>>>>>> Hang
> >     >>>>>>>>>>
> >     >>>>>>>>>> Qingsheng Ren <re...@gmail.com> 于2022年12月26日周一
> >     16:48写道：
> >     >>>>>>>>>>
> >     >>>>>>>>>>
> >     >>>>>>>>>>> Thanks for the FLIP, Hang!
> >     >>>>>>>>>>>
> >     >>>>>>>>>>> This FLIP overall looks good to me. Actually I share
> >     the same
> >     >>>>> concern
> >     >>>>>>>>>> with
> >     >>>>>>>>>>> Zhu that numEventsIn and numEventsOut counters are not
> >     quite
> >     >>>>>>>>>>> useful
> >     >>>>>>>> to
> >     >>>>>>>>>> end
> >     >>>>>>>>>>> users. OperatorEvent is a quite low-level abstraction,
> >     which
> >     >>>>> requires
> >     >>>>>>>>>>> instantialization in order to be practical to users and
> >     >>>>>>>>>>> developers,
> >     >>>>>>>> so
> >     >>>>>>>>>>> maybe it's better to exclude them from the FLIP.
> >     >>>>>>>>>>>
> >     >>>>>>>>>>> Best,
> >     >>>>>>>>>>> Qingsheng
> >     >>>>>>>>>>> Ververica (Alibaba)
> >     >>>>>>>>>>>
> >     >>>>>>>>>>> On Mon, Dec 26, 2022 at 12:08 PM Zhu Zhu
> >     <re...@gmail.com>
> >     >>>>> wrote:
> >     >>>>>>>>>>>> Hi Hang,
> >     >>>>>>>>>>>> I still see no strong reason why we need
> >     >>>>>>>>>>>> numEventsIn/numEventsOut
> >     >>>>>>>>>>> metrics.
> >     >>>>>>>>>>>> In the discussion in FLINK-29801, I can see the same
> >     concern
> >     >>>>>>>>>>>> from
> >     >>>>>>>>>> others.
> >     >>>>>>>>>>>> So I prefer to exclude them from this FLIP to avoid
> >     >>>>>>>>>>>> over-extending
> >     >>>>>>>>> the
> >     >>>>>>>>>>>> scope.
> >     >>>>>>>>>>>>
> >     >>>>>>>>>>>> Thanks,
> >     >>>>>>>>>>>> Zhu
> >     >>>>>>>>>>>>
> >     >>>>>>>>>>>> Hang Ruan <ru...@gmail.com> 于2022年12月23日周五
> >     >>>>>>>>>>>> 15:21写道：
> >     >>>>>>>>>>>>> Hi everyone,
> >     >>>>>>>>>>>>>
> >     >>>>>>>>>>>>> As for the Zhu Zhu's problem, I think we should keep
> >     the common
> >     >>>>>>>>>>> metrics,
> >     >>>>>>>>>>>> which will help to observe incoming and outgoing
> >     events. What do
> >     >>>>>>>> you
> >     >>>>>>>>>>> think,
> >     >>>>>>>>>>>> @Zhu Zhu ?
> >     >>>>>>>>>>>>> And @Chesnay, are there any other issues you are
> >     more concerned
> >     >>>>>>>>>> about?
> >     >>>>>>>>>>>> Looking forward to your reply.
> >     >>>>>>>>>>>>> Thanks for all the comments. If there are no further
> >     >>>>>>>>>>>>> comments, we
> >     >>>>>>>>>> plan
> >     >>>>>>>>>>>> to start the voting thread next week.
> >     >>>>>>>>>>>>> Best,
> >     >>>>>>>>>>>>> Hang
> >     >>>>>>>>>>>>>
> >     >>>>>>>>>>>>> Hang Ruan <ru...@gmail.com> 于2022年12月15日周四
> >     >>>>>>>>>>>>> 16:49写道：
> >     >>>>>>>>>>>>>> Hi, Zhu Zhu,
> >     >>>>>>>>>>>>>>
> >     >>>>>>>>>>>>>> Thanks for your feedback!
> >     >>>>>>>>>>>>>>
> >     >>>>>>>>>>>>>> The OperatorCoordinator implementations are
> >     different. And
> >     >>>>>>>>>>>>>> their
> >     >>>>>>>>>>>> metrics are much different too. We try to find the
> common
> >     >>>>>>>>>>>> metrics
> >     >>>>>>>> and
> >     >>>>>>>>>> put
> >     >>>>>>>>>>>> them in the OperatorCoordinatorMetricGroup. If most
> >     developers
> >     >>>>>>>> think
> >     >>>>>>>>> we
> >     >>>>>>>>>>> do
> >     >>>>>>>>>>>> not need these common metrics, removing the common
> >     metrics is
> >     >>>>>>>>>>>> also
> >     >>>>>>>>>>>> acceptable.
> >     >>>>>>>>>>>>>> Best,
> >     >>>>>>>>>>>>>> Hang
> >     >>>>>>>>>>>>>>
> >     >>>>>>>>>>>>>> Zhu Zhu <re...@gmail.com> 于2022年12月14日周三
> >     22:09写道：
> >     >>>>>>>>>>>>>>> Hi Hang & MengYue,
> >     >>>>>>>>>>>>>>>
> >     >>>>>>>>>>>>>>> Thanks for creating this FLIP!
> >     >>>>>>>>>>>>>>>
> >     >>>>>>>>>>>>>>> I think it is very useful, mainly in two aspects:
> >     >>>>>>>>>>>>>>> 1. Enables OperatorCoordinators to register metrics.
> >     >>>>>>>>>>>>>>> Currently
> >     >>>>>>>>>>>>>>> the coordinators has no way to do this. And operator
> >     >>>>>>>> coordinator
> >     >>>>>>>>>>>>>>> metric group further enables the SplitEnumerator
> >     to have
> >     >>>>>>>>>>>>>>> access
> >     >>>>>>>>>>>>>>> to a registered metric group (via the existing public
> >     >>>>>>>>>>>>>>> interface
> >     >>>>>>>>>>>>>>> SplitEnumeratorContext#metricGroup()), which is
> >     null at the
> >     >>>>>>>>> moment.
> >     >>>>>>>>>>>>>>> 2. Defines the scope of operator coordinator
> >     metrics. A clear
> >     >>>>>>>>>>>> definition
> >     >>>>>>>>>>>>>>> makes it easy for users to find their wanted
> >     metrics. The
> >     >>>>>>>>>> definition
> >     >>>>>>>>>>>>>>> also helps to avoid conflicts of metrics from
> multiple
> >     >>>>>>>>>>>> OperatorCoordinators
> >     >>>>>>>>>>>>>>> of the same kind. E.g. each SourceCoordinator may
> >     have its
> >     >>>>>>>>>>>>>>> own
> >     >>>>>>>>>>>>>>> numSourceSplits metric, these metrics should not
> >     be directly
> >     >>>>>>>>>>> registered
> >     >>>>>>>>>>>>>>> to the job metric group.
> >     >>>>>>>>>>>>>>>
> >     >>>>>>>>>>>>>>> What I'm a bit concerned is the necessity of the
> >     introduced
> >     >>>>>>>>> common
> >     >>>>>>>>>>>> metrics
> >     >>>>>>>>>>>>>>> numEventsInCounter & numEventsOutCounter. If there
> >     any case
> >     >>>>>>>> which
> >     >>>>>>>>>>>> strongly
> >     >>>>>>>>>>>>>>> requires them?
> >     >>>>>>>>>>>>>>>
> >     >>>>>>>>>>>>>>> Regarding the concerns of Chesnay,
> >     >>>>>>>>>>>>>>>> A dedicated coordinator MG implementation is
> overkill
> >     >>>>>>>>>>>>>>> Directly using the job metric group can result in
> >     metric
> >     >>>>>>>>> conflicts,
> >     >>>>>>>>>>> as
> >     >>>>>>>>>>>> mentioned
> >     >>>>>>>>>>>>>>> in above #2.
> >     >>>>>>>>>>>>>>>
> >     >>>>>>>>>>>>>>> Thanks,
> >     >>>>>>>>>>>>>>> Zhu
> >     >>>>>>>>>>>>>>>
> >     >>>>>>>>>>>>>>> Dong Lin <li...@gmail.com> 于2022年12月10日周六
> >     >>>>>>>>>>>>>>> 14:16写道：
> >     >>>>>>>>>>>>>>>
> >     >>>>>>>>>>>>>>>> Hi Chesney,
> >     >>>>>>>>>>>>>>>>
> >     >>>>>>>>>>>>>>>> Just to double check with you,
> >     >>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup
> >     >>>>>>>>>>>> (annotated as
> >     >>>>>>>>>>>>>>>> @PublicEvolving) has already been introduced into
> >     Flink by
> >     >>>>>>>>>> FLIP-179
> >     >>>>>>>>>>>>>>>> <
> >     >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics
> >     >>>>>>>>>>>>> .
> >     >>>>>>>>>>>>>>>> And that FLIP has got you +1.. Do you mean we
> >     should remove
> >     >>>>>>>>> this
> >     >>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup?
> >     >>>>>>>>>>>>>>>>
> >     >>>>>>>>>>>>>>>> Regards,
> >     >>>>>>>>>>>>>>>> Dong
> >     >>>>>>>>>>>>>>>>
> >     >>>>>>>>>>>>>>>> On Sat, Dec 10, 2022 at 1:33 AM Chesnay Schepler <
> >     >>>>>>>>>>> chesnay@apache.org>
> >     >>>>>>>>>>>> wrote:
> >     >>>>>>>>>>>>>>>>> As a whole I feel like this FLIP is overly
> >     complicated. A
> >     >>>>>>>>>>> dedicated
> >     >>>>>>>>>>>>>>>>> coordinator MG implementation is overkill; it
> >     could just
> >     >>>>>>>>> re-use
> >     >>>>>>>>>>> the
> >     >>>>>>>>>>>>>>>>> existing Task/OperatorMGs to create the same
> >     structure we
> >     >>>>>>>>> have
> >     >>>>>>>>>> on
> >     >>>>>>>>>>>> TMs,
> >     >>>>>>>>>>>>>>>>> similar to what we did with the Job MG.
> >     >>>>>>>>>>>>>>>>>
> >     >>>>>>>>>>>>>>>>> However, I'm not convinced that this is required
> >     anyway,
> >     >>>>>>>>>> because
> >     >>>>>>>>>>>> all the
> >     >>>>>>>>>>>>>>>>> example metrics you listed can be implemented on
> >     the TM
> >     >>>>>>>> side
> >     >>>>>>>>> +
> >     >>>>>>>>>>>>>>>>> aggregating them in the external metrics backend.
> >     >>>>>>>>>>>>>>>>>
> >     >>>>>>>>>>>>>>>>> Since I'm on holidays soon, just so no one tries
> >     to pull a
> >     >>>>>>>>> fast
> >     >>>>>>>>>>>> one on
> >     >>>>>>>>>>>>>>>>> me, if this were to go to a vote as-is I'd be
> >     against it.
> >     >>>>>>>>>>>>>>>>>
> >     >>>>>>>>>>>>>>>>>
> >     >>>>>>>>>>>>>>>>> On 09/12/2022 15:30, Dong Lin wrote:
> >     >>>>>>>>>>>>>>>>>> Hi Hang,
> >     >>>>>>>>>>>>>>>>>>
> >     >>>>>>>>>>>>>>>>>> Thanks for the FLIP! The FLIP looks good and it
> >     is pretty
> >     >>>>>>>>>>>> informative.
> >     >>>>>>>>>>>>>>>>>> I have just two minor comments regarding names:
> >     >>>>>>>>>>>>>>>>>> - Would it be useful to rename the config key as
> >     >>>>>>>>>>>>>>>>>> *metrics.scope.jm.job.operator-coordinator* for
> >     >>>>>>>> consistency
> >     >>>>>>>>>>> with
> >     >>>>>>>>>>>>>>>>>> *metrics.scope.jm.job
> >     >>>>>>>>>>>>>>>>>> *(which is not named as *jm-job)?
> >     >>>>>>>>>>>>>>>>>> - Maybe rename the variable as
> >     >>>>>>>>>>> SCOPE_NAMING_OPERATOR_COORDINATOR
> >     >>>>>>>>>>>> for
> >     >>>>>>>>>>>>>>>>>> simplicity and consistency with
> >     SCOPE_NAMING_OPERATOR
> >     >>>>>>>>> (which
> >     >>>>>>>>>> is
> >     >>>>>>>>>>>> not named
> >     >>>>>>>>>>>>>>>>>> as SCOPE_NAMING_TM_JOB_OPERATOR)?
> >     >>>>>>>>>>>>>>>>>>
> >     >>>>>>>>>>>>>>>>>> Cheers,
> >     >>>>>>>>>>>>>>>>>> Dong
> >     >>>>>>>>>>>>>>>>>>
> >     >>>>>>>>>>>>>>>>>>
> >     >>>>>>>>>>>>>>>>>>
> >     >>>>>>>>>>>>>>>>>> On Thu, Dec 8, 2022 at 3:28 PM Hang Ruan <
> >     >>>>>>>>>>> ruanhang1993@gmail.com>
> >     >>>>>>>>>>>> wrote:
> >     >>>>>>>>>>>>>>>>>>> Hi all,
> >     >>>>>>>>>>>>>>>>>>>
> >     >>>>>>>>>>>>>>>>>>> MengYue and I created FLIP-274[1] Introduce
> >     metric group
> >     >>>>>>>>> for
> >     >>>>>>>>>>>>>>>>>>> OperatorCoordinator. OperatorCoordinator is the
> >     >>>>>>>>> coordinator
> >     >>>>>>>>>>> for
> >     >>>>>>>>>>>> runtime
> >     >>>>>>>>>>>>>>>>>>> operators and running on Job Manager. The
> >     coordination
> >     >>>>>>>>>>>> mechanism is
> >     >>>>>>>>>>>>>>>>>>> operator events between OperatorCoordinator
> >     and its all
> >     >>>>>>>>>>>> operators, the
> >     >>>>>>>>>>>>>>>>>>> coordination is more and more using in Flink, for
> >     >>>>>>>> example
> >     >>>>>>>>>> many
> >     >>>>>>>>>>>> Sources
> >     >>>>>>>>>>>>>>>>> and
> >     >>>>>>>>>>>>>>>>>>> Sinks depend on the mechanism to assign splits
> and
> >     >>>>>>>>>> coordinate
> >     >>>>>>>>>>>> commits to
> >     >>>>>>>>>>>>>>>>>>> external systems. The OperatorCoordinator is
> >     widely
> >     >>>>>>>> using
> >     >>>>>>>>> in
> >     >>>>>>>>>>>> flink kafka
> >     >>>>>>>>>>>>>>>>>>> connector, flink pulsar connector, flink cdc
> >     connector,
> >     >>>>>>>>>> flink
> >     >>>>>>>>>>>> hudi
> >     >>>>>>>>>>>>>>>>>>> connector and so on.
> >     >>>>>>>>>>>>>>>>>>>
> >     >>>>>>>>>>>>>>>>>>> But there is not a suitable metric group scope
> >     for the
> >     >>>>>>>>>>>>>>>>> OperatorCoordinator
> >     >>>>>>>>>>>>>>>>>>> and not an implementation for the interface
> >     >>>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup.
> >     >>>>>>>>>>>>>>>>>>> These metrics in OperatorCoordinator could be
> >     how many
> >     >>>>>>>>>>>> splits/partitions
> >     >>>>>>>>>>>>>>>>>>> have been assigned to source readers, how many
> >     files
> >     >>>>>>>> have
> >     >>>>>>>>>> been
> >     >>>>>>>>>>>> written
> >     >>>>>>>>>>>>>>>>> out
> >     >>>>>>>>>>>>>>>>>>> by sink writers, these metrics not only help
> >     users to
> >     >>>>>>>> know
> >     >>>>>>>>>> the
> >     >>>>>>>>>>>> job
> >     >>>>>>>>>>>>>>>>> progress
> >     >>>>>>>>>>>>>>>>>>> but also make big job maintaining easier. Thus we
> >     >>>>>>>> propose
> >     >>>>>>>>>> the
> >     >>>>>>>>>>>> FLIP-274
> >     >>>>>>>>>>>>>>>>> to
> >     >>>>>>>>>>>>>>>>>>> introduce a new metric group scope for
> >     >>>>>>>> OperatorCoordinator
> >     >>>>>>>>>> and
> >     >>>>>>>>>>>> provide
> >     >>>>>>>>>>>>>>>>> an
> >     >>>>>>>>>>>>>>>>>>> internal implementation for
> >     >>>>>>>>> OperatorCoordinatorMetricGroup.
> >     >>>>>>>>>>>>>>>>>>> Could you help review this FLIP when you get
> >     time? Any
> >     >>>>>>>>>>> feedback
> >     >>>>>>>>>>>> is
> >     >>>>>>>>>>>>>>>>>>> appreciated!
> >     >>>>>>>>>>>>>>>>>>>
> >     >>>>>>>>>>>>>>>>>>> Best,
> >     >>>>>>>>>>>>>>>>>>> Hang
> >     >>>>>>>>>>>>>>>>>>>
> >     >>>>>>>>>>>>>>>>>>> [1]
> >     >>>>>>>>>>>>>>>>>>>
> >     >>>>>>>>>>>>>>>>>>>
> >     >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-274%3A+Introduce+metric+group+for+OperatorCoordinator
> >     >>
> >
>

Re: [DISCUSS] FLIP-274 : Introduce metric group for OperatorCoordinator

Posted by Chesnay Schepler <ch...@apache.org>.

 > First, I do not understand why users have to configure the new scope 
format, which has a default value.

If you don't use scope formats, sure. If you do use scope formats, e.g. 
to add a common prefix (which is the case for datadog users for 
example), then the current default in the FLIP is insufficient and 
requires the user to update the configuration.

 > I usually do not need to change these configuration of scope formats 
when submitting the flink job.

Scope formats as a whole are quite a power-user feature, but that 
doesn't mean we should ignore it.

 > I try to let it extend the ComponentMetricGroup

This isn't inherently required just because it is a "component".
Component metric groups should _only_ be used for cases where such a 
component provides several bits of metadata.
For example, tasks provide vertex ids, task names, attempted IDs etc., 
and we'd like users to have the option on how this metadata ends up 
being used (via scope formats). This currently can't be built with the 
addGroup() methods, hence why the component groups exist.
However, the operator coordinator _itself_does not provide several bits 
of metadata. Logically, the _operator_ does, not the coordinator.

 > This make me decide to add a new scope format.

And this is fine; just don't add a new scope format for the coordinator 
but logically for operators on the JM side, such that we can extend the 
set of operator-specific metrics in the future without having to add yet 
another scope format.

 > The relationship between the Operator and OperatorCoordinator is 
maintained in the OperatorCoordinatorHolder. In the POC, the operator 
name/id could be found and the OperatorCoordinatorMetricGroup will be 
created here.

That's all irrelevant.

 > The registered metrics of OperatorCoordinator always are those which 
can not be aggregated from the tasks, like the number of unassigned 
splits in the SourceCoordinator. Actually we have not encountered the 
scenario you mentioned. I think the OperatorCoordinatorMetricGroup 
should only contain the metrics for itself instead of its subtasks.

You are misunderstanding the problem.

This isn't about aggregating metrics on our side or exposing metrics 
from TMs on the JM side or anything like that.

It's purely about the available metadata for operator metrics on the JM 
side. Currently you suggest operator name / id; what I'm proposing is to 
add task(==vertex!) id / name to this list.

The benefit here is simple. I can look up all metrics where task_id==XXX 
and get _everything_ related to that vertex, including all metrics 
associated with operators that are part of that vertex, including the 
coordinator.

 > Using metrics.scope.jm.job is not enough to distinguish the different 
coordinators.

I did not suggest just using metrics.scope.jm.job. I suggested that as 
the default for the jm operator scope format, that is used for the 
proposed JobManagerOperatorMetricGroup, which will have some plain 
metric groups as children for coordinators.
Aka, you have the JMOperatorMG, then you call addGroup("coordinators") 
and then addGroup(coordinator_name) for each coordinator.

 > So there are two choice for the 
InternalOperatorCoordinatorMetricGroup: 1. add and improve the new scope 
format; 2. use the metrics.scope.jm.job and ProxyMetricGroup. Which one 
is better?

There are more options than that.
I feel like there are a lot of misunderstandings in this discussion. 
Please let me know if I could clear things up. If not I can also provide 
a PoC based on your PoC if that can speed things up. It's not that 
different anyway.

On 19/01/2023 07:39, Hang Ruan wrote:
> Hi, chesnay,
>
> Thanks for your reply. I still have some doubts about the questions 
> you raised.
>
>     > Extending the set of ScopeFormats is problematic because it in
>     practice
>     it breaks the config if users actively rely on it, since there's now
>     another key that they _must_ set for it to be
>     consistent/compatible with
>     their existing setup.
>     Unfortunately due to how powerful scope formats are we can't derive a
>     default value that matches their existing setup.
>     Hence we should try to do this as rarely as possible.
>
>
>     > This FLIP does not adhere to that since it proposes a dedicated
>     format
>     for coordinators; next time we want to expose operator-specific
>     metrics
>     (e.g., in the scheduler) we'd have to add another one to support it.
>
>
> First, I do not understand why users have to configure the new scope 
> format, which has a default value. I usually do not need to change 
> these configuration of scope formats when submitting the flink job.
> IMO, the OperatorCoordinator is a component running at the JobMaster. 
> I try to let it extend the ComponentMetricGroup, but I cannot find 
> a suitable scope. This make me decide to add a new scope format.
>
>     > Additionally, the configurable variables (operator name/id) are
>     logically not attached to the coordinator, but operators, so to me it
>     just doesn't make sense to structure it like this.
>
>
> The relationship between the Operator and OperatorCoordinator is 
> maintained in the OperatorCoordinatorHolder. In the POC, the operator 
> name/id could be found and the OperatorCoordinatorMetricGroup will be 
> created here.
>
>     > Another thing I'm concerned about is that, because we don't include
>     tasks in the hierarchy, users wishing to collect all metrics for a
>     particular task (in this case ==vertex) now have to go
>     significantly out
>     of their way to get them, since they can no longer just filter by the
>     task ID but have to be filter for _all_ operators that are part of
>     the task.
>
>
> The registered metrics of OperatorCoordinator always are those which 
> can not be aggregated from the tasks, like the number of unassigned 
> splits in the SourceCoordinator. Actually we have not encountered the 
> scenario you mentioned. I think the OperatorCoordinatorMetricGroup 
> should only contain the metrics for itself instead of its subtasks.
>
> Using metrics.scope.jm.job is not enough to distinguish the different 
> coordinators. The operator id/name is necessary. Then 
> the implementation should be like this. But users can not change the 
> scope in this way. This is also acceptable.
> @Internal public class InternalOperatorCoordinatorMetricGroup
>          extends ProxyMetricGroup<MetricGroup>
>          implements OperatorCoordinatorMetricGroup {
>      public InternalOperatorCoordinatorMetricGroup(
>              JobManagerJobMetricGroup parent, OperatorID operatorID, String operatorName) {
>          super(parent.addGroup(operatorID + operatorName).addGroup("coordinator")); }
> }
> So there are two choice for the 
> InternalOperatorCoordinatorMetricGroup: 1. add and improve the new 
> scope format; 2. use the metrics.scope.jm.job and ProxyMetricGroup. 
> Which one is better?
>
> Best,
> Hang
>
>
> Chesnay Schepler <ch...@apache.org> 于2023年1月18日周三 17:03写道：
>
>     You're misunderstanding the problem.
>
>     Metric groups form a tree with each group providing certain metadata.
>
>     E.g., on the taskmanager we have a TM metric group that provides info
>     about the TM, that has child task metric groups, that have child
>     operator metric groups etc. The operator metric group again has
>     children, where we are now mostly in the user/connector land, like a
>     kafka metric group providing partition info or something.
>
>     What is being proposed is to create a coordinator metric group on
>     the JM
>     side that provides operator metadata.
>
>     A more appropriate structure is to create an operator group on the JM
>     side that provides this info, with the coordinator metric group
>     being a
>     child.
>
>     On 17/01/2023 19:56, Steven Wu wrote:
>     >> Additionally, the configurable variables (operator name/id) are
>     > logically not attached to the coordinator, but operators, so to
>     me it
>     > just doesn't make sense to structure it like this.
>     >
>     > Chesnay, maybe we should clarify the terminology. To me,
>     pperators (like
>     > FLIP-27 source) can have two parts (coordinator and
>     reader/subtask). I
>     > think it is fine to include operator name/id for coordinator
>     metrics.
>     >
>     > On Mon, Jan 16, 2023 at 2:13 AM Chesnay Schepler
>     <ch...@apache.org> wrote:
>     >
>     >> Slight correction: Using metrics.scope.jm.job as the default
>     should be
>     >> safe.
>     >>
>     >> On 16/01/2023 10:18, Chesnay Schepler wrote:
>     >>> The proposed ScopeFormat is still problematic for a few reasons.
>     >>>
>     >>> Extending the set of ScopeFormats is problematic because it in
>     >>> practice it breaks the config if users actively rely on it, since
>     >>> there's now another key that they _must_ set for it to be
>     >>> consistent/compatible with their existing setup.
>     >>> Unfortunately due to how powerful scope formats are we can't
>     derive a
>     >>> default value that matches their existing setup.
>     >>> Hence we should try to do this as rarely as possible.
>     >>>
>     >>> This FLIP does not adhere to that since it proposes a
>     dedicated format
>     >>> for coordinators; next time we want to expose operator-specific
>     >>> metrics (e.g., in the scheduler) we'd have to add another one to
>     >>> support it.
>     >>>
>     >>> Additionally, the configurable variables (operator name/id) are
>     >>> logically not attached to the coordinator, but operators, so
>     to me it
>     >>> just doesn't make sense to structure it like this.
>     >>>
>     >>> Another thing I'm concerned about is that, because we don't
>     include
>     >>> tasks in the hierarchy, users wishing to collect all metrics for a
>     >>> particular task (in this case ==vertex) now have to go
>     significantly
>     >>> out of their way to get them, since they can no longer just
>     filter by
>     >>> the task ID but have to be filter for _all_ operators that are
>     part of
>     >>> the task.
>     >>>
>     >>> On 16/01/2023 03:09, Hang Ruan wrote:
>     >>>> Hi, @chesnay@apache.org <ch...@apache.org> ,
>     >>>>
>     >>>> Do you have time to help to review this FLIP again after the
>     >>>> modification?
>     >>>> Looking forward to your reply.
>     >>>> This FLIP will add a new configuration for the
>     >>>> OperatorCoordinatorMetricGroup scope format. It provides an
>     internal
>     >>>> implementation and is added as a component to the
>     >>>> JobManagerJobMetricGroup.
>     >>>> If something doesn't make sense, could you provide some
>     advice? It
>     >>>> will be
>     >>>> very helpful. Thanks a lot for your help.
>     >>>>
>     >>>> Best,
>     >>>> Hang
>     >>>>
>     >>>> Martijn Visser <ma...@apache.org> 于2023年1月11日周三
>     16:34写道：
>     >>>>
>     >>>>
>     >>>>> Hi Hang,
>     >>>>>
>     >>>>> I'm a bit surprised that this has gone to a vote, given that
>     Chesnay
>     >>>>> deliberately mentioned that he would vote against it as-is.
>     I would
>     >>>>> expect
>     >>>>> that before going to a vote, he has had the opportunity to
>     >>>>> participate in
>     >>>>> this discussion.
>     >>>>>
>     >>>>> Best regards,
>     >>>>>
>     >>>>> Martijn
>     >>>>>
>     >>>>> On Tue, Jan 3, 2023 at 12:53 PM Jark Wu <im...@gmail.com>
>     wrote:
>     >>>>>
>     >>>>>> Hi Dong,
>     >>>>>>
>     >>>>>> Regarding “SplitEnumeratorContext#metricGroup”, my only
>     concern is
>     >>>>>> that
>     >>>>>> this is a core interface for the FLIP. It’s hard to tell how
>     >>>>>> sources use
>     >>>>>> metric group without mentioning this interface. Even if
>     this is an
>     >>>>> existing
>     >>>>>> API, I think it’s worth introducing the interface again and
>     declaring
>     >>>>> that
>     >>>>>> we will implement the interface instead of a no-op method
>     in this
>     >>>>>> FLIP.
>     >>>>>>
>     >>>>>> Anyway, this is a minor problem and shouldn’t block this
>     FLIP. I’m
>     >>>>>> +1 to
>     >>>>>> start a vote.
>     >>>>>>
>     >>>>>> Best,
>     >>>>>> Jark
>     >>>>>>
>     >>>>>>
>     >>>>>>> 2023年1月3日 10:03，Hang Ruan <ru...@gmail.com> 写道：
>     >>>>>>>
>     >>>>>>> Hi, Jark and Dong,
>     >>>>>>>
>     >>>>>>> Thanks for your comments. Sorry for my late reply.
>     >>>>>>>
>     >>>>>>> For suggestion 1, I plan to implement the
>     >>>>>>> SplitEnumeratorMetricGroup in
>     >>>>>>> another issue, and it is not contained in this FLIP. I
>     will add some
>     >>>>>>> description about this part.
>     >>>>>>> For suggestion 2, changes about
>     OperatorCoordinator#metricGroup has
>     >>>>>> already
>     >>>>>>> been documented in the proposed change section.
>     >>>>>>>
>     >>>>>>> Best,
>     >>>>>>> Hang
>     >>>>>>>
>     >>>>>>> Dong Lin <li...@gmail.com> 于2023年1月1日周日 09:45写道：
>     >>>>>>>
>     >>>>>>>> Let me chime-in and add comments regarding the public
>     interface
>     >>>>> section.
>     >>>>>>>> Please see my comments inline.
>     >>>>>>>>
>     >>>>>>>> On Thu, Dec 29, 2022 at 6:08 PM Jark Wu
>     <im...@gmail.com> wrote:
>     >>>>>>>>
>     >>>>>>>>> Hi Hang,
>     >>>>>>>>>
>     >>>>>>>>> Thanks for driving this discussion. I think this is a
>     very useful
>     >>>>>> feature
>     >>>>>>>>> for connectors.
>     >>>>>>>>>
>     >>>>>>>>> The FLIP looks quite good to me, and I just have two
>     suggestions.
>     >>>>>>>>>
>     >>>>>>>>> 1. In the "Public Interface" section, mention that the
>     >>>>>>>>> implementation
>     >>>>>>>>> behavior of "SplitEnumeratorContext#metricGroup" is
>     changed from
>     >>>>>>>> returning
>     >>>>>>>>> null to returning a concrete SplitEnumeratorMetricGroup
>     instance.
>     >>>>> Even
>     >>>>>>>>> though the API is already there, the behavior change can
>     also be
>     >>>>>>>> considered
>     >>>>>>>>> a public change.
>     >>>>>>>>>
>     >>>>>>>> SplitEnumeratorContext#metricGroup is an interface and
>     this FLIP
>     >>>>>>>> does
>     >>>>>> not
>     >>>>>>>> seem to change its semantics/behavior. The FLIP does
>     change the
>     >>>>>>>> implementation/behavior of
>     SourceCoordinatorContext#metricGroup,
>     >>>>>>>> which
>     >>>>>> is
>     >>>>>>>> marked @Internal.
>     >>>>>>>>
>     >>>>>>>> Thus it might seem a bit weird to add in the public interface
>     >>>>>>>> section
>     >>>>>>>> saying that we change the interface
>     >>>>>>>> SplitEnumeratorContext#metricGroup
>     >>>>>> from
>     >>>>>>>> returning null to non-null object.
>     >>>>>>>>
>     >>>>>>>> 2. Mention the newly added interface of
>     >>>>>> "OperatorCoordinator#metricGroup"
>     >>>>>>>>> in the "Proposed Changes" section or "Public Interface"
>     section. As
>     >>>>> the
>     >>>>>>>>> FLIP said, OperatorCoordinator is widely used in many
>     connectors.
>     >>>>>> Though
>     >>>>>>>> it
>     >>>>>>>>> is still an @Internal API, I think it is worth
>     mentioning the
>     >>>>>>>>> change
>     >>>>> in
>     >>>>>>>> the
>     >>>>>>>>> FLIP.
>     >>>>>>>> Since OperatorCoordinator is an internal API, it seems
>     reasonable to
>     >>>>>>>> explain it in the proposed change section. The FLIP seems
>     to have
>     >>>>>>>> documented this in point 5 of the proposed change section.
>     >>>>>>>>
>     >>>>>>>> BTW, if we think there are @internal classes that are
>     important
>     >>>>>>>> enough
>     >>>>>> to
>     >>>>>>>> be added in the public interface section, it might be
>     useful to
>     >>>>>> explicitly
>     >>>>>>>> discuss this topic and document it in the "*What are the
>     "public
>     >>>>>>>> interfaces" of the project*" in this
>     >>>>>>>> <
>     >>>>>>>>
>     >>
>     https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
>     >>>>>>>> wiki.
>     >>>>>>>>
>     >>>>>>>>
>     >>>>>>>>> Best,
>     >>>>>>>>> Jark
>     >>>>>>>>>
>     >>>>>>>>>
>     >>>>>>>>> On Mon, 26 Dec 2022 at 18:06, Hang Ruan
>     <ru...@gmail.com>
>     >>>>>> wrote:
>     >>>>>>>>>> Hi, thanks for the feedback, Zhu Zhu and Qingsheng.
>     >>>>>>>>>> After combining everyone's comments, the main concerns and
>     >>>>>>>> corresponding
>     >>>>>>>>>> adjustments are as follows.
>     >>>>>>>>>>
>     >>>>>>>>>> Q1: Common metrics are not quite useful.
>     >>>>>>>>>> numEventsIn and numEventsOut counters will be removed
>     from the
>     >>>>>>>>>> OperatorCoordinatorMetricGroup. These common metrics do not
>     >>>>>>>>>> provide
>     >>>>>>>>> enough
>     >>>>>>>>>> information for users. The users are more willing to
>     get the
>     >>>>>>>>>> number
>     >>>>> of
>     >>>>>>>>>> events of the specified type instead of the total
>     number. And this
>     >>>>>>>> metric
>     >>>>>>>>>> is calculated differently. The implementation could
>     register the
>     >>>>>> metric
>     >>>>>>>>> by
>     >>>>>>>>>> themselves.
>     >>>>>>>>>>
>     >>>>>>>>>> Q2: This FLIP is overly complicated.
>     >>>>>>>>>> This FLIP will become concise after these modifications.
>     >>>>>>>>>> OperatorCoordinatorMetricGroup has already been
>     introduced into
>     >>>>> Flink
>     >>>>>>>> by
>     >>>>>>>>>> FLIP-179<
>     >>>>>>>>>>
>     >>>>>>>>>>
>     >>
>     https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics
>     >>>>>>>>>>> .
>     >>>>>>>>>> And
>     >>>>>>>>>> this FLIP will not change it. This FLIP only provides a
>     new metric
>     >>>>>>>> option
>     >>>>>>>>>> and a new metric group scope. The changes in proposed
>     changes
>     >>>>> provide
>     >>>>>>>> the
>     >>>>>>>>>> details about the modifications for the internal
>     classes, which
>     >>>>> might
>     >>>>>>>>> make
>     >>>>>>>>>> it look complicated.
>     >>>>>>>>>>
>     >>>>>>>>>> Thanks for all the comments again. If there are no further
>     >>>>>>>>>> comments,
>     >>>>>> we
>     >>>>>>>>>> plan to start the voting thread this week.
>     >>>>>>>>>>
>     >>>>>>>>>> Best,
>     >>>>>>>>>> Hang
>     >>>>>>>>>>
>     >>>>>>>>>> Qingsheng Ren <re...@gmail.com> 于2022年12月26日周一
>     16:48写道：
>     >>>>>>>>>>
>     >>>>>>>>>>
>     >>>>>>>>>>> Thanks for the FLIP, Hang!
>     >>>>>>>>>>>
>     >>>>>>>>>>> This FLIP overall looks good to me. Actually I share
>     the same
>     >>>>> concern
>     >>>>>>>>>> with
>     >>>>>>>>>>> Zhu that numEventsIn and numEventsOut counters are not
>     quite
>     >>>>>>>>>>> useful
>     >>>>>>>> to
>     >>>>>>>>>> end
>     >>>>>>>>>>> users. OperatorEvent is a quite low-level abstraction,
>     which
>     >>>>> requires
>     >>>>>>>>>>> instantialization in order to be practical to users and
>     >>>>>>>>>>> developers,
>     >>>>>>>> so
>     >>>>>>>>>>> maybe it's better to exclude them from the FLIP.
>     >>>>>>>>>>>
>     >>>>>>>>>>> Best,
>     >>>>>>>>>>> Qingsheng
>     >>>>>>>>>>> Ververica (Alibaba)
>     >>>>>>>>>>>
>     >>>>>>>>>>> On Mon, Dec 26, 2022 at 12:08 PM Zhu Zhu
>     <re...@gmail.com>
>     >>>>> wrote:
>     >>>>>>>>>>>> Hi Hang,
>     >>>>>>>>>>>> I still see no strong reason why we need
>     >>>>>>>>>>>> numEventsIn/numEventsOut
>     >>>>>>>>>>> metrics.
>     >>>>>>>>>>>> In the discussion in FLINK-29801, I can see the same
>     concern
>     >>>>>>>>>>>> from
>     >>>>>>>>>> others.
>     >>>>>>>>>>>> So I prefer to exclude them from this FLIP to avoid
>     >>>>>>>>>>>> over-extending
>     >>>>>>>>> the
>     >>>>>>>>>>>> scope.
>     >>>>>>>>>>>>
>     >>>>>>>>>>>> Thanks,
>     >>>>>>>>>>>> Zhu
>     >>>>>>>>>>>>
>     >>>>>>>>>>>> Hang Ruan <ru...@gmail.com> 于2022年12月23日周五
>     >>>>>>>>>>>> 15:21写道：
>     >>>>>>>>>>>>> Hi everyone,
>     >>>>>>>>>>>>>
>     >>>>>>>>>>>>> As for the Zhu Zhu's problem, I think we should keep
>     the common
>     >>>>>>>>>>> metrics,
>     >>>>>>>>>>>> which will help to observe incoming and outgoing
>     events. What do
>     >>>>>>>> you
>     >>>>>>>>>>> think,
>     >>>>>>>>>>>> @Zhu Zhu ?
>     >>>>>>>>>>>>> And @Chesnay, are there any other issues you are
>     more concerned
>     >>>>>>>>>> about?
>     >>>>>>>>>>>> Looking forward to your reply.
>     >>>>>>>>>>>>> Thanks for all the comments. If there are no further
>     >>>>>>>>>>>>> comments, we
>     >>>>>>>>>> plan
>     >>>>>>>>>>>> to start the voting thread next week.
>     >>>>>>>>>>>>> Best,
>     >>>>>>>>>>>>> Hang
>     >>>>>>>>>>>>>
>     >>>>>>>>>>>>> Hang Ruan <ru...@gmail.com> 于2022年12月15日周四
>     >>>>>>>>>>>>> 16:49写道：
>     >>>>>>>>>>>>>> Hi, Zhu Zhu,
>     >>>>>>>>>>>>>>
>     >>>>>>>>>>>>>> Thanks for your feedback!
>     >>>>>>>>>>>>>>
>     >>>>>>>>>>>>>> The OperatorCoordinator implementations are
>     different. And
>     >>>>>>>>>>>>>> their
>     >>>>>>>>>>>> metrics are much different too. We try to find the common
>     >>>>>>>>>>>> metrics
>     >>>>>>>> and
>     >>>>>>>>>> put
>     >>>>>>>>>>>> them in the OperatorCoordinatorMetricGroup. If most
>     developers
>     >>>>>>>> think
>     >>>>>>>>> we
>     >>>>>>>>>>> do
>     >>>>>>>>>>>> not need these common metrics, removing the common
>     metrics is
>     >>>>>>>>>>>> also
>     >>>>>>>>>>>> acceptable.
>     >>>>>>>>>>>>>> Best,
>     >>>>>>>>>>>>>> Hang
>     >>>>>>>>>>>>>>
>     >>>>>>>>>>>>>> Zhu Zhu <re...@gmail.com> 于2022年12月14日周三
>     22:09写道：
>     >>>>>>>>>>>>>>> Hi Hang & MengYue,
>     >>>>>>>>>>>>>>>
>     >>>>>>>>>>>>>>> Thanks for creating this FLIP!
>     >>>>>>>>>>>>>>>
>     >>>>>>>>>>>>>>> I think it is very useful, mainly in two aspects:
>     >>>>>>>>>>>>>>> 1. Enables OperatorCoordinators to register metrics.
>     >>>>>>>>>>>>>>> Currently
>     >>>>>>>>>>>>>>> the coordinators has no way to do this. And operator
>     >>>>>>>> coordinator
>     >>>>>>>>>>>>>>> metric group further enables the SplitEnumerator
>     to have
>     >>>>>>>>>>>>>>> access
>     >>>>>>>>>>>>>>> to a registered metric group (via the existing public
>     >>>>>>>>>>>>>>> interface
>     >>>>>>>>>>>>>>> SplitEnumeratorContext#metricGroup()), which is
>     null at the
>     >>>>>>>>> moment.
>     >>>>>>>>>>>>>>> 2. Defines the scope of operator coordinator
>     metrics. A clear
>     >>>>>>>>>>>> definition
>     >>>>>>>>>>>>>>> makes it easy for users to find their wanted
>     metrics. The
>     >>>>>>>>>> definition
>     >>>>>>>>>>>>>>> also helps to avoid conflicts of metrics from multiple
>     >>>>>>>>>>>> OperatorCoordinators
>     >>>>>>>>>>>>>>> of the same kind. E.g. each SourceCoordinator may
>     have its
>     >>>>>>>>>>>>>>> own
>     >>>>>>>>>>>>>>> numSourceSplits metric, these metrics should not
>     be directly
>     >>>>>>>>>>> registered
>     >>>>>>>>>>>>>>> to the job metric group.
>     >>>>>>>>>>>>>>>
>     >>>>>>>>>>>>>>> What I'm a bit concerned is the necessity of the
>     introduced
>     >>>>>>>>> common
>     >>>>>>>>>>>> metrics
>     >>>>>>>>>>>>>>> numEventsInCounter & numEventsOutCounter. If there
>     any case
>     >>>>>>>> which
>     >>>>>>>>>>>> strongly
>     >>>>>>>>>>>>>>> requires them?
>     >>>>>>>>>>>>>>>
>     >>>>>>>>>>>>>>> Regarding the concerns of Chesnay,
>     >>>>>>>>>>>>>>>> A dedicated coordinator MG implementation is overkill
>     >>>>>>>>>>>>>>> Directly using the job metric group can result in
>     metric
>     >>>>>>>>> conflicts,
>     >>>>>>>>>>> as
>     >>>>>>>>>>>> mentioned
>     >>>>>>>>>>>>>>> in above #2.
>     >>>>>>>>>>>>>>>
>     >>>>>>>>>>>>>>> Thanks,
>     >>>>>>>>>>>>>>> Zhu
>     >>>>>>>>>>>>>>>
>     >>>>>>>>>>>>>>> Dong Lin <li...@gmail.com> 于2022年12月10日周六
>     >>>>>>>>>>>>>>> 14:16写道：
>     >>>>>>>>>>>>>>>
>     >>>>>>>>>>>>>>>> Hi Chesney,
>     >>>>>>>>>>>>>>>>
>     >>>>>>>>>>>>>>>> Just to double check with you,
>     >>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup
>     >>>>>>>>>>>> (annotated as
>     >>>>>>>>>>>>>>>> @PublicEvolving) has already been introduced into
>     Flink by
>     >>>>>>>>>> FLIP-179
>     >>>>>>>>>>>>>>>> <
>     >>
>     https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics
>     >>>>>>>>>>>>> .
>     >>>>>>>>>>>>>>>> And that FLIP has got you +1.. Do you mean we
>     should remove
>     >>>>>>>>> this
>     >>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup?
>     >>>>>>>>>>>>>>>>
>     >>>>>>>>>>>>>>>> Regards,
>     >>>>>>>>>>>>>>>> Dong
>     >>>>>>>>>>>>>>>>
>     >>>>>>>>>>>>>>>> On Sat, Dec 10, 2022 at 1:33 AM Chesnay Schepler <
>     >>>>>>>>>>> chesnay@apache.org>
>     >>>>>>>>>>>> wrote:
>     >>>>>>>>>>>>>>>>> As a whole I feel like this FLIP is overly
>     complicated. A
>     >>>>>>>>>>> dedicated
>     >>>>>>>>>>>>>>>>> coordinator MG implementation is overkill; it
>     could just
>     >>>>>>>>> re-use
>     >>>>>>>>>>> the
>     >>>>>>>>>>>>>>>>> existing Task/OperatorMGs to create the same
>     structure we
>     >>>>>>>>> have
>     >>>>>>>>>> on
>     >>>>>>>>>>>> TMs,
>     >>>>>>>>>>>>>>>>> similar to what we did with the Job MG.
>     >>>>>>>>>>>>>>>>>
>     >>>>>>>>>>>>>>>>> However, I'm not convinced that this is required
>     anyway,
>     >>>>>>>>>> because
>     >>>>>>>>>>>> all the
>     >>>>>>>>>>>>>>>>> example metrics you listed can be implemented on
>     the TM
>     >>>>>>>> side
>     >>>>>>>>> +
>     >>>>>>>>>>>>>>>>> aggregating them in the external metrics backend.
>     >>>>>>>>>>>>>>>>>
>     >>>>>>>>>>>>>>>>> Since I'm on holidays soon, just so no one tries
>     to pull a
>     >>>>>>>>> fast
>     >>>>>>>>>>>> one on
>     >>>>>>>>>>>>>>>>> me, if this were to go to a vote as-is I'd be
>     against it.
>     >>>>>>>>>>>>>>>>>
>     >>>>>>>>>>>>>>>>>
>     >>>>>>>>>>>>>>>>> On 09/12/2022 15:30, Dong Lin wrote:
>     >>>>>>>>>>>>>>>>>> Hi Hang,
>     >>>>>>>>>>>>>>>>>>
>     >>>>>>>>>>>>>>>>>> Thanks for the FLIP! The FLIP looks good and it
>     is pretty
>     >>>>>>>>>>>> informative.
>     >>>>>>>>>>>>>>>>>> I have just two minor comments regarding names:
>     >>>>>>>>>>>>>>>>>> - Would it be useful to rename the config key as
>     >>>>>>>>>>>>>>>>>> *metrics.scope.jm.job.operator-coordinator* for
>     >>>>>>>> consistency
>     >>>>>>>>>>> with
>     >>>>>>>>>>>>>>>>>> *metrics.scope.jm.job
>     >>>>>>>>>>>>>>>>>> *(which is not named as *jm-job)?
>     >>>>>>>>>>>>>>>>>> - Maybe rename the variable as
>     >>>>>>>>>>> SCOPE_NAMING_OPERATOR_COORDINATOR
>     >>>>>>>>>>>> for
>     >>>>>>>>>>>>>>>>>> simplicity and consistency with
>     SCOPE_NAMING_OPERATOR
>     >>>>>>>>> (which
>     >>>>>>>>>> is
>     >>>>>>>>>>>> not named
>     >>>>>>>>>>>>>>>>>> as SCOPE_NAMING_TM_JOB_OPERATOR)?
>     >>>>>>>>>>>>>>>>>>
>     >>>>>>>>>>>>>>>>>> Cheers,
>     >>>>>>>>>>>>>>>>>> Dong
>     >>>>>>>>>>>>>>>>>>
>     >>>>>>>>>>>>>>>>>>
>     >>>>>>>>>>>>>>>>>>
>     >>>>>>>>>>>>>>>>>> On Thu, Dec 8, 2022 at 3:28 PM Hang Ruan <
>     >>>>>>>>>>> ruanhang1993@gmail.com>
>     >>>>>>>>>>>> wrote:
>     >>>>>>>>>>>>>>>>>>> Hi all,
>     >>>>>>>>>>>>>>>>>>>
>     >>>>>>>>>>>>>>>>>>> MengYue and I created FLIP-274[1] Introduce
>     metric group
>     >>>>>>>>> for
>     >>>>>>>>>>>>>>>>>>> OperatorCoordinator. OperatorCoordinator is the
>     >>>>>>>>> coordinator
>     >>>>>>>>>>> for
>     >>>>>>>>>>>> runtime
>     >>>>>>>>>>>>>>>>>>> operators and running on Job Manager. The
>     coordination
>     >>>>>>>>>>>> mechanism is
>     >>>>>>>>>>>>>>>>>>> operator events between OperatorCoordinator
>     and its all
>     >>>>>>>>>>>> operators, the
>     >>>>>>>>>>>>>>>>>>> coordination is more and more using in Flink, for
>     >>>>>>>> example
>     >>>>>>>>>> many
>     >>>>>>>>>>>> Sources
>     >>>>>>>>>>>>>>>>> and
>     >>>>>>>>>>>>>>>>>>> Sinks depend on the mechanism to assign splits and
>     >>>>>>>>>> coordinate
>     >>>>>>>>>>>> commits to
>     >>>>>>>>>>>>>>>>>>> external systems. The OperatorCoordinator is
>     widely
>     >>>>>>>> using
>     >>>>>>>>> in
>     >>>>>>>>>>>> flink kafka
>     >>>>>>>>>>>>>>>>>>> connector, flink pulsar connector, flink cdc
>     connector,
>     >>>>>>>>>> flink
>     >>>>>>>>>>>> hudi
>     >>>>>>>>>>>>>>>>>>> connector and so on.
>     >>>>>>>>>>>>>>>>>>>
>     >>>>>>>>>>>>>>>>>>> But there is not a suitable metric group scope
>     for the
>     >>>>>>>>>>>>>>>>> OperatorCoordinator
>     >>>>>>>>>>>>>>>>>>> and not an implementation for the interface
>     >>>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup.
>     >>>>>>>>>>>>>>>>>>> These metrics in OperatorCoordinator could be
>     how many
>     >>>>>>>>>>>> splits/partitions
>     >>>>>>>>>>>>>>>>>>> have been assigned to source readers, how many
>     files
>     >>>>>>>> have
>     >>>>>>>>>> been
>     >>>>>>>>>>>> written
>     >>>>>>>>>>>>>>>>> out
>     >>>>>>>>>>>>>>>>>>> by sink writers, these metrics not only help
>     users to
>     >>>>>>>> know
>     >>>>>>>>>> the
>     >>>>>>>>>>>> job
>     >>>>>>>>>>>>>>>>> progress
>     >>>>>>>>>>>>>>>>>>> but also make big job maintaining easier. Thus we
>     >>>>>>>> propose
>     >>>>>>>>>> the
>     >>>>>>>>>>>> FLIP-274
>     >>>>>>>>>>>>>>>>> to
>     >>>>>>>>>>>>>>>>>>> introduce a new metric group scope for
>     >>>>>>>> OperatorCoordinator
>     >>>>>>>>>> and
>     >>>>>>>>>>>> provide
>     >>>>>>>>>>>>>>>>> an
>     >>>>>>>>>>>>>>>>>>> internal implementation for
>     >>>>>>>>> OperatorCoordinatorMetricGroup.
>     >>>>>>>>>>>>>>>>>>> Could you help review this FLIP when you get
>     time? Any
>     >>>>>>>>>>> feedback
>     >>>>>>>>>>>> is
>     >>>>>>>>>>>>>>>>>>> appreciated!
>     >>>>>>>>>>>>>>>>>>>
>     >>>>>>>>>>>>>>>>>>> Best,
>     >>>>>>>>>>>>>>>>>>> Hang
>     >>>>>>>>>>>>>>>>>>>
>     >>>>>>>>>>>>>>>>>>> [1]
>     >>>>>>>>>>>>>>>>>>>
>     >>>>>>>>>>>>>>>>>>>
>     >>
>     https://cwiki.apache.org/confluence/display/FLINK/FLIP-274%3A+Introduce+metric+group+for+OperatorCoordinator
>     >>
>

Re: [DISCUSS] FLIP-274 : Introduce metric group for OperatorCoordinator

Posted by Mason Chen <ma...@gmail.com>.

Hi all,

I'm also interested in this FLIP! A metric group that extends from the
jobmanager group resonates well with me too.

I prefer choice 1 since that's consistent with how the other metric groups
are designed. In addition, in your example, I think you mean to write

`parent.addGroup("operator_name", operatorName)`?

From the FLIP:

> For example,KafkaSourceEnumerator could register a metric on how many
> partitions has been assigned to source readers.

 BTW, I would advise against creating redundant metrics Flink can already
obtain. (The number of assigned partitions can be computed by the Kafka
client metrics exported by the source).

Best,
Mason

On Wed, Jan 18, 2023 at 10:39 PM Hang Ruan <ru...@gmail.com> wrote:

> Hi, chesnay,
>
> Thanks for your reply. I still have some doubts about the questions you
> raised.
>
> > Extending the set of ScopeFormats is problematic because it in practice
> > it breaks the config if users actively rely on it, since there's now
> > another key that they _must_ set for it to be consistent/compatible with
> > their existing setup.
> > Unfortunately due to how powerful scope formats are we can't derive a
> > default value that matches their existing setup.
> > Hence we should try to do this as rarely as possible.
>
>
> > > This FLIP does not adhere to that since it proposes a dedicated format
> > for coordinators; next time we want to expose operator-specific metrics
> > (e.g., in the scheduler) we'd have to add another one to support it.
>
>
> First, I do not understand why users have to configure the new scope
> format, which has a default value. I usually do not need to change these
> configuration of scope formats when submitting the flink job.
> IMO, the OperatorCoordinator is a component running at the JobMaster. I try
> to let it extend the ComponentMetricGroup, but I cannot find a suitable
> scope. This make me decide to add a new scope format.
>
> > Additionally, the configurable variables (operator name/id) are
> > logically not attached to the coordinator, but operators, so to me it
> > just doesn't make sense to structure it like this.
>
>
> The relationship between the Operator and OperatorCoordinator is maintained
> in the OperatorCoordinatorHolder. In the POC, the operator name/id could be
> found and the OperatorCoordinatorMetricGroup will be created here.
>
> > Another thing I'm concerned about is that, because we don't include
> > tasks in the hierarchy, users wishing to collect all metrics for a
> > particular task (in this case ==vertex) now have to go significantly out
> > of their way to get them, since they can no longer just filter by the
> > task ID but have to be filter for _all_ operators that are part of the
> > task.
>
>
> The registered metrics of OperatorCoordinator always are those which can
> not be aggregated from the tasks, like the number of unassigned splits in
> the SourceCoordinator. Actually we have not encountered the scenario you
> mentioned. I think the OperatorCoordinatorMetricGroup should only contain
> the metrics for itself instead of its subtasks.
>
> Using metrics.scope.jm.job is not enough to distinguish the different
> coordinators. The operator id/name is necessary. Then the implementation
> should be like this. But users can not change the scope in this way. This
> is also acceptable.
>
> @Internal
> public class InternalOperatorCoordinatorMetricGroup
>         extends ProxyMetricGroup<MetricGroup>
>         implements OperatorCoordinatorMetricGroup {
>     public InternalOperatorCoordinatorMetricGroup(
>             JobManagerJobMetricGroup parent,
>             OperatorID operatorID,
>             String operatorName) {
>         super(parent.addGroup(operatorID +
> operatorName).addGroup("coordinator"));
>     }
> }
>
> So there are two choice for the InternalOperatorCoordinatorMetricGroup: 1.
> add and improve the new scope format; 2. use the metrics.scope.jm.job and
> ProxyMetricGroup. Which one is better?
>
> Best,
> Hang
>
>
> Chesnay Schepler <ch...@apache.org> 于2023年1月18日周三 17:03写道：
>
> > You're misunderstanding the problem.
> >
> > Metric groups form a tree with each group providing certain metadata.
> >
> > E.g., on the taskmanager we have a TM metric group that provides info
> > about the TM, that has child task metric groups, that have child
> > operator metric groups etc. The operator metric group again has
> > children, where we are now mostly in the user/connector land, like a
> > kafka metric group providing partition info or something.
> >
> > What is being proposed is to create a coordinator metric group on the JM
> > side that provides operator metadata.
> >
> > A more appropriate structure is to create an operator group on the JM
> > side that provides this info, with the coordinator metric group being a
> > child.
> >
> > On 17/01/2023 19:56, Steven Wu wrote:
> > >> Additionally, the configurable variables (operator name/id) are
> > > logically not attached to the coordinator, but operators, so to me it
> > > just doesn't make sense to structure it like this.
> > >
> > > Chesnay, maybe we should clarify the terminology. To me, pperators
> (like
> > > FLIP-27 source) can have two parts (coordinator and reader/subtask). I
> > > think it is fine to include operator name/id for coordinator metrics.
> > >
> > > On Mon, Jan 16, 2023 at 2:13 AM Chesnay Schepler <ch...@apache.org>
> > wrote:
> > >
> > >> Slight correction: Using metrics.scope.jm.job as the default should be
> > >> safe.
> > >>
> > >> On 16/01/2023 10:18, Chesnay Schepler wrote:
> > >>> The proposed ScopeFormat is still problematic for a few reasons.
> > >>>
> > >>> Extending the set of ScopeFormats is problematic because it in
> > >>> practice it breaks the config if users actively rely on it, since
> > >>> there's now another key that they _must_ set for it to be
> > >>> consistent/compatible with their existing setup.
> > >>> Unfortunately due to how powerful scope formats are we can't derive a
> > >>> default value that matches their existing setup.
> > >>> Hence we should try to do this as rarely as possible.
> > >>>
> > >>> This FLIP does not adhere to that since it proposes a dedicated
> format
> > >>> for coordinators; next time we want to expose operator-specific
> > >>> metrics (e.g., in the scheduler) we'd have to add another one to
> > >>> support it.
> > >>>
> > >>> Additionally, the configurable variables (operator name/id) are
> > >>> logically not attached to the coordinator, but operators, so to me it
> > >>> just doesn't make sense to structure it like this.
> > >>>
> > >>> Another thing I'm concerned about is that, because we don't include
> > >>> tasks in the hierarchy, users wishing to collect all metrics for a
> > >>> particular task (in this case ==vertex) now have to go significantly
> > >>> out of their way to get them, since they can no longer just filter by
> > >>> the task ID but have to be filter for _all_ operators that are part
> of
> > >>> the task.
> > >>>
> > >>> On 16/01/2023 03:09, Hang Ruan wrote:
> > >>>> Hi, @chesnay@apache.org <ch...@apache.org> ,
> > >>>>
> > >>>> Do you have time to help to review this FLIP again after the
> > >>>> modification?
> > >>>> Looking forward to your reply.
> > >>>> This FLIP will add a new configuration for the
> > >>>> OperatorCoordinatorMetricGroup scope format. It provides an internal
> > >>>> implementation and is added as a component to the
> > >>>> JobManagerJobMetricGroup.
> > >>>> If something doesn't make sense, could you provide some advice? It
> > >>>> will be
> > >>>> very helpful. Thanks a lot for your help.
> > >>>>
> > >>>> Best,
> > >>>> Hang
> > >>>>
> > >>>> Martijn Visser <ma...@apache.org> 于2023年1月11日周三 16:34写道：
> > >>>>
> > >>>>
> > >>>>> Hi Hang,
> > >>>>>
> > >>>>> I'm a bit surprised that this has gone to a vote, given that
> Chesnay
> > >>>>> deliberately mentioned that he would vote against it as-is. I would
> > >>>>> expect
> > >>>>> that before going to a vote, he has had the opportunity to
> > >>>>> participate in
> > >>>>> this discussion.
> > >>>>>
> > >>>>> Best regards,
> > >>>>>
> > >>>>> Martijn
> > >>>>>
> > >>>>> On Tue, Jan 3, 2023 at 12:53 PM Jark Wu <im...@gmail.com> wrote:
> > >>>>>
> > >>>>>> Hi Dong,
> > >>>>>>
> > >>>>>> Regarding “SplitEnumeratorContext#metricGroup”, my only concern is
> > >>>>>> that
> > >>>>>> this is a core interface for the FLIP. It’s hard to tell how
> > >>>>>> sources use
> > >>>>>> metric group without mentioning this interface. Even if this is an
> > >>>>> existing
> > >>>>>> API, I think it’s worth introducing the interface again and
> > declaring
> > >>>>> that
> > >>>>>> we will implement the interface instead of a no-op method in this
> > >>>>>> FLIP.
> > >>>>>>
> > >>>>>> Anyway, this is a minor problem and shouldn’t block this FLIP. I’m
> > >>>>>> +1 to
> > >>>>>> start a vote.
> > >>>>>>
> > >>>>>> Best,
> > >>>>>> Jark
> > >>>>>>
> > >>>>>>
> > >>>>>>> 2023年1月3日 10:03，Hang Ruan <ru...@gmail.com> 写道：
> > >>>>>>>
> > >>>>>>> Hi, Jark and Dong,
> > >>>>>>>
> > >>>>>>> Thanks for your comments. Sorry for my late reply.
> > >>>>>>>
> > >>>>>>> For suggestion 1, I plan to implement the
> > >>>>>>> SplitEnumeratorMetricGroup in
> > >>>>>>> another issue, and it is not contained in this FLIP. I will add
> > some
> > >>>>>>> description about this part.
> > >>>>>>> For suggestion 2, changes about OperatorCoordinator#metricGroup
> has
> > >>>>>> already
> > >>>>>>> been documented in the proposed change section.
> > >>>>>>>
> > >>>>>>> Best,
> > >>>>>>> Hang
> > >>>>>>>
> > >>>>>>> Dong Lin <li...@gmail.com> 于2023年1月1日周日 09:45写道：
> > >>>>>>>
> > >>>>>>>> Let me chime-in and add comments regarding the public interface
> > >>>>> section.
> > >>>>>>>> Please see my comments inline.
> > >>>>>>>>
> > >>>>>>>> On Thu, Dec 29, 2022 at 6:08 PM Jark Wu <im...@gmail.com>
> wrote:
> > >>>>>>>>
> > >>>>>>>>> Hi Hang,
> > >>>>>>>>>
> > >>>>>>>>> Thanks for driving this discussion. I think this is a very
> useful
> > >>>>>> feature
> > >>>>>>>>> for connectors.
> > >>>>>>>>>
> > >>>>>>>>> The FLIP looks quite good to me, and I just have two
> suggestions.
> > >>>>>>>>>
> > >>>>>>>>> 1. In the "Public Interface" section, mention that the
> > >>>>>>>>> implementation
> > >>>>>>>>> behavior of "SplitEnumeratorContext#metricGroup" is changed
> from
> > >>>>>>>> returning
> > >>>>>>>>> null to returning a concrete SplitEnumeratorMetricGroup
> instance.
> > >>>>> Even
> > >>>>>>>>> though the API is already there, the behavior change can also
> be
> > >>>>>>>> considered
> > >>>>>>>>> a public change.
> > >>>>>>>>>
> > >>>>>>>> SplitEnumeratorContext#metricGroup is an interface and this FLIP
> > >>>>>>>> does
> > >>>>>> not
> > >>>>>>>> seem to change its semantics/behavior. The FLIP does change the
> > >>>>>>>> implementation/behavior of SourceCoordinatorContext#metricGroup,
> > >>>>>>>> which
> > >>>>>> is
> > >>>>>>>> marked @Internal.
> > >>>>>>>>
> > >>>>>>>> Thus it might seem a bit weird to add in the public interface
> > >>>>>>>> section
> > >>>>>>>> saying that we change the interface
> > >>>>>>>> SplitEnumeratorContext#metricGroup
> > >>>>>> from
> > >>>>>>>> returning null to non-null object.
> > >>>>>>>>
> > >>>>>>>> 2. Mention the newly added interface of
> > >>>>>> "OperatorCoordinator#metricGroup"
> > >>>>>>>>> in the "Proposed Changes" section or "Public Interface"
> section.
> > As
> > >>>>> the
> > >>>>>>>>> FLIP said, OperatorCoordinator is widely used in many
> connectors.
> > >>>>>> Though
> > >>>>>>>> it
> > >>>>>>>>> is still an @Internal API, I think it is worth mentioning the
> > >>>>>>>>> change
> > >>>>> in
> > >>>>>>>> the
> > >>>>>>>>> FLIP.
> > >>>>>>>> Since OperatorCoordinator is an internal API, it seems
> reasonable
> > to
> > >>>>>>>> explain it in the proposed change section. The FLIP seems to
> have
> > >>>>>>>> documented this in point 5 of the proposed change section.
> > >>>>>>>>
> > >>>>>>>> BTW, if we think there are @internal classes that are important
> > >>>>>>>> enough
> > >>>>>> to
> > >>>>>>>> be added in the public interface section, it might be useful to
> > >>>>>> explicitly
> > >>>>>>>> discuss this topic and document it in the "*What are the "public
> > >>>>>>>> interfaces" of the project*" in this
> > >>>>>>>> <
> > >>>>>>>>
> > >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
> > >>>>>>>> wiki.
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>> Best,
> > >>>>>>>>> Jark
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> On Mon, 26 Dec 2022 at 18:06, Hang Ruan <
> ruanhang1993@gmail.com>
> > >>>>>> wrote:
> > >>>>>>>>>> Hi, thanks for the feedback, Zhu Zhu and Qingsheng.
> > >>>>>>>>>> After combining everyone's comments, the main concerns and
> > >>>>>>>> corresponding
> > >>>>>>>>>> adjustments are as follows.
> > >>>>>>>>>>
> > >>>>>>>>>> Q1: Common metrics are not quite useful.
> > >>>>>>>>>> numEventsIn and numEventsOut counters will be removed from the
> > >>>>>>>>>> OperatorCoordinatorMetricGroup. These common metrics do not
> > >>>>>>>>>> provide
> > >>>>>>>>> enough
> > >>>>>>>>>> information for users. The users are more willing to get the
> > >>>>>>>>>> number
> > >>>>> of
> > >>>>>>>>>> events of the specified type instead of the total number. And
> > this
> > >>>>>>>> metric
> > >>>>>>>>>> is calculated differently. The implementation could register
> the
> > >>>>>> metric
> > >>>>>>>>> by
> > >>>>>>>>>> themselves.
> > >>>>>>>>>>
> > >>>>>>>>>> Q2: This FLIP is overly complicated.
> > >>>>>>>>>> This FLIP will become concise after these modifications.
> > >>>>>>>>>> OperatorCoordinatorMetricGroup has already been introduced
> into
> > >>>>> Flink
> > >>>>>>>> by
> > >>>>>>>>>> FLIP-179<
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics
> > >>>>>>>>>>> .
> > >>>>>>>>>> And
> > >>>>>>>>>> this FLIP will not change it. This FLIP only provides a new
> > metric
> > >>>>>>>> option
> > >>>>>>>>>> and a new metric group scope. The changes in proposed changes
> > >>>>> provide
> > >>>>>>>> the
> > >>>>>>>>>> details about the modifications for the internal classes,
> which
> > >>>>> might
> > >>>>>>>>> make
> > >>>>>>>>>> it look complicated.
> > >>>>>>>>>>
> > >>>>>>>>>> Thanks for all the comments again. If there are no further
> > >>>>>>>>>> comments,
> > >>>>>> we
> > >>>>>>>>>> plan to start the voting thread this week.
> > >>>>>>>>>>
> > >>>>>>>>>> Best,
> > >>>>>>>>>> Hang
> > >>>>>>>>>>
> > >>>>>>>>>> Qingsheng Ren <re...@gmail.com> 于2022年12月26日周一 16:48写道：
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>> Thanks for the FLIP, Hang!
> > >>>>>>>>>>>
> > >>>>>>>>>>> This FLIP overall looks good to me. Actually I share the same
> > >>>>> concern
> > >>>>>>>>>> with
> > >>>>>>>>>>> Zhu that numEventsIn and numEventsOut counters are not quite
> > >>>>>>>>>>> useful
> > >>>>>>>> to
> > >>>>>>>>>> end
> > >>>>>>>>>>> users. OperatorEvent is a quite low-level abstraction, which
> > >>>>> requires
> > >>>>>>>>>>> instantialization in order to be practical to users and
> > >>>>>>>>>>> developers,
> > >>>>>>>> so
> > >>>>>>>>>>> maybe it's better to exclude them from the FLIP.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Best,
> > >>>>>>>>>>> Qingsheng
> > >>>>>>>>>>> Ververica (Alibaba)
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Mon, Dec 26, 2022 at 12:08 PM Zhu Zhu <re...@gmail.com>
> > >>>>> wrote:
> > >>>>>>>>>>>> Hi Hang,
> > >>>>>>>>>>>> I still see no strong reason why we need
> > >>>>>>>>>>>> numEventsIn/numEventsOut
> > >>>>>>>>>>> metrics.
> > >>>>>>>>>>>> In the discussion in FLINK-29801, I can see the same concern
> > >>>>>>>>>>>> from
> > >>>>>>>>>> others.
> > >>>>>>>>>>>> So I prefer to exclude them from this FLIP to avoid
> > >>>>>>>>>>>> over-extending
> > >>>>>>>>> the
> > >>>>>>>>>>>> scope.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>> Zhu
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Hang Ruan <ru...@gmail.com> 于2022年12月23日周五
> > >>>>>>>>>>>> 15:21写道：
> > >>>>>>>>>>>>> Hi everyone,
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> As for the Zhu Zhu's problem, I think we should keep the
> > common
> > >>>>>>>>>>> metrics,
> > >>>>>>>>>>>> which will help to observe incoming and outgoing events.
> What
> > do
> > >>>>>>>> you
> > >>>>>>>>>>> think,
> > >>>>>>>>>>>> @Zhu Zhu ?
> > >>>>>>>>>>>>> And @Chesnay, are there any other issues you are more
> > concerned
> > >>>>>>>>>> about?
> > >>>>>>>>>>>> Looking forward to your reply.
> > >>>>>>>>>>>>> Thanks for all the comments. If there are no further
> > >>>>>>>>>>>>> comments, we
> > >>>>>>>>>> plan
> > >>>>>>>>>>>> to start the voting thread next week.
> > >>>>>>>>>>>>> Best,
> > >>>>>>>>>>>>> Hang
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Hang Ruan <ru...@gmail.com> 于2022年12月15日周四
> > >>>>>>>>>>>>> 16:49写道：
> > >>>>>>>>>>>>>> Hi, Zhu Zhu,
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Thanks for your feedback!
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> The OperatorCoordinator implementations are different. And
> > >>>>>>>>>>>>>> their
> > >>>>>>>>>>>> metrics are much different too. We try to find the common
> > >>>>>>>>>>>> metrics
> > >>>>>>>> and
> > >>>>>>>>>> put
> > >>>>>>>>>>>> them in the OperatorCoordinatorMetricGroup. If most
> developers
> > >>>>>>>> think
> > >>>>>>>>> we
> > >>>>>>>>>>> do
> > >>>>>>>>>>>> not need these common metrics, removing the common metrics
> is
> > >>>>>>>>>>>> also
> > >>>>>>>>>>>> acceptable.
> > >>>>>>>>>>>>>> Best,
> > >>>>>>>>>>>>>> Hang
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Zhu Zhu <re...@gmail.com> 于2022年12月14日周三 22:09写道：
> > >>>>>>>>>>>>>>> Hi Hang & MengYue,
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Thanks for creating this FLIP!
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> I think it is very useful, mainly in two aspects:
> > >>>>>>>>>>>>>>> 1. Enables OperatorCoordinators to register metrics.
> > >>>>>>>>>>>>>>> Currently
> > >>>>>>>>>>>>>>> the coordinators has no way to do this. And operator
> > >>>>>>>> coordinator
> > >>>>>>>>>>>>>>> metric group further enables the SplitEnumerator to have
> > >>>>>>>>>>>>>>> access
> > >>>>>>>>>>>>>>> to a registered metric group (via the existing public
> > >>>>>>>>>>>>>>> interface
> > >>>>>>>>>>>>>>> SplitEnumeratorContext#metricGroup()), which is null at
> the
> > >>>>>>>>> moment.
> > >>>>>>>>>>>>>>> 2. Defines the scope of operator coordinator metrics. A
> > clear
> > >>>>>>>>>>>> definition
> > >>>>>>>>>>>>>>> makes it easy for users to find their wanted metrics. The
> > >>>>>>>>>> definition
> > >>>>>>>>>>>>>>> also helps to avoid conflicts of metrics from multiple
> > >>>>>>>>>>>> OperatorCoordinators
> > >>>>>>>>>>>>>>> of the same kind. E.g. each SourceCoordinator may have
> its
> > >>>>>>>>>>>>>>> own
> > >>>>>>>>>>>>>>> numSourceSplits metric, these metrics should not be
> > directly
> > >>>>>>>>>>> registered
> > >>>>>>>>>>>>>>> to the job metric group.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> What I'm a bit concerned is the necessity of the
> introduced
> > >>>>>>>>> common
> > >>>>>>>>>>>> metrics
> > >>>>>>>>>>>>>>> numEventsInCounter & numEventsOutCounter. If there any
> case
> > >>>>>>>> which
> > >>>>>>>>>>>> strongly
> > >>>>>>>>>>>>>>> requires them?
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Regarding the concerns of Chesnay,
> > >>>>>>>>>>>>>>>> A dedicated coordinator MG implementation is overkill
> > >>>>>>>>>>>>>>> Directly using the job metric group can result in metric
> > >>>>>>>>> conflicts,
> > >>>>>>>>>>> as
> > >>>>>>>>>>>> mentioned
> > >>>>>>>>>>>>>>> in above #2.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>>>>> Zhu
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Dong Lin <li...@gmail.com> 于2022年12月10日周六
> > >>>>>>>>>>>>>>> 14:16写道：
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Hi Chesney,
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Just to double check with you,
> > >>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup
> > >>>>>>>>>>>> (annotated as
> > >>>>>>>>>>>>>>>> @PublicEvolving) has already been introduced into Flink
> by
> > >>>>>>>>>> FLIP-179
> > >>>>>>>>>>>>>>>> <
> > >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics
> > >>>>>>>>>>>>> .
> > >>>>>>>>>>>>>>>> And that FLIP has got you +1.. Do you mean we should
> > remove
> > >>>>>>>>> this
> > >>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup?
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Regards,
> > >>>>>>>>>>>>>>>> Dong
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> On Sat, Dec 10, 2022 at 1:33 AM Chesnay Schepler <
> > >>>>>>>>>>> chesnay@apache.org>
> > >>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>> As a whole I feel like this FLIP is overly
> complicated. A
> > >>>>>>>>>>> dedicated
> > >>>>>>>>>>>>>>>>> coordinator MG implementation is overkill; it could
> just
> > >>>>>>>>> re-use
> > >>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>> existing Task/OperatorMGs to create the same structure
> we
> > >>>>>>>>> have
> > >>>>>>>>>> on
> > >>>>>>>>>>>> TMs,
> > >>>>>>>>>>>>>>>>> similar to what we did with the Job MG.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> However, I'm not convinced that this is required
> anyway,
> > >>>>>>>>>> because
> > >>>>>>>>>>>> all the
> > >>>>>>>>>>>>>>>>> example metrics you listed can be implemented on the TM
> > >>>>>>>> side
> > >>>>>>>>> +
> > >>>>>>>>>>>>>>>>> aggregating them in the external metrics backend.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Since I'm on holidays soon, just so no one tries to
> pull
> > a
> > >>>>>>>>> fast
> > >>>>>>>>>>>> one on
> > >>>>>>>>>>>>>>>>> me, if this were to go to a vote as-is I'd be against
> it.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> On 09/12/2022 15:30, Dong Lin wrote:
> > >>>>>>>>>>>>>>>>>> Hi Hang,
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Thanks for the FLIP! The FLIP looks good and it is
> > pretty
> > >>>>>>>>>>>> informative.
> > >>>>>>>>>>>>>>>>>> I have just two minor comments regarding names:
> > >>>>>>>>>>>>>>>>>> - Would it be useful to rename the config key as
> > >>>>>>>>>>>>>>>>>> *metrics.scope.jm.job.operator-coordinator* for
> > >>>>>>>> consistency
> > >>>>>>>>>>> with
> > >>>>>>>>>>>>>>>>>> *metrics.scope.jm.job
> > >>>>>>>>>>>>>>>>>> *(which is not named as *jm-job)?
> > >>>>>>>>>>>>>>>>>> - Maybe rename the variable as
> > >>>>>>>>>>> SCOPE_NAMING_OPERATOR_COORDINATOR
> > >>>>>>>>>>>> for
> > >>>>>>>>>>>>>>>>>> simplicity and consistency with SCOPE_NAMING_OPERATOR
> > >>>>>>>>> (which
> > >>>>>>>>>> is
> > >>>>>>>>>>>> not named
> > >>>>>>>>>>>>>>>>>> as SCOPE_NAMING_TM_JOB_OPERATOR)?
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Cheers,
> > >>>>>>>>>>>>>>>>>> Dong
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> On Thu, Dec 8, 2022 at 3:28 PM Hang Ruan <
> > >>>>>>>>>>> ruanhang1993@gmail.com>
> > >>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>> Hi all,
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> MengYue and I created FLIP-274[1] Introduce metric
> > group
> > >>>>>>>>> for
> > >>>>>>>>>>>>>>>>>>> OperatorCoordinator. OperatorCoordinator is the
> > >>>>>>>>> coordinator
> > >>>>>>>>>>> for
> > >>>>>>>>>>>> runtime
> > >>>>>>>>>>>>>>>>>>> operators and running on Job Manager. The
> coordination
> > >>>>>>>>>>>> mechanism is
> > >>>>>>>>>>>>>>>>>>> operator events between OperatorCoordinator and its
> all
> > >>>>>>>>>>>> operators, the
> > >>>>>>>>>>>>>>>>>>> coordination is more and more using in Flink, for
> > >>>>>>>> example
> > >>>>>>>>>> many
> > >>>>>>>>>>>> Sources
> > >>>>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>> Sinks depend on the mechanism to assign splits and
> > >>>>>>>>>> coordinate
> > >>>>>>>>>>>> commits to
> > >>>>>>>>>>>>>>>>>>> external systems. The OperatorCoordinator is widely
> > >>>>>>>> using
> > >>>>>>>>> in
> > >>>>>>>>>>>> flink kafka
> > >>>>>>>>>>>>>>>>>>> connector, flink pulsar connector, flink cdc
> connector,
> > >>>>>>>>>> flink
> > >>>>>>>>>>>> hudi
> > >>>>>>>>>>>>>>>>>>> connector and so on.
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> But there is not a suitable metric group scope for
> the
> > >>>>>>>>>>>>>>>>> OperatorCoordinator
> > >>>>>>>>>>>>>>>>>>> and not an implementation for the interface
> > >>>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup.
> > >>>>>>>>>>>>>>>>>>> These metrics in OperatorCoordinator could be how
> many
> > >>>>>>>>>>>> splits/partitions
> > >>>>>>>>>>>>>>>>>>> have been assigned to source readers, how many files
> > >>>>>>>> have
> > >>>>>>>>>> been
> > >>>>>>>>>>>> written
> > >>>>>>>>>>>>>>>>> out
> > >>>>>>>>>>>>>>>>>>> by sink writers, these metrics not only help users to
> > >>>>>>>> know
> > >>>>>>>>>> the
> > >>>>>>>>>>>> job
> > >>>>>>>>>>>>>>>>> progress
> > >>>>>>>>>>>>>>>>>>> but also make big job maintaining easier. Thus we
> > >>>>>>>> propose
> > >>>>>>>>>> the
> > >>>>>>>>>>>> FLIP-274
> > >>>>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>> introduce a new metric group scope for
> > >>>>>>>> OperatorCoordinator
> > >>>>>>>>>> and
> > >>>>>>>>>>>> provide
> > >>>>>>>>>>>>>>>>> an
> > >>>>>>>>>>>>>>>>>>> internal implementation for
> > >>>>>>>>> OperatorCoordinatorMetricGroup.
> > >>>>>>>>>>>>>>>>>>> Could you help review this FLIP when you get time?
> Any
> > >>>>>>>>>>> feedback
> > >>>>>>>>>>>> is
> > >>>>>>>>>>>>>>>>>>> appreciated!
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Best,
> > >>>>>>>>>>>>>>>>>>> Hang
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> [1]
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>
> > >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-274%3A+Introduce+metric+group+for+OperatorCoordinator
> > >>
> >
> >
>

Re: [DISCUSS] FLIP-274 : Introduce metric group for OperatorCoordinator

Posted by Hang Ruan <ru...@gmail.com>.

Hi, chesnay,

Thanks for your reply. I still have some doubts about the questions you
raised.

> Extending the set of ScopeFormats is problematic because it in practice
> it breaks the config if users actively rely on it, since there's now
> another key that they _must_ set for it to be consistent/compatible with
> their existing setup.
> Unfortunately due to how powerful scope formats are we can't derive a
> default value that matches their existing setup.
> Hence we should try to do this as rarely as possible.


> > This FLIP does not adhere to that since it proposes a dedicated format
> for coordinators; next time we want to expose operator-specific metrics
> (e.g., in the scheduler) we'd have to add another one to support it.


First, I do not understand why users have to configure the new scope
format, which has a default value. I usually do not need to change these
configuration of scope formats when submitting the flink job.
IMO, the OperatorCoordinator is a component running at the JobMaster. I try
to let it extend the ComponentMetricGroup, but I cannot find a suitable
scope. This make me decide to add a new scope format.

> Additionally, the configurable variables (operator name/id) are
> logically not attached to the coordinator, but operators, so to me it
> just doesn't make sense to structure it like this.


The relationship between the Operator and OperatorCoordinator is maintained
in the OperatorCoordinatorHolder. In the POC, the operator name/id could be
found and the OperatorCoordinatorMetricGroup will be created here.

> Another thing I'm concerned about is that, because we don't include
> tasks in the hierarchy, users wishing to collect all metrics for a
> particular task (in this case ==vertex) now have to go significantly out
> of their way to get them, since they can no longer just filter by the
> task ID but have to be filter for _all_ operators that are part of the
> task.


The registered metrics of OperatorCoordinator always are those which can
not be aggregated from the tasks, like the number of unassigned splits in
the SourceCoordinator. Actually we have not encountered the scenario you
mentioned. I think the OperatorCoordinatorMetricGroup should only contain
the metrics for itself instead of its subtasks.

Using metrics.scope.jm.job is not enough to distinguish the different
coordinators. The operator id/name is necessary. Then the implementation
should be like this. But users can not change the scope in this way. This
is also acceptable.

@Internal
public class InternalOperatorCoordinatorMetricGroup
        extends ProxyMetricGroup<MetricGroup>
        implements OperatorCoordinatorMetricGroup {
    public InternalOperatorCoordinatorMetricGroup(
            JobManagerJobMetricGroup parent,
            OperatorID operatorID,
            String operatorName) {
        super(parent.addGroup(operatorID +
operatorName).addGroup("coordinator"));
    }
}

So there are two choice for the InternalOperatorCoordinatorMetricGroup: 1.
add and improve the new scope format; 2. use the metrics.scope.jm.job and
ProxyMetricGroup. Which one is better?

Best,
Hang


Chesnay Schepler <ch...@apache.org> 于2023年1月18日周三 17:03写道：

> You're misunderstanding the problem.
>
> Metric groups form a tree with each group providing certain metadata.
>
> E.g., on the taskmanager we have a TM metric group that provides info
> about the TM, that has child task metric groups, that have child
> operator metric groups etc. The operator metric group again has
> children, where we are now mostly in the user/connector land, like a
> kafka metric group providing partition info or something.
>
> What is being proposed is to create a coordinator metric group on the JM
> side that provides operator metadata.
>
> A more appropriate structure is to create an operator group on the JM
> side that provides this info, with the coordinator metric group being a
> child.
>
> On 17/01/2023 19:56, Steven Wu wrote:
> >> Additionally, the configurable variables (operator name/id) are
> > logically not attached to the coordinator, but operators, so to me it
> > just doesn't make sense to structure it like this.
> >
> > Chesnay, maybe we should clarify the terminology. To me, pperators (like
> > FLIP-27 source) can have two parts (coordinator and reader/subtask). I
> > think it is fine to include operator name/id for coordinator metrics.
> >
> > On Mon, Jan 16, 2023 at 2:13 AM Chesnay Schepler <ch...@apache.org>
> wrote:
> >
> >> Slight correction: Using metrics.scope.jm.job as the default should be
> >> safe.
> >>
> >> On 16/01/2023 10:18, Chesnay Schepler wrote:
> >>> The proposed ScopeFormat is still problematic for a few reasons.
> >>>
> >>> Extending the set of ScopeFormats is problematic because it in
> >>> practice it breaks the config if users actively rely on it, since
> >>> there's now another key that they _must_ set for it to be
> >>> consistent/compatible with their existing setup.
> >>> Unfortunately due to how powerful scope formats are we can't derive a
> >>> default value that matches their existing setup.
> >>> Hence we should try to do this as rarely as possible.
> >>>
> >>> This FLIP does not adhere to that since it proposes a dedicated format
> >>> for coordinators; next time we want to expose operator-specific
> >>> metrics (e.g., in the scheduler) we'd have to add another one to
> >>> support it.
> >>>
> >>> Additionally, the configurable variables (operator name/id) are
> >>> logically not attached to the coordinator, but operators, so to me it
> >>> just doesn't make sense to structure it like this.
> >>>
> >>> Another thing I'm concerned about is that, because we don't include
> >>> tasks in the hierarchy, users wishing to collect all metrics for a
> >>> particular task (in this case ==vertex) now have to go significantly
> >>> out of their way to get them, since they can no longer just filter by
> >>> the task ID but have to be filter for _all_ operators that are part of
> >>> the task.
> >>>
> >>> On 16/01/2023 03:09, Hang Ruan wrote:
> >>>> Hi, @chesnay@apache.org <ch...@apache.org> ,
> >>>>
> >>>> Do you have time to help to review this FLIP again after the
> >>>> modification?
> >>>> Looking forward to your reply.
> >>>> This FLIP will add a new configuration for the
> >>>> OperatorCoordinatorMetricGroup scope format. It provides an internal
> >>>> implementation and is added as a component to the
> >>>> JobManagerJobMetricGroup.
> >>>> If something doesn't make sense, could you provide some advice? It
> >>>> will be
> >>>> very helpful. Thanks a lot for your help.
> >>>>
> >>>> Best,
> >>>> Hang
> >>>>
> >>>> Martijn Visser <ma...@apache.org> 于2023年1月11日周三 16:34写道：
> >>>>
> >>>>
> >>>>> Hi Hang,
> >>>>>
> >>>>> I'm a bit surprised that this has gone to a vote, given that Chesnay
> >>>>> deliberately mentioned that he would vote against it as-is. I would
> >>>>> expect
> >>>>> that before going to a vote, he has had the opportunity to
> >>>>> participate in
> >>>>> this discussion.
> >>>>>
> >>>>> Best regards,
> >>>>>
> >>>>> Martijn
> >>>>>
> >>>>> On Tue, Jan 3, 2023 at 12:53 PM Jark Wu <im...@gmail.com> wrote:
> >>>>>
> >>>>>> Hi Dong,
> >>>>>>
> >>>>>> Regarding “SplitEnumeratorContext#metricGroup”, my only concern is
> >>>>>> that
> >>>>>> this is a core interface for the FLIP. It’s hard to tell how
> >>>>>> sources use
> >>>>>> metric group without mentioning this interface. Even if this is an
> >>>>> existing
> >>>>>> API, I think it’s worth introducing the interface again and
> declaring
> >>>>> that
> >>>>>> we will implement the interface instead of a no-op method in this
> >>>>>> FLIP.
> >>>>>>
> >>>>>> Anyway, this is a minor problem and shouldn’t block this FLIP. I’m
> >>>>>> +1 to
> >>>>>> start a vote.
> >>>>>>
> >>>>>> Best,
> >>>>>> Jark
> >>>>>>
> >>>>>>
> >>>>>>> 2023年1月3日 10:03，Hang Ruan <ru...@gmail.com> 写道：
> >>>>>>>
> >>>>>>> Hi, Jark and Dong,
> >>>>>>>
> >>>>>>> Thanks for your comments. Sorry for my late reply.
> >>>>>>>
> >>>>>>> For suggestion 1, I plan to implement the
> >>>>>>> SplitEnumeratorMetricGroup in
> >>>>>>> another issue, and it is not contained in this FLIP. I will add
> some
> >>>>>>> description about this part.
> >>>>>>> For suggestion 2, changes about OperatorCoordinator#metricGroup has
> >>>>>> already
> >>>>>>> been documented in the proposed change section.
> >>>>>>>
> >>>>>>> Best,
> >>>>>>> Hang
> >>>>>>>
> >>>>>>> Dong Lin <li...@gmail.com> 于2023年1月1日周日 09:45写道：
> >>>>>>>
> >>>>>>>> Let me chime-in and add comments regarding the public interface
> >>>>> section.
> >>>>>>>> Please see my comments inline.
> >>>>>>>>
> >>>>>>>> On Thu, Dec 29, 2022 at 6:08 PM Jark Wu <im...@gmail.com> wrote:
> >>>>>>>>
> >>>>>>>>> Hi Hang,
> >>>>>>>>>
> >>>>>>>>> Thanks for driving this discussion. I think this is a very useful
> >>>>>> feature
> >>>>>>>>> for connectors.
> >>>>>>>>>
> >>>>>>>>> The FLIP looks quite good to me, and I just have two suggestions.
> >>>>>>>>>
> >>>>>>>>> 1. In the "Public Interface" section, mention that the
> >>>>>>>>> implementation
> >>>>>>>>> behavior of "SplitEnumeratorContext#metricGroup" is changed from
> >>>>>>>> returning
> >>>>>>>>> null to returning a concrete SplitEnumeratorMetricGroup instance.
> >>>>> Even
> >>>>>>>>> though the API is already there, the behavior change can also be
> >>>>>>>> considered
> >>>>>>>>> a public change.
> >>>>>>>>>
> >>>>>>>> SplitEnumeratorContext#metricGroup is an interface and this FLIP
> >>>>>>>> does
> >>>>>> not
> >>>>>>>> seem to change its semantics/behavior. The FLIP does change the
> >>>>>>>> implementation/behavior of SourceCoordinatorContext#metricGroup,
> >>>>>>>> which
> >>>>>> is
> >>>>>>>> marked @Internal.
> >>>>>>>>
> >>>>>>>> Thus it might seem a bit weird to add in the public interface
> >>>>>>>> section
> >>>>>>>> saying that we change the interface
> >>>>>>>> SplitEnumeratorContext#metricGroup
> >>>>>> from
> >>>>>>>> returning null to non-null object.
> >>>>>>>>
> >>>>>>>> 2. Mention the newly added interface of
> >>>>>> "OperatorCoordinator#metricGroup"
> >>>>>>>>> in the "Proposed Changes" section or "Public Interface" section.
> As
> >>>>> the
> >>>>>>>>> FLIP said, OperatorCoordinator is widely used in many connectors.
> >>>>>> Though
> >>>>>>>> it
> >>>>>>>>> is still an @Internal API, I think it is worth mentioning the
> >>>>>>>>> change
> >>>>> in
> >>>>>>>> the
> >>>>>>>>> FLIP.
> >>>>>>>> Since OperatorCoordinator is an internal API, it seems reasonable
> to
> >>>>>>>> explain it in the proposed change section. The FLIP seems to have
> >>>>>>>> documented this in point 5 of the proposed change section.
> >>>>>>>>
> >>>>>>>> BTW, if we think there are @internal classes that are important
> >>>>>>>> enough
> >>>>>> to
> >>>>>>>> be added in the public interface section, it might be useful to
> >>>>>> explicitly
> >>>>>>>> discuss this topic and document it in the "*What are the "public
> >>>>>>>> interfaces" of the project*" in this
> >>>>>>>> <
> >>>>>>>>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
> >>>>>>>> wiki.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> Best,
> >>>>>>>>> Jark
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Mon, 26 Dec 2022 at 18:06, Hang Ruan <ru...@gmail.com>
> >>>>>> wrote:
> >>>>>>>>>> Hi, thanks for the feedback, Zhu Zhu and Qingsheng.
> >>>>>>>>>> After combining everyone's comments, the main concerns and
> >>>>>>>> corresponding
> >>>>>>>>>> adjustments are as follows.
> >>>>>>>>>>
> >>>>>>>>>> Q1: Common metrics are not quite useful.
> >>>>>>>>>> numEventsIn and numEventsOut counters will be removed from the
> >>>>>>>>>> OperatorCoordinatorMetricGroup. These common metrics do not
> >>>>>>>>>> provide
> >>>>>>>>> enough
> >>>>>>>>>> information for users. The users are more willing to get the
> >>>>>>>>>> number
> >>>>> of
> >>>>>>>>>> events of the specified type instead of the total number. And
> this
> >>>>>>>> metric
> >>>>>>>>>> is calculated differently. The implementation could register the
> >>>>>> metric
> >>>>>>>>> by
> >>>>>>>>>> themselves.
> >>>>>>>>>>
> >>>>>>>>>> Q2: This FLIP is overly complicated.
> >>>>>>>>>> This FLIP will become concise after these modifications.
> >>>>>>>>>> OperatorCoordinatorMetricGroup has already been introduced into
> >>>>> Flink
> >>>>>>>> by
> >>>>>>>>>> FLIP-179<
> >>>>>>>>>>
> >>>>>>>>>>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics
> >>>>>>>>>>> .
> >>>>>>>>>> And
> >>>>>>>>>> this FLIP will not change it. This FLIP only provides a new
> metric
> >>>>>>>> option
> >>>>>>>>>> and a new metric group scope. The changes in proposed changes
> >>>>> provide
> >>>>>>>> the
> >>>>>>>>>> details about the modifications for the internal classes, which
> >>>>> might
> >>>>>>>>> make
> >>>>>>>>>> it look complicated.
> >>>>>>>>>>
> >>>>>>>>>> Thanks for all the comments again. If there are no further
> >>>>>>>>>> comments,
> >>>>>> we
> >>>>>>>>>> plan to start the voting thread this week.
> >>>>>>>>>>
> >>>>>>>>>> Best,
> >>>>>>>>>> Hang
> >>>>>>>>>>
> >>>>>>>>>> Qingsheng Ren <re...@gmail.com> 于2022年12月26日周一 16:48写道：
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> Thanks for the FLIP, Hang!
> >>>>>>>>>>>
> >>>>>>>>>>> This FLIP overall looks good to me. Actually I share the same
> >>>>> concern
> >>>>>>>>>> with
> >>>>>>>>>>> Zhu that numEventsIn and numEventsOut counters are not quite
> >>>>>>>>>>> useful
> >>>>>>>> to
> >>>>>>>>>> end
> >>>>>>>>>>> users. OperatorEvent is a quite low-level abstraction, which
> >>>>> requires
> >>>>>>>>>>> instantialization in order to be practical to users and
> >>>>>>>>>>> developers,
> >>>>>>>> so
> >>>>>>>>>>> maybe it's better to exclude them from the FLIP.
> >>>>>>>>>>>
> >>>>>>>>>>> Best,
> >>>>>>>>>>> Qingsheng
> >>>>>>>>>>> Ververica (Alibaba)
> >>>>>>>>>>>
> >>>>>>>>>>> On Mon, Dec 26, 2022 at 12:08 PM Zhu Zhu <re...@gmail.com>
> >>>>> wrote:
> >>>>>>>>>>>> Hi Hang,
> >>>>>>>>>>>> I still see no strong reason why we need
> >>>>>>>>>>>> numEventsIn/numEventsOut
> >>>>>>>>>>> metrics.
> >>>>>>>>>>>> In the discussion in FLINK-29801, I can see the same concern
> >>>>>>>>>>>> from
> >>>>>>>>>> others.
> >>>>>>>>>>>> So I prefer to exclude them from this FLIP to avoid
> >>>>>>>>>>>> over-extending
> >>>>>>>>> the
> >>>>>>>>>>>> scope.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks,
> >>>>>>>>>>>> Zhu
> >>>>>>>>>>>>
> >>>>>>>>>>>> Hang Ruan <ru...@gmail.com> 于2022年12月23日周五
> >>>>>>>>>>>> 15:21写道：
> >>>>>>>>>>>>> Hi everyone,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> As for the Zhu Zhu's problem, I think we should keep the
> common
> >>>>>>>>>>> metrics,
> >>>>>>>>>>>> which will help to observe incoming and outgoing events. What
> do
> >>>>>>>> you
> >>>>>>>>>>> think,
> >>>>>>>>>>>> @Zhu Zhu ?
> >>>>>>>>>>>>> And @Chesnay, are there any other issues you are more
> concerned
> >>>>>>>>>> about?
> >>>>>>>>>>>> Looking forward to your reply.
> >>>>>>>>>>>>> Thanks for all the comments. If there are no further
> >>>>>>>>>>>>> comments, we
> >>>>>>>>>> plan
> >>>>>>>>>>>> to start the voting thread next week.
> >>>>>>>>>>>>> Best,
> >>>>>>>>>>>>> Hang
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Hang Ruan <ru...@gmail.com> 于2022年12月15日周四
> >>>>>>>>>>>>> 16:49写道：
> >>>>>>>>>>>>>> Hi, Zhu Zhu,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks for your feedback!
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> The OperatorCoordinator implementations are different. And
> >>>>>>>>>>>>>> their
> >>>>>>>>>>>> metrics are much different too. We try to find the common
> >>>>>>>>>>>> metrics
> >>>>>>>> and
> >>>>>>>>>> put
> >>>>>>>>>>>> them in the OperatorCoordinatorMetricGroup. If most developers
> >>>>>>>> think
> >>>>>>>>> we
> >>>>>>>>>>> do
> >>>>>>>>>>>> not need these common metrics, removing the common metrics is
> >>>>>>>>>>>> also
> >>>>>>>>>>>> acceptable.
> >>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>> Hang
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Zhu Zhu <re...@gmail.com> 于2022年12月14日周三 22:09写道：
> >>>>>>>>>>>>>>> Hi Hang & MengYue,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thanks for creating this FLIP!
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I think it is very useful, mainly in two aspects:
> >>>>>>>>>>>>>>> 1. Enables OperatorCoordinators to register metrics.
> >>>>>>>>>>>>>>> Currently
> >>>>>>>>>>>>>>> the coordinators has no way to do this. And operator
> >>>>>>>> coordinator
> >>>>>>>>>>>>>>> metric group further enables the SplitEnumerator to have
> >>>>>>>>>>>>>>> access
> >>>>>>>>>>>>>>> to a registered metric group (via the existing public
> >>>>>>>>>>>>>>> interface
> >>>>>>>>>>>>>>> SplitEnumeratorContext#metricGroup()), which is null at the
> >>>>>>>>> moment.
> >>>>>>>>>>>>>>> 2. Defines the scope of operator coordinator metrics. A
> clear
> >>>>>>>>>>>> definition
> >>>>>>>>>>>>>>> makes it easy for users to find their wanted metrics. The
> >>>>>>>>>> definition
> >>>>>>>>>>>>>>> also helps to avoid conflicts of metrics from multiple
> >>>>>>>>>>>> OperatorCoordinators
> >>>>>>>>>>>>>>> of the same kind. E.g. each SourceCoordinator may have its
> >>>>>>>>>>>>>>> own
> >>>>>>>>>>>>>>> numSourceSplits metric, these metrics should not be
> directly
> >>>>>>>>>>> registered
> >>>>>>>>>>>>>>> to the job metric group.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> What I'm a bit concerned is the necessity of the introduced
> >>>>>>>>> common
> >>>>>>>>>>>> metrics
> >>>>>>>>>>>>>>> numEventsInCounter & numEventsOutCounter. If there any case
> >>>>>>>> which
> >>>>>>>>>>>> strongly
> >>>>>>>>>>>>>>> requires them?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Regarding the concerns of Chesnay,
> >>>>>>>>>>>>>>>> A dedicated coordinator MG implementation is overkill
> >>>>>>>>>>>>>>> Directly using the job metric group can result in metric
> >>>>>>>>> conflicts,
> >>>>>>>>>>> as
> >>>>>>>>>>>> mentioned
> >>>>>>>>>>>>>>> in above #2.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>> Zhu
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Dong Lin <li...@gmail.com> 于2022年12月10日周六
> >>>>>>>>>>>>>>> 14:16写道：
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Hi Chesney,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Just to double check with you,
> >>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup
> >>>>>>>>>>>> (annotated as
> >>>>>>>>>>>>>>>> @PublicEvolving) has already been introduced into Flink by
> >>>>>>>>>> FLIP-179
> >>>>>>>>>>>>>>>> <
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics
> >>>>>>>>>>>>> .
> >>>>>>>>>>>>>>>> And that FLIP has got you +1.. Do you mean we should
> remove
> >>>>>>>>> this
> >>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup?
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>>>> Dong
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Sat, Dec 10, 2022 at 1:33 AM Chesnay Schepler <
> >>>>>>>>>>> chesnay@apache.org>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>> As a whole I feel like this FLIP is overly complicated. A
> >>>>>>>>>>> dedicated
> >>>>>>>>>>>>>>>>> coordinator MG implementation is overkill; it could just
> >>>>>>>>> re-use
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> existing Task/OperatorMGs to create the same structure we
> >>>>>>>>> have
> >>>>>>>>>> on
> >>>>>>>>>>>> TMs,
> >>>>>>>>>>>>>>>>> similar to what we did with the Job MG.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> However, I'm not convinced that this is required anyway,
> >>>>>>>>>> because
> >>>>>>>>>>>> all the
> >>>>>>>>>>>>>>>>> example metrics you listed can be implemented on the TM
> >>>>>>>> side
> >>>>>>>>> +
> >>>>>>>>>>>>>>>>> aggregating them in the external metrics backend.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Since I'm on holidays soon, just so no one tries to pull
> a
> >>>>>>>>> fast
> >>>>>>>>>>>> one on
> >>>>>>>>>>>>>>>>> me, if this were to go to a vote as-is I'd be against it.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On 09/12/2022 15:30, Dong Lin wrote:
> >>>>>>>>>>>>>>>>>> Hi Hang,
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Thanks for the FLIP! The FLIP looks good and it is
> pretty
> >>>>>>>>>>>> informative.
> >>>>>>>>>>>>>>>>>> I have just two minor comments regarding names:
> >>>>>>>>>>>>>>>>>> - Would it be useful to rename the config key as
> >>>>>>>>>>>>>>>>>> *metrics.scope.jm.job.operator-coordinator* for
> >>>>>>>> consistency
> >>>>>>>>>>> with
> >>>>>>>>>>>>>>>>>> *metrics.scope.jm.job
> >>>>>>>>>>>>>>>>>> *(which is not named as *jm-job)?
> >>>>>>>>>>>>>>>>>> - Maybe rename the variable as
> >>>>>>>>>>> SCOPE_NAMING_OPERATOR_COORDINATOR
> >>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>> simplicity and consistency with SCOPE_NAMING_OPERATOR
> >>>>>>>>> (which
> >>>>>>>>>> is
> >>>>>>>>>>>> not named
> >>>>>>>>>>>>>>>>>> as SCOPE_NAMING_TM_JOB_OPERATOR)?
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Cheers,
> >>>>>>>>>>>>>>>>>> Dong
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On Thu, Dec 8, 2022 at 3:28 PM Hang Ruan <
> >>>>>>>>>>> ruanhang1993@gmail.com>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> MengYue and I created FLIP-274[1] Introduce metric
> group
> >>>>>>>>> for
> >>>>>>>>>>>>>>>>>>> OperatorCoordinator. OperatorCoordinator is the
> >>>>>>>>> coordinator
> >>>>>>>>>>> for
> >>>>>>>>>>>> runtime
> >>>>>>>>>>>>>>>>>>> operators and running on Job Manager. The coordination
> >>>>>>>>>>>> mechanism is
> >>>>>>>>>>>>>>>>>>> operator events between OperatorCoordinator and its all
> >>>>>>>>>>>> operators, the
> >>>>>>>>>>>>>>>>>>> coordination is more and more using in Flink, for
> >>>>>>>> example
> >>>>>>>>>> many
> >>>>>>>>>>>> Sources
> >>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>> Sinks depend on the mechanism to assign splits and
> >>>>>>>>>> coordinate
> >>>>>>>>>>>> commits to
> >>>>>>>>>>>>>>>>>>> external systems. The OperatorCoordinator is widely
> >>>>>>>> using
> >>>>>>>>> in
> >>>>>>>>>>>> flink kafka
> >>>>>>>>>>>>>>>>>>> connector, flink pulsar connector, flink cdc connector,
> >>>>>>>>>> flink
> >>>>>>>>>>>> hudi
> >>>>>>>>>>>>>>>>>>> connector and so on.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> But there is not a suitable metric group scope for the
> >>>>>>>>>>>>>>>>> OperatorCoordinator
> >>>>>>>>>>>>>>>>>>> and not an implementation for the interface
> >>>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup.
> >>>>>>>>>>>>>>>>>>> These metrics in OperatorCoordinator could be how many
> >>>>>>>>>>>> splits/partitions
> >>>>>>>>>>>>>>>>>>> have been assigned to source readers, how many files
> >>>>>>>> have
> >>>>>>>>>> been
> >>>>>>>>>>>> written
> >>>>>>>>>>>>>>>>> out
> >>>>>>>>>>>>>>>>>>> by sink writers, these metrics not only help users to
> >>>>>>>> know
> >>>>>>>>>> the
> >>>>>>>>>>>> job
> >>>>>>>>>>>>>>>>> progress
> >>>>>>>>>>>>>>>>>>> but also make big job maintaining easier. Thus we
> >>>>>>>> propose
> >>>>>>>>>> the
> >>>>>>>>>>>> FLIP-274
> >>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>> introduce a new metric group scope for
> >>>>>>>> OperatorCoordinator
> >>>>>>>>>> and
> >>>>>>>>>>>> provide
> >>>>>>>>>>>>>>>>> an
> >>>>>>>>>>>>>>>>>>> internal implementation for
> >>>>>>>>> OperatorCoordinatorMetricGroup.
> >>>>>>>>>>>>>>>>>>> Could you help review this FLIP when you get time? Any
> >>>>>>>>>>> feedback
> >>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>> appreciated!
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>>> Hang
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> [1]
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-274%3A+Introduce+metric+group+for+OperatorCoordinator
> >>
>
>

Re: [DISCUSS] FLIP-274 : Introduce metric group for OperatorCoordinator

Posted by Chesnay Schepler <ch...@apache.org>.

You're misunderstanding the problem.

Metric groups form a tree with each group providing certain metadata.

E.g., on the taskmanager we have a TM metric group that provides info 
about the TM, that has child task metric groups, that have child 
operator metric groups etc. The operator metric group again has 
children, where we are now mostly in the user/connector land, like a 
kafka metric group providing partition info or something.

What is being proposed is to create a coordinator metric group on the JM 
side that provides operator metadata.

A more appropriate structure is to create an operator group on the JM 
side that provides this info, with the coordinator metric group being a 
child.

On 17/01/2023 19:56, Steven Wu wrote:
>> Additionally, the configurable variables (operator name/id) are
> logically not attached to the coordinator, but operators, so to me it
> just doesn't make sense to structure it like this.
>
> Chesnay, maybe we should clarify the terminology. To me, pperators (like
> FLIP-27 source) can have two parts (coordinator and reader/subtask). I
> think it is fine to include operator name/id for coordinator metrics.
>
> On Mon, Jan 16, 2023 at 2:13 AM Chesnay Schepler <ch...@apache.org> wrote:
>
>> Slight correction: Using metrics.scope.jm.job as the default should be
>> safe.
>>
>> On 16/01/2023 10:18, Chesnay Schepler wrote:
>>> The proposed ScopeFormat is still problematic for a few reasons.
>>>
>>> Extending the set of ScopeFormats is problematic because it in
>>> practice it breaks the config if users actively rely on it, since
>>> there's now another key that they _must_ set for it to be
>>> consistent/compatible with their existing setup.
>>> Unfortunately due to how powerful scope formats are we can't derive a
>>> default value that matches their existing setup.
>>> Hence we should try to do this as rarely as possible.
>>>
>>> This FLIP does not adhere to that since it proposes a dedicated format
>>> for coordinators; next time we want to expose operator-specific
>>> metrics (e.g., in the scheduler) we'd have to add another one to
>>> support it.
>>>
>>> Additionally, the configurable variables (operator name/id) are
>>> logically not attached to the coordinator, but operators, so to me it
>>> just doesn't make sense to structure it like this.
>>>
>>> Another thing I'm concerned about is that, because we don't include
>>> tasks in the hierarchy, users wishing to collect all metrics for a
>>> particular task (in this case ==vertex) now have to go significantly
>>> out of their way to get them, since they can no longer just filter by
>>> the task ID but have to be filter for _all_ operators that are part of
>>> the task.
>>>
>>> On 16/01/2023 03:09, Hang Ruan wrote:
>>>> Hi, @chesnay@apache.org <ch...@apache.org> ,
>>>>
>>>> Do you have time to help to review this FLIP again after the
>>>> modification?
>>>> Looking forward to your reply.
>>>> This FLIP will add a new configuration for the
>>>> OperatorCoordinatorMetricGroup scope format. It provides an internal
>>>> implementation and is added as a component to the
>>>> JobManagerJobMetricGroup.
>>>> If something doesn't make sense, could you provide some advice? It
>>>> will be
>>>> very helpful. Thanks a lot for your help.
>>>>
>>>> Best,
>>>> Hang
>>>>
>>>> Martijn Visser <ma...@apache.org> 于2023年1月11日周三 16:34写道：
>>>>
>>>>
>>>>> Hi Hang,
>>>>>
>>>>> I'm a bit surprised that this has gone to a vote, given that Chesnay
>>>>> deliberately mentioned that he would vote against it as-is. I would
>>>>> expect
>>>>> that before going to a vote, he has had the opportunity to
>>>>> participate in
>>>>> this discussion.
>>>>>
>>>>> Best regards,
>>>>>
>>>>> Martijn
>>>>>
>>>>> On Tue, Jan 3, 2023 at 12:53 PM Jark Wu <im...@gmail.com> wrote:
>>>>>
>>>>>> Hi Dong,
>>>>>>
>>>>>> Regarding “SplitEnumeratorContext#metricGroup”, my only concern is
>>>>>> that
>>>>>> this is a core interface for the FLIP. It’s hard to tell how
>>>>>> sources use
>>>>>> metric group without mentioning this interface. Even if this is an
>>>>> existing
>>>>>> API, I think it’s worth introducing the interface again and declaring
>>>>> that
>>>>>> we will implement the interface instead of a no-op method in this
>>>>>> FLIP.
>>>>>>
>>>>>> Anyway, this is a minor problem and shouldn’t block this FLIP. I’m
>>>>>> +1 to
>>>>>> start a vote.
>>>>>>
>>>>>> Best,
>>>>>> Jark
>>>>>>
>>>>>>
>>>>>>> 2023年1月3日 10:03，Hang Ruan <ru...@gmail.com> 写道：
>>>>>>>
>>>>>>> Hi, Jark and Dong,
>>>>>>>
>>>>>>> Thanks for your comments. Sorry for my late reply.
>>>>>>>
>>>>>>> For suggestion 1, I plan to implement the
>>>>>>> SplitEnumeratorMetricGroup in
>>>>>>> another issue, and it is not contained in this FLIP. I will add some
>>>>>>> description about this part.
>>>>>>> For suggestion 2, changes about OperatorCoordinator#metricGroup has
>>>>>> already
>>>>>>> been documented in the proposed change section.
>>>>>>>
>>>>>>> Best,
>>>>>>> Hang
>>>>>>>
>>>>>>> Dong Lin <li...@gmail.com> 于2023年1月1日周日 09:45写道：
>>>>>>>
>>>>>>>> Let me chime-in and add comments regarding the public interface
>>>>> section.
>>>>>>>> Please see my comments inline.
>>>>>>>>
>>>>>>>> On Thu, Dec 29, 2022 at 6:08 PM Jark Wu <im...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Hang,
>>>>>>>>>
>>>>>>>>> Thanks for driving this discussion. I think this is a very useful
>>>>>> feature
>>>>>>>>> for connectors.
>>>>>>>>>
>>>>>>>>> The FLIP looks quite good to me, and I just have two suggestions.
>>>>>>>>>
>>>>>>>>> 1. In the "Public Interface" section, mention that the
>>>>>>>>> implementation
>>>>>>>>> behavior of "SplitEnumeratorContext#metricGroup" is changed from
>>>>>>>> returning
>>>>>>>>> null to returning a concrete SplitEnumeratorMetricGroup instance.
>>>>> Even
>>>>>>>>> though the API is already there, the behavior change can also be
>>>>>>>> considered
>>>>>>>>> a public change.
>>>>>>>>>
>>>>>>>> SplitEnumeratorContext#metricGroup is an interface and this FLIP
>>>>>>>> does
>>>>>> not
>>>>>>>> seem to change its semantics/behavior. The FLIP does change the
>>>>>>>> implementation/behavior of SourceCoordinatorContext#metricGroup,
>>>>>>>> which
>>>>>> is
>>>>>>>> marked @Internal.
>>>>>>>>
>>>>>>>> Thus it might seem a bit weird to add in the public interface
>>>>>>>> section
>>>>>>>> saying that we change the interface
>>>>>>>> SplitEnumeratorContext#metricGroup
>>>>>> from
>>>>>>>> returning null to non-null object.
>>>>>>>>
>>>>>>>> 2. Mention the newly added interface of
>>>>>> "OperatorCoordinator#metricGroup"
>>>>>>>>> in the "Proposed Changes" section or "Public Interface" section. As
>>>>> the
>>>>>>>>> FLIP said, OperatorCoordinator is widely used in many connectors.
>>>>>> Though
>>>>>>>> it
>>>>>>>>> is still an @Internal API, I think it is worth mentioning the
>>>>>>>>> change
>>>>> in
>>>>>>>> the
>>>>>>>>> FLIP.
>>>>>>>> Since OperatorCoordinator is an internal API, it seems reasonable to
>>>>>>>> explain it in the proposed change section. The FLIP seems to have
>>>>>>>> documented this in point 5 of the proposed change section.
>>>>>>>>
>>>>>>>> BTW, if we think there are @internal classes that are important
>>>>>>>> enough
>>>>>> to
>>>>>>>> be added in the public interface section, it might be useful to
>>>>>> explicitly
>>>>>>>> discuss this topic and document it in the "*What are the "public
>>>>>>>> interfaces" of the project*" in this
>>>>>>>> <
>>>>>>>>
>> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
>>>>>>>> wiki.
>>>>>>>>
>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Jark
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, 26 Dec 2022 at 18:06, Hang Ruan <ru...@gmail.com>
>>>>>> wrote:
>>>>>>>>>> Hi, thanks for the feedback, Zhu Zhu and Qingsheng.
>>>>>>>>>> After combining everyone's comments, the main concerns and
>>>>>>>> corresponding
>>>>>>>>>> adjustments are as follows.
>>>>>>>>>>
>>>>>>>>>> Q1: Common metrics are not quite useful.
>>>>>>>>>> numEventsIn and numEventsOut counters will be removed from the
>>>>>>>>>> OperatorCoordinatorMetricGroup. These common metrics do not
>>>>>>>>>> provide
>>>>>>>>> enough
>>>>>>>>>> information for users. The users are more willing to get the
>>>>>>>>>> number
>>>>> of
>>>>>>>>>> events of the specified type instead of the total number. And this
>>>>>>>> metric
>>>>>>>>>> is calculated differently. The implementation could register the
>>>>>> metric
>>>>>>>>> by
>>>>>>>>>> themselves.
>>>>>>>>>>
>>>>>>>>>> Q2: This FLIP is overly complicated.
>>>>>>>>>> This FLIP will become concise after these modifications.
>>>>>>>>>> OperatorCoordinatorMetricGroup has already been introduced into
>>>>> Flink
>>>>>>>> by
>>>>>>>>>> FLIP-179<
>>>>>>>>>>
>>>>>>>>>>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics
>>>>>>>>>>> .
>>>>>>>>>> And
>>>>>>>>>> this FLIP will not change it. This FLIP only provides a new metric
>>>>>>>> option
>>>>>>>>>> and a new metric group scope. The changes in proposed changes
>>>>> provide
>>>>>>>> the
>>>>>>>>>> details about the modifications for the internal classes, which
>>>>> might
>>>>>>>>> make
>>>>>>>>>> it look complicated.
>>>>>>>>>>
>>>>>>>>>> Thanks for all the comments again. If there are no further
>>>>>>>>>> comments,
>>>>>> we
>>>>>>>>>> plan to start the voting thread this week.
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Hang
>>>>>>>>>>
>>>>>>>>>> Qingsheng Ren <re...@gmail.com> 于2022年12月26日周一 16:48写道：
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Thanks for the FLIP, Hang!
>>>>>>>>>>>
>>>>>>>>>>> This FLIP overall looks good to me. Actually I share the same
>>>>> concern
>>>>>>>>>> with
>>>>>>>>>>> Zhu that numEventsIn and numEventsOut counters are not quite
>>>>>>>>>>> useful
>>>>>>>> to
>>>>>>>>>> end
>>>>>>>>>>> users. OperatorEvent is a quite low-level abstraction, which
>>>>> requires
>>>>>>>>>>> instantialization in order to be practical to users and
>>>>>>>>>>> developers,
>>>>>>>> so
>>>>>>>>>>> maybe it's better to exclude them from the FLIP.
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>> Qingsheng
>>>>>>>>>>> Ververica (Alibaba)
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Dec 26, 2022 at 12:08 PM Zhu Zhu <re...@gmail.com>
>>>>> wrote:
>>>>>>>>>>>> Hi Hang,
>>>>>>>>>>>> I still see no strong reason why we need
>>>>>>>>>>>> numEventsIn/numEventsOut
>>>>>>>>>>> metrics.
>>>>>>>>>>>> In the discussion in FLINK-29801, I can see the same concern
>>>>>>>>>>>> from
>>>>>>>>>> others.
>>>>>>>>>>>> So I prefer to exclude them from this FLIP to avoid
>>>>>>>>>>>> over-extending
>>>>>>>>> the
>>>>>>>>>>>> scope.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Zhu
>>>>>>>>>>>>
>>>>>>>>>>>> Hang Ruan <ru...@gmail.com> 于2022年12月23日周五
>>>>>>>>>>>> 15:21写道：
>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>
>>>>>>>>>>>>> As for the Zhu Zhu's problem, I think we should keep the common
>>>>>>>>>>> metrics,
>>>>>>>>>>>> which will help to observe incoming and outgoing events. What do
>>>>>>>> you
>>>>>>>>>>> think,
>>>>>>>>>>>> @Zhu Zhu ?
>>>>>>>>>>>>> And @Chesnay, are there any other issues you are more concerned
>>>>>>>>>> about?
>>>>>>>>>>>> Looking forward to your reply.
>>>>>>>>>>>>> Thanks for all the comments. If there are no further
>>>>>>>>>>>>> comments, we
>>>>>>>>>> plan
>>>>>>>>>>>> to start the voting thread next week.
>>>>>>>>>>>>> Best,
>>>>>>>>>>>>> Hang
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hang Ruan <ru...@gmail.com> 于2022年12月15日周四
>>>>>>>>>>>>> 16:49写道：
>>>>>>>>>>>>>> Hi, Zhu Zhu,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks for your feedback!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The OperatorCoordinator implementations are different. And
>>>>>>>>>>>>>> their
>>>>>>>>>>>> metrics are much different too. We try to find the common
>>>>>>>>>>>> metrics
>>>>>>>> and
>>>>>>>>>> put
>>>>>>>>>>>> them in the OperatorCoordinatorMetricGroup. If most developers
>>>>>>>> think
>>>>>>>>> we
>>>>>>>>>>> do
>>>>>>>>>>>> not need these common metrics, removing the common metrics is
>>>>>>>>>>>> also
>>>>>>>>>>>> acceptable.
>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>> Hang
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Zhu Zhu <re...@gmail.com> 于2022年12月14日周三 22:09写道：
>>>>>>>>>>>>>>> Hi Hang & MengYue,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks for creating this FLIP!
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I think it is very useful, mainly in two aspects:
>>>>>>>>>>>>>>> 1. Enables OperatorCoordinators to register metrics.
>>>>>>>>>>>>>>> Currently
>>>>>>>>>>>>>>> the coordinators has no way to do this. And operator
>>>>>>>> coordinator
>>>>>>>>>>>>>>> metric group further enables the SplitEnumerator to have
>>>>>>>>>>>>>>> access
>>>>>>>>>>>>>>> to a registered metric group (via the existing public
>>>>>>>>>>>>>>> interface
>>>>>>>>>>>>>>> SplitEnumeratorContext#metricGroup()), which is null at the
>>>>>>>>> moment.
>>>>>>>>>>>>>>> 2. Defines the scope of operator coordinator metrics. A clear
>>>>>>>>>>>> definition
>>>>>>>>>>>>>>> makes it easy for users to find their wanted metrics. The
>>>>>>>>>> definition
>>>>>>>>>>>>>>> also helps to avoid conflicts of metrics from multiple
>>>>>>>>>>>> OperatorCoordinators
>>>>>>>>>>>>>>> of the same kind. E.g. each SourceCoordinator may have its
>>>>>>>>>>>>>>> own
>>>>>>>>>>>>>>> numSourceSplits metric, these metrics should not be directly
>>>>>>>>>>> registered
>>>>>>>>>>>>>>> to the job metric group.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> What I'm a bit concerned is the necessity of the introduced
>>>>>>>>> common
>>>>>>>>>>>> metrics
>>>>>>>>>>>>>>> numEventsInCounter & numEventsOutCounter. If there any case
>>>>>>>> which
>>>>>>>>>>>> strongly
>>>>>>>>>>>>>>> requires them?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Regarding the concerns of Chesnay,
>>>>>>>>>>>>>>>> A dedicated coordinator MG implementation is overkill
>>>>>>>>>>>>>>> Directly using the job metric group can result in metric
>>>>>>>>> conflicts,
>>>>>>>>>>> as
>>>>>>>>>>>> mentioned
>>>>>>>>>>>>>>> in above #2.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Zhu
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Dong Lin <li...@gmail.com> 于2022年12月10日周六
>>>>>>>>>>>>>>> 14:16写道：
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Chesney,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Just to double check with you,
>>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup
>>>>>>>>>>>> (annotated as
>>>>>>>>>>>>>>>> @PublicEvolving) has already been introduced into Flink by
>>>>>>>>>> FLIP-179
>>>>>>>>>>>>>>>> <
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics
>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>> And that FLIP has got you +1.. Do you mean we should remove
>>>>>>>>> this
>>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>> Dong
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Sat, Dec 10, 2022 at 1:33 AM Chesnay Schepler <
>>>>>>>>>>> chesnay@apache.org>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> As a whole I feel like this FLIP is overly complicated. A
>>>>>>>>>>> dedicated
>>>>>>>>>>>>>>>>> coordinator MG implementation is overkill; it could just
>>>>>>>>> re-use
>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> existing Task/OperatorMGs to create the same structure we
>>>>>>>>> have
>>>>>>>>>> on
>>>>>>>>>>>> TMs,
>>>>>>>>>>>>>>>>> similar to what we did with the Job MG.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> However, I'm not convinced that this is required anyway,
>>>>>>>>>> because
>>>>>>>>>>>> all the
>>>>>>>>>>>>>>>>> example metrics you listed can be implemented on the TM
>>>>>>>> side
>>>>>>>>> +
>>>>>>>>>>>>>>>>> aggregating them in the external metrics backend.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Since I'm on holidays soon, just so no one tries to pull a
>>>>>>>>> fast
>>>>>>>>>>>> one on
>>>>>>>>>>>>>>>>> me, if this were to go to a vote as-is I'd be against it.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On 09/12/2022 15:30, Dong Lin wrote:
>>>>>>>>>>>>>>>>>> Hi Hang,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks for the FLIP! The FLIP looks good and it is pretty
>>>>>>>>>>>> informative.
>>>>>>>>>>>>>>>>>> I have just two minor comments regarding names:
>>>>>>>>>>>>>>>>>> - Would it be useful to rename the config key as
>>>>>>>>>>>>>>>>>> *metrics.scope.jm.job.operator-coordinator* for
>>>>>>>> consistency
>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>> *metrics.scope.jm.job
>>>>>>>>>>>>>>>>>> *(which is not named as *jm-job)?
>>>>>>>>>>>>>>>>>> - Maybe rename the variable as
>>>>>>>>>>> SCOPE_NAMING_OPERATOR_COORDINATOR
>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>> simplicity and consistency with SCOPE_NAMING_OPERATOR
>>>>>>>>> (which
>>>>>>>>>> is
>>>>>>>>>>>> not named
>>>>>>>>>>>>>>>>>> as SCOPE_NAMING_TM_JOB_OPERATOR)?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>> Dong
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Thu, Dec 8, 2022 at 3:28 PM Hang Ruan <
>>>>>>>>>>> ruanhang1993@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> MengYue and I created FLIP-274[1] Introduce metric group
>>>>>>>>> for
>>>>>>>>>>>>>>>>>>> OperatorCoordinator. OperatorCoordinator is the
>>>>>>>>> coordinator
>>>>>>>>>>> for
>>>>>>>>>>>> runtime
>>>>>>>>>>>>>>>>>>> operators and running on Job Manager. The coordination
>>>>>>>>>>>> mechanism is
>>>>>>>>>>>>>>>>>>> operator events between OperatorCoordinator and its all
>>>>>>>>>>>> operators, the
>>>>>>>>>>>>>>>>>>> coordination is more and more using in Flink, for
>>>>>>>> example
>>>>>>>>>> many
>>>>>>>>>>>> Sources
>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>> Sinks depend on the mechanism to assign splits and
>>>>>>>>>> coordinate
>>>>>>>>>>>> commits to
>>>>>>>>>>>>>>>>>>> external systems. The OperatorCoordinator is widely
>>>>>>>> using
>>>>>>>>> in
>>>>>>>>>>>> flink kafka
>>>>>>>>>>>>>>>>>>> connector, flink pulsar connector, flink cdc connector,
>>>>>>>>>> flink
>>>>>>>>>>>> hudi
>>>>>>>>>>>>>>>>>>> connector and so on.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> But there is not a suitable metric group scope for the
>>>>>>>>>>>>>>>>> OperatorCoordinator
>>>>>>>>>>>>>>>>>>> and not an implementation for the interface
>>>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup.
>>>>>>>>>>>>>>>>>>> These metrics in OperatorCoordinator could be how many
>>>>>>>>>>>> splits/partitions
>>>>>>>>>>>>>>>>>>> have been assigned to source readers, how many files
>>>>>>>> have
>>>>>>>>>> been
>>>>>>>>>>>> written
>>>>>>>>>>>>>>>>> out
>>>>>>>>>>>>>>>>>>> by sink writers, these metrics not only help users to
>>>>>>>> know
>>>>>>>>>> the
>>>>>>>>>>>> job
>>>>>>>>>>>>>>>>> progress
>>>>>>>>>>>>>>>>>>> but also make big job maintaining easier. Thus we
>>>>>>>> propose
>>>>>>>>>> the
>>>>>>>>>>>> FLIP-274
>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>> introduce a new metric group scope for
>>>>>>>> OperatorCoordinator
>>>>>>>>>> and
>>>>>>>>>>>> provide
>>>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>>> internal implementation for
>>>>>>>>> OperatorCoordinatorMetricGroup.
>>>>>>>>>>>>>>>>>>> Could you help review this FLIP when you get time? Any
>>>>>>>>>>> feedback
>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>> appreciated!
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>> Hang
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-274%3A+Introduce+metric+group+for+OperatorCoordinator
>>

Re: [DISCUSS] FLIP-274 : Introduce metric group for OperatorCoordinator

Posted by Steven Wu <st...@gmail.com>.

> Additionally, the configurable variables (operator name/id) are
logically not attached to the coordinator, but operators, so to me it
just doesn't make sense to structure it like this.

Chesnay, maybe we should clarify the terminology. To me, pperators (like
FLIP-27 source) can have two parts (coordinator and reader/subtask). I
think it is fine to include operator name/id for coordinator metrics.

On Mon, Jan 16, 2023 at 2:13 AM Chesnay Schepler <ch...@apache.org> wrote:

> Slight correction: Using metrics.scope.jm.job as the default should be
> safe.
>
> On 16/01/2023 10:18, Chesnay Schepler wrote:
> > The proposed ScopeFormat is still problematic for a few reasons.
> >
> > Extending the set of ScopeFormats is problematic because it in
> > practice it breaks the config if users actively rely on it, since
> > there's now another key that they _must_ set for it to be
> > consistent/compatible with their existing setup.
> > Unfortunately due to how powerful scope formats are we can't derive a
> > default value that matches their existing setup.
> > Hence we should try to do this as rarely as possible.
> >
> > This FLIP does not adhere to that since it proposes a dedicated format
> > for coordinators; next time we want to expose operator-specific
> > metrics (e.g., in the scheduler) we'd have to add another one to
> > support it.
> >
> > Additionally, the configurable variables (operator name/id) are
> > logically not attached to the coordinator, but operators, so to me it
> > just doesn't make sense to structure it like this.
> >
> > Another thing I'm concerned about is that, because we don't include
> > tasks in the hierarchy, users wishing to collect all metrics for a
> > particular task (in this case ==vertex) now have to go significantly
> > out of their way to get them, since they can no longer just filter by
> > the task ID but have to be filter for _all_ operators that are part of
> > the task.
> >
> > On 16/01/2023 03:09, Hang Ruan wrote:
> >> Hi, @chesnay@apache.org <ch...@apache.org> ,
> >>
> >> Do you have time to help to review this FLIP again after the
> >> modification?
> >> Looking forward to your reply.
> >> This FLIP will add a new configuration for the
> >> OperatorCoordinatorMetricGroup scope format. It provides an internal
> >> implementation and is added as a component to the
> >> JobManagerJobMetricGroup.
> >> If something doesn't make sense, could you provide some advice? It
> >> will be
> >> very helpful. Thanks a lot for your help.
> >>
> >> Best,
> >> Hang
> >>
> >> Martijn Visser <ma...@apache.org> 于2023年1月11日周三 16:34写道：
> >>
> >>
> >>> Hi Hang,
> >>>
> >>> I'm a bit surprised that this has gone to a vote, given that Chesnay
> >>> deliberately mentioned that he would vote against it as-is. I would
> >>> expect
> >>> that before going to a vote, he has had the opportunity to
> >>> participate in
> >>> this discussion.
> >>>
> >>> Best regards,
> >>>
> >>> Martijn
> >>>
> >>> On Tue, Jan 3, 2023 at 12:53 PM Jark Wu <im...@gmail.com> wrote:
> >>>
> >>>> Hi Dong,
> >>>>
> >>>> Regarding “SplitEnumeratorContext#metricGroup”, my only concern is
> >>>> that
> >>>> this is a core interface for the FLIP. It’s hard to tell how
> >>>> sources use
> >>>> metric group without mentioning this interface. Even if this is an
> >>> existing
> >>>> API, I think it’s worth introducing the interface again and declaring
> >>> that
> >>>> we will implement the interface instead of a no-op method in this
> >>>> FLIP.
> >>>>
> >>>> Anyway, this is a minor problem and shouldn’t block this FLIP. I’m
> >>>> +1 to
> >>>> start a vote.
> >>>>
> >>>> Best,
> >>>> Jark
> >>>>
> >>>>
> >>>>> 2023年1月3日 10:03，Hang Ruan <ru...@gmail.com> 写道：
> >>>>>
> >>>>> Hi, Jark and Dong,
> >>>>>
> >>>>> Thanks for your comments. Sorry for my late reply.
> >>>>>
> >>>>> For suggestion 1, I plan to implement the
> >>>>> SplitEnumeratorMetricGroup in
> >>>>> another issue, and it is not contained in this FLIP. I will add some
> >>>>> description about this part.
> >>>>> For suggestion 2, changes about OperatorCoordinator#metricGroup has
> >>>> already
> >>>>> been documented in the proposed change section.
> >>>>>
> >>>>> Best,
> >>>>> Hang
> >>>>>
> >>>>> Dong Lin <li...@gmail.com> 于2023年1月1日周日 09:45写道：
> >>>>>
> >>>>>> Let me chime-in and add comments regarding the public interface
> >>> section.
> >>>>>> Please see my comments inline.
> >>>>>>
> >>>>>> On Thu, Dec 29, 2022 at 6:08 PM Jark Wu <im...@gmail.com> wrote:
> >>>>>>
> >>>>>>> Hi Hang,
> >>>>>>>
> >>>>>>> Thanks for driving this discussion. I think this is a very useful
> >>>> feature
> >>>>>>> for connectors.
> >>>>>>>
> >>>>>>> The FLIP looks quite good to me, and I just have two suggestions.
> >>>>>>>
> >>>>>>> 1. In the "Public Interface" section, mention that the
> >>>>>>> implementation
> >>>>>>> behavior of "SplitEnumeratorContext#metricGroup" is changed from
> >>>>>> returning
> >>>>>>> null to returning a concrete SplitEnumeratorMetricGroup instance.
> >>> Even
> >>>>>>> though the API is already there, the behavior change can also be
> >>>>>> considered
> >>>>>>> a public change.
> >>>>>>>
> >>>>>> SplitEnumeratorContext#metricGroup is an interface and this FLIP
> >>>>>> does
> >>>> not
> >>>>>> seem to change its semantics/behavior. The FLIP does change the
> >>>>>> implementation/behavior of SourceCoordinatorContext#metricGroup,
> >>>>>> which
> >>>> is
> >>>>>> marked @Internal.
> >>>>>>
> >>>>>> Thus it might seem a bit weird to add in the public interface
> >>>>>> section
> >>>>>> saying that we change the interface
> >>>>>> SplitEnumeratorContext#metricGroup
> >>>> from
> >>>>>> returning null to non-null object.
> >>>>>>
> >>>>>> 2. Mention the newly added interface of
> >>>> "OperatorCoordinator#metricGroup"
> >>>>>>> in the "Proposed Changes" section or "Public Interface" section. As
> >>> the
> >>>>>>> FLIP said, OperatorCoordinator is widely used in many connectors.
> >>>> Though
> >>>>>> it
> >>>>>>> is still an @Internal API, I think it is worth mentioning the
> >>>>>>> change
> >>> in
> >>>>>> the
> >>>>>>> FLIP.
> >>>>>>
> >>>>>> Since OperatorCoordinator is an internal API, it seems reasonable to
> >>>>>> explain it in the proposed change section. The FLIP seems to have
> >>>>>> documented this in point 5 of the proposed change section.
> >>>>>>
> >>>>>> BTW, if we think there are @internal classes that are important
> >>>>>> enough
> >>>> to
> >>>>>> be added in the public interface section, it might be useful to
> >>>> explicitly
> >>>>>> discuss this topic and document it in the "*What are the "public
> >>>>>> interfaces" of the project*" in this
> >>>>>> <
> >>>>>>
> >>>
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
> >>>
> >>>>>> wiki.
> >>>>>>
> >>>>>>
> >>>>>>> Best,
> >>>>>>> Jark
> >>>>>>>
> >>>>>>>
> >>>>>>> On Mon, 26 Dec 2022 at 18:06, Hang Ruan <ru...@gmail.com>
> >>>> wrote:
> >>>>>>>> Hi, thanks for the feedback, Zhu Zhu and Qingsheng.
> >>>>>>>> After combining everyone's comments, the main concerns and
> >>>>>> corresponding
> >>>>>>>> adjustments are as follows.
> >>>>>>>>
> >>>>>>>> Q1: Common metrics are not quite useful.
> >>>>>>>> numEventsIn and numEventsOut counters will be removed from the
> >>>>>>>> OperatorCoordinatorMetricGroup. These common metrics do not
> >>>>>>>> provide
> >>>>>>> enough
> >>>>>>>> information for users. The users are more willing to get the
> >>>>>>>> number
> >>> of
> >>>>>>>> events of the specified type instead of the total number. And this
> >>>>>> metric
> >>>>>>>> is calculated differently. The implementation could register the
> >>>> metric
> >>>>>>> by
> >>>>>>>> themselves.
> >>>>>>>>
> >>>>>>>> Q2: This FLIP is overly complicated.
> >>>>>>>> This FLIP will become concise after these modifications.
> >>>>>>>> OperatorCoordinatorMetricGroup has already been introduced into
> >>> Flink
> >>>>>> by
> >>>>>>>> FLIP-179<
> >>>>>>>>
> >>>>>>>>
> >>>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics
> >>>
> >>>>>>>>> .
> >>>>>>>> And
> >>>>>>>> this FLIP will not change it. This FLIP only provides a new metric
> >>>>>> option
> >>>>>>>> and a new metric group scope. The changes in proposed changes
> >>> provide
> >>>>>> the
> >>>>>>>> details about the modifications for the internal classes, which
> >>> might
> >>>>>>> make
> >>>>>>>> it look complicated.
> >>>>>>>>
> >>>>>>>> Thanks for all the comments again. If there are no further
> >>>>>>>> comments,
> >>>> we
> >>>>>>>> plan to start the voting thread this week.
> >>>>>>>>
> >>>>>>>> Best,
> >>>>>>>> Hang
> >>>>>>>>
> >>>>>>>> Qingsheng Ren <re...@gmail.com> 于2022年12月26日周一 16:48写道：
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> Thanks for the FLIP, Hang!
> >>>>>>>>>
> >>>>>>>>> This FLIP overall looks good to me. Actually I share the same
> >>> concern
> >>>>>>>> with
> >>>>>>>>> Zhu that numEventsIn and numEventsOut counters are not quite
> >>>>>>>>> useful
> >>>>>> to
> >>>>>>>> end
> >>>>>>>>> users. OperatorEvent is a quite low-level abstraction, which
> >>> requires
> >>>>>>>>> instantialization in order to be practical to users and
> >>>>>>>>> developers,
> >>>>>> so
> >>>>>>>>> maybe it's better to exclude them from the FLIP.
> >>>>>>>>>
> >>>>>>>>> Best,
> >>>>>>>>> Qingsheng
> >>>>>>>>> Ververica (Alibaba)
> >>>>>>>>>
> >>>>>>>>> On Mon, Dec 26, 2022 at 12:08 PM Zhu Zhu <re...@gmail.com>
> >>> wrote:
> >>>>>>>>>> Hi Hang,
> >>>>>>>>>> I still see no strong reason why we need
> >>>>>>>>>> numEventsIn/numEventsOut
> >>>>>>>>> metrics.
> >>>>>>>>>> In the discussion in FLINK-29801, I can see the same concern
> >>>>>>>>>> from
> >>>>>>>> others.
> >>>>>>>>>> So I prefer to exclude them from this FLIP to avoid
> >>>>>>>>>> over-extending
> >>>>>>> the
> >>>>>>>>>> scope.
> >>>>>>>>>>
> >>>>>>>>>> Thanks,
> >>>>>>>>>> Zhu
> >>>>>>>>>>
> >>>>>>>>>> Hang Ruan <ru...@gmail.com> 于2022年12月23日周五
> >>>>>>>>>> 15:21写道：
> >>>>>>>>>>> Hi everyone,
> >>>>>>>>>>>
> >>>>>>>>>>> As for the Zhu Zhu's problem, I think we should keep the common
> >>>>>>>>> metrics,
> >>>>>>>>>> which will help to observe incoming and outgoing events. What do
> >>>>>> you
> >>>>>>>>> think,
> >>>>>>>>>> @Zhu Zhu ?
> >>>>>>>>>>> And @Chesnay, are there any other issues you are more concerned
> >>>>>>>> about?
> >>>>>>>>>> Looking forward to your reply.
> >>>>>>>>>>> Thanks for all the comments. If there are no further
> >>>>>>>>>>> comments, we
> >>>>>>>> plan
> >>>>>>>>>> to start the voting thread next week.
> >>>>>>>>>>> Best,
> >>>>>>>>>>> Hang
> >>>>>>>>>>>
> >>>>>>>>>>> Hang Ruan <ru...@gmail.com> 于2022年12月15日周四
> >>>>>>>>>>> 16:49写道：
> >>>>>>>>>>>> Hi, Zhu Zhu,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks for your feedback!
> >>>>>>>>>>>>
> >>>>>>>>>>>> The OperatorCoordinator implementations are different. And
> >>>>>>>>>>>> their
> >>>>>>>>>> metrics are much different too. We try to find the common
> >>>>>>>>>> metrics
> >>>>>> and
> >>>>>>>> put
> >>>>>>>>>> them in the OperatorCoordinatorMetricGroup. If most developers
> >>>>>> think
> >>>>>>> we
> >>>>>>>>> do
> >>>>>>>>>> not need these common metrics, removing the common metrics is
> >>>>>>>>>> also
> >>>>>>>>>> acceptable.
> >>>>>>>>>>>> Best,
> >>>>>>>>>>>> Hang
> >>>>>>>>>>>>
> >>>>>>>>>>>> Zhu Zhu <re...@gmail.com> 于2022年12月14日周三 22:09写道：
> >>>>>>>>>>>>> Hi Hang & MengYue,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks for creating this FLIP!
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I think it is very useful, mainly in two aspects:
> >>>>>>>>>>>>> 1. Enables OperatorCoordinators to register metrics.
> >>>>>>>>>>>>> Currently
> >>>>>>>>>>>>> the coordinators has no way to do this. And operator
> >>>>>> coordinator
> >>>>>>>>>>>>> metric group further enables the SplitEnumerator to have
> >>>>>>>>>>>>> access
> >>>>>>>>>>>>> to a registered metric group (via the existing public
> >>>>>>>>>>>>> interface
> >>>>>>>>>>>>> SplitEnumeratorContext#metricGroup()), which is null at the
> >>>>>>> moment.
> >>>>>>>>>>>>> 2. Defines the scope of operator coordinator metrics. A clear
> >>>>>>>>>> definition
> >>>>>>>>>>>>> makes it easy for users to find their wanted metrics. The
> >>>>>>>> definition
> >>>>>>>>>>>>> also helps to avoid conflicts of metrics from multiple
> >>>>>>>>>> OperatorCoordinators
> >>>>>>>>>>>>> of the same kind. E.g. each SourceCoordinator may have its
> >>>>>>>>>>>>> own
> >>>>>>>>>>>>> numSourceSplits metric, these metrics should not be directly
> >>>>>>>>> registered
> >>>>>>>>>>>>> to the job metric group.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> What I'm a bit concerned is the necessity of the introduced
> >>>>>>> common
> >>>>>>>>>> metrics
> >>>>>>>>>>>>> numEventsInCounter & numEventsOutCounter. If there any case
> >>>>>> which
> >>>>>>>>>> strongly
> >>>>>>>>>>>>> requires them?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Regarding the concerns of Chesnay,
> >>>>>>>>>>>>>> A dedicated coordinator MG implementation is overkill
> >>>>>>>>>>>>> Directly using the job metric group can result in metric
> >>>>>>> conflicts,
> >>>>>>>>> as
> >>>>>>>>>> mentioned
> >>>>>>>>>>>>> in above #2.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>> Zhu
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Dong Lin <li...@gmail.com> 于2022年12月10日周六
> >>>>>>>>>>>>> 14:16写道：
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi Chesney,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Just to double check with you,
> >>>>>>>>>>>>>> OperatorCoordinatorMetricGroup
> >>>>>>>>>> (annotated as
> >>>>>>>>>>>>>> @PublicEvolving) has already been introduced into Flink by
> >>>>>>>> FLIP-179
> >>>>>>>>>>>>>> <
> >>>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics
> >>>
> >>>>>>>>>>> .
> >>>>>>>>>>>>>> And that FLIP has got you +1.. Do you mean we should remove
> >>>>>>> this
> >>>>>>>>>>>>>> OperatorCoordinatorMetricGroup?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>> Dong
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Sat, Dec 10, 2022 at 1:33 AM Chesnay Schepler <
> >>>>>>>>> chesnay@apache.org>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>>>>> As a whole I feel like this FLIP is overly complicated. A
> >>>>>>>>> dedicated
> >>>>>>>>>>>>>>> coordinator MG implementation is overkill; it could just
> >>>>>>> re-use
> >>>>>>>>> the
> >>>>>>>>>>>>>>> existing Task/OperatorMGs to create the same structure we
> >>>>>>> have
> >>>>>>>> on
> >>>>>>>>>> TMs,
> >>>>>>>>>>>>>>> similar to what we did with the Job MG.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> However, I'm not convinced that this is required anyway,
> >>>>>>>> because
> >>>>>>>>>> all the
> >>>>>>>>>>>>>>> example metrics you listed can be implemented on the TM
> >>>>>> side
> >>>>>>> +
> >>>>>>>>>>>>>>> aggregating them in the external metrics backend.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Since I'm on holidays soon, just so no one tries to pull a
> >>>>>>> fast
> >>>>>>>>>> one on
> >>>>>>>>>>>>>>> me, if this were to go to a vote as-is I'd be against it.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On 09/12/2022 15:30, Dong Lin wrote:
> >>>>>>>>>>>>>>>> Hi Hang,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Thanks for the FLIP! The FLIP looks good and it is pretty
> >>>>>>>>>> informative.
> >>>>>>>>>>>>>>>> I have just two minor comments regarding names:
> >>>>>>>>>>>>>>>> - Would it be useful to rename the config key as
> >>>>>>>>>>>>>>>> *metrics.scope.jm.job.operator-coordinator* for
> >>>>>> consistency
> >>>>>>>>> with
> >>>>>>>>>>>>>>>> *metrics.scope.jm.job
> >>>>>>>>>>>>>>>> *(which is not named as *jm-job)?
> >>>>>>>>>>>>>>>> - Maybe rename the variable as
> >>>>>>>>> SCOPE_NAMING_OPERATOR_COORDINATOR
> >>>>>>>>>> for
> >>>>>>>>>>>>>>>> simplicity and consistency with SCOPE_NAMING_OPERATOR
> >>>>>>> (which
> >>>>>>>> is
> >>>>>>>>>> not named
> >>>>>>>>>>>>>>>> as SCOPE_NAMING_TM_JOB_OPERATOR)?
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Cheers,
> >>>>>>>>>>>>>>>> Dong
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Thu, Dec 8, 2022 at 3:28 PM Hang Ruan <
> >>>>>>>>> ruanhang1993@gmail.com>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> MengYue and I created FLIP-274[1] Introduce metric group
> >>>>>>> for
> >>>>>>>>>>>>>>>>> OperatorCoordinator. OperatorCoordinator is the
> >>>>>>> coordinator
> >>>>>>>>> for
> >>>>>>>>>> runtime
> >>>>>>>>>>>>>>>>> operators and running on Job Manager. The coordination
> >>>>>>>>>> mechanism is
> >>>>>>>>>>>>>>>>> operator events between OperatorCoordinator and its all
> >>>>>>>>>> operators, the
> >>>>>>>>>>>>>>>>> coordination is more and more using in Flink, for
> >>>>>> example
> >>>>>>>> many
> >>>>>>>>>> Sources
> >>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>> Sinks depend on the mechanism to assign splits and
> >>>>>>>> coordinate
> >>>>>>>>>> commits to
> >>>>>>>>>>>>>>>>> external systems. The OperatorCoordinator is widely
> >>>>>> using
> >>>>>>> in
> >>>>>>>>>> flink kafka
> >>>>>>>>>>>>>>>>> connector, flink pulsar connector, flink cdc connector,
> >>>>>>>> flink
> >>>>>>>>>> hudi
> >>>>>>>>>>>>>>>>> connector and so on.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> But there is not a suitable metric group scope for the
> >>>>>>>>>>>>>>> OperatorCoordinator
> >>>>>>>>>>>>>>>>> and not an implementation for the interface
> >>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup.
> >>>>>>>>>>>>>>>>> These metrics in OperatorCoordinator could be how many
> >>>>>>>>>> splits/partitions
> >>>>>>>>>>>>>>>>> have been assigned to source readers, how many files
> >>>>>> have
> >>>>>>>> been
> >>>>>>>>>> written
> >>>>>>>>>>>>>>> out
> >>>>>>>>>>>>>>>>> by sink writers, these metrics not only help users to
> >>>>>> know
> >>>>>>>> the
> >>>>>>>>>> job
> >>>>>>>>>>>>>>> progress
> >>>>>>>>>>>>>>>>> but also make big job maintaining easier. Thus we
> >>>>>> propose
> >>>>>>>> the
> >>>>>>>>>> FLIP-274
> >>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>> introduce a new metric group scope for
> >>>>>> OperatorCoordinator
> >>>>>>>> and
> >>>>>>>>>> provide
> >>>>>>>>>>>>>>> an
> >>>>>>>>>>>>>>>>> internal implementation for
> >>>>>>> OperatorCoordinatorMetricGroup.
> >>>>>>>>>>>>>>>>> Could you help review this FLIP when you get time? Any
> >>>>>>>>> feedback
> >>>>>>>>>> is
> >>>>>>>>>>>>>>>>> appreciated!
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>> Hang
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> [1]
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-274%3A+Introduce+metric+group+for+OperatorCoordinator
> >>>
> >>>>>>>>>>>>>>>
> >>>>
> >
>
>

Re: [DISCUSS] FLIP-274 : Introduce metric group for OperatorCoordinator

Posted by Chesnay Schepler <ch...@apache.org>.

Slight correction: Using metrics.scope.jm.job as the default should be safe.

On 16/01/2023 10:18, Chesnay Schepler wrote:
> The proposed ScopeFormat is still problematic for a few reasons.
>
> Extending the set of ScopeFormats is problematic because it in 
> practice it breaks the config if users actively rely on it, since 
> there's now another key that they _must_ set for it to be 
> consistent/compatible with their existing setup.
> Unfortunately due to how powerful scope formats are we can't derive a 
> default value that matches their existing setup.
> Hence we should try to do this as rarely as possible.
>
> This FLIP does not adhere to that since it proposes a dedicated format 
> for coordinators; next time we want to expose operator-specific 
> metrics (e.g., in the scheduler) we'd have to add another one to 
> support it.
>
> Additionally, the configurable variables (operator name/id) are 
> logically not attached to the coordinator, but operators, so to me it 
> just doesn't make sense to structure it like this.
>
> Another thing I'm concerned about is that, because we don't include 
> tasks in the hierarchy, users wishing to collect all metrics for a 
> particular task (in this case ==vertex) now have to go significantly 
> out of their way to get them, since they can no longer just filter by 
> the task ID but have to be filter for _all_ operators that are part of 
> the task.
>
> On 16/01/2023 03:09, Hang Ruan wrote:
>> Hi, @chesnay@apache.org <ch...@apache.org> ,
>>
>> Do you have time to help to review this FLIP again after the 
>> modification?
>> Looking forward to your reply.
>> This FLIP will add a new configuration for the
>> OperatorCoordinatorMetricGroup scope format. It provides an internal
>> implementation and is added as a component to the 
>> JobManagerJobMetricGroup.
>> If something doesn't make sense, could you provide some advice? It 
>> will be
>> very helpful. Thanks a lot for your help.
>>
>> Best,
>> Hang
>>
>> Martijn Visser <ma...@apache.org> 于2023年1月11日周三 16:34写道： 
>>
>>
>>> Hi Hang,
>>>
>>> I'm a bit surprised that this has gone to a vote, given that Chesnay
>>> deliberately mentioned that he would vote against it as-is. I would 
>>> expect
>>> that before going to a vote, he has had the opportunity to 
>>> participate in
>>> this discussion.
>>>
>>> Best regards,
>>>
>>> Martijn
>>>
>>> On Tue, Jan 3, 2023 at 12:53 PM Jark Wu <im...@gmail.com> wrote:
>>>
>>>> Hi Dong,
>>>>
>>>> Regarding “SplitEnumeratorContext#metricGroup”, my only concern is 
>>>> that
>>>> this is a core interface for the FLIP. It’s hard to tell how 
>>>> sources use
>>>> metric group without mentioning this interface. Even if this is an
>>> existing
>>>> API, I think it’s worth introducing the interface again and declaring
>>> that
>>>> we will implement the interface instead of a no-op method in this 
>>>> FLIP.
>>>>
>>>> Anyway, this is a minor problem and shouldn’t block this FLIP. I’m 
>>>> +1 to
>>>> start a vote.
>>>>
>>>> Best,
>>>> Jark
>>>>
>>>>
>>>>> 2023年1月3日 10:03，Hang Ruan <ru...@gmail.com> 写道：
>>>>>
>>>>> Hi, Jark and Dong,
>>>>>
>>>>> Thanks for your comments. Sorry for my late reply.
>>>>>
>>>>> For suggestion 1, I plan to implement the 
>>>>> SplitEnumeratorMetricGroup in
>>>>> another issue, and it is not contained in this FLIP. I will add some
>>>>> description about this part.
>>>>> For suggestion 2, changes about OperatorCoordinator#metricGroup has
>>>> already
>>>>> been documented in the proposed change section.
>>>>>
>>>>> Best,
>>>>> Hang
>>>>>
>>>>> Dong Lin <li...@gmail.com> 于2023年1月1日周日 09:45写道：
>>>>>
>>>>>> Let me chime-in and add comments regarding the public interface
>>> section.
>>>>>> Please see my comments inline.
>>>>>>
>>>>>> On Thu, Dec 29, 2022 at 6:08 PM Jark Wu <im...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Hang,
>>>>>>>
>>>>>>> Thanks for driving this discussion. I think this is a very useful
>>>> feature
>>>>>>> for connectors.
>>>>>>>
>>>>>>> The FLIP looks quite good to me, and I just have two suggestions.
>>>>>>>
>>>>>>> 1. In the "Public Interface" section, mention that the 
>>>>>>> implementation
>>>>>>> behavior of "SplitEnumeratorContext#metricGroup" is changed from
>>>>>> returning
>>>>>>> null to returning a concrete SplitEnumeratorMetricGroup instance.
>>> Even
>>>>>>> though the API is already there, the behavior change can also be
>>>>>> considered
>>>>>>> a public change.
>>>>>>>
>>>>>> SplitEnumeratorContext#metricGroup is an interface and this FLIP 
>>>>>> does
>>>> not
>>>>>> seem to change its semantics/behavior. The FLIP does change the
>>>>>> implementation/behavior of SourceCoordinatorContext#metricGroup, 
>>>>>> which
>>>> is
>>>>>> marked @Internal.
>>>>>>
>>>>>> Thus it might seem a bit weird to add in the public interface 
>>>>>> section
>>>>>> saying that we change the interface 
>>>>>> SplitEnumeratorContext#metricGroup
>>>> from
>>>>>> returning null to non-null object.
>>>>>>
>>>>>> 2. Mention the newly added interface of
>>>> "OperatorCoordinator#metricGroup"
>>>>>>> in the "Proposed Changes" section or "Public Interface" section. As
>>> the
>>>>>>> FLIP said, OperatorCoordinator is widely used in many connectors.
>>>> Though
>>>>>> it
>>>>>>> is still an @Internal API, I think it is worth mentioning the 
>>>>>>> change
>>> in
>>>>>> the
>>>>>>> FLIP.
>>>>>>
>>>>>> Since OperatorCoordinator is an internal API, it seems reasonable to
>>>>>> explain it in the proposed change section. The FLIP seems to have
>>>>>> documented this in point 5 of the proposed change section.
>>>>>>
>>>>>> BTW, if we think there are @internal classes that are important 
>>>>>> enough
>>>> to
>>>>>> be added in the public interface section, it might be useful to
>>>> explicitly
>>>>>> discuss this topic and document it in the "*What are the "public
>>>>>> interfaces" of the project*" in this
>>>>>> <
>>>>>>
>>> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals 
>>>
>>>>>> wiki.
>>>>>>
>>>>>>
>>>>>>> Best,
>>>>>>> Jark
>>>>>>>
>>>>>>>
>>>>>>> On Mon, 26 Dec 2022 at 18:06, Hang Ruan <ru...@gmail.com>
>>>> wrote:
>>>>>>>> Hi, thanks for the feedback, Zhu Zhu and Qingsheng.
>>>>>>>> After combining everyone's comments, the main concerns and
>>>>>> corresponding
>>>>>>>> adjustments are as follows.
>>>>>>>>
>>>>>>>> Q1: Common metrics are not quite useful.
>>>>>>>> numEventsIn and numEventsOut counters will be removed from the
>>>>>>>> OperatorCoordinatorMetricGroup. These common metrics do not 
>>>>>>>> provide
>>>>>>> enough
>>>>>>>> information for users. The users are more willing to get the 
>>>>>>>> number
>>> of
>>>>>>>> events of the specified type instead of the total number. And this
>>>>>> metric
>>>>>>>> is calculated differently. The implementation could register the
>>>> metric
>>>>>>> by
>>>>>>>> themselves.
>>>>>>>>
>>>>>>>> Q2: This FLIP is overly complicated.
>>>>>>>> This FLIP will become concise after these modifications.
>>>>>>>> OperatorCoordinatorMetricGroup has already been introduced into
>>> Flink
>>>>>> by
>>>>>>>> FLIP-179<
>>>>>>>>
>>>>>>>>
>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics 
>>>
>>>>>>>>> .
>>>>>>>> And
>>>>>>>> this FLIP will not change it. This FLIP only provides a new metric
>>>>>> option
>>>>>>>> and a new metric group scope. The changes in proposed changes
>>> provide
>>>>>> the
>>>>>>>> details about the modifications for the internal classes, which
>>> might
>>>>>>> make
>>>>>>>> it look complicated.
>>>>>>>>
>>>>>>>> Thanks for all the comments again. If there are no further 
>>>>>>>> comments,
>>>> we
>>>>>>>> plan to start the voting thread this week.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Hang
>>>>>>>>
>>>>>>>> Qingsheng Ren <re...@gmail.com> 于2022年12月26日周一 16:48写道： 
>>>>>>>>
>>>>>>>>
>>>>>>>>> Thanks for the FLIP, Hang!
>>>>>>>>>
>>>>>>>>> This FLIP overall looks good to me. Actually I share the same
>>> concern
>>>>>>>> with
>>>>>>>>> Zhu that numEventsIn and numEventsOut counters are not quite 
>>>>>>>>> useful
>>>>>> to
>>>>>>>> end
>>>>>>>>> users. OperatorEvent is a quite low-level abstraction, which
>>> requires
>>>>>>>>> instantialization in order to be practical to users and 
>>>>>>>>> developers,
>>>>>> so
>>>>>>>>> maybe it's better to exclude them from the FLIP.
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Qingsheng
>>>>>>>>> Ververica (Alibaba)
>>>>>>>>>
>>>>>>>>> On Mon, Dec 26, 2022 at 12:08 PM Zhu Zhu <re...@gmail.com>
>>> wrote:
>>>>>>>>>> Hi Hang,
>>>>>>>>>> I still see no strong reason why we need 
>>>>>>>>>> numEventsIn/numEventsOut
>>>>>>>>> metrics.
>>>>>>>>>> In the discussion in FLINK-29801, I can see the same concern 
>>>>>>>>>> from
>>>>>>>> others.
>>>>>>>>>> So I prefer to exclude them from this FLIP to avoid 
>>>>>>>>>> over-extending
>>>>>>> the
>>>>>>>>>> scope.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Zhu
>>>>>>>>>>
>>>>>>>>>> Hang Ruan <ru...@gmail.com> 于2022年12月23日周五 
>>>>>>>>>> 15:21写道：
>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>
>>>>>>>>>>> As for the Zhu Zhu's problem, I think we should keep the common
>>>>>>>>> metrics,
>>>>>>>>>> which will help to observe incoming and outgoing events. What do
>>>>>> you
>>>>>>>>> think,
>>>>>>>>>> @Zhu Zhu ?
>>>>>>>>>>> And @Chesnay, are there any other issues you are more concerned
>>>>>>>> about?
>>>>>>>>>> Looking forward to your reply.
>>>>>>>>>>> Thanks for all the comments. If there are no further 
>>>>>>>>>>> comments, we
>>>>>>>> plan
>>>>>>>>>> to start the voting thread next week.
>>>>>>>>>>> Best,
>>>>>>>>>>> Hang
>>>>>>>>>>>
>>>>>>>>>>> Hang Ruan <ru...@gmail.com> 于2022年12月15日周四 
>>>>>>>>>>> 16:49写道：
>>>>>>>>>>>> Hi, Zhu Zhu,
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for your feedback!
>>>>>>>>>>>>
>>>>>>>>>>>> The OperatorCoordinator implementations are different. And 
>>>>>>>>>>>> their
>>>>>>>>>> metrics are much different too. We try to find the common 
>>>>>>>>>> metrics
>>>>>> and
>>>>>>>> put
>>>>>>>>>> them in the OperatorCoordinatorMetricGroup. If most developers
>>>>>> think
>>>>>>> we
>>>>>>>>> do
>>>>>>>>>> not need these common metrics, removing the common metrics is 
>>>>>>>>>> also
>>>>>>>>>> acceptable.
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Hang
>>>>>>>>>>>>
>>>>>>>>>>>> Zhu Zhu <re...@gmail.com> 于2022年12月14日周三 22:09写道：
>>>>>>>>>>>>> Hi Hang & MengYue,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for creating this FLIP!
>>>>>>>>>>>>>
>>>>>>>>>>>>> I think it is very useful, mainly in two aspects:
>>>>>>>>>>>>> 1. Enables OperatorCoordinators to register metrics. 
>>>>>>>>>>>>> Currently
>>>>>>>>>>>>> the coordinators has no way to do this. And operator
>>>>>> coordinator
>>>>>>>>>>>>> metric group further enables the SplitEnumerator to have 
>>>>>>>>>>>>> access
>>>>>>>>>>>>> to a registered metric group (via the existing public 
>>>>>>>>>>>>> interface
>>>>>>>>>>>>> SplitEnumeratorContext#metricGroup()), which is null at the
>>>>>>> moment.
>>>>>>>>>>>>> 2. Defines the scope of operator coordinator metrics. A clear
>>>>>>>>>> definition
>>>>>>>>>>>>> makes it easy for users to find their wanted metrics. The
>>>>>>>> definition
>>>>>>>>>>>>> also helps to avoid conflicts of metrics from multiple
>>>>>>>>>> OperatorCoordinators
>>>>>>>>>>>>> of the same kind. E.g. each SourceCoordinator may have its 
>>>>>>>>>>>>> own
>>>>>>>>>>>>> numSourceSplits metric, these metrics should not be directly
>>>>>>>>> registered
>>>>>>>>>>>>> to the job metric group.
>>>>>>>>>>>>>
>>>>>>>>>>>>> What I'm a bit concerned is the necessity of the introduced
>>>>>>> common
>>>>>>>>>> metrics
>>>>>>>>>>>>> numEventsInCounter & numEventsOutCounter. If there any case
>>>>>> which
>>>>>>>>>> strongly
>>>>>>>>>>>>> requires them?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regarding the concerns of Chesnay,
>>>>>>>>>>>>>> A dedicated coordinator MG implementation is overkill
>>>>>>>>>>>>> Directly using the job metric group can result in metric
>>>>>>> conflicts,
>>>>>>>>> as
>>>>>>>>>> mentioned
>>>>>>>>>>>>> in above #2.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Zhu
>>>>>>>>>>>>>
>>>>>>>>>>>>> Dong Lin <li...@gmail.com> 于2022年12月10日周六 
>>>>>>>>>>>>> 14:16写道：
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Chesney,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Just to double check with you, 
>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup
>>>>>>>>>> (annotated as
>>>>>>>>>>>>>> @PublicEvolving) has already been introduced into Flink by
>>>>>>>> FLIP-179
>>>>>>>>>>>>>> <
>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics 
>>>
>>>>>>>>>>> .
>>>>>>>>>>>>>> And that FLIP has got you +1.. Do you mean we should remove
>>>>>>> this
>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>> Dong
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sat, Dec 10, 2022 at 1:33 AM Chesnay Schepler <
>>>>>>>>> chesnay@apache.org>
>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> As a whole I feel like this FLIP is overly complicated. A
>>>>>>>>> dedicated
>>>>>>>>>>>>>>> coordinator MG implementation is overkill; it could just
>>>>>>> re-use
>>>>>>>>> the
>>>>>>>>>>>>>>> existing Task/OperatorMGs to create the same structure we
>>>>>>> have
>>>>>>>> on
>>>>>>>>>> TMs,
>>>>>>>>>>>>>>> similar to what we did with the Job MG.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> However, I'm not convinced that this is required anyway,
>>>>>>>> because
>>>>>>>>>> all the
>>>>>>>>>>>>>>> example metrics you listed can be implemented on the TM
>>>>>> side
>>>>>>> +
>>>>>>>>>>>>>>> aggregating them in the external metrics backend.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Since I'm on holidays soon, just so no one tries to pull a
>>>>>>> fast
>>>>>>>>>> one on
>>>>>>>>>>>>>>> me, if this were to go to a vote as-is I'd be against it.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 09/12/2022 15:30, Dong Lin wrote:
>>>>>>>>>>>>>>>> Hi Hang,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks for the FLIP! The FLIP looks good and it is pretty
>>>>>>>>>> informative.
>>>>>>>>>>>>>>>> I have just two minor comments regarding names:
>>>>>>>>>>>>>>>> - Would it be useful to rename the config key as
>>>>>>>>>>>>>>>> *metrics.scope.jm.job.operator-coordinator* for
>>>>>> consistency
>>>>>>>>> with
>>>>>>>>>>>>>>>> *metrics.scope.jm.job
>>>>>>>>>>>>>>>> *(which is not named as *jm-job)?
>>>>>>>>>>>>>>>> - Maybe rename the variable as
>>>>>>>>> SCOPE_NAMING_OPERATOR_COORDINATOR
>>>>>>>>>> for
>>>>>>>>>>>>>>>> simplicity and consistency with SCOPE_NAMING_OPERATOR
>>>>>>> (which
>>>>>>>> is
>>>>>>>>>> not named
>>>>>>>>>>>>>>>> as SCOPE_NAMING_TM_JOB_OPERATOR)?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>> Dong
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Thu, Dec 8, 2022 at 3:28 PM Hang Ruan <
>>>>>>>>> ruanhang1993@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> MengYue and I created FLIP-274[1] Introduce metric group
>>>>>>> for
>>>>>>>>>>>>>>>>> OperatorCoordinator. OperatorCoordinator is the
>>>>>>> coordinator
>>>>>>>>> for
>>>>>>>>>> runtime
>>>>>>>>>>>>>>>>> operators and running on Job Manager. The coordination
>>>>>>>>>> mechanism is
>>>>>>>>>>>>>>>>> operator events between OperatorCoordinator and its all
>>>>>>>>>> operators, the
>>>>>>>>>>>>>>>>> coordination is more and more using in Flink, for
>>>>>> example
>>>>>>>> many
>>>>>>>>>> Sources
>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>> Sinks depend on the mechanism to assign splits and
>>>>>>>> coordinate
>>>>>>>>>> commits to
>>>>>>>>>>>>>>>>> external systems. The OperatorCoordinator is widely
>>>>>> using
>>>>>>> in
>>>>>>>>>> flink kafka
>>>>>>>>>>>>>>>>> connector, flink pulsar connector, flink cdc connector,
>>>>>>>> flink
>>>>>>>>>> hudi
>>>>>>>>>>>>>>>>> connector and so on.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> But there is not a suitable metric group scope for the
>>>>>>>>>>>>>>> OperatorCoordinator
>>>>>>>>>>>>>>>>> and not an implementation for the interface
>>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup.
>>>>>>>>>>>>>>>>> These metrics in OperatorCoordinator could be how many
>>>>>>>>>> splits/partitions
>>>>>>>>>>>>>>>>> have been assigned to source readers, how many files
>>>>>> have
>>>>>>>> been
>>>>>>>>>> written
>>>>>>>>>>>>>>> out
>>>>>>>>>>>>>>>>> by sink writers, these metrics not only help users to
>>>>>> know
>>>>>>>> the
>>>>>>>>>> job
>>>>>>>>>>>>>>> progress
>>>>>>>>>>>>>>>>> but also make big job maintaining easier. Thus we
>>>>>> propose
>>>>>>>> the
>>>>>>>>>> FLIP-274
>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> introduce a new metric group scope for
>>>>>> OperatorCoordinator
>>>>>>>> and
>>>>>>>>>> provide
>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>> internal implementation for
>>>>>>> OperatorCoordinatorMetricGroup.
>>>>>>>>>>>>>>>>> Could you help review this FLIP when you get time? Any
>>>>>>>>> feedback
>>>>>>>>>> is
>>>>>>>>>>>>>>>>> appreciated!
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>> Hang
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-274%3A+Introduce+metric+group+for+OperatorCoordinator 
>>>
>>>>>>>>>>>>>>>
>>>>
>

Re: [DISCUSS] FLIP-274 : Introduce metric group for OperatorCoordinator

Posted by Chesnay Schepler <ch...@apache.org>.

The proposed ScopeFormat is still problematic for a few reasons.

Extending the set of ScopeFormats is problematic because it in practice 
it breaks the config if users actively rely on it, since there's now 
another key that they _must_ set for it to be consistent/compatible with 
their existing setup.
Unfortunately due to how powerful scope formats are we can't derive a 
default value that matches their existing setup.
Hence we should try to do this as rarely as possible.

This FLIP does not adhere to that since it proposes a dedicated format 
for coordinators; next time we want to expose operator-specific metrics 
(e.g., in the scheduler) we'd have to add another one to support it.

Additionally, the configurable variables (operator name/id) are 
logically not attached to the coordinator, but operators, so to me it 
just doesn't make sense to structure it like this.

Another thing I'm concerned about is that, because we don't include 
tasks in the hierarchy, users wishing to collect all metrics for a 
particular task (in this case ==vertex) now have to go significantly out 
of their way to get them, since they can no longer just filter by the 
task ID but have to be filter for _all_ operators that are part of the task.

On 16/01/2023 03:09, Hang Ruan wrote:
> Hi, @chesnay@apache.org <ch...@apache.org> ,
>
> Do you have time to help to review this FLIP again after the modification?
> Looking forward to your reply.
> This FLIP will add a new configuration for the
> OperatorCoordinatorMetricGroup scope format. It provides an internal
> implementation and is added as a component to the JobManagerJobMetricGroup.
> If something doesn't make sense, could you provide some advice? It will be
> very helpful. Thanks a lot for your help.
>
> Best,
> Hang
>
> Martijn Visser <ma...@apache.org> 于2023年1月11日周三 16:34写道：
>
>> Hi Hang,
>>
>> I'm a bit surprised that this has gone to a vote, given that Chesnay
>> deliberately mentioned that he would vote against it as-is. I would expect
>> that before going to a vote, he has had the opportunity to participate in
>> this discussion.
>>
>> Best regards,
>>
>> Martijn
>>
>> On Tue, Jan 3, 2023 at 12:53 PM Jark Wu <im...@gmail.com> wrote:
>>
>>> Hi Dong,
>>>
>>> Regarding “SplitEnumeratorContext#metricGroup”, my only concern is that
>>> this is a core interface for the FLIP. It’s hard to tell how sources use
>>> metric group without mentioning this interface. Even if this is an
>> existing
>>> API, I think it’s worth introducing the interface again and declaring
>> that
>>> we will implement the interface instead of a no-op method in this FLIP.
>>>
>>> Anyway, this is a minor problem and shouldn’t block this FLIP. I’m +1 to
>>> start a vote.
>>>
>>> Best,
>>> Jark
>>>
>>>
>>>> 2023年1月3日 10:03，Hang Ruan <ru...@gmail.com> 写道：
>>>>
>>>> Hi, Jark and Dong,
>>>>
>>>> Thanks for your comments. Sorry for my late reply.
>>>>
>>>> For suggestion 1, I plan to implement the SplitEnumeratorMetricGroup in
>>>> another issue, and it is not contained in this FLIP. I will add some
>>>> description about this part.
>>>> For suggestion 2, changes about OperatorCoordinator#metricGroup has
>>> already
>>>> been documented in the proposed change section.
>>>>
>>>> Best,
>>>> Hang
>>>>
>>>> Dong Lin <li...@gmail.com> 于2023年1月1日周日 09:45写道：
>>>>
>>>>> Let me chime-in and add comments regarding the public interface
>> section.
>>>>> Please see my comments inline.
>>>>>
>>>>> On Thu, Dec 29, 2022 at 6:08 PM Jark Wu <im...@gmail.com> wrote:
>>>>>
>>>>>> Hi Hang,
>>>>>>
>>>>>> Thanks for driving this discussion. I think this is a very useful
>>> feature
>>>>>> for connectors.
>>>>>>
>>>>>> The FLIP looks quite good to me, and I just have two suggestions.
>>>>>>
>>>>>> 1. In the "Public Interface" section, mention that the implementation
>>>>>> behavior of "SplitEnumeratorContext#metricGroup" is changed from
>>>>> returning
>>>>>> null to returning a concrete SplitEnumeratorMetricGroup instance.
>> Even
>>>>>> though the API is already there, the behavior change can also be
>>>>> considered
>>>>>> a public change.
>>>>>>
>>>>> SplitEnumeratorContext#metricGroup is an interface and this FLIP does
>>> not
>>>>> seem to change its semantics/behavior. The FLIP does change the
>>>>> implementation/behavior of SourceCoordinatorContext#metricGroup, which
>>> is
>>>>> marked @Internal.
>>>>>
>>>>> Thus it might seem a bit weird to add in the public interface section
>>>>> saying that we change the interface SplitEnumeratorContext#metricGroup
>>> from
>>>>> returning null to non-null object.
>>>>>
>>>>> 2. Mention the newly added interface of
>>> "OperatorCoordinator#metricGroup"
>>>>>> in the "Proposed Changes" section or "Public Interface" section. As
>> the
>>>>>> FLIP said, OperatorCoordinator is widely used in many connectors.
>>> Though
>>>>> it
>>>>>> is still an @Internal API, I think it is worth mentioning the change
>> in
>>>>> the
>>>>>> FLIP.
>>>>>
>>>>> Since OperatorCoordinator is an internal API, it seems reasonable to
>>>>> explain it in the proposed change section. The FLIP seems to have
>>>>> documented this in point 5 of the proposed change section.
>>>>>
>>>>> BTW, if we think there are @internal classes that are important enough
>>> to
>>>>> be added in the public interface section, it might be useful to
>>> explicitly
>>>>> discuss this topic and document it in the "*What are the "public
>>>>> interfaces" of the project*" in this
>>>>> <
>>>>>
>> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
>>>>> wiki.
>>>>>
>>>>>
>>>>>> Best,
>>>>>> Jark
>>>>>>
>>>>>>
>>>>>> On Mon, 26 Dec 2022 at 18:06, Hang Ruan <ru...@gmail.com>
>>> wrote:
>>>>>>> Hi, thanks for the feedback, Zhu Zhu and Qingsheng.
>>>>>>> After combining everyone's comments, the main concerns and
>>>>> corresponding
>>>>>>> adjustments are as follows.
>>>>>>>
>>>>>>> Q1: Common metrics are not quite useful.
>>>>>>> numEventsIn and numEventsOut counters will be removed from the
>>>>>>> OperatorCoordinatorMetricGroup. These common metrics do not provide
>>>>>> enough
>>>>>>> information for users. The users are more willing to get the number
>> of
>>>>>>> events of the specified type instead of the total number. And this
>>>>> metric
>>>>>>> is calculated differently. The implementation could register the
>>> metric
>>>>>> by
>>>>>>> themselves.
>>>>>>>
>>>>>>> Q2: This FLIP is overly complicated.
>>>>>>> This FLIP will become concise after these modifications.
>>>>>>> OperatorCoordinatorMetricGroup has already been introduced into
>> Flink
>>>>> by
>>>>>>> FLIP-179<
>>>>>>>
>>>>>>>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics
>>>>>>>> .
>>>>>>> And
>>>>>>> this FLIP will not change it. This FLIP only provides a new metric
>>>>> option
>>>>>>> and a new metric group scope. The changes in proposed changes
>> provide
>>>>> the
>>>>>>> details about the modifications for the internal classes, which
>> might
>>>>>> make
>>>>>>> it look complicated.
>>>>>>>
>>>>>>> Thanks for all the comments again. If there are no further comments,
>>> we
>>>>>>> plan to start the voting thread this week.
>>>>>>>
>>>>>>> Best,
>>>>>>> Hang
>>>>>>>
>>>>>>> Qingsheng Ren <re...@gmail.com> 于2022年12月26日周一 16:48写道：
>>>>>>>
>>>>>>>> Thanks for the FLIP, Hang!
>>>>>>>>
>>>>>>>> This FLIP overall looks good to me. Actually I share the same
>> concern
>>>>>>> with
>>>>>>>> Zhu that numEventsIn and numEventsOut counters are not quite useful
>>>>> to
>>>>>>> end
>>>>>>>> users. OperatorEvent is a quite low-level abstraction, which
>> requires
>>>>>>>> instantialization in order to be practical to users and developers,
>>>>> so
>>>>>>>> maybe it's better to exclude them from the FLIP.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Qingsheng
>>>>>>>> Ververica (Alibaba)
>>>>>>>>
>>>>>>>> On Mon, Dec 26, 2022 at 12:08 PM Zhu Zhu <re...@gmail.com>
>> wrote:
>>>>>>>>> Hi Hang,
>>>>>>>>> I still see no strong reason why we need numEventsIn/numEventsOut
>>>>>>>> metrics.
>>>>>>>>> In the discussion in FLINK-29801, I can see the same concern from
>>>>>>> others.
>>>>>>>>> So I prefer to exclude them from this FLIP to avoid over-extending
>>>>>> the
>>>>>>>>> scope.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Zhu
>>>>>>>>>
>>>>>>>>> Hang Ruan <ru...@gmail.com> 于2022年12月23日周五 15:21写道：
>>>>>>>>>> Hi everyone,
>>>>>>>>>>
>>>>>>>>>> As for the Zhu Zhu's problem, I think we should keep the common
>>>>>>>> metrics,
>>>>>>>>> which will help to observe incoming and outgoing events. What do
>>>>> you
>>>>>>>> think,
>>>>>>>>> @Zhu Zhu ?
>>>>>>>>>> And @Chesnay, are there any other issues you are more concerned
>>>>>>> about?
>>>>>>>>> Looking forward to your reply.
>>>>>>>>>> Thanks for all the comments. If there are no further comments, we
>>>>>>> plan
>>>>>>>>> to start the voting thread next week.
>>>>>>>>>> Best,
>>>>>>>>>> Hang
>>>>>>>>>>
>>>>>>>>>> Hang Ruan <ru...@gmail.com> 于2022年12月15日周四 16:49写道：
>>>>>>>>>>> Hi, Zhu Zhu,
>>>>>>>>>>>
>>>>>>>>>>> Thanks for your feedback!
>>>>>>>>>>>
>>>>>>>>>>> The OperatorCoordinator implementations are different. And their
>>>>>>>>> metrics are much different too. We try to find the common metrics
>>>>> and
>>>>>>> put
>>>>>>>>> them in the OperatorCoordinatorMetricGroup. If most developers
>>>>> think
>>>>>> we
>>>>>>>> do
>>>>>>>>> not need these common metrics, removing the common metrics is also
>>>>>>>>> acceptable.
>>>>>>>>>>> Best,
>>>>>>>>>>> Hang
>>>>>>>>>>>
>>>>>>>>>>> Zhu Zhu <re...@gmail.com> 于2022年12月14日周三 22:09写道：
>>>>>>>>>>>> Hi Hang & MengYue,
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for creating this FLIP!
>>>>>>>>>>>>
>>>>>>>>>>>> I think it is very useful, mainly in two aspects:
>>>>>>>>>>>> 1. Enables OperatorCoordinators to register metrics. Currently
>>>>>>>>>>>> the coordinators has no way to do this. And operator
>>>>> coordinator
>>>>>>>>>>>> metric group further enables the SplitEnumerator to have access
>>>>>>>>>>>> to a registered metric group (via the existing public interface
>>>>>>>>>>>> SplitEnumeratorContext#metricGroup()), which is null at the
>>>>>> moment.
>>>>>>>>>>>> 2. Defines the scope of operator coordinator metrics. A clear
>>>>>>>>> definition
>>>>>>>>>>>> makes it easy for users to find their wanted metrics. The
>>>>>>> definition
>>>>>>>>>>>> also helps to avoid conflicts of metrics from multiple
>>>>>>>>> OperatorCoordinators
>>>>>>>>>>>> of the same kind. E.g. each SourceCoordinator may have its own
>>>>>>>>>>>> numSourceSplits metric, these metrics should not be directly
>>>>>>>> registered
>>>>>>>>>>>> to the job metric group.
>>>>>>>>>>>>
>>>>>>>>>>>> What I'm a bit concerned is the necessity of the introduced
>>>>>> common
>>>>>>>>> metrics
>>>>>>>>>>>> numEventsInCounter & numEventsOutCounter. If there any case
>>>>> which
>>>>>>>>> strongly
>>>>>>>>>>>> requires them?
>>>>>>>>>>>>
>>>>>>>>>>>> Regarding the concerns of Chesnay,
>>>>>>>>>>>>> A dedicated coordinator MG implementation is overkill
>>>>>>>>>>>> Directly using the job metric group can result in metric
>>>>>> conflicts,
>>>>>>>> as
>>>>>>>>> mentioned
>>>>>>>>>>>> in above #2.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Zhu
>>>>>>>>>>>>
>>>>>>>>>>>> Dong Lin <li...@gmail.com> 于2022年12月10日周六 14:16写道：
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Chesney,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Just to double check with you, OperatorCoordinatorMetricGroup
>>>>>>>>> (annotated as
>>>>>>>>>>>>> @PublicEvolving) has already been introduced into Flink by
>>>>>>> FLIP-179
>>>>>>>>>>>>> <
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics
>>>>>>>>>> .
>>>>>>>>>>>>> And that FLIP has got you +1.. Do you mean we should remove
>>>>>> this
>>>>>>>>>>>>> OperatorCoordinatorMetricGroup?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>> Dong
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, Dec 10, 2022 at 1:33 AM Chesnay Schepler <
>>>>>>>> chesnay@apache.org>
>>>>>>>>> wrote:
>>>>>>>>>>>>>> As a whole I feel like this FLIP is overly complicated. A
>>>>>>>> dedicated
>>>>>>>>>>>>>> coordinator MG implementation is overkill; it could just
>>>>>> re-use
>>>>>>>> the
>>>>>>>>>>>>>> existing Task/OperatorMGs to create the same structure we
>>>>>> have
>>>>>>> on
>>>>>>>>> TMs,
>>>>>>>>>>>>>> similar to what we did with the Job MG.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> However, I'm not convinced that this is required anyway,
>>>>>>> because
>>>>>>>>> all the
>>>>>>>>>>>>>> example metrics you listed can be implemented on the TM
>>>>> side
>>>>>> +
>>>>>>>>>>>>>> aggregating them in the external metrics backend.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Since I'm on holidays soon, just so no one tries to pull a
>>>>>> fast
>>>>>>>>> one on
>>>>>>>>>>>>>> me, if this were to go to a vote as-is I'd be against it.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 09/12/2022 15:30, Dong Lin wrote:
>>>>>>>>>>>>>>> Hi Hang,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks for the FLIP! The FLIP looks good and it is pretty
>>>>>>>>> informative.
>>>>>>>>>>>>>>> I have just two minor comments regarding names:
>>>>>>>>>>>>>>> - Would it be useful to rename the config key as
>>>>>>>>>>>>>>> *metrics.scope.jm.job.operator-coordinator* for
>>>>> consistency
>>>>>>>> with
>>>>>>>>>>>>>>> *metrics.scope.jm.job
>>>>>>>>>>>>>>> *(which is not named as *jm-job)?
>>>>>>>>>>>>>>> - Maybe rename the variable as
>>>>>>>> SCOPE_NAMING_OPERATOR_COORDINATOR
>>>>>>>>> for
>>>>>>>>>>>>>>> simplicity and consistency with SCOPE_NAMING_OPERATOR
>>>>>> (which
>>>>>>> is
>>>>>>>>> not named
>>>>>>>>>>>>>>> as SCOPE_NAMING_TM_JOB_OPERATOR)?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>> Dong
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, Dec 8, 2022 at 3:28 PM Hang Ruan <
>>>>>>>> ruanhang1993@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> MengYue and I created FLIP-274[1] Introduce metric group
>>>>>> for
>>>>>>>>>>>>>>>> OperatorCoordinator. OperatorCoordinator is the
>>>>>> coordinator
>>>>>>>> for
>>>>>>>>> runtime
>>>>>>>>>>>>>>>> operators and running on Job Manager. The coordination
>>>>>>>>> mechanism is
>>>>>>>>>>>>>>>> operator events between OperatorCoordinator and its all
>>>>>>>>> operators, the
>>>>>>>>>>>>>>>> coordination is more and more using in Flink, for
>>>>> example
>>>>>>> many
>>>>>>>>> Sources
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>> Sinks depend on the mechanism to assign splits and
>>>>>>> coordinate
>>>>>>>>> commits to
>>>>>>>>>>>>>>>> external systems. The OperatorCoordinator is widely
>>>>> using
>>>>>> in
>>>>>>>>> flink kafka
>>>>>>>>>>>>>>>> connector, flink pulsar connector, flink cdc connector,
>>>>>>> flink
>>>>>>>>> hudi
>>>>>>>>>>>>>>>> connector and so on.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> But there is not a suitable metric group scope for the
>>>>>>>>>>>>>> OperatorCoordinator
>>>>>>>>>>>>>>>> and not an implementation for the interface
>>>>>>>>>>>>>> OperatorCoordinatorMetricGroup.
>>>>>>>>>>>>>>>> These metrics in OperatorCoordinator could be how many
>>>>>>>>> splits/partitions
>>>>>>>>>>>>>>>> have been assigned to source readers, how many files
>>>>> have
>>>>>>> been
>>>>>>>>> written
>>>>>>>>>>>>>> out
>>>>>>>>>>>>>>>> by sink writers, these metrics not only help users to
>>>>> know
>>>>>>> the
>>>>>>>>> job
>>>>>>>>>>>>>> progress
>>>>>>>>>>>>>>>> but also make big job maintaining easier. Thus we
>>>>> propose
>>>>>>> the
>>>>>>>>> FLIP-274
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>> introduce a new metric group scope for
>>>>> OperatorCoordinator
>>>>>>> and
>>>>>>>>> provide
>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>> internal implementation for
>>>>>> OperatorCoordinatorMetricGroup.
>>>>>>>>>>>>>>>> Could you help review this FLIP when you get time? Any
>>>>>>>> feedback
>>>>>>>>> is
>>>>>>>>>>>>>>>> appreciated!
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>> Hang
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-274%3A+Introduce+metric+group+for+OperatorCoordinator
>>>>>>>>>>>>>>
>>>

Re: [DISCUSS] FLIP-274 : Introduce metric group for OperatorCoordinator

Posted by Hang Ruan <ru...@gmail.com>.

Hi, @chesnay@apache.org <ch...@apache.org> ,

Do you have time to help to review this FLIP again after the modification?
Looking forward to your reply.
This FLIP will add a new configuration for the
OperatorCoordinatorMetricGroup scope format. It provides an internal
implementation and is added as a component to the JobManagerJobMetricGroup.
If something doesn't make sense, could you provide some advice? It will be
very helpful. Thanks a lot for your help.

Best,
Hang

Martijn Visser <ma...@apache.org> 于2023年1月11日周三 16:34写道：

> Hi Hang,
>
> I'm a bit surprised that this has gone to a vote, given that Chesnay
> deliberately mentioned that he would vote against it as-is. I would expect
> that before going to a vote, he has had the opportunity to participate in
> this discussion.
>
> Best regards,
>
> Martijn
>
> On Tue, Jan 3, 2023 at 12:53 PM Jark Wu <im...@gmail.com> wrote:
>
> > Hi Dong,
> >
> > Regarding “SplitEnumeratorContext#metricGroup”, my only concern is that
> > this is a core interface for the FLIP. It’s hard to tell how sources use
> > metric group without mentioning this interface. Even if this is an
> existing
> > API, I think it’s worth introducing the interface again and declaring
> that
> > we will implement the interface instead of a no-op method in this FLIP.
> >
> > Anyway, this is a minor problem and shouldn’t block this FLIP. I’m +1 to
> > start a vote.
> >
> > Best,
> > Jark
> >
> >
> > > 2023年1月3日 10:03，Hang Ruan <ru...@gmail.com> 写道：
> > >
> > > Hi, Jark and Dong,
> > >
> > > Thanks for your comments. Sorry for my late reply.
> > >
> > > For suggestion 1, I plan to implement the SplitEnumeratorMetricGroup in
> > > another issue, and it is not contained in this FLIP. I will add some
> > > description about this part.
> > > For suggestion 2, changes about OperatorCoordinator#metricGroup has
> > already
> > > been documented in the proposed change section.
> > >
> > > Best,
> > > Hang
> > >
> > > Dong Lin <li...@gmail.com> 于2023年1月1日周日 09:45写道：
> > >
> > >> Let me chime-in and add comments regarding the public interface
> section.
> > >> Please see my comments inline.
> > >>
> > >> On Thu, Dec 29, 2022 at 6:08 PM Jark Wu <im...@gmail.com> wrote:
> > >>
> > >>> Hi Hang,
> > >>>
> > >>> Thanks for driving this discussion. I think this is a very useful
> > feature
> > >>> for connectors.
> > >>>
> > >>> The FLIP looks quite good to me, and I just have two suggestions.
> > >>>
> > >>> 1. In the "Public Interface" section, mention that the implementation
> > >>> behavior of "SplitEnumeratorContext#metricGroup" is changed from
> > >> returning
> > >>> null to returning a concrete SplitEnumeratorMetricGroup instance.
> Even
> > >>> though the API is already there, the behavior change can also be
> > >> considered
> > >>> a public change.
> > >>>
> > >>
> > >> SplitEnumeratorContext#metricGroup is an interface and this FLIP does
> > not
> > >> seem to change its semantics/behavior. The FLIP does change the
> > >> implementation/behavior of SourceCoordinatorContext#metricGroup, which
> > is
> > >> marked @Internal.
> > >>
> > >> Thus it might seem a bit weird to add in the public interface section
> > >> saying that we change the interface SplitEnumeratorContext#metricGroup
> > from
> > >> returning null to non-null object.
> > >>
> > >> 2. Mention the newly added interface of
> > "OperatorCoordinator#metricGroup"
> > >>> in the "Proposed Changes" section or "Public Interface" section. As
> the
> > >>> FLIP said, OperatorCoordinator is widely used in many connectors.
> > Though
> > >> it
> > >>> is still an @Internal API, I think it is worth mentioning the change
> in
> > >> the
> > >>> FLIP.
> > >>
> > >>
> > >> Since OperatorCoordinator is an internal API, it seems reasonable to
> > >> explain it in the proposed change section. The FLIP seems to have
> > >> documented this in point 5 of the proposed change section.
> > >>
> > >> BTW, if we think there are @internal classes that are important enough
> > to
> > >> be added in the public interface section, it might be useful to
> > explicitly
> > >> discuss this topic and document it in the "*What are the "public
> > >> interfaces" of the project*" in this
> > >> <
> > >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
> > >>>
> > >> wiki.
> > >>
> > >>
> > >>> Best,
> > >>> Jark
> > >>>
> > >>>
> > >>> On Mon, 26 Dec 2022 at 18:06, Hang Ruan <ru...@gmail.com>
> > wrote:
> > >>>
> > >>>> Hi, thanks for the feedback, Zhu Zhu and Qingsheng.
> > >>>> After combining everyone's comments, the main concerns and
> > >> corresponding
> > >>>> adjustments are as follows.
> > >>>>
> > >>>> Q1: Common metrics are not quite useful.
> > >>>> numEventsIn and numEventsOut counters will be removed from the
> > >>>> OperatorCoordinatorMetricGroup. These common metrics do not provide
> > >>> enough
> > >>>> information for users. The users are more willing to get the number
> of
> > >>>> events of the specified type instead of the total number. And this
> > >> metric
> > >>>> is calculated differently. The implementation could register the
> > metric
> > >>> by
> > >>>> themselves.
> > >>>>
> > >>>> Q2: This FLIP is overly complicated.
> > >>>> This FLIP will become concise after these modifications.
> > >>>> OperatorCoordinatorMetricGroup has already been introduced into
> Flink
> > >> by
> > >>>> FLIP-179<
> > >>>>
> > >>>>
> > >>>
> > >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics
> > >>>>> .
> > >>>> And
> > >>>> this FLIP will not change it. This FLIP only provides a new metric
> > >> option
> > >>>> and a new metric group scope. The changes in proposed changes
> provide
> > >> the
> > >>>> details about the modifications for the internal classes, which
> might
> > >>> make
> > >>>> it look complicated.
> > >>>>
> > >>>> Thanks for all the comments again. If there are no further comments,
> > we
> > >>>> plan to start the voting thread this week.
> > >>>>
> > >>>> Best,
> > >>>> Hang
> > >>>>
> > >>>> Qingsheng Ren <re...@gmail.com> 于2022年12月26日周一 16:48写道：
> > >>>>
> > >>>>> Thanks for the FLIP, Hang!
> > >>>>>
> > >>>>> This FLIP overall looks good to me. Actually I share the same
> concern
> > >>>> with
> > >>>>> Zhu that numEventsIn and numEventsOut counters are not quite useful
> > >> to
> > >>>> end
> > >>>>> users. OperatorEvent is a quite low-level abstraction, which
> requires
> > >>>>> instantialization in order to be practical to users and developers,
> > >> so
> > >>>>> maybe it's better to exclude them from the FLIP.
> > >>>>>
> > >>>>> Best,
> > >>>>> Qingsheng
> > >>>>> Ververica (Alibaba)
> > >>>>>
> > >>>>> On Mon, Dec 26, 2022 at 12:08 PM Zhu Zhu <re...@gmail.com>
> wrote:
> > >>>>>
> > >>>>>> Hi Hang,
> > >>>>>> I still see no strong reason why we need numEventsIn/numEventsOut
> > >>>>> metrics.
> > >>>>>> In the discussion in FLINK-29801, I can see the same concern from
> > >>>> others.
> > >>>>>> So I prefer to exclude them from this FLIP to avoid over-extending
> > >>> the
> > >>>>>> scope.
> > >>>>>>
> > >>>>>> Thanks,
> > >>>>>> Zhu
> > >>>>>>
> > >>>>>> Hang Ruan <ru...@gmail.com> 于2022年12月23日周五 15:21写道：
> > >>>>>>>
> > >>>>>>> Hi everyone,
> > >>>>>>>
> > >>>>>>> As for the Zhu Zhu's problem, I think we should keep the common
> > >>>>> metrics,
> > >>>>>> which will help to observe incoming and outgoing events. What do
> > >> you
> > >>>>> think,
> > >>>>>> @Zhu Zhu ?
> > >>>>>>> And @Chesnay, are there any other issues you are more concerned
> > >>>> about?
> > >>>>>> Looking forward to your reply.
> > >>>>>>>
> > >>>>>>> Thanks for all the comments. If there are no further comments, we
> > >>>> plan
> > >>>>>> to start the voting thread next week.
> > >>>>>>>
> > >>>>>>> Best,
> > >>>>>>> Hang
> > >>>>>>>
> > >>>>>>> Hang Ruan <ru...@gmail.com> 于2022年12月15日周四 16:49写道：
> > >>>>>>>>
> > >>>>>>>> Hi, Zhu Zhu,
> > >>>>>>>>
> > >>>>>>>> Thanks for your feedback!
> > >>>>>>>>
> > >>>>>>>> The OperatorCoordinator implementations are different. And their
> > >>>>>> metrics are much different too. We try to find the common metrics
> > >> and
> > >>>> put
> > >>>>>> them in the OperatorCoordinatorMetricGroup. If most developers
> > >> think
> > >>> we
> > >>>>> do
> > >>>>>> not need these common metrics, removing the common metrics is also
> > >>>>>> acceptable.
> > >>>>>>>>
> > >>>>>>>> Best,
> > >>>>>>>> Hang
> > >>>>>>>>
> > >>>>>>>> Zhu Zhu <re...@gmail.com> 于2022年12月14日周三 22:09写道：
> > >>>>>>>>>
> > >>>>>>>>> Hi Hang & MengYue,
> > >>>>>>>>>
> > >>>>>>>>> Thanks for creating this FLIP!
> > >>>>>>>>>
> > >>>>>>>>> I think it is very useful, mainly in two aspects:
> > >>>>>>>>> 1. Enables OperatorCoordinators to register metrics. Currently
> > >>>>>>>>> the coordinators has no way to do this. And operator
> > >> coordinator
> > >>>>>>>>> metric group further enables the SplitEnumerator to have access
> > >>>>>>>>> to a registered metric group (via the existing public interface
> > >>>>>>>>> SplitEnumeratorContext#metricGroup()), which is null at the
> > >>> moment.
> > >>>>>>>>>
> > >>>>>>>>> 2. Defines the scope of operator coordinator metrics. A clear
> > >>>>>> definition
> > >>>>>>>>> makes it easy for users to find their wanted metrics. The
> > >>>> definition
> > >>>>>>>>> also helps to avoid conflicts of metrics from multiple
> > >>>>>> OperatorCoordinators
> > >>>>>>>>> of the same kind. E.g. each SourceCoordinator may have its own
> > >>>>>>>>> numSourceSplits metric, these metrics should not be directly
> > >>>>> registered
> > >>>>>>>>> to the job metric group.
> > >>>>>>>>>
> > >>>>>>>>> What I'm a bit concerned is the necessity of the introduced
> > >>> common
> > >>>>>> metrics
> > >>>>>>>>> numEventsInCounter & numEventsOutCounter. If there any case
> > >> which
> > >>>>>> strongly
> > >>>>>>>>> requires them?
> > >>>>>>>>>
> > >>>>>>>>> Regarding the concerns of Chesnay,
> > >>>>>>>>>> A dedicated coordinator MG implementation is overkill
> > >>>>>>>>> Directly using the job metric group can result in metric
> > >>> conflicts,
> > >>>>> as
> > >>>>>> mentioned
> > >>>>>>>>> in above #2.
> > >>>>>>>>>
> > >>>>>>>>> Thanks,
> > >>>>>>>>> Zhu
> > >>>>>>>>>
> > >>>>>>>>> Dong Lin <li...@gmail.com> 于2022年12月10日周六 14:16写道：
> > >>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> Hi Chesney,
> > >>>>>>>>>>
> > >>>>>>>>>> Just to double check with you, OperatorCoordinatorMetricGroup
> > >>>>>> (annotated as
> > >>>>>>>>>> @PublicEvolving) has already been introduced into Flink by
> > >>>> FLIP-179
> > >>>>>>>>>> <
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics
> > >>>>>>> .
> > >>>>>>>>>> And that FLIP has got you +1.. Do you mean we should remove
> > >>> this
> > >>>>>>>>>> OperatorCoordinatorMetricGroup?
> > >>>>>>>>>>
> > >>>>>>>>>> Regards,
> > >>>>>>>>>> Dong
> > >>>>>>>>>>
> > >>>>>>>>>> On Sat, Dec 10, 2022 at 1:33 AM Chesnay Schepler <
> > >>>>> chesnay@apache.org>
> > >>>>>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>> As a whole I feel like this FLIP is overly complicated. A
> > >>>>> dedicated
> > >>>>>>>>>>> coordinator MG implementation is overkill; it could just
> > >>> re-use
> > >>>>> the
> > >>>>>>>>>>> existing Task/OperatorMGs to create the same structure we
> > >>> have
> > >>>> on
> > >>>>>> TMs,
> > >>>>>>>>>>> similar to what we did with the Job MG.
> > >>>>>>>>>>>
> > >>>>>>>>>>> However, I'm not convinced that this is required anyway,
> > >>>> because
> > >>>>>> all the
> > >>>>>>>>>>> example metrics you listed can be implemented on the TM
> > >> side
> > >>> +
> > >>>>>>>>>>> aggregating them in the external metrics backend.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Since I'm on holidays soon, just so no one tries to pull a
> > >>> fast
> > >>>>>> one on
> > >>>>>>>>>>> me, if this were to go to a vote as-is I'd be against it.
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> On 09/12/2022 15:30, Dong Lin wrote:
> > >>>>>>>>>>>> Hi Hang,
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Thanks for the FLIP! The FLIP looks good and it is pretty
> > >>>>>> informative.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> I have just two minor comments regarding names:
> > >>>>>>>>>>>> - Would it be useful to rename the config key as
> > >>>>>>>>>>>> *metrics.scope.jm.job.operator-coordinator* for
> > >> consistency
> > >>>>> with
> > >>>>>>>>>>>> *metrics.scope.jm.job
> > >>>>>>>>>>>> *(which is not named as *jm-job)?
> > >>>>>>>>>>>> - Maybe rename the variable as
> > >>>>> SCOPE_NAMING_OPERATOR_COORDINATOR
> > >>>>>> for
> > >>>>>>>>>>>> simplicity and consistency with SCOPE_NAMING_OPERATOR
> > >>> (which
> > >>>> is
> > >>>>>> not named
> > >>>>>>>>>>>> as SCOPE_NAMING_TM_JOB_OPERATOR)?
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Cheers,
> > >>>>>>>>>>>> Dong
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On Thu, Dec 8, 2022 at 3:28 PM Hang Ruan <
> > >>>>> ruanhang1993@gmail.com>
> > >>>>>> wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> Hi all,
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> MengYue and I created FLIP-274[1] Introduce metric group
> > >>> for
> > >>>>>>>>>>>>> OperatorCoordinator. OperatorCoordinator is the
> > >>> coordinator
> > >>>>> for
> > >>>>>> runtime
> > >>>>>>>>>>>>> operators and running on Job Manager. The coordination
> > >>>>>> mechanism is
> > >>>>>>>>>>>>> operator events between OperatorCoordinator and its all
> > >>>>>> operators, the
> > >>>>>>>>>>>>> coordination is more and more using in Flink, for
> > >> example
> > >>>> many
> > >>>>>> Sources
> > >>>>>>>>>>> and
> > >>>>>>>>>>>>> Sinks depend on the mechanism to assign splits and
> > >>>> coordinate
> > >>>>>> commits to
> > >>>>>>>>>>>>> external systems. The OperatorCoordinator is widely
> > >> using
> > >>> in
> > >>>>>> flink kafka
> > >>>>>>>>>>>>> connector, flink pulsar connector, flink cdc connector,
> > >>>> flink
> > >>>>>> hudi
> > >>>>>>>>>>>>> connector and so on.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> But there is not a suitable metric group scope for the
> > >>>>>>>>>>> OperatorCoordinator
> > >>>>>>>>>>>>> and not an implementation for the interface
> > >>>>>>>>>>> OperatorCoordinatorMetricGroup.
> > >>>>>>>>>>>>> These metrics in OperatorCoordinator could be how many
> > >>>>>> splits/partitions
> > >>>>>>>>>>>>> have been assigned to source readers, how many files
> > >> have
> > >>>> been
> > >>>>>> written
> > >>>>>>>>>>> out
> > >>>>>>>>>>>>> by sink writers, these metrics not only help users to
> > >> know
> > >>>> the
> > >>>>>> job
> > >>>>>>>>>>> progress
> > >>>>>>>>>>>>> but also make big job maintaining easier. Thus we
> > >> propose
> > >>>> the
> > >>>>>> FLIP-274
> > >>>>>>>>>>> to
> > >>>>>>>>>>>>> introduce a new metric group scope for
> > >> OperatorCoordinator
> > >>>> and
> > >>>>>> provide
> > >>>>>>>>>>> an
> > >>>>>>>>>>>>> internal implementation for
> > >>> OperatorCoordinatorMetricGroup.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Could you help review this FLIP when you get time? Any
> > >>>>> feedback
> > >>>>>> is
> > >>>>>>>>>>>>> appreciated!
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Best,
> > >>>>>>>>>>>>> Hang
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> [1]
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-274%3A+Introduce+metric+group+for+OperatorCoordinator
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>
> >
> >
>

Re: [DISCUSS] FLIP-274 : Introduce metric group for OperatorCoordinator

Posted by Martijn Visser <ma...@apache.org>.

Hi Hang,

I'm a bit surprised that this has gone to a vote, given that Chesnay
deliberately mentioned that he would vote against it as-is. I would expect
that before going to a vote, he has had the opportunity to participate in
this discussion.

Best regards,

Martijn

On Tue, Jan 3, 2023 at 12:53 PM Jark Wu <im...@gmail.com> wrote:

> Hi Dong,
>
> Regarding “SplitEnumeratorContext#metricGroup”, my only concern is that
> this is a core interface for the FLIP. It’s hard to tell how sources use
> metric group without mentioning this interface. Even if this is an existing
> API, I think it’s worth introducing the interface again and declaring that
> we will implement the interface instead of a no-op method in this FLIP.
>
> Anyway, this is a minor problem and shouldn’t block this FLIP. I’m +1 to
> start a vote.
>
> Best,
> Jark
>
>
> > 2023年1月3日 10:03，Hang Ruan <ru...@gmail.com> 写道：
> >
> > Hi, Jark and Dong,
> >
> > Thanks for your comments. Sorry for my late reply.
> >
> > For suggestion 1, I plan to implement the SplitEnumeratorMetricGroup in
> > another issue, and it is not contained in this FLIP. I will add some
> > description about this part.
> > For suggestion 2, changes about OperatorCoordinator#metricGroup has
> already
> > been documented in the proposed change section.
> >
> > Best,
> > Hang
> >
> > Dong Lin <li...@gmail.com> 于2023年1月1日周日 09:45写道：
> >
> >> Let me chime-in and add comments regarding the public interface section.
> >> Please see my comments inline.
> >>
> >> On Thu, Dec 29, 2022 at 6:08 PM Jark Wu <im...@gmail.com> wrote:
> >>
> >>> Hi Hang,
> >>>
> >>> Thanks for driving this discussion. I think this is a very useful
> feature
> >>> for connectors.
> >>>
> >>> The FLIP looks quite good to me, and I just have two suggestions.
> >>>
> >>> 1. In the "Public Interface" section, mention that the implementation
> >>> behavior of "SplitEnumeratorContext#metricGroup" is changed from
> >> returning
> >>> null to returning a concrete SplitEnumeratorMetricGroup instance. Even
> >>> though the API is already there, the behavior change can also be
> >> considered
> >>> a public change.
> >>>
> >>
> >> SplitEnumeratorContext#metricGroup is an interface and this FLIP does
> not
> >> seem to change its semantics/behavior. The FLIP does change the
> >> implementation/behavior of SourceCoordinatorContext#metricGroup, which
> is
> >> marked @Internal.
> >>
> >> Thus it might seem a bit weird to add in the public interface section
> >> saying that we change the interface SplitEnumeratorContext#metricGroup
> from
> >> returning null to non-null object.
> >>
> >> 2. Mention the newly added interface of
> "OperatorCoordinator#metricGroup"
> >>> in the "Proposed Changes" section or "Public Interface" section. As the
> >>> FLIP said, OperatorCoordinator is widely used in many connectors.
> Though
> >> it
> >>> is still an @Internal API, I think it is worth mentioning the change in
> >> the
> >>> FLIP.
> >>
> >>
> >> Since OperatorCoordinator is an internal API, it seems reasonable to
> >> explain it in the proposed change section. The FLIP seems to have
> >> documented this in point 5 of the proposed change section.
> >>
> >> BTW, if we think there are @internal classes that are important enough
> to
> >> be added in the public interface section, it might be useful to
> explicitly
> >> discuss this topic and document it in the "*What are the "public
> >> interfaces" of the project*" in this
> >> <
> >>
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
> >>>
> >> wiki.
> >>
> >>
> >>> Best,
> >>> Jark
> >>>
> >>>
> >>> On Mon, 26 Dec 2022 at 18:06, Hang Ruan <ru...@gmail.com>
> wrote:
> >>>
> >>>> Hi, thanks for the feedback, Zhu Zhu and Qingsheng.
> >>>> After combining everyone's comments, the main concerns and
> >> corresponding
> >>>> adjustments are as follows.
> >>>>
> >>>> Q1: Common metrics are not quite useful.
> >>>> numEventsIn and numEventsOut counters will be removed from the
> >>>> OperatorCoordinatorMetricGroup. These common metrics do not provide
> >>> enough
> >>>> information for users. The users are more willing to get the number of
> >>>> events of the specified type instead of the total number. And this
> >> metric
> >>>> is calculated differently. The implementation could register the
> metric
> >>> by
> >>>> themselves.
> >>>>
> >>>> Q2: This FLIP is overly complicated.
> >>>> This FLIP will become concise after these modifications.
> >>>> OperatorCoordinatorMetricGroup has already been introduced into Flink
> >> by
> >>>> FLIP-179<
> >>>>
> >>>>
> >>>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics
> >>>>> .
> >>>> And
> >>>> this FLIP will not change it. This FLIP only provides a new metric
> >> option
> >>>> and a new metric group scope. The changes in proposed changes provide
> >> the
> >>>> details about the modifications for the internal classes, which might
> >>> make
> >>>> it look complicated.
> >>>>
> >>>> Thanks for all the comments again. If there are no further comments,
> we
> >>>> plan to start the voting thread this week.
> >>>>
> >>>> Best,
> >>>> Hang
> >>>>
> >>>> Qingsheng Ren <re...@gmail.com> 于2022年12月26日周一 16:48写道：
> >>>>
> >>>>> Thanks for the FLIP, Hang!
> >>>>>
> >>>>> This FLIP overall looks good to me. Actually I share the same concern
> >>>> with
> >>>>> Zhu that numEventsIn and numEventsOut counters are not quite useful
> >> to
> >>>> end
> >>>>> users. OperatorEvent is a quite low-level abstraction, which requires
> >>>>> instantialization in order to be practical to users and developers,
> >> so
> >>>>> maybe it's better to exclude them from the FLIP.
> >>>>>
> >>>>> Best,
> >>>>> Qingsheng
> >>>>> Ververica (Alibaba)
> >>>>>
> >>>>> On Mon, Dec 26, 2022 at 12:08 PM Zhu Zhu <re...@gmail.com> wrote:
> >>>>>
> >>>>>> Hi Hang,
> >>>>>> I still see no strong reason why we need numEventsIn/numEventsOut
> >>>>> metrics.
> >>>>>> In the discussion in FLINK-29801, I can see the same concern from
> >>>> others.
> >>>>>> So I prefer to exclude them from this FLIP to avoid over-extending
> >>> the
> >>>>>> scope.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Zhu
> >>>>>>
> >>>>>> Hang Ruan <ru...@gmail.com> 于2022年12月23日周五 15:21写道：
> >>>>>>>
> >>>>>>> Hi everyone,
> >>>>>>>
> >>>>>>> As for the Zhu Zhu's problem, I think we should keep the common
> >>>>> metrics,
> >>>>>> which will help to observe incoming and outgoing events. What do
> >> you
> >>>>> think,
> >>>>>> @Zhu Zhu ?
> >>>>>>> And @Chesnay, are there any other issues you are more concerned
> >>>> about?
> >>>>>> Looking forward to your reply.
> >>>>>>>
> >>>>>>> Thanks for all the comments. If there are no further comments, we
> >>>> plan
> >>>>>> to start the voting thread next week.
> >>>>>>>
> >>>>>>> Best,
> >>>>>>> Hang
> >>>>>>>
> >>>>>>> Hang Ruan <ru...@gmail.com> 于2022年12月15日周四 16:49写道：
> >>>>>>>>
> >>>>>>>> Hi, Zhu Zhu,
> >>>>>>>>
> >>>>>>>> Thanks for your feedback!
> >>>>>>>>
> >>>>>>>> The OperatorCoordinator implementations are different. And their
> >>>>>> metrics are much different too. We try to find the common metrics
> >> and
> >>>> put
> >>>>>> them in the OperatorCoordinatorMetricGroup. If most developers
> >> think
> >>> we
> >>>>> do
> >>>>>> not need these common metrics, removing the common metrics is also
> >>>>>> acceptable.
> >>>>>>>>
> >>>>>>>> Best,
> >>>>>>>> Hang
> >>>>>>>>
> >>>>>>>> Zhu Zhu <re...@gmail.com> 于2022年12月14日周三 22:09写道：
> >>>>>>>>>
> >>>>>>>>> Hi Hang & MengYue,
> >>>>>>>>>
> >>>>>>>>> Thanks for creating this FLIP!
> >>>>>>>>>
> >>>>>>>>> I think it is very useful, mainly in two aspects:
> >>>>>>>>> 1. Enables OperatorCoordinators to register metrics. Currently
> >>>>>>>>> the coordinators has no way to do this. And operator
> >> coordinator
> >>>>>>>>> metric group further enables the SplitEnumerator to have access
> >>>>>>>>> to a registered metric group (via the existing public interface
> >>>>>>>>> SplitEnumeratorContext#metricGroup()), which is null at the
> >>> moment.
> >>>>>>>>>
> >>>>>>>>> 2. Defines the scope of operator coordinator metrics. A clear
> >>>>>> definition
> >>>>>>>>> makes it easy for users to find their wanted metrics. The
> >>>> definition
> >>>>>>>>> also helps to avoid conflicts of metrics from multiple
> >>>>>> OperatorCoordinators
> >>>>>>>>> of the same kind. E.g. each SourceCoordinator may have its own
> >>>>>>>>> numSourceSplits metric, these metrics should not be directly
> >>>>> registered
> >>>>>>>>> to the job metric group.
> >>>>>>>>>
> >>>>>>>>> What I'm a bit concerned is the necessity of the introduced
> >>> common
> >>>>>> metrics
> >>>>>>>>> numEventsInCounter & numEventsOutCounter. If there any case
> >> which
> >>>>>> strongly
> >>>>>>>>> requires them?
> >>>>>>>>>
> >>>>>>>>> Regarding the concerns of Chesnay,
> >>>>>>>>>> A dedicated coordinator MG implementation is overkill
> >>>>>>>>> Directly using the job metric group can result in metric
> >>> conflicts,
> >>>>> as
> >>>>>> mentioned
> >>>>>>>>> in above #2.
> >>>>>>>>>
> >>>>>>>>> Thanks,
> >>>>>>>>> Zhu
> >>>>>>>>>
> >>>>>>>>> Dong Lin <li...@gmail.com> 于2022年12月10日周六 14:16写道：
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Hi Chesney,
> >>>>>>>>>>
> >>>>>>>>>> Just to double check with you, OperatorCoordinatorMetricGroup
> >>>>>> (annotated as
> >>>>>>>>>> @PublicEvolving) has already been introduced into Flink by
> >>>> FLIP-179
> >>>>>>>>>> <
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics
> >>>>>>> .
> >>>>>>>>>> And that FLIP has got you +1.. Do you mean we should remove
> >>> this
> >>>>>>>>>> OperatorCoordinatorMetricGroup?
> >>>>>>>>>>
> >>>>>>>>>> Regards,
> >>>>>>>>>> Dong
> >>>>>>>>>>
> >>>>>>>>>> On Sat, Dec 10, 2022 at 1:33 AM Chesnay Schepler <
> >>>>> chesnay@apache.org>
> >>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> As a whole I feel like this FLIP is overly complicated. A
> >>>>> dedicated
> >>>>>>>>>>> coordinator MG implementation is overkill; it could just
> >>> re-use
> >>>>> the
> >>>>>>>>>>> existing Task/OperatorMGs to create the same structure we
> >>> have
> >>>> on
> >>>>>> TMs,
> >>>>>>>>>>> similar to what we did with the Job MG.
> >>>>>>>>>>>
> >>>>>>>>>>> However, I'm not convinced that this is required anyway,
> >>>> because
> >>>>>> all the
> >>>>>>>>>>> example metrics you listed can be implemented on the TM
> >> side
> >>> +
> >>>>>>>>>>> aggregating them in the external metrics backend.
> >>>>>>>>>>>
> >>>>>>>>>>> Since I'm on holidays soon, just so no one tries to pull a
> >>> fast
> >>>>>> one on
> >>>>>>>>>>> me, if this were to go to a vote as-is I'd be against it.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On 09/12/2022 15:30, Dong Lin wrote:
> >>>>>>>>>>>> Hi Hang,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks for the FLIP! The FLIP looks good and it is pretty
> >>>>>> informative.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I have just two minor comments regarding names:
> >>>>>>>>>>>> - Would it be useful to rename the config key as
> >>>>>>>>>>>> *metrics.scope.jm.job.operator-coordinator* for
> >> consistency
> >>>>> with
> >>>>>>>>>>>> *metrics.scope.jm.job
> >>>>>>>>>>>> *(which is not named as *jm-job)?
> >>>>>>>>>>>> - Maybe rename the variable as
> >>>>> SCOPE_NAMING_OPERATOR_COORDINATOR
> >>>>>> for
> >>>>>>>>>>>> simplicity and consistency with SCOPE_NAMING_OPERATOR
> >>> (which
> >>>> is
> >>>>>> not named
> >>>>>>>>>>>> as SCOPE_NAMING_TM_JOB_OPERATOR)?
> >>>>>>>>>>>>
> >>>>>>>>>>>> Cheers,
> >>>>>>>>>>>> Dong
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Thu, Dec 8, 2022 at 3:28 PM Hang Ruan <
> >>>>> ruanhang1993@gmail.com>
> >>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> MengYue and I created FLIP-274[1] Introduce metric group
> >>> for
> >>>>>>>>>>>>> OperatorCoordinator. OperatorCoordinator is the
> >>> coordinator
> >>>>> for
> >>>>>> runtime
> >>>>>>>>>>>>> operators and running on Job Manager. The coordination
> >>>>>> mechanism is
> >>>>>>>>>>>>> operator events between OperatorCoordinator and its all
> >>>>>> operators, the
> >>>>>>>>>>>>> coordination is more and more using in Flink, for
> >> example
> >>>> many
> >>>>>> Sources
> >>>>>>>>>>> and
> >>>>>>>>>>>>> Sinks depend on the mechanism to assign splits and
> >>>> coordinate
> >>>>>> commits to
> >>>>>>>>>>>>> external systems. The OperatorCoordinator is widely
> >> using
> >>> in
> >>>>>> flink kafka
> >>>>>>>>>>>>> connector, flink pulsar connector, flink cdc connector,
> >>>> flink
> >>>>>> hudi
> >>>>>>>>>>>>> connector and so on.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> But there is not a suitable metric group scope for the
> >>>>>>>>>>> OperatorCoordinator
> >>>>>>>>>>>>> and not an implementation for the interface
> >>>>>>>>>>> OperatorCoordinatorMetricGroup.
> >>>>>>>>>>>>> These metrics in OperatorCoordinator could be how many
> >>>>>> splits/partitions
> >>>>>>>>>>>>> have been assigned to source readers, how many files
> >> have
> >>>> been
> >>>>>> written
> >>>>>>>>>>> out
> >>>>>>>>>>>>> by sink writers, these metrics not only help users to
> >> know
> >>>> the
> >>>>>> job
> >>>>>>>>>>> progress
> >>>>>>>>>>>>> but also make big job maintaining easier. Thus we
> >> propose
> >>>> the
> >>>>>> FLIP-274
> >>>>>>>>>>> to
> >>>>>>>>>>>>> introduce a new metric group scope for
> >> OperatorCoordinator
> >>>> and
> >>>>>> provide
> >>>>>>>>>>> an
> >>>>>>>>>>>>> internal implementation for
> >>> OperatorCoordinatorMetricGroup.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Could you help review this FLIP when you get time? Any
> >>>>> feedback
> >>>>>> is
> >>>>>>>>>>>>> appreciated!
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Best,
> >>>>>>>>>>>>> Hang
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> [1]
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-274%3A+Introduce+metric+group+for+OperatorCoordinator
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
>
>

Re: [DISCUSS] FLIP-274 : Introduce metric group for OperatorCoordinator

Posted by Jark Wu <im...@gmail.com>.

Hi Dong,

Regarding “SplitEnumeratorContext#metricGroup”, my only concern is that this is a core interface for the FLIP. It’s hard to tell how sources use metric group without mentioning this interface. Even if this is an existing API, I think it’s worth introducing the interface again and declaring that we will implement the interface instead of a no-op method in this FLIP. 

Anyway, this is a minor problem and shouldn’t block this FLIP. I’m +1 to start a vote. 

Best,
Jark


> 2023年1月3日 10:03，Hang Ruan <ru...@gmail.com> 写道：
> 
> Hi, Jark and Dong,
> 
> Thanks for your comments. Sorry for my late reply.
> 
> For suggestion 1, I plan to implement the SplitEnumeratorMetricGroup in
> another issue, and it is not contained in this FLIP. I will add some
> description about this part.
> For suggestion 2, changes about OperatorCoordinator#metricGroup has already
> been documented in the proposed change section.
> 
> Best,
> Hang
> 
> Dong Lin <li...@gmail.com> 于2023年1月1日周日 09:45写道：
> 
>> Let me chime-in and add comments regarding the public interface section.
>> Please see my comments inline.
>> 
>> On Thu, Dec 29, 2022 at 6:08 PM Jark Wu <im...@gmail.com> wrote:
>> 
>>> Hi Hang,
>>> 
>>> Thanks for driving this discussion. I think this is a very useful feature
>>> for connectors.
>>> 
>>> The FLIP looks quite good to me, and I just have two suggestions.
>>> 
>>> 1. In the "Public Interface" section, mention that the implementation
>>> behavior of "SplitEnumeratorContext#metricGroup" is changed from
>> returning
>>> null to returning a concrete SplitEnumeratorMetricGroup instance. Even
>>> though the API is already there, the behavior change can also be
>> considered
>>> a public change.
>>> 
>> 
>> SplitEnumeratorContext#metricGroup is an interface and this FLIP does not
>> seem to change its semantics/behavior. The FLIP does change the
>> implementation/behavior of SourceCoordinatorContext#metricGroup, which is
>> marked @Internal.
>> 
>> Thus it might seem a bit weird to add in the public interface section
>> saying that we change the interface SplitEnumeratorContext#metricGroup from
>> returning null to non-null object.
>> 
>> 2. Mention the newly added interface of "OperatorCoordinator#metricGroup"
>>> in the "Proposed Changes" section or "Public Interface" section. As the
>>> FLIP said, OperatorCoordinator is widely used in many connectors. Though
>> it
>>> is still an @Internal API, I think it is worth mentioning the change in
>> the
>>> FLIP.
>> 
>> 
>> Since OperatorCoordinator is an internal API, it seems reasonable to
>> explain it in the proposed change section. The FLIP seems to have
>> documented this in point 5 of the proposed change section.
>> 
>> BTW, if we think there are @internal classes that are important enough to
>> be added in the public interface section, it might be useful to explicitly
>> discuss this topic and document it in the "*What are the "public
>> interfaces" of the project*" in this
>> <
>> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
>>> 
>> wiki.
>> 
>> 
>>> Best,
>>> Jark
>>> 
>>> 
>>> On Mon, 26 Dec 2022 at 18:06, Hang Ruan <ru...@gmail.com> wrote:
>>> 
>>>> Hi, thanks for the feedback, Zhu Zhu and Qingsheng.
>>>> After combining everyone's comments, the main concerns and
>> corresponding
>>>> adjustments are as follows.
>>>> 
>>>> Q1: Common metrics are not quite useful.
>>>> numEventsIn and numEventsOut counters will be removed from the
>>>> OperatorCoordinatorMetricGroup. These common metrics do not provide
>>> enough
>>>> information for users. The users are more willing to get the number of
>>>> events of the specified type instead of the total number. And this
>> metric
>>>> is calculated differently. The implementation could register the metric
>>> by
>>>> themselves.
>>>> 
>>>> Q2: This FLIP is overly complicated.
>>>> This FLIP will become concise after these modifications.
>>>> OperatorCoordinatorMetricGroup has already been introduced into Flink
>> by
>>>> FLIP-179<
>>>> 
>>>> 
>>> 
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics
>>>>> .
>>>> And
>>>> this FLIP will not change it. This FLIP only provides a new metric
>> option
>>>> and a new metric group scope. The changes in proposed changes provide
>> the
>>>> details about the modifications for the internal classes, which might
>>> make
>>>> it look complicated.
>>>> 
>>>> Thanks for all the comments again. If there are no further comments, we
>>>> plan to start the voting thread this week.
>>>> 
>>>> Best,
>>>> Hang
>>>> 
>>>> Qingsheng Ren <re...@gmail.com> 于2022年12月26日周一 16:48写道：
>>>> 
>>>>> Thanks for the FLIP, Hang!
>>>>> 
>>>>> This FLIP overall looks good to me. Actually I share the same concern
>>>> with
>>>>> Zhu that numEventsIn and numEventsOut counters are not quite useful
>> to
>>>> end
>>>>> users. OperatorEvent is a quite low-level abstraction, which requires
>>>>> instantialization in order to be practical to users and developers,
>> so
>>>>> maybe it's better to exclude them from the FLIP.
>>>>> 
>>>>> Best,
>>>>> Qingsheng
>>>>> Ververica (Alibaba)
>>>>> 
>>>>> On Mon, Dec 26, 2022 at 12:08 PM Zhu Zhu <re...@gmail.com> wrote:
>>>>> 
>>>>>> Hi Hang,
>>>>>> I still see no strong reason why we need numEventsIn/numEventsOut
>>>>> metrics.
>>>>>> In the discussion in FLINK-29801, I can see the same concern from
>>>> others.
>>>>>> So I prefer to exclude them from this FLIP to avoid over-extending
>>> the
>>>>>> scope.
>>>>>> 
>>>>>> Thanks,
>>>>>> Zhu
>>>>>> 
>>>>>> Hang Ruan <ru...@gmail.com> 于2022年12月23日周五 15:21写道：
>>>>>>> 
>>>>>>> Hi everyone,
>>>>>>> 
>>>>>>> As for the Zhu Zhu's problem, I think we should keep the common
>>>>> metrics,
>>>>>> which will help to observe incoming and outgoing events. What do
>> you
>>>>> think,
>>>>>> @Zhu Zhu ?
>>>>>>> And @Chesnay, are there any other issues you are more concerned
>>>> about?
>>>>>> Looking forward to your reply.
>>>>>>> 
>>>>>>> Thanks for all the comments. If there are no further comments, we
>>>> plan
>>>>>> to start the voting thread next week.
>>>>>>> 
>>>>>>> Best,
>>>>>>> Hang
>>>>>>> 
>>>>>>> Hang Ruan <ru...@gmail.com> 于2022年12月15日周四 16:49写道：
>>>>>>>> 
>>>>>>>> Hi, Zhu Zhu,
>>>>>>>> 
>>>>>>>> Thanks for your feedback!
>>>>>>>> 
>>>>>>>> The OperatorCoordinator implementations are different. And their
>>>>>> metrics are much different too. We try to find the common metrics
>> and
>>>> put
>>>>>> them in the OperatorCoordinatorMetricGroup. If most developers
>> think
>>> we
>>>>> do
>>>>>> not need these common metrics, removing the common metrics is also
>>>>>> acceptable.
>>>>>>>> 
>>>>>>>> Best,
>>>>>>>> Hang
>>>>>>>> 
>>>>>>>> Zhu Zhu <re...@gmail.com> 于2022年12月14日周三 22:09写道：
>>>>>>>>> 
>>>>>>>>> Hi Hang & MengYue,
>>>>>>>>> 
>>>>>>>>> Thanks for creating this FLIP!
>>>>>>>>> 
>>>>>>>>> I think it is very useful, mainly in two aspects:
>>>>>>>>> 1. Enables OperatorCoordinators to register metrics. Currently
>>>>>>>>> the coordinators has no way to do this. And operator
>> coordinator
>>>>>>>>> metric group further enables the SplitEnumerator to have access
>>>>>>>>> to a registered metric group (via the existing public interface
>>>>>>>>> SplitEnumeratorContext#metricGroup()), which is null at the
>>> moment.
>>>>>>>>> 
>>>>>>>>> 2. Defines the scope of operator coordinator metrics. A clear
>>>>>> definition
>>>>>>>>> makes it easy for users to find their wanted metrics. The
>>>> definition
>>>>>>>>> also helps to avoid conflicts of metrics from multiple
>>>>>> OperatorCoordinators
>>>>>>>>> of the same kind. E.g. each SourceCoordinator may have its own
>>>>>>>>> numSourceSplits metric, these metrics should not be directly
>>>>> registered
>>>>>>>>> to the job metric group.
>>>>>>>>> 
>>>>>>>>> What I'm a bit concerned is the necessity of the introduced
>>> common
>>>>>> metrics
>>>>>>>>> numEventsInCounter & numEventsOutCounter. If there any case
>> which
>>>>>> strongly
>>>>>>>>> requires them?
>>>>>>>>> 
>>>>>>>>> Regarding the concerns of Chesnay,
>>>>>>>>>> A dedicated coordinator MG implementation is overkill
>>>>>>>>> Directly using the job metric group can result in metric
>>> conflicts,
>>>>> as
>>>>>> mentioned
>>>>>>>>> in above #2.
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> Zhu
>>>>>>>>> 
>>>>>>>>> Dong Lin <li...@gmail.com> 于2022年12月10日周六 14:16写道：
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Hi Chesney,
>>>>>>>>>> 
>>>>>>>>>> Just to double check with you, OperatorCoordinatorMetricGroup
>>>>>> (annotated as
>>>>>>>>>> @PublicEvolving) has already been introduced into Flink by
>>>> FLIP-179
>>>>>>>>>> <
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics
>>>>>>> .
>>>>>>>>>> And that FLIP has got you +1.. Do you mean we should remove
>>> this
>>>>>>>>>> OperatorCoordinatorMetricGroup?
>>>>>>>>>> 
>>>>>>>>>> Regards,
>>>>>>>>>> Dong
>>>>>>>>>> 
>>>>>>>>>> On Sat, Dec 10, 2022 at 1:33 AM Chesnay Schepler <
>>>>> chesnay@apache.org>
>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> As a whole I feel like this FLIP is overly complicated. A
>>>>> dedicated
>>>>>>>>>>> coordinator MG implementation is overkill; it could just
>>> re-use
>>>>> the
>>>>>>>>>>> existing Task/OperatorMGs to create the same structure we
>>> have
>>>> on
>>>>>> TMs,
>>>>>>>>>>> similar to what we did with the Job MG.
>>>>>>>>>>> 
>>>>>>>>>>> However, I'm not convinced that this is required anyway,
>>>> because
>>>>>> all the
>>>>>>>>>>> example metrics you listed can be implemented on the TM
>> side
>>> +
>>>>>>>>>>> aggregating them in the external metrics backend.
>>>>>>>>>>> 
>>>>>>>>>>> Since I'm on holidays soon, just so no one tries to pull a
>>> fast
>>>>>> one on
>>>>>>>>>>> me, if this were to go to a vote as-is I'd be against it.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On 09/12/2022 15:30, Dong Lin wrote:
>>>>>>>>>>>> Hi Hang,
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks for the FLIP! The FLIP looks good and it is pretty
>>>>>> informative.
>>>>>>>>>>>> 
>>>>>>>>>>>> I have just two minor comments regarding names:
>>>>>>>>>>>> - Would it be useful to rename the config key as
>>>>>>>>>>>> *metrics.scope.jm.job.operator-coordinator* for
>> consistency
>>>>> with
>>>>>>>>>>>> *metrics.scope.jm.job
>>>>>>>>>>>> *(which is not named as *jm-job)?
>>>>>>>>>>>> - Maybe rename the variable as
>>>>> SCOPE_NAMING_OPERATOR_COORDINATOR
>>>>>> for
>>>>>>>>>>>> simplicity and consistency with SCOPE_NAMING_OPERATOR
>>> (which
>>>> is
>>>>>> not named
>>>>>>>>>>>> as SCOPE_NAMING_TM_JOB_OPERATOR)?
>>>>>>>>>>>> 
>>>>>>>>>>>> Cheers,
>>>>>>>>>>>> Dong
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Thu, Dec 8, 2022 at 3:28 PM Hang Ruan <
>>>>> ruanhang1993@gmail.com>
>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> MengYue and I created FLIP-274[1] Introduce metric group
>>> for
>>>>>>>>>>>>> OperatorCoordinator. OperatorCoordinator is the
>>> coordinator
>>>>> for
>>>>>> runtime
>>>>>>>>>>>>> operators and running on Job Manager. The coordination
>>>>>> mechanism is
>>>>>>>>>>>>> operator events between OperatorCoordinator and its all
>>>>>> operators, the
>>>>>>>>>>>>> coordination is more and more using in Flink, for
>> example
>>>> many
>>>>>> Sources
>>>>>>>>>>> and
>>>>>>>>>>>>> Sinks depend on the mechanism to assign splits and
>>>> coordinate
>>>>>> commits to
>>>>>>>>>>>>> external systems. The OperatorCoordinator is widely
>> using
>>> in
>>>>>> flink kafka
>>>>>>>>>>>>> connector, flink pulsar connector, flink cdc connector,
>>>> flink
>>>>>> hudi
>>>>>>>>>>>>> connector and so on.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> But there is not a suitable metric group scope for the
>>>>>>>>>>> OperatorCoordinator
>>>>>>>>>>>>> and not an implementation for the interface
>>>>>>>>>>> OperatorCoordinatorMetricGroup.
>>>>>>>>>>>>> These metrics in OperatorCoordinator could be how many
>>>>>> splits/partitions
>>>>>>>>>>>>> have been assigned to source readers, how many files
>> have
>>>> been
>>>>>> written
>>>>>>>>>>> out
>>>>>>>>>>>>> by sink writers, these metrics not only help users to
>> know
>>>> the
>>>>>> job
>>>>>>>>>>> progress
>>>>>>>>>>>>> but also make big job maintaining easier. Thus we
>> propose
>>>> the
>>>>>> FLIP-274
>>>>>>>>>>> to
>>>>>>>>>>>>> introduce a new metric group scope for
>> OperatorCoordinator
>>>> and
>>>>>> provide
>>>>>>>>>>> an
>>>>>>>>>>>>> internal implementation for
>>> OperatorCoordinatorMetricGroup.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Could you help review this FLIP when you get time? Any
>>>>> feedback
>>>>>> is
>>>>>>>>>>>>> appreciated!
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Best,
>>>>>>>>>>>>> Hang
>>>>>>>>>>>>> 
>>>>>>>>>>>>> [1]
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-274%3A+Introduce+metric+group+for+OperatorCoordinator
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>>

Re: [DISCUSS] FLIP-274 : Introduce metric group for OperatorCoordinator

Posted by Hang Ruan <ru...@gmail.com>.

Hi, Jark and Dong,

Thanks for your comments. Sorry for my late reply.

For suggestion 1, I plan to implement the SplitEnumeratorMetricGroup in
another issue, and it is not contained in this FLIP. I will add some
description about this part.
For suggestion 2, changes about OperatorCoordinator#metricGroup has already
been documented in the proposed change section.

Best,
Hang

Dong Lin <li...@gmail.com> 于2023年1月1日周日 09:45写道：

> Let me chime-in and add comments regarding the public interface section.
> Please see my comments inline.
>
> On Thu, Dec 29, 2022 at 6:08 PM Jark Wu <im...@gmail.com> wrote:
>
> > Hi Hang,
> >
> > Thanks for driving this discussion. I think this is a very useful feature
> > for connectors.
> >
> > The FLIP looks quite good to me, and I just have two suggestions.
> >
> > 1. In the "Public Interface" section, mention that the implementation
> > behavior of "SplitEnumeratorContext#metricGroup" is changed from
> returning
> > null to returning a concrete SplitEnumeratorMetricGroup instance. Even
> > though the API is already there, the behavior change can also be
> considered
> > a public change.
> >
>
> SplitEnumeratorContext#metricGroup is an interface and this FLIP does not
> seem to change its semantics/behavior. The FLIP does change the
> implementation/behavior of SourceCoordinatorContext#metricGroup, which is
> marked @Internal.
>
> Thus it might seem a bit weird to add in the public interface section
> saying that we change the interface SplitEnumeratorContext#metricGroup from
> returning null to non-null object.
>
> 2. Mention the newly added interface of "OperatorCoordinator#metricGroup"
> > in the "Proposed Changes" section or "Public Interface" section. As the
> > FLIP said, OperatorCoordinator is widely used in many connectors. Though
> it
> > is still an @Internal API, I think it is worth mentioning the change in
> the
> > FLIP.
>
>
> Since OperatorCoordinator is an internal API, it seems reasonable to
> explain it in the proposed change section. The FLIP seems to have
> documented this in point 5 of the proposed change section.
>
> BTW, if we think there are @internal classes that are important enough to
> be added in the public interface section, it might be useful to explicitly
> discuss this topic and document it in the "*What are the "public
> interfaces" of the project*" in this
> <
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
> >
> wiki.
>
>
> > Best,
> > Jark
> >
> >
> > On Mon, 26 Dec 2022 at 18:06, Hang Ruan <ru...@gmail.com> wrote:
> >
> > > Hi, thanks for the feedback, Zhu Zhu and Qingsheng.
> > > After combining everyone's comments, the main concerns and
> corresponding
> > > adjustments are as follows.
> > >
> > > Q1: Common metrics are not quite useful.
> > > numEventsIn and numEventsOut counters will be removed from the
> > > OperatorCoordinatorMetricGroup. These common metrics do not provide
> > enough
> > > information for users. The users are more willing to get the number of
> > > events of the specified type instead of the total number. And this
> metric
> > > is calculated differently. The implementation could register the metric
> > by
> > > themselves.
> > >
> > > Q2: This FLIP is overly complicated.
> > > This FLIP will become concise after these modifications.
> > > OperatorCoordinatorMetricGroup has already been introduced into Flink
> by
> > > FLIP-179<
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics
> > > >.
> > > And
> > > this FLIP will not change it. This FLIP only provides a new metric
> option
> > > and a new metric group scope. The changes in proposed changes provide
> the
> > > details about the modifications for the internal classes, which might
> > make
> > > it look complicated.
> > >
> > > Thanks for all the comments again. If there are no further comments, we
> > > plan to start the voting thread this week.
> > >
> > > Best,
> > > Hang
> > >
> > > Qingsheng Ren <re...@gmail.com> 于2022年12月26日周一 16:48写道：
> > >
> > > > Thanks for the FLIP, Hang!
> > > >
> > > > This FLIP overall looks good to me. Actually I share the same concern
> > > with
> > > > Zhu that numEventsIn and numEventsOut counters are not quite useful
> to
> > > end
> > > > users. OperatorEvent is a quite low-level abstraction, which requires
> > > > instantialization in order to be practical to users and developers,
> so
> > > > maybe it's better to exclude them from the FLIP.
> > > >
> > > > Best,
> > > > Qingsheng
> > > > Ververica (Alibaba)
> > > >
> > > > On Mon, Dec 26, 2022 at 12:08 PM Zhu Zhu <re...@gmail.com> wrote:
> > > >
> > > > > Hi Hang,
> > > > > I still see no strong reason why we need numEventsIn/numEventsOut
> > > > metrics.
> > > > > In the discussion in FLINK-29801, I can see the same concern from
> > > others.
> > > > > So I prefer to exclude them from this FLIP to avoid over-extending
> > the
> > > > > scope.
> > > > >
> > > > > Thanks,
> > > > > Zhu
> > > > >
> > > > > Hang Ruan <ru...@gmail.com> 于2022年12月23日周五 15:21写道：
> > > > > >
> > > > > > Hi everyone,
> > > > > >
> > > > > > As for the Zhu Zhu's problem, I think we should keep the common
> > > > metrics,
> > > > > which will help to observe incoming and outgoing events. What do
> you
> > > > think,
> > > > > @Zhu Zhu ?
> > > > > > And @Chesnay, are there any other issues you are more concerned
> > > about?
> > > > > Looking forward to your reply.
> > > > > >
> > > > > > Thanks for all the comments. If there are no further comments, we
> > > plan
> > > > > to start the voting thread next week.
> > > > > >
> > > > > > Best,
> > > > > > Hang
> > > > > >
> > > > > > Hang Ruan <ru...@gmail.com> 于2022年12月15日周四 16:49写道：
> > > > > >>
> > > > > >> Hi, Zhu Zhu,
> > > > > >>
> > > > > >> Thanks for your feedback!
> > > > > >>
> > > > > >> The OperatorCoordinator implementations are different. And their
> > > > > metrics are much different too. We try to find the common metrics
> and
> > > put
> > > > > them in the OperatorCoordinatorMetricGroup. If most developers
> think
> > we
> > > > do
> > > > > not need these common metrics, removing the common metrics is also
> > > > > acceptable.
> > > > > >>
> > > > > >> Best,
> > > > > >> Hang
> > > > > >>
> > > > > >> Zhu Zhu <re...@gmail.com> 于2022年12月14日周三 22:09写道：
> > > > > >>>
> > > > > >>> Hi Hang & MengYue,
> > > > > >>>
> > > > > >>> Thanks for creating this FLIP!
> > > > > >>>
> > > > > >>> I think it is very useful, mainly in two aspects:
> > > > > >>> 1. Enables OperatorCoordinators to register metrics. Currently
> > > > > >>> the coordinators has no way to do this. And operator
> coordinator
> > > > > >>> metric group further enables the SplitEnumerator to have access
> > > > > >>> to a registered metric group (via the existing public interface
> > > > > >>> SplitEnumeratorContext#metricGroup()), which is null at the
> > moment.
> > > > > >>>
> > > > > >>> 2. Defines the scope of operator coordinator metrics. A clear
> > > > > definition
> > > > > >>> makes it easy for users to find their wanted metrics. The
> > > definition
> > > > > >>> also helps to avoid conflicts of metrics from multiple
> > > > > OperatorCoordinators
> > > > > >>> of the same kind. E.g. each SourceCoordinator may have its own
> > > > > >>> numSourceSplits metric, these metrics should not be directly
> > > > registered
> > > > > >>> to the job metric group.
> > > > > >>>
> > > > > >>> What I'm a bit concerned is the necessity of the introduced
> > common
> > > > > metrics
> > > > > >>> numEventsInCounter & numEventsOutCounter. If there any case
> which
> > > > > strongly
> > > > > >>> requires them?
> > > > > >>>
> > > > > >>> Regarding the concerns of Chesnay,
> > > > > >>> > A dedicated coordinator MG implementation is overkill
> > > > > >>> Directly using the job metric group can result in metric
> > conflicts,
> > > > as
> > > > > mentioned
> > > > > >>> in above #2.
> > > > > >>>
> > > > > >>> Thanks,
> > > > > >>> Zhu
> > > > > >>>
> > > > > >>> Dong Lin <li...@gmail.com> 于2022年12月10日周六 14:16写道：
> > > > > >>>
> > > > > >>> >
> > > > > >>> > Hi Chesney,
> > > > > >>> >
> > > > > >>> > Just to double check with you, OperatorCoordinatorMetricGroup
> > > > > (annotated as
> > > > > >>> > @PublicEvolving) has already been introduced into Flink by
> > > FLIP-179
> > > > > >>> > <
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics
> > > > > >.
> > > > > >>> > And that FLIP has got you +1.. Do you mean we should remove
> > this
> > > > > >>> > OperatorCoordinatorMetricGroup?
> > > > > >>> >
> > > > > >>> > Regards,
> > > > > >>> > Dong
> > > > > >>> >
> > > > > >>> > On Sat, Dec 10, 2022 at 1:33 AM Chesnay Schepler <
> > > > chesnay@apache.org>
> > > > > wrote:
> > > > > >>> >
> > > > > >>> > > As a whole I feel like this FLIP is overly complicated. A
> > > > dedicated
> > > > > >>> > > coordinator MG implementation is overkill; it could just
> > re-use
> > > > the
> > > > > >>> > > existing Task/OperatorMGs to create the same structure we
> > have
> > > on
> > > > > TMs,
> > > > > >>> > > similar to what we did with the Job MG.
> > > > > >>> > >
> > > > > >>> > > However, I'm not convinced that this is required anyway,
> > > because
> > > > > all the
> > > > > >>> > > example metrics you listed can be implemented on the TM
> side
> > +
> > > > > >>> > > aggregating them in the external metrics backend.
> > > > > >>> > >
> > > > > >>> > > Since I'm on holidays soon, just so no one tries to pull a
> > fast
> > > > > one on
> > > > > >>> > > me, if this were to go to a vote as-is I'd be against it.
> > > > > >>> > >
> > > > > >>> > >
> > > > > >>> > > On 09/12/2022 15:30, Dong Lin wrote:
> > > > > >>> > > > Hi Hang,
> > > > > >>> > > >
> > > > > >>> > > > Thanks for the FLIP! The FLIP looks good and it is pretty
> > > > > informative.
> > > > > >>> > > >
> > > > > >>> > > > I have just two minor comments regarding names:
> > > > > >>> > > > - Would it be useful to rename the config key as
> > > > > >>> > > > *metrics.scope.jm.job.operator-coordinator* for
> consistency
> > > > with
> > > > > >>> > > > *metrics.scope.jm.job
> > > > > >>> > > > *(which is not named as *jm-job)?
> > > > > >>> > > > - Maybe rename the variable as
> > > > SCOPE_NAMING_OPERATOR_COORDINATOR
> > > > > for
> > > > > >>> > > > simplicity and consistency with SCOPE_NAMING_OPERATOR
> > (which
> > > is
> > > > > not named
> > > > > >>> > > > as SCOPE_NAMING_TM_JOB_OPERATOR)?
> > > > > >>> > > >
> > > > > >>> > > > Cheers,
> > > > > >>> > > > Dong
> > > > > >>> > > >
> > > > > >>> > > >
> > > > > >>> > > >
> > > > > >>> > > > On Thu, Dec 8, 2022 at 3:28 PM Hang Ruan <
> > > > ruanhang1993@gmail.com>
> > > > > wrote:
> > > > > >>> > > >
> > > > > >>> > > >> Hi all,
> > > > > >>> > > >>
> > > > > >>> > > >> MengYue and I created FLIP-274[1] Introduce metric group
> > for
> > > > > >>> > > >> OperatorCoordinator. OperatorCoordinator is the
> > coordinator
> > > > for
> > > > > runtime
> > > > > >>> > > >> operators and running on Job Manager. The coordination
> > > > > mechanism is
> > > > > >>> > > >> operator events between OperatorCoordinator and its all
> > > > > operators, the
> > > > > >>> > > >> coordination is more and more using in Flink, for
> example
> > > many
> > > > > Sources
> > > > > >>> > > and
> > > > > >>> > > >> Sinks depend on the mechanism to assign splits and
> > > coordinate
> > > > > commits to
> > > > > >>> > > >> external systems. The OperatorCoordinator is widely
> using
> > in
> > > > > flink kafka
> > > > > >>> > > >> connector, flink pulsar connector, flink cdc connector,
> > > flink
> > > > > hudi
> > > > > >>> > > >> connector and so on.
> > > > > >>> > > >>
> > > > > >>> > > >> But there is not a suitable metric group scope for the
> > > > > >>> > > OperatorCoordinator
> > > > > >>> > > >> and not an implementation for the interface
> > > > > >>> > > OperatorCoordinatorMetricGroup.
> > > > > >>> > > >> These metrics in OperatorCoordinator could be how many
> > > > > splits/partitions
> > > > > >>> > > >> have been assigned to source readers, how many files
> have
> > > been
> > > > > written
> > > > > >>> > > out
> > > > > >>> > > >> by sink writers, these metrics not only help users to
> know
> > > the
> > > > > job
> > > > > >>> > > progress
> > > > > >>> > > >> but also make big job maintaining easier. Thus we
> propose
> > > the
> > > > > FLIP-274
> > > > > >>> > > to
> > > > > >>> > > >> introduce a new metric group scope for
> OperatorCoordinator
> > > and
> > > > > provide
> > > > > >>> > > an
> > > > > >>> > > >> internal implementation for
> > OperatorCoordinatorMetricGroup.
> > > > > >>> > > >>
> > > > > >>> > > >> Could you help review this FLIP when you get time? Any
> > > > feedback
> > > > > is
> > > > > >>> > > >> appreciated!
> > > > > >>> > > >>
> > > > > >>> > > >> Best,
> > > > > >>> > > >> Hang
> > > > > >>> > > >>
> > > > > >>> > > >> [1]
> > > > > >>> > > >>
> > > > > >>> > > >>
> > > > > >>> > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-274%3A+Introduce+metric+group+for+OperatorCoordinator
> > > > > >>> > > >>
> > > > > >>> > >
> > > > > >>> > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-274 : Introduce metric group for OperatorCoordinator

Posted by Dong Lin <li...@gmail.com>.

Let me chime-in and add comments regarding the public interface section.
Please see my comments inline.

On Thu, Dec 29, 2022 at 6:08 PM Jark Wu <im...@gmail.com> wrote:

> Hi Hang,
>
> Thanks for driving this discussion. I think this is a very useful feature
> for connectors.
>
> The FLIP looks quite good to me, and I just have two suggestions.
>
> 1. In the "Public Interface" section, mention that the implementation
> behavior of "SplitEnumeratorContext#metricGroup" is changed from returning
> null to returning a concrete SplitEnumeratorMetricGroup instance. Even
> though the API is already there, the behavior change can also be considered
> a public change.
>

SplitEnumeratorContext#metricGroup is an interface and this FLIP does not
seem to change its semantics/behavior. The FLIP does change the
implementation/behavior of SourceCoordinatorContext#metricGroup, which is
marked @Internal.

Thus it might seem a bit weird to add in the public interface section
saying that we change the interface SplitEnumeratorContext#metricGroup from
returning null to non-null object.

2. Mention the newly added interface of "OperatorCoordinator#metricGroup"
> in the "Proposed Changes" section or "Public Interface" section. As the
> FLIP said, OperatorCoordinator is widely used in many connectors. Though it
> is still an @Internal API, I think it is worth mentioning the change in the
> FLIP.


Since OperatorCoordinator is an internal API, it seems reasonable to
explain it in the proposed change section. The FLIP seems to have
documented this in point 5 of the proposed change section.

BTW, if we think there are @internal classes that are important enough to
be added in the public interface section, it might be useful to explicitly
discuss this topic and document it in the "*What are the "public
interfaces" of the project*" in this
<https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals>
wiki.


> Best,
> Jark
>
>
> On Mon, 26 Dec 2022 at 18:06, Hang Ruan <ru...@gmail.com> wrote:
>
> > Hi, thanks for the feedback, Zhu Zhu and Qingsheng.
> > After combining everyone's comments, the main concerns and corresponding
> > adjustments are as follows.
> >
> > Q1: Common metrics are not quite useful.
> > numEventsIn and numEventsOut counters will be removed from the
> > OperatorCoordinatorMetricGroup. These common metrics do not provide
> enough
> > information for users. The users are more willing to get the number of
> > events of the specified type instead of the total number. And this metric
> > is calculated differently. The implementation could register the metric
> by
> > themselves.
> >
> > Q2: This FLIP is overly complicated.
> > This FLIP will become concise after these modifications.
> > OperatorCoordinatorMetricGroup has already been introduced into Flink by
> > FLIP-179<
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics
> > >.
> > And
> > this FLIP will not change it. This FLIP only provides a new metric option
> > and a new metric group scope. The changes in proposed changes provide the
> > details about the modifications for the internal classes, which might
> make
> > it look complicated.
> >
> > Thanks for all the comments again. If there are no further comments, we
> > plan to start the voting thread this week.
> >
> > Best,
> > Hang
> >
> > Qingsheng Ren <re...@gmail.com> 于2022年12月26日周一 16:48写道：
> >
> > > Thanks for the FLIP, Hang!
> > >
> > > This FLIP overall looks good to me. Actually I share the same concern
> > with
> > > Zhu that numEventsIn and numEventsOut counters are not quite useful to
> > end
> > > users. OperatorEvent is a quite low-level abstraction, which requires
> > > instantialization in order to be practical to users and developers, so
> > > maybe it's better to exclude them from the FLIP.
> > >
> > > Best,
> > > Qingsheng
> > > Ververica (Alibaba)
> > >
> > > On Mon, Dec 26, 2022 at 12:08 PM Zhu Zhu <re...@gmail.com> wrote:
> > >
> > > > Hi Hang,
> > > > I still see no strong reason why we need numEventsIn/numEventsOut
> > > metrics.
> > > > In the discussion in FLINK-29801, I can see the same concern from
> > others.
> > > > So I prefer to exclude them from this FLIP to avoid over-extending
> the
> > > > scope.
> > > >
> > > > Thanks,
> > > > Zhu
> > > >
> > > > Hang Ruan <ru...@gmail.com> 于2022年12月23日周五 15:21写道：
> > > > >
> > > > > Hi everyone,
> > > > >
> > > > > As for the Zhu Zhu's problem, I think we should keep the common
> > > metrics,
> > > > which will help to observe incoming and outgoing events. What do you
> > > think,
> > > > @Zhu Zhu ?
> > > > > And @Chesnay, are there any other issues you are more concerned
> > about?
> > > > Looking forward to your reply.
> > > > >
> > > > > Thanks for all the comments. If there are no further comments, we
> > plan
> > > > to start the voting thread next week.
> > > > >
> > > > > Best,
> > > > > Hang
> > > > >
> > > > > Hang Ruan <ru...@gmail.com> 于2022年12月15日周四 16:49写道：
> > > > >>
> > > > >> Hi, Zhu Zhu,
> > > > >>
> > > > >> Thanks for your feedback!
> > > > >>
> > > > >> The OperatorCoordinator implementations are different. And their
> > > > metrics are much different too. We try to find the common metrics and
> > put
> > > > them in the OperatorCoordinatorMetricGroup. If most developers think
> we
> > > do
> > > > not need these common metrics, removing the common metrics is also
> > > > acceptable.
> > > > >>
> > > > >> Best,
> > > > >> Hang
> > > > >>
> > > > >> Zhu Zhu <re...@gmail.com> 于2022年12月14日周三 22:09写道：
> > > > >>>
> > > > >>> Hi Hang & MengYue,
> > > > >>>
> > > > >>> Thanks for creating this FLIP!
> > > > >>>
> > > > >>> I think it is very useful, mainly in two aspects:
> > > > >>> 1. Enables OperatorCoordinators to register metrics. Currently
> > > > >>> the coordinators has no way to do this. And operator coordinator
> > > > >>> metric group further enables the SplitEnumerator to have access
> > > > >>> to a registered metric group (via the existing public interface
> > > > >>> SplitEnumeratorContext#metricGroup()), which is null at the
> moment.
> > > > >>>
> > > > >>> 2. Defines the scope of operator coordinator metrics. A clear
> > > > definition
> > > > >>> makes it easy for users to find their wanted metrics. The
> > definition
> > > > >>> also helps to avoid conflicts of metrics from multiple
> > > > OperatorCoordinators
> > > > >>> of the same kind. E.g. each SourceCoordinator may have its own
> > > > >>> numSourceSplits metric, these metrics should not be directly
> > > registered
> > > > >>> to the job metric group.
> > > > >>>
> > > > >>> What I'm a bit concerned is the necessity of the introduced
> common
> > > > metrics
> > > > >>> numEventsInCounter & numEventsOutCounter. If there any case which
> > > > strongly
> > > > >>> requires them?
> > > > >>>
> > > > >>> Regarding the concerns of Chesnay,
> > > > >>> > A dedicated coordinator MG implementation is overkill
> > > > >>> Directly using the job metric group can result in metric
> conflicts,
> > > as
> > > > mentioned
> > > > >>> in above #2.
> > > > >>>
> > > > >>> Thanks,
> > > > >>> Zhu
> > > > >>>
> > > > >>> Dong Lin <li...@gmail.com> 于2022年12月10日周六 14:16写道：
> > > > >>>
> > > > >>> >
> > > > >>> > Hi Chesney,
> > > > >>> >
> > > > >>> > Just to double check with you, OperatorCoordinatorMetricGroup
> > > > (annotated as
> > > > >>> > @PublicEvolving) has already been introduced into Flink by
> > FLIP-179
> > > > >>> > <
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics
> > > > >.
> > > > >>> > And that FLIP has got you +1.. Do you mean we should remove
> this
> > > > >>> > OperatorCoordinatorMetricGroup?
> > > > >>> >
> > > > >>> > Regards,
> > > > >>> > Dong
> > > > >>> >
> > > > >>> > On Sat, Dec 10, 2022 at 1:33 AM Chesnay Schepler <
> > > chesnay@apache.org>
> > > > wrote:
> > > > >>> >
> > > > >>> > > As a whole I feel like this FLIP is overly complicated. A
> > > dedicated
> > > > >>> > > coordinator MG implementation is overkill; it could just
> re-use
> > > the
> > > > >>> > > existing Task/OperatorMGs to create the same structure we
> have
> > on
> > > > TMs,
> > > > >>> > > similar to what we did with the Job MG.
> > > > >>> > >
> > > > >>> > > However, I'm not convinced that this is required anyway,
> > because
> > > > all the
> > > > >>> > > example metrics you listed can be implemented on the TM side
> +
> > > > >>> > > aggregating them in the external metrics backend.
> > > > >>> > >
> > > > >>> > > Since I'm on holidays soon, just so no one tries to pull a
> fast
> > > > one on
> > > > >>> > > me, if this were to go to a vote as-is I'd be against it.
> > > > >>> > >
> > > > >>> > >
> > > > >>> > > On 09/12/2022 15:30, Dong Lin wrote:
> > > > >>> > > > Hi Hang,
> > > > >>> > > >
> > > > >>> > > > Thanks for the FLIP! The FLIP looks good and it is pretty
> > > > informative.
> > > > >>> > > >
> > > > >>> > > > I have just two minor comments regarding names:
> > > > >>> > > > - Would it be useful to rename the config key as
> > > > >>> > > > *metrics.scope.jm.job.operator-coordinator* for consistency
> > > with
> > > > >>> > > > *metrics.scope.jm.job
> > > > >>> > > > *(which is not named as *jm-job)?
> > > > >>> > > > - Maybe rename the variable as
> > > SCOPE_NAMING_OPERATOR_COORDINATOR
> > > > for
> > > > >>> > > > simplicity and consistency with SCOPE_NAMING_OPERATOR
> (which
> > is
> > > > not named
> > > > >>> > > > as SCOPE_NAMING_TM_JOB_OPERATOR)?
> > > > >>> > > >
> > > > >>> > > > Cheers,
> > > > >>> > > > Dong
> > > > >>> > > >
> > > > >>> > > >
> > > > >>> > > >
> > > > >>> > > > On Thu, Dec 8, 2022 at 3:28 PM Hang Ruan <
> > > ruanhang1993@gmail.com>
> > > > wrote:
> > > > >>> > > >
> > > > >>> > > >> Hi all,
> > > > >>> > > >>
> > > > >>> > > >> MengYue and I created FLIP-274[1] Introduce metric group
> for
> > > > >>> > > >> OperatorCoordinator. OperatorCoordinator is the
> coordinator
> > > for
> > > > runtime
> > > > >>> > > >> operators and running on Job Manager. The coordination
> > > > mechanism is
> > > > >>> > > >> operator events between OperatorCoordinator and its all
> > > > operators, the
> > > > >>> > > >> coordination is more and more using in Flink, for example
> > many
> > > > Sources
> > > > >>> > > and
> > > > >>> > > >> Sinks depend on the mechanism to assign splits and
> > coordinate
> > > > commits to
> > > > >>> > > >> external systems. The OperatorCoordinator is widely using
> in
> > > > flink kafka
> > > > >>> > > >> connector, flink pulsar connector, flink cdc connector,
> > flink
> > > > hudi
> > > > >>> > > >> connector and so on.
> > > > >>> > > >>
> > > > >>> > > >> But there is not a suitable metric group scope for the
> > > > >>> > > OperatorCoordinator
> > > > >>> > > >> and not an implementation for the interface
> > > > >>> > > OperatorCoordinatorMetricGroup.
> > > > >>> > > >> These metrics in OperatorCoordinator could be how many
> > > > splits/partitions
> > > > >>> > > >> have been assigned to source readers, how many files have
> > been
> > > > written
> > > > >>> > > out
> > > > >>> > > >> by sink writers, these metrics not only help users to know
> > the
> > > > job
> > > > >>> > > progress
> > > > >>> > > >> but also make big job maintaining easier. Thus we propose
> > the
> > > > FLIP-274
> > > > >>> > > to
> > > > >>> > > >> introduce a new metric group scope for OperatorCoordinator
> > and
> > > > provide
> > > > >>> > > an
> > > > >>> > > >> internal implementation for
> OperatorCoordinatorMetricGroup.
> > > > >>> > > >>
> > > > >>> > > >> Could you help review this FLIP when you get time? Any
> > > feedback
> > > > is
> > > > >>> > > >> appreciated!
> > > > >>> > > >>
> > > > >>> > > >> Best,
> > > > >>> > > >> Hang
> > > > >>> > > >>
> > > > >>> > > >> [1]
> > > > >>> > > >>
> > > > >>> > > >>
> > > > >>> > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-274%3A+Introduce+metric+group+for+OperatorCoordinator
> > > > >>> > > >>
> > > > >>> > >
> > > > >>> > >
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-274 : Introduce metric group for OperatorCoordinator

Posted by Jark Wu <im...@gmail.com>.

Hi Hang,

Thanks for driving this discussion. I think this is a very useful feature
for connectors.

The FLIP looks quite good to me, and I just have two suggestions.

1. In the "Public Interface" section, mention that the implementation
behavior of "SplitEnumeratorContext#metricGroup" is changed from returning
null to returning a concrete SplitEnumeratorMetricGroup instance. Even
though the API is already there, the behavior change can also be considered
a public change.
2. Mention the newly added interface of "OperatorCoordinator#metricGroup"
in the "Proposed Changes" section or "Public Interface" section. As the
FLIP said, OperatorCoordinator is widely used in many connectors. Though it
is still an @Internal API, I think it is worth mentioning the change in the
FLIP.

Best,
Jark


On Mon, 26 Dec 2022 at 18:06, Hang Ruan <ru...@gmail.com> wrote:

> Hi, thanks for the feedback, Zhu Zhu and Qingsheng.
> After combining everyone's comments, the main concerns and corresponding
> adjustments are as follows.
>
> Q1: Common metrics are not quite useful.
> numEventsIn and numEventsOut counters will be removed from the
> OperatorCoordinatorMetricGroup. These common metrics do not provide enough
> information for users. The users are more willing to get the number of
> events of the specified type instead of the total number. And this metric
> is calculated differently. The implementation could register the metric by
> themselves.
>
> Q2: This FLIP is overly complicated.
> This FLIP will become concise after these modifications.
> OperatorCoordinatorMetricGroup has already been introduced into Flink by
> FLIP-179<
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics
> >.
> And
> this FLIP will not change it. This FLIP only provides a new metric option
> and a new metric group scope. The changes in proposed changes provide the
> details about the modifications for the internal classes, which might make
> it look complicated.
>
> Thanks for all the comments again. If there are no further comments, we
> plan to start the voting thread this week.
>
> Best,
> Hang
>
> Qingsheng Ren <re...@gmail.com> 于2022年12月26日周一 16:48写道：
>
> > Thanks for the FLIP, Hang!
> >
> > This FLIP overall looks good to me. Actually I share the same concern
> with
> > Zhu that numEventsIn and numEventsOut counters are not quite useful to
> end
> > users. OperatorEvent is a quite low-level abstraction, which requires
> > instantialization in order to be practical to users and developers, so
> > maybe it's better to exclude them from the FLIP.
> >
> > Best,
> > Qingsheng
> > Ververica (Alibaba)
> >
> > On Mon, Dec 26, 2022 at 12:08 PM Zhu Zhu <re...@gmail.com> wrote:
> >
> > > Hi Hang,
> > > I still see no strong reason why we need numEventsIn/numEventsOut
> > metrics.
> > > In the discussion in FLINK-29801, I can see the same concern from
> others.
> > > So I prefer to exclude them from this FLIP to avoid over-extending the
> > > scope.
> > >
> > > Thanks,
> > > Zhu
> > >
> > > Hang Ruan <ru...@gmail.com> 于2022年12月23日周五 15:21写道：
> > > >
> > > > Hi everyone,
> > > >
> > > > As for the Zhu Zhu's problem, I think we should keep the common
> > metrics,
> > > which will help to observe incoming and outgoing events. What do you
> > think,
> > > @Zhu Zhu ?
> > > > And @Chesnay, are there any other issues you are more concerned
> about?
> > > Looking forward to your reply.
> > > >
> > > > Thanks for all the comments. If there are no further comments, we
> plan
> > > to start the voting thread next week.
> > > >
> > > > Best,
> > > > Hang
> > > >
> > > > Hang Ruan <ru...@gmail.com> 于2022年12月15日周四 16:49写道：
> > > >>
> > > >> Hi, Zhu Zhu,
> > > >>
> > > >> Thanks for your feedback!
> > > >>
> > > >> The OperatorCoordinator implementations are different. And their
> > > metrics are much different too. We try to find the common metrics and
> put
> > > them in the OperatorCoordinatorMetricGroup. If most developers think we
> > do
> > > not need these common metrics, removing the common metrics is also
> > > acceptable.
> > > >>
> > > >> Best,
> > > >> Hang
> > > >>
> > > >> Zhu Zhu <re...@gmail.com> 于2022年12月14日周三 22:09写道：
> > > >>>
> > > >>> Hi Hang & MengYue,
> > > >>>
> > > >>> Thanks for creating this FLIP!
> > > >>>
> > > >>> I think it is very useful, mainly in two aspects:
> > > >>> 1. Enables OperatorCoordinators to register metrics. Currently
> > > >>> the coordinators has no way to do this. And operator coordinator
> > > >>> metric group further enables the SplitEnumerator to have access
> > > >>> to a registered metric group (via the existing public interface
> > > >>> SplitEnumeratorContext#metricGroup()), which is null at the moment.
> > > >>>
> > > >>> 2. Defines the scope of operator coordinator metrics. A clear
> > > definition
> > > >>> makes it easy for users to find their wanted metrics. The
> definition
> > > >>> also helps to avoid conflicts of metrics from multiple
> > > OperatorCoordinators
> > > >>> of the same kind. E.g. each SourceCoordinator may have its own
> > > >>> numSourceSplits metric, these metrics should not be directly
> > registered
> > > >>> to the job metric group.
> > > >>>
> > > >>> What I'm a bit concerned is the necessity of the introduced common
> > > metrics
> > > >>> numEventsInCounter & numEventsOutCounter. If there any case which
> > > strongly
> > > >>> requires them?
> > > >>>
> > > >>> Regarding the concerns of Chesnay,
> > > >>> > A dedicated coordinator MG implementation is overkill
> > > >>> Directly using the job metric group can result in metric conflicts,
> > as
> > > mentioned
> > > >>> in above #2.
> > > >>>
> > > >>> Thanks,
> > > >>> Zhu
> > > >>>
> > > >>> Dong Lin <li...@gmail.com> 于2022年12月10日周六 14:16写道：
> > > >>>
> > > >>> >
> > > >>> > Hi Chesney,
> > > >>> >
> > > >>> > Just to double check with you, OperatorCoordinatorMetricGroup
> > > (annotated as
> > > >>> > @PublicEvolving) has already been introduced into Flink by
> FLIP-179
> > > >>> > <
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics
> > > >.
> > > >>> > And that FLIP has got you +1.. Do you mean we should remove this
> > > >>> > OperatorCoordinatorMetricGroup?
> > > >>> >
> > > >>> > Regards,
> > > >>> > Dong
> > > >>> >
> > > >>> > On Sat, Dec 10, 2022 at 1:33 AM Chesnay Schepler <
> > chesnay@apache.org>
> > > wrote:
> > > >>> >
> > > >>> > > As a whole I feel like this FLIP is overly complicated. A
> > dedicated
> > > >>> > > coordinator MG implementation is overkill; it could just re-use
> > the
> > > >>> > > existing Task/OperatorMGs to create the same structure we have
> on
> > > TMs,
> > > >>> > > similar to what we did with the Job MG.
> > > >>> > >
> > > >>> > > However, I'm not convinced that this is required anyway,
> because
> > > all the
> > > >>> > > example metrics you listed can be implemented on the TM side +
> > > >>> > > aggregating them in the external metrics backend.
> > > >>> > >
> > > >>> > > Since I'm on holidays soon, just so no one tries to pull a fast
> > > one on
> > > >>> > > me, if this were to go to a vote as-is I'd be against it.
> > > >>> > >
> > > >>> > >
> > > >>> > > On 09/12/2022 15:30, Dong Lin wrote:
> > > >>> > > > Hi Hang,
> > > >>> > > >
> > > >>> > > > Thanks for the FLIP! The FLIP looks good and it is pretty
> > > informative.
> > > >>> > > >
> > > >>> > > > I have just two minor comments regarding names:
> > > >>> > > > - Would it be useful to rename the config key as
> > > >>> > > > *metrics.scope.jm.job.operator-coordinator* for consistency
> > with
> > > >>> > > > *metrics.scope.jm.job
> > > >>> > > > *(which is not named as *jm-job)?
> > > >>> > > > - Maybe rename the variable as
> > SCOPE_NAMING_OPERATOR_COORDINATOR
> > > for
> > > >>> > > > simplicity and consistency with SCOPE_NAMING_OPERATOR (which
> is
> > > not named
> > > >>> > > > as SCOPE_NAMING_TM_JOB_OPERATOR)?
> > > >>> > > >
> > > >>> > > > Cheers,
> > > >>> > > > Dong
> > > >>> > > >
> > > >>> > > >
> > > >>> > > >
> > > >>> > > > On Thu, Dec 8, 2022 at 3:28 PM Hang Ruan <
> > ruanhang1993@gmail.com>
> > > wrote:
> > > >>> > > >
> > > >>> > > >> Hi all,
> > > >>> > > >>
> > > >>> > > >> MengYue and I created FLIP-274[1] Introduce metric group for
> > > >>> > > >> OperatorCoordinator. OperatorCoordinator is the coordinator
> > for
> > > runtime
> > > >>> > > >> operators and running on Job Manager. The coordination
> > > mechanism is
> > > >>> > > >> operator events between OperatorCoordinator and its all
> > > operators, the
> > > >>> > > >> coordination is more and more using in Flink, for example
> many
> > > Sources
> > > >>> > > and
> > > >>> > > >> Sinks depend on the mechanism to assign splits and
> coordinate
> > > commits to
> > > >>> > > >> external systems. The OperatorCoordinator is widely using in
> > > flink kafka
> > > >>> > > >> connector, flink pulsar connector, flink cdc connector,
> flink
> > > hudi
> > > >>> > > >> connector and so on.
> > > >>> > > >>
> > > >>> > > >> But there is not a suitable metric group scope for the
> > > >>> > > OperatorCoordinator
> > > >>> > > >> and not an implementation for the interface
> > > >>> > > OperatorCoordinatorMetricGroup.
> > > >>> > > >> These metrics in OperatorCoordinator could be how many
> > > splits/partitions
> > > >>> > > >> have been assigned to source readers, how many files have
> been
> > > written
> > > >>> > > out
> > > >>> > > >> by sink writers, these metrics not only help users to know
> the
> > > job
> > > >>> > > progress
> > > >>> > > >> but also make big job maintaining easier. Thus we propose
> the
> > > FLIP-274
> > > >>> > > to
> > > >>> > > >> introduce a new metric group scope for OperatorCoordinator
> and
> > > provide
> > > >>> > > an
> > > >>> > > >> internal implementation for OperatorCoordinatorMetricGroup.
> > > >>> > > >>
> > > >>> > > >> Could you help review this FLIP when you get time? Any
> > feedback
> > > is
> > > >>> > > >> appreciated!
> > > >>> > > >>
> > > >>> > > >> Best,
> > > >>> > > >> Hang
> > > >>> > > >>
> > > >>> > > >> [1]
> > > >>> > > >>
> > > >>> > > >>
> > > >>> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-274%3A+Introduce+metric+group+for+OperatorCoordinator
> > > >>> > > >>
> > > >>> > >
> > > >>> > >
> > >
> >
>

Re: [DISCUSS] FLIP-274 : Introduce metric group for OperatorCoordinator

Posted by Hang Ruan <ru...@gmail.com>.

Hi, thanks for the feedback, Zhu Zhu and Qingsheng.
After combining everyone's comments, the main concerns and corresponding
adjustments are as follows.

Q1: Common metrics are not quite useful.
numEventsIn and numEventsOut counters will be removed from the
OperatorCoordinatorMetricGroup. These common metrics do not provide enough
information for users. The users are more willing to get the number of
events of the specified type instead of the total number. And this metric
is calculated differently. The implementation could register the metric by
themselves.

Q2: This FLIP is overly complicated.
This FLIP will become concise after these modifications.
OperatorCoordinatorMetricGroup has already been introduced into Flink by
FLIP-179<
https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics>.
And
this FLIP will not change it. This FLIP only provides a new metric option
and a new metric group scope. The changes in proposed changes provide the
details about the modifications for the internal classes, which might make
it look complicated.

Thanks for all the comments again. If there are no further comments, we
plan to start the voting thread this week.

Best,
Hang

Qingsheng Ren <re...@gmail.com> 于2022年12月26日周一 16:48写道：

> Thanks for the FLIP, Hang!
>
> This FLIP overall looks good to me. Actually I share the same concern with
> Zhu that numEventsIn and numEventsOut counters are not quite useful to end
> users. OperatorEvent is a quite low-level abstraction, which requires
> instantialization in order to be practical to users and developers, so
> maybe it's better to exclude them from the FLIP.
>
> Best,
> Qingsheng
> Ververica (Alibaba)
>
> On Mon, Dec 26, 2022 at 12:08 PM Zhu Zhu <re...@gmail.com> wrote:
>
> > Hi Hang,
> > I still see no strong reason why we need numEventsIn/numEventsOut
> metrics.
> > In the discussion in FLINK-29801, I can see the same concern from others.
> > So I prefer to exclude them from this FLIP to avoid over-extending the
> > scope.
> >
> > Thanks,
> > Zhu
> >
> > Hang Ruan <ru...@gmail.com> 于2022年12月23日周五 15:21写道：
> > >
> > > Hi everyone,
> > >
> > > As for the Zhu Zhu's problem, I think we should keep the common
> metrics,
> > which will help to observe incoming and outgoing events. What do you
> think,
> > @Zhu Zhu ?
> > > And @Chesnay, are there any other issues you are more concerned about?
> > Looking forward to your reply.
> > >
> > > Thanks for all the comments. If there are no further comments, we plan
> > to start the voting thread next week.
> > >
> > > Best,
> > > Hang
> > >
> > > Hang Ruan <ru...@gmail.com> 于2022年12月15日周四 16:49写道：
> > >>
> > >> Hi, Zhu Zhu,
> > >>
> > >> Thanks for your feedback!
> > >>
> > >> The OperatorCoordinator implementations are different. And their
> > metrics are much different too. We try to find the common metrics and put
> > them in the OperatorCoordinatorMetricGroup. If most developers think we
> do
> > not need these common metrics, removing the common metrics is also
> > acceptable.
> > >>
> > >> Best,
> > >> Hang
> > >>
> > >> Zhu Zhu <re...@gmail.com> 于2022年12月14日周三 22:09写道：
> > >>>
> > >>> Hi Hang & MengYue,
> > >>>
> > >>> Thanks for creating this FLIP!
> > >>>
> > >>> I think it is very useful, mainly in two aspects:
> > >>> 1. Enables OperatorCoordinators to register metrics. Currently
> > >>> the coordinators has no way to do this. And operator coordinator
> > >>> metric group further enables the SplitEnumerator to have access
> > >>> to a registered metric group (via the existing public interface
> > >>> SplitEnumeratorContext#metricGroup()), which is null at the moment.
> > >>>
> > >>> 2. Defines the scope of operator coordinator metrics. A clear
> > definition
> > >>> makes it easy for users to find their wanted metrics. The definition
> > >>> also helps to avoid conflicts of metrics from multiple
> > OperatorCoordinators
> > >>> of the same kind. E.g. each SourceCoordinator may have its own
> > >>> numSourceSplits metric, these metrics should not be directly
> registered
> > >>> to the job metric group.
> > >>>
> > >>> What I'm a bit concerned is the necessity of the introduced common
> > metrics
> > >>> numEventsInCounter & numEventsOutCounter. If there any case which
> > strongly
> > >>> requires them?
> > >>>
> > >>> Regarding the concerns of Chesnay,
> > >>> > A dedicated coordinator MG implementation is overkill
> > >>> Directly using the job metric group can result in metric conflicts,
> as
> > mentioned
> > >>> in above #2.
> > >>>
> > >>> Thanks,
> > >>> Zhu
> > >>>
> > >>> Dong Lin <li...@gmail.com> 于2022年12月10日周六 14:16写道：
> > >>>
> > >>> >
> > >>> > Hi Chesney,
> > >>> >
> > >>> > Just to double check with you, OperatorCoordinatorMetricGroup
> > (annotated as
> > >>> > @PublicEvolving) has already been introduced into Flink by FLIP-179
> > >>> > <
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics
> > >.
> > >>> > And that FLIP has got you +1.. Do you mean we should remove this
> > >>> > OperatorCoordinatorMetricGroup?
> > >>> >
> > >>> > Regards,
> > >>> > Dong
> > >>> >
> > >>> > On Sat, Dec 10, 2022 at 1:33 AM Chesnay Schepler <
> chesnay@apache.org>
> > wrote:
> > >>> >
> > >>> > > As a whole I feel like this FLIP is overly complicated. A
> dedicated
> > >>> > > coordinator MG implementation is overkill; it could just re-use
> the
> > >>> > > existing Task/OperatorMGs to create the same structure we have on
> > TMs,
> > >>> > > similar to what we did with the Job MG.
> > >>> > >
> > >>> > > However, I'm not convinced that this is required anyway, because
> > all the
> > >>> > > example metrics you listed can be implemented on the TM side +
> > >>> > > aggregating them in the external metrics backend.
> > >>> > >
> > >>> > > Since I'm on holidays soon, just so no one tries to pull a fast
> > one on
> > >>> > > me, if this were to go to a vote as-is I'd be against it.
> > >>> > >
> > >>> > >
> > >>> > > On 09/12/2022 15:30, Dong Lin wrote:
> > >>> > > > Hi Hang,
> > >>> > > >
> > >>> > > > Thanks for the FLIP! The FLIP looks good and it is pretty
> > informative.
> > >>> > > >
> > >>> > > > I have just two minor comments regarding names:
> > >>> > > > - Would it be useful to rename the config key as
> > >>> > > > *metrics.scope.jm.job.operator-coordinator* for consistency
> with
> > >>> > > > *metrics.scope.jm.job
> > >>> > > > *(which is not named as *jm-job)?
> > >>> > > > - Maybe rename the variable as
> SCOPE_NAMING_OPERATOR_COORDINATOR
> > for
> > >>> > > > simplicity and consistency with SCOPE_NAMING_OPERATOR (which is
> > not named
> > >>> > > > as SCOPE_NAMING_TM_JOB_OPERATOR)?
> > >>> > > >
> > >>> > > > Cheers,
> > >>> > > > Dong
> > >>> > > >
> > >>> > > >
> > >>> > > >
> > >>> > > > On Thu, Dec 8, 2022 at 3:28 PM Hang Ruan <
> ruanhang1993@gmail.com>
> > wrote:
> > >>> > > >
> > >>> > > >> Hi all,
> > >>> > > >>
> > >>> > > >> MengYue and I created FLIP-274[1] Introduce metric group for
> > >>> > > >> OperatorCoordinator. OperatorCoordinator is the coordinator
> for
> > runtime
> > >>> > > >> operators and running on Job Manager. The coordination
> > mechanism is
> > >>> > > >> operator events between OperatorCoordinator and its all
> > operators, the
> > >>> > > >> coordination is more and more using in Flink, for example many
> > Sources
> > >>> > > and
> > >>> > > >> Sinks depend on the mechanism to assign splits and coordinate
> > commits to
> > >>> > > >> external systems. The OperatorCoordinator is widely using in
> > flink kafka
> > >>> > > >> connector, flink pulsar connector, flink cdc connector, flink
> > hudi
> > >>> > > >> connector and so on.
> > >>> > > >>
> > >>> > > >> But there is not a suitable metric group scope for the
> > >>> > > OperatorCoordinator
> > >>> > > >> and not an implementation for the interface
> > >>> > > OperatorCoordinatorMetricGroup.
> > >>> > > >> These metrics in OperatorCoordinator could be how many
> > splits/partitions
> > >>> > > >> have been assigned to source readers, how many files have been
> > written
> > >>> > > out
> > >>> > > >> by sink writers, these metrics not only help users to know the
> > job
> > >>> > > progress
> > >>> > > >> but also make big job maintaining easier. Thus we propose the
> > FLIP-274
> > >>> > > to
> > >>> > > >> introduce a new metric group scope for OperatorCoordinator and
> > provide
> > >>> > > an
> > >>> > > >> internal implementation for OperatorCoordinatorMetricGroup.
> > >>> > > >>
> > >>> > > >> Could you help review this FLIP when you get time? Any
> feedback
> > is
> > >>> > > >> appreciated!
> > >>> > > >>
> > >>> > > >> Best,
> > >>> > > >> Hang
> > >>> > > >>
> > >>> > > >> [1]
> > >>> > > >>
> > >>> > > >>
> > >>> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-274%3A+Introduce+metric+group+for+OperatorCoordinator
> > >>> > > >>
> > >>> > >
> > >>> > >
> >
>

Re: [DISCUSS] FLIP-274 : Introduce metric group for OperatorCoordinator

Posted by Qingsheng Ren <re...@gmail.com>.

Thanks for the FLIP, Hang!

This FLIP overall looks good to me. Actually I share the same concern with
Zhu that numEventsIn and numEventsOut counters are not quite useful to end
users. OperatorEvent is a quite low-level abstraction, which requires
instantialization in order to be practical to users and developers, so
maybe it's better to exclude them from the FLIP.

Best,
Qingsheng
Ververica (Alibaba)

On Mon, Dec 26, 2022 at 12:08 PM Zhu Zhu <re...@gmail.com> wrote:

> Hi Hang,
> I still see no strong reason why we need numEventsIn/numEventsOut metrics.
> In the discussion in FLINK-29801, I can see the same concern from others.
> So I prefer to exclude them from this FLIP to avoid over-extending the
> scope.
>
> Thanks,
> Zhu
>
> Hang Ruan <ru...@gmail.com> 于2022年12月23日周五 15:21写道：
> >
> > Hi everyone,
> >
> > As for the Zhu Zhu's problem, I think we should keep the common metrics,
> which will help to observe incoming and outgoing events. What do you think,
> @Zhu Zhu ?
> > And @Chesnay, are there any other issues you are more concerned about?
> Looking forward to your reply.
> >
> > Thanks for all the comments. If there are no further comments, we plan
> to start the voting thread next week.
> >
> > Best,
> > Hang
> >
> > Hang Ruan <ru...@gmail.com> 于2022年12月15日周四 16:49写道：
> >>
> >> Hi, Zhu Zhu,
> >>
> >> Thanks for your feedback!
> >>
> >> The OperatorCoordinator implementations are different. And their
> metrics are much different too. We try to find the common metrics and put
> them in the OperatorCoordinatorMetricGroup. If most developers think we do
> not need these common metrics, removing the common metrics is also
> acceptable.
> >>
> >> Best,
> >> Hang
> >>
> >> Zhu Zhu <re...@gmail.com> 于2022年12月14日周三 22:09写道：
> >>>
> >>> Hi Hang & MengYue,
> >>>
> >>> Thanks for creating this FLIP!
> >>>
> >>> I think it is very useful, mainly in two aspects:
> >>> 1. Enables OperatorCoordinators to register metrics. Currently
> >>> the coordinators has no way to do this. And operator coordinator
> >>> metric group further enables the SplitEnumerator to have access
> >>> to a registered metric group (via the existing public interface
> >>> SplitEnumeratorContext#metricGroup()), which is null at the moment.
> >>>
> >>> 2. Defines the scope of operator coordinator metrics. A clear
> definition
> >>> makes it easy for users to find their wanted metrics. The definition
> >>> also helps to avoid conflicts of metrics from multiple
> OperatorCoordinators
> >>> of the same kind. E.g. each SourceCoordinator may have its own
> >>> numSourceSplits metric, these metrics should not be directly registered
> >>> to the job metric group.
> >>>
> >>> What I'm a bit concerned is the necessity of the introduced common
> metrics
> >>> numEventsInCounter & numEventsOutCounter. If there any case which
> strongly
> >>> requires them?
> >>>
> >>> Regarding the concerns of Chesnay,
> >>> > A dedicated coordinator MG implementation is overkill
> >>> Directly using the job metric group can result in metric conflicts, as
> mentioned
> >>> in above #2.
> >>>
> >>> Thanks,
> >>> Zhu
> >>>
> >>> Dong Lin <li...@gmail.com> 于2022年12月10日周六 14:16写道：
> >>>
> >>> >
> >>> > Hi Chesney,
> >>> >
> >>> > Just to double check with you, OperatorCoordinatorMetricGroup
> (annotated as
> >>> > @PublicEvolving) has already been introduced into Flink by FLIP-179
> >>> > <
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics
> >.
> >>> > And that FLIP has got you +1.. Do you mean we should remove this
> >>> > OperatorCoordinatorMetricGroup?
> >>> >
> >>> > Regards,
> >>> > Dong
> >>> >
> >>> > On Sat, Dec 10, 2022 at 1:33 AM Chesnay Schepler <ch...@apache.org>
> wrote:
> >>> >
> >>> > > As a whole I feel like this FLIP is overly complicated. A dedicated
> >>> > > coordinator MG implementation is overkill; it could just re-use the
> >>> > > existing Task/OperatorMGs to create the same structure we have on
> TMs,
> >>> > > similar to what we did with the Job MG.
> >>> > >
> >>> > > However, I'm not convinced that this is required anyway, because
> all the
> >>> > > example metrics you listed can be implemented on the TM side +
> >>> > > aggregating them in the external metrics backend.
> >>> > >
> >>> > > Since I'm on holidays soon, just so no one tries to pull a fast
> one on
> >>> > > me, if this were to go to a vote as-is I'd be against it.
> >>> > >
> >>> > >
> >>> > > On 09/12/2022 15:30, Dong Lin wrote:
> >>> > > > Hi Hang,
> >>> > > >
> >>> > > > Thanks for the FLIP! The FLIP looks good and it is pretty
> informative.
> >>> > > >
> >>> > > > I have just two minor comments regarding names:
> >>> > > > - Would it be useful to rename the config key as
> >>> > > > *metrics.scope.jm.job.operator-coordinator* for consistency with
> >>> > > > *metrics.scope.jm.job
> >>> > > > *(which is not named as *jm-job)?
> >>> > > > - Maybe rename the variable as SCOPE_NAMING_OPERATOR_COORDINATOR
> for
> >>> > > > simplicity and consistency with SCOPE_NAMING_OPERATOR (which is
> not named
> >>> > > > as SCOPE_NAMING_TM_JOB_OPERATOR)?
> >>> > > >
> >>> > > > Cheers,
> >>> > > > Dong
> >>> > > >
> >>> > > >
> >>> > > >
> >>> > > > On Thu, Dec 8, 2022 at 3:28 PM Hang Ruan <ru...@gmail.com>
> wrote:
> >>> > > >
> >>> > > >> Hi all,
> >>> > > >>
> >>> > > >> MengYue and I created FLIP-274[1] Introduce metric group for
> >>> > > >> OperatorCoordinator. OperatorCoordinator is the coordinator for
> runtime
> >>> > > >> operators and running on Job Manager. The coordination
> mechanism is
> >>> > > >> operator events between OperatorCoordinator and its all
> operators, the
> >>> > > >> coordination is more and more using in Flink, for example many
> Sources
> >>> > > and
> >>> > > >> Sinks depend on the mechanism to assign splits and coordinate
> commits to
> >>> > > >> external systems. The OperatorCoordinator is widely using in
> flink kafka
> >>> > > >> connector, flink pulsar connector, flink cdc connector, flink
> hudi
> >>> > > >> connector and so on.
> >>> > > >>
> >>> > > >> But there is not a suitable metric group scope for the
> >>> > > OperatorCoordinator
> >>> > > >> and not an implementation for the interface
> >>> > > OperatorCoordinatorMetricGroup.
> >>> > > >> These metrics in OperatorCoordinator could be how many
> splits/partitions
> >>> > > >> have been assigned to source readers, how many files have been
> written
> >>> > > out
> >>> > > >> by sink writers, these metrics not only help users to know the
> job
> >>> > > progress
> >>> > > >> but also make big job maintaining easier. Thus we propose the
> FLIP-274
> >>> > > to
> >>> > > >> introduce a new metric group scope for OperatorCoordinator and
> provide
> >>> > > an
> >>> > > >> internal implementation for OperatorCoordinatorMetricGroup.
> >>> > > >>
> >>> > > >> Could you help review this FLIP when you get time? Any feedback
> is
> >>> > > >> appreciated!
> >>> > > >>
> >>> > > >> Best,
> >>> > > >> Hang
> >>> > > >>
> >>> > > >> [1]
> >>> > > >>
> >>> > > >>
> >>> > >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-274%3A+Introduce+metric+group+for+OperatorCoordinator
> >>> > > >>
> >>> > >
> >>> > >
>

Re: [DISCUSS] FLIP-274 : Introduce metric group for OperatorCoordinator

Posted by Zhu Zhu <re...@gmail.com>.

Hi Hang,
I still see no strong reason why we need numEventsIn/numEventsOut metrics.
In the discussion in FLINK-29801, I can see the same concern from others.
So I prefer to exclude them from this FLIP to avoid over-extending the scope.

Thanks,
Zhu

Hang Ruan <ru...@gmail.com> 于2022年12月23日周五 15:21写道：
>
> Hi everyone,
>
> As for the Zhu Zhu's problem, I think we should keep the common metrics, which will help to observe incoming and outgoing events. What do you think, @Zhu Zhu ?
> And @Chesnay, are there any other issues you are more concerned about? Looking forward to your reply.
>
> Thanks for all the comments. If there are no further comments, we plan to start the voting thread next week.
>
> Best,
> Hang
>
> Hang Ruan <ru...@gmail.com> 于2022年12月15日周四 16:49写道：
>>
>> Hi, Zhu Zhu,
>>
>> Thanks for your feedback!
>>
>> The OperatorCoordinator implementations are different. And their metrics are much different too. We try to find the common metrics and put them in the OperatorCoordinatorMetricGroup. If most developers think we do not need these common metrics, removing the common metrics is also acceptable.
>>
>> Best,
>> Hang
>>
>> Zhu Zhu <re...@gmail.com> 于2022年12月14日周三 22:09写道：
>>>
>>> Hi Hang & MengYue,
>>>
>>> Thanks for creating this FLIP!
>>>
>>> I think it is very useful, mainly in two aspects:
>>> 1. Enables OperatorCoordinators to register metrics. Currently
>>> the coordinators has no way to do this. And operator coordinator
>>> metric group further enables the SplitEnumerator to have access
>>> to a registered metric group (via the existing public interface
>>> SplitEnumeratorContext#metricGroup()), which is null at the moment.
>>>
>>> 2. Defines the scope of operator coordinator metrics. A clear definition
>>> makes it easy for users to find their wanted metrics. The definition
>>> also helps to avoid conflicts of metrics from multiple OperatorCoordinators
>>> of the same kind. E.g. each SourceCoordinator may have its own
>>> numSourceSplits metric, these metrics should not be directly registered
>>> to the job metric group.
>>>
>>> What I'm a bit concerned is the necessity of the introduced common metrics
>>> numEventsInCounter & numEventsOutCounter. If there any case which strongly
>>> requires them?
>>>
>>> Regarding the concerns of Chesnay,
>>> > A dedicated coordinator MG implementation is overkill
>>> Directly using the job metric group can result in metric conflicts, as mentioned
>>> in above #2.
>>>
>>> Thanks,
>>> Zhu
>>>
>>> Dong Lin <li...@gmail.com> 于2022年12月10日周六 14:16写道：
>>>
>>> >
>>> > Hi Chesney,
>>> >
>>> > Just to double check with you, OperatorCoordinatorMetricGroup (annotated as
>>> > @PublicEvolving) has already been introduced into Flink by FLIP-179
>>> > <https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics>.
>>> > And that FLIP has got you +1.. Do you mean we should remove this
>>> > OperatorCoordinatorMetricGroup?
>>> >
>>> > Regards,
>>> > Dong
>>> >
>>> > On Sat, Dec 10, 2022 at 1:33 AM Chesnay Schepler <ch...@apache.org> wrote:
>>> >
>>> > > As a whole I feel like this FLIP is overly complicated. A dedicated
>>> > > coordinator MG implementation is overkill; it could just re-use the
>>> > > existing Task/OperatorMGs to create the same structure we have on TMs,
>>> > > similar to what we did with the Job MG.
>>> > >
>>> > > However, I'm not convinced that this is required anyway, because all the
>>> > > example metrics you listed can be implemented on the TM side +
>>> > > aggregating them in the external metrics backend.
>>> > >
>>> > > Since I'm on holidays soon, just so no one tries to pull a fast one on
>>> > > me, if this were to go to a vote as-is I'd be against it.
>>> > >
>>> > >
>>> > > On 09/12/2022 15:30, Dong Lin wrote:
>>> > > > Hi Hang,
>>> > > >
>>> > > > Thanks for the FLIP! The FLIP looks good and it is pretty informative.
>>> > > >
>>> > > > I have just two minor comments regarding names:
>>> > > > - Would it be useful to rename the config key as
>>> > > > *metrics.scope.jm.job.operator-coordinator* for consistency with
>>> > > > *metrics.scope.jm.job
>>> > > > *(which is not named as *jm-job)?
>>> > > > - Maybe rename the variable as SCOPE_NAMING_OPERATOR_COORDINATOR for
>>> > > > simplicity and consistency with SCOPE_NAMING_OPERATOR (which is not named
>>> > > > as SCOPE_NAMING_TM_JOB_OPERATOR)?
>>> > > >
>>> > > > Cheers,
>>> > > > Dong
>>> > > >
>>> > > >
>>> > > >
>>> > > > On Thu, Dec 8, 2022 at 3:28 PM Hang Ruan <ru...@gmail.com> wrote:
>>> > > >
>>> > > >> Hi all,
>>> > > >>
>>> > > >> MengYue and I created FLIP-274[1] Introduce metric group for
>>> > > >> OperatorCoordinator. OperatorCoordinator is the coordinator for runtime
>>> > > >> operators and running on Job Manager. The coordination mechanism is
>>> > > >> operator events between OperatorCoordinator and its all operators, the
>>> > > >> coordination is more and more using in Flink, for example many Sources
>>> > > and
>>> > > >> Sinks depend on the mechanism to assign splits and coordinate commits to
>>> > > >> external systems. The OperatorCoordinator is widely using in flink kafka
>>> > > >> connector, flink pulsar connector, flink cdc connector, flink hudi
>>> > > >> connector and so on.
>>> > > >>
>>> > > >> But there is not a suitable metric group scope for the
>>> > > OperatorCoordinator
>>> > > >> and not an implementation for the interface
>>> > > OperatorCoordinatorMetricGroup.
>>> > > >> These metrics in OperatorCoordinator could be how many splits/partitions
>>> > > >> have been assigned to source readers, how many files have been written
>>> > > out
>>> > > >> by sink writers, these metrics not only help users to know the job
>>> > > progress
>>> > > >> but also make big job maintaining easier. Thus we propose the FLIP-274
>>> > > to
>>> > > >> introduce a new metric group scope for OperatorCoordinator and provide
>>> > > an
>>> > > >> internal implementation for OperatorCoordinatorMetricGroup.
>>> > > >>
>>> > > >> Could you help review this FLIP when you get time? Any feedback is
>>> > > >> appreciated!
>>> > > >>
>>> > > >> Best,
>>> > > >> Hang
>>> > > >>
>>> > > >> [1]
>>> > > >>
>>> > > >>
>>> > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-274%3A+Introduce+metric+group+for+OperatorCoordinator
>>> > > >>
>>> > >
>>> > >

Re: [DISCUSS] FLIP-274 : Introduce metric group for OperatorCoordinator

Posted by Hang Ruan <ru...@gmail.com>.

Hi everyone,

As for the Zhu Zhu's problem, I think we should keep the common metrics,
which will help to observe incoming and outgoing events. What do you
think, @Zhu
Zhu <re...@gmail.com> ?
And @Chesnay, are there any other issues you are more concerned about?
Looking forward to your reply.

Thanks for all the comments. If there are no further comments, we plan to
start the voting thread next week.

Best,
Hang

Hang Ruan <ru...@gmail.com> 于2022年12月15日周四 16:49写道：

> Hi, Zhu Zhu,
>
> Thanks for your feedback!
>
> The OperatorCoordinator implementations are different. And their metrics
> are much different too. We try to find the common metrics and put them in
> the OperatorCoordinatorMetricGroup. If most developers think we do not need
> these common metrics, removing the common metrics is also acceptable.
>
> Best,
> Hang
>
> Zhu Zhu <re...@gmail.com> 于2022年12月14日周三 22:09写道：
>
>> Hi Hang & MengYue,
>>
>> Thanks for creating this FLIP!
>>
>> I think it is very useful, mainly in two aspects:
>> 1. Enables OperatorCoordinators to register metrics. Currently
>> the coordinators has no way to do this. And operator coordinator
>> metric group further enables the SplitEnumerator to have access
>> to a registered metric group (via the existing public interface
>> SplitEnumeratorContext#metricGroup()), which is null at the moment.
>>
>> 2. Defines the scope of operator coordinator metrics. A clear definition
>> makes it easy for users to find their wanted metrics. The definition
>> also helps to avoid conflicts of metrics from multiple
>> OperatorCoordinators
>> of the same kind. E.g. each SourceCoordinator may have its own
>> numSourceSplits metric, these metrics should not be directly registered
>> to the job metric group.
>>
>> What I'm a bit concerned is the necessity of the introduced common metrics
>> numEventsInCounter & numEventsOutCounter. If there any case which strongly
>> requires them?
>>
>> Regarding the concerns of Chesnay,
>> > A dedicated coordinator MG implementation is overkill
>> Directly using the job metric group can result in metric conflicts, as
>> mentioned
>> in above #2.
>>
>> Thanks,
>> Zhu
>>
>> Dong Lin <li...@gmail.com> 于2022年12月10日周六 14:16写道：
>>
>> >
>> > Hi Chesney,
>> >
>> > Just to double check with you, OperatorCoordinatorMetricGroup
>> (annotated as
>> > @PublicEvolving) has already been introduced into Flink by FLIP-179
>> > <
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics
>> >.
>> > And that FLIP has got you +1.. Do you mean we should remove this
>> > OperatorCoordinatorMetricGroup?
>> >
>> > Regards,
>> > Dong
>> >
>> > On Sat, Dec 10, 2022 at 1:33 AM Chesnay Schepler <ch...@apache.org>
>> wrote:
>> >
>> > > As a whole I feel like this FLIP is overly complicated. A dedicated
>> > > coordinator MG implementation is overkill; it could just re-use the
>> > > existing Task/OperatorMGs to create the same structure we have on TMs,
>> > > similar to what we did with the Job MG.
>> > >
>> > > However, I'm not convinced that this is required anyway, because all
>> the
>> > > example metrics you listed can be implemented on the TM side +
>> > > aggregating them in the external metrics backend.
>> > >
>> > > Since I'm on holidays soon, just so no one tries to pull a fast one on
>> > > me, if this were to go to a vote as-is I'd be against it.
>> > >
>> > >
>> > > On 09/12/2022 15:30, Dong Lin wrote:
>> > > > Hi Hang,
>> > > >
>> > > > Thanks for the FLIP! The FLIP looks good and it is pretty
>> informative.
>> > > >
>> > > > I have just two minor comments regarding names:
>> > > > - Would it be useful to rename the config key as
>> > > > *metrics.scope.jm.job.operator-coordinator* for consistency with
>> > > > *metrics.scope.jm.job
>> > > > *(which is not named as *jm-job)?
>> > > > - Maybe rename the variable as SCOPE_NAMING_OPERATOR_COORDINATOR for
>> > > > simplicity and consistency with SCOPE_NAMING_OPERATOR (which is not
>> named
>> > > > as SCOPE_NAMING_TM_JOB_OPERATOR)?
>> > > >
>> > > > Cheers,
>> > > > Dong
>> > > >
>> > > >
>> > > >
>> > > > On Thu, Dec 8, 2022 at 3:28 PM Hang Ruan <ru...@gmail.com>
>> wrote:
>> > > >
>> > > >> Hi all,
>> > > >>
>> > > >> MengYue and I created FLIP-274[1] Introduce metric group for
>> > > >> OperatorCoordinator. OperatorCoordinator is the coordinator for
>> runtime
>> > > >> operators and running on Job Manager. The coordination mechanism is
>> > > >> operator events between OperatorCoordinator and its all operators,
>> the
>> > > >> coordination is more and more using in Flink, for example many
>> Sources
>> > > and
>> > > >> Sinks depend on the mechanism to assign splits and coordinate
>> commits to
>> > > >> external systems. The OperatorCoordinator is widely using in flink
>> kafka
>> > > >> connector, flink pulsar connector, flink cdc connector, flink hudi
>> > > >> connector and so on.
>> > > >>
>> > > >> But there is not a suitable metric group scope for the
>> > > OperatorCoordinator
>> > > >> and not an implementation for the interface
>> > > OperatorCoordinatorMetricGroup.
>> > > >> These metrics in OperatorCoordinator could be how many
>> splits/partitions
>> > > >> have been assigned to source readers, how many files have been
>> written
>> > > out
>> > > >> by sink writers, these metrics not only help users to know the job
>> > > progress
>> > > >> but also make big job maintaining easier. Thus we propose the
>> FLIP-274
>> > > to
>> > > >> introduce a new metric group scope for OperatorCoordinator and
>> provide
>> > > an
>> > > >> internal implementation for OperatorCoordinatorMetricGroup.
>> > > >>
>> > > >> Could you help review this FLIP when you get time? Any feedback is
>> > > >> appreciated!
>> > > >>
>> > > >> Best,
>> > > >> Hang
>> > > >>
>> > > >> [1]
>> > > >>
>> > > >>
>> > >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-274%3A+Introduce+metric+group+for+OperatorCoordinator
>> > > >>
>> > >
>> > >
>>
>

Re: [DISCUSS] FLIP-274 : Introduce metric group for OperatorCoordinator

Posted by Hang Ruan <ru...@gmail.com>.

Hi, Zhu Zhu,

Thanks for your feedback!

The OperatorCoordinator implementations are different. And their metrics
are much different too. We try to find the common metrics and put them in
the OperatorCoordinatorMetricGroup. If most developers think we do not need
these common metrics, removing the common metrics is also acceptable.

Best,
Hang

Zhu Zhu <re...@gmail.com> 于2022年12月14日周三 22:09写道：

> Hi Hang & MengYue,
>
> Thanks for creating this FLIP!
>
> I think it is very useful, mainly in two aspects:
> 1. Enables OperatorCoordinators to register metrics. Currently
> the coordinators has no way to do this. And operator coordinator
> metric group further enables the SplitEnumerator to have access
> to a registered metric group (via the existing public interface
> SplitEnumeratorContext#metricGroup()), which is null at the moment.
>
> 2. Defines the scope of operator coordinator metrics. A clear definition
> makes it easy for users to find their wanted metrics. The definition
> also helps to avoid conflicts of metrics from multiple OperatorCoordinators
> of the same kind. E.g. each SourceCoordinator may have its own
> numSourceSplits metric, these metrics should not be directly registered
> to the job metric group.
>
> What I'm a bit concerned is the necessity of the introduced common metrics
> numEventsInCounter & numEventsOutCounter. If there any case which strongly
> requires them?
>
> Regarding the concerns of Chesnay,
> > A dedicated coordinator MG implementation is overkill
> Directly using the job metric group can result in metric conflicts, as
> mentioned
> in above #2.
>
> Thanks,
> Zhu
>
> Dong Lin <li...@gmail.com> 于2022年12月10日周六 14:16写道：
>
> >
> > Hi Chesney,
> >
> > Just to double check with you, OperatorCoordinatorMetricGroup (annotated
> as
> > @PublicEvolving) has already been introduced into Flink by FLIP-179
> > <
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics
> >.
> > And that FLIP has got you +1.. Do you mean we should remove this
> > OperatorCoordinatorMetricGroup?
> >
> > Regards,
> > Dong
> >
> > On Sat, Dec 10, 2022 at 1:33 AM Chesnay Schepler <ch...@apache.org>
> wrote:
> >
> > > As a whole I feel like this FLIP is overly complicated. A dedicated
> > > coordinator MG implementation is overkill; it could just re-use the
> > > existing Task/OperatorMGs to create the same structure we have on TMs,
> > > similar to what we did with the Job MG.
> > >
> > > However, I'm not convinced that this is required anyway, because all
> the
> > > example metrics you listed can be implemented on the TM side +
> > > aggregating them in the external metrics backend.
> > >
> > > Since I'm on holidays soon, just so no one tries to pull a fast one on
> > > me, if this were to go to a vote as-is I'd be against it.
> > >
> > >
> > > On 09/12/2022 15:30, Dong Lin wrote:
> > > > Hi Hang,
> > > >
> > > > Thanks for the FLIP! The FLIP looks good and it is pretty
> informative.
> > > >
> > > > I have just two minor comments regarding names:
> > > > - Would it be useful to rename the config key as
> > > > *metrics.scope.jm.job.operator-coordinator* for consistency with
> > > > *metrics.scope.jm.job
> > > > *(which is not named as *jm-job)?
> > > > - Maybe rename the variable as SCOPE_NAMING_OPERATOR_COORDINATOR for
> > > > simplicity and consistency with SCOPE_NAMING_OPERATOR (which is not
> named
> > > > as SCOPE_NAMING_TM_JOB_OPERATOR)?
> > > >
> > > > Cheers,
> > > > Dong
> > > >
> > > >
> > > >
> > > > On Thu, Dec 8, 2022 at 3:28 PM Hang Ruan <ru...@gmail.com>
> wrote:
> > > >
> > > >> Hi all,
> > > >>
> > > >> MengYue and I created FLIP-274[1] Introduce metric group for
> > > >> OperatorCoordinator. OperatorCoordinator is the coordinator for
> runtime
> > > >> operators and running on Job Manager. The coordination mechanism is
> > > >> operator events between OperatorCoordinator and its all operators,
> the
> > > >> coordination is more and more using in Flink, for example many
> Sources
> > > and
> > > >> Sinks depend on the mechanism to assign splits and coordinate
> commits to
> > > >> external systems. The OperatorCoordinator is widely using in flink
> kafka
> > > >> connector, flink pulsar connector, flink cdc connector, flink hudi
> > > >> connector and so on.
> > > >>
> > > >> But there is not a suitable metric group scope for the
> > > OperatorCoordinator
> > > >> and not an implementation for the interface
> > > OperatorCoordinatorMetricGroup.
> > > >> These metrics in OperatorCoordinator could be how many
> splits/partitions
> > > >> have been assigned to source readers, how many files have been
> written
> > > out
> > > >> by sink writers, these metrics not only help users to know the job
> > > progress
> > > >> but also make big job maintaining easier. Thus we propose the
> FLIP-274
> > > to
> > > >> introduce a new metric group scope for OperatorCoordinator and
> provide
> > > an
> > > >> internal implementation for OperatorCoordinatorMetricGroup.
> > > >>
> > > >> Could you help review this FLIP when you get time? Any feedback is
> > > >> appreciated!
> > > >>
> > > >> Best,
> > > >> Hang
> > > >>
> > > >> [1]
> > > >>
> > > >>
> > >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-274%3A+Introduce+metric+group+for+OperatorCoordinator
> > > >>
> > >
> > >
>

Re: [DISCUSS] FLIP-274 : Introduce metric group for OperatorCoordinator

Posted by Zhu Zhu <re...@gmail.com>.

Hi Hang & MengYue,

Thanks for creating this FLIP!

I think it is very useful, mainly in two aspects:
1. Enables OperatorCoordinators to register metrics. Currently
the coordinators has no way to do this. And operator coordinator
metric group further enables the SplitEnumerator to have access
to a registered metric group (via the existing public interface
SplitEnumeratorContext#metricGroup()), which is null at the moment.

2. Defines the scope of operator coordinator metrics. A clear definition
makes it easy for users to find their wanted metrics. The definition
also helps to avoid conflicts of metrics from multiple OperatorCoordinators
of the same kind. E.g. each SourceCoordinator may have its own
numSourceSplits metric, these metrics should not be directly registered
to the job metric group.

What I'm a bit concerned is the necessity of the introduced common metrics
numEventsInCounter & numEventsOutCounter. If there any case which strongly
requires them?

Regarding the concerns of Chesnay,
> A dedicated coordinator MG implementation is overkill
Directly using the job metric group can result in metric conflicts, as mentioned
in above #2.

Thanks,
Zhu

Dong Lin <li...@gmail.com> 于2022年12月10日周六 14:16写道：

>
> Hi Chesney,
>
> Just to double check with you, OperatorCoordinatorMetricGroup (annotated as
> @PublicEvolving) has already been introduced into Flink by FLIP-179
> <https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics>.
> And that FLIP has got you +1.. Do you mean we should remove this
> OperatorCoordinatorMetricGroup?
>
> Regards,
> Dong
>
> On Sat, Dec 10, 2022 at 1:33 AM Chesnay Schepler <ch...@apache.org> wrote:
>
> > As a whole I feel like this FLIP is overly complicated. A dedicated
> > coordinator MG implementation is overkill; it could just re-use the
> > existing Task/OperatorMGs to create the same structure we have on TMs,
> > similar to what we did with the Job MG.
> >
> > However, I'm not convinced that this is required anyway, because all the
> > example metrics you listed can be implemented on the TM side +
> > aggregating them in the external metrics backend.
> >
> > Since I'm on holidays soon, just so no one tries to pull a fast one on
> > me, if this were to go to a vote as-is I'd be against it.
> >
> >
> > On 09/12/2022 15:30, Dong Lin wrote:
> > > Hi Hang,
> > >
> > > Thanks for the FLIP! The FLIP looks good and it is pretty informative.
> > >
> > > I have just two minor comments regarding names:
> > > - Would it be useful to rename the config key as
> > > *metrics.scope.jm.job.operator-coordinator* for consistency with
> > > *metrics.scope.jm.job
> > > *(which is not named as *jm-job)?
> > > - Maybe rename the variable as SCOPE_NAMING_OPERATOR_COORDINATOR for
> > > simplicity and consistency with SCOPE_NAMING_OPERATOR (which is not named
> > > as SCOPE_NAMING_TM_JOB_OPERATOR)?
> > >
> > > Cheers,
> > > Dong
> > >
> > >
> > >
> > > On Thu, Dec 8, 2022 at 3:28 PM Hang Ruan <ru...@gmail.com> wrote:
> > >
> > >> Hi all,
> > >>
> > >> MengYue and I created FLIP-274[1] Introduce metric group for
> > >> OperatorCoordinator. OperatorCoordinator is the coordinator for runtime
> > >> operators and running on Job Manager. The coordination mechanism is
> > >> operator events between OperatorCoordinator and its all operators, the
> > >> coordination is more and more using in Flink, for example many Sources
> > and
> > >> Sinks depend on the mechanism to assign splits and coordinate commits to
> > >> external systems. The OperatorCoordinator is widely using in flink kafka
> > >> connector, flink pulsar connector, flink cdc connector, flink hudi
> > >> connector and so on.
> > >>
> > >> But there is not a suitable metric group scope for the
> > OperatorCoordinator
> > >> and not an implementation for the interface
> > OperatorCoordinatorMetricGroup.
> > >> These metrics in OperatorCoordinator could be how many splits/partitions
> > >> have been assigned to source readers, how many files have been written
> > out
> > >> by sink writers, these metrics not only help users to know the job
> > progress
> > >> but also make big job maintaining easier. Thus we propose the FLIP-274
> > to
> > >> introduce a new metric group scope for OperatorCoordinator and provide
> > an
> > >> internal implementation for OperatorCoordinatorMetricGroup.
> > >>
> > >> Could you help review this FLIP when you get time? Any feedback is
> > >> appreciated!
> > >>
> > >> Best,
> > >> Hang
> > >>
> > >> [1]
> > >>
> > >>
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-274%3A+Introduce+metric+group+for+OperatorCoordinator
> > >>
> >
> >

Re: [DISCUSS] FLIP-274 : Introduce metric group for OperatorCoordinator

Posted by Dong Lin <li...@gmail.com>.

Hi Chesney,

Just to double check with you, OperatorCoordinatorMetricGroup (annotated as
@PublicEvolving) has already been introduced into Flink by FLIP-179
<https://cwiki.apache.org/confluence/display/FLINK/FLIP-179%3A+Expose+Standardized+Operator+Metrics>.
And that FLIP has got you +1.. Do you mean we should remove this
OperatorCoordinatorMetricGroup?

Regards,
Dong

On Sat, Dec 10, 2022 at 1:33 AM Chesnay Schepler <ch...@apache.org> wrote:

> As a whole I feel like this FLIP is overly complicated. A dedicated
> coordinator MG implementation is overkill; it could just re-use the
> existing Task/OperatorMGs to create the same structure we have on TMs,
> similar to what we did with the Job MG.
>
> However, I'm not convinced that this is required anyway, because all the
> example metrics you listed can be implemented on the TM side +
> aggregating them in the external metrics backend.
>
> Since I'm on holidays soon, just so no one tries to pull a fast one on
> me, if this were to go to a vote as-is I'd be against it.
>
>
> On 09/12/2022 15:30, Dong Lin wrote:
> > Hi Hang,
> >
> > Thanks for the FLIP! The FLIP looks good and it is pretty informative.
> >
> > I have just two minor comments regarding names:
> > - Would it be useful to rename the config key as
> > *metrics.scope.jm.job.operator-coordinator* for consistency with
> > *metrics.scope.jm.job
> > *(which is not named as *jm-job)?
> > - Maybe rename the variable as SCOPE_NAMING_OPERATOR_COORDINATOR for
> > simplicity and consistency with SCOPE_NAMING_OPERATOR (which is not named
> > as SCOPE_NAMING_TM_JOB_OPERATOR)?
> >
> > Cheers,
> > Dong
> >
> >
> >
> > On Thu, Dec 8, 2022 at 3:28 PM Hang Ruan <ru...@gmail.com> wrote:
> >
> >> Hi all,
> >>
> >> MengYue and I created FLIP-274[1] Introduce metric group for
> >> OperatorCoordinator. OperatorCoordinator is the coordinator for runtime
> >> operators and running on Job Manager. The coordination mechanism is
> >> operator events between OperatorCoordinator and its all operators, the
> >> coordination is more and more using in Flink, for example many Sources
> and
> >> Sinks depend on the mechanism to assign splits and coordinate commits to
> >> external systems. The OperatorCoordinator is widely using in flink kafka
> >> connector, flink pulsar connector, flink cdc connector, flink hudi
> >> connector and so on.
> >>
> >> But there is not a suitable metric group scope for the
> OperatorCoordinator
> >> and not an implementation for the interface
> OperatorCoordinatorMetricGroup.
> >> These metrics in OperatorCoordinator could be how many splits/partitions
> >> have been assigned to source readers, how many files have been written
> out
> >> by sink writers, these metrics not only help users to know the job
> progress
> >> but also make big job maintaining easier. Thus we propose the FLIP-274
> to
> >> introduce a new metric group scope for OperatorCoordinator and provide
> an
> >> internal implementation for OperatorCoordinatorMetricGroup.
> >>
> >> Could you help review this FLIP when you get time? Any feedback is
> >> appreciated!
> >>
> >> Best,
> >> Hang
> >>
> >> [1]
> >>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-274%3A+Introduce+metric+group+for+OperatorCoordinator
> >>
>
>

Re: [DISCUSS] FLIP-274 : Introduce metric group for OperatorCoordinator

Posted by Hang Ruan <ru...@gmail.com>.

Hi, Chesnay,

Thanks for your reply.

Actually we can not reuse the Task/OperatorMG for the OperatorCoordinator.
There are mainly two reasons.
First of all, the scopes of these metric groups are not suitable for the
OperatorCoordinator. It should be
"<host>.jobmanager.<job_name>.<operator_name>.coordinator".
Secondly, there are some metrics that we cannot compute from the subtasks.
For example, in flink cdc connectors, we could report how many tables are
pending in the OperatorCoordinator. But this information is not available
in its subtasks.

We try to add some common metrics to the OperatorCoordinatorMetricGroup for
all OperatorCoordinator implementations. In fact, it should be discussed
whether these common metrics are necessary.

Best,
Hang

Chesnay Schepler <ch...@apache.org> 于2022年12月10日周六 01:33写道：

> As a whole I feel like this FLIP is overly complicated. A dedicated
> coordinator MG implementation is overkill; it could just re-use the
> existing Task/OperatorMGs to create the same structure we have on TMs,
> similar to what we did with the Job MG.
>
> However, I'm not convinced that this is required anyway, because all the
> example metrics you listed can be implemented on the TM side +
> aggregating them in the external metrics backend.
>
> Since I'm on holidays soon, just so no one tries to pull a fast one on
> me, if this were to go to a vote as-is I'd be against it.
>
>
> On 09/12/2022 15:30, Dong Lin wrote:
> > Hi Hang,
> >
> > Thanks for the FLIP! The FLIP looks good and it is pretty informative.
> >
> > I have just two minor comments regarding names:
> > - Would it be useful to rename the config key as
> > *metrics.scope.jm.job.operator-coordinator* for consistency with
> > *metrics.scope.jm.job
> > *(which is not named as *jm-job)?
> > - Maybe rename the variable as SCOPE_NAMING_OPERATOR_COORDINATOR for
> > simplicity and consistency with SCOPE_NAMING_OPERATOR (which is not named
> > as SCOPE_NAMING_TM_JOB_OPERATOR)?
> >
> > Cheers,
> > Dong
> >
> >
> >
> > On Thu, Dec 8, 2022 at 3:28 PM Hang Ruan <ru...@gmail.com> wrote:
> >
> >> Hi all,
> >>
> >> MengYue and I created FLIP-274[1] Introduce metric group for
> >> OperatorCoordinator. OperatorCoordinator is the coordinator for runtime
> >> operators and running on Job Manager. The coordination mechanism is
> >> operator events between OperatorCoordinator and its all operators, the
> >> coordination is more and more using in Flink, for example many Sources
> and
> >> Sinks depend on the mechanism to assign splits and coordinate commits to
> >> external systems. The OperatorCoordinator is widely using in flink kafka
> >> connector, flink pulsar connector, flink cdc connector, flink hudi
> >> connector and so on.
> >>
> >> But there is not a suitable metric group scope for the
> OperatorCoordinator
> >> and not an implementation for the interface
> OperatorCoordinatorMetricGroup.
> >> These metrics in OperatorCoordinator could be how many splits/partitions
> >> have been assigned to source readers, how many files have been written
> out
> >> by sink writers, these metrics not only help users to know the job
> progress
> >> but also make big job maintaining easier. Thus we propose the FLIP-274
> to
> >> introduce a new metric group scope for OperatorCoordinator and provide
> an
> >> internal implementation for OperatorCoordinatorMetricGroup.
> >>
> >> Could you help review this FLIP when you get time? Any feedback is
> >> appreciated!
> >>
> >> Best,
> >> Hang
> >>
> >> [1]
> >>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-274%3A+Introduce+metric+group+for+OperatorCoordinator
> >>
>
>

Re: [DISCUSS] FLIP-274 : Introduce metric group for OperatorCoordinator

Posted by Chesnay Schepler <ch...@apache.org>.

As a whole I feel like this FLIP is overly complicated. A dedicated 
coordinator MG implementation is overkill; it could just re-use the 
existing Task/OperatorMGs to create the same structure we have on TMs, 
similar to what we did with the Job MG.

However, I'm not convinced that this is required anyway, because all the 
example metrics you listed can be implemented on the TM side + 
aggregating them in the external metrics backend.

Since I'm on holidays soon, just so no one tries to pull a fast one on 
me, if this were to go to a vote as-is I'd be against it.


On 09/12/2022 15:30, Dong Lin wrote:
> Hi Hang,
>
> Thanks for the FLIP! The FLIP looks good and it is pretty informative.
>
> I have just two minor comments regarding names:
> - Would it be useful to rename the config key as
> *metrics.scope.jm.job.operator-coordinator* for consistency with
> *metrics.scope.jm.job
> *(which is not named as *jm-job)?
> - Maybe rename the variable as SCOPE_NAMING_OPERATOR_COORDINATOR for
> simplicity and consistency with SCOPE_NAMING_OPERATOR (which is not named
> as SCOPE_NAMING_TM_JOB_OPERATOR)?
>
> Cheers,
> Dong
>
>
>
> On Thu, Dec 8, 2022 at 3:28 PM Hang Ruan <ru...@gmail.com> wrote:
>
>> Hi all,
>>
>> MengYue and I created FLIP-274[1] Introduce metric group for
>> OperatorCoordinator. OperatorCoordinator is the coordinator for runtime
>> operators and running on Job Manager. The coordination mechanism is
>> operator events between OperatorCoordinator and its all operators, the
>> coordination is more and more using in Flink, for example many Sources and
>> Sinks depend on the mechanism to assign splits and coordinate commits to
>> external systems. The OperatorCoordinator is widely using in flink kafka
>> connector, flink pulsar connector, flink cdc connector, flink hudi
>> connector and so on.
>>
>> But there is not a suitable metric group scope for the OperatorCoordinator
>> and not an implementation for the interface OperatorCoordinatorMetricGroup.
>> These metrics in OperatorCoordinator could be how many splits/partitions
>> have been assigned to source readers, how many files have been written out
>> by sink writers, these metrics not only help users to know the job progress
>> but also make big job maintaining easier. Thus we propose the FLIP-274 to
>> introduce a new metric group scope for OperatorCoordinator and provide an
>> internal implementation for OperatorCoordinatorMetricGroup.
>>
>> Could you help review this FLIP when you get time? Any feedback is
>> appreciated!
>>
>> Best,
>> Hang
>>
>> [1]
>>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-274%3A+Introduce+metric+group+for+OperatorCoordinator
>>

Re: [DISCUSS] FLIP-274 : Introduce metric group for OperatorCoordinator

Posted by Hang Ruan <ru...@gmail.com>.

Hi, Dong,

Thanks for your suggestion. I plan to rename this scope like this :

public static final ConfigOption<String> SCOPE_NAMING_OPERATOR_COORDINATOR =
        key("metrics.scope.operator-coordinator")
                .stringType()

.defaultValue("<host>.jobmanager.<job_name>.<operator_name>.coordinator")
                .withDescription(
                        "Defines the scope format string that is
applied to all metrics scoped to a job on a JobManager.");

Best,
Hang

Dong Lin <li...@gmail.com> 于2022年12月9日周五 22:34写道：

> Hi Hang,
>
> Thanks for the FLIP! The FLIP looks good and it is pretty informative.
>
> I have just two minor comments regarding names:
> - Would it be useful to rename the config key as
> *metrics.scope.jm.job.operator-coordinator* for consistency with
> *metrics.scope.jm.job
> *(which is not named as *jm-job)?
> - Maybe rename the variable as SCOPE_NAMING_OPERATOR_COORDINATOR for
> simplicity and consistency with SCOPE_NAMING_OPERATOR (which is not named
> as SCOPE_NAMING_TM_JOB_OPERATOR)?
>
> Cheers,
> Dong
>
>
>
> On Thu, Dec 8, 2022 at 3:28 PM Hang Ruan <ru...@gmail.com> wrote:
>
> > Hi all,
> >
> > MengYue and I created FLIP-274[1] Introduce metric group for
> > OperatorCoordinator. OperatorCoordinator is the coordinator for runtime
> > operators and running on Job Manager. The coordination mechanism is
> > operator events between OperatorCoordinator and its all operators, the
> > coordination is more and more using in Flink, for example many Sources
> and
> > Sinks depend on the mechanism to assign splits and coordinate commits to
> > external systems. The OperatorCoordinator is widely using in flink kafka
> > connector, flink pulsar connector, flink cdc connector, flink hudi
> > connector and so on.
> >
> > But there is not a suitable metric group scope for the
> OperatorCoordinator
> > and not an implementation for the interface
> OperatorCoordinatorMetricGroup.
> > These metrics in OperatorCoordinator could be how many splits/partitions
> > have been assigned to source readers, how many files have been written
> out
> > by sink writers, these metrics not only help users to know the job
> progress
> > but also make big job maintaining easier. Thus we propose the FLIP-274 to
> > introduce a new metric group scope for OperatorCoordinator and provide an
> > internal implementation for OperatorCoordinatorMetricGroup.
> >
> > Could you help review this FLIP when you get time? Any feedback is
> > appreciated!
> >
> > Best,
> > Hang
> >
> > [1]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-274%3A+Introduce+metric+group+for+OperatorCoordinator
> >
>

Re: [DISCUSS] FLIP-274 : Introduce metric group for OperatorCoordinator

Posted by Dong Lin <li...@gmail.com>.

Hi Hang,

Thanks for the FLIP! The FLIP looks good and it is pretty informative.

I have just two minor comments regarding names:
- Would it be useful to rename the config key as
*metrics.scope.jm.job.operator-coordinator* for consistency with
*metrics.scope.jm.job
*(which is not named as *jm-job)?
- Maybe rename the variable as SCOPE_NAMING_OPERATOR_COORDINATOR for
simplicity and consistency with SCOPE_NAMING_OPERATOR (which is not named
as SCOPE_NAMING_TM_JOB_OPERATOR)?

Cheers,
Dong



On Thu, Dec 8, 2022 at 3:28 PM Hang Ruan <ru...@gmail.com> wrote:

> Hi all,
>
> MengYue and I created FLIP-274[1] Introduce metric group for
> OperatorCoordinator. OperatorCoordinator is the coordinator for runtime
> operators and running on Job Manager. The coordination mechanism is
> operator events between OperatorCoordinator and its all operators, the
> coordination is more and more using in Flink, for example many Sources and
> Sinks depend on the mechanism to assign splits and coordinate commits to
> external systems. The OperatorCoordinator is widely using in flink kafka
> connector, flink pulsar connector, flink cdc connector, flink hudi
> connector and so on.
>
> But there is not a suitable metric group scope for the OperatorCoordinator
> and not an implementation for the interface OperatorCoordinatorMetricGroup.
> These metrics in OperatorCoordinator could be how many splits/partitions
> have been assigned to source readers, how many files have been written out
> by sink writers, these metrics not only help users to know the job progress
> but also make big job maintaining easier. Thus we propose the FLIP-274 to
> introduce a new metric group scope for OperatorCoordinator and provide an
> internal implementation for OperatorCoordinatorMetricGroup.
>
> Could you help review this FLIP when you get time? Any feedback is
> appreciated!
>
> Best,
> Hang
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-274%3A+Introduce+metric+group+for+OperatorCoordinator
>