You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by Weihua Hu <hu...@gmail.com> on 2023/02/27 12:40:58 UTC

[DISCUSS] FLIP-298: Unifying the Implementation of SlotManager

Hi everyone,

I would like to begin a discussion on FLIP-298: Unifying the Implementation
of SlotManager[1]. There are currently two types of SlotManager in Flink:
DeclarativeSlotManager and FineGrainedSlotManager. FineGrainedSlotManager
should behave as DeclarativeSlotManager if the user does not configure the
slot request profile.

Therefore, this FLIP aims to unify the implementation of SlotManager in
order to reduce maintenance costs.

Looking forward to hearing from you.

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-298%3A+Unifying+the+Implementation+of+SlotManager

Best,
Weihua

Re: [DISCUSS] FLIP-298: Unifying the Implementation of SlotManager

Posted by Weihua Hu <hu...@gmail.com>.
Hi, Max

Thanks for feedback.

SlotManager only releases task managers with no task running on it after an
idle timeout.
IMO, this behavior could support scale down naturally.

There are two ways to trigger scale down:
1. Using adaptive scheduler and release task managers by resource
provider(k8s/yarn).
Then scheduler will scale down the parallelism to match the less resources.

2. After FLIP-291[1], we could set the parallelism of jobs, this may scale
down jobs. In this case,
We could add some strategy in SlotPool to release slots. Once SlotPool
releases all slots
of a task manager, the SlotManager will release this task manager after the
idle timeout exceeds.

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-291%3A+Externalized+Declarative+Resource+Management

Best,
Weihua


On Fri, Mar 10, 2023 at 10:11 PM Maximilian Michels <mx...@apache.org> wrote:

> +1 on the proposal.
>
> I'm wondering about the scale down behavior of the unified
> implementation. Does the new unified implementation prioritize
> releasing entire task managers in favor of evenly spreading out task
> managers? Consider a scale down from parallelism 15 to parallelism 10
> where each task manager has 5 slots. We could either spread out 10
> slots among the 3 task managers, or fit the 10 slots in 2 task
> managers and surrender the task manager (at least on k8s). From a cost
> saving perspective, the latter would be preferable.
>
> -Max
>
> On Thu, Mar 9, 2023 at 4:17 PM Matthias Pohl
> <ma...@aiven.io.invalid> wrote:
> >
> > Thanks for your clarification. I have nothing else to add to the
> > discussion. +1 from my side to proceed
> >
> > On Wed, Mar 8, 2023 at 4:16 AM Weihua Hu <hu...@gmail.com> wrote:
> >
> > > Thanks Yangze for your attention, this would be a great help.
> > >
> > > And thanks Matthias too.
> > >
> > > FLIP-156 [1] mentions some incompatibility between fine-grained
> resource
> > > > management and reactive mode. I assume that this is independent of
> the
> > > > SlotManager and replacing the DSM with the FGSM wouldn't affect
> reactive
> > > > mode?
> > >
> > > Yes. This incompatibility is independent of SlotManager. That means the
> > > AdpativeScheduler will always ignore the resource requirement set by
> > > slotSharingGroup and declare Unknown ResourceProfile to SlotManager.
> > > So, using FGSM as default will not affect reactive mode.
> > >
> > > About the heterogeneous TaskManager: This is a feature that's also not
> > > > supported in the DSM right now, is it? We should state that fact in
> the
> > > > FLIP if we mentioned that we don't want to implement it for the FSGM.
> > >
> > > Yes, both DSM and FGSM do not support request heterogeneous
> > > TaskManager right now. Heterogeneous will make the resource allocation
> > > logic more complicated, such as the resource deadlock if request A
> > > allocated
> > > the bigger slot B and then request B could not allocate the small slot
> A.
> > > We
> > > need to think more before starting to support the heterogeneous task
> > > manager.
> > > So, we don't want to implement heterogeneity in this FLIP.
> > >
> > > Best,
> > > Weihua
> > >
> > >
> > > On Wed, Mar 8, 2023 at 12:44 AM Matthias Pohl
> > > <ma...@aiven.io.invalid> wrote:
> > >
> > > > Thanks for updating the FLIP and adding more context to it.
> Additionally,
> > > > thanks to Xintong and Yangze for offering your expertise here as
> > > > contributors to the initial FineGrainedSlotManager implementation.
> > > >
> > > > The remark on cutting out functionality was only based on some
> > > superficial
> > > > initial code reading. I cannot come up with a better code structure
> > > myself.
> > > > Therefore, I'm fine with not refactoring the code as part of this
> FLIP.
> > > >
> > > > The strategies that were proposed around making sure that the
> refactoring
> > > > is properly backed by tests sound reasonable. My initial concern was
> > > based
> > > > on the fact that we might have unit test scenarios for the DSM that
> are
> > > not
> > > > covered in the unit tests of the FSGM. In that case, swapping the DSM
> > > with
> > > > the FSGM might not be good enough. Going over the GSM tests to make
> sure
> > > > that we're not accidentally deleting test scenarios sounds good to
> me.
> > > > Thanks, Weihua.
> > > >
> > > > FLIP-156 [1] mentions some incompatibility between fine-grained
> resource
> > > > management and reactive mode. I assume that this is independent of
> the
> > > > SlotManager and replacing the DSM with the FGSM wouldn't affect
> reactive
> > > > mode?
> > > >
> > > > About the heterogeneous TaskManager: This is a feature that's also
> not
> > > > supported in the DSM right now, is it? We should state that fact in
> the
> > > > FLIP if we mentioned that we don't want to implement it for the FSGM.
> > > >
> > > > Best,
> > > > Matthias
> > > >
> > > > [1]
> > > >
> > > >
> > >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-156%3A+Runtime+Interfaces+for+Fine-Grained+Resource+Requirements
> > > >
> > > > On Tue, Mar 7, 2023 at 8:58 AM Yangze Guo <ka...@gmail.com>
> wrote:
> > > >
> > > > > Hi Weihua,
> > > > >
> > > > > Thanks for driving this. As Xintong mentioned, this was a technical
> > > > > debt from FLIP-56.
> > > > >
> > > > > The latest version of FLIP sounds good, +1 from my side. As a
> > > > > contributor to this component, I'm willing to assist with the
> review
> > > > > process. Feel free to reach me if you need help.
> > > > >
> > > > > Best,
> > > > > Yangze Guo
> > > > >
> > > > > On Tue, Mar 7, 2023 at 1:47 PM Weihua Hu <hu...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > @David @Matthias
> > > > > > There are a few days after hearing your thoughts. I would like to
> > > know
> > > > if
> > > > > > there are any other concerns about this FLIP.
> > > > > >
> > > > > >
> > > > > > Best,
> > > > > > Weihua
> > > > > >
> > > > > >
> > > > > > On Mon, Mar 6, 2023 at 7:53 PM Weihua Hu <huweihua.ckl@gmail.com
> >
> > > > wrote:
> > > > > >
> > > > > > >
> > > > > > > Thanks Shammon,
> > > > > > >
> > > > > > > I've updated FLIP to add this redundant Task Manager
> limitation.
> > > > > > >
> > > > > > >
> > > > > > > Best,
> > > > > > > Weihua
> > > > > > >
> > > > > > >
> > > > > > > On Mon, Mar 6, 2023 at 5:07 PM Shammon FY <zj...@gmail.com>
> > > wrote:
> > > > > > >
> > > > > > >> Hi weihua
> > > > > > >>
> > > > > > >> Can you add content related to `heterogeneous resources` to
> this
> > > > > FLIP? We
> > > > > > >> can record it and consider it in the future. It may be useful
> for
> > > > some
> > > > > > >> scenarios, such as the combination of streaming and ML.
> > > > > > >>
> > > > > > >> Best,
> > > > > > >> Shammon
> > > > > > >>
> > > > > > >>
> > > > > > >> On Mon, Mar 6, 2023 at 1:39 PM weijie guo <
> > > > guoweijiereswqa@gmail.com>
> > > > > > >> wrote:
> > > > > > >>
> > > > > > >> > Hi Weihua,
> > > > > > >> >
> > > > > > >> > Thanks for your clarification, SGTM.
> > > > > > >> >
> > > > > > >> > Best regards,
> > > > > > >> >
> > > > > > >> > Weijie
> > > > > > >> >
> > > > > > >> >
> > > > > > >> > Weihua Hu <hu...@gmail.com> 于2023年3月6日周一 11:43写道:
> > > > > > >> >
> > > > > > >> > > Thanks Weijie.
> > > > > > >> > >
> > > > > > >> > > Heterogeneous task managers will not be considered in this
> > > FLIP
> > > > > since
> > > > > > >> > > it does not request heterogeneous resources as you said.
> > > > > > >> > >
> > > > > > >> > > My first thought is we can adjust the meaning of redundant
> > > > > > >> configuration
> > > > > > >> > > to redundant number of per resource type. These can be
> > > > considered
> > > > > in
> > > > > > >> > > detail when we decide to support heterogeneous task
> managers.
> > > > > > >> > >
> > > > > > >> > > Best,
> > > > > > >> > > Weihua
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > > On Sat, Mar 4, 2023 at 1:13 AM weijie guo <
> > > > > guoweijiereswqa@gmail.com>
> > > > > > >> > > wrote:
> > > > > > >> > >
> > > > > > >> > > > Thanks Weihua for preparing this FLIP.
> > > > > > >> > > >
> > > > > > >> > > > This FLIP overall looks reasonable to me after updating
> as
> > > > > > >> suggested by
> > > > > > >> > > > Matthias.
> > > > > > >> > > >
> > > > > > >> > > > I only have one small question about keeping some
> redundant
> > > > task
> > > > > > >> > > managers:
> > > > > > >> > > > In the fine-grained resource management, theoretically,
> it
> > > can
> > > > > > >> support
> > > > > > >> > > > heterogeneous taskmanagers. When we complete the missing
> > > > > features
> > > > > > >> for
> > > > > > >> > > FGSM,
> > > > > > >> > > > do we plan to take this into account?
> > > > > > >> > > > Of course, if I remember correctly, FGSM will not
> request
> > > > > > >> heterogeneous
> > > > > > >> > > > resources at present, so it is also acceptable to me if
> > > there
> > > > > is no
> > > > > > >> > > special
> > > > > > >> > > > treatment now.
> > > > > > >> > > >
> > > > > > >> > > > +1 for this changes if we can ensure the test coverage.
> > > > > > >> > > >
> > > > > > >> > > > Best regards,
> > > > > > >> > > >
> > > > > > >> > > > Weijie
> > > > > > >> > > >
> > > > > > >> > > >
> > > > > > >> > > > John Roesler <vv...@apache.org> 于2023年3月2日周四
> 12:53写道:
> > > > > > >> > > >
> > > > > > >> > > > > Thanks for the test plan, Weihua!
> > > > > > >> > > > >
> > > > > > >> > > > > Yes, it addresses my concerns.
> > > > > > >> > > > >
> > > > > > >> > > > > Thanks,
> > > > > > >> > > > > John
> > > > > > >> > > > >
> > > > > > >> > > > > On Wed, Mar 1, 2023, at 22:38, Weihua Hu wrote:
> > > > > > >> > > > > > Hi, everyone,
> > > > > > >> > > > > > Thanks for your suggestions and ideas.
> > > > > > >> > > > > > Thanks Xintong for sharing the detailed backgrounds
> of
> > > > > > >> SlotManager.
> > > > > > >> > > > > >
> > > > > > >> > > > > > *@Matthias
> > > > > > >> > > > > >
> > > > > > >> > > > > > 1. Did you do a proper test coverage analysis?
> > > > > > >> > > > > >
> > > > > > >> > > > > >
> > > > > > >> > > > > > Just as Xintong said, we already have a CI stage for
> > > fine
> > > > > > >> grained
> > > > > > >> > > > > resource
> > > > > > >> > > > > > managers.
> > > > > > >> > > > > > And I will make sure FineGrainedSlotManager as the
> > > default
> > > > > > >> > > SlotManager
> > > > > > >> > > > > can
> > > > > > >> > > > > > pass all the tests of CI.
> > > > > > >> > > > > > In addition, I will review all unit tests of
> > > > > > >> > > > DeclarativeSlotManager(DSM)
> > > > > > >> > > > > to
> > > > > > >> > > > > > ensure that there are no gaps in the
> > > > > > >> > > > > > coverage provided by the FineGrainedSlotManager.
> > > > > > >> > > > > > I also added the 'Test Plan' part to the FLIP.
> > > > > > >> > > > > > @Matthias @John @Shammon Does this test plan address
> > > your
> > > > > > >> concerns?
> > > > > > >> > > > > >
> > > > > > >> > > > > > 2.  DeclarativeSlotManager and
> FineGrainedSlotManager
> > > feel
> > > > > quite
> > > > > > >> > big
> > > > > > >> > > in
> > > > > > >> > > > > >
> > > > > > >> > > > > > terms of lines of code....
> > > > > > >> > > > > >
> > > > > > >> > > > > >
> > > > > > >> > > > > > IMO, the refactoring of SlotManager does not belong
> to
> > > > this
> > > > > FLIP
> > > > > > >> > > since
> > > > > > >> > > > it
> > > > > > >> > > > > > may lead to some unstable risks. For
> > > > > > >> > > > > > FineGrainedSlotManager(FGSM), we already split some
> > > > > reasonable
> > > > > > >> > > > > components.
> > > > > > >> > > > > > They are:
> > > > > > >> > > > > > * TaskManagerTracker: Track task managers and their
> > > > > resources.
> > > > > > >> > > > > > * ResourceTracker: track requirements of jobs
> > > > > > >> > > > > > * ResourceAllocationStrategy: Try to fulfill the
> > > resource
> > > > > > >> > > requirements
> > > > > > >> > > > > with
> > > > > > >> > > > > > available/pending resources.
> > > > > > >> > > > > > * SlotStatusSyncer: communicate with TaskManager,
> for
> > > > > > >> > > > allocating/freeing
> > > > > > >> > > > > > slot and reconciling the slot status
> > > > > > >> > > > > > Maybe we can start a discussion about refactoring
> > > > > SlotManager in
> > > > > > >> > > > another
> > > > > > >> > > > > > FLIP if there are some good suggestions.
> > > > > > >> > > > > > WDYT
> > > > > > >> > > > > >
> > > > > > >> > > > > > 3. For me personally, having a more detailed summary
> > > > > comparing
> > > > > > >> the
> > > > > > >> > > > > >> subcomponents of both SlotManager implementations
> with
> > > > > where
> > > > > > >> > > > > >> their functionality matches and where they differ
> might
> > > > > help
> > > > > > >> > > > understand
> > > > > > >> > > > > the
> > > > > > >> > > > > >> consequences of the changes proposed in FLIP-298
> > > > > > >> > > > > >
> > > > > > >> > > > > > Good suggestion, I have updated the comparison in
> this
> > > > FLIP.
> > > > > > >> > Looking
> > > > > > >> > > > > > forward to any suggestions/thoughts
> > > > > > >> > > > > > if they are not described clearly.
> > > > > > >> > > > > >
> > > > > > >> > > > > > *@John
> > > > > > >> > > > > >
> > > > > > >> > > > > > 4. In addition to changing the default, would it
> make
> > > > sense
> > > > > to
> > > > > > >> log
> > > > > > >> > a
> > > > > > >> > > > > >> deprecation warning on initialization
> > > > > > >> > > > > >
> > > > > > >> > > > > > if the DeclarativeSlotManager is used?
> > > > > > >> > > > > >>
> > > > > > >> > > > > > SGTM, We should add Deprecated annotations to DSM
> for
> > > > devs.
> > > > > And
> > > > > > >> > log a
> > > > > > >> > > > > > deprecation warning for users.
> > > > > > >> > > > > >
> > > > > > >> > > > > > *@Shammon
> > > > > > >> > > > > >
> > > > > > >> > > > > > 1. For their functional differences, can you give
> some
> > > > > detailed
> > > > > > >> > tests
> > > > > > >> > > > to
> > > > > > >> > > > > >> verify that the new FineGrainedSlotManager has
> these
> > > > > > >> capabilities?
> > > > > > >> > > > This
> > > > > > >> > > > > can
> > > > > > >> > > > > >> effectively verify the new functions
> > > > > > >> > > > > >>
> > > > > > >> > > > > > As just maintained, there is already a CI stage of
> FGSM,
> > > > > and I
> > > > > > >> will
> > > > > > >> > > do
> > > > > > >> > > > > more
> > > > > > >> > > > > > review of unit tests for DSM.
> > > > > > >> > > > > >
> > > > > > >> > > > > >  2. I'm worried that many functions are not
> independent
> > > > and
> > > > > it
> > > > > > >> is
> > > > > > >> > > > > difficult
> > > > > > >> > > > > >> to migrate step-by-step. You can list the
> relationship
> > > > > between
> > > > > > >> > them
> > > > > > >> > > in
> > > > > > >> > > > > >> detail.
> > > > > > >> > > > > >
> > > > > > >> > > > > >  As Xintong saied the DSM is a subset of FGSM by
> design.
> > > > > But as
> > > > > > >> > time
> > > > > > >> > > > goes
> > > > > > >> > > > > > on, FGSM has some lacking
> > > > > > >> > > > > > functions as I listed in this FLIP. And I have
> added the
> > > > > > >> comparison
> > > > > > >> > > > > between
> > > > > > >> > > > > > DSM and FGSM in this FLIP.
> > > > > > >> > > > > >
> > > > > > >> > > > > >
> > > > > > >> > > > > > Thanks again for all your thoughts. Any feedback is
> > > > > appreciated!
> > > > > > >> > > > > >
> > > > > > >> > > > > > Best,
> > > > > > >> > > > > > Weihua
> > > > > > >> > > > > >
> > > > > > >> > > > > >
> > > > > > >> > > > > > On Wed, Mar 1, 2023 at 2:17 PM Xintong Song <
> > > > > > >> tonysong820@gmail.com
> > > > > > >> > >
> > > > > > >> > > > > wrote:
> > > > > > >> > > > > >
> > > > > > >> > > > > >> Thanks Weihua for preparing this FLIP. +1 for the
> > > > proposal.
> > > > > > >> > > > > >>
> > > > > > >> > > > > >>
> > > > > > >> > > > > >> As one of the contributors of the fine-grained slot
> > > > > manager,
> > > > > > >> I'd
> > > > > > >> > > like
> > > > > > >> > > > to
> > > > > > >> > > > > >> share some backgrounds here.
> > > > > > >> > > > > >>
> > > > > > >> > > > > >> - There used to be a defaut slot manager
> > > implementation,
> > > > > which
> > > > > > >> is
> > > > > > >> > > > > >> non-declarative and has been removed now. The two
> > > > features,
> > > > > > >> > > > declarative
> > > > > > >> > > > > /
> > > > > > >> > > > > >> reactive resource management and fine-grained
> resource
> > > > > > >> management,
> > > > > > >> > > > were
> > > > > > >> > > > > >> proposed at about the same time. We were aware
> that by
> > > > > design
> > > > > > >> the
> > > > > > >> > > > > >> declarative slot manager is a subset of
> fine-grained
> > > slot
> > > > > > >> manager
> > > > > > >> > at
> > > > > > >> > > > > that
> > > > > > >> > > > > >> time, but still decided to implement two slot
> managers
> > > > for
> > > > > the
> > > > > > >> > > purpose
> > > > > > >> > > > > of
> > > > > > >> > > > > >> decoupling efforts and reducing cross-team
> > > > synchronization
> > > > > > >> > overhead.
> > > > > > >> > > > > >> Merging the two slot managers once they are proved
> > > stable
> > > > > is
> > > > > > >> IMO a
> > > > > > >> > > > > >> technical debt that was planned at the very
> beginning.
> > > > > > >> > > > > >>
> > > > > > >> > > > > >> - The FineGrainedSlotManager has been verified in
> > > > Alibaba's
> > > > > > >> > internal
> > > > > > >> > > > > >> production as well as Alibaba Cloud services as the
> > > > default
> > > > > > >> slot
> > > > > > >> > > > manager
> > > > > > >> > > > > >> for about 2 years.
> > > > > > >> > > > > >>
> > > > > > >> > > > > >>
> > > > > > >> > > > > >> Concerning test cases, we currently have a ci
> stage for
> > > > > fine
> > > > > > >> > grained
> > > > > > >> > > > > >> resource management. To avoid adding too much
> burden,
> > > the
> > > > > stage
> > > > > > >> > only
> > > > > > >> > > > > >> includes tests from flink-runtime and flink-test
> > > > modules. I
> > > > > > >> think
> > > > > > >> > > > > switching
> > > > > > >> > > > > >> the default slot manager and applying the whole
> set of
> > > > > tests on
> > > > > > >> > the
> > > > > > >> > > > > >> fine-grained slot manager would help us to be more
> > > > > confident
> > > > > > >> about
> > > > > > >> > > it.
> > > > > > >> > > > > >>
> > > > > > >> > > > > >>
> > > > > > >> > > > > >> Concerning cutting out functionalities out of slot
> > > > > manager, I
> > > > > > >> > think
> > > > > > >> > > > > Yangze
> > > > > > >> > > > > >> and I have tried our best to shape the
> > > > > FineGrainedSlotManager
> > > > > > >> into
> > > > > > >> > > > > >> reasonable components. I personally don't have
> other
> > > > ideas
> > > > > to
> > > > > > >> > > further
> > > > > > >> > > > > >> disassemble the component, but I'm open to such
> > > > > suggestions.
> > > > > > >> > > However,
> > > > > > >> > > > > from
> > > > > > >> > > > > >> the stability perspective, I'd be in favor of not
> > > > > introducing
> > > > > > >> > > > > significant
> > > > > > >> > > > > >> changes to the FineGrainedSlotManager while
> switching
> > > it
> > > > > to the
> > > > > > >> > > > default.
> > > > > > >> > > > > >> Because the current implementation has already been
> > > > > verified
> > > > > > >> (or
> > > > > > >> > at
> > > > > > >> > > > > least
> > > > > > >> > > > > >> partially verified because Alibaba does not cover
> all
> > > the
> > > > > Flink
> > > > > > >> > use
> > > > > > >> > > > > cases),
> > > > > > >> > > > > >> and introducing more changes also means more
> chances of
> > > > > > >> breaking
> > > > > > >> > > > things.
> > > > > > >> > > > > >>
> > > > > > >> > > > > >>
> > > > > > >> > > > > >> Best,
> > > > > > >> > > > > >>
> > > > > > >> > > > > >> Xintong
> > > > > > >> > > > > >>
> > > > > > >> > > > > >>
> > > > > > >> > > > > >>
> > > > > > >> > > > > >> On Wed, Mar 1, 2023 at 11:12 AM Shammon FY <
> > > > > zjureel@gmail.com>
> > > > > > >> > > wrote:
> > > > > > >> > > > > >>
> > > > > > >> > > > > >> > Hi
> > > > > > >> > > > > >> >
> > > > > > >> > > > > >> > Thanks for starting this work weihua, I think
> > > unifying
> > > > > > >> > > > > >> > DeclarativeSlotManager and
> FineGrainedSlotManager is
> > > > > > >> valuable.
> > > > > > >> > > > > >> >
> > > > > > >> > > > > >> > I agree with @Matthias and @John that we need a
> way
> > > to
> > > > > ensure
> > > > > > >> > that
> > > > > > >> > > > > >> > DeclarativeSlotManager's capabilities are fully
> > > covered
> > > > > by
> > > > > > >> > > > > >> > FineGrainedSlotManager
> > > > > > >> > > > > >> >
> > > > > > >> > > > > >> > 1. For their functional differences, can you give
> > > some
> > > > > > >> detailed
> > > > > > >> > > > tests
> > > > > > >> > > > > to
> > > > > > >> > > > > >> > verify that the new FineGrainedSlotManager has
> these
> > > > > > >> > capabilities?
> > > > > > >> > > > > This
> > > > > > >> > > > > >> can
> > > > > > >> > > > > >> > effectively verify the new functions
> > > > > > >> > > > > >> >
> > > > > > >> > > > > >> > 2. I'm worried that many functions are not
> > > independent
> > > > > and
> > > > > > >> it is
> > > > > > >> > > > > >> difficult
> > > > > > >> > > > > >> > to migrate step-by-step. You can list the
> > > relationship
> > > > > > >> between
> > > > > > >> > > them
> > > > > > >> > > > in
> > > > > > >> > > > > >> > detail.
> > > > > > >> > > > > >> >
> > > > > > >> > > > > >> > 3. As John mentioned, give a smoke test for
> > > > > > >> > FineGrainedSlotManager
> > > > > > >> > > > is
> > > > > > >> > > > > a
> > > > > > >> > > > > >> > good idea. Or you can add some test information
> to
> > > the
> > > > > > >> > > > > >> > DeclarativeSlotManager to determine how many
> tests
> > > have
> > > > > used
> > > > > > >> it.
> > > > > > >> > > In
> > > > > > >> > > > > this
> > > > > > >> > > > > >> > way, we can gradually construct test cases for
> > > > > > >> > > > FineGrainedSlotManager
> > > > > > >> > > > > >> > during the development process.
> > > > > > >> > > > > >> >
> > > > > > >> > > > > >> >
> > > > > > >> > > > > >> > Best,
> > > > > > >> > > > > >> > Shammon
> > > > > > >> > > > > >> >
> > > > > > >> > > > > >> >
> > > > > > >> > > > > >> > On Tue, Feb 28, 2023 at 10:22 PM John Roesler <
> > > > > > >> > > vvcephei@apache.org>
> > > > > > >> > > > > >> wrote:
> > > > > > >> > > > > >> >
> > > > > > >> > > > > >> > > Thanks for the FLIP, Weihua!
> > > > > > >> > > > > >> > >
> > > > > > >> > > > > >> > > I’ve read the FLIP, and it sounds good to me.
> We
> > > need
> > > > > to
> > > > > > >> avoid
> > > > > > >> > > > > >> > > proliferating alternative implementations
> wherever
> > > > > > >> possible. I
> > > > > > >> > > > have
> > > > > > >> > > > > >> just
> > > > > > >> > > > > >> > a
> > > > > > >> > > > > >> > > couple of comments:
> > > > > > >> > > > > >> > >
> > > > > > >> > > > > >> > > 1. I share Matthias’s concern about ensuring
> the
> > > > > behavior
> > > > > > >> is
> > > > > > >> > > > really
> > > > > > >> > > > > the
> > > > > > >> > > > > >> > > same. One suggestion I’ve used for this kind of
> > > thing
> > > > > is,
> > > > > > >> as a
> > > > > > >> > > > smoke
> > > > > > >> > > > > >> > test,
> > > > > > >> > > > > >> > > to update the DeclarativeSlotManager to just
> > > delegate
> > > > > to
> > > > > > >> the
> > > > > > >> > > > > >> > > FineGrainedSlotManager. If the full test suite
> > > still
> > > > > > >> passes,
> > > > > > >> > you
> > > > > > >> > > > > can be
> > > > > > >> > > > > >> > > pretty sure the new default is really ok. It
> would
> > > > not
> > > > > be a
> > > > > > >> > good
> > > > > > >> > > > > idea
> > > > > > >> > > > > >> to
> > > > > > >> > > > > >> > > actually keep that in for the release, since it
> > > would
> > > > > > >> remove
> > > > > > >> > the
> > > > > > >> > > > > option
> > > > > > >> > > > > >> > to
> > > > > > >> > > > > >> > > fall back in case of bugs. Either way, we need
> to
> > > > make
> > > > > sure
> > > > > > >> > all
> > > > > > >> > > > test
> > > > > > >> > > > > >> > > scenarios are present for the FGSM.
> > > > > > >> > > > > >> > >
> > > > > > >> > > > > >> > > 4. In addition to changing the default, would
> it
> > > make
> > > > > > >> sense to
> > > > > > >> > > > log a
> > > > > > >> > > > > >> > > deprecation warning on initialization if the
> > > > > > >> > > > DeclarativeSlotManager
> > > > > > >> > > > > is
> > > > > > >> > > > > >> > used?
> > > > > > >> > > > > >> > >
> > > > > > >> > > > > >> > > Thanks again,
> > > > > > >> > > > > >> > > John
> > > > > > >> > > > > >> > >
> > > > > > >> > > > > >> > > On Tue, Feb 28, 2023, at 07:20, Matthias Pohl
> > > wrote:
> > > > > > >> > > > > >> > > > Hi Weihua,
> > > > > > >> > > > > >> > > > Thanks for your proposal. From a conceptual
> > > point:
> > > > > AFAIU,
> > > > > > >> > the
> > > > > > >> > > > > >> > > > DeclarativeSlotManager covers a subset (i.e.
> only
> > > > > evenly
> > > > > > >> > sized
> > > > > > >> > > > > slots)
> > > > > > >> > > > > >> > of
> > > > > > >> > > > > >> > > > what the FineGrainedSlotManager should be
> able to
> > > > > achieve
> > > > > > >> > > > > (variable
> > > > > > >> > > > > >> > slot
> > > > > > >> > > > > >> > > > size per task manager). Is this the right
> > > > > > >> > > > > assumption/understanding?
> > > > > > >> > > > > >> In
> > > > > > >> > > > > >> > > this
> > > > > > >> > > > > >> > > > sense, merging both implementations into a
> single
> > > > one
> > > > > > >> sounds
> > > > > > >> > > > > good. A
> > > > > > >> > > > > >> > few
> > > > > > >> > > > > >> > > > more general comments, though:
> > > > > > >> > > > > >> > > >
> > > > > > >> > > > > >> > > > 1. Did you do a proper test coverage
> analysis?
> > > > > That's not
> > > > > > >> > > > > mentioned
> > > > > > >> > > > > >> in
> > > > > > >> > > > > >> > > the
> > > > > > >> > > > > >> > > > current version of the FLIP. I'm bringing
> this up
> > > > > > >> because we
> > > > > > >> > > ran
> > > > > > >> > > > > into
> > > > > > >> > > > > >> > the
> > > > > > >> > > > > >> > > > same issue when fixing the flaws that popped
> up
> > > > after
> > > > > > >> > > > introducing
> > > > > > >> > > > > the
> > > > > > >> > > > > >> > > > multi-component leader election (see FLIP-285
> > > [1]).
> > > > > There
> > > > > > >> > is a
> > > > > > >> > > > > risk
> > > > > > >> > > > > >> > that
> > > > > > >> > > > > >> > > by
> > > > > > >> > > > > >> > > > removing the legacy code we decrease test
> > > coverage
> > > > > > >> because
> > > > > > >> > > > certain
> > > > > > >> > > > > >> > > > test cases that were covered for the legacy
> > > classes
> > > > > might
> > > > > > >> > not
> > > > > > >> > > be
> > > > > > >> > > > > >> > > > necessarily covered in the new
> implementation,
> > > yet
> > > > > (see
> > > > > > >> > > > > FLINK-30338
> > > > > > >> > > > > >> [2]
> > > > > > >> > > > > >> > > > which covers this issue for the leader
> election
> > > > > case).
> > > > > > >> > > Ideally,
> > > > > > >> > > > we
> > > > > > >> > > > > >> > don't
> > > > > > >> > > > > >> > > > want to remove test cases accidentally
> because
> > > they
> > > > > were
> > > > > > >> > only
> > > > > > >> > > > > >> > implemented
> > > > > > >> > > > > >> > > > for the DeclarativeSlotManager but missed
> for the
> > > > > > >> > > > > >> > FineGrainedSlotManager.
> > > > > > >> > > > > >> > > >
> > > > > > >> > > > > >> > > > 2. DeclarativeSlotManager and
> > > > FineGrainedSlotManager
> > > > > feel
> > > > > > >> > > quite
> > > > > > >> > > > > big
> > > > > > >> > > > > >> in
> > > > > > >> > > > > >> > > > terms of lines of code. Without knowing
> whether
> > > > it's
> > > > > > >> > actually
> > > > > > >> > > a
> > > > > > >> > > > > >> > > reasonable
> > > > > > >> > > > > >> > > > thing to do: Instead of just adding more
> features
> > > > to
> > > > > the
> > > > > > >> > > > > >> > > > FineGrainedSlotManager, have you thought of
> > > cutting
> > > > > out
> > > > > > >> > > > > functionality
> > > > > > >> > > > > >> > > into
> > > > > > >> > > > > >> > > > smaller sub-components along this
> refactoring?
> > > > Such a
> > > > > > >> > > > step-by-step
> > > > > > >> > > > > >> > > approach
> > > > > > >> > > > > >> > > > might improve the overall codebase and might
> make
> > > > > > >> reviewing
> > > > > > >> > > the
> > > > > > >> > > > > >> > > refactoring
> > > > > > >> > > > > >> > > > easier. I did a first pass over the code and
> > > > > struggled to
> > > > > > >> > > > identify
> > > > > > >> > > > > >> code
> > > > > > >> > > > > >> > > > blocks that could be moved out of the
> SlotManager
> > > > > > >> > > > > implementation(s).
> > > > > > >> > > > > >> > > > Therefore, I might be wrong with this
> proposal. I
> > > > > haven't
> > > > > > >> > > worked
> > > > > > >> > > > > on
> > > > > > >> > > > > >> > this
> > > > > > >> > > > > >> > > > codebase in that detail that it would allow
> me to
> > > > > come up
> > > > > > >> > > with a
> > > > > > >> > > > > >> > > judgement
> > > > > > >> > > > > >> > > > call. I wanted to bring it up, anyway,
> because
> > > I'm
> > > > > > >> curious
> > > > > > >> > > > whether
> > > > > > >> > > > > >> that
> > > > > > >> > > > > >> > > > could be an option. There's a comment
> created by
> > > > > Chesnay
> > > > > > >> > > (CC'd)
> > > > > > >> > > > in
> > > > > > >> > > > > >> the
> > > > > > >> > > > > >> > > > JavaDoc of TaskExecutorManager [3] indicating
> > > > > something
> > > > > > >> > > similar.
> > > > > > >> > > > > I'm
> > > > > > >> > > > > >> > > > wondering whether he can add some insights
> here.
> > > > > > >> > > > > >> > > >
> > > > > > >> > > > > >> > > > 3. For me personally, having a more detailed
> > > > summary
> > > > > > >> > comparing
> > > > > > >> > > > the
> > > > > > >> > > > > >> > > > subcomponents of both SlotManager
> implementations
> > > > > with
> > > > > > >> where
> > > > > > >> > > > > >> > > > their functionality matches and where they
> differ
> > > > > might
> > > > > > >> help
> > > > > > >> > > > > >> understand
> > > > > > >> > > > > >> > > the
> > > > > > >> > > > > >> > > > consequences of the changes proposed in
> FLIP-298.
> > > > > > >> > > > > >> > > >
> > > > > > >> > > > > >> > > > Best,
> > > > > > >> > > > > >> > > > Matthias
> > > > > > >> > > > > >> > > >
> > > > > > >> > > > > >> > > > [1]
> > > > > > >> > > > > >> > > >
> > > > > > >> > > > > >> > >
> > > > > > >> > > > > >> >
> > > > > > >> > > > > >>
> > > > > > >> > > > >
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> >
> > > > > > >>
> > > > >
> > > >
> > >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+Refactoring+LeaderElection+to+make+Flink+support+multi-component+leader+election+out-of-the-box
> > > > > > >> > > > > >> > > > [2]
> > > > > https://issues.apache.org/jira/browse/FLINK-30338
> > > > > > >> > > > > >> > > > [3]
> > > > > > >> > > > > >> > > >
> > > > > > >> > > > > >> > >
> > > > > > >> > > > > >> >
> > > > > > >> > > > > >>
> > > > > > >> > > > >
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> >
> > > > > > >>
> > > > >
> > > >
> > >
> https://github.com/apache/flink/blob/f611ea8cb5deddb42429df2c99f0c68d7382e9bd/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/TaskExecutorManager.java#L66-L68
> > > > > > >> > > > > >> > > >
> > > > > > >> > > > > >> > > > On Tue, Feb 28, 2023 at 6:14 AM Matt Wang <
> > > > > > >> wangm92@163.com>
> > > > > > >> > > > > wrote:
> > > > > > >> > > > > >> > > >
> > > > > > >> > > > > >> > > >> This is a good proposal for me, it will
> make the
> > > > > code of
> > > > > > >> > the
> > > > > > >> > > > > >> > SlotManager
> > > > > > >> > > > > >> > > >> more clear.
> > > > > > >> > > > > >> > > >>
> > > > > > >> > > > > >> > > >>
> > > > > > >> > > > > >> > > >>
> > > > > > >> > > > > >> > > >> --
> > > > > > >> > > > > >> > > >>
> > > > > > >> > > > > >> > > >> Best,
> > > > > > >> > > > > >> > > >> Matt Wang
> > > > > > >> > > > > >> > > >>
> > > > > > >> > > > > >> > > >>
> > > > > > >> > > > > >> > > >> ---- Replied Message ----
> > > > > > >> > > > > >> > > >> | From | David Morávek<dm...@apache.org> |
> > > > > > >> > > > > >> > > >> | Date | 02/27/2023 22:45 |
> > > > > > >> > > > > >> > > >> | To | <de...@flink.apache.org> |
> > > > > > >> > > > > >> > > >> | Subject | Re: [DISCUSS] FLIP-298:
> Unifying the
> > > > > > >> > > Implementation
> > > > > > >> > > > > of
> > > > > > >> > > > > >> > > >> SlotManager |
> > > > > > >> > > > > >> > > >> Hi Weihua, I still need to dig into the
> details,
> > > > > but the
> > > > > > >> > > > overall
> > > > > > >> > > > > >> > > sentiment
> > > > > > >> > > > > >> > > >> of this change sounds reasonable.
> > > > > > >> > > > > >> > > >>
> > > > > > >> > > > > >> > > >> Best,
> > > > > > >> > > > > >> > > >> D.
> > > > > > >> > > > > >> > > >>
> > > > > > >> > > > > >> > > >> On Mon, Feb 27, 2023 at 2:26 PM Zhanghao
> Chen <
> > > > > > >> > > > > >> > > zhanghao.chen@outlook.com>
> > > > > > >> > > > > >> > > >> wrote:
> > > > > > >> > > > > >> > > >>
> > > > > > >> > > > > >> > > >> Thanks for driving this topic. I think this
> FLIP
> > > > > could
> > > > > > >> help
> > > > > > >> > > > > clean up
> > > > > > >> > > > > >> > the
> > > > > > >> > > > > >> > > >> codebase to make it easier to maintain. +1
> on
> > > it.
> > > > > > >> > > > > >> > > >>
> > > > > > >> > > > > >> > > >> Best,
> > > > > > >> > > > > >> > > >> Zhanghao Chen
> > > > > > >> > > > > >> > > >> ________________________________
> > > > > > >> > > > > >> > > >> From: Weihua Hu <hu...@gmail.com>
> > > > > > >> > > > > >> > > >> Sent: Monday, February 27, 2023 20:40
> > > > > > >> > > > > >> > > >> To: dev <de...@flink.apache.org>
> > > > > > >> > > > > >> > > >> Subject: [DISCUSS] FLIP-298: Unifying the
> > > > > > >> Implementation of
> > > > > > >> > > > > >> > SlotManager
> > > > > > >> > > > > >> > > >>
> > > > > > >> > > > > >> > > >> Hi everyone,
> > > > > > >> > > > > >> > > >>
> > > > > > >> > > > > >> > > >> I would like to begin a discussion on
> FLIP-298:
> > > > > Unifying
> > > > > > >> > the
> > > > > > >> > > > > >> > > Implementation
> > > > > > >> > > > > >> > > >> of SlotManager[1]. There are currently two
> types
> > > > of
> > > > > > >> > > SlotManager
> > > > > > >> > > > > in
> > > > > > >> > > > > >> > > Flink:
> > > > > > >> > > > > >> > > >> DeclarativeSlotManager and
> > > FineGrainedSlotManager.
> > > > > > >> > > > > >> > > FineGrainedSlotManager
> > > > > > >> > > > > >> > > >> should behave as DeclarativeSlotManager if
> the
> > > > user
> > > > > does
> > > > > > >> > not
> > > > > > >> > > > > >> configure
> > > > > > >> > > > > >> > > the
> > > > > > >> > > > > >> > > >> slot request profile.
> > > > > > >> > > > > >> > > >>
> > > > > > >> > > > > >> > > >> Therefore, this FLIP aims to unify the
> > > > > implementation of
> > > > > > >> > > > > SlotManager
> > > > > > >> > > > > >> > in
> > > > > > >> > > > > >> > > >> order to reduce maintenance costs.
> > > > > > >> > > > > >> > > >>
> > > > > > >> > > > > >> > > >> Looking forward to hearing from you.
> > > > > > >> > > > > >> > > >>
> > > > > > >> > > > > >> > > >> [1]
> > > > > > >> > > > > >> > > >>
> > > > > > >> > > > > >> > > >>
> > > > > > >> > > > > >> > > >>
> > > > > > >> > > > > >> > >
> > > > > > >> > > > > >> >
> > > > > > >> > > > > >>
> > > > > > >> > > > >
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> >
> > > > > > >>
> > > > >
> > > >
> > >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-298%3A+Unifying+the+Implementation+of+SlotManager
> > > > > > >> > > > > >> > > >>
> > > > > > >> > > > > >> > > >> Best,
> > > > > > >> > > > > >> > > >> Weihua
> > > > > > >> > > > > >> > > >>
> > > > > > >> > > > > >> > > >>
> > > > > > >> > > > > >> > >
> > > > > > >> > > > > >> >
> > > > > > >> > > > > >>
> > > > > > >> > > > >
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> >
> > > > > > >>
> > > > > > >
> > > > >
> > > >
> > >
>

Re: [DISCUSS] FLIP-298: Unifying the Implementation of SlotManager

Posted by Maximilian Michels <mx...@apache.org>.
+1 on the proposal.

I'm wondering about the scale down behavior of the unified
implementation. Does the new unified implementation prioritize
releasing entire task managers in favor of evenly spreading out task
managers? Consider a scale down from parallelism 15 to parallelism 10
where each task manager has 5 slots. We could either spread out 10
slots among the 3 task managers, or fit the 10 slots in 2 task
managers and surrender the task manager (at least on k8s). From a cost
saving perspective, the latter would be preferable.

-Max

On Thu, Mar 9, 2023 at 4:17 PM Matthias Pohl
<ma...@aiven.io.invalid> wrote:
>
> Thanks for your clarification. I have nothing else to add to the
> discussion. +1 from my side to proceed
>
> On Wed, Mar 8, 2023 at 4:16 AM Weihua Hu <hu...@gmail.com> wrote:
>
> > Thanks Yangze for your attention, this would be a great help.
> >
> > And thanks Matthias too.
> >
> > FLIP-156 [1] mentions some incompatibility between fine-grained resource
> > > management and reactive mode. I assume that this is independent of the
> > > SlotManager and replacing the DSM with the FGSM wouldn't affect reactive
> > > mode?
> >
> > Yes. This incompatibility is independent of SlotManager. That means the
> > AdpativeScheduler will always ignore the resource requirement set by
> > slotSharingGroup and declare Unknown ResourceProfile to SlotManager.
> > So, using FGSM as default will not affect reactive mode.
> >
> > About the heterogeneous TaskManager: This is a feature that's also not
> > > supported in the DSM right now, is it? We should state that fact in the
> > > FLIP if we mentioned that we don't want to implement it for the FSGM.
> >
> > Yes, both DSM and FGSM do not support request heterogeneous
> > TaskManager right now. Heterogeneous will make the resource allocation
> > logic more complicated, such as the resource deadlock if request A
> > allocated
> > the bigger slot B and then request B could not allocate the small slot A.
> > We
> > need to think more before starting to support the heterogeneous task
> > manager.
> > So, we don't want to implement heterogeneity in this FLIP.
> >
> > Best,
> > Weihua
> >
> >
> > On Wed, Mar 8, 2023 at 12:44 AM Matthias Pohl
> > <ma...@aiven.io.invalid> wrote:
> >
> > > Thanks for updating the FLIP and adding more context to it. Additionally,
> > > thanks to Xintong and Yangze for offering your expertise here as
> > > contributors to the initial FineGrainedSlotManager implementation.
> > >
> > > The remark on cutting out functionality was only based on some
> > superficial
> > > initial code reading. I cannot come up with a better code structure
> > myself.
> > > Therefore, I'm fine with not refactoring the code as part of this FLIP.
> > >
> > > The strategies that were proposed around making sure that the refactoring
> > > is properly backed by tests sound reasonable. My initial concern was
> > based
> > > on the fact that we might have unit test scenarios for the DSM that are
> > not
> > > covered in the unit tests of the FSGM. In that case, swapping the DSM
> > with
> > > the FSGM might not be good enough. Going over the GSM tests to make sure
> > > that we're not accidentally deleting test scenarios sounds good to me.
> > > Thanks, Weihua.
> > >
> > > FLIP-156 [1] mentions some incompatibility between fine-grained resource
> > > management and reactive mode. I assume that this is independent of the
> > > SlotManager and replacing the DSM with the FGSM wouldn't affect reactive
> > > mode?
> > >
> > > About the heterogeneous TaskManager: This is a feature that's also not
> > > supported in the DSM right now, is it? We should state that fact in the
> > > FLIP if we mentioned that we don't want to implement it for the FSGM.
> > >
> > > Best,
> > > Matthias
> > >
> > > [1]
> > >
> > >
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-156%3A+Runtime+Interfaces+for+Fine-Grained+Resource+Requirements
> > >
> > > On Tue, Mar 7, 2023 at 8:58 AM Yangze Guo <ka...@gmail.com> wrote:
> > >
> > > > Hi Weihua,
> > > >
> > > > Thanks for driving this. As Xintong mentioned, this was a technical
> > > > debt from FLIP-56.
> > > >
> > > > The latest version of FLIP sounds good, +1 from my side. As a
> > > > contributor to this component, I'm willing to assist with the review
> > > > process. Feel free to reach me if you need help.
> > > >
> > > > Best,
> > > > Yangze Guo
> > > >
> > > > On Tue, Mar 7, 2023 at 1:47 PM Weihua Hu <hu...@gmail.com>
> > wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > @David @Matthias
> > > > > There are a few days after hearing your thoughts. I would like to
> > know
> > > if
> > > > > there are any other concerns about this FLIP.
> > > > >
> > > > >
> > > > > Best,
> > > > > Weihua
> > > > >
> > > > >
> > > > > On Mon, Mar 6, 2023 at 7:53 PM Weihua Hu <hu...@gmail.com>
> > > wrote:
> > > > >
> > > > > >
> > > > > > Thanks Shammon,
> > > > > >
> > > > > > I've updated FLIP to add this redundant Task Manager limitation.
> > > > > >
> > > > > >
> > > > > > Best,
> > > > > > Weihua
> > > > > >
> > > > > >
> > > > > > On Mon, Mar 6, 2023 at 5:07 PM Shammon FY <zj...@gmail.com>
> > wrote:
> > > > > >
> > > > > >> Hi weihua
> > > > > >>
> > > > > >> Can you add content related to `heterogeneous resources` to this
> > > > FLIP? We
> > > > > >> can record it and consider it in the future. It may be useful for
> > > some
> > > > > >> scenarios, such as the combination of streaming and ML.
> > > > > >>
> > > > > >> Best,
> > > > > >> Shammon
> > > > > >>
> > > > > >>
> > > > > >> On Mon, Mar 6, 2023 at 1:39 PM weijie guo <
> > > guoweijiereswqa@gmail.com>
> > > > > >> wrote:
> > > > > >>
> > > > > >> > Hi Weihua,
> > > > > >> >
> > > > > >> > Thanks for your clarification, SGTM.
> > > > > >> >
> > > > > >> > Best regards,
> > > > > >> >
> > > > > >> > Weijie
> > > > > >> >
> > > > > >> >
> > > > > >> > Weihua Hu <hu...@gmail.com> 于2023年3月6日周一 11:43写道:
> > > > > >> >
> > > > > >> > > Thanks Weijie.
> > > > > >> > >
> > > > > >> > > Heterogeneous task managers will not be considered in this
> > FLIP
> > > > since
> > > > > >> > > it does not request heterogeneous resources as you said.
> > > > > >> > >
> > > > > >> > > My first thought is we can adjust the meaning of redundant
> > > > > >> configuration
> > > > > >> > > to redundant number of per resource type. These can be
> > > considered
> > > > in
> > > > > >> > > detail when we decide to support heterogeneous task managers.
> > > > > >> > >
> > > > > >> > > Best,
> > > > > >> > > Weihua
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > On Sat, Mar 4, 2023 at 1:13 AM weijie guo <
> > > > guoweijiereswqa@gmail.com>
> > > > > >> > > wrote:
> > > > > >> > >
> > > > > >> > > > Thanks Weihua for preparing this FLIP.
> > > > > >> > > >
> > > > > >> > > > This FLIP overall looks reasonable to me after updating as
> > > > > >> suggested by
> > > > > >> > > > Matthias.
> > > > > >> > > >
> > > > > >> > > > I only have one small question about keeping some redundant
> > > task
> > > > > >> > > managers:
> > > > > >> > > > In the fine-grained resource management, theoretically, it
> > can
> > > > > >> support
> > > > > >> > > > heterogeneous taskmanagers. When we complete the missing
> > > > features
> > > > > >> for
> > > > > >> > > FGSM,
> > > > > >> > > > do we plan to take this into account?
> > > > > >> > > > Of course, if I remember correctly, FGSM will not request
> > > > > >> heterogeneous
> > > > > >> > > > resources at present, so it is also acceptable to me if
> > there
> > > > is no
> > > > > >> > > special
> > > > > >> > > > treatment now.
> > > > > >> > > >
> > > > > >> > > > +1 for this changes if we can ensure the test coverage.
> > > > > >> > > >
> > > > > >> > > > Best regards,
> > > > > >> > > >
> > > > > >> > > > Weijie
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > > > John Roesler <vv...@apache.org> 于2023年3月2日周四 12:53写道:
> > > > > >> > > >
> > > > > >> > > > > Thanks for the test plan, Weihua!
> > > > > >> > > > >
> > > > > >> > > > > Yes, it addresses my concerns.
> > > > > >> > > > >
> > > > > >> > > > > Thanks,
> > > > > >> > > > > John
> > > > > >> > > > >
> > > > > >> > > > > On Wed, Mar 1, 2023, at 22:38, Weihua Hu wrote:
> > > > > >> > > > > > Hi, everyone,
> > > > > >> > > > > > Thanks for your suggestions and ideas.
> > > > > >> > > > > > Thanks Xintong for sharing the detailed backgrounds of
> > > > > >> SlotManager.
> > > > > >> > > > > >
> > > > > >> > > > > > *@Matthias
> > > > > >> > > > > >
> > > > > >> > > > > > 1. Did you do a proper test coverage analysis?
> > > > > >> > > > > >
> > > > > >> > > > > >
> > > > > >> > > > > > Just as Xintong said, we already have a CI stage for
> > fine
> > > > > >> grained
> > > > > >> > > > > resource
> > > > > >> > > > > > managers.
> > > > > >> > > > > > And I will make sure FineGrainedSlotManager as the
> > default
> > > > > >> > > SlotManager
> > > > > >> > > > > can
> > > > > >> > > > > > pass all the tests of CI.
> > > > > >> > > > > > In addition, I will review all unit tests of
> > > > > >> > > > DeclarativeSlotManager(DSM)
> > > > > >> > > > > to
> > > > > >> > > > > > ensure that there are no gaps in the
> > > > > >> > > > > > coverage provided by the FineGrainedSlotManager.
> > > > > >> > > > > > I also added the 'Test Plan' part to the FLIP.
> > > > > >> > > > > > @Matthias @John @Shammon Does this test plan address
> > your
> > > > > >> concerns?
> > > > > >> > > > > >
> > > > > >> > > > > > 2.  DeclarativeSlotManager and FineGrainedSlotManager
> > feel
> > > > quite
> > > > > >> > big
> > > > > >> > > in
> > > > > >> > > > > >
> > > > > >> > > > > > terms of lines of code....
> > > > > >> > > > > >
> > > > > >> > > > > >
> > > > > >> > > > > > IMO, the refactoring of SlotManager does not belong to
> > > this
> > > > FLIP
> > > > > >> > > since
> > > > > >> > > > it
> > > > > >> > > > > > may lead to some unstable risks. For
> > > > > >> > > > > > FineGrainedSlotManager(FGSM), we already split some
> > > > reasonable
> > > > > >> > > > > components.
> > > > > >> > > > > > They are:
> > > > > >> > > > > > * TaskManagerTracker: Track task managers and their
> > > > resources.
> > > > > >> > > > > > * ResourceTracker: track requirements of jobs
> > > > > >> > > > > > * ResourceAllocationStrategy: Try to fulfill the
> > resource
> > > > > >> > > requirements
> > > > > >> > > > > with
> > > > > >> > > > > > available/pending resources.
> > > > > >> > > > > > * SlotStatusSyncer: communicate with TaskManager, for
> > > > > >> > > > allocating/freeing
> > > > > >> > > > > > slot and reconciling the slot status
> > > > > >> > > > > > Maybe we can start a discussion about refactoring
> > > > SlotManager in
> > > > > >> > > > another
> > > > > >> > > > > > FLIP if there are some good suggestions.
> > > > > >> > > > > > WDYT
> > > > > >> > > > > >
> > > > > >> > > > > > 3. For me personally, having a more detailed summary
> > > > comparing
> > > > > >> the
> > > > > >> > > > > >> subcomponents of both SlotManager implementations with
> > > > where
> > > > > >> > > > > >> their functionality matches and where they differ might
> > > > help
> > > > > >> > > > understand
> > > > > >> > > > > the
> > > > > >> > > > > >> consequences of the changes proposed in FLIP-298
> > > > > >> > > > > >
> > > > > >> > > > > > Good suggestion, I have updated the comparison in this
> > > FLIP.
> > > > > >> > Looking
> > > > > >> > > > > > forward to any suggestions/thoughts
> > > > > >> > > > > > if they are not described clearly.
> > > > > >> > > > > >
> > > > > >> > > > > > *@John
> > > > > >> > > > > >
> > > > > >> > > > > > 4. In addition to changing the default, would it make
> > > sense
> > > > to
> > > > > >> log
> > > > > >> > a
> > > > > >> > > > > >> deprecation warning on initialization
> > > > > >> > > > > >
> > > > > >> > > > > > if the DeclarativeSlotManager is used?
> > > > > >> > > > > >>
> > > > > >> > > > > > SGTM, We should add Deprecated annotations to DSM for
> > > devs.
> > > > And
> > > > > >> > log a
> > > > > >> > > > > > deprecation warning for users.
> > > > > >> > > > > >
> > > > > >> > > > > > *@Shammon
> > > > > >> > > > > >
> > > > > >> > > > > > 1. For their functional differences, can you give some
> > > > detailed
> > > > > >> > tests
> > > > > >> > > > to
> > > > > >> > > > > >> verify that the new FineGrainedSlotManager has these
> > > > > >> capabilities?
> > > > > >> > > > This
> > > > > >> > > > > can
> > > > > >> > > > > >> effectively verify the new functions
> > > > > >> > > > > >>
> > > > > >> > > > > > As just maintained, there is already a CI stage of FGSM,
> > > > and I
> > > > > >> will
> > > > > >> > > do
> > > > > >> > > > > more
> > > > > >> > > > > > review of unit tests for DSM.
> > > > > >> > > > > >
> > > > > >> > > > > >  2. I'm worried that many functions are not independent
> > > and
> > > > it
> > > > > >> is
> > > > > >> > > > > difficult
> > > > > >> > > > > >> to migrate step-by-step. You can list the relationship
> > > > between
> > > > > >> > them
> > > > > >> > > in
> > > > > >> > > > > >> detail.
> > > > > >> > > > > >
> > > > > >> > > > > >  As Xintong saied the DSM is a subset of FGSM by design.
> > > > But as
> > > > > >> > time
> > > > > >> > > > goes
> > > > > >> > > > > > on, FGSM has some lacking
> > > > > >> > > > > > functions as I listed in this FLIP. And I have added the
> > > > > >> comparison
> > > > > >> > > > > between
> > > > > >> > > > > > DSM and FGSM in this FLIP.
> > > > > >> > > > > >
> > > > > >> > > > > >
> > > > > >> > > > > > Thanks again for all your thoughts. Any feedback is
> > > > appreciated!
> > > > > >> > > > > >
> > > > > >> > > > > > Best,
> > > > > >> > > > > > Weihua
> > > > > >> > > > > >
> > > > > >> > > > > >
> > > > > >> > > > > > On Wed, Mar 1, 2023 at 2:17 PM Xintong Song <
> > > > > >> tonysong820@gmail.com
> > > > > >> > >
> > > > > >> > > > > wrote:
> > > > > >> > > > > >
> > > > > >> > > > > >> Thanks Weihua for preparing this FLIP. +1 for the
> > > proposal.
> > > > > >> > > > > >>
> > > > > >> > > > > >>
> > > > > >> > > > > >> As one of the contributors of the fine-grained slot
> > > > manager,
> > > > > >> I'd
> > > > > >> > > like
> > > > > >> > > > to
> > > > > >> > > > > >> share some backgrounds here.
> > > > > >> > > > > >>
> > > > > >> > > > > >> - There used to be a defaut slot manager
> > implementation,
> > > > which
> > > > > >> is
> > > > > >> > > > > >> non-declarative and has been removed now. The two
> > > features,
> > > > > >> > > > declarative
> > > > > >> > > > > /
> > > > > >> > > > > >> reactive resource management and fine-grained resource
> > > > > >> management,
> > > > > >> > > > were
> > > > > >> > > > > >> proposed at about the same time. We were aware that by
> > > > design
> > > > > >> the
> > > > > >> > > > > >> declarative slot manager is a subset of fine-grained
> > slot
> > > > > >> manager
> > > > > >> > at
> > > > > >> > > > > that
> > > > > >> > > > > >> time, but still decided to implement two slot managers
> > > for
> > > > the
> > > > > >> > > purpose
> > > > > >> > > > > of
> > > > > >> > > > > >> decoupling efforts and reducing cross-team
> > > synchronization
> > > > > >> > overhead.
> > > > > >> > > > > >> Merging the two slot managers once they are proved
> > stable
> > > > is
> > > > > >> IMO a
> > > > > >> > > > > >> technical debt that was planned at the very beginning.
> > > > > >> > > > > >>
> > > > > >> > > > > >> - The FineGrainedSlotManager has been verified in
> > > Alibaba's
> > > > > >> > internal
> > > > > >> > > > > >> production as well as Alibaba Cloud services as the
> > > default
> > > > > >> slot
> > > > > >> > > > manager
> > > > > >> > > > > >> for about 2 years.
> > > > > >> > > > > >>
> > > > > >> > > > > >>
> > > > > >> > > > > >> Concerning test cases, we currently have a ci stage for
> > > > fine
> > > > > >> > grained
> > > > > >> > > > > >> resource management. To avoid adding too much burden,
> > the
> > > > stage
> > > > > >> > only
> > > > > >> > > > > >> includes tests from flink-runtime and flink-test
> > > modules. I
> > > > > >> think
> > > > > >> > > > > switching
> > > > > >> > > > > >> the default slot manager and applying the whole set of
> > > > tests on
> > > > > >> > the
> > > > > >> > > > > >> fine-grained slot manager would help us to be more
> > > > confident
> > > > > >> about
> > > > > >> > > it.
> > > > > >> > > > > >>
> > > > > >> > > > > >>
> > > > > >> > > > > >> Concerning cutting out functionalities out of slot
> > > > manager, I
> > > > > >> > think
> > > > > >> > > > > Yangze
> > > > > >> > > > > >> and I have tried our best to shape the
> > > > FineGrainedSlotManager
> > > > > >> into
> > > > > >> > > > > >> reasonable components. I personally don't have other
> > > ideas
> > > > to
> > > > > >> > > further
> > > > > >> > > > > >> disassemble the component, but I'm open to such
> > > > suggestions.
> > > > > >> > > However,
> > > > > >> > > > > from
> > > > > >> > > > > >> the stability perspective, I'd be in favor of not
> > > > introducing
> > > > > >> > > > > significant
> > > > > >> > > > > >> changes to the FineGrainedSlotManager while switching
> > it
> > > > to the
> > > > > >> > > > default.
> > > > > >> > > > > >> Because the current implementation has already been
> > > > verified
> > > > > >> (or
> > > > > >> > at
> > > > > >> > > > > least
> > > > > >> > > > > >> partially verified because Alibaba does not cover all
> > the
> > > > Flink
> > > > > >> > use
> > > > > >> > > > > cases),
> > > > > >> > > > > >> and introducing more changes also means more chances of
> > > > > >> breaking
> > > > > >> > > > things.
> > > > > >> > > > > >>
> > > > > >> > > > > >>
> > > > > >> > > > > >> Best,
> > > > > >> > > > > >>
> > > > > >> > > > > >> Xintong
> > > > > >> > > > > >>
> > > > > >> > > > > >>
> > > > > >> > > > > >>
> > > > > >> > > > > >> On Wed, Mar 1, 2023 at 11:12 AM Shammon FY <
> > > > zjureel@gmail.com>
> > > > > >> > > wrote:
> > > > > >> > > > > >>
> > > > > >> > > > > >> > Hi
> > > > > >> > > > > >> >
> > > > > >> > > > > >> > Thanks for starting this work weihua, I think
> > unifying
> > > > > >> > > > > >> > DeclarativeSlotManager and FineGrainedSlotManager is
> > > > > >> valuable.
> > > > > >> > > > > >> >
> > > > > >> > > > > >> > I agree with @Matthias and @John that we need a way
> > to
> > > > ensure
> > > > > >> > that
> > > > > >> > > > > >> > DeclarativeSlotManager's capabilities are fully
> > covered
> > > > by
> > > > > >> > > > > >> > FineGrainedSlotManager
> > > > > >> > > > > >> >
> > > > > >> > > > > >> > 1. For their functional differences, can you give
> > some
> > > > > >> detailed
> > > > > >> > > > tests
> > > > > >> > > > > to
> > > > > >> > > > > >> > verify that the new FineGrainedSlotManager has these
> > > > > >> > capabilities?
> > > > > >> > > > > This
> > > > > >> > > > > >> can
> > > > > >> > > > > >> > effectively verify the new functions
> > > > > >> > > > > >> >
> > > > > >> > > > > >> > 2. I'm worried that many functions are not
> > independent
> > > > and
> > > > > >> it is
> > > > > >> > > > > >> difficult
> > > > > >> > > > > >> > to migrate step-by-step. You can list the
> > relationship
> > > > > >> between
> > > > > >> > > them
> > > > > >> > > > in
> > > > > >> > > > > >> > detail.
> > > > > >> > > > > >> >
> > > > > >> > > > > >> > 3. As John mentioned, give a smoke test for
> > > > > >> > FineGrainedSlotManager
> > > > > >> > > > is
> > > > > >> > > > > a
> > > > > >> > > > > >> > good idea. Or you can add some test information to
> > the
> > > > > >> > > > > >> > DeclarativeSlotManager to determine how many tests
> > have
> > > > used
> > > > > >> it.
> > > > > >> > > In
> > > > > >> > > > > this
> > > > > >> > > > > >> > way, we can gradually construct test cases for
> > > > > >> > > > FineGrainedSlotManager
> > > > > >> > > > > >> > during the development process.
> > > > > >> > > > > >> >
> > > > > >> > > > > >> >
> > > > > >> > > > > >> > Best,
> > > > > >> > > > > >> > Shammon
> > > > > >> > > > > >> >
> > > > > >> > > > > >> >
> > > > > >> > > > > >> > On Tue, Feb 28, 2023 at 10:22 PM John Roesler <
> > > > > >> > > vvcephei@apache.org>
> > > > > >> > > > > >> wrote:
> > > > > >> > > > > >> >
> > > > > >> > > > > >> > > Thanks for the FLIP, Weihua!
> > > > > >> > > > > >> > >
> > > > > >> > > > > >> > > I’ve read the FLIP, and it sounds good to me. We
> > need
> > > > to
> > > > > >> avoid
> > > > > >> > > > > >> > > proliferating alternative implementations wherever
> > > > > >> possible. I
> > > > > >> > > > have
> > > > > >> > > > > >> just
> > > > > >> > > > > >> > a
> > > > > >> > > > > >> > > couple of comments:
> > > > > >> > > > > >> > >
> > > > > >> > > > > >> > > 1. I share Matthias’s concern about ensuring the
> > > > behavior
> > > > > >> is
> > > > > >> > > > really
> > > > > >> > > > > the
> > > > > >> > > > > >> > > same. One suggestion I’ve used for this kind of
> > thing
> > > > is,
> > > > > >> as a
> > > > > >> > > > smoke
> > > > > >> > > > > >> > test,
> > > > > >> > > > > >> > > to update the DeclarativeSlotManager to just
> > delegate
> > > > to
> > > > > >> the
> > > > > >> > > > > >> > > FineGrainedSlotManager. If the full test suite
> > still
> > > > > >> passes,
> > > > > >> > you
> > > > > >> > > > > can be
> > > > > >> > > > > >> > > pretty sure the new default is really ok. It would
> > > not
> > > > be a
> > > > > >> > good
> > > > > >> > > > > idea
> > > > > >> > > > > >> to
> > > > > >> > > > > >> > > actually keep that in for the release, since it
> > would
> > > > > >> remove
> > > > > >> > the
> > > > > >> > > > > option
> > > > > >> > > > > >> > to
> > > > > >> > > > > >> > > fall back in case of bugs. Either way, we need to
> > > make
> > > > sure
> > > > > >> > all
> > > > > >> > > > test
> > > > > >> > > > > >> > > scenarios are present for the FGSM.
> > > > > >> > > > > >> > >
> > > > > >> > > > > >> > > 4. In addition to changing the default, would it
> > make
> > > > > >> sense to
> > > > > >> > > > log a
> > > > > >> > > > > >> > > deprecation warning on initialization if the
> > > > > >> > > > DeclarativeSlotManager
> > > > > >> > > > > is
> > > > > >> > > > > >> > used?
> > > > > >> > > > > >> > >
> > > > > >> > > > > >> > > Thanks again,
> > > > > >> > > > > >> > > John
> > > > > >> > > > > >> > >
> > > > > >> > > > > >> > > On Tue, Feb 28, 2023, at 07:20, Matthias Pohl
> > wrote:
> > > > > >> > > > > >> > > > Hi Weihua,
> > > > > >> > > > > >> > > > Thanks for your proposal. From a conceptual
> > point:
> > > > AFAIU,
> > > > > >> > the
> > > > > >> > > > > >> > > > DeclarativeSlotManager covers a subset (i.e. only
> > > > evenly
> > > > > >> > sized
> > > > > >> > > > > slots)
> > > > > >> > > > > >> > of
> > > > > >> > > > > >> > > > what the FineGrainedSlotManager should be able to
> > > > achieve
> > > > > >> > > > > (variable
> > > > > >> > > > > >> > slot
> > > > > >> > > > > >> > > > size per task manager). Is this the right
> > > > > >> > > > > assumption/understanding?
> > > > > >> > > > > >> In
> > > > > >> > > > > >> > > this
> > > > > >> > > > > >> > > > sense, merging both implementations into a single
> > > one
> > > > > >> sounds
> > > > > >> > > > > good. A
> > > > > >> > > > > >> > few
> > > > > >> > > > > >> > > > more general comments, though:
> > > > > >> > > > > >> > > >
> > > > > >> > > > > >> > > > 1. Did you do a proper test coverage analysis?
> > > > That's not
> > > > > >> > > > > mentioned
> > > > > >> > > > > >> in
> > > > > >> > > > > >> > > the
> > > > > >> > > > > >> > > > current version of the FLIP. I'm bringing this up
> > > > > >> because we
> > > > > >> > > ran
> > > > > >> > > > > into
> > > > > >> > > > > >> > the
> > > > > >> > > > > >> > > > same issue when fixing the flaws that popped up
> > > after
> > > > > >> > > > introducing
> > > > > >> > > > > the
> > > > > >> > > > > >> > > > multi-component leader election (see FLIP-285
> > [1]).
> > > > There
> > > > > >> > is a
> > > > > >> > > > > risk
> > > > > >> > > > > >> > that
> > > > > >> > > > > >> > > by
> > > > > >> > > > > >> > > > removing the legacy code we decrease test
> > coverage
> > > > > >> because
> > > > > >> > > > certain
> > > > > >> > > > > >> > > > test cases that were covered for the legacy
> > classes
> > > > might
> > > > > >> > not
> > > > > >> > > be
> > > > > >> > > > > >> > > > necessarily covered in the new implementation,
> > yet
> > > > (see
> > > > > >> > > > > FLINK-30338
> > > > > >> > > > > >> [2]
> > > > > >> > > > > >> > > > which covers this issue for the leader election
> > > > case).
> > > > > >> > > Ideally,
> > > > > >> > > > we
> > > > > >> > > > > >> > don't
> > > > > >> > > > > >> > > > want to remove test cases accidentally because
> > they
> > > > were
> > > > > >> > only
> > > > > >> > > > > >> > implemented
> > > > > >> > > > > >> > > > for the DeclarativeSlotManager but missed for the
> > > > > >> > > > > >> > FineGrainedSlotManager.
> > > > > >> > > > > >> > > >
> > > > > >> > > > > >> > > > 2. DeclarativeSlotManager and
> > > FineGrainedSlotManager
> > > > feel
> > > > > >> > > quite
> > > > > >> > > > > big
> > > > > >> > > > > >> in
> > > > > >> > > > > >> > > > terms of lines of code. Without knowing whether
> > > it's
> > > > > >> > actually
> > > > > >> > > a
> > > > > >> > > > > >> > > reasonable
> > > > > >> > > > > >> > > > thing to do: Instead of just adding more features
> > > to
> > > > the
> > > > > >> > > > > >> > > > FineGrainedSlotManager, have you thought of
> > cutting
> > > > out
> > > > > >> > > > > functionality
> > > > > >> > > > > >> > > into
> > > > > >> > > > > >> > > > smaller sub-components along this refactoring?
> > > Such a
> > > > > >> > > > step-by-step
> > > > > >> > > > > >> > > approach
> > > > > >> > > > > >> > > > might improve the overall codebase and might make
> > > > > >> reviewing
> > > > > >> > > the
> > > > > >> > > > > >> > > refactoring
> > > > > >> > > > > >> > > > easier. I did a first pass over the code and
> > > > struggled to
> > > > > >> > > > identify
> > > > > >> > > > > >> code
> > > > > >> > > > > >> > > > blocks that could be moved out of the SlotManager
> > > > > >> > > > > implementation(s).
> > > > > >> > > > > >> > > > Therefore, I might be wrong with this proposal. I
> > > > haven't
> > > > > >> > > worked
> > > > > >> > > > > on
> > > > > >> > > > > >> > this
> > > > > >> > > > > >> > > > codebase in that detail that it would allow me to
> > > > come up
> > > > > >> > > with a
> > > > > >> > > > > >> > > judgement
> > > > > >> > > > > >> > > > call. I wanted to bring it up, anyway, because
> > I'm
> > > > > >> curious
> > > > > >> > > > whether
> > > > > >> > > > > >> that
> > > > > >> > > > > >> > > > could be an option. There's a comment created by
> > > > Chesnay
> > > > > >> > > (CC'd)
> > > > > >> > > > in
> > > > > >> > > > > >> the
> > > > > >> > > > > >> > > > JavaDoc of TaskExecutorManager [3] indicating
> > > > something
> > > > > >> > > similar.
> > > > > >> > > > > I'm
> > > > > >> > > > > >> > > > wondering whether he can add some insights here.
> > > > > >> > > > > >> > > >
> > > > > >> > > > > >> > > > 3. For me personally, having a more detailed
> > > summary
> > > > > >> > comparing
> > > > > >> > > > the
> > > > > >> > > > > >> > > > subcomponents of both SlotManager implementations
> > > > with
> > > > > >> where
> > > > > >> > > > > >> > > > their functionality matches and where they differ
> > > > might
> > > > > >> help
> > > > > >> > > > > >> understand
> > > > > >> > > > > >> > > the
> > > > > >> > > > > >> > > > consequences of the changes proposed in FLIP-298.
> > > > > >> > > > > >> > > >
> > > > > >> > > > > >> > > > Best,
> > > > > >> > > > > >> > > > Matthias
> > > > > >> > > > > >> > > >
> > > > > >> > > > > >> > > > [1]
> > > > > >> > > > > >> > > >
> > > > > >> > > > > >> > >
> > > > > >> > > > > >> >
> > > > > >> > > > > >>
> > > > > >> > > > >
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > >
> > >
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+Refactoring+LeaderElection+to+make+Flink+support+multi-component+leader+election+out-of-the-box
> > > > > >> > > > > >> > > > [2]
> > > > https://issues.apache.org/jira/browse/FLINK-30338
> > > > > >> > > > > >> > > > [3]
> > > > > >> > > > > >> > > >
> > > > > >> > > > > >> > >
> > > > > >> > > > > >> >
> > > > > >> > > > > >>
> > > > > >> > > > >
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > >
> > >
> > https://github.com/apache/flink/blob/f611ea8cb5deddb42429df2c99f0c68d7382e9bd/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/TaskExecutorManager.java#L66-L68
> > > > > >> > > > > >> > > >
> > > > > >> > > > > >> > > > On Tue, Feb 28, 2023 at 6:14 AM Matt Wang <
> > > > > >> wangm92@163.com>
> > > > > >> > > > > wrote:
> > > > > >> > > > > >> > > >
> > > > > >> > > > > >> > > >> This is a good proposal for me, it will make the
> > > > code of
> > > > > >> > the
> > > > > >> > > > > >> > SlotManager
> > > > > >> > > > > >> > > >> more clear.
> > > > > >> > > > > >> > > >>
> > > > > >> > > > > >> > > >>
> > > > > >> > > > > >> > > >>
> > > > > >> > > > > >> > > >> --
> > > > > >> > > > > >> > > >>
> > > > > >> > > > > >> > > >> Best,
> > > > > >> > > > > >> > > >> Matt Wang
> > > > > >> > > > > >> > > >>
> > > > > >> > > > > >> > > >>
> > > > > >> > > > > >> > > >> ---- Replied Message ----
> > > > > >> > > > > >> > > >> | From | David Morávek<dm...@apache.org> |
> > > > > >> > > > > >> > > >> | Date | 02/27/2023 22:45 |
> > > > > >> > > > > >> > > >> | To | <de...@flink.apache.org> |
> > > > > >> > > > > >> > > >> | Subject | Re: [DISCUSS] FLIP-298: Unifying the
> > > > > >> > > Implementation
> > > > > >> > > > > of
> > > > > >> > > > > >> > > >> SlotManager |
> > > > > >> > > > > >> > > >> Hi Weihua, I still need to dig into the details,
> > > > but the
> > > > > >> > > > overall
> > > > > >> > > > > >> > > sentiment
> > > > > >> > > > > >> > > >> of this change sounds reasonable.
> > > > > >> > > > > >> > > >>
> > > > > >> > > > > >> > > >> Best,
> > > > > >> > > > > >> > > >> D.
> > > > > >> > > > > >> > > >>
> > > > > >> > > > > >> > > >> On Mon, Feb 27, 2023 at 2:26 PM Zhanghao Chen <
> > > > > >> > > > > >> > > zhanghao.chen@outlook.com>
> > > > > >> > > > > >> > > >> wrote:
> > > > > >> > > > > >> > > >>
> > > > > >> > > > > >> > > >> Thanks for driving this topic. I think this FLIP
> > > > could
> > > > > >> help
> > > > > >> > > > > clean up
> > > > > >> > > > > >> > the
> > > > > >> > > > > >> > > >> codebase to make it easier to maintain. +1 on
> > it.
> > > > > >> > > > > >> > > >>
> > > > > >> > > > > >> > > >> Best,
> > > > > >> > > > > >> > > >> Zhanghao Chen
> > > > > >> > > > > >> > > >> ________________________________
> > > > > >> > > > > >> > > >> From: Weihua Hu <hu...@gmail.com>
> > > > > >> > > > > >> > > >> Sent: Monday, February 27, 2023 20:40
> > > > > >> > > > > >> > > >> To: dev <de...@flink.apache.org>
> > > > > >> > > > > >> > > >> Subject: [DISCUSS] FLIP-298: Unifying the
> > > > > >> Implementation of
> > > > > >> > > > > >> > SlotManager
> > > > > >> > > > > >> > > >>
> > > > > >> > > > > >> > > >> Hi everyone,
> > > > > >> > > > > >> > > >>
> > > > > >> > > > > >> > > >> I would like to begin a discussion on FLIP-298:
> > > > Unifying
> > > > > >> > the
> > > > > >> > > > > >> > > Implementation
> > > > > >> > > > > >> > > >> of SlotManager[1]. There are currently two types
> > > of
> > > > > >> > > SlotManager
> > > > > >> > > > > in
> > > > > >> > > > > >> > > Flink:
> > > > > >> > > > > >> > > >> DeclarativeSlotManager and
> > FineGrainedSlotManager.
> > > > > >> > > > > >> > > FineGrainedSlotManager
> > > > > >> > > > > >> > > >> should behave as DeclarativeSlotManager if the
> > > user
> > > > does
> > > > > >> > not
> > > > > >> > > > > >> configure
> > > > > >> > > > > >> > > the
> > > > > >> > > > > >> > > >> slot request profile.
> > > > > >> > > > > >> > > >>
> > > > > >> > > > > >> > > >> Therefore, this FLIP aims to unify the
> > > > implementation of
> > > > > >> > > > > SlotManager
> > > > > >> > > > > >> > in
> > > > > >> > > > > >> > > >> order to reduce maintenance costs.
> > > > > >> > > > > >> > > >>
> > > > > >> > > > > >> > > >> Looking forward to hearing from you.
> > > > > >> > > > > >> > > >>
> > > > > >> > > > > >> > > >> [1]
> > > > > >> > > > > >> > > >>
> > > > > >> > > > > >> > > >>
> > > > > >> > > > > >> > > >>
> > > > > >> > > > > >> > >
> > > > > >> > > > > >> >
> > > > > >> > > > > >>
> > > > > >> > > > >
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > >
> > >
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-298%3A+Unifying+the+Implementation+of+SlotManager
> > > > > >> > > > > >> > > >>
> > > > > >> > > > > >> > > >> Best,
> > > > > >> > > > > >> > > >> Weihua
> > > > > >> > > > > >> > > >>
> > > > > >> > > > > >> > > >>
> > > > > >> > > > > >> > >
> > > > > >> > > > > >> >
> > > > > >> > > > > >>
> > > > > >> > > > >
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > > >
> > > >
> > >
> >

Re: [DISCUSS] FLIP-298: Unifying the Implementation of SlotManager

Posted by Matthias Pohl <ma...@aiven.io.INVALID>.
Thanks for your clarification. I have nothing else to add to the
discussion. +1 from my side to proceed

On Wed, Mar 8, 2023 at 4:16 AM Weihua Hu <hu...@gmail.com> wrote:

> Thanks Yangze for your attention, this would be a great help.
>
> And thanks Matthias too.
>
> FLIP-156 [1] mentions some incompatibility between fine-grained resource
> > management and reactive mode. I assume that this is independent of the
> > SlotManager and replacing the DSM with the FGSM wouldn't affect reactive
> > mode?
>
> Yes. This incompatibility is independent of SlotManager. That means the
> AdpativeScheduler will always ignore the resource requirement set by
> slotSharingGroup and declare Unknown ResourceProfile to SlotManager.
> So, using FGSM as default will not affect reactive mode.
>
> About the heterogeneous TaskManager: This is a feature that's also not
> > supported in the DSM right now, is it? We should state that fact in the
> > FLIP if we mentioned that we don't want to implement it for the FSGM.
>
> Yes, both DSM and FGSM do not support request heterogeneous
> TaskManager right now. Heterogeneous will make the resource allocation
> logic more complicated, such as the resource deadlock if request A
> allocated
> the bigger slot B and then request B could not allocate the small slot A.
> We
> need to think more before starting to support the heterogeneous task
> manager.
> So, we don't want to implement heterogeneity in this FLIP.
>
> Best,
> Weihua
>
>
> On Wed, Mar 8, 2023 at 12:44 AM Matthias Pohl
> <ma...@aiven.io.invalid> wrote:
>
> > Thanks for updating the FLIP and adding more context to it. Additionally,
> > thanks to Xintong and Yangze for offering your expertise here as
> > contributors to the initial FineGrainedSlotManager implementation.
> >
> > The remark on cutting out functionality was only based on some
> superficial
> > initial code reading. I cannot come up with a better code structure
> myself.
> > Therefore, I'm fine with not refactoring the code as part of this FLIP.
> >
> > The strategies that were proposed around making sure that the refactoring
> > is properly backed by tests sound reasonable. My initial concern was
> based
> > on the fact that we might have unit test scenarios for the DSM that are
> not
> > covered in the unit tests of the FSGM. In that case, swapping the DSM
> with
> > the FSGM might not be good enough. Going over the GSM tests to make sure
> > that we're not accidentally deleting test scenarios sounds good to me.
> > Thanks, Weihua.
> >
> > FLIP-156 [1] mentions some incompatibility between fine-grained resource
> > management and reactive mode. I assume that this is independent of the
> > SlotManager and replacing the DSM with the FGSM wouldn't affect reactive
> > mode?
> >
> > About the heterogeneous TaskManager: This is a feature that's also not
> > supported in the DSM right now, is it? We should state that fact in the
> > FLIP if we mentioned that we don't want to implement it for the FSGM.
> >
> > Best,
> > Matthias
> >
> > [1]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-156%3A+Runtime+Interfaces+for+Fine-Grained+Resource+Requirements
> >
> > On Tue, Mar 7, 2023 at 8:58 AM Yangze Guo <ka...@gmail.com> wrote:
> >
> > > Hi Weihua,
> > >
> > > Thanks for driving this. As Xintong mentioned, this was a technical
> > > debt from FLIP-56.
> > >
> > > The latest version of FLIP sounds good, +1 from my side. As a
> > > contributor to this component, I'm willing to assist with the review
> > > process. Feel free to reach me if you need help.
> > >
> > > Best,
> > > Yangze Guo
> > >
> > > On Tue, Mar 7, 2023 at 1:47 PM Weihua Hu <hu...@gmail.com>
> wrote:
> > > >
> > > > Hi,
> > > >
> > > > @David @Matthias
> > > > There are a few days after hearing your thoughts. I would like to
> know
> > if
> > > > there are any other concerns about this FLIP.
> > > >
> > > >
> > > > Best,
> > > > Weihua
> > > >
> > > >
> > > > On Mon, Mar 6, 2023 at 7:53 PM Weihua Hu <hu...@gmail.com>
> > wrote:
> > > >
> > > > >
> > > > > Thanks Shammon,
> > > > >
> > > > > I've updated FLIP to add this redundant Task Manager limitation.
> > > > >
> > > > >
> > > > > Best,
> > > > > Weihua
> > > > >
> > > > >
> > > > > On Mon, Mar 6, 2023 at 5:07 PM Shammon FY <zj...@gmail.com>
> wrote:
> > > > >
> > > > >> Hi weihua
> > > > >>
> > > > >> Can you add content related to `heterogeneous resources` to this
> > > FLIP? We
> > > > >> can record it and consider it in the future. It may be useful for
> > some
> > > > >> scenarios, such as the combination of streaming and ML.
> > > > >>
> > > > >> Best,
> > > > >> Shammon
> > > > >>
> > > > >>
> > > > >> On Mon, Mar 6, 2023 at 1:39 PM weijie guo <
> > guoweijiereswqa@gmail.com>
> > > > >> wrote:
> > > > >>
> > > > >> > Hi Weihua,
> > > > >> >
> > > > >> > Thanks for your clarification, SGTM.
> > > > >> >
> > > > >> > Best regards,
> > > > >> >
> > > > >> > Weijie
> > > > >> >
> > > > >> >
> > > > >> > Weihua Hu <hu...@gmail.com> 于2023年3月6日周一 11:43写道:
> > > > >> >
> > > > >> > > Thanks Weijie.
> > > > >> > >
> > > > >> > > Heterogeneous task managers will not be considered in this
> FLIP
> > > since
> > > > >> > > it does not request heterogeneous resources as you said.
> > > > >> > >
> > > > >> > > My first thought is we can adjust the meaning of redundant
> > > > >> configuration
> > > > >> > > to redundant number of per resource type. These can be
> > considered
> > > in
> > > > >> > > detail when we decide to support heterogeneous task managers.
> > > > >> > >
> > > > >> > > Best,
> > > > >> > > Weihua
> > > > >> > >
> > > > >> > >
> > > > >> > > On Sat, Mar 4, 2023 at 1:13 AM weijie guo <
> > > guoweijiereswqa@gmail.com>
> > > > >> > > wrote:
> > > > >> > >
> > > > >> > > > Thanks Weihua for preparing this FLIP.
> > > > >> > > >
> > > > >> > > > This FLIP overall looks reasonable to me after updating as
> > > > >> suggested by
> > > > >> > > > Matthias.
> > > > >> > > >
> > > > >> > > > I only have one small question about keeping some redundant
> > task
> > > > >> > > managers:
> > > > >> > > > In the fine-grained resource management, theoretically, it
> can
> > > > >> support
> > > > >> > > > heterogeneous taskmanagers. When we complete the missing
> > > features
> > > > >> for
> > > > >> > > FGSM,
> > > > >> > > > do we plan to take this into account?
> > > > >> > > > Of course, if I remember correctly, FGSM will not request
> > > > >> heterogeneous
> > > > >> > > > resources at present, so it is also acceptable to me if
> there
> > > is no
> > > > >> > > special
> > > > >> > > > treatment now.
> > > > >> > > >
> > > > >> > > > +1 for this changes if we can ensure the test coverage.
> > > > >> > > >
> > > > >> > > > Best regards,
> > > > >> > > >
> > > > >> > > > Weijie
> > > > >> > > >
> > > > >> > > >
> > > > >> > > > John Roesler <vv...@apache.org> 于2023年3月2日周四 12:53写道:
> > > > >> > > >
> > > > >> > > > > Thanks for the test plan, Weihua!
> > > > >> > > > >
> > > > >> > > > > Yes, it addresses my concerns.
> > > > >> > > > >
> > > > >> > > > > Thanks,
> > > > >> > > > > John
> > > > >> > > > >
> > > > >> > > > > On Wed, Mar 1, 2023, at 22:38, Weihua Hu wrote:
> > > > >> > > > > > Hi, everyone,
> > > > >> > > > > > Thanks for your suggestions and ideas.
> > > > >> > > > > > Thanks Xintong for sharing the detailed backgrounds of
> > > > >> SlotManager.
> > > > >> > > > > >
> > > > >> > > > > > *@Matthias
> > > > >> > > > > >
> > > > >> > > > > > 1. Did you do a proper test coverage analysis?
> > > > >> > > > > >
> > > > >> > > > > >
> > > > >> > > > > > Just as Xintong said, we already have a CI stage for
> fine
> > > > >> grained
> > > > >> > > > > resource
> > > > >> > > > > > managers.
> > > > >> > > > > > And I will make sure FineGrainedSlotManager as the
> default
> > > > >> > > SlotManager
> > > > >> > > > > can
> > > > >> > > > > > pass all the tests of CI.
> > > > >> > > > > > In addition, I will review all unit tests of
> > > > >> > > > DeclarativeSlotManager(DSM)
> > > > >> > > > > to
> > > > >> > > > > > ensure that there are no gaps in the
> > > > >> > > > > > coverage provided by the FineGrainedSlotManager.
> > > > >> > > > > > I also added the 'Test Plan' part to the FLIP.
> > > > >> > > > > > @Matthias @John @Shammon Does this test plan address
> your
> > > > >> concerns?
> > > > >> > > > > >
> > > > >> > > > > > 2.  DeclarativeSlotManager and FineGrainedSlotManager
> feel
> > > quite
> > > > >> > big
> > > > >> > > in
> > > > >> > > > > >
> > > > >> > > > > > terms of lines of code....
> > > > >> > > > > >
> > > > >> > > > > >
> > > > >> > > > > > IMO, the refactoring of SlotManager does not belong to
> > this
> > > FLIP
> > > > >> > > since
> > > > >> > > > it
> > > > >> > > > > > may lead to some unstable risks. For
> > > > >> > > > > > FineGrainedSlotManager(FGSM), we already split some
> > > reasonable
> > > > >> > > > > components.
> > > > >> > > > > > They are:
> > > > >> > > > > > * TaskManagerTracker: Track task managers and their
> > > resources.
> > > > >> > > > > > * ResourceTracker: track requirements of jobs
> > > > >> > > > > > * ResourceAllocationStrategy: Try to fulfill the
> resource
> > > > >> > > requirements
> > > > >> > > > > with
> > > > >> > > > > > available/pending resources.
> > > > >> > > > > > * SlotStatusSyncer: communicate with TaskManager, for
> > > > >> > > > allocating/freeing
> > > > >> > > > > > slot and reconciling the slot status
> > > > >> > > > > > Maybe we can start a discussion about refactoring
> > > SlotManager in
> > > > >> > > > another
> > > > >> > > > > > FLIP if there are some good suggestions.
> > > > >> > > > > > WDYT
> > > > >> > > > > >
> > > > >> > > > > > 3. For me personally, having a more detailed summary
> > > comparing
> > > > >> the
> > > > >> > > > > >> subcomponents of both SlotManager implementations with
> > > where
> > > > >> > > > > >> their functionality matches and where they differ might
> > > help
> > > > >> > > > understand
> > > > >> > > > > the
> > > > >> > > > > >> consequences of the changes proposed in FLIP-298
> > > > >> > > > > >
> > > > >> > > > > > Good suggestion, I have updated the comparison in this
> > FLIP.
> > > > >> > Looking
> > > > >> > > > > > forward to any suggestions/thoughts
> > > > >> > > > > > if they are not described clearly.
> > > > >> > > > > >
> > > > >> > > > > > *@John
> > > > >> > > > > >
> > > > >> > > > > > 4. In addition to changing the default, would it make
> > sense
> > > to
> > > > >> log
> > > > >> > a
> > > > >> > > > > >> deprecation warning on initialization
> > > > >> > > > > >
> > > > >> > > > > > if the DeclarativeSlotManager is used?
> > > > >> > > > > >>
> > > > >> > > > > > SGTM, We should add Deprecated annotations to DSM for
> > devs.
> > > And
> > > > >> > log a
> > > > >> > > > > > deprecation warning for users.
> > > > >> > > > > >
> > > > >> > > > > > *@Shammon
> > > > >> > > > > >
> > > > >> > > > > > 1. For their functional differences, can you give some
> > > detailed
> > > > >> > tests
> > > > >> > > > to
> > > > >> > > > > >> verify that the new FineGrainedSlotManager has these
> > > > >> capabilities?
> > > > >> > > > This
> > > > >> > > > > can
> > > > >> > > > > >> effectively verify the new functions
> > > > >> > > > > >>
> > > > >> > > > > > As just maintained, there is already a CI stage of FGSM,
> > > and I
> > > > >> will
> > > > >> > > do
> > > > >> > > > > more
> > > > >> > > > > > review of unit tests for DSM.
> > > > >> > > > > >
> > > > >> > > > > >  2. I'm worried that many functions are not independent
> > and
> > > it
> > > > >> is
> > > > >> > > > > difficult
> > > > >> > > > > >> to migrate step-by-step. You can list the relationship
> > > between
> > > > >> > them
> > > > >> > > in
> > > > >> > > > > >> detail.
> > > > >> > > > > >
> > > > >> > > > > >  As Xintong saied the DSM is a subset of FGSM by design.
> > > But as
> > > > >> > time
> > > > >> > > > goes
> > > > >> > > > > > on, FGSM has some lacking
> > > > >> > > > > > functions as I listed in this FLIP. And I have added the
> > > > >> comparison
> > > > >> > > > > between
> > > > >> > > > > > DSM and FGSM in this FLIP.
> > > > >> > > > > >
> > > > >> > > > > >
> > > > >> > > > > > Thanks again for all your thoughts. Any feedback is
> > > appreciated!
> > > > >> > > > > >
> > > > >> > > > > > Best,
> > > > >> > > > > > Weihua
> > > > >> > > > > >
> > > > >> > > > > >
> > > > >> > > > > > On Wed, Mar 1, 2023 at 2:17 PM Xintong Song <
> > > > >> tonysong820@gmail.com
> > > > >> > >
> > > > >> > > > > wrote:
> > > > >> > > > > >
> > > > >> > > > > >> Thanks Weihua for preparing this FLIP. +1 for the
> > proposal.
> > > > >> > > > > >>
> > > > >> > > > > >>
> > > > >> > > > > >> As one of the contributors of the fine-grained slot
> > > manager,
> > > > >> I'd
> > > > >> > > like
> > > > >> > > > to
> > > > >> > > > > >> share some backgrounds here.
> > > > >> > > > > >>
> > > > >> > > > > >> - There used to be a defaut slot manager
> implementation,
> > > which
> > > > >> is
> > > > >> > > > > >> non-declarative and has been removed now. The two
> > features,
> > > > >> > > > declarative
> > > > >> > > > > /
> > > > >> > > > > >> reactive resource management and fine-grained resource
> > > > >> management,
> > > > >> > > > were
> > > > >> > > > > >> proposed at about the same time. We were aware that by
> > > design
> > > > >> the
> > > > >> > > > > >> declarative slot manager is a subset of fine-grained
> slot
> > > > >> manager
> > > > >> > at
> > > > >> > > > > that
> > > > >> > > > > >> time, but still decided to implement two slot managers
> > for
> > > the
> > > > >> > > purpose
> > > > >> > > > > of
> > > > >> > > > > >> decoupling efforts and reducing cross-team
> > synchronization
> > > > >> > overhead.
> > > > >> > > > > >> Merging the two slot managers once they are proved
> stable
> > > is
> > > > >> IMO a
> > > > >> > > > > >> technical debt that was planned at the very beginning.
> > > > >> > > > > >>
> > > > >> > > > > >> - The FineGrainedSlotManager has been verified in
> > Alibaba's
> > > > >> > internal
> > > > >> > > > > >> production as well as Alibaba Cloud services as the
> > default
> > > > >> slot
> > > > >> > > > manager
> > > > >> > > > > >> for about 2 years.
> > > > >> > > > > >>
> > > > >> > > > > >>
> > > > >> > > > > >> Concerning test cases, we currently have a ci stage for
> > > fine
> > > > >> > grained
> > > > >> > > > > >> resource management. To avoid adding too much burden,
> the
> > > stage
> > > > >> > only
> > > > >> > > > > >> includes tests from flink-runtime and flink-test
> > modules. I
> > > > >> think
> > > > >> > > > > switching
> > > > >> > > > > >> the default slot manager and applying the whole set of
> > > tests on
> > > > >> > the
> > > > >> > > > > >> fine-grained slot manager would help us to be more
> > > confident
> > > > >> about
> > > > >> > > it.
> > > > >> > > > > >>
> > > > >> > > > > >>
> > > > >> > > > > >> Concerning cutting out functionalities out of slot
> > > manager, I
> > > > >> > think
> > > > >> > > > > Yangze
> > > > >> > > > > >> and I have tried our best to shape the
> > > FineGrainedSlotManager
> > > > >> into
> > > > >> > > > > >> reasonable components. I personally don't have other
> > ideas
> > > to
> > > > >> > > further
> > > > >> > > > > >> disassemble the component, but I'm open to such
> > > suggestions.
> > > > >> > > However,
> > > > >> > > > > from
> > > > >> > > > > >> the stability perspective, I'd be in favor of not
> > > introducing
> > > > >> > > > > significant
> > > > >> > > > > >> changes to the FineGrainedSlotManager while switching
> it
> > > to the
> > > > >> > > > default.
> > > > >> > > > > >> Because the current implementation has already been
> > > verified
> > > > >> (or
> > > > >> > at
> > > > >> > > > > least
> > > > >> > > > > >> partially verified because Alibaba does not cover all
> the
> > > Flink
> > > > >> > use
> > > > >> > > > > cases),
> > > > >> > > > > >> and introducing more changes also means more chances of
> > > > >> breaking
> > > > >> > > > things.
> > > > >> > > > > >>
> > > > >> > > > > >>
> > > > >> > > > > >> Best,
> > > > >> > > > > >>
> > > > >> > > > > >> Xintong
> > > > >> > > > > >>
> > > > >> > > > > >>
> > > > >> > > > > >>
> > > > >> > > > > >> On Wed, Mar 1, 2023 at 11:12 AM Shammon FY <
> > > zjureel@gmail.com>
> > > > >> > > wrote:
> > > > >> > > > > >>
> > > > >> > > > > >> > Hi
> > > > >> > > > > >> >
> > > > >> > > > > >> > Thanks for starting this work weihua, I think
> unifying
> > > > >> > > > > >> > DeclarativeSlotManager and FineGrainedSlotManager is
> > > > >> valuable.
> > > > >> > > > > >> >
> > > > >> > > > > >> > I agree with @Matthias and @John that we need a way
> to
> > > ensure
> > > > >> > that
> > > > >> > > > > >> > DeclarativeSlotManager's capabilities are fully
> covered
> > > by
> > > > >> > > > > >> > FineGrainedSlotManager
> > > > >> > > > > >> >
> > > > >> > > > > >> > 1. For their functional differences, can you give
> some
> > > > >> detailed
> > > > >> > > > tests
> > > > >> > > > > to
> > > > >> > > > > >> > verify that the new FineGrainedSlotManager has these
> > > > >> > capabilities?
> > > > >> > > > > This
> > > > >> > > > > >> can
> > > > >> > > > > >> > effectively verify the new functions
> > > > >> > > > > >> >
> > > > >> > > > > >> > 2. I'm worried that many functions are not
> independent
> > > and
> > > > >> it is
> > > > >> > > > > >> difficult
> > > > >> > > > > >> > to migrate step-by-step. You can list the
> relationship
> > > > >> between
> > > > >> > > them
> > > > >> > > > in
> > > > >> > > > > >> > detail.
> > > > >> > > > > >> >
> > > > >> > > > > >> > 3. As John mentioned, give a smoke test for
> > > > >> > FineGrainedSlotManager
> > > > >> > > > is
> > > > >> > > > > a
> > > > >> > > > > >> > good idea. Or you can add some test information to
> the
> > > > >> > > > > >> > DeclarativeSlotManager to determine how many tests
> have
> > > used
> > > > >> it.
> > > > >> > > In
> > > > >> > > > > this
> > > > >> > > > > >> > way, we can gradually construct test cases for
> > > > >> > > > FineGrainedSlotManager
> > > > >> > > > > >> > during the development process.
> > > > >> > > > > >> >
> > > > >> > > > > >> >
> > > > >> > > > > >> > Best,
> > > > >> > > > > >> > Shammon
> > > > >> > > > > >> >
> > > > >> > > > > >> >
> > > > >> > > > > >> > On Tue, Feb 28, 2023 at 10:22 PM John Roesler <
> > > > >> > > vvcephei@apache.org>
> > > > >> > > > > >> wrote:
> > > > >> > > > > >> >
> > > > >> > > > > >> > > Thanks for the FLIP, Weihua!
> > > > >> > > > > >> > >
> > > > >> > > > > >> > > I’ve read the FLIP, and it sounds good to me. We
> need
> > > to
> > > > >> avoid
> > > > >> > > > > >> > > proliferating alternative implementations wherever
> > > > >> possible. I
> > > > >> > > > have
> > > > >> > > > > >> just
> > > > >> > > > > >> > a
> > > > >> > > > > >> > > couple of comments:
> > > > >> > > > > >> > >
> > > > >> > > > > >> > > 1. I share Matthias’s concern about ensuring the
> > > behavior
> > > > >> is
> > > > >> > > > really
> > > > >> > > > > the
> > > > >> > > > > >> > > same. One suggestion I’ve used for this kind of
> thing
> > > is,
> > > > >> as a
> > > > >> > > > smoke
> > > > >> > > > > >> > test,
> > > > >> > > > > >> > > to update the DeclarativeSlotManager to just
> delegate
> > > to
> > > > >> the
> > > > >> > > > > >> > > FineGrainedSlotManager. If the full test suite
> still
> > > > >> passes,
> > > > >> > you
> > > > >> > > > > can be
> > > > >> > > > > >> > > pretty sure the new default is really ok. It would
> > not
> > > be a
> > > > >> > good
> > > > >> > > > > idea
> > > > >> > > > > >> to
> > > > >> > > > > >> > > actually keep that in for the release, since it
> would
> > > > >> remove
> > > > >> > the
> > > > >> > > > > option
> > > > >> > > > > >> > to
> > > > >> > > > > >> > > fall back in case of bugs. Either way, we need to
> > make
> > > sure
> > > > >> > all
> > > > >> > > > test
> > > > >> > > > > >> > > scenarios are present for the FGSM.
> > > > >> > > > > >> > >
> > > > >> > > > > >> > > 4. In addition to changing the default, would it
> make
> > > > >> sense to
> > > > >> > > > log a
> > > > >> > > > > >> > > deprecation warning on initialization if the
> > > > >> > > > DeclarativeSlotManager
> > > > >> > > > > is
> > > > >> > > > > >> > used?
> > > > >> > > > > >> > >
> > > > >> > > > > >> > > Thanks again,
> > > > >> > > > > >> > > John
> > > > >> > > > > >> > >
> > > > >> > > > > >> > > On Tue, Feb 28, 2023, at 07:20, Matthias Pohl
> wrote:
> > > > >> > > > > >> > > > Hi Weihua,
> > > > >> > > > > >> > > > Thanks for your proposal. From a conceptual
> point:
> > > AFAIU,
> > > > >> > the
> > > > >> > > > > >> > > > DeclarativeSlotManager covers a subset (i.e. only
> > > evenly
> > > > >> > sized
> > > > >> > > > > slots)
> > > > >> > > > > >> > of
> > > > >> > > > > >> > > > what the FineGrainedSlotManager should be able to
> > > achieve
> > > > >> > > > > (variable
> > > > >> > > > > >> > slot
> > > > >> > > > > >> > > > size per task manager). Is this the right
> > > > >> > > > > assumption/understanding?
> > > > >> > > > > >> In
> > > > >> > > > > >> > > this
> > > > >> > > > > >> > > > sense, merging both implementations into a single
> > one
> > > > >> sounds
> > > > >> > > > > good. A
> > > > >> > > > > >> > few
> > > > >> > > > > >> > > > more general comments, though:
> > > > >> > > > > >> > > >
> > > > >> > > > > >> > > > 1. Did you do a proper test coverage analysis?
> > > That's not
> > > > >> > > > > mentioned
> > > > >> > > > > >> in
> > > > >> > > > > >> > > the
> > > > >> > > > > >> > > > current version of the FLIP. I'm bringing this up
> > > > >> because we
> > > > >> > > ran
> > > > >> > > > > into
> > > > >> > > > > >> > the
> > > > >> > > > > >> > > > same issue when fixing the flaws that popped up
> > after
> > > > >> > > > introducing
> > > > >> > > > > the
> > > > >> > > > > >> > > > multi-component leader election (see FLIP-285
> [1]).
> > > There
> > > > >> > is a
> > > > >> > > > > risk
> > > > >> > > > > >> > that
> > > > >> > > > > >> > > by
> > > > >> > > > > >> > > > removing the legacy code we decrease test
> coverage
> > > > >> because
> > > > >> > > > certain
> > > > >> > > > > >> > > > test cases that were covered for the legacy
> classes
> > > might
> > > > >> > not
> > > > >> > > be
> > > > >> > > > > >> > > > necessarily covered in the new implementation,
> yet
> > > (see
> > > > >> > > > > FLINK-30338
> > > > >> > > > > >> [2]
> > > > >> > > > > >> > > > which covers this issue for the leader election
> > > case).
> > > > >> > > Ideally,
> > > > >> > > > we
> > > > >> > > > > >> > don't
> > > > >> > > > > >> > > > want to remove test cases accidentally because
> they
> > > were
> > > > >> > only
> > > > >> > > > > >> > implemented
> > > > >> > > > > >> > > > for the DeclarativeSlotManager but missed for the
> > > > >> > > > > >> > FineGrainedSlotManager.
> > > > >> > > > > >> > > >
> > > > >> > > > > >> > > > 2. DeclarativeSlotManager and
> > FineGrainedSlotManager
> > > feel
> > > > >> > > quite
> > > > >> > > > > big
> > > > >> > > > > >> in
> > > > >> > > > > >> > > > terms of lines of code. Without knowing whether
> > it's
> > > > >> > actually
> > > > >> > > a
> > > > >> > > > > >> > > reasonable
> > > > >> > > > > >> > > > thing to do: Instead of just adding more features
> > to
> > > the
> > > > >> > > > > >> > > > FineGrainedSlotManager, have you thought of
> cutting
> > > out
> > > > >> > > > > functionality
> > > > >> > > > > >> > > into
> > > > >> > > > > >> > > > smaller sub-components along this refactoring?
> > Such a
> > > > >> > > > step-by-step
> > > > >> > > > > >> > > approach
> > > > >> > > > > >> > > > might improve the overall codebase and might make
> > > > >> reviewing
> > > > >> > > the
> > > > >> > > > > >> > > refactoring
> > > > >> > > > > >> > > > easier. I did a first pass over the code and
> > > struggled to
> > > > >> > > > identify
> > > > >> > > > > >> code
> > > > >> > > > > >> > > > blocks that could be moved out of the SlotManager
> > > > >> > > > > implementation(s).
> > > > >> > > > > >> > > > Therefore, I might be wrong with this proposal. I
> > > haven't
> > > > >> > > worked
> > > > >> > > > > on
> > > > >> > > > > >> > this
> > > > >> > > > > >> > > > codebase in that detail that it would allow me to
> > > come up
> > > > >> > > with a
> > > > >> > > > > >> > > judgement
> > > > >> > > > > >> > > > call. I wanted to bring it up, anyway, because
> I'm
> > > > >> curious
> > > > >> > > > whether
> > > > >> > > > > >> that
> > > > >> > > > > >> > > > could be an option. There's a comment created by
> > > Chesnay
> > > > >> > > (CC'd)
> > > > >> > > > in
> > > > >> > > > > >> the
> > > > >> > > > > >> > > > JavaDoc of TaskExecutorManager [3] indicating
> > > something
> > > > >> > > similar.
> > > > >> > > > > I'm
> > > > >> > > > > >> > > > wondering whether he can add some insights here.
> > > > >> > > > > >> > > >
> > > > >> > > > > >> > > > 3. For me personally, having a more detailed
> > summary
> > > > >> > comparing
> > > > >> > > > the
> > > > >> > > > > >> > > > subcomponents of both SlotManager implementations
> > > with
> > > > >> where
> > > > >> > > > > >> > > > their functionality matches and where they differ
> > > might
> > > > >> help
> > > > >> > > > > >> understand
> > > > >> > > > > >> > > the
> > > > >> > > > > >> > > > consequences of the changes proposed in FLIP-298.
> > > > >> > > > > >> > > >
> > > > >> > > > > >> > > > Best,
> > > > >> > > > > >> > > > Matthias
> > > > >> > > > > >> > > >
> > > > >> > > > > >> > > > [1]
> > > > >> > > > > >> > > >
> > > > >> > > > > >> > >
> > > > >> > > > > >> >
> > > > >> > > > > >>
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+Refactoring+LeaderElection+to+make+Flink+support+multi-component+leader+election+out-of-the-box
> > > > >> > > > > >> > > > [2]
> > > https://issues.apache.org/jira/browse/FLINK-30338
> > > > >> > > > > >> > > > [3]
> > > > >> > > > > >> > > >
> > > > >> > > > > >> > >
> > > > >> > > > > >> >
> > > > >> > > > > >>
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > >
> >
> https://github.com/apache/flink/blob/f611ea8cb5deddb42429df2c99f0c68d7382e9bd/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/TaskExecutorManager.java#L66-L68
> > > > >> > > > > >> > > >
> > > > >> > > > > >> > > > On Tue, Feb 28, 2023 at 6:14 AM Matt Wang <
> > > > >> wangm92@163.com>
> > > > >> > > > > wrote:
> > > > >> > > > > >> > > >
> > > > >> > > > > >> > > >> This is a good proposal for me, it will make the
> > > code of
> > > > >> > the
> > > > >> > > > > >> > SlotManager
> > > > >> > > > > >> > > >> more clear.
> > > > >> > > > > >> > > >>
> > > > >> > > > > >> > > >>
> > > > >> > > > > >> > > >>
> > > > >> > > > > >> > > >> --
> > > > >> > > > > >> > > >>
> > > > >> > > > > >> > > >> Best,
> > > > >> > > > > >> > > >> Matt Wang
> > > > >> > > > > >> > > >>
> > > > >> > > > > >> > > >>
> > > > >> > > > > >> > > >> ---- Replied Message ----
> > > > >> > > > > >> > > >> | From | David Morávek<dm...@apache.org> |
> > > > >> > > > > >> > > >> | Date | 02/27/2023 22:45 |
> > > > >> > > > > >> > > >> | To | <de...@flink.apache.org> |
> > > > >> > > > > >> > > >> | Subject | Re: [DISCUSS] FLIP-298: Unifying the
> > > > >> > > Implementation
> > > > >> > > > > of
> > > > >> > > > > >> > > >> SlotManager |
> > > > >> > > > > >> > > >> Hi Weihua, I still need to dig into the details,
> > > but the
> > > > >> > > > overall
> > > > >> > > > > >> > > sentiment
> > > > >> > > > > >> > > >> of this change sounds reasonable.
> > > > >> > > > > >> > > >>
> > > > >> > > > > >> > > >> Best,
> > > > >> > > > > >> > > >> D.
> > > > >> > > > > >> > > >>
> > > > >> > > > > >> > > >> On Mon, Feb 27, 2023 at 2:26 PM Zhanghao Chen <
> > > > >> > > > > >> > > zhanghao.chen@outlook.com>
> > > > >> > > > > >> > > >> wrote:
> > > > >> > > > > >> > > >>
> > > > >> > > > > >> > > >> Thanks for driving this topic. I think this FLIP
> > > could
> > > > >> help
> > > > >> > > > > clean up
> > > > >> > > > > >> > the
> > > > >> > > > > >> > > >> codebase to make it easier to maintain. +1 on
> it.
> > > > >> > > > > >> > > >>
> > > > >> > > > > >> > > >> Best,
> > > > >> > > > > >> > > >> Zhanghao Chen
> > > > >> > > > > >> > > >> ________________________________
> > > > >> > > > > >> > > >> From: Weihua Hu <hu...@gmail.com>
> > > > >> > > > > >> > > >> Sent: Monday, February 27, 2023 20:40
> > > > >> > > > > >> > > >> To: dev <de...@flink.apache.org>
> > > > >> > > > > >> > > >> Subject: [DISCUSS] FLIP-298: Unifying the
> > > > >> Implementation of
> > > > >> > > > > >> > SlotManager
> > > > >> > > > > >> > > >>
> > > > >> > > > > >> > > >> Hi everyone,
> > > > >> > > > > >> > > >>
> > > > >> > > > > >> > > >> I would like to begin a discussion on FLIP-298:
> > > Unifying
> > > > >> > the
> > > > >> > > > > >> > > Implementation
> > > > >> > > > > >> > > >> of SlotManager[1]. There are currently two types
> > of
> > > > >> > > SlotManager
> > > > >> > > > > in
> > > > >> > > > > >> > > Flink:
> > > > >> > > > > >> > > >> DeclarativeSlotManager and
> FineGrainedSlotManager.
> > > > >> > > > > >> > > FineGrainedSlotManager
> > > > >> > > > > >> > > >> should behave as DeclarativeSlotManager if the
> > user
> > > does
> > > > >> > not
> > > > >> > > > > >> configure
> > > > >> > > > > >> > > the
> > > > >> > > > > >> > > >> slot request profile.
> > > > >> > > > > >> > > >>
> > > > >> > > > > >> > > >> Therefore, this FLIP aims to unify the
> > > implementation of
> > > > >> > > > > SlotManager
> > > > >> > > > > >> > in
> > > > >> > > > > >> > > >> order to reduce maintenance costs.
> > > > >> > > > > >> > > >>
> > > > >> > > > > >> > > >> Looking forward to hearing from you.
> > > > >> > > > > >> > > >>
> > > > >> > > > > >> > > >> [1]
> > > > >> > > > > >> > > >>
> > > > >> > > > > >> > > >>
> > > > >> > > > > >> > > >>
> > > > >> > > > > >> > >
> > > > >> > > > > >> >
> > > > >> > > > > >>
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-298%3A+Unifying+the+Implementation+of+SlotManager
> > > > >> > > > > >> > > >>
> > > > >> > > > > >> > > >> Best,
> > > > >> > > > > >> > > >> Weihua
> > > > >> > > > > >> > > >>
> > > > >> > > > > >> > > >>
> > > > >> > > > > >> > >
> > > > >> > > > > >> >
> > > > >> > > > > >>
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >
> > >
> >
>

Re: [DISCUSS] FLIP-298: Unifying the Implementation of SlotManager

Posted by Weihua Hu <hu...@gmail.com>.
Thanks Yangze for your attention, this would be a great help.

And thanks Matthias too.

FLIP-156 [1] mentions some incompatibility between fine-grained resource
> management and reactive mode. I assume that this is independent of the
> SlotManager and replacing the DSM with the FGSM wouldn't affect reactive
> mode?

Yes. This incompatibility is independent of SlotManager. That means the
AdpativeScheduler will always ignore the resource requirement set by
slotSharingGroup and declare Unknown ResourceProfile to SlotManager.
So, using FGSM as default will not affect reactive mode.

About the heterogeneous TaskManager: This is a feature that's also not
> supported in the DSM right now, is it? We should state that fact in the
> FLIP if we mentioned that we don't want to implement it for the FSGM.

Yes, both DSM and FGSM do not support request heterogeneous
TaskManager right now. Heterogeneous will make the resource allocation
logic more complicated, such as the resource deadlock if request A
allocated
the bigger slot B and then request B could not allocate the small slot A.
We
need to think more before starting to support the heterogeneous task
manager.
So, we don't want to implement heterogeneity in this FLIP.

Best,
Weihua


On Wed, Mar 8, 2023 at 12:44 AM Matthias Pohl
<ma...@aiven.io.invalid> wrote:

> Thanks for updating the FLIP and adding more context to it. Additionally,
> thanks to Xintong and Yangze for offering your expertise here as
> contributors to the initial FineGrainedSlotManager implementation.
>
> The remark on cutting out functionality was only based on some superficial
> initial code reading. I cannot come up with a better code structure myself.
> Therefore, I'm fine with not refactoring the code as part of this FLIP.
>
> The strategies that were proposed around making sure that the refactoring
> is properly backed by tests sound reasonable. My initial concern was based
> on the fact that we might have unit test scenarios for the DSM that are not
> covered in the unit tests of the FSGM. In that case, swapping the DSM with
> the FSGM might not be good enough. Going over the GSM tests to make sure
> that we're not accidentally deleting test scenarios sounds good to me.
> Thanks, Weihua.
>
> FLIP-156 [1] mentions some incompatibility between fine-grained resource
> management and reactive mode. I assume that this is independent of the
> SlotManager and replacing the DSM with the FGSM wouldn't affect reactive
> mode?
>
> About the heterogeneous TaskManager: This is a feature that's also not
> supported in the DSM right now, is it? We should state that fact in the
> FLIP if we mentioned that we don't want to implement it for the FSGM.
>
> Best,
> Matthias
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-156%3A+Runtime+Interfaces+for+Fine-Grained+Resource+Requirements
>
> On Tue, Mar 7, 2023 at 8:58 AM Yangze Guo <ka...@gmail.com> wrote:
>
> > Hi Weihua,
> >
> > Thanks for driving this. As Xintong mentioned, this was a technical
> > debt from FLIP-56.
> >
> > The latest version of FLIP sounds good, +1 from my side. As a
> > contributor to this component, I'm willing to assist with the review
> > process. Feel free to reach me if you need help.
> >
> > Best,
> > Yangze Guo
> >
> > On Tue, Mar 7, 2023 at 1:47 PM Weihua Hu <hu...@gmail.com> wrote:
> > >
> > > Hi,
> > >
> > > @David @Matthias
> > > There are a few days after hearing your thoughts. I would like to know
> if
> > > there are any other concerns about this FLIP.
> > >
> > >
> > > Best,
> > > Weihua
> > >
> > >
> > > On Mon, Mar 6, 2023 at 7:53 PM Weihua Hu <hu...@gmail.com>
> wrote:
> > >
> > > >
> > > > Thanks Shammon,
> > > >
> > > > I've updated FLIP to add this redundant Task Manager limitation.
> > > >
> > > >
> > > > Best,
> > > > Weihua
> > > >
> > > >
> > > > On Mon, Mar 6, 2023 at 5:07 PM Shammon FY <zj...@gmail.com> wrote:
> > > >
> > > >> Hi weihua
> > > >>
> > > >> Can you add content related to `heterogeneous resources` to this
> > FLIP? We
> > > >> can record it and consider it in the future. It may be useful for
> some
> > > >> scenarios, such as the combination of streaming and ML.
> > > >>
> > > >> Best,
> > > >> Shammon
> > > >>
> > > >>
> > > >> On Mon, Mar 6, 2023 at 1:39 PM weijie guo <
> guoweijiereswqa@gmail.com>
> > > >> wrote:
> > > >>
> > > >> > Hi Weihua,
> > > >> >
> > > >> > Thanks for your clarification, SGTM.
> > > >> >
> > > >> > Best regards,
> > > >> >
> > > >> > Weijie
> > > >> >
> > > >> >
> > > >> > Weihua Hu <hu...@gmail.com> 于2023年3月6日周一 11:43写道:
> > > >> >
> > > >> > > Thanks Weijie.
> > > >> > >
> > > >> > > Heterogeneous task managers will not be considered in this FLIP
> > since
> > > >> > > it does not request heterogeneous resources as you said.
> > > >> > >
> > > >> > > My first thought is we can adjust the meaning of redundant
> > > >> configuration
> > > >> > > to redundant number of per resource type. These can be
> considered
> > in
> > > >> > > detail when we decide to support heterogeneous task managers.
> > > >> > >
> > > >> > > Best,
> > > >> > > Weihua
> > > >> > >
> > > >> > >
> > > >> > > On Sat, Mar 4, 2023 at 1:13 AM weijie guo <
> > guoweijiereswqa@gmail.com>
> > > >> > > wrote:
> > > >> > >
> > > >> > > > Thanks Weihua for preparing this FLIP.
> > > >> > > >
> > > >> > > > This FLIP overall looks reasonable to me after updating as
> > > >> suggested by
> > > >> > > > Matthias.
> > > >> > > >
> > > >> > > > I only have one small question about keeping some redundant
> task
> > > >> > > managers:
> > > >> > > > In the fine-grained resource management, theoretically, it can
> > > >> support
> > > >> > > > heterogeneous taskmanagers. When we complete the missing
> > features
> > > >> for
> > > >> > > FGSM,
> > > >> > > > do we plan to take this into account?
> > > >> > > > Of course, if I remember correctly, FGSM will not request
> > > >> heterogeneous
> > > >> > > > resources at present, so it is also acceptable to me if there
> > is no
> > > >> > > special
> > > >> > > > treatment now.
> > > >> > > >
> > > >> > > > +1 for this changes if we can ensure the test coverage.
> > > >> > > >
> > > >> > > > Best regards,
> > > >> > > >
> > > >> > > > Weijie
> > > >> > > >
> > > >> > > >
> > > >> > > > John Roesler <vv...@apache.org> 于2023年3月2日周四 12:53写道:
> > > >> > > >
> > > >> > > > > Thanks for the test plan, Weihua!
> > > >> > > > >
> > > >> > > > > Yes, it addresses my concerns.
> > > >> > > > >
> > > >> > > > > Thanks,
> > > >> > > > > John
> > > >> > > > >
> > > >> > > > > On Wed, Mar 1, 2023, at 22:38, Weihua Hu wrote:
> > > >> > > > > > Hi, everyone,
> > > >> > > > > > Thanks for your suggestions and ideas.
> > > >> > > > > > Thanks Xintong for sharing the detailed backgrounds of
> > > >> SlotManager.
> > > >> > > > > >
> > > >> > > > > > *@Matthias
> > > >> > > > > >
> > > >> > > > > > 1. Did you do a proper test coverage analysis?
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > > Just as Xintong said, we already have a CI stage for fine
> > > >> grained
> > > >> > > > > resource
> > > >> > > > > > managers.
> > > >> > > > > > And I will make sure FineGrainedSlotManager as the default
> > > >> > > SlotManager
> > > >> > > > > can
> > > >> > > > > > pass all the tests of CI.
> > > >> > > > > > In addition, I will review all unit tests of
> > > >> > > > DeclarativeSlotManager(DSM)
> > > >> > > > > to
> > > >> > > > > > ensure that there are no gaps in the
> > > >> > > > > > coverage provided by the FineGrainedSlotManager.
> > > >> > > > > > I also added the 'Test Plan' part to the FLIP.
> > > >> > > > > > @Matthias @John @Shammon Does this test plan address your
> > > >> concerns?
> > > >> > > > > >
> > > >> > > > > > 2.  DeclarativeSlotManager and FineGrainedSlotManager feel
> > quite
> > > >> > big
> > > >> > > in
> > > >> > > > > >
> > > >> > > > > > terms of lines of code....
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > > IMO, the refactoring of SlotManager does not belong to
> this
> > FLIP
> > > >> > > since
> > > >> > > > it
> > > >> > > > > > may lead to some unstable risks. For
> > > >> > > > > > FineGrainedSlotManager(FGSM), we already split some
> > reasonable
> > > >> > > > > components.
> > > >> > > > > > They are:
> > > >> > > > > > * TaskManagerTracker: Track task managers and their
> > resources.
> > > >> > > > > > * ResourceTracker: track requirements of jobs
> > > >> > > > > > * ResourceAllocationStrategy: Try to fulfill the resource
> > > >> > > requirements
> > > >> > > > > with
> > > >> > > > > > available/pending resources.
> > > >> > > > > > * SlotStatusSyncer: communicate with TaskManager, for
> > > >> > > > allocating/freeing
> > > >> > > > > > slot and reconciling the slot status
> > > >> > > > > > Maybe we can start a discussion about refactoring
> > SlotManager in
> > > >> > > > another
> > > >> > > > > > FLIP if there are some good suggestions.
> > > >> > > > > > WDYT
> > > >> > > > > >
> > > >> > > > > > 3. For me personally, having a more detailed summary
> > comparing
> > > >> the
> > > >> > > > > >> subcomponents of both SlotManager implementations with
> > where
> > > >> > > > > >> their functionality matches and where they differ might
> > help
> > > >> > > > understand
> > > >> > > > > the
> > > >> > > > > >> consequences of the changes proposed in FLIP-298
> > > >> > > > > >
> > > >> > > > > > Good suggestion, I have updated the comparison in this
> FLIP.
> > > >> > Looking
> > > >> > > > > > forward to any suggestions/thoughts
> > > >> > > > > > if they are not described clearly.
> > > >> > > > > >
> > > >> > > > > > *@John
> > > >> > > > > >
> > > >> > > > > > 4. In addition to changing the default, would it make
> sense
> > to
> > > >> log
> > > >> > a
> > > >> > > > > >> deprecation warning on initialization
> > > >> > > > > >
> > > >> > > > > > if the DeclarativeSlotManager is used?
> > > >> > > > > >>
> > > >> > > > > > SGTM, We should add Deprecated annotations to DSM for
> devs.
> > And
> > > >> > log a
> > > >> > > > > > deprecation warning for users.
> > > >> > > > > >
> > > >> > > > > > *@Shammon
> > > >> > > > > >
> > > >> > > > > > 1. For their functional differences, can you give some
> > detailed
> > > >> > tests
> > > >> > > > to
> > > >> > > > > >> verify that the new FineGrainedSlotManager has these
> > > >> capabilities?
> > > >> > > > This
> > > >> > > > > can
> > > >> > > > > >> effectively verify the new functions
> > > >> > > > > >>
> > > >> > > > > > As just maintained, there is already a CI stage of FGSM,
> > and I
> > > >> will
> > > >> > > do
> > > >> > > > > more
> > > >> > > > > > review of unit tests for DSM.
> > > >> > > > > >
> > > >> > > > > >  2. I'm worried that many functions are not independent
> and
> > it
> > > >> is
> > > >> > > > > difficult
> > > >> > > > > >> to migrate step-by-step. You can list the relationship
> > between
> > > >> > them
> > > >> > > in
> > > >> > > > > >> detail.
> > > >> > > > > >
> > > >> > > > > >  As Xintong saied the DSM is a subset of FGSM by design.
> > But as
> > > >> > time
> > > >> > > > goes
> > > >> > > > > > on, FGSM has some lacking
> > > >> > > > > > functions as I listed in this FLIP. And I have added the
> > > >> comparison
> > > >> > > > > between
> > > >> > > > > > DSM and FGSM in this FLIP.
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > > Thanks again for all your thoughts. Any feedback is
> > appreciated!
> > > >> > > > > >
> > > >> > > > > > Best,
> > > >> > > > > > Weihua
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > > On Wed, Mar 1, 2023 at 2:17 PM Xintong Song <
> > > >> tonysong820@gmail.com
> > > >> > >
> > > >> > > > > wrote:
> > > >> > > > > >
> > > >> > > > > >> Thanks Weihua for preparing this FLIP. +1 for the
> proposal.
> > > >> > > > > >>
> > > >> > > > > >>
> > > >> > > > > >> As one of the contributors of the fine-grained slot
> > manager,
> > > >> I'd
> > > >> > > like
> > > >> > > > to
> > > >> > > > > >> share some backgrounds here.
> > > >> > > > > >>
> > > >> > > > > >> - There used to be a defaut slot manager implementation,
> > which
> > > >> is
> > > >> > > > > >> non-declarative and has been removed now. The two
> features,
> > > >> > > > declarative
> > > >> > > > > /
> > > >> > > > > >> reactive resource management and fine-grained resource
> > > >> management,
> > > >> > > > were
> > > >> > > > > >> proposed at about the same time. We were aware that by
> > design
> > > >> the
> > > >> > > > > >> declarative slot manager is a subset of fine-grained slot
> > > >> manager
> > > >> > at
> > > >> > > > > that
> > > >> > > > > >> time, but still decided to implement two slot managers
> for
> > the
> > > >> > > purpose
> > > >> > > > > of
> > > >> > > > > >> decoupling efforts and reducing cross-team
> synchronization
> > > >> > overhead.
> > > >> > > > > >> Merging the two slot managers once they are proved stable
> > is
> > > >> IMO a
> > > >> > > > > >> technical debt that was planned at the very beginning.
> > > >> > > > > >>
> > > >> > > > > >> - The FineGrainedSlotManager has been verified in
> Alibaba's
> > > >> > internal
> > > >> > > > > >> production as well as Alibaba Cloud services as the
> default
> > > >> slot
> > > >> > > > manager
> > > >> > > > > >> for about 2 years.
> > > >> > > > > >>
> > > >> > > > > >>
> > > >> > > > > >> Concerning test cases, we currently have a ci stage for
> > fine
> > > >> > grained
> > > >> > > > > >> resource management. To avoid adding too much burden, the
> > stage
> > > >> > only
> > > >> > > > > >> includes tests from flink-runtime and flink-test
> modules. I
> > > >> think
> > > >> > > > > switching
> > > >> > > > > >> the default slot manager and applying the whole set of
> > tests on
> > > >> > the
> > > >> > > > > >> fine-grained slot manager would help us to be more
> > confident
> > > >> about
> > > >> > > it.
> > > >> > > > > >>
> > > >> > > > > >>
> > > >> > > > > >> Concerning cutting out functionalities out of slot
> > manager, I
> > > >> > think
> > > >> > > > > Yangze
> > > >> > > > > >> and I have tried our best to shape the
> > FineGrainedSlotManager
> > > >> into
> > > >> > > > > >> reasonable components. I personally don't have other
> ideas
> > to
> > > >> > > further
> > > >> > > > > >> disassemble the component, but I'm open to such
> > suggestions.
> > > >> > > However,
> > > >> > > > > from
> > > >> > > > > >> the stability perspective, I'd be in favor of not
> > introducing
> > > >> > > > > significant
> > > >> > > > > >> changes to the FineGrainedSlotManager while switching it
> > to the
> > > >> > > > default.
> > > >> > > > > >> Because the current implementation has already been
> > verified
> > > >> (or
> > > >> > at
> > > >> > > > > least
> > > >> > > > > >> partially verified because Alibaba does not cover all the
> > Flink
> > > >> > use
> > > >> > > > > cases),
> > > >> > > > > >> and introducing more changes also means more chances of
> > > >> breaking
> > > >> > > > things.
> > > >> > > > > >>
> > > >> > > > > >>
> > > >> > > > > >> Best,
> > > >> > > > > >>
> > > >> > > > > >> Xintong
> > > >> > > > > >>
> > > >> > > > > >>
> > > >> > > > > >>
> > > >> > > > > >> On Wed, Mar 1, 2023 at 11:12 AM Shammon FY <
> > zjureel@gmail.com>
> > > >> > > wrote:
> > > >> > > > > >>
> > > >> > > > > >> > Hi
> > > >> > > > > >> >
> > > >> > > > > >> > Thanks for starting this work weihua, I think unifying
> > > >> > > > > >> > DeclarativeSlotManager and FineGrainedSlotManager is
> > > >> valuable.
> > > >> > > > > >> >
> > > >> > > > > >> > I agree with @Matthias and @John that we need a way to
> > ensure
> > > >> > that
> > > >> > > > > >> > DeclarativeSlotManager's capabilities are fully covered
> > by
> > > >> > > > > >> > FineGrainedSlotManager
> > > >> > > > > >> >
> > > >> > > > > >> > 1. For their functional differences, can you give some
> > > >> detailed
> > > >> > > > tests
> > > >> > > > > to
> > > >> > > > > >> > verify that the new FineGrainedSlotManager has these
> > > >> > capabilities?
> > > >> > > > > This
> > > >> > > > > >> can
> > > >> > > > > >> > effectively verify the new functions
> > > >> > > > > >> >
> > > >> > > > > >> > 2. I'm worried that many functions are not independent
> > and
> > > >> it is
> > > >> > > > > >> difficult
> > > >> > > > > >> > to migrate step-by-step. You can list the relationship
> > > >> between
> > > >> > > them
> > > >> > > > in
> > > >> > > > > >> > detail.
> > > >> > > > > >> >
> > > >> > > > > >> > 3. As John mentioned, give a smoke test for
> > > >> > FineGrainedSlotManager
> > > >> > > > is
> > > >> > > > > a
> > > >> > > > > >> > good idea. Or you can add some test information to the
> > > >> > > > > >> > DeclarativeSlotManager to determine how many tests have
> > used
> > > >> it.
> > > >> > > In
> > > >> > > > > this
> > > >> > > > > >> > way, we can gradually construct test cases for
> > > >> > > > FineGrainedSlotManager
> > > >> > > > > >> > during the development process.
> > > >> > > > > >> >
> > > >> > > > > >> >
> > > >> > > > > >> > Best,
> > > >> > > > > >> > Shammon
> > > >> > > > > >> >
> > > >> > > > > >> >
> > > >> > > > > >> > On Tue, Feb 28, 2023 at 10:22 PM John Roesler <
> > > >> > > vvcephei@apache.org>
> > > >> > > > > >> wrote:
> > > >> > > > > >> >
> > > >> > > > > >> > > Thanks for the FLIP, Weihua!
> > > >> > > > > >> > >
> > > >> > > > > >> > > I’ve read the FLIP, and it sounds good to me. We need
> > to
> > > >> avoid
> > > >> > > > > >> > > proliferating alternative implementations wherever
> > > >> possible. I
> > > >> > > > have
> > > >> > > > > >> just
> > > >> > > > > >> > a
> > > >> > > > > >> > > couple of comments:
> > > >> > > > > >> > >
> > > >> > > > > >> > > 1. I share Matthias’s concern about ensuring the
> > behavior
> > > >> is
> > > >> > > > really
> > > >> > > > > the
> > > >> > > > > >> > > same. One suggestion I’ve used for this kind of thing
> > is,
> > > >> as a
> > > >> > > > smoke
> > > >> > > > > >> > test,
> > > >> > > > > >> > > to update the DeclarativeSlotManager to just delegate
> > to
> > > >> the
> > > >> > > > > >> > > FineGrainedSlotManager. If the full test suite still
> > > >> passes,
> > > >> > you
> > > >> > > > > can be
> > > >> > > > > >> > > pretty sure the new default is really ok. It would
> not
> > be a
> > > >> > good
> > > >> > > > > idea
> > > >> > > > > >> to
> > > >> > > > > >> > > actually keep that in for the release, since it would
> > > >> remove
> > > >> > the
> > > >> > > > > option
> > > >> > > > > >> > to
> > > >> > > > > >> > > fall back in case of bugs. Either way, we need to
> make
> > sure
> > > >> > all
> > > >> > > > test
> > > >> > > > > >> > > scenarios are present for the FGSM.
> > > >> > > > > >> > >
> > > >> > > > > >> > > 4. In addition to changing the default, would it make
> > > >> sense to
> > > >> > > > log a
> > > >> > > > > >> > > deprecation warning on initialization if the
> > > >> > > > DeclarativeSlotManager
> > > >> > > > > is
> > > >> > > > > >> > used?
> > > >> > > > > >> > >
> > > >> > > > > >> > > Thanks again,
> > > >> > > > > >> > > John
> > > >> > > > > >> > >
> > > >> > > > > >> > > On Tue, Feb 28, 2023, at 07:20, Matthias Pohl wrote:
> > > >> > > > > >> > > > Hi Weihua,
> > > >> > > > > >> > > > Thanks for your proposal. From a conceptual point:
> > AFAIU,
> > > >> > the
> > > >> > > > > >> > > > DeclarativeSlotManager covers a subset (i.e. only
> > evenly
> > > >> > sized
> > > >> > > > > slots)
> > > >> > > > > >> > of
> > > >> > > > > >> > > > what the FineGrainedSlotManager should be able to
> > achieve
> > > >> > > > > (variable
> > > >> > > > > >> > slot
> > > >> > > > > >> > > > size per task manager). Is this the right
> > > >> > > > > assumption/understanding?
> > > >> > > > > >> In
> > > >> > > > > >> > > this
> > > >> > > > > >> > > > sense, merging both implementations into a single
> one
> > > >> sounds
> > > >> > > > > good. A
> > > >> > > > > >> > few
> > > >> > > > > >> > > > more general comments, though:
> > > >> > > > > >> > > >
> > > >> > > > > >> > > > 1. Did you do a proper test coverage analysis?
> > That's not
> > > >> > > > > mentioned
> > > >> > > > > >> in
> > > >> > > > > >> > > the
> > > >> > > > > >> > > > current version of the FLIP. I'm bringing this up
> > > >> because we
> > > >> > > ran
> > > >> > > > > into
> > > >> > > > > >> > the
> > > >> > > > > >> > > > same issue when fixing the flaws that popped up
> after
> > > >> > > > introducing
> > > >> > > > > the
> > > >> > > > > >> > > > multi-component leader election (see FLIP-285 [1]).
> > There
> > > >> > is a
> > > >> > > > > risk
> > > >> > > > > >> > that
> > > >> > > > > >> > > by
> > > >> > > > > >> > > > removing the legacy code we decrease test coverage
> > > >> because
> > > >> > > > certain
> > > >> > > > > >> > > > test cases that were covered for the legacy classes
> > might
> > > >> > not
> > > >> > > be
> > > >> > > > > >> > > > necessarily covered in the new implementation, yet
> > (see
> > > >> > > > > FLINK-30338
> > > >> > > > > >> [2]
> > > >> > > > > >> > > > which covers this issue for the leader election
> > case).
> > > >> > > Ideally,
> > > >> > > > we
> > > >> > > > > >> > don't
> > > >> > > > > >> > > > want to remove test cases accidentally because they
> > were
> > > >> > only
> > > >> > > > > >> > implemented
> > > >> > > > > >> > > > for the DeclarativeSlotManager but missed for the
> > > >> > > > > >> > FineGrainedSlotManager.
> > > >> > > > > >> > > >
> > > >> > > > > >> > > > 2. DeclarativeSlotManager and
> FineGrainedSlotManager
> > feel
> > > >> > > quite
> > > >> > > > > big
> > > >> > > > > >> in
> > > >> > > > > >> > > > terms of lines of code. Without knowing whether
> it's
> > > >> > actually
> > > >> > > a
> > > >> > > > > >> > > reasonable
> > > >> > > > > >> > > > thing to do: Instead of just adding more features
> to
> > the
> > > >> > > > > >> > > > FineGrainedSlotManager, have you thought of cutting
> > out
> > > >> > > > > functionality
> > > >> > > > > >> > > into
> > > >> > > > > >> > > > smaller sub-components along this refactoring?
> Such a
> > > >> > > > step-by-step
> > > >> > > > > >> > > approach
> > > >> > > > > >> > > > might improve the overall codebase and might make
> > > >> reviewing
> > > >> > > the
> > > >> > > > > >> > > refactoring
> > > >> > > > > >> > > > easier. I did a first pass over the code and
> > struggled to
> > > >> > > > identify
> > > >> > > > > >> code
> > > >> > > > > >> > > > blocks that could be moved out of the SlotManager
> > > >> > > > > implementation(s).
> > > >> > > > > >> > > > Therefore, I might be wrong with this proposal. I
> > haven't
> > > >> > > worked
> > > >> > > > > on
> > > >> > > > > >> > this
> > > >> > > > > >> > > > codebase in that detail that it would allow me to
> > come up
> > > >> > > with a
> > > >> > > > > >> > > judgement
> > > >> > > > > >> > > > call. I wanted to bring it up, anyway, because I'm
> > > >> curious
> > > >> > > > whether
> > > >> > > > > >> that
> > > >> > > > > >> > > > could be an option. There's a comment created by
> > Chesnay
> > > >> > > (CC'd)
> > > >> > > > in
> > > >> > > > > >> the
> > > >> > > > > >> > > > JavaDoc of TaskExecutorManager [3] indicating
> > something
> > > >> > > similar.
> > > >> > > > > I'm
> > > >> > > > > >> > > > wondering whether he can add some insights here.
> > > >> > > > > >> > > >
> > > >> > > > > >> > > > 3. For me personally, having a more detailed
> summary
> > > >> > comparing
> > > >> > > > the
> > > >> > > > > >> > > > subcomponents of both SlotManager implementations
> > with
> > > >> where
> > > >> > > > > >> > > > their functionality matches and where they differ
> > might
> > > >> help
> > > >> > > > > >> understand
> > > >> > > > > >> > > the
> > > >> > > > > >> > > > consequences of the changes proposed in FLIP-298.
> > > >> > > > > >> > > >
> > > >> > > > > >> > > > Best,
> > > >> > > > > >> > > > Matthias
> > > >> > > > > >> > > >
> > > >> > > > > >> > > > [1]
> > > >> > > > > >> > > >
> > > >> > > > > >> > >
> > > >> > > > > >> >
> > > >> > > > > >>
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+Refactoring+LeaderElection+to+make+Flink+support+multi-component+leader+election+out-of-the-box
> > > >> > > > > >> > > > [2]
> > https://issues.apache.org/jira/browse/FLINK-30338
> > > >> > > > > >> > > > [3]
> > > >> > > > > >> > > >
> > > >> > > > > >> > >
> > > >> > > > > >> >
> > > >> > > > > >>
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> >
> https://github.com/apache/flink/blob/f611ea8cb5deddb42429df2c99f0c68d7382e9bd/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/TaskExecutorManager.java#L66-L68
> > > >> > > > > >> > > >
> > > >> > > > > >> > > > On Tue, Feb 28, 2023 at 6:14 AM Matt Wang <
> > > >> wangm92@163.com>
> > > >> > > > > wrote:
> > > >> > > > > >> > > >
> > > >> > > > > >> > > >> This is a good proposal for me, it will make the
> > code of
> > > >> > the
> > > >> > > > > >> > SlotManager
> > > >> > > > > >> > > >> more clear.
> > > >> > > > > >> > > >>
> > > >> > > > > >> > > >>
> > > >> > > > > >> > > >>
> > > >> > > > > >> > > >> --
> > > >> > > > > >> > > >>
> > > >> > > > > >> > > >> Best,
> > > >> > > > > >> > > >> Matt Wang
> > > >> > > > > >> > > >>
> > > >> > > > > >> > > >>
> > > >> > > > > >> > > >> ---- Replied Message ----
> > > >> > > > > >> > > >> | From | David Morávek<dm...@apache.org> |
> > > >> > > > > >> > > >> | Date | 02/27/2023 22:45 |
> > > >> > > > > >> > > >> | To | <de...@flink.apache.org> |
> > > >> > > > > >> > > >> | Subject | Re: [DISCUSS] FLIP-298: Unifying the
> > > >> > > Implementation
> > > >> > > > > of
> > > >> > > > > >> > > >> SlotManager |
> > > >> > > > > >> > > >> Hi Weihua, I still need to dig into the details,
> > but the
> > > >> > > > overall
> > > >> > > > > >> > > sentiment
> > > >> > > > > >> > > >> of this change sounds reasonable.
> > > >> > > > > >> > > >>
> > > >> > > > > >> > > >> Best,
> > > >> > > > > >> > > >> D.
> > > >> > > > > >> > > >>
> > > >> > > > > >> > > >> On Mon, Feb 27, 2023 at 2:26 PM Zhanghao Chen <
> > > >> > > > > >> > > zhanghao.chen@outlook.com>
> > > >> > > > > >> > > >> wrote:
> > > >> > > > > >> > > >>
> > > >> > > > > >> > > >> Thanks for driving this topic. I think this FLIP
> > could
> > > >> help
> > > >> > > > > clean up
> > > >> > > > > >> > the
> > > >> > > > > >> > > >> codebase to make it easier to maintain. +1 on it.
> > > >> > > > > >> > > >>
> > > >> > > > > >> > > >> Best,
> > > >> > > > > >> > > >> Zhanghao Chen
> > > >> > > > > >> > > >> ________________________________
> > > >> > > > > >> > > >> From: Weihua Hu <hu...@gmail.com>
> > > >> > > > > >> > > >> Sent: Monday, February 27, 2023 20:40
> > > >> > > > > >> > > >> To: dev <de...@flink.apache.org>
> > > >> > > > > >> > > >> Subject: [DISCUSS] FLIP-298: Unifying the
> > > >> Implementation of
> > > >> > > > > >> > SlotManager
> > > >> > > > > >> > > >>
> > > >> > > > > >> > > >> Hi everyone,
> > > >> > > > > >> > > >>
> > > >> > > > > >> > > >> I would like to begin a discussion on FLIP-298:
> > Unifying
> > > >> > the
> > > >> > > > > >> > > Implementation
> > > >> > > > > >> > > >> of SlotManager[1]. There are currently two types
> of
> > > >> > > SlotManager
> > > >> > > > > in
> > > >> > > > > >> > > Flink:
> > > >> > > > > >> > > >> DeclarativeSlotManager and FineGrainedSlotManager.
> > > >> > > > > >> > > FineGrainedSlotManager
> > > >> > > > > >> > > >> should behave as DeclarativeSlotManager if the
> user
> > does
> > > >> > not
> > > >> > > > > >> configure
> > > >> > > > > >> > > the
> > > >> > > > > >> > > >> slot request profile.
> > > >> > > > > >> > > >>
> > > >> > > > > >> > > >> Therefore, this FLIP aims to unify the
> > implementation of
> > > >> > > > > SlotManager
> > > >> > > > > >> > in
> > > >> > > > > >> > > >> order to reduce maintenance costs.
> > > >> > > > > >> > > >>
> > > >> > > > > >> > > >> Looking forward to hearing from you.
> > > >> > > > > >> > > >>
> > > >> > > > > >> > > >> [1]
> > > >> > > > > >> > > >>
> > > >> > > > > >> > > >>
> > > >> > > > > >> > > >>
> > > >> > > > > >> > >
> > > >> > > > > >> >
> > > >> > > > > >>
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-298%3A+Unifying+the+Implementation+of+SlotManager
> > > >> > > > > >> > > >>
> > > >> > > > > >> > > >> Best,
> > > >> > > > > >> > > >> Weihua
> > > >> > > > > >> > > >>
> > > >> > > > > >> > > >>
> > > >> > > > > >> > >
> > > >> > > > > >> >
> > > >> > > > > >>
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> >
>

Re: [DISCUSS] FLIP-298: Unifying the Implementation of SlotManager

Posted by Matthias Pohl <ma...@aiven.io.INVALID>.
Thanks for updating the FLIP and adding more context to it. Additionally,
thanks to Xintong and Yangze for offering your expertise here as
contributors to the initial FineGrainedSlotManager implementation.

The remark on cutting out functionality was only based on some superficial
initial code reading. I cannot come up with a better code structure myself.
Therefore, I'm fine with not refactoring the code as part of this FLIP.

The strategies that were proposed around making sure that the refactoring
is properly backed by tests sound reasonable. My initial concern was based
on the fact that we might have unit test scenarios for the DSM that are not
covered in the unit tests of the FSGM. In that case, swapping the DSM with
the FSGM might not be good enough. Going over the GSM tests to make sure
that we're not accidentally deleting test scenarios sounds good to me.
Thanks, Weihua.

FLIP-156 [1] mentions some incompatibility between fine-grained resource
management and reactive mode. I assume that this is independent of the
SlotManager and replacing the DSM with the FGSM wouldn't affect reactive
mode?

About the heterogeneous TaskManager: This is a feature that's also not
supported in the DSM right now, is it? We should state that fact in the
FLIP if we mentioned that we don't want to implement it for the FSGM.

Best,
Matthias

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-156%3A+Runtime+Interfaces+for+Fine-Grained+Resource+Requirements

On Tue, Mar 7, 2023 at 8:58 AM Yangze Guo <ka...@gmail.com> wrote:

> Hi Weihua,
>
> Thanks for driving this. As Xintong mentioned, this was a technical
> debt from FLIP-56.
>
> The latest version of FLIP sounds good, +1 from my side. As a
> contributor to this component, I'm willing to assist with the review
> process. Feel free to reach me if you need help.
>
> Best,
> Yangze Guo
>
> On Tue, Mar 7, 2023 at 1:47 PM Weihua Hu <hu...@gmail.com> wrote:
> >
> > Hi,
> >
> > @David @Matthias
> > There are a few days after hearing your thoughts. I would like to know if
> > there are any other concerns about this FLIP.
> >
> >
> > Best,
> > Weihua
> >
> >
> > On Mon, Mar 6, 2023 at 7:53 PM Weihua Hu <hu...@gmail.com> wrote:
> >
> > >
> > > Thanks Shammon,
> > >
> > > I've updated FLIP to add this redundant Task Manager limitation.
> > >
> > >
> > > Best,
> > > Weihua
> > >
> > >
> > > On Mon, Mar 6, 2023 at 5:07 PM Shammon FY <zj...@gmail.com> wrote:
> > >
> > >> Hi weihua
> > >>
> > >> Can you add content related to `heterogeneous resources` to this
> FLIP? We
> > >> can record it and consider it in the future. It may be useful for some
> > >> scenarios, such as the combination of streaming and ML.
> > >>
> > >> Best,
> > >> Shammon
> > >>
> > >>
> > >> On Mon, Mar 6, 2023 at 1:39 PM weijie guo <gu...@gmail.com>
> > >> wrote:
> > >>
> > >> > Hi Weihua,
> > >> >
> > >> > Thanks for your clarification, SGTM.
> > >> >
> > >> > Best regards,
> > >> >
> > >> > Weijie
> > >> >
> > >> >
> > >> > Weihua Hu <hu...@gmail.com> 于2023年3月6日周一 11:43写道:
> > >> >
> > >> > > Thanks Weijie.
> > >> > >
> > >> > > Heterogeneous task managers will not be considered in this FLIP
> since
> > >> > > it does not request heterogeneous resources as you said.
> > >> > >
> > >> > > My first thought is we can adjust the meaning of redundant
> > >> configuration
> > >> > > to redundant number of per resource type. These can be considered
> in
> > >> > > detail when we decide to support heterogeneous task managers.
> > >> > >
> > >> > > Best,
> > >> > > Weihua
> > >> > >
> > >> > >
> > >> > > On Sat, Mar 4, 2023 at 1:13 AM weijie guo <
> guoweijiereswqa@gmail.com>
> > >> > > wrote:
> > >> > >
> > >> > > > Thanks Weihua for preparing this FLIP.
> > >> > > >
> > >> > > > This FLIP overall looks reasonable to me after updating as
> > >> suggested by
> > >> > > > Matthias.
> > >> > > >
> > >> > > > I only have one small question about keeping some redundant task
> > >> > > managers:
> > >> > > > In the fine-grained resource management, theoretically, it can
> > >> support
> > >> > > > heterogeneous taskmanagers. When we complete the missing
> features
> > >> for
> > >> > > FGSM,
> > >> > > > do we plan to take this into account?
> > >> > > > Of course, if I remember correctly, FGSM will not request
> > >> heterogeneous
> > >> > > > resources at present, so it is also acceptable to me if there
> is no
> > >> > > special
> > >> > > > treatment now.
> > >> > > >
> > >> > > > +1 for this changes if we can ensure the test coverage.
> > >> > > >
> > >> > > > Best regards,
> > >> > > >
> > >> > > > Weijie
> > >> > > >
> > >> > > >
> > >> > > > John Roesler <vv...@apache.org> 于2023年3月2日周四 12:53写道:
> > >> > > >
> > >> > > > > Thanks for the test plan, Weihua!
> > >> > > > >
> > >> > > > > Yes, it addresses my concerns.
> > >> > > > >
> > >> > > > > Thanks,
> > >> > > > > John
> > >> > > > >
> > >> > > > > On Wed, Mar 1, 2023, at 22:38, Weihua Hu wrote:
> > >> > > > > > Hi, everyone,
> > >> > > > > > Thanks for your suggestions and ideas.
> > >> > > > > > Thanks Xintong for sharing the detailed backgrounds of
> > >> SlotManager.
> > >> > > > > >
> > >> > > > > > *@Matthias
> > >> > > > > >
> > >> > > > > > 1. Did you do a proper test coverage analysis?
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > Just as Xintong said, we already have a CI stage for fine
> > >> grained
> > >> > > > > resource
> > >> > > > > > managers.
> > >> > > > > > And I will make sure FineGrainedSlotManager as the default
> > >> > > SlotManager
> > >> > > > > can
> > >> > > > > > pass all the tests of CI.
> > >> > > > > > In addition, I will review all unit tests of
> > >> > > > DeclarativeSlotManager(DSM)
> > >> > > > > to
> > >> > > > > > ensure that there are no gaps in the
> > >> > > > > > coverage provided by the FineGrainedSlotManager.
> > >> > > > > > I also added the 'Test Plan' part to the FLIP.
> > >> > > > > > @Matthias @John @Shammon Does this test plan address your
> > >> concerns?
> > >> > > > > >
> > >> > > > > > 2.  DeclarativeSlotManager and FineGrainedSlotManager feel
> quite
> > >> > big
> > >> > > in
> > >> > > > > >
> > >> > > > > > terms of lines of code....
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > IMO, the refactoring of SlotManager does not belong to this
> FLIP
> > >> > > since
> > >> > > > it
> > >> > > > > > may lead to some unstable risks. For
> > >> > > > > > FineGrainedSlotManager(FGSM), we already split some
> reasonable
> > >> > > > > components.
> > >> > > > > > They are:
> > >> > > > > > * TaskManagerTracker: Track task managers and their
> resources.
> > >> > > > > > * ResourceTracker: track requirements of jobs
> > >> > > > > > * ResourceAllocationStrategy: Try to fulfill the resource
> > >> > > requirements
> > >> > > > > with
> > >> > > > > > available/pending resources.
> > >> > > > > > * SlotStatusSyncer: communicate with TaskManager, for
> > >> > > > allocating/freeing
> > >> > > > > > slot and reconciling the slot status
> > >> > > > > > Maybe we can start a discussion about refactoring
> SlotManager in
> > >> > > > another
> > >> > > > > > FLIP if there are some good suggestions.
> > >> > > > > > WDYT
> > >> > > > > >
> > >> > > > > > 3. For me personally, having a more detailed summary
> comparing
> > >> the
> > >> > > > > >> subcomponents of both SlotManager implementations with
> where
> > >> > > > > >> their functionality matches and where they differ might
> help
> > >> > > > understand
> > >> > > > > the
> > >> > > > > >> consequences of the changes proposed in FLIP-298
> > >> > > > > >
> > >> > > > > > Good suggestion, I have updated the comparison in this FLIP.
> > >> > Looking
> > >> > > > > > forward to any suggestions/thoughts
> > >> > > > > > if they are not described clearly.
> > >> > > > > >
> > >> > > > > > *@John
> > >> > > > > >
> > >> > > > > > 4. In addition to changing the default, would it make sense
> to
> > >> log
> > >> > a
> > >> > > > > >> deprecation warning on initialization
> > >> > > > > >
> > >> > > > > > if the DeclarativeSlotManager is used?
> > >> > > > > >>
> > >> > > > > > SGTM, We should add Deprecated annotations to DSM for devs.
> And
> > >> > log a
> > >> > > > > > deprecation warning for users.
> > >> > > > > >
> > >> > > > > > *@Shammon
> > >> > > > > >
> > >> > > > > > 1. For their functional differences, can you give some
> detailed
> > >> > tests
> > >> > > > to
> > >> > > > > >> verify that the new FineGrainedSlotManager has these
> > >> capabilities?
> > >> > > > This
> > >> > > > > can
> > >> > > > > >> effectively verify the new functions
> > >> > > > > >>
> > >> > > > > > As just maintained, there is already a CI stage of FGSM,
> and I
> > >> will
> > >> > > do
> > >> > > > > more
> > >> > > > > > review of unit tests for DSM.
> > >> > > > > >
> > >> > > > > >  2. I'm worried that many functions are not independent and
> it
> > >> is
> > >> > > > > difficult
> > >> > > > > >> to migrate step-by-step. You can list the relationship
> between
> > >> > them
> > >> > > in
> > >> > > > > >> detail.
> > >> > > > > >
> > >> > > > > >  As Xintong saied the DSM is a subset of FGSM by design.
> But as
> > >> > time
> > >> > > > goes
> > >> > > > > > on, FGSM has some lacking
> > >> > > > > > functions as I listed in this FLIP. And I have added the
> > >> comparison
> > >> > > > > between
> > >> > > > > > DSM and FGSM in this FLIP.
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > Thanks again for all your thoughts. Any feedback is
> appreciated!
> > >> > > > > >
> > >> > > > > > Best,
> > >> > > > > > Weihua
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > On Wed, Mar 1, 2023 at 2:17 PM Xintong Song <
> > >> tonysong820@gmail.com
> > >> > >
> > >> > > > > wrote:
> > >> > > > > >
> > >> > > > > >> Thanks Weihua for preparing this FLIP. +1 for the proposal.
> > >> > > > > >>
> > >> > > > > >>
> > >> > > > > >> As one of the contributors of the fine-grained slot
> manager,
> > >> I'd
> > >> > > like
> > >> > > > to
> > >> > > > > >> share some backgrounds here.
> > >> > > > > >>
> > >> > > > > >> - There used to be a defaut slot manager implementation,
> which
> > >> is
> > >> > > > > >> non-declarative and has been removed now. The two features,
> > >> > > > declarative
> > >> > > > > /
> > >> > > > > >> reactive resource management and fine-grained resource
> > >> management,
> > >> > > > were
> > >> > > > > >> proposed at about the same time. We were aware that by
> design
> > >> the
> > >> > > > > >> declarative slot manager is a subset of fine-grained slot
> > >> manager
> > >> > at
> > >> > > > > that
> > >> > > > > >> time, but still decided to implement two slot managers for
> the
> > >> > > purpose
> > >> > > > > of
> > >> > > > > >> decoupling efforts and reducing cross-team synchronization
> > >> > overhead.
> > >> > > > > >> Merging the two slot managers once they are proved stable
> is
> > >> IMO a
> > >> > > > > >> technical debt that was planned at the very beginning.
> > >> > > > > >>
> > >> > > > > >> - The FineGrainedSlotManager has been verified in Alibaba's
> > >> > internal
> > >> > > > > >> production as well as Alibaba Cloud services as the default
> > >> slot
> > >> > > > manager
> > >> > > > > >> for about 2 years.
> > >> > > > > >>
> > >> > > > > >>
> > >> > > > > >> Concerning test cases, we currently have a ci stage for
> fine
> > >> > grained
> > >> > > > > >> resource management. To avoid adding too much burden, the
> stage
> > >> > only
> > >> > > > > >> includes tests from flink-runtime and flink-test modules. I
> > >> think
> > >> > > > > switching
> > >> > > > > >> the default slot manager and applying the whole set of
> tests on
> > >> > the
> > >> > > > > >> fine-grained slot manager would help us to be more
> confident
> > >> about
> > >> > > it.
> > >> > > > > >>
> > >> > > > > >>
> > >> > > > > >> Concerning cutting out functionalities out of slot
> manager, I
> > >> > think
> > >> > > > > Yangze
> > >> > > > > >> and I have tried our best to shape the
> FineGrainedSlotManager
> > >> into
> > >> > > > > >> reasonable components. I personally don't have other ideas
> to
> > >> > > further
> > >> > > > > >> disassemble the component, but I'm open to such
> suggestions.
> > >> > > However,
> > >> > > > > from
> > >> > > > > >> the stability perspective, I'd be in favor of not
> introducing
> > >> > > > > significant
> > >> > > > > >> changes to the FineGrainedSlotManager while switching it
> to the
> > >> > > > default.
> > >> > > > > >> Because the current implementation has already been
> verified
> > >> (or
> > >> > at
> > >> > > > > least
> > >> > > > > >> partially verified because Alibaba does not cover all the
> Flink
> > >> > use
> > >> > > > > cases),
> > >> > > > > >> and introducing more changes also means more chances of
> > >> breaking
> > >> > > > things.
> > >> > > > > >>
> > >> > > > > >>
> > >> > > > > >> Best,
> > >> > > > > >>
> > >> > > > > >> Xintong
> > >> > > > > >>
> > >> > > > > >>
> > >> > > > > >>
> > >> > > > > >> On Wed, Mar 1, 2023 at 11:12 AM Shammon FY <
> zjureel@gmail.com>
> > >> > > wrote:
> > >> > > > > >>
> > >> > > > > >> > Hi
> > >> > > > > >> >
> > >> > > > > >> > Thanks for starting this work weihua, I think unifying
> > >> > > > > >> > DeclarativeSlotManager and FineGrainedSlotManager is
> > >> valuable.
> > >> > > > > >> >
> > >> > > > > >> > I agree with @Matthias and @John that we need a way to
> ensure
> > >> > that
> > >> > > > > >> > DeclarativeSlotManager's capabilities are fully covered
> by
> > >> > > > > >> > FineGrainedSlotManager
> > >> > > > > >> >
> > >> > > > > >> > 1. For their functional differences, can you give some
> > >> detailed
> > >> > > > tests
> > >> > > > > to
> > >> > > > > >> > verify that the new FineGrainedSlotManager has these
> > >> > capabilities?
> > >> > > > > This
> > >> > > > > >> can
> > >> > > > > >> > effectively verify the new functions
> > >> > > > > >> >
> > >> > > > > >> > 2. I'm worried that many functions are not independent
> and
> > >> it is
> > >> > > > > >> difficult
> > >> > > > > >> > to migrate step-by-step. You can list the relationship
> > >> between
> > >> > > them
> > >> > > > in
> > >> > > > > >> > detail.
> > >> > > > > >> >
> > >> > > > > >> > 3. As John mentioned, give a smoke test for
> > >> > FineGrainedSlotManager
> > >> > > > is
> > >> > > > > a
> > >> > > > > >> > good idea. Or you can add some test information to the
> > >> > > > > >> > DeclarativeSlotManager to determine how many tests have
> used
> > >> it.
> > >> > > In
> > >> > > > > this
> > >> > > > > >> > way, we can gradually construct test cases for
> > >> > > > FineGrainedSlotManager
> > >> > > > > >> > during the development process.
> > >> > > > > >> >
> > >> > > > > >> >
> > >> > > > > >> > Best,
> > >> > > > > >> > Shammon
> > >> > > > > >> >
> > >> > > > > >> >
> > >> > > > > >> > On Tue, Feb 28, 2023 at 10:22 PM John Roesler <
> > >> > > vvcephei@apache.org>
> > >> > > > > >> wrote:
> > >> > > > > >> >
> > >> > > > > >> > > Thanks for the FLIP, Weihua!
> > >> > > > > >> > >
> > >> > > > > >> > > I’ve read the FLIP, and it sounds good to me. We need
> to
> > >> avoid
> > >> > > > > >> > > proliferating alternative implementations wherever
> > >> possible. I
> > >> > > > have
> > >> > > > > >> just
> > >> > > > > >> > a
> > >> > > > > >> > > couple of comments:
> > >> > > > > >> > >
> > >> > > > > >> > > 1. I share Matthias’s concern about ensuring the
> behavior
> > >> is
> > >> > > > really
> > >> > > > > the
> > >> > > > > >> > > same. One suggestion I’ve used for this kind of thing
> is,
> > >> as a
> > >> > > > smoke
> > >> > > > > >> > test,
> > >> > > > > >> > > to update the DeclarativeSlotManager to just delegate
> to
> > >> the
> > >> > > > > >> > > FineGrainedSlotManager. If the full test suite still
> > >> passes,
> > >> > you
> > >> > > > > can be
> > >> > > > > >> > > pretty sure the new default is really ok. It would not
> be a
> > >> > good
> > >> > > > > idea
> > >> > > > > >> to
> > >> > > > > >> > > actually keep that in for the release, since it would
> > >> remove
> > >> > the
> > >> > > > > option
> > >> > > > > >> > to
> > >> > > > > >> > > fall back in case of bugs. Either way, we need to make
> sure
> > >> > all
> > >> > > > test
> > >> > > > > >> > > scenarios are present for the FGSM.
> > >> > > > > >> > >
> > >> > > > > >> > > 4. In addition to changing the default, would it make
> > >> sense to
> > >> > > > log a
> > >> > > > > >> > > deprecation warning on initialization if the
> > >> > > > DeclarativeSlotManager
> > >> > > > > is
> > >> > > > > >> > used?
> > >> > > > > >> > >
> > >> > > > > >> > > Thanks again,
> > >> > > > > >> > > John
> > >> > > > > >> > >
> > >> > > > > >> > > On Tue, Feb 28, 2023, at 07:20, Matthias Pohl wrote:
> > >> > > > > >> > > > Hi Weihua,
> > >> > > > > >> > > > Thanks for your proposal. From a conceptual point:
> AFAIU,
> > >> > the
> > >> > > > > >> > > > DeclarativeSlotManager covers a subset (i.e. only
> evenly
> > >> > sized
> > >> > > > > slots)
> > >> > > > > >> > of
> > >> > > > > >> > > > what the FineGrainedSlotManager should be able to
> achieve
> > >> > > > > (variable
> > >> > > > > >> > slot
> > >> > > > > >> > > > size per task manager). Is this the right
> > >> > > > > assumption/understanding?
> > >> > > > > >> In
> > >> > > > > >> > > this
> > >> > > > > >> > > > sense, merging both implementations into a single one
> > >> sounds
> > >> > > > > good. A
> > >> > > > > >> > few
> > >> > > > > >> > > > more general comments, though:
> > >> > > > > >> > > >
> > >> > > > > >> > > > 1. Did you do a proper test coverage analysis?
> That's not
> > >> > > > > mentioned
> > >> > > > > >> in
> > >> > > > > >> > > the
> > >> > > > > >> > > > current version of the FLIP. I'm bringing this up
> > >> because we
> > >> > > ran
> > >> > > > > into
> > >> > > > > >> > the
> > >> > > > > >> > > > same issue when fixing the flaws that popped up after
> > >> > > > introducing
> > >> > > > > the
> > >> > > > > >> > > > multi-component leader election (see FLIP-285 [1]).
> There
> > >> > is a
> > >> > > > > risk
> > >> > > > > >> > that
> > >> > > > > >> > > by
> > >> > > > > >> > > > removing the legacy code we decrease test coverage
> > >> because
> > >> > > > certain
> > >> > > > > >> > > > test cases that were covered for the legacy classes
> might
> > >> > not
> > >> > > be
> > >> > > > > >> > > > necessarily covered in the new implementation, yet
> (see
> > >> > > > > FLINK-30338
> > >> > > > > >> [2]
> > >> > > > > >> > > > which covers this issue for the leader election
> case).
> > >> > > Ideally,
> > >> > > > we
> > >> > > > > >> > don't
> > >> > > > > >> > > > want to remove test cases accidentally because they
> were
> > >> > only
> > >> > > > > >> > implemented
> > >> > > > > >> > > > for the DeclarativeSlotManager but missed for the
> > >> > > > > >> > FineGrainedSlotManager.
> > >> > > > > >> > > >
> > >> > > > > >> > > > 2. DeclarativeSlotManager and FineGrainedSlotManager
> feel
> > >> > > quite
> > >> > > > > big
> > >> > > > > >> in
> > >> > > > > >> > > > terms of lines of code. Without knowing whether it's
> > >> > actually
> > >> > > a
> > >> > > > > >> > > reasonable
> > >> > > > > >> > > > thing to do: Instead of just adding more features to
> the
> > >> > > > > >> > > > FineGrainedSlotManager, have you thought of cutting
> out
> > >> > > > > functionality
> > >> > > > > >> > > into
> > >> > > > > >> > > > smaller sub-components along this refactoring? Such a
> > >> > > > step-by-step
> > >> > > > > >> > > approach
> > >> > > > > >> > > > might improve the overall codebase and might make
> > >> reviewing
> > >> > > the
> > >> > > > > >> > > refactoring
> > >> > > > > >> > > > easier. I did a first pass over the code and
> struggled to
> > >> > > > identify
> > >> > > > > >> code
> > >> > > > > >> > > > blocks that could be moved out of the SlotManager
> > >> > > > > implementation(s).
> > >> > > > > >> > > > Therefore, I might be wrong with this proposal. I
> haven't
> > >> > > worked
> > >> > > > > on
> > >> > > > > >> > this
> > >> > > > > >> > > > codebase in that detail that it would allow me to
> come up
> > >> > > with a
> > >> > > > > >> > > judgement
> > >> > > > > >> > > > call. I wanted to bring it up, anyway, because I'm
> > >> curious
> > >> > > > whether
> > >> > > > > >> that
> > >> > > > > >> > > > could be an option. There's a comment created by
> Chesnay
> > >> > > (CC'd)
> > >> > > > in
> > >> > > > > >> the
> > >> > > > > >> > > > JavaDoc of TaskExecutorManager [3] indicating
> something
> > >> > > similar.
> > >> > > > > I'm
> > >> > > > > >> > > > wondering whether he can add some insights here.
> > >> > > > > >> > > >
> > >> > > > > >> > > > 3. For me personally, having a more detailed summary
> > >> > comparing
> > >> > > > the
> > >> > > > > >> > > > subcomponents of both SlotManager implementations
> with
> > >> where
> > >> > > > > >> > > > their functionality matches and where they differ
> might
> > >> help
> > >> > > > > >> understand
> > >> > > > > >> > > the
> > >> > > > > >> > > > consequences of the changes proposed in FLIP-298.
> > >> > > > > >> > > >
> > >> > > > > >> > > > Best,
> > >> > > > > >> > > > Matthias
> > >> > > > > >> > > >
> > >> > > > > >> > > > [1]
> > >> > > > > >> > > >
> > >> > > > > >> > >
> > >> > > > > >> >
> > >> > > > > >>
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+Refactoring+LeaderElection+to+make+Flink+support+multi-component+leader+election+out-of-the-box
> > >> > > > > >> > > > [2]
> https://issues.apache.org/jira/browse/FLINK-30338
> > >> > > > > >> > > > [3]
> > >> > > > > >> > > >
> > >> > > > > >> > >
> > >> > > > > >> >
> > >> > > > > >>
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> https://github.com/apache/flink/blob/f611ea8cb5deddb42429df2c99f0c68d7382e9bd/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/TaskExecutorManager.java#L66-L68
> > >> > > > > >> > > >
> > >> > > > > >> > > > On Tue, Feb 28, 2023 at 6:14 AM Matt Wang <
> > >> wangm92@163.com>
> > >> > > > > wrote:
> > >> > > > > >> > > >
> > >> > > > > >> > > >> This is a good proposal for me, it will make the
> code of
> > >> > the
> > >> > > > > >> > SlotManager
> > >> > > > > >> > > >> more clear.
> > >> > > > > >> > > >>
> > >> > > > > >> > > >>
> > >> > > > > >> > > >>
> > >> > > > > >> > > >> --
> > >> > > > > >> > > >>
> > >> > > > > >> > > >> Best,
> > >> > > > > >> > > >> Matt Wang
> > >> > > > > >> > > >>
> > >> > > > > >> > > >>
> > >> > > > > >> > > >> ---- Replied Message ----
> > >> > > > > >> > > >> | From | David Morávek<dm...@apache.org> |
> > >> > > > > >> > > >> | Date | 02/27/2023 22:45 |
> > >> > > > > >> > > >> | To | <de...@flink.apache.org> |
> > >> > > > > >> > > >> | Subject | Re: [DISCUSS] FLIP-298: Unifying the
> > >> > > Implementation
> > >> > > > > of
> > >> > > > > >> > > >> SlotManager |
> > >> > > > > >> > > >> Hi Weihua, I still need to dig into the details,
> but the
> > >> > > > overall
> > >> > > > > >> > > sentiment
> > >> > > > > >> > > >> of this change sounds reasonable.
> > >> > > > > >> > > >>
> > >> > > > > >> > > >> Best,
> > >> > > > > >> > > >> D.
> > >> > > > > >> > > >>
> > >> > > > > >> > > >> On Mon, Feb 27, 2023 at 2:26 PM Zhanghao Chen <
> > >> > > > > >> > > zhanghao.chen@outlook.com>
> > >> > > > > >> > > >> wrote:
> > >> > > > > >> > > >>
> > >> > > > > >> > > >> Thanks for driving this topic. I think this FLIP
> could
> > >> help
> > >> > > > > clean up
> > >> > > > > >> > the
> > >> > > > > >> > > >> codebase to make it easier to maintain. +1 on it.
> > >> > > > > >> > > >>
> > >> > > > > >> > > >> Best,
> > >> > > > > >> > > >> Zhanghao Chen
> > >> > > > > >> > > >> ________________________________
> > >> > > > > >> > > >> From: Weihua Hu <hu...@gmail.com>
> > >> > > > > >> > > >> Sent: Monday, February 27, 2023 20:40
> > >> > > > > >> > > >> To: dev <de...@flink.apache.org>
> > >> > > > > >> > > >> Subject: [DISCUSS] FLIP-298: Unifying the
> > >> Implementation of
> > >> > > > > >> > SlotManager
> > >> > > > > >> > > >>
> > >> > > > > >> > > >> Hi everyone,
> > >> > > > > >> > > >>
> > >> > > > > >> > > >> I would like to begin a discussion on FLIP-298:
> Unifying
> > >> > the
> > >> > > > > >> > > Implementation
> > >> > > > > >> > > >> of SlotManager[1]. There are currently two types of
> > >> > > SlotManager
> > >> > > > > in
> > >> > > > > >> > > Flink:
> > >> > > > > >> > > >> DeclarativeSlotManager and FineGrainedSlotManager.
> > >> > > > > >> > > FineGrainedSlotManager
> > >> > > > > >> > > >> should behave as DeclarativeSlotManager if the user
> does
> > >> > not
> > >> > > > > >> configure
> > >> > > > > >> > > the
> > >> > > > > >> > > >> slot request profile.
> > >> > > > > >> > > >>
> > >> > > > > >> > > >> Therefore, this FLIP aims to unify the
> implementation of
> > >> > > > > SlotManager
> > >> > > > > >> > in
> > >> > > > > >> > > >> order to reduce maintenance costs.
> > >> > > > > >> > > >>
> > >> > > > > >> > > >> Looking forward to hearing from you.
> > >> > > > > >> > > >>
> > >> > > > > >> > > >> [1]
> > >> > > > > >> > > >>
> > >> > > > > >> > > >>
> > >> > > > > >> > > >>
> > >> > > > > >> > >
> > >> > > > > >> >
> > >> > > > > >>
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-298%3A+Unifying+the+Implementation+of+SlotManager
> > >> > > > > >> > > >>
> > >> > > > > >> > > >> Best,
> > >> > > > > >> > > >> Weihua
> > >> > > > > >> > > >>
> > >> > > > > >> > > >>
> > >> > > > > >> > >
> > >> > > > > >> >
> > >> > > > > >>
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
>

Re: [DISCUSS] FLIP-298: Unifying the Implementation of SlotManager

Posted by Yangze Guo <ka...@gmail.com>.
Hi Weihua,

Thanks for driving this. As Xintong mentioned, this was a technical
debt from FLIP-56.

The latest version of FLIP sounds good, +1 from my side. As a
contributor to this component, I'm willing to assist with the review
process. Feel free to reach me if you need help.

Best,
Yangze Guo

On Tue, Mar 7, 2023 at 1:47 PM Weihua Hu <hu...@gmail.com> wrote:
>
> Hi,
>
> @David @Matthias
> There are a few days after hearing your thoughts. I would like to know if
> there are any other concerns about this FLIP.
>
>
> Best,
> Weihua
>
>
> On Mon, Mar 6, 2023 at 7:53 PM Weihua Hu <hu...@gmail.com> wrote:
>
> >
> > Thanks Shammon,
> >
> > I've updated FLIP to add this redundant Task Manager limitation.
> >
> >
> > Best,
> > Weihua
> >
> >
> > On Mon, Mar 6, 2023 at 5:07 PM Shammon FY <zj...@gmail.com> wrote:
> >
> >> Hi weihua
> >>
> >> Can you add content related to `heterogeneous resources` to this FLIP? We
> >> can record it and consider it in the future. It may be useful for some
> >> scenarios, such as the combination of streaming and ML.
> >>
> >> Best,
> >> Shammon
> >>
> >>
> >> On Mon, Mar 6, 2023 at 1:39 PM weijie guo <gu...@gmail.com>
> >> wrote:
> >>
> >> > Hi Weihua,
> >> >
> >> > Thanks for your clarification, SGTM.
> >> >
> >> > Best regards,
> >> >
> >> > Weijie
> >> >
> >> >
> >> > Weihua Hu <hu...@gmail.com> 于2023年3月6日周一 11:43写道:
> >> >
> >> > > Thanks Weijie.
> >> > >
> >> > > Heterogeneous task managers will not be considered in this FLIP since
> >> > > it does not request heterogeneous resources as you said.
> >> > >
> >> > > My first thought is we can adjust the meaning of redundant
> >> configuration
> >> > > to redundant number of per resource type. These can be considered in
> >> > > detail when we decide to support heterogeneous task managers.
> >> > >
> >> > > Best,
> >> > > Weihua
> >> > >
> >> > >
> >> > > On Sat, Mar 4, 2023 at 1:13 AM weijie guo <gu...@gmail.com>
> >> > > wrote:
> >> > >
> >> > > > Thanks Weihua for preparing this FLIP.
> >> > > >
> >> > > > This FLIP overall looks reasonable to me after updating as
> >> suggested by
> >> > > > Matthias.
> >> > > >
> >> > > > I only have one small question about keeping some redundant task
> >> > > managers:
> >> > > > In the fine-grained resource management, theoretically, it can
> >> support
> >> > > > heterogeneous taskmanagers. When we complete the missing features
> >> for
> >> > > FGSM,
> >> > > > do we plan to take this into account?
> >> > > > Of course, if I remember correctly, FGSM will not request
> >> heterogeneous
> >> > > > resources at present, so it is also acceptable to me if there is no
> >> > > special
> >> > > > treatment now.
> >> > > >
> >> > > > +1 for this changes if we can ensure the test coverage.
> >> > > >
> >> > > > Best regards,
> >> > > >
> >> > > > Weijie
> >> > > >
> >> > > >
> >> > > > John Roesler <vv...@apache.org> 于2023年3月2日周四 12:53写道:
> >> > > >
> >> > > > > Thanks for the test plan, Weihua!
> >> > > > >
> >> > > > > Yes, it addresses my concerns.
> >> > > > >
> >> > > > > Thanks,
> >> > > > > John
> >> > > > >
> >> > > > > On Wed, Mar 1, 2023, at 22:38, Weihua Hu wrote:
> >> > > > > > Hi, everyone,
> >> > > > > > Thanks for your suggestions and ideas.
> >> > > > > > Thanks Xintong for sharing the detailed backgrounds of
> >> SlotManager.
> >> > > > > >
> >> > > > > > *@Matthias
> >> > > > > >
> >> > > > > > 1. Did you do a proper test coverage analysis?
> >> > > > > >
> >> > > > > >
> >> > > > > > Just as Xintong said, we already have a CI stage for fine
> >> grained
> >> > > > > resource
> >> > > > > > managers.
> >> > > > > > And I will make sure FineGrainedSlotManager as the default
> >> > > SlotManager
> >> > > > > can
> >> > > > > > pass all the tests of CI.
> >> > > > > > In addition, I will review all unit tests of
> >> > > > DeclarativeSlotManager(DSM)
> >> > > > > to
> >> > > > > > ensure that there are no gaps in the
> >> > > > > > coverage provided by the FineGrainedSlotManager.
> >> > > > > > I also added the 'Test Plan' part to the FLIP.
> >> > > > > > @Matthias @John @Shammon Does this test plan address your
> >> concerns?
> >> > > > > >
> >> > > > > > 2.  DeclarativeSlotManager and FineGrainedSlotManager feel quite
> >> > big
> >> > > in
> >> > > > > >
> >> > > > > > terms of lines of code....
> >> > > > > >
> >> > > > > >
> >> > > > > > IMO, the refactoring of SlotManager does not belong to this FLIP
> >> > > since
> >> > > > it
> >> > > > > > may lead to some unstable risks. For
> >> > > > > > FineGrainedSlotManager(FGSM), we already split some reasonable
> >> > > > > components.
> >> > > > > > They are:
> >> > > > > > * TaskManagerTracker: Track task managers and their resources.
> >> > > > > > * ResourceTracker: track requirements of jobs
> >> > > > > > * ResourceAllocationStrategy: Try to fulfill the resource
> >> > > requirements
> >> > > > > with
> >> > > > > > available/pending resources.
> >> > > > > > * SlotStatusSyncer: communicate with TaskManager, for
> >> > > > allocating/freeing
> >> > > > > > slot and reconciling the slot status
> >> > > > > > Maybe we can start a discussion about refactoring SlotManager in
> >> > > > another
> >> > > > > > FLIP if there are some good suggestions.
> >> > > > > > WDYT
> >> > > > > >
> >> > > > > > 3. For me personally, having a more detailed summary comparing
> >> the
> >> > > > > >> subcomponents of both SlotManager implementations with where
> >> > > > > >> their functionality matches and where they differ might help
> >> > > > understand
> >> > > > > the
> >> > > > > >> consequences of the changes proposed in FLIP-298
> >> > > > > >
> >> > > > > > Good suggestion, I have updated the comparison in this FLIP.
> >> > Looking
> >> > > > > > forward to any suggestions/thoughts
> >> > > > > > if they are not described clearly.
> >> > > > > >
> >> > > > > > *@John
> >> > > > > >
> >> > > > > > 4. In addition to changing the default, would it make sense to
> >> log
> >> > a
> >> > > > > >> deprecation warning on initialization
> >> > > > > >
> >> > > > > > if the DeclarativeSlotManager is used?
> >> > > > > >>
> >> > > > > > SGTM, We should add Deprecated annotations to DSM for devs. And
> >> > log a
> >> > > > > > deprecation warning for users.
> >> > > > > >
> >> > > > > > *@Shammon
> >> > > > > >
> >> > > > > > 1. For their functional differences, can you give some detailed
> >> > tests
> >> > > > to
> >> > > > > >> verify that the new FineGrainedSlotManager has these
> >> capabilities?
> >> > > > This
> >> > > > > can
> >> > > > > >> effectively verify the new functions
> >> > > > > >>
> >> > > > > > As just maintained, there is already a CI stage of FGSM, and I
> >> will
> >> > > do
> >> > > > > more
> >> > > > > > review of unit tests for DSM.
> >> > > > > >
> >> > > > > >  2. I'm worried that many functions are not independent and it
> >> is
> >> > > > > difficult
> >> > > > > >> to migrate step-by-step. You can list the relationship between
> >> > them
> >> > > in
> >> > > > > >> detail.
> >> > > > > >
> >> > > > > >  As Xintong saied the DSM is a subset of FGSM by design. But as
> >> > time
> >> > > > goes
> >> > > > > > on, FGSM has some lacking
> >> > > > > > functions as I listed in this FLIP. And I have added the
> >> comparison
> >> > > > > between
> >> > > > > > DSM and FGSM in this FLIP.
> >> > > > > >
> >> > > > > >
> >> > > > > > Thanks again for all your thoughts. Any feedback is appreciated!
> >> > > > > >
> >> > > > > > Best,
> >> > > > > > Weihua
> >> > > > > >
> >> > > > > >
> >> > > > > > On Wed, Mar 1, 2023 at 2:17 PM Xintong Song <
> >> tonysong820@gmail.com
> >> > >
> >> > > > > wrote:
> >> > > > > >
> >> > > > > >> Thanks Weihua for preparing this FLIP. +1 for the proposal.
> >> > > > > >>
> >> > > > > >>
> >> > > > > >> As one of the contributors of the fine-grained slot manager,
> >> I'd
> >> > > like
> >> > > > to
> >> > > > > >> share some backgrounds here.
> >> > > > > >>
> >> > > > > >> - There used to be a defaut slot manager implementation, which
> >> is
> >> > > > > >> non-declarative and has been removed now. The two features,
> >> > > > declarative
> >> > > > > /
> >> > > > > >> reactive resource management and fine-grained resource
> >> management,
> >> > > > were
> >> > > > > >> proposed at about the same time. We were aware that by design
> >> the
> >> > > > > >> declarative slot manager is a subset of fine-grained slot
> >> manager
> >> > at
> >> > > > > that
> >> > > > > >> time, but still decided to implement two slot managers for the
> >> > > purpose
> >> > > > > of
> >> > > > > >> decoupling efforts and reducing cross-team synchronization
> >> > overhead.
> >> > > > > >> Merging the two slot managers once they are proved stable is
> >> IMO a
> >> > > > > >> technical debt that was planned at the very beginning.
> >> > > > > >>
> >> > > > > >> - The FineGrainedSlotManager has been verified in Alibaba's
> >> > internal
> >> > > > > >> production as well as Alibaba Cloud services as the default
> >> slot
> >> > > > manager
> >> > > > > >> for about 2 years.
> >> > > > > >>
> >> > > > > >>
> >> > > > > >> Concerning test cases, we currently have a ci stage for fine
> >> > grained
> >> > > > > >> resource management. To avoid adding too much burden, the stage
> >> > only
> >> > > > > >> includes tests from flink-runtime and flink-test modules. I
> >> think
> >> > > > > switching
> >> > > > > >> the default slot manager and applying the whole set of tests on
> >> > the
> >> > > > > >> fine-grained slot manager would help us to be more confident
> >> about
> >> > > it.
> >> > > > > >>
> >> > > > > >>
> >> > > > > >> Concerning cutting out functionalities out of slot manager, I
> >> > think
> >> > > > > Yangze
> >> > > > > >> and I have tried our best to shape the FineGrainedSlotManager
> >> into
> >> > > > > >> reasonable components. I personally don't have other ideas to
> >> > > further
> >> > > > > >> disassemble the component, but I'm open to such suggestions.
> >> > > However,
> >> > > > > from
> >> > > > > >> the stability perspective, I'd be in favor of not introducing
> >> > > > > significant
> >> > > > > >> changes to the FineGrainedSlotManager while switching it to the
> >> > > > default.
> >> > > > > >> Because the current implementation has already been verified
> >> (or
> >> > at
> >> > > > > least
> >> > > > > >> partially verified because Alibaba does not cover all the Flink
> >> > use
> >> > > > > cases),
> >> > > > > >> and introducing more changes also means more chances of
> >> breaking
> >> > > > things.
> >> > > > > >>
> >> > > > > >>
> >> > > > > >> Best,
> >> > > > > >>
> >> > > > > >> Xintong
> >> > > > > >>
> >> > > > > >>
> >> > > > > >>
> >> > > > > >> On Wed, Mar 1, 2023 at 11:12 AM Shammon FY <zj...@gmail.com>
> >> > > wrote:
> >> > > > > >>
> >> > > > > >> > Hi
> >> > > > > >> >
> >> > > > > >> > Thanks for starting this work weihua, I think unifying
> >> > > > > >> > DeclarativeSlotManager and FineGrainedSlotManager is
> >> valuable.
> >> > > > > >> >
> >> > > > > >> > I agree with @Matthias and @John that we need a way to ensure
> >> > that
> >> > > > > >> > DeclarativeSlotManager's capabilities are fully covered by
> >> > > > > >> > FineGrainedSlotManager
> >> > > > > >> >
> >> > > > > >> > 1. For their functional differences, can you give some
> >> detailed
> >> > > > tests
> >> > > > > to
> >> > > > > >> > verify that the new FineGrainedSlotManager has these
> >> > capabilities?
> >> > > > > This
> >> > > > > >> can
> >> > > > > >> > effectively verify the new functions
> >> > > > > >> >
> >> > > > > >> > 2. I'm worried that many functions are not independent and
> >> it is
> >> > > > > >> difficult
> >> > > > > >> > to migrate step-by-step. You can list the relationship
> >> between
> >> > > them
> >> > > > in
> >> > > > > >> > detail.
> >> > > > > >> >
> >> > > > > >> > 3. As John mentioned, give a smoke test for
> >> > FineGrainedSlotManager
> >> > > > is
> >> > > > > a
> >> > > > > >> > good idea. Or you can add some test information to the
> >> > > > > >> > DeclarativeSlotManager to determine how many tests have used
> >> it.
> >> > > In
> >> > > > > this
> >> > > > > >> > way, we can gradually construct test cases for
> >> > > > FineGrainedSlotManager
> >> > > > > >> > during the development process.
> >> > > > > >> >
> >> > > > > >> >
> >> > > > > >> > Best,
> >> > > > > >> > Shammon
> >> > > > > >> >
> >> > > > > >> >
> >> > > > > >> > On Tue, Feb 28, 2023 at 10:22 PM John Roesler <
> >> > > vvcephei@apache.org>
> >> > > > > >> wrote:
> >> > > > > >> >
> >> > > > > >> > > Thanks for the FLIP, Weihua!
> >> > > > > >> > >
> >> > > > > >> > > I’ve read the FLIP, and it sounds good to me. We need to
> >> avoid
> >> > > > > >> > > proliferating alternative implementations wherever
> >> possible. I
> >> > > > have
> >> > > > > >> just
> >> > > > > >> > a
> >> > > > > >> > > couple of comments:
> >> > > > > >> > >
> >> > > > > >> > > 1. I share Matthias’s concern about ensuring the behavior
> >> is
> >> > > > really
> >> > > > > the
> >> > > > > >> > > same. One suggestion I’ve used for this kind of thing is,
> >> as a
> >> > > > smoke
> >> > > > > >> > test,
> >> > > > > >> > > to update the DeclarativeSlotManager to just delegate to
> >> the
> >> > > > > >> > > FineGrainedSlotManager. If the full test suite still
> >> passes,
> >> > you
> >> > > > > can be
> >> > > > > >> > > pretty sure the new default is really ok. It would not be a
> >> > good
> >> > > > > idea
> >> > > > > >> to
> >> > > > > >> > > actually keep that in for the release, since it would
> >> remove
> >> > the
> >> > > > > option
> >> > > > > >> > to
> >> > > > > >> > > fall back in case of bugs. Either way, we need to make sure
> >> > all
> >> > > > test
> >> > > > > >> > > scenarios are present for the FGSM.
> >> > > > > >> > >
> >> > > > > >> > > 4. In addition to changing the default, would it make
> >> sense to
> >> > > > log a
> >> > > > > >> > > deprecation warning on initialization if the
> >> > > > DeclarativeSlotManager
> >> > > > > is
> >> > > > > >> > used?
> >> > > > > >> > >
> >> > > > > >> > > Thanks again,
> >> > > > > >> > > John
> >> > > > > >> > >
> >> > > > > >> > > On Tue, Feb 28, 2023, at 07:20, Matthias Pohl wrote:
> >> > > > > >> > > > Hi Weihua,
> >> > > > > >> > > > Thanks for your proposal. From a conceptual point: AFAIU,
> >> > the
> >> > > > > >> > > > DeclarativeSlotManager covers a subset (i.e. only evenly
> >> > sized
> >> > > > > slots)
> >> > > > > >> > of
> >> > > > > >> > > > what the FineGrainedSlotManager should be able to achieve
> >> > > > > (variable
> >> > > > > >> > slot
> >> > > > > >> > > > size per task manager). Is this the right
> >> > > > > assumption/understanding?
> >> > > > > >> In
> >> > > > > >> > > this
> >> > > > > >> > > > sense, merging both implementations into a single one
> >> sounds
> >> > > > > good. A
> >> > > > > >> > few
> >> > > > > >> > > > more general comments, though:
> >> > > > > >> > > >
> >> > > > > >> > > > 1. Did you do a proper test coverage analysis? That's not
> >> > > > > mentioned
> >> > > > > >> in
> >> > > > > >> > > the
> >> > > > > >> > > > current version of the FLIP. I'm bringing this up
> >> because we
> >> > > ran
> >> > > > > into
> >> > > > > >> > the
> >> > > > > >> > > > same issue when fixing the flaws that popped up after
> >> > > > introducing
> >> > > > > the
> >> > > > > >> > > > multi-component leader election (see FLIP-285 [1]). There
> >> > is a
> >> > > > > risk
> >> > > > > >> > that
> >> > > > > >> > > by
> >> > > > > >> > > > removing the legacy code we decrease test coverage
> >> because
> >> > > > certain
> >> > > > > >> > > > test cases that were covered for the legacy classes might
> >> > not
> >> > > be
> >> > > > > >> > > > necessarily covered in the new implementation, yet (see
> >> > > > > FLINK-30338
> >> > > > > >> [2]
> >> > > > > >> > > > which covers this issue for the leader election case).
> >> > > Ideally,
> >> > > > we
> >> > > > > >> > don't
> >> > > > > >> > > > want to remove test cases accidentally because they were
> >> > only
> >> > > > > >> > implemented
> >> > > > > >> > > > for the DeclarativeSlotManager but missed for the
> >> > > > > >> > FineGrainedSlotManager.
> >> > > > > >> > > >
> >> > > > > >> > > > 2. DeclarativeSlotManager and FineGrainedSlotManager feel
> >> > > quite
> >> > > > > big
> >> > > > > >> in
> >> > > > > >> > > > terms of lines of code. Without knowing whether it's
> >> > actually
> >> > > a
> >> > > > > >> > > reasonable
> >> > > > > >> > > > thing to do: Instead of just adding more features to the
> >> > > > > >> > > > FineGrainedSlotManager, have you thought of cutting out
> >> > > > > functionality
> >> > > > > >> > > into
> >> > > > > >> > > > smaller sub-components along this refactoring? Such a
> >> > > > step-by-step
> >> > > > > >> > > approach
> >> > > > > >> > > > might improve the overall codebase and might make
> >> reviewing
> >> > > the
> >> > > > > >> > > refactoring
> >> > > > > >> > > > easier. I did a first pass over the code and struggled to
> >> > > > identify
> >> > > > > >> code
> >> > > > > >> > > > blocks that could be moved out of the SlotManager
> >> > > > > implementation(s).
> >> > > > > >> > > > Therefore, I might be wrong with this proposal. I haven't
> >> > > worked
> >> > > > > on
> >> > > > > >> > this
> >> > > > > >> > > > codebase in that detail that it would allow me to come up
> >> > > with a
> >> > > > > >> > > judgement
> >> > > > > >> > > > call. I wanted to bring it up, anyway, because I'm
> >> curious
> >> > > > whether
> >> > > > > >> that
> >> > > > > >> > > > could be an option. There's a comment created by Chesnay
> >> > > (CC'd)
> >> > > > in
> >> > > > > >> the
> >> > > > > >> > > > JavaDoc of TaskExecutorManager [3] indicating something
> >> > > similar.
> >> > > > > I'm
> >> > > > > >> > > > wondering whether he can add some insights here.
> >> > > > > >> > > >
> >> > > > > >> > > > 3. For me personally, having a more detailed summary
> >> > comparing
> >> > > > the
> >> > > > > >> > > > subcomponents of both SlotManager implementations with
> >> where
> >> > > > > >> > > > their functionality matches and where they differ might
> >> help
> >> > > > > >> understand
> >> > > > > >> > > the
> >> > > > > >> > > > consequences of the changes proposed in FLIP-298.
> >> > > > > >> > > >
> >> > > > > >> > > > Best,
> >> > > > > >> > > > Matthias
> >> > > > > >> > > >
> >> > > > > >> > > > [1]
> >> > > > > >> > > >
> >> > > > > >> > >
> >> > > > > >> >
> >> > > > > >>
> >> > > > >
> >> > > >
> >> > >
> >> >
> >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+Refactoring+LeaderElection+to+make+Flink+support+multi-component+leader+election+out-of-the-box
> >> > > > > >> > > > [2] https://issues.apache.org/jira/browse/FLINK-30338
> >> > > > > >> > > > [3]
> >> > > > > >> > > >
> >> > > > > >> > >
> >> > > > > >> >
> >> > > > > >>
> >> > > > >
> >> > > >
> >> > >
> >> >
> >> https://github.com/apache/flink/blob/f611ea8cb5deddb42429df2c99f0c68d7382e9bd/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/TaskExecutorManager.java#L66-L68
> >> > > > > >> > > >
> >> > > > > >> > > > On Tue, Feb 28, 2023 at 6:14 AM Matt Wang <
> >> wangm92@163.com>
> >> > > > > wrote:
> >> > > > > >> > > >
> >> > > > > >> > > >> This is a good proposal for me, it will make the code of
> >> > the
> >> > > > > >> > SlotManager
> >> > > > > >> > > >> more clear.
> >> > > > > >> > > >>
> >> > > > > >> > > >>
> >> > > > > >> > > >>
> >> > > > > >> > > >> --
> >> > > > > >> > > >>
> >> > > > > >> > > >> Best,
> >> > > > > >> > > >> Matt Wang
> >> > > > > >> > > >>
> >> > > > > >> > > >>
> >> > > > > >> > > >> ---- Replied Message ----
> >> > > > > >> > > >> | From | David Morávek<dm...@apache.org> |
> >> > > > > >> > > >> | Date | 02/27/2023 22:45 |
> >> > > > > >> > > >> | To | <de...@flink.apache.org> |
> >> > > > > >> > > >> | Subject | Re: [DISCUSS] FLIP-298: Unifying the
> >> > > Implementation
> >> > > > > of
> >> > > > > >> > > >> SlotManager |
> >> > > > > >> > > >> Hi Weihua, I still need to dig into the details, but the
> >> > > > overall
> >> > > > > >> > > sentiment
> >> > > > > >> > > >> of this change sounds reasonable.
> >> > > > > >> > > >>
> >> > > > > >> > > >> Best,
> >> > > > > >> > > >> D.
> >> > > > > >> > > >>
> >> > > > > >> > > >> On Mon, Feb 27, 2023 at 2:26 PM Zhanghao Chen <
> >> > > > > >> > > zhanghao.chen@outlook.com>
> >> > > > > >> > > >> wrote:
> >> > > > > >> > > >>
> >> > > > > >> > > >> Thanks for driving this topic. I think this FLIP could
> >> help
> >> > > > > clean up
> >> > > > > >> > the
> >> > > > > >> > > >> codebase to make it easier to maintain. +1 on it.
> >> > > > > >> > > >>
> >> > > > > >> > > >> Best,
> >> > > > > >> > > >> Zhanghao Chen
> >> > > > > >> > > >> ________________________________
> >> > > > > >> > > >> From: Weihua Hu <hu...@gmail.com>
> >> > > > > >> > > >> Sent: Monday, February 27, 2023 20:40
> >> > > > > >> > > >> To: dev <de...@flink.apache.org>
> >> > > > > >> > > >> Subject: [DISCUSS] FLIP-298: Unifying the
> >> Implementation of
> >> > > > > >> > SlotManager
> >> > > > > >> > > >>
> >> > > > > >> > > >> Hi everyone,
> >> > > > > >> > > >>
> >> > > > > >> > > >> I would like to begin a discussion on FLIP-298: Unifying
> >> > the
> >> > > > > >> > > Implementation
> >> > > > > >> > > >> of SlotManager[1]. There are currently two types of
> >> > > SlotManager
> >> > > > > in
> >> > > > > >> > > Flink:
> >> > > > > >> > > >> DeclarativeSlotManager and FineGrainedSlotManager.
> >> > > > > >> > > FineGrainedSlotManager
> >> > > > > >> > > >> should behave as DeclarativeSlotManager if the user does
> >> > not
> >> > > > > >> configure
> >> > > > > >> > > the
> >> > > > > >> > > >> slot request profile.
> >> > > > > >> > > >>
> >> > > > > >> > > >> Therefore, this FLIP aims to unify the implementation of
> >> > > > > SlotManager
> >> > > > > >> > in
> >> > > > > >> > > >> order to reduce maintenance costs.
> >> > > > > >> > > >>
> >> > > > > >> > > >> Looking forward to hearing from you.
> >> > > > > >> > > >>
> >> > > > > >> > > >> [1]
> >> > > > > >> > > >>
> >> > > > > >> > > >>
> >> > > > > >> > > >>
> >> > > > > >> > >
> >> > > > > >> >
> >> > > > > >>
> >> > > > >
> >> > > >
> >> > >
> >> >
> >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-298%3A+Unifying+the+Implementation+of+SlotManager
> >> > > > > >> > > >>
> >> > > > > >> > > >> Best,
> >> > > > > >> > > >> Weihua
> >> > > > > >> > > >>
> >> > > > > >> > > >>
> >> > > > > >> > >
> >> > > > > >> >
> >> > > > > >>
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >

Re: [DISCUSS] FLIP-298: Unifying the Implementation of SlotManager

Posted by Weihua Hu <hu...@gmail.com>.
Hi,

@David @Matthias
There are a few days after hearing your thoughts. I would like to know if
there are any other concerns about this FLIP.


Best,
Weihua


On Mon, Mar 6, 2023 at 7:53 PM Weihua Hu <hu...@gmail.com> wrote:

>
> Thanks Shammon,
>
> I've updated FLIP to add this redundant Task Manager limitation.
>
>
> Best,
> Weihua
>
>
> On Mon, Mar 6, 2023 at 5:07 PM Shammon FY <zj...@gmail.com> wrote:
>
>> Hi weihua
>>
>> Can you add content related to `heterogeneous resources` to this FLIP? We
>> can record it and consider it in the future. It may be useful for some
>> scenarios, such as the combination of streaming and ML.
>>
>> Best,
>> Shammon
>>
>>
>> On Mon, Mar 6, 2023 at 1:39 PM weijie guo <gu...@gmail.com>
>> wrote:
>>
>> > Hi Weihua,
>> >
>> > Thanks for your clarification, SGTM.
>> >
>> > Best regards,
>> >
>> > Weijie
>> >
>> >
>> > Weihua Hu <hu...@gmail.com> 于2023年3月6日周一 11:43写道:
>> >
>> > > Thanks Weijie.
>> > >
>> > > Heterogeneous task managers will not be considered in this FLIP since
>> > > it does not request heterogeneous resources as you said.
>> > >
>> > > My first thought is we can adjust the meaning of redundant
>> configuration
>> > > to redundant number of per resource type. These can be considered in
>> > > detail when we decide to support heterogeneous task managers.
>> > >
>> > > Best,
>> > > Weihua
>> > >
>> > >
>> > > On Sat, Mar 4, 2023 at 1:13 AM weijie guo <gu...@gmail.com>
>> > > wrote:
>> > >
>> > > > Thanks Weihua for preparing this FLIP.
>> > > >
>> > > > This FLIP overall looks reasonable to me after updating as
>> suggested by
>> > > > Matthias.
>> > > >
>> > > > I only have one small question about keeping some redundant task
>> > > managers:
>> > > > In the fine-grained resource management, theoretically, it can
>> support
>> > > > heterogeneous taskmanagers. When we complete the missing features
>> for
>> > > FGSM,
>> > > > do we plan to take this into account?
>> > > > Of course, if I remember correctly, FGSM will not request
>> heterogeneous
>> > > > resources at present, so it is also acceptable to me if there is no
>> > > special
>> > > > treatment now.
>> > > >
>> > > > +1 for this changes if we can ensure the test coverage.
>> > > >
>> > > > Best regards,
>> > > >
>> > > > Weijie
>> > > >
>> > > >
>> > > > John Roesler <vv...@apache.org> 于2023年3月2日周四 12:53写道:
>> > > >
>> > > > > Thanks for the test plan, Weihua!
>> > > > >
>> > > > > Yes, it addresses my concerns.
>> > > > >
>> > > > > Thanks,
>> > > > > John
>> > > > >
>> > > > > On Wed, Mar 1, 2023, at 22:38, Weihua Hu wrote:
>> > > > > > Hi, everyone,
>> > > > > > Thanks for your suggestions and ideas.
>> > > > > > Thanks Xintong for sharing the detailed backgrounds of
>> SlotManager.
>> > > > > >
>> > > > > > *@Matthias
>> > > > > >
>> > > > > > 1. Did you do a proper test coverage analysis?
>> > > > > >
>> > > > > >
>> > > > > > Just as Xintong said, we already have a CI stage for fine
>> grained
>> > > > > resource
>> > > > > > managers.
>> > > > > > And I will make sure FineGrainedSlotManager as the default
>> > > SlotManager
>> > > > > can
>> > > > > > pass all the tests of CI.
>> > > > > > In addition, I will review all unit tests of
>> > > > DeclarativeSlotManager(DSM)
>> > > > > to
>> > > > > > ensure that there are no gaps in the
>> > > > > > coverage provided by the FineGrainedSlotManager.
>> > > > > > I also added the 'Test Plan' part to the FLIP.
>> > > > > > @Matthias @John @Shammon Does this test plan address your
>> concerns?
>> > > > > >
>> > > > > > 2.  DeclarativeSlotManager and FineGrainedSlotManager feel quite
>> > big
>> > > in
>> > > > > >
>> > > > > > terms of lines of code....
>> > > > > >
>> > > > > >
>> > > > > > IMO, the refactoring of SlotManager does not belong to this FLIP
>> > > since
>> > > > it
>> > > > > > may lead to some unstable risks. For
>> > > > > > FineGrainedSlotManager(FGSM), we already split some reasonable
>> > > > > components.
>> > > > > > They are:
>> > > > > > * TaskManagerTracker: Track task managers and their resources.
>> > > > > > * ResourceTracker: track requirements of jobs
>> > > > > > * ResourceAllocationStrategy: Try to fulfill the resource
>> > > requirements
>> > > > > with
>> > > > > > available/pending resources.
>> > > > > > * SlotStatusSyncer: communicate with TaskManager, for
>> > > > allocating/freeing
>> > > > > > slot and reconciling the slot status
>> > > > > > Maybe we can start a discussion about refactoring SlotManager in
>> > > > another
>> > > > > > FLIP if there are some good suggestions.
>> > > > > > WDYT
>> > > > > >
>> > > > > > 3. For me personally, having a more detailed summary comparing
>> the
>> > > > > >> subcomponents of both SlotManager implementations with where
>> > > > > >> their functionality matches and where they differ might help
>> > > > understand
>> > > > > the
>> > > > > >> consequences of the changes proposed in FLIP-298
>> > > > > >
>> > > > > > Good suggestion, I have updated the comparison in this FLIP.
>> > Looking
>> > > > > > forward to any suggestions/thoughts
>> > > > > > if they are not described clearly.
>> > > > > >
>> > > > > > *@John
>> > > > > >
>> > > > > > 4. In addition to changing the default, would it make sense to
>> log
>> > a
>> > > > > >> deprecation warning on initialization
>> > > > > >
>> > > > > > if the DeclarativeSlotManager is used?
>> > > > > >>
>> > > > > > SGTM, We should add Deprecated annotations to DSM for devs. And
>> > log a
>> > > > > > deprecation warning for users.
>> > > > > >
>> > > > > > *@Shammon
>> > > > > >
>> > > > > > 1. For their functional differences, can you give some detailed
>> > tests
>> > > > to
>> > > > > >> verify that the new FineGrainedSlotManager has these
>> capabilities?
>> > > > This
>> > > > > can
>> > > > > >> effectively verify the new functions
>> > > > > >>
>> > > > > > As just maintained, there is already a CI stage of FGSM, and I
>> will
>> > > do
>> > > > > more
>> > > > > > review of unit tests for DSM.
>> > > > > >
>> > > > > >  2. I'm worried that many functions are not independent and it
>> is
>> > > > > difficult
>> > > > > >> to migrate step-by-step. You can list the relationship between
>> > them
>> > > in
>> > > > > >> detail.
>> > > > > >
>> > > > > >  As Xintong saied the DSM is a subset of FGSM by design. But as
>> > time
>> > > > goes
>> > > > > > on, FGSM has some lacking
>> > > > > > functions as I listed in this FLIP. And I have added the
>> comparison
>> > > > > between
>> > > > > > DSM and FGSM in this FLIP.
>> > > > > >
>> > > > > >
>> > > > > > Thanks again for all your thoughts. Any feedback is appreciated!
>> > > > > >
>> > > > > > Best,
>> > > > > > Weihua
>> > > > > >
>> > > > > >
>> > > > > > On Wed, Mar 1, 2023 at 2:17 PM Xintong Song <
>> tonysong820@gmail.com
>> > >
>> > > > > wrote:
>> > > > > >
>> > > > > >> Thanks Weihua for preparing this FLIP. +1 for the proposal.
>> > > > > >>
>> > > > > >>
>> > > > > >> As one of the contributors of the fine-grained slot manager,
>> I'd
>> > > like
>> > > > to
>> > > > > >> share some backgrounds here.
>> > > > > >>
>> > > > > >> - There used to be a defaut slot manager implementation, which
>> is
>> > > > > >> non-declarative and has been removed now. The two features,
>> > > > declarative
>> > > > > /
>> > > > > >> reactive resource management and fine-grained resource
>> management,
>> > > > were
>> > > > > >> proposed at about the same time. We were aware that by design
>> the
>> > > > > >> declarative slot manager is a subset of fine-grained slot
>> manager
>> > at
>> > > > > that
>> > > > > >> time, but still decided to implement two slot managers for the
>> > > purpose
>> > > > > of
>> > > > > >> decoupling efforts and reducing cross-team synchronization
>> > overhead.
>> > > > > >> Merging the two slot managers once they are proved stable is
>> IMO a
>> > > > > >> technical debt that was planned at the very beginning.
>> > > > > >>
>> > > > > >> - The FineGrainedSlotManager has been verified in Alibaba's
>> > internal
>> > > > > >> production as well as Alibaba Cloud services as the default
>> slot
>> > > > manager
>> > > > > >> for about 2 years.
>> > > > > >>
>> > > > > >>
>> > > > > >> Concerning test cases, we currently have a ci stage for fine
>> > grained
>> > > > > >> resource management. To avoid adding too much burden, the stage
>> > only
>> > > > > >> includes tests from flink-runtime and flink-test modules. I
>> think
>> > > > > switching
>> > > > > >> the default slot manager and applying the whole set of tests on
>> > the
>> > > > > >> fine-grained slot manager would help us to be more confident
>> about
>> > > it.
>> > > > > >>
>> > > > > >>
>> > > > > >> Concerning cutting out functionalities out of slot manager, I
>> > think
>> > > > > Yangze
>> > > > > >> and I have tried our best to shape the FineGrainedSlotManager
>> into
>> > > > > >> reasonable components. I personally don't have other ideas to
>> > > further
>> > > > > >> disassemble the component, but I'm open to such suggestions.
>> > > However,
>> > > > > from
>> > > > > >> the stability perspective, I'd be in favor of not introducing
>> > > > > significant
>> > > > > >> changes to the FineGrainedSlotManager while switching it to the
>> > > > default.
>> > > > > >> Because the current implementation has already been verified
>> (or
>> > at
>> > > > > least
>> > > > > >> partially verified because Alibaba does not cover all the Flink
>> > use
>> > > > > cases),
>> > > > > >> and introducing more changes also means more chances of
>> breaking
>> > > > things.
>> > > > > >>
>> > > > > >>
>> > > > > >> Best,
>> > > > > >>
>> > > > > >> Xintong
>> > > > > >>
>> > > > > >>
>> > > > > >>
>> > > > > >> On Wed, Mar 1, 2023 at 11:12 AM Shammon FY <zj...@gmail.com>
>> > > wrote:
>> > > > > >>
>> > > > > >> > Hi
>> > > > > >> >
>> > > > > >> > Thanks for starting this work weihua, I think unifying
>> > > > > >> > DeclarativeSlotManager and FineGrainedSlotManager is
>> valuable.
>> > > > > >> >
>> > > > > >> > I agree with @Matthias and @John that we need a way to ensure
>> > that
>> > > > > >> > DeclarativeSlotManager's capabilities are fully covered by
>> > > > > >> > FineGrainedSlotManager
>> > > > > >> >
>> > > > > >> > 1. For their functional differences, can you give some
>> detailed
>> > > > tests
>> > > > > to
>> > > > > >> > verify that the new FineGrainedSlotManager has these
>> > capabilities?
>> > > > > This
>> > > > > >> can
>> > > > > >> > effectively verify the new functions
>> > > > > >> >
>> > > > > >> > 2. I'm worried that many functions are not independent and
>> it is
>> > > > > >> difficult
>> > > > > >> > to migrate step-by-step. You can list the relationship
>> between
>> > > them
>> > > > in
>> > > > > >> > detail.
>> > > > > >> >
>> > > > > >> > 3. As John mentioned, give a smoke test for
>> > FineGrainedSlotManager
>> > > > is
>> > > > > a
>> > > > > >> > good idea. Or you can add some test information to the
>> > > > > >> > DeclarativeSlotManager to determine how many tests have used
>> it.
>> > > In
>> > > > > this
>> > > > > >> > way, we can gradually construct test cases for
>> > > > FineGrainedSlotManager
>> > > > > >> > during the development process.
>> > > > > >> >
>> > > > > >> >
>> > > > > >> > Best,
>> > > > > >> > Shammon
>> > > > > >> >
>> > > > > >> >
>> > > > > >> > On Tue, Feb 28, 2023 at 10:22 PM John Roesler <
>> > > vvcephei@apache.org>
>> > > > > >> wrote:
>> > > > > >> >
>> > > > > >> > > Thanks for the FLIP, Weihua!
>> > > > > >> > >
>> > > > > >> > > I’ve read the FLIP, and it sounds good to me. We need to
>> avoid
>> > > > > >> > > proliferating alternative implementations wherever
>> possible. I
>> > > > have
>> > > > > >> just
>> > > > > >> > a
>> > > > > >> > > couple of comments:
>> > > > > >> > >
>> > > > > >> > > 1. I share Matthias’s concern about ensuring the behavior
>> is
>> > > > really
>> > > > > the
>> > > > > >> > > same. One suggestion I’ve used for this kind of thing is,
>> as a
>> > > > smoke
>> > > > > >> > test,
>> > > > > >> > > to update the DeclarativeSlotManager to just delegate to
>> the
>> > > > > >> > > FineGrainedSlotManager. If the full test suite still
>> passes,
>> > you
>> > > > > can be
>> > > > > >> > > pretty sure the new default is really ok. It would not be a
>> > good
>> > > > > idea
>> > > > > >> to
>> > > > > >> > > actually keep that in for the release, since it would
>> remove
>> > the
>> > > > > option
>> > > > > >> > to
>> > > > > >> > > fall back in case of bugs. Either way, we need to make sure
>> > all
>> > > > test
>> > > > > >> > > scenarios are present for the FGSM.
>> > > > > >> > >
>> > > > > >> > > 4. In addition to changing the default, would it make
>> sense to
>> > > > log a
>> > > > > >> > > deprecation warning on initialization if the
>> > > > DeclarativeSlotManager
>> > > > > is
>> > > > > >> > used?
>> > > > > >> > >
>> > > > > >> > > Thanks again,
>> > > > > >> > > John
>> > > > > >> > >
>> > > > > >> > > On Tue, Feb 28, 2023, at 07:20, Matthias Pohl wrote:
>> > > > > >> > > > Hi Weihua,
>> > > > > >> > > > Thanks for your proposal. From a conceptual point: AFAIU,
>> > the
>> > > > > >> > > > DeclarativeSlotManager covers a subset (i.e. only evenly
>> > sized
>> > > > > slots)
>> > > > > >> > of
>> > > > > >> > > > what the FineGrainedSlotManager should be able to achieve
>> > > > > (variable
>> > > > > >> > slot
>> > > > > >> > > > size per task manager). Is this the right
>> > > > > assumption/understanding?
>> > > > > >> In
>> > > > > >> > > this
>> > > > > >> > > > sense, merging both implementations into a single one
>> sounds
>> > > > > good. A
>> > > > > >> > few
>> > > > > >> > > > more general comments, though:
>> > > > > >> > > >
>> > > > > >> > > > 1. Did you do a proper test coverage analysis? That's not
>> > > > > mentioned
>> > > > > >> in
>> > > > > >> > > the
>> > > > > >> > > > current version of the FLIP. I'm bringing this up
>> because we
>> > > ran
>> > > > > into
>> > > > > >> > the
>> > > > > >> > > > same issue when fixing the flaws that popped up after
>> > > > introducing
>> > > > > the
>> > > > > >> > > > multi-component leader election (see FLIP-285 [1]). There
>> > is a
>> > > > > risk
>> > > > > >> > that
>> > > > > >> > > by
>> > > > > >> > > > removing the legacy code we decrease test coverage
>> because
>> > > > certain
>> > > > > >> > > > test cases that were covered for the legacy classes might
>> > not
>> > > be
>> > > > > >> > > > necessarily covered in the new implementation, yet (see
>> > > > > FLINK-30338
>> > > > > >> [2]
>> > > > > >> > > > which covers this issue for the leader election case).
>> > > Ideally,
>> > > > we
>> > > > > >> > don't
>> > > > > >> > > > want to remove test cases accidentally because they were
>> > only
>> > > > > >> > implemented
>> > > > > >> > > > for the DeclarativeSlotManager but missed for the
>> > > > > >> > FineGrainedSlotManager.
>> > > > > >> > > >
>> > > > > >> > > > 2. DeclarativeSlotManager and FineGrainedSlotManager feel
>> > > quite
>> > > > > big
>> > > > > >> in
>> > > > > >> > > > terms of lines of code. Without knowing whether it's
>> > actually
>> > > a
>> > > > > >> > > reasonable
>> > > > > >> > > > thing to do: Instead of just adding more features to the
>> > > > > >> > > > FineGrainedSlotManager, have you thought of cutting out
>> > > > > functionality
>> > > > > >> > > into
>> > > > > >> > > > smaller sub-components along this refactoring? Such a
>> > > > step-by-step
>> > > > > >> > > approach
>> > > > > >> > > > might improve the overall codebase and might make
>> reviewing
>> > > the
>> > > > > >> > > refactoring
>> > > > > >> > > > easier. I did a first pass over the code and struggled to
>> > > > identify
>> > > > > >> code
>> > > > > >> > > > blocks that could be moved out of the SlotManager
>> > > > > implementation(s).
>> > > > > >> > > > Therefore, I might be wrong with this proposal. I haven't
>> > > worked
>> > > > > on
>> > > > > >> > this
>> > > > > >> > > > codebase in that detail that it would allow me to come up
>> > > with a
>> > > > > >> > > judgement
>> > > > > >> > > > call. I wanted to bring it up, anyway, because I'm
>> curious
>> > > > whether
>> > > > > >> that
>> > > > > >> > > > could be an option. There's a comment created by Chesnay
>> > > (CC'd)
>> > > > in
>> > > > > >> the
>> > > > > >> > > > JavaDoc of TaskExecutorManager [3] indicating something
>> > > similar.
>> > > > > I'm
>> > > > > >> > > > wondering whether he can add some insights here.
>> > > > > >> > > >
>> > > > > >> > > > 3. For me personally, having a more detailed summary
>> > comparing
>> > > > the
>> > > > > >> > > > subcomponents of both SlotManager implementations with
>> where
>> > > > > >> > > > their functionality matches and where they differ might
>> help
>> > > > > >> understand
>> > > > > >> > > the
>> > > > > >> > > > consequences of the changes proposed in FLIP-298.
>> > > > > >> > > >
>> > > > > >> > > > Best,
>> > > > > >> > > > Matthias
>> > > > > >> > > >
>> > > > > >> > > > [1]
>> > > > > >> > > >
>> > > > > >> > >
>> > > > > >> >
>> > > > > >>
>> > > > >
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+Refactoring+LeaderElection+to+make+Flink+support+multi-component+leader+election+out-of-the-box
>> > > > > >> > > > [2] https://issues.apache.org/jira/browse/FLINK-30338
>> > > > > >> > > > [3]
>> > > > > >> > > >
>> > > > > >> > >
>> > > > > >> >
>> > > > > >>
>> > > > >
>> > > >
>> > >
>> >
>> https://github.com/apache/flink/blob/f611ea8cb5deddb42429df2c99f0c68d7382e9bd/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/TaskExecutorManager.java#L66-L68
>> > > > > >> > > >
>> > > > > >> > > > On Tue, Feb 28, 2023 at 6:14 AM Matt Wang <
>> wangm92@163.com>
>> > > > > wrote:
>> > > > > >> > > >
>> > > > > >> > > >> This is a good proposal for me, it will make the code of
>> > the
>> > > > > >> > SlotManager
>> > > > > >> > > >> more clear.
>> > > > > >> > > >>
>> > > > > >> > > >>
>> > > > > >> > > >>
>> > > > > >> > > >> --
>> > > > > >> > > >>
>> > > > > >> > > >> Best,
>> > > > > >> > > >> Matt Wang
>> > > > > >> > > >>
>> > > > > >> > > >>
>> > > > > >> > > >> ---- Replied Message ----
>> > > > > >> > > >> | From | David Morávek<dm...@apache.org> |
>> > > > > >> > > >> | Date | 02/27/2023 22:45 |
>> > > > > >> > > >> | To | <de...@flink.apache.org> |
>> > > > > >> > > >> | Subject | Re: [DISCUSS] FLIP-298: Unifying the
>> > > Implementation
>> > > > > of
>> > > > > >> > > >> SlotManager |
>> > > > > >> > > >> Hi Weihua, I still need to dig into the details, but the
>> > > > overall
>> > > > > >> > > sentiment
>> > > > > >> > > >> of this change sounds reasonable.
>> > > > > >> > > >>
>> > > > > >> > > >> Best,
>> > > > > >> > > >> D.
>> > > > > >> > > >>
>> > > > > >> > > >> On Mon, Feb 27, 2023 at 2:26 PM Zhanghao Chen <
>> > > > > >> > > zhanghao.chen@outlook.com>
>> > > > > >> > > >> wrote:
>> > > > > >> > > >>
>> > > > > >> > > >> Thanks for driving this topic. I think this FLIP could
>> help
>> > > > > clean up
>> > > > > >> > the
>> > > > > >> > > >> codebase to make it easier to maintain. +1 on it.
>> > > > > >> > > >>
>> > > > > >> > > >> Best,
>> > > > > >> > > >> Zhanghao Chen
>> > > > > >> > > >> ________________________________
>> > > > > >> > > >> From: Weihua Hu <hu...@gmail.com>
>> > > > > >> > > >> Sent: Monday, February 27, 2023 20:40
>> > > > > >> > > >> To: dev <de...@flink.apache.org>
>> > > > > >> > > >> Subject: [DISCUSS] FLIP-298: Unifying the
>> Implementation of
>> > > > > >> > SlotManager
>> > > > > >> > > >>
>> > > > > >> > > >> Hi everyone,
>> > > > > >> > > >>
>> > > > > >> > > >> I would like to begin a discussion on FLIP-298: Unifying
>> > the
>> > > > > >> > > Implementation
>> > > > > >> > > >> of SlotManager[1]. There are currently two types of
>> > > SlotManager
>> > > > > in
>> > > > > >> > > Flink:
>> > > > > >> > > >> DeclarativeSlotManager and FineGrainedSlotManager.
>> > > > > >> > > FineGrainedSlotManager
>> > > > > >> > > >> should behave as DeclarativeSlotManager if the user does
>> > not
>> > > > > >> configure
>> > > > > >> > > the
>> > > > > >> > > >> slot request profile.
>> > > > > >> > > >>
>> > > > > >> > > >> Therefore, this FLIP aims to unify the implementation of
>> > > > > SlotManager
>> > > > > >> > in
>> > > > > >> > > >> order to reduce maintenance costs.
>> > > > > >> > > >>
>> > > > > >> > > >> Looking forward to hearing from you.
>> > > > > >> > > >>
>> > > > > >> > > >> [1]
>> > > > > >> > > >>
>> > > > > >> > > >>
>> > > > > >> > > >>
>> > > > > >> > >
>> > > > > >> >
>> > > > > >>
>> > > > >
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-298%3A+Unifying+the+Implementation+of+SlotManager
>> > > > > >> > > >>
>> > > > > >> > > >> Best,
>> > > > > >> > > >> Weihua
>> > > > > >> > > >>
>> > > > > >> > > >>
>> > > > > >> > >
>> > > > > >> >
>> > > > > >>
>> > > > >
>> > > >
>> > >
>> >
>>
>

Re: [DISCUSS] FLIP-298: Unifying the Implementation of SlotManager

Posted by Weihua Hu <hu...@gmail.com>.
Thanks Shammon,

I've updated FLIP to add this redundant Task Manager limitation.


Best,
Weihua


On Mon, Mar 6, 2023 at 5:07 PM Shammon FY <zj...@gmail.com> wrote:

> Hi weihua
>
> Can you add content related to `heterogeneous resources` to this FLIP? We
> can record it and consider it in the future. It may be useful for some
> scenarios, such as the combination of streaming and ML.
>
> Best,
> Shammon
>
>
> On Mon, Mar 6, 2023 at 1:39 PM weijie guo <gu...@gmail.com>
> wrote:
>
> > Hi Weihua,
> >
> > Thanks for your clarification, SGTM.
> >
> > Best regards,
> >
> > Weijie
> >
> >
> > Weihua Hu <hu...@gmail.com> 于2023年3月6日周一 11:43写道:
> >
> > > Thanks Weijie.
> > >
> > > Heterogeneous task managers will not be considered in this FLIP since
> > > it does not request heterogeneous resources as you said.
> > >
> > > My first thought is we can adjust the meaning of redundant
> configuration
> > > to redundant number of per resource type. These can be considered in
> > > detail when we decide to support heterogeneous task managers.
> > >
> > > Best,
> > > Weihua
> > >
> > >
> > > On Sat, Mar 4, 2023 at 1:13 AM weijie guo <gu...@gmail.com>
> > > wrote:
> > >
> > > > Thanks Weihua for preparing this FLIP.
> > > >
> > > > This FLIP overall looks reasonable to me after updating as suggested
> by
> > > > Matthias.
> > > >
> > > > I only have one small question about keeping some redundant task
> > > managers:
> > > > In the fine-grained resource management, theoretically, it can
> support
> > > > heterogeneous taskmanagers. When we complete the missing features for
> > > FGSM,
> > > > do we plan to take this into account?
> > > > Of course, if I remember correctly, FGSM will not request
> heterogeneous
> > > > resources at present, so it is also acceptable to me if there is no
> > > special
> > > > treatment now.
> > > >
> > > > +1 for this changes if we can ensure the test coverage.
> > > >
> > > > Best regards,
> > > >
> > > > Weijie
> > > >
> > > >
> > > > John Roesler <vv...@apache.org> 于2023年3月2日周四 12:53写道:
> > > >
> > > > > Thanks for the test plan, Weihua!
> > > > >
> > > > > Yes, it addresses my concerns.
> > > > >
> > > > > Thanks,
> > > > > John
> > > > >
> > > > > On Wed, Mar 1, 2023, at 22:38, Weihua Hu wrote:
> > > > > > Hi, everyone,
> > > > > > Thanks for your suggestions and ideas.
> > > > > > Thanks Xintong for sharing the detailed backgrounds of
> SlotManager.
> > > > > >
> > > > > > *@Matthias
> > > > > >
> > > > > > 1. Did you do a proper test coverage analysis?
> > > > > >
> > > > > >
> > > > > > Just as Xintong said, we already have a CI stage for fine grained
> > > > > resource
> > > > > > managers.
> > > > > > And I will make sure FineGrainedSlotManager as the default
> > > SlotManager
> > > > > can
> > > > > > pass all the tests of CI.
> > > > > > In addition, I will review all unit tests of
> > > > DeclarativeSlotManager(DSM)
> > > > > to
> > > > > > ensure that there are no gaps in the
> > > > > > coverage provided by the FineGrainedSlotManager.
> > > > > > I also added the 'Test Plan' part to the FLIP.
> > > > > > @Matthias @John @Shammon Does this test plan address your
> concerns?
> > > > > >
> > > > > > 2.  DeclarativeSlotManager and FineGrainedSlotManager feel quite
> > big
> > > in
> > > > > >
> > > > > > terms of lines of code....
> > > > > >
> > > > > >
> > > > > > IMO, the refactoring of SlotManager does not belong to this FLIP
> > > since
> > > > it
> > > > > > may lead to some unstable risks. For
> > > > > > FineGrainedSlotManager(FGSM), we already split some reasonable
> > > > > components.
> > > > > > They are:
> > > > > > * TaskManagerTracker: Track task managers and their resources.
> > > > > > * ResourceTracker: track requirements of jobs
> > > > > > * ResourceAllocationStrategy: Try to fulfill the resource
> > > requirements
> > > > > with
> > > > > > available/pending resources.
> > > > > > * SlotStatusSyncer: communicate with TaskManager, for
> > > > allocating/freeing
> > > > > > slot and reconciling the slot status
> > > > > > Maybe we can start a discussion about refactoring SlotManager in
> > > > another
> > > > > > FLIP if there are some good suggestions.
> > > > > > WDYT
> > > > > >
> > > > > > 3. For me personally, having a more detailed summary comparing
> the
> > > > > >> subcomponents of both SlotManager implementations with where
> > > > > >> their functionality matches and where they differ might help
> > > > understand
> > > > > the
> > > > > >> consequences of the changes proposed in FLIP-298
> > > > > >
> > > > > > Good suggestion, I have updated the comparison in this FLIP.
> > Looking
> > > > > > forward to any suggestions/thoughts
> > > > > > if they are not described clearly.
> > > > > >
> > > > > > *@John
> > > > > >
> > > > > > 4. In addition to changing the default, would it make sense to
> log
> > a
> > > > > >> deprecation warning on initialization
> > > > > >
> > > > > > if the DeclarativeSlotManager is used?
> > > > > >>
> > > > > > SGTM, We should add Deprecated annotations to DSM for devs. And
> > log a
> > > > > > deprecation warning for users.
> > > > > >
> > > > > > *@Shammon
> > > > > >
> > > > > > 1. For their functional differences, can you give some detailed
> > tests
> > > > to
> > > > > >> verify that the new FineGrainedSlotManager has these
> capabilities?
> > > > This
> > > > > can
> > > > > >> effectively verify the new functions
> > > > > >>
> > > > > > As just maintained, there is already a CI stage of FGSM, and I
> will
> > > do
> > > > > more
> > > > > > review of unit tests for DSM.
> > > > > >
> > > > > >  2. I'm worried that many functions are not independent and it is
> > > > > difficult
> > > > > >> to migrate step-by-step. You can list the relationship between
> > them
> > > in
> > > > > >> detail.
> > > > > >
> > > > > >  As Xintong saied the DSM is a subset of FGSM by design. But as
> > time
> > > > goes
> > > > > > on, FGSM has some lacking
> > > > > > functions as I listed in this FLIP. And I have added the
> comparison
> > > > > between
> > > > > > DSM and FGSM in this FLIP.
> > > > > >
> > > > > >
> > > > > > Thanks again for all your thoughts. Any feedback is appreciated!
> > > > > >
> > > > > > Best,
> > > > > > Weihua
> > > > > >
> > > > > >
> > > > > > On Wed, Mar 1, 2023 at 2:17 PM Xintong Song <
> tonysong820@gmail.com
> > >
> > > > > wrote:
> > > > > >
> > > > > >> Thanks Weihua for preparing this FLIP. +1 for the proposal.
> > > > > >>
> > > > > >>
> > > > > >> As one of the contributors of the fine-grained slot manager, I'd
> > > like
> > > > to
> > > > > >> share some backgrounds here.
> > > > > >>
> > > > > >> - There used to be a defaut slot manager implementation, which
> is
> > > > > >> non-declarative and has been removed now. The two features,
> > > > declarative
> > > > > /
> > > > > >> reactive resource management and fine-grained resource
> management,
> > > > were
> > > > > >> proposed at about the same time. We were aware that by design
> the
> > > > > >> declarative slot manager is a subset of fine-grained slot
> manager
> > at
> > > > > that
> > > > > >> time, but still decided to implement two slot managers for the
> > > purpose
> > > > > of
> > > > > >> decoupling efforts and reducing cross-team synchronization
> > overhead.
> > > > > >> Merging the two slot managers once they are proved stable is
> IMO a
> > > > > >> technical debt that was planned at the very beginning.
> > > > > >>
> > > > > >> - The FineGrainedSlotManager has been verified in Alibaba's
> > internal
> > > > > >> production as well as Alibaba Cloud services as the default slot
> > > > manager
> > > > > >> for about 2 years.
> > > > > >>
> > > > > >>
> > > > > >> Concerning test cases, we currently have a ci stage for fine
> > grained
> > > > > >> resource management. To avoid adding too much burden, the stage
> > only
> > > > > >> includes tests from flink-runtime and flink-test modules. I
> think
> > > > > switching
> > > > > >> the default slot manager and applying the whole set of tests on
> > the
> > > > > >> fine-grained slot manager would help us to be more confident
> about
> > > it.
> > > > > >>
> > > > > >>
> > > > > >> Concerning cutting out functionalities out of slot manager, I
> > think
> > > > > Yangze
> > > > > >> and I have tried our best to shape the FineGrainedSlotManager
> into
> > > > > >> reasonable components. I personally don't have other ideas to
> > > further
> > > > > >> disassemble the component, but I'm open to such suggestions.
> > > However,
> > > > > from
> > > > > >> the stability perspective, I'd be in favor of not introducing
> > > > > significant
> > > > > >> changes to the FineGrainedSlotManager while switching it to the
> > > > default.
> > > > > >> Because the current implementation has already been verified (or
> > at
> > > > > least
> > > > > >> partially verified because Alibaba does not cover all the Flink
> > use
> > > > > cases),
> > > > > >> and introducing more changes also means more chances of breaking
> > > > things.
> > > > > >>
> > > > > >>
> > > > > >> Best,
> > > > > >>
> > > > > >> Xintong
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> On Wed, Mar 1, 2023 at 11:12 AM Shammon FY <zj...@gmail.com>
> > > wrote:
> > > > > >>
> > > > > >> > Hi
> > > > > >> >
> > > > > >> > Thanks for starting this work weihua, I think unifying
> > > > > >> > DeclarativeSlotManager and FineGrainedSlotManager is valuable.
> > > > > >> >
> > > > > >> > I agree with @Matthias and @John that we need a way to ensure
> > that
> > > > > >> > DeclarativeSlotManager's capabilities are fully covered by
> > > > > >> > FineGrainedSlotManager
> > > > > >> >
> > > > > >> > 1. For their functional differences, can you give some
> detailed
> > > > tests
> > > > > to
> > > > > >> > verify that the new FineGrainedSlotManager has these
> > capabilities?
> > > > > This
> > > > > >> can
> > > > > >> > effectively verify the new functions
> > > > > >> >
> > > > > >> > 2. I'm worried that many functions are not independent and it
> is
> > > > > >> difficult
> > > > > >> > to migrate step-by-step. You can list the relationship between
> > > them
> > > > in
> > > > > >> > detail.
> > > > > >> >
> > > > > >> > 3. As John mentioned, give a smoke test for
> > FineGrainedSlotManager
> > > > is
> > > > > a
> > > > > >> > good idea. Or you can add some test information to the
> > > > > >> > DeclarativeSlotManager to determine how many tests have used
> it.
> > > In
> > > > > this
> > > > > >> > way, we can gradually construct test cases for
> > > > FineGrainedSlotManager
> > > > > >> > during the development process.
> > > > > >> >
> > > > > >> >
> > > > > >> > Best,
> > > > > >> > Shammon
> > > > > >> >
> > > > > >> >
> > > > > >> > On Tue, Feb 28, 2023 at 10:22 PM John Roesler <
> > > vvcephei@apache.org>
> > > > > >> wrote:
> > > > > >> >
> > > > > >> > > Thanks for the FLIP, Weihua!
> > > > > >> > >
> > > > > >> > > I’ve read the FLIP, and it sounds good to me. We need to
> avoid
> > > > > >> > > proliferating alternative implementations wherever
> possible. I
> > > > have
> > > > > >> just
> > > > > >> > a
> > > > > >> > > couple of comments:
> > > > > >> > >
> > > > > >> > > 1. I share Matthias’s concern about ensuring the behavior is
> > > > really
> > > > > the
> > > > > >> > > same. One suggestion I’ve used for this kind of thing is,
> as a
> > > > smoke
> > > > > >> > test,
> > > > > >> > > to update the DeclarativeSlotManager to just delegate to the
> > > > > >> > > FineGrainedSlotManager. If the full test suite still passes,
> > you
> > > > > can be
> > > > > >> > > pretty sure the new default is really ok. It would not be a
> > good
> > > > > idea
> > > > > >> to
> > > > > >> > > actually keep that in for the release, since it would remove
> > the
> > > > > option
> > > > > >> > to
> > > > > >> > > fall back in case of bugs. Either way, we need to make sure
> > all
> > > > test
> > > > > >> > > scenarios are present for the FGSM.
> > > > > >> > >
> > > > > >> > > 4. In addition to changing the default, would it make sense
> to
> > > > log a
> > > > > >> > > deprecation warning on initialization if the
> > > > DeclarativeSlotManager
> > > > > is
> > > > > >> > used?
> > > > > >> > >
> > > > > >> > > Thanks again,
> > > > > >> > > John
> > > > > >> > >
> > > > > >> > > On Tue, Feb 28, 2023, at 07:20, Matthias Pohl wrote:
> > > > > >> > > > Hi Weihua,
> > > > > >> > > > Thanks for your proposal. From a conceptual point: AFAIU,
> > the
> > > > > >> > > > DeclarativeSlotManager covers a subset (i.e. only evenly
> > sized
> > > > > slots)
> > > > > >> > of
> > > > > >> > > > what the FineGrainedSlotManager should be able to achieve
> > > > > (variable
> > > > > >> > slot
> > > > > >> > > > size per task manager). Is this the right
> > > > > assumption/understanding?
> > > > > >> In
> > > > > >> > > this
> > > > > >> > > > sense, merging both implementations into a single one
> sounds
> > > > > good. A
> > > > > >> > few
> > > > > >> > > > more general comments, though:
> > > > > >> > > >
> > > > > >> > > > 1. Did you do a proper test coverage analysis? That's not
> > > > > mentioned
> > > > > >> in
> > > > > >> > > the
> > > > > >> > > > current version of the FLIP. I'm bringing this up because
> we
> > > ran
> > > > > into
> > > > > >> > the
> > > > > >> > > > same issue when fixing the flaws that popped up after
> > > > introducing
> > > > > the
> > > > > >> > > > multi-component leader election (see FLIP-285 [1]). There
> > is a
> > > > > risk
> > > > > >> > that
> > > > > >> > > by
> > > > > >> > > > removing the legacy code we decrease test coverage because
> > > > certain
> > > > > >> > > > test cases that were covered for the legacy classes might
> > not
> > > be
> > > > > >> > > > necessarily covered in the new implementation, yet (see
> > > > > FLINK-30338
> > > > > >> [2]
> > > > > >> > > > which covers this issue for the leader election case).
> > > Ideally,
> > > > we
> > > > > >> > don't
> > > > > >> > > > want to remove test cases accidentally because they were
> > only
> > > > > >> > implemented
> > > > > >> > > > for the DeclarativeSlotManager but missed for the
> > > > > >> > FineGrainedSlotManager.
> > > > > >> > > >
> > > > > >> > > > 2. DeclarativeSlotManager and FineGrainedSlotManager feel
> > > quite
> > > > > big
> > > > > >> in
> > > > > >> > > > terms of lines of code. Without knowing whether it's
> > actually
> > > a
> > > > > >> > > reasonable
> > > > > >> > > > thing to do: Instead of just adding more features to the
> > > > > >> > > > FineGrainedSlotManager, have you thought of cutting out
> > > > > functionality
> > > > > >> > > into
> > > > > >> > > > smaller sub-components along this refactoring? Such a
> > > > step-by-step
> > > > > >> > > approach
> > > > > >> > > > might improve the overall codebase and might make
> reviewing
> > > the
> > > > > >> > > refactoring
> > > > > >> > > > easier. I did a first pass over the code and struggled to
> > > > identify
> > > > > >> code
> > > > > >> > > > blocks that could be moved out of the SlotManager
> > > > > implementation(s).
> > > > > >> > > > Therefore, I might be wrong with this proposal. I haven't
> > > worked
> > > > > on
> > > > > >> > this
> > > > > >> > > > codebase in that detail that it would allow me to come up
> > > with a
> > > > > >> > > judgement
> > > > > >> > > > call. I wanted to bring it up, anyway, because I'm curious
> > > > whether
> > > > > >> that
> > > > > >> > > > could be an option. There's a comment created by Chesnay
> > > (CC'd)
> > > > in
> > > > > >> the
> > > > > >> > > > JavaDoc of TaskExecutorManager [3] indicating something
> > > similar.
> > > > > I'm
> > > > > >> > > > wondering whether he can add some insights here.
> > > > > >> > > >
> > > > > >> > > > 3. For me personally, having a more detailed summary
> > comparing
> > > > the
> > > > > >> > > > subcomponents of both SlotManager implementations with
> where
> > > > > >> > > > their functionality matches and where they differ might
> help
> > > > > >> understand
> > > > > >> > > the
> > > > > >> > > > consequences of the changes proposed in FLIP-298.
> > > > > >> > > >
> > > > > >> > > > Best,
> > > > > >> > > > Matthias
> > > > > >> > > >
> > > > > >> > > > [1]
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+Refactoring+LeaderElection+to+make+Flink+support+multi-component+leader+election+out-of-the-box
> > > > > >> > > > [2] https://issues.apache.org/jira/browse/FLINK-30338
> > > > > >> > > > [3]
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://github.com/apache/flink/blob/f611ea8cb5deddb42429df2c99f0c68d7382e9bd/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/TaskExecutorManager.java#L66-L68
> > > > > >> > > >
> > > > > >> > > > On Tue, Feb 28, 2023 at 6:14 AM Matt Wang <
> wangm92@163.com>
> > > > > wrote:
> > > > > >> > > >
> > > > > >> > > >> This is a good proposal for me, it will make the code of
> > the
> > > > > >> > SlotManager
> > > > > >> > > >> more clear.
> > > > > >> > > >>
> > > > > >> > > >>
> > > > > >> > > >>
> > > > > >> > > >> --
> > > > > >> > > >>
> > > > > >> > > >> Best,
> > > > > >> > > >> Matt Wang
> > > > > >> > > >>
> > > > > >> > > >>
> > > > > >> > > >> ---- Replied Message ----
> > > > > >> > > >> | From | David Morávek<dm...@apache.org> |
> > > > > >> > > >> | Date | 02/27/2023 22:45 |
> > > > > >> > > >> | To | <de...@flink.apache.org> |
> > > > > >> > > >> | Subject | Re: [DISCUSS] FLIP-298: Unifying the
> > > Implementation
> > > > > of
> > > > > >> > > >> SlotManager |
> > > > > >> > > >> Hi Weihua, I still need to dig into the details, but the
> > > > overall
> > > > > >> > > sentiment
> > > > > >> > > >> of this change sounds reasonable.
> > > > > >> > > >>
> > > > > >> > > >> Best,
> > > > > >> > > >> D.
> > > > > >> > > >>
> > > > > >> > > >> On Mon, Feb 27, 2023 at 2:26 PM Zhanghao Chen <
> > > > > >> > > zhanghao.chen@outlook.com>
> > > > > >> > > >> wrote:
> > > > > >> > > >>
> > > > > >> > > >> Thanks for driving this topic. I think this FLIP could
> help
> > > > > clean up
> > > > > >> > the
> > > > > >> > > >> codebase to make it easier to maintain. +1 on it.
> > > > > >> > > >>
> > > > > >> > > >> Best,
> > > > > >> > > >> Zhanghao Chen
> > > > > >> > > >> ________________________________
> > > > > >> > > >> From: Weihua Hu <hu...@gmail.com>
> > > > > >> > > >> Sent: Monday, February 27, 2023 20:40
> > > > > >> > > >> To: dev <de...@flink.apache.org>
> > > > > >> > > >> Subject: [DISCUSS] FLIP-298: Unifying the Implementation
> of
> > > > > >> > SlotManager
> > > > > >> > > >>
> > > > > >> > > >> Hi everyone,
> > > > > >> > > >>
> > > > > >> > > >> I would like to begin a discussion on FLIP-298: Unifying
> > the
> > > > > >> > > Implementation
> > > > > >> > > >> of SlotManager[1]. There are currently two types of
> > > SlotManager
> > > > > in
> > > > > >> > > Flink:
> > > > > >> > > >> DeclarativeSlotManager and FineGrainedSlotManager.
> > > > > >> > > FineGrainedSlotManager
> > > > > >> > > >> should behave as DeclarativeSlotManager if the user does
> > not
> > > > > >> configure
> > > > > >> > > the
> > > > > >> > > >> slot request profile.
> > > > > >> > > >>
> > > > > >> > > >> Therefore, this FLIP aims to unify the implementation of
> > > > > SlotManager
> > > > > >> > in
> > > > > >> > > >> order to reduce maintenance costs.
> > > > > >> > > >>
> > > > > >> > > >> Looking forward to hearing from you.
> > > > > >> > > >>
> > > > > >> > > >> [1]
> > > > > >> > > >>
> > > > > >> > > >>
> > > > > >> > > >>
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-298%3A+Unifying+the+Implementation+of+SlotManager
> > > > > >> > > >>
> > > > > >> > > >> Best,
> > > > > >> > > >> Weihua
> > > > > >> > > >>
> > > > > >> > > >>
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-298: Unifying the Implementation of SlotManager

Posted by Shammon FY <zj...@gmail.com>.
Hi weihua

Can you add content related to `heterogeneous resources` to this FLIP? We
can record it and consider it in the future. It may be useful for some
scenarios, such as the combination of streaming and ML.

Best,
Shammon


On Mon, Mar 6, 2023 at 1:39 PM weijie guo <gu...@gmail.com> wrote:

> Hi Weihua,
>
> Thanks for your clarification, SGTM.
>
> Best regards,
>
> Weijie
>
>
> Weihua Hu <hu...@gmail.com> 于2023年3月6日周一 11:43写道:
>
> > Thanks Weijie.
> >
> > Heterogeneous task managers will not be considered in this FLIP since
> > it does not request heterogeneous resources as you said.
> >
> > My first thought is we can adjust the meaning of redundant configuration
> > to redundant number of per resource type. These can be considered in
> > detail when we decide to support heterogeneous task managers.
> >
> > Best,
> > Weihua
> >
> >
> > On Sat, Mar 4, 2023 at 1:13 AM weijie guo <gu...@gmail.com>
> > wrote:
> >
> > > Thanks Weihua for preparing this FLIP.
> > >
> > > This FLIP overall looks reasonable to me after updating as suggested by
> > > Matthias.
> > >
> > > I only have one small question about keeping some redundant task
> > managers:
> > > In the fine-grained resource management, theoretically, it can support
> > > heterogeneous taskmanagers. When we complete the missing features for
> > FGSM,
> > > do we plan to take this into account?
> > > Of course, if I remember correctly, FGSM will not request heterogeneous
> > > resources at present, so it is also acceptable to me if there is no
> > special
> > > treatment now.
> > >
> > > +1 for this changes if we can ensure the test coverage.
> > >
> > > Best regards,
> > >
> > > Weijie
> > >
> > >
> > > John Roesler <vv...@apache.org> 于2023年3月2日周四 12:53写道:
> > >
> > > > Thanks for the test plan, Weihua!
> > > >
> > > > Yes, it addresses my concerns.
> > > >
> > > > Thanks,
> > > > John
> > > >
> > > > On Wed, Mar 1, 2023, at 22:38, Weihua Hu wrote:
> > > > > Hi, everyone,
> > > > > Thanks for your suggestions and ideas.
> > > > > Thanks Xintong for sharing the detailed backgrounds of SlotManager.
> > > > >
> > > > > *@Matthias
> > > > >
> > > > > 1. Did you do a proper test coverage analysis?
> > > > >
> > > > >
> > > > > Just as Xintong said, we already have a CI stage for fine grained
> > > > resource
> > > > > managers.
> > > > > And I will make sure FineGrainedSlotManager as the default
> > SlotManager
> > > > can
> > > > > pass all the tests of CI.
> > > > > In addition, I will review all unit tests of
> > > DeclarativeSlotManager(DSM)
> > > > to
> > > > > ensure that there are no gaps in the
> > > > > coverage provided by the FineGrainedSlotManager.
> > > > > I also added the 'Test Plan' part to the FLIP.
> > > > > @Matthias @John @Shammon Does this test plan address your concerns?
> > > > >
> > > > > 2.  DeclarativeSlotManager and FineGrainedSlotManager feel quite
> big
> > in
> > > > >
> > > > > terms of lines of code....
> > > > >
> > > > >
> > > > > IMO, the refactoring of SlotManager does not belong to this FLIP
> > since
> > > it
> > > > > may lead to some unstable risks. For
> > > > > FineGrainedSlotManager(FGSM), we already split some reasonable
> > > > components.
> > > > > They are:
> > > > > * TaskManagerTracker: Track task managers and their resources.
> > > > > * ResourceTracker: track requirements of jobs
> > > > > * ResourceAllocationStrategy: Try to fulfill the resource
> > requirements
> > > > with
> > > > > available/pending resources.
> > > > > * SlotStatusSyncer: communicate with TaskManager, for
> > > allocating/freeing
> > > > > slot and reconciling the slot status
> > > > > Maybe we can start a discussion about refactoring SlotManager in
> > > another
> > > > > FLIP if there are some good suggestions.
> > > > > WDYT
> > > > >
> > > > > 3. For me personally, having a more detailed summary comparing the
> > > > >> subcomponents of both SlotManager implementations with where
> > > > >> their functionality matches and where they differ might help
> > > understand
> > > > the
> > > > >> consequences of the changes proposed in FLIP-298
> > > > >
> > > > > Good suggestion, I have updated the comparison in this FLIP.
> Looking
> > > > > forward to any suggestions/thoughts
> > > > > if they are not described clearly.
> > > > >
> > > > > *@John
> > > > >
> > > > > 4. In addition to changing the default, would it make sense to log
> a
> > > > >> deprecation warning on initialization
> > > > >
> > > > > if the DeclarativeSlotManager is used?
> > > > >>
> > > > > SGTM, We should add Deprecated annotations to DSM for devs. And
> log a
> > > > > deprecation warning for users.
> > > > >
> > > > > *@Shammon
> > > > >
> > > > > 1. For their functional differences, can you give some detailed
> tests
> > > to
> > > > >> verify that the new FineGrainedSlotManager has these capabilities?
> > > This
> > > > can
> > > > >> effectively verify the new functions
> > > > >>
> > > > > As just maintained, there is already a CI stage of FGSM, and I will
> > do
> > > > more
> > > > > review of unit tests for DSM.
> > > > >
> > > > >  2. I'm worried that many functions are not independent and it is
> > > > difficult
> > > > >> to migrate step-by-step. You can list the relationship between
> them
> > in
> > > > >> detail.
> > > > >
> > > > >  As Xintong saied the DSM is a subset of FGSM by design. But as
> time
> > > goes
> > > > > on, FGSM has some lacking
> > > > > functions as I listed in this FLIP. And I have added the comparison
> > > > between
> > > > > DSM and FGSM in this FLIP.
> > > > >
> > > > >
> > > > > Thanks again for all your thoughts. Any feedback is appreciated!
> > > > >
> > > > > Best,
> > > > > Weihua
> > > > >
> > > > >
> > > > > On Wed, Mar 1, 2023 at 2:17 PM Xintong Song <tonysong820@gmail.com
> >
> > > > wrote:
> > > > >
> > > > >> Thanks Weihua for preparing this FLIP. +1 for the proposal.
> > > > >>
> > > > >>
> > > > >> As one of the contributors of the fine-grained slot manager, I'd
> > like
> > > to
> > > > >> share some backgrounds here.
> > > > >>
> > > > >> - There used to be a defaut slot manager implementation, which is
> > > > >> non-declarative and has been removed now. The two features,
> > > declarative
> > > > /
> > > > >> reactive resource management and fine-grained resource management,
> > > were
> > > > >> proposed at about the same time. We were aware that by design the
> > > > >> declarative slot manager is a subset of fine-grained slot manager
> at
> > > > that
> > > > >> time, but still decided to implement two slot managers for the
> > purpose
> > > > of
> > > > >> decoupling efforts and reducing cross-team synchronization
> overhead.
> > > > >> Merging the two slot managers once they are proved stable is IMO a
> > > > >> technical debt that was planned at the very beginning.
> > > > >>
> > > > >> - The FineGrainedSlotManager has been verified in Alibaba's
> internal
> > > > >> production as well as Alibaba Cloud services as the default slot
> > > manager
> > > > >> for about 2 years.
> > > > >>
> > > > >>
> > > > >> Concerning test cases, we currently have a ci stage for fine
> grained
> > > > >> resource management. To avoid adding too much burden, the stage
> only
> > > > >> includes tests from flink-runtime and flink-test modules. I think
> > > > switching
> > > > >> the default slot manager and applying the whole set of tests on
> the
> > > > >> fine-grained slot manager would help us to be more confident about
> > it.
> > > > >>
> > > > >>
> > > > >> Concerning cutting out functionalities out of slot manager, I
> think
> > > > Yangze
> > > > >> and I have tried our best to shape the FineGrainedSlotManager into
> > > > >> reasonable components. I personally don't have other ideas to
> > further
> > > > >> disassemble the component, but I'm open to such suggestions.
> > However,
> > > > from
> > > > >> the stability perspective, I'd be in favor of not introducing
> > > > significant
> > > > >> changes to the FineGrainedSlotManager while switching it to the
> > > default.
> > > > >> Because the current implementation has already been verified (or
> at
> > > > least
> > > > >> partially verified because Alibaba does not cover all the Flink
> use
> > > > cases),
> > > > >> and introducing more changes also means more chances of breaking
> > > things.
> > > > >>
> > > > >>
> > > > >> Best,
> > > > >>
> > > > >> Xintong
> > > > >>
> > > > >>
> > > > >>
> > > > >> On Wed, Mar 1, 2023 at 11:12 AM Shammon FY <zj...@gmail.com>
> > wrote:
> > > > >>
> > > > >> > Hi
> > > > >> >
> > > > >> > Thanks for starting this work weihua, I think unifying
> > > > >> > DeclarativeSlotManager and FineGrainedSlotManager is valuable.
> > > > >> >
> > > > >> > I agree with @Matthias and @John that we need a way to ensure
> that
> > > > >> > DeclarativeSlotManager's capabilities are fully covered by
> > > > >> > FineGrainedSlotManager
> > > > >> >
> > > > >> > 1. For their functional differences, can you give some detailed
> > > tests
> > > > to
> > > > >> > verify that the new FineGrainedSlotManager has these
> capabilities?
> > > > This
> > > > >> can
> > > > >> > effectively verify the new functions
> > > > >> >
> > > > >> > 2. I'm worried that many functions are not independent and it is
> > > > >> difficult
> > > > >> > to migrate step-by-step. You can list the relationship between
> > them
> > > in
> > > > >> > detail.
> > > > >> >
> > > > >> > 3. As John mentioned, give a smoke test for
> FineGrainedSlotManager
> > > is
> > > > a
> > > > >> > good idea. Or you can add some test information to the
> > > > >> > DeclarativeSlotManager to determine how many tests have used it.
> > In
> > > > this
> > > > >> > way, we can gradually construct test cases for
> > > FineGrainedSlotManager
> > > > >> > during the development process.
> > > > >> >
> > > > >> >
> > > > >> > Best,
> > > > >> > Shammon
> > > > >> >
> > > > >> >
> > > > >> > On Tue, Feb 28, 2023 at 10:22 PM John Roesler <
> > vvcephei@apache.org>
> > > > >> wrote:
> > > > >> >
> > > > >> > > Thanks for the FLIP, Weihua!
> > > > >> > >
> > > > >> > > I’ve read the FLIP, and it sounds good to me. We need to avoid
> > > > >> > > proliferating alternative implementations wherever possible. I
> > > have
> > > > >> just
> > > > >> > a
> > > > >> > > couple of comments:
> > > > >> > >
> > > > >> > > 1. I share Matthias’s concern about ensuring the behavior is
> > > really
> > > > the
> > > > >> > > same. One suggestion I’ve used for this kind of thing is, as a
> > > smoke
> > > > >> > test,
> > > > >> > > to update the DeclarativeSlotManager to just delegate to the
> > > > >> > > FineGrainedSlotManager. If the full test suite still passes,
> you
> > > > can be
> > > > >> > > pretty sure the new default is really ok. It would not be a
> good
> > > > idea
> > > > >> to
> > > > >> > > actually keep that in for the release, since it would remove
> the
> > > > option
> > > > >> > to
> > > > >> > > fall back in case of bugs. Either way, we need to make sure
> all
> > > test
> > > > >> > > scenarios are present for the FGSM.
> > > > >> > >
> > > > >> > > 4. In addition to changing the default, would it make sense to
> > > log a
> > > > >> > > deprecation warning on initialization if the
> > > DeclarativeSlotManager
> > > > is
> > > > >> > used?
> > > > >> > >
> > > > >> > > Thanks again,
> > > > >> > > John
> > > > >> > >
> > > > >> > > On Tue, Feb 28, 2023, at 07:20, Matthias Pohl wrote:
> > > > >> > > > Hi Weihua,
> > > > >> > > > Thanks for your proposal. From a conceptual point: AFAIU,
> the
> > > > >> > > > DeclarativeSlotManager covers a subset (i.e. only evenly
> sized
> > > > slots)
> > > > >> > of
> > > > >> > > > what the FineGrainedSlotManager should be able to achieve
> > > > (variable
> > > > >> > slot
> > > > >> > > > size per task manager). Is this the right
> > > > assumption/understanding?
> > > > >> In
> > > > >> > > this
> > > > >> > > > sense, merging both implementations into a single one sounds
> > > > good. A
> > > > >> > few
> > > > >> > > > more general comments, though:
> > > > >> > > >
> > > > >> > > > 1. Did you do a proper test coverage analysis? That's not
> > > > mentioned
> > > > >> in
> > > > >> > > the
> > > > >> > > > current version of the FLIP. I'm bringing this up because we
> > ran
> > > > into
> > > > >> > the
> > > > >> > > > same issue when fixing the flaws that popped up after
> > > introducing
> > > > the
> > > > >> > > > multi-component leader election (see FLIP-285 [1]). There
> is a
> > > > risk
> > > > >> > that
> > > > >> > > by
> > > > >> > > > removing the legacy code we decrease test coverage because
> > > certain
> > > > >> > > > test cases that were covered for the legacy classes might
> not
> > be
> > > > >> > > > necessarily covered in the new implementation, yet (see
> > > > FLINK-30338
> > > > >> [2]
> > > > >> > > > which covers this issue for the leader election case).
> > Ideally,
> > > we
> > > > >> > don't
> > > > >> > > > want to remove test cases accidentally because they were
> only
> > > > >> > implemented
> > > > >> > > > for the DeclarativeSlotManager but missed for the
> > > > >> > FineGrainedSlotManager.
> > > > >> > > >
> > > > >> > > > 2. DeclarativeSlotManager and FineGrainedSlotManager feel
> > quite
> > > > big
> > > > >> in
> > > > >> > > > terms of lines of code. Without knowing whether it's
> actually
> > a
> > > > >> > > reasonable
> > > > >> > > > thing to do: Instead of just adding more features to the
> > > > >> > > > FineGrainedSlotManager, have you thought of cutting out
> > > > functionality
> > > > >> > > into
> > > > >> > > > smaller sub-components along this refactoring? Such a
> > > step-by-step
> > > > >> > > approach
> > > > >> > > > might improve the overall codebase and might make reviewing
> > the
> > > > >> > > refactoring
> > > > >> > > > easier. I did a first pass over the code and struggled to
> > > identify
> > > > >> code
> > > > >> > > > blocks that could be moved out of the SlotManager
> > > > implementation(s).
> > > > >> > > > Therefore, I might be wrong with this proposal. I haven't
> > worked
> > > > on
> > > > >> > this
> > > > >> > > > codebase in that detail that it would allow me to come up
> > with a
> > > > >> > > judgement
> > > > >> > > > call. I wanted to bring it up, anyway, because I'm curious
> > > whether
> > > > >> that
> > > > >> > > > could be an option. There's a comment created by Chesnay
> > (CC'd)
> > > in
> > > > >> the
> > > > >> > > > JavaDoc of TaskExecutorManager [3] indicating something
> > similar.
> > > > I'm
> > > > >> > > > wondering whether he can add some insights here.
> > > > >> > > >
> > > > >> > > > 3. For me personally, having a more detailed summary
> comparing
> > > the
> > > > >> > > > subcomponents of both SlotManager implementations with where
> > > > >> > > > their functionality matches and where they differ might help
> > > > >> understand
> > > > >> > > the
> > > > >> > > > consequences of the changes proposed in FLIP-298.
> > > > >> > > >
> > > > >> > > > Best,
> > > > >> > > > Matthias
> > > > >> > > >
> > > > >> > > > [1]
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+Refactoring+LeaderElection+to+make+Flink+support+multi-component+leader+election+out-of-the-box
> > > > >> > > > [2] https://issues.apache.org/jira/browse/FLINK-30338
> > > > >> > > > [3]
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > >
> > >
> >
> https://github.com/apache/flink/blob/f611ea8cb5deddb42429df2c99f0c68d7382e9bd/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/TaskExecutorManager.java#L66-L68
> > > > >> > > >
> > > > >> > > > On Tue, Feb 28, 2023 at 6:14 AM Matt Wang <wa...@163.com>
> > > > wrote:
> > > > >> > > >
> > > > >> > > >> This is a good proposal for me, it will make the code of
> the
> > > > >> > SlotManager
> > > > >> > > >> more clear.
> > > > >> > > >>
> > > > >> > > >>
> > > > >> > > >>
> > > > >> > > >> --
> > > > >> > > >>
> > > > >> > > >> Best,
> > > > >> > > >> Matt Wang
> > > > >> > > >>
> > > > >> > > >>
> > > > >> > > >> ---- Replied Message ----
> > > > >> > > >> | From | David Morávek<dm...@apache.org> |
> > > > >> > > >> | Date | 02/27/2023 22:45 |
> > > > >> > > >> | To | <de...@flink.apache.org> |
> > > > >> > > >> | Subject | Re: [DISCUSS] FLIP-298: Unifying the
> > Implementation
> > > > of
> > > > >> > > >> SlotManager |
> > > > >> > > >> Hi Weihua, I still need to dig into the details, but the
> > > overall
> > > > >> > > sentiment
> > > > >> > > >> of this change sounds reasonable.
> > > > >> > > >>
> > > > >> > > >> Best,
> > > > >> > > >> D.
> > > > >> > > >>
> > > > >> > > >> On Mon, Feb 27, 2023 at 2:26 PM Zhanghao Chen <
> > > > >> > > zhanghao.chen@outlook.com>
> > > > >> > > >> wrote:
> > > > >> > > >>
> > > > >> > > >> Thanks for driving this topic. I think this FLIP could help
> > > > clean up
> > > > >> > the
> > > > >> > > >> codebase to make it easier to maintain. +1 on it.
> > > > >> > > >>
> > > > >> > > >> Best,
> > > > >> > > >> Zhanghao Chen
> > > > >> > > >> ________________________________
> > > > >> > > >> From: Weihua Hu <hu...@gmail.com>
> > > > >> > > >> Sent: Monday, February 27, 2023 20:40
> > > > >> > > >> To: dev <de...@flink.apache.org>
> > > > >> > > >> Subject: [DISCUSS] FLIP-298: Unifying the Implementation of
> > > > >> > SlotManager
> > > > >> > > >>
> > > > >> > > >> Hi everyone,
> > > > >> > > >>
> > > > >> > > >> I would like to begin a discussion on FLIP-298: Unifying
> the
> > > > >> > > Implementation
> > > > >> > > >> of SlotManager[1]. There are currently two types of
> > SlotManager
> > > > in
> > > > >> > > Flink:
> > > > >> > > >> DeclarativeSlotManager and FineGrainedSlotManager.
> > > > >> > > FineGrainedSlotManager
> > > > >> > > >> should behave as DeclarativeSlotManager if the user does
> not
> > > > >> configure
> > > > >> > > the
> > > > >> > > >> slot request profile.
> > > > >> > > >>
> > > > >> > > >> Therefore, this FLIP aims to unify the implementation of
> > > > SlotManager
> > > > >> > in
> > > > >> > > >> order to reduce maintenance costs.
> > > > >> > > >>
> > > > >> > > >> Looking forward to hearing from you.
> > > > >> > > >>
> > > > >> > > >> [1]
> > > > >> > > >>
> > > > >> > > >>
> > > > >> > > >>
> > > > >> > >
> > > > >> >
> > > > >>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-298%3A+Unifying+the+Implementation+of+SlotManager
> > > > >> > > >>
> > > > >> > > >> Best,
> > > > >> > > >> Weihua
> > > > >> > > >>
> > > > >> > > >>
> > > > >> > >
> > > > >> >
> > > > >>
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-298: Unifying the Implementation of SlotManager

Posted by weijie guo <gu...@gmail.com>.
Hi Weihua,

Thanks for your clarification, SGTM.

Best regards,

Weijie


Weihua Hu <hu...@gmail.com> 于2023年3月6日周一 11:43写道:

> Thanks Weijie.
>
> Heterogeneous task managers will not be considered in this FLIP since
> it does not request heterogeneous resources as you said.
>
> My first thought is we can adjust the meaning of redundant configuration
> to redundant number of per resource type. These can be considered in
> detail when we decide to support heterogeneous task managers.
>
> Best,
> Weihua
>
>
> On Sat, Mar 4, 2023 at 1:13 AM weijie guo <gu...@gmail.com>
> wrote:
>
> > Thanks Weihua for preparing this FLIP.
> >
> > This FLIP overall looks reasonable to me after updating as suggested by
> > Matthias.
> >
> > I only have one small question about keeping some redundant task
> managers:
> > In the fine-grained resource management, theoretically, it can support
> > heterogeneous taskmanagers. When we complete the missing features for
> FGSM,
> > do we plan to take this into account?
> > Of course, if I remember correctly, FGSM will not request heterogeneous
> > resources at present, so it is also acceptable to me if there is no
> special
> > treatment now.
> >
> > +1 for this changes if we can ensure the test coverage.
> >
> > Best regards,
> >
> > Weijie
> >
> >
> > John Roesler <vv...@apache.org> 于2023年3月2日周四 12:53写道:
> >
> > > Thanks for the test plan, Weihua!
> > >
> > > Yes, it addresses my concerns.
> > >
> > > Thanks,
> > > John
> > >
> > > On Wed, Mar 1, 2023, at 22:38, Weihua Hu wrote:
> > > > Hi, everyone,
> > > > Thanks for your suggestions and ideas.
> > > > Thanks Xintong for sharing the detailed backgrounds of SlotManager.
> > > >
> > > > *@Matthias
> > > >
> > > > 1. Did you do a proper test coverage analysis?
> > > >
> > > >
> > > > Just as Xintong said, we already have a CI stage for fine grained
> > > resource
> > > > managers.
> > > > And I will make sure FineGrainedSlotManager as the default
> SlotManager
> > > can
> > > > pass all the tests of CI.
> > > > In addition, I will review all unit tests of
> > DeclarativeSlotManager(DSM)
> > > to
> > > > ensure that there are no gaps in the
> > > > coverage provided by the FineGrainedSlotManager.
> > > > I also added the 'Test Plan' part to the FLIP.
> > > > @Matthias @John @Shammon Does this test plan address your concerns?
> > > >
> > > > 2.  DeclarativeSlotManager and FineGrainedSlotManager feel quite big
> in
> > > >
> > > > terms of lines of code....
> > > >
> > > >
> > > > IMO, the refactoring of SlotManager does not belong to this FLIP
> since
> > it
> > > > may lead to some unstable risks. For
> > > > FineGrainedSlotManager(FGSM), we already split some reasonable
> > > components.
> > > > They are:
> > > > * TaskManagerTracker: Track task managers and their resources.
> > > > * ResourceTracker: track requirements of jobs
> > > > * ResourceAllocationStrategy: Try to fulfill the resource
> requirements
> > > with
> > > > available/pending resources.
> > > > * SlotStatusSyncer: communicate with TaskManager, for
> > allocating/freeing
> > > > slot and reconciling the slot status
> > > > Maybe we can start a discussion about refactoring SlotManager in
> > another
> > > > FLIP if there are some good suggestions.
> > > > WDYT
> > > >
> > > > 3. For me personally, having a more detailed summary comparing the
> > > >> subcomponents of both SlotManager implementations with where
> > > >> their functionality matches and where they differ might help
> > understand
> > > the
> > > >> consequences of the changes proposed in FLIP-298
> > > >
> > > > Good suggestion, I have updated the comparison in this FLIP. Looking
> > > > forward to any suggestions/thoughts
> > > > if they are not described clearly.
> > > >
> > > > *@John
> > > >
> > > > 4. In addition to changing the default, would it make sense to log a
> > > >> deprecation warning on initialization
> > > >
> > > > if the DeclarativeSlotManager is used?
> > > >>
> > > > SGTM, We should add Deprecated annotations to DSM for devs. And log a
> > > > deprecation warning for users.
> > > >
> > > > *@Shammon
> > > >
> > > > 1. For their functional differences, can you give some detailed tests
> > to
> > > >> verify that the new FineGrainedSlotManager has these capabilities?
> > This
> > > can
> > > >> effectively verify the new functions
> > > >>
> > > > As just maintained, there is already a CI stage of FGSM, and I will
> do
> > > more
> > > > review of unit tests for DSM.
> > > >
> > > >  2. I'm worried that many functions are not independent and it is
> > > difficult
> > > >> to migrate step-by-step. You can list the relationship between them
> in
> > > >> detail.
> > > >
> > > >  As Xintong saied the DSM is a subset of FGSM by design. But as time
> > goes
> > > > on, FGSM has some lacking
> > > > functions as I listed in this FLIP. And I have added the comparison
> > > between
> > > > DSM and FGSM in this FLIP.
> > > >
> > > >
> > > > Thanks again for all your thoughts. Any feedback is appreciated!
> > > >
> > > > Best,
> > > > Weihua
> > > >
> > > >
> > > > On Wed, Mar 1, 2023 at 2:17 PM Xintong Song <to...@gmail.com>
> > > wrote:
> > > >
> > > >> Thanks Weihua for preparing this FLIP. +1 for the proposal.
> > > >>
> > > >>
> > > >> As one of the contributors of the fine-grained slot manager, I'd
> like
> > to
> > > >> share some backgrounds here.
> > > >>
> > > >> - There used to be a defaut slot manager implementation, which is
> > > >> non-declarative and has been removed now. The two features,
> > declarative
> > > /
> > > >> reactive resource management and fine-grained resource management,
> > were
> > > >> proposed at about the same time. We were aware that by design the
> > > >> declarative slot manager is a subset of fine-grained slot manager at
> > > that
> > > >> time, but still decided to implement two slot managers for the
> purpose
> > > of
> > > >> decoupling efforts and reducing cross-team synchronization overhead.
> > > >> Merging the two slot managers once they are proved stable is IMO a
> > > >> technical debt that was planned at the very beginning.
> > > >>
> > > >> - The FineGrainedSlotManager has been verified in Alibaba's internal
> > > >> production as well as Alibaba Cloud services as the default slot
> > manager
> > > >> for about 2 years.
> > > >>
> > > >>
> > > >> Concerning test cases, we currently have a ci stage for fine grained
> > > >> resource management. To avoid adding too much burden, the stage only
> > > >> includes tests from flink-runtime and flink-test modules. I think
> > > switching
> > > >> the default slot manager and applying the whole set of tests on the
> > > >> fine-grained slot manager would help us to be more confident about
> it.
> > > >>
> > > >>
> > > >> Concerning cutting out functionalities out of slot manager, I think
> > > Yangze
> > > >> and I have tried our best to shape the FineGrainedSlotManager into
> > > >> reasonable components. I personally don't have other ideas to
> further
> > > >> disassemble the component, but I'm open to such suggestions.
> However,
> > > from
> > > >> the stability perspective, I'd be in favor of not introducing
> > > significant
> > > >> changes to the FineGrainedSlotManager while switching it to the
> > default.
> > > >> Because the current implementation has already been verified (or at
> > > least
> > > >> partially verified because Alibaba does not cover all the Flink use
> > > cases),
> > > >> and introducing more changes also means more chances of breaking
> > things.
> > > >>
> > > >>
> > > >> Best,
> > > >>
> > > >> Xintong
> > > >>
> > > >>
> > > >>
> > > >> On Wed, Mar 1, 2023 at 11:12 AM Shammon FY <zj...@gmail.com>
> wrote:
> > > >>
> > > >> > Hi
> > > >> >
> > > >> > Thanks for starting this work weihua, I think unifying
> > > >> > DeclarativeSlotManager and FineGrainedSlotManager is valuable.
> > > >> >
> > > >> > I agree with @Matthias and @John that we need a way to ensure that
> > > >> > DeclarativeSlotManager's capabilities are fully covered by
> > > >> > FineGrainedSlotManager
> > > >> >
> > > >> > 1. For their functional differences, can you give some detailed
> > tests
> > > to
> > > >> > verify that the new FineGrainedSlotManager has these capabilities?
> > > This
> > > >> can
> > > >> > effectively verify the new functions
> > > >> >
> > > >> > 2. I'm worried that many functions are not independent and it is
> > > >> difficult
> > > >> > to migrate step-by-step. You can list the relationship between
> them
> > in
> > > >> > detail.
> > > >> >
> > > >> > 3. As John mentioned, give a smoke test for FineGrainedSlotManager
> > is
> > > a
> > > >> > good idea. Or you can add some test information to the
> > > >> > DeclarativeSlotManager to determine how many tests have used it.
> In
> > > this
> > > >> > way, we can gradually construct test cases for
> > FineGrainedSlotManager
> > > >> > during the development process.
> > > >> >
> > > >> >
> > > >> > Best,
> > > >> > Shammon
> > > >> >
> > > >> >
> > > >> > On Tue, Feb 28, 2023 at 10:22 PM John Roesler <
> vvcephei@apache.org>
> > > >> wrote:
> > > >> >
> > > >> > > Thanks for the FLIP, Weihua!
> > > >> > >
> > > >> > > I’ve read the FLIP, and it sounds good to me. We need to avoid
> > > >> > > proliferating alternative implementations wherever possible. I
> > have
> > > >> just
> > > >> > a
> > > >> > > couple of comments:
> > > >> > >
> > > >> > > 1. I share Matthias’s concern about ensuring the behavior is
> > really
> > > the
> > > >> > > same. One suggestion I’ve used for this kind of thing is, as a
> > smoke
> > > >> > test,
> > > >> > > to update the DeclarativeSlotManager to just delegate to the
> > > >> > > FineGrainedSlotManager. If the full test suite still passes, you
> > > can be
> > > >> > > pretty sure the new default is really ok. It would not be a good
> > > idea
> > > >> to
> > > >> > > actually keep that in for the release, since it would remove the
> > > option
> > > >> > to
> > > >> > > fall back in case of bugs. Either way, we need to make sure all
> > test
> > > >> > > scenarios are present for the FGSM.
> > > >> > >
> > > >> > > 4. In addition to changing the default, would it make sense to
> > log a
> > > >> > > deprecation warning on initialization if the
> > DeclarativeSlotManager
> > > is
> > > >> > used?
> > > >> > >
> > > >> > > Thanks again,
> > > >> > > John
> > > >> > >
> > > >> > > On Tue, Feb 28, 2023, at 07:20, Matthias Pohl wrote:
> > > >> > > > Hi Weihua,
> > > >> > > > Thanks for your proposal. From a conceptual point: AFAIU, the
> > > >> > > > DeclarativeSlotManager covers a subset (i.e. only evenly sized
> > > slots)
> > > >> > of
> > > >> > > > what the FineGrainedSlotManager should be able to achieve
> > > (variable
> > > >> > slot
> > > >> > > > size per task manager). Is this the right
> > > assumption/understanding?
> > > >> In
> > > >> > > this
> > > >> > > > sense, merging both implementations into a single one sounds
> > > good. A
> > > >> > few
> > > >> > > > more general comments, though:
> > > >> > > >
> > > >> > > > 1. Did you do a proper test coverage analysis? That's not
> > > mentioned
> > > >> in
> > > >> > > the
> > > >> > > > current version of the FLIP. I'm bringing this up because we
> ran
> > > into
> > > >> > the
> > > >> > > > same issue when fixing the flaws that popped up after
> > introducing
> > > the
> > > >> > > > multi-component leader election (see FLIP-285 [1]). There is a
> > > risk
> > > >> > that
> > > >> > > by
> > > >> > > > removing the legacy code we decrease test coverage because
> > certain
> > > >> > > > test cases that were covered for the legacy classes might not
> be
> > > >> > > > necessarily covered in the new implementation, yet (see
> > > FLINK-30338
> > > >> [2]
> > > >> > > > which covers this issue for the leader election case).
> Ideally,
> > we
> > > >> > don't
> > > >> > > > want to remove test cases accidentally because they were only
> > > >> > implemented
> > > >> > > > for the DeclarativeSlotManager but missed for the
> > > >> > FineGrainedSlotManager.
> > > >> > > >
> > > >> > > > 2. DeclarativeSlotManager and FineGrainedSlotManager feel
> quite
> > > big
> > > >> in
> > > >> > > > terms of lines of code. Without knowing whether it's actually
> a
> > > >> > > reasonable
> > > >> > > > thing to do: Instead of just adding more features to the
> > > >> > > > FineGrainedSlotManager, have you thought of cutting out
> > > functionality
> > > >> > > into
> > > >> > > > smaller sub-components along this refactoring? Such a
> > step-by-step
> > > >> > > approach
> > > >> > > > might improve the overall codebase and might make reviewing
> the
> > > >> > > refactoring
> > > >> > > > easier. I did a first pass over the code and struggled to
> > identify
> > > >> code
> > > >> > > > blocks that could be moved out of the SlotManager
> > > implementation(s).
> > > >> > > > Therefore, I might be wrong with this proposal. I haven't
> worked
> > > on
> > > >> > this
> > > >> > > > codebase in that detail that it would allow me to come up
> with a
> > > >> > > judgement
> > > >> > > > call. I wanted to bring it up, anyway, because I'm curious
> > whether
> > > >> that
> > > >> > > > could be an option. There's a comment created by Chesnay
> (CC'd)
> > in
> > > >> the
> > > >> > > > JavaDoc of TaskExecutorManager [3] indicating something
> similar.
> > > I'm
> > > >> > > > wondering whether he can add some insights here.
> > > >> > > >
> > > >> > > > 3. For me personally, having a more detailed summary comparing
> > the
> > > >> > > > subcomponents of both SlotManager implementations with where
> > > >> > > > their functionality matches and where they differ might help
> > > >> understand
> > > >> > > the
> > > >> > > > consequences of the changes proposed in FLIP-298.
> > > >> > > >
> > > >> > > > Best,
> > > >> > > > Matthias
> > > >> > > >
> > > >> > > > [1]
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+Refactoring+LeaderElection+to+make+Flink+support+multi-component+leader+election+out-of-the-box
> > > >> > > > [2] https://issues.apache.org/jira/browse/FLINK-30338
> > > >> > > > [3]
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
> https://github.com/apache/flink/blob/f611ea8cb5deddb42429df2c99f0c68d7382e9bd/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/TaskExecutorManager.java#L66-L68
> > > >> > > >
> > > >> > > > On Tue, Feb 28, 2023 at 6:14 AM Matt Wang <wa...@163.com>
> > > wrote:
> > > >> > > >
> > > >> > > >> This is a good proposal for me, it will make the code of the
> > > >> > SlotManager
> > > >> > > >> more clear.
> > > >> > > >>
> > > >> > > >>
> > > >> > > >>
> > > >> > > >> --
> > > >> > > >>
> > > >> > > >> Best,
> > > >> > > >> Matt Wang
> > > >> > > >>
> > > >> > > >>
> > > >> > > >> ---- Replied Message ----
> > > >> > > >> | From | David Morávek<dm...@apache.org> |
> > > >> > > >> | Date | 02/27/2023 22:45 |
> > > >> > > >> | To | <de...@flink.apache.org> |
> > > >> > > >> | Subject | Re: [DISCUSS] FLIP-298: Unifying the
> Implementation
> > > of
> > > >> > > >> SlotManager |
> > > >> > > >> Hi Weihua, I still need to dig into the details, but the
> > overall
> > > >> > > sentiment
> > > >> > > >> of this change sounds reasonable.
> > > >> > > >>
> > > >> > > >> Best,
> > > >> > > >> D.
> > > >> > > >>
> > > >> > > >> On Mon, Feb 27, 2023 at 2:26 PM Zhanghao Chen <
> > > >> > > zhanghao.chen@outlook.com>
> > > >> > > >> wrote:
> > > >> > > >>
> > > >> > > >> Thanks for driving this topic. I think this FLIP could help
> > > clean up
> > > >> > the
> > > >> > > >> codebase to make it easier to maintain. +1 on it.
> > > >> > > >>
> > > >> > > >> Best,
> > > >> > > >> Zhanghao Chen
> > > >> > > >> ________________________________
> > > >> > > >> From: Weihua Hu <hu...@gmail.com>
> > > >> > > >> Sent: Monday, February 27, 2023 20:40
> > > >> > > >> To: dev <de...@flink.apache.org>
> > > >> > > >> Subject: [DISCUSS] FLIP-298: Unifying the Implementation of
> > > >> > SlotManager
> > > >> > > >>
> > > >> > > >> Hi everyone,
> > > >> > > >>
> > > >> > > >> I would like to begin a discussion on FLIP-298: Unifying the
> > > >> > > Implementation
> > > >> > > >> of SlotManager[1]. There are currently two types of
> SlotManager
> > > in
> > > >> > > Flink:
> > > >> > > >> DeclarativeSlotManager and FineGrainedSlotManager.
> > > >> > > FineGrainedSlotManager
> > > >> > > >> should behave as DeclarativeSlotManager if the user does not
> > > >> configure
> > > >> > > the
> > > >> > > >> slot request profile.
> > > >> > > >>
> > > >> > > >> Therefore, this FLIP aims to unify the implementation of
> > > SlotManager
> > > >> > in
> > > >> > > >> order to reduce maintenance costs.
> > > >> > > >>
> > > >> > > >> Looking forward to hearing from you.
> > > >> > > >>
> > > >> > > >> [1]
> > > >> > > >>
> > > >> > > >>
> > > >> > > >>
> > > >> > >
> > > >> >
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-298%3A+Unifying+the+Implementation+of+SlotManager
> > > >> > > >>
> > > >> > > >> Best,
> > > >> > > >> Weihua
> > > >> > > >>
> > > >> > > >>
> > > >> > >
> > > >> >
> > > >>
> > >
> >
>

Re: [DISCUSS] FLIP-298: Unifying the Implementation of SlotManager

Posted by Weihua Hu <hu...@gmail.com>.
Thanks Weijie.

Heterogeneous task managers will not be considered in this FLIP since
it does not request heterogeneous resources as you said.

My first thought is we can adjust the meaning of redundant configuration
to redundant number of per resource type. These can be considered in
detail when we decide to support heterogeneous task managers.

Best,
Weihua


On Sat, Mar 4, 2023 at 1:13 AM weijie guo <gu...@gmail.com> wrote:

> Thanks Weihua for preparing this FLIP.
>
> This FLIP overall looks reasonable to me after updating as suggested by
> Matthias.
>
> I only have one small question about keeping some redundant task managers:
> In the fine-grained resource management, theoretically, it can support
> heterogeneous taskmanagers. When we complete the missing features for FGSM,
> do we plan to take this into account?
> Of course, if I remember correctly, FGSM will not request heterogeneous
> resources at present, so it is also acceptable to me if there is no special
> treatment now.
>
> +1 for this changes if we can ensure the test coverage.
>
> Best regards,
>
> Weijie
>
>
> John Roesler <vv...@apache.org> 于2023年3月2日周四 12:53写道:
>
> > Thanks for the test plan, Weihua!
> >
> > Yes, it addresses my concerns.
> >
> > Thanks,
> > John
> >
> > On Wed, Mar 1, 2023, at 22:38, Weihua Hu wrote:
> > > Hi, everyone,
> > > Thanks for your suggestions and ideas.
> > > Thanks Xintong for sharing the detailed backgrounds of SlotManager.
> > >
> > > *@Matthias
> > >
> > > 1. Did you do a proper test coverage analysis?
> > >
> > >
> > > Just as Xintong said, we already have a CI stage for fine grained
> > resource
> > > managers.
> > > And I will make sure FineGrainedSlotManager as the default SlotManager
> > can
> > > pass all the tests of CI.
> > > In addition, I will review all unit tests of
> DeclarativeSlotManager(DSM)
> > to
> > > ensure that there are no gaps in the
> > > coverage provided by the FineGrainedSlotManager.
> > > I also added the 'Test Plan' part to the FLIP.
> > > @Matthias @John @Shammon Does this test plan address your concerns?
> > >
> > > 2.  DeclarativeSlotManager and FineGrainedSlotManager feel quite big in
> > >
> > > terms of lines of code....
> > >
> > >
> > > IMO, the refactoring of SlotManager does not belong to this FLIP since
> it
> > > may lead to some unstable risks. For
> > > FineGrainedSlotManager(FGSM), we already split some reasonable
> > components.
> > > They are:
> > > * TaskManagerTracker: Track task managers and their resources.
> > > * ResourceTracker: track requirements of jobs
> > > * ResourceAllocationStrategy: Try to fulfill the resource requirements
> > with
> > > available/pending resources.
> > > * SlotStatusSyncer: communicate with TaskManager, for
> allocating/freeing
> > > slot and reconciling the slot status
> > > Maybe we can start a discussion about refactoring SlotManager in
> another
> > > FLIP if there are some good suggestions.
> > > WDYT
> > >
> > > 3. For me personally, having a more detailed summary comparing the
> > >> subcomponents of both SlotManager implementations with where
> > >> their functionality matches and where they differ might help
> understand
> > the
> > >> consequences of the changes proposed in FLIP-298
> > >
> > > Good suggestion, I have updated the comparison in this FLIP. Looking
> > > forward to any suggestions/thoughts
> > > if they are not described clearly.
> > >
> > > *@John
> > >
> > > 4. In addition to changing the default, would it make sense to log a
> > >> deprecation warning on initialization
> > >
> > > if the DeclarativeSlotManager is used?
> > >>
> > > SGTM, We should add Deprecated annotations to DSM for devs. And log a
> > > deprecation warning for users.
> > >
> > > *@Shammon
> > >
> > > 1. For their functional differences, can you give some detailed tests
> to
> > >> verify that the new FineGrainedSlotManager has these capabilities?
> This
> > can
> > >> effectively verify the new functions
> > >>
> > > As just maintained, there is already a CI stage of FGSM, and I will do
> > more
> > > review of unit tests for DSM.
> > >
> > >  2. I'm worried that many functions are not independent and it is
> > difficult
> > >> to migrate step-by-step. You can list the relationship between them in
> > >> detail.
> > >
> > >  As Xintong saied the DSM is a subset of FGSM by design. But as time
> goes
> > > on, FGSM has some lacking
> > > functions as I listed in this FLIP. And I have added the comparison
> > between
> > > DSM and FGSM in this FLIP.
> > >
> > >
> > > Thanks again for all your thoughts. Any feedback is appreciated!
> > >
> > > Best,
> > > Weihua
> > >
> > >
> > > On Wed, Mar 1, 2023 at 2:17 PM Xintong Song <to...@gmail.com>
> > wrote:
> > >
> > >> Thanks Weihua for preparing this FLIP. +1 for the proposal.
> > >>
> > >>
> > >> As one of the contributors of the fine-grained slot manager, I'd like
> to
> > >> share some backgrounds here.
> > >>
> > >> - There used to be a defaut slot manager implementation, which is
> > >> non-declarative and has been removed now. The two features,
> declarative
> > /
> > >> reactive resource management and fine-grained resource management,
> were
> > >> proposed at about the same time. We were aware that by design the
> > >> declarative slot manager is a subset of fine-grained slot manager at
> > that
> > >> time, but still decided to implement two slot managers for the purpose
> > of
> > >> decoupling efforts and reducing cross-team synchronization overhead.
> > >> Merging the two slot managers once they are proved stable is IMO a
> > >> technical debt that was planned at the very beginning.
> > >>
> > >> - The FineGrainedSlotManager has been verified in Alibaba's internal
> > >> production as well as Alibaba Cloud services as the default slot
> manager
> > >> for about 2 years.
> > >>
> > >>
> > >> Concerning test cases, we currently have a ci stage for fine grained
> > >> resource management. To avoid adding too much burden, the stage only
> > >> includes tests from flink-runtime and flink-test modules. I think
> > switching
> > >> the default slot manager and applying the whole set of tests on the
> > >> fine-grained slot manager would help us to be more confident about it.
> > >>
> > >>
> > >> Concerning cutting out functionalities out of slot manager, I think
> > Yangze
> > >> and I have tried our best to shape the FineGrainedSlotManager into
> > >> reasonable components. I personally don't have other ideas to further
> > >> disassemble the component, but I'm open to such suggestions. However,
> > from
> > >> the stability perspective, I'd be in favor of not introducing
> > significant
> > >> changes to the FineGrainedSlotManager while switching it to the
> default.
> > >> Because the current implementation has already been verified (or at
> > least
> > >> partially verified because Alibaba does not cover all the Flink use
> > cases),
> > >> and introducing more changes also means more chances of breaking
> things.
> > >>
> > >>
> > >> Best,
> > >>
> > >> Xintong
> > >>
> > >>
> > >>
> > >> On Wed, Mar 1, 2023 at 11:12 AM Shammon FY <zj...@gmail.com> wrote:
> > >>
> > >> > Hi
> > >> >
> > >> > Thanks for starting this work weihua, I think unifying
> > >> > DeclarativeSlotManager and FineGrainedSlotManager is valuable.
> > >> >
> > >> > I agree with @Matthias and @John that we need a way to ensure that
> > >> > DeclarativeSlotManager's capabilities are fully covered by
> > >> > FineGrainedSlotManager
> > >> >
> > >> > 1. For their functional differences, can you give some detailed
> tests
> > to
> > >> > verify that the new FineGrainedSlotManager has these capabilities?
> > This
> > >> can
> > >> > effectively verify the new functions
> > >> >
> > >> > 2. I'm worried that many functions are not independent and it is
> > >> difficult
> > >> > to migrate step-by-step. You can list the relationship between them
> in
> > >> > detail.
> > >> >
> > >> > 3. As John mentioned, give a smoke test for FineGrainedSlotManager
> is
> > a
> > >> > good idea. Or you can add some test information to the
> > >> > DeclarativeSlotManager to determine how many tests have used it. In
> > this
> > >> > way, we can gradually construct test cases for
> FineGrainedSlotManager
> > >> > during the development process.
> > >> >
> > >> >
> > >> > Best,
> > >> > Shammon
> > >> >
> > >> >
> > >> > On Tue, Feb 28, 2023 at 10:22 PM John Roesler <vv...@apache.org>
> > >> wrote:
> > >> >
> > >> > > Thanks for the FLIP, Weihua!
> > >> > >
> > >> > > I’ve read the FLIP, and it sounds good to me. We need to avoid
> > >> > > proliferating alternative implementations wherever possible. I
> have
> > >> just
> > >> > a
> > >> > > couple of comments:
> > >> > >
> > >> > > 1. I share Matthias’s concern about ensuring the behavior is
> really
> > the
> > >> > > same. One suggestion I’ve used for this kind of thing is, as a
> smoke
> > >> > test,
> > >> > > to update the DeclarativeSlotManager to just delegate to the
> > >> > > FineGrainedSlotManager. If the full test suite still passes, you
> > can be
> > >> > > pretty sure the new default is really ok. It would not be a good
> > idea
> > >> to
> > >> > > actually keep that in for the release, since it would remove the
> > option
> > >> > to
> > >> > > fall back in case of bugs. Either way, we need to make sure all
> test
> > >> > > scenarios are present for the FGSM.
> > >> > >
> > >> > > 4. In addition to changing the default, would it make sense to
> log a
> > >> > > deprecation warning on initialization if the
> DeclarativeSlotManager
> > is
> > >> > used?
> > >> > >
> > >> > > Thanks again,
> > >> > > John
> > >> > >
> > >> > > On Tue, Feb 28, 2023, at 07:20, Matthias Pohl wrote:
> > >> > > > Hi Weihua,
> > >> > > > Thanks for your proposal. From a conceptual point: AFAIU, the
> > >> > > > DeclarativeSlotManager covers a subset (i.e. only evenly sized
> > slots)
> > >> > of
> > >> > > > what the FineGrainedSlotManager should be able to achieve
> > (variable
> > >> > slot
> > >> > > > size per task manager). Is this the right
> > assumption/understanding?
> > >> In
> > >> > > this
> > >> > > > sense, merging both implementations into a single one sounds
> > good. A
> > >> > few
> > >> > > > more general comments, though:
> > >> > > >
> > >> > > > 1. Did you do a proper test coverage analysis? That's not
> > mentioned
> > >> in
> > >> > > the
> > >> > > > current version of the FLIP. I'm bringing this up because we ran
> > into
> > >> > the
> > >> > > > same issue when fixing the flaws that popped up after
> introducing
> > the
> > >> > > > multi-component leader election (see FLIP-285 [1]). There is a
> > risk
> > >> > that
> > >> > > by
> > >> > > > removing the legacy code we decrease test coverage because
> certain
> > >> > > > test cases that were covered for the legacy classes might not be
> > >> > > > necessarily covered in the new implementation, yet (see
> > FLINK-30338
> > >> [2]
> > >> > > > which covers this issue for the leader election case). Ideally,
> we
> > >> > don't
> > >> > > > want to remove test cases accidentally because they were only
> > >> > implemented
> > >> > > > for the DeclarativeSlotManager but missed for the
> > >> > FineGrainedSlotManager.
> > >> > > >
> > >> > > > 2. DeclarativeSlotManager and FineGrainedSlotManager feel quite
> > big
> > >> in
> > >> > > > terms of lines of code. Without knowing whether it's actually a
> > >> > > reasonable
> > >> > > > thing to do: Instead of just adding more features to the
> > >> > > > FineGrainedSlotManager, have you thought of cutting out
> > functionality
> > >> > > into
> > >> > > > smaller sub-components along this refactoring? Such a
> step-by-step
> > >> > > approach
> > >> > > > might improve the overall codebase and might make reviewing the
> > >> > > refactoring
> > >> > > > easier. I did a first pass over the code and struggled to
> identify
> > >> code
> > >> > > > blocks that could be moved out of the SlotManager
> > implementation(s).
> > >> > > > Therefore, I might be wrong with this proposal. I haven't worked
> > on
> > >> > this
> > >> > > > codebase in that detail that it would allow me to come up with a
> > >> > > judgement
> > >> > > > call. I wanted to bring it up, anyway, because I'm curious
> whether
> > >> that
> > >> > > > could be an option. There's a comment created by Chesnay (CC'd)
> in
> > >> the
> > >> > > > JavaDoc of TaskExecutorManager [3] indicating something similar.
> > I'm
> > >> > > > wondering whether he can add some insights here.
> > >> > > >
> > >> > > > 3. For me personally, having a more detailed summary comparing
> the
> > >> > > > subcomponents of both SlotManager implementations with where
> > >> > > > their functionality matches and where they differ might help
> > >> understand
> > >> > > the
> > >> > > > consequences of the changes proposed in FLIP-298.
> > >> > > >
> > >> > > > Best,
> > >> > > > Matthias
> > >> > > >
> > >> > > > [1]
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+Refactoring+LeaderElection+to+make+Flink+support+multi-component+leader+election+out-of-the-box
> > >> > > > [2] https://issues.apache.org/jira/browse/FLINK-30338
> > >> > > > [3]
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://github.com/apache/flink/blob/f611ea8cb5deddb42429df2c99f0c68d7382e9bd/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/TaskExecutorManager.java#L66-L68
> > >> > > >
> > >> > > > On Tue, Feb 28, 2023 at 6:14 AM Matt Wang <wa...@163.com>
> > wrote:
> > >> > > >
> > >> > > >> This is a good proposal for me, it will make the code of the
> > >> > SlotManager
> > >> > > >> more clear.
> > >> > > >>
> > >> > > >>
> > >> > > >>
> > >> > > >> --
> > >> > > >>
> > >> > > >> Best,
> > >> > > >> Matt Wang
> > >> > > >>
> > >> > > >>
> > >> > > >> ---- Replied Message ----
> > >> > > >> | From | David Morávek<dm...@apache.org> |
> > >> > > >> | Date | 02/27/2023 22:45 |
> > >> > > >> | To | <de...@flink.apache.org> |
> > >> > > >> | Subject | Re: [DISCUSS] FLIP-298: Unifying the Implementation
> > of
> > >> > > >> SlotManager |
> > >> > > >> Hi Weihua, I still need to dig into the details, but the
> overall
> > >> > > sentiment
> > >> > > >> of this change sounds reasonable.
> > >> > > >>
> > >> > > >> Best,
> > >> > > >> D.
> > >> > > >>
> > >> > > >> On Mon, Feb 27, 2023 at 2:26 PM Zhanghao Chen <
> > >> > > zhanghao.chen@outlook.com>
> > >> > > >> wrote:
> > >> > > >>
> > >> > > >> Thanks for driving this topic. I think this FLIP could help
> > clean up
> > >> > the
> > >> > > >> codebase to make it easier to maintain. +1 on it.
> > >> > > >>
> > >> > > >> Best,
> > >> > > >> Zhanghao Chen
> > >> > > >> ________________________________
> > >> > > >> From: Weihua Hu <hu...@gmail.com>
> > >> > > >> Sent: Monday, February 27, 2023 20:40
> > >> > > >> To: dev <de...@flink.apache.org>
> > >> > > >> Subject: [DISCUSS] FLIP-298: Unifying the Implementation of
> > >> > SlotManager
> > >> > > >>
> > >> > > >> Hi everyone,
> > >> > > >>
> > >> > > >> I would like to begin a discussion on FLIP-298: Unifying the
> > >> > > Implementation
> > >> > > >> of SlotManager[1]. There are currently two types of SlotManager
> > in
> > >> > > Flink:
> > >> > > >> DeclarativeSlotManager and FineGrainedSlotManager.
> > >> > > FineGrainedSlotManager
> > >> > > >> should behave as DeclarativeSlotManager if the user does not
> > >> configure
> > >> > > the
> > >> > > >> slot request profile.
> > >> > > >>
> > >> > > >> Therefore, this FLIP aims to unify the implementation of
> > SlotManager
> > >> > in
> > >> > > >> order to reduce maintenance costs.
> > >> > > >>
> > >> > > >> Looking forward to hearing from you.
> > >> > > >>
> > >> > > >> [1]
> > >> > > >>
> > >> > > >>
> > >> > > >>
> > >> > >
> > >> >
> > >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-298%3A+Unifying+the+Implementation+of+SlotManager
> > >> > > >>
> > >> > > >> Best,
> > >> > > >> Weihua
> > >> > > >>
> > >> > > >>
> > >> > >
> > >> >
> > >>
> >
>

Re: [DISCUSS] FLIP-298: Unifying the Implementation of SlotManager

Posted by weijie guo <gu...@gmail.com>.
Thanks Weihua for preparing this FLIP.

This FLIP overall looks reasonable to me after updating as suggested by
Matthias.

I only have one small question about keeping some redundant task managers:
In the fine-grained resource management, theoretically, it can support
heterogeneous taskmanagers. When we complete the missing features for FGSM,
do we plan to take this into account?
Of course, if I remember correctly, FGSM will not request heterogeneous
resources at present, so it is also acceptable to me if there is no special
treatment now.

+1 for this changes if we can ensure the test coverage.

Best regards,

Weijie


John Roesler <vv...@apache.org> 于2023年3月2日周四 12:53写道:

> Thanks for the test plan, Weihua!
>
> Yes, it addresses my concerns.
>
> Thanks,
> John
>
> On Wed, Mar 1, 2023, at 22:38, Weihua Hu wrote:
> > Hi, everyone,
> > Thanks for your suggestions and ideas.
> > Thanks Xintong for sharing the detailed backgrounds of SlotManager.
> >
> > *@Matthias
> >
> > 1. Did you do a proper test coverage analysis?
> >
> >
> > Just as Xintong said, we already have a CI stage for fine grained
> resource
> > managers.
> > And I will make sure FineGrainedSlotManager as the default SlotManager
> can
> > pass all the tests of CI.
> > In addition, I will review all unit tests of DeclarativeSlotManager(DSM)
> to
> > ensure that there are no gaps in the
> > coverage provided by the FineGrainedSlotManager.
> > I also added the 'Test Plan' part to the FLIP.
> > @Matthias @John @Shammon Does this test plan address your concerns?
> >
> > 2.  DeclarativeSlotManager and FineGrainedSlotManager feel quite big in
> >
> > terms of lines of code....
> >
> >
> > IMO, the refactoring of SlotManager does not belong to this FLIP since it
> > may lead to some unstable risks. For
> > FineGrainedSlotManager(FGSM), we already split some reasonable
> components.
> > They are:
> > * TaskManagerTracker: Track task managers and their resources.
> > * ResourceTracker: track requirements of jobs
> > * ResourceAllocationStrategy: Try to fulfill the resource requirements
> with
> > available/pending resources.
> > * SlotStatusSyncer: communicate with TaskManager, for allocating/freeing
> > slot and reconciling the slot status
> > Maybe we can start a discussion about refactoring SlotManager in another
> > FLIP if there are some good suggestions.
> > WDYT
> >
> > 3. For me personally, having a more detailed summary comparing the
> >> subcomponents of both SlotManager implementations with where
> >> their functionality matches and where they differ might help understand
> the
> >> consequences of the changes proposed in FLIP-298
> >
> > Good suggestion, I have updated the comparison in this FLIP. Looking
> > forward to any suggestions/thoughts
> > if they are not described clearly.
> >
> > *@John
> >
> > 4. In addition to changing the default, would it make sense to log a
> >> deprecation warning on initialization
> >
> > if the DeclarativeSlotManager is used?
> >>
> > SGTM, We should add Deprecated annotations to DSM for devs. And log a
> > deprecation warning for users.
> >
> > *@Shammon
> >
> > 1. For their functional differences, can you give some detailed tests to
> >> verify that the new FineGrainedSlotManager has these capabilities? This
> can
> >> effectively verify the new functions
> >>
> > As just maintained, there is already a CI stage of FGSM, and I will do
> more
> > review of unit tests for DSM.
> >
> >  2. I'm worried that many functions are not independent and it is
> difficult
> >> to migrate step-by-step. You can list the relationship between them in
> >> detail.
> >
> >  As Xintong saied the DSM is a subset of FGSM by design. But as time goes
> > on, FGSM has some lacking
> > functions as I listed in this FLIP. And I have added the comparison
> between
> > DSM and FGSM in this FLIP.
> >
> >
> > Thanks again for all your thoughts. Any feedback is appreciated!
> >
> > Best,
> > Weihua
> >
> >
> > On Wed, Mar 1, 2023 at 2:17 PM Xintong Song <to...@gmail.com>
> wrote:
> >
> >> Thanks Weihua for preparing this FLIP. +1 for the proposal.
> >>
> >>
> >> As one of the contributors of the fine-grained slot manager, I'd like to
> >> share some backgrounds here.
> >>
> >> - There used to be a defaut slot manager implementation, which is
> >> non-declarative and has been removed now. The two features, declarative
> /
> >> reactive resource management and fine-grained resource management, were
> >> proposed at about the same time. We were aware that by design the
> >> declarative slot manager is a subset of fine-grained slot manager at
> that
> >> time, but still decided to implement two slot managers for the purpose
> of
> >> decoupling efforts and reducing cross-team synchronization overhead.
> >> Merging the two slot managers once they are proved stable is IMO a
> >> technical debt that was planned at the very beginning.
> >>
> >> - The FineGrainedSlotManager has been verified in Alibaba's internal
> >> production as well as Alibaba Cloud services as the default slot manager
> >> for about 2 years.
> >>
> >>
> >> Concerning test cases, we currently have a ci stage for fine grained
> >> resource management. To avoid adding too much burden, the stage only
> >> includes tests from flink-runtime and flink-test modules. I think
> switching
> >> the default slot manager and applying the whole set of tests on the
> >> fine-grained slot manager would help us to be more confident about it.
> >>
> >>
> >> Concerning cutting out functionalities out of slot manager, I think
> Yangze
> >> and I have tried our best to shape the FineGrainedSlotManager into
> >> reasonable components. I personally don't have other ideas to further
> >> disassemble the component, but I'm open to such suggestions. However,
> from
> >> the stability perspective, I'd be in favor of not introducing
> significant
> >> changes to the FineGrainedSlotManager while switching it to the default.
> >> Because the current implementation has already been verified (or at
> least
> >> partially verified because Alibaba does not cover all the Flink use
> cases),
> >> and introducing more changes also means more chances of breaking things.
> >>
> >>
> >> Best,
> >>
> >> Xintong
> >>
> >>
> >>
> >> On Wed, Mar 1, 2023 at 11:12 AM Shammon FY <zj...@gmail.com> wrote:
> >>
> >> > Hi
> >> >
> >> > Thanks for starting this work weihua, I think unifying
> >> > DeclarativeSlotManager and FineGrainedSlotManager is valuable.
> >> >
> >> > I agree with @Matthias and @John that we need a way to ensure that
> >> > DeclarativeSlotManager's capabilities are fully covered by
> >> > FineGrainedSlotManager
> >> >
> >> > 1. For their functional differences, can you give some detailed tests
> to
> >> > verify that the new FineGrainedSlotManager has these capabilities?
> This
> >> can
> >> > effectively verify the new functions
> >> >
> >> > 2. I'm worried that many functions are not independent and it is
> >> difficult
> >> > to migrate step-by-step. You can list the relationship between them in
> >> > detail.
> >> >
> >> > 3. As John mentioned, give a smoke test for FineGrainedSlotManager is
> a
> >> > good idea. Or you can add some test information to the
> >> > DeclarativeSlotManager to determine how many tests have used it. In
> this
> >> > way, we can gradually construct test cases for FineGrainedSlotManager
> >> > during the development process.
> >> >
> >> >
> >> > Best,
> >> > Shammon
> >> >
> >> >
> >> > On Tue, Feb 28, 2023 at 10:22 PM John Roesler <vv...@apache.org>
> >> wrote:
> >> >
> >> > > Thanks for the FLIP, Weihua!
> >> > >
> >> > > I’ve read the FLIP, and it sounds good to me. We need to avoid
> >> > > proliferating alternative implementations wherever possible. I have
> >> just
> >> > a
> >> > > couple of comments:
> >> > >
> >> > > 1. I share Matthias’s concern about ensuring the behavior is really
> the
> >> > > same. One suggestion I’ve used for this kind of thing is, as a smoke
> >> > test,
> >> > > to update the DeclarativeSlotManager to just delegate to the
> >> > > FineGrainedSlotManager. If the full test suite still passes, you
> can be
> >> > > pretty sure the new default is really ok. It would not be a good
> idea
> >> to
> >> > > actually keep that in for the release, since it would remove the
> option
> >> > to
> >> > > fall back in case of bugs. Either way, we need to make sure all test
> >> > > scenarios are present for the FGSM.
> >> > >
> >> > > 4. In addition to changing the default, would it make sense to log a
> >> > > deprecation warning on initialization if the DeclarativeSlotManager
> is
> >> > used?
> >> > >
> >> > > Thanks again,
> >> > > John
> >> > >
> >> > > On Tue, Feb 28, 2023, at 07:20, Matthias Pohl wrote:
> >> > > > Hi Weihua,
> >> > > > Thanks for your proposal. From a conceptual point: AFAIU, the
> >> > > > DeclarativeSlotManager covers a subset (i.e. only evenly sized
> slots)
> >> > of
> >> > > > what the FineGrainedSlotManager should be able to achieve
> (variable
> >> > slot
> >> > > > size per task manager). Is this the right
> assumption/understanding?
> >> In
> >> > > this
> >> > > > sense, merging both implementations into a single one sounds
> good. A
> >> > few
> >> > > > more general comments, though:
> >> > > >
> >> > > > 1. Did you do a proper test coverage analysis? That's not
> mentioned
> >> in
> >> > > the
> >> > > > current version of the FLIP. I'm bringing this up because we ran
> into
> >> > the
> >> > > > same issue when fixing the flaws that popped up after introducing
> the
> >> > > > multi-component leader election (see FLIP-285 [1]). There is a
> risk
> >> > that
> >> > > by
> >> > > > removing the legacy code we decrease test coverage because certain
> >> > > > test cases that were covered for the legacy classes might not be
> >> > > > necessarily covered in the new implementation, yet (see
> FLINK-30338
> >> [2]
> >> > > > which covers this issue for the leader election case). Ideally, we
> >> > don't
> >> > > > want to remove test cases accidentally because they were only
> >> > implemented
> >> > > > for the DeclarativeSlotManager but missed for the
> >> > FineGrainedSlotManager.
> >> > > >
> >> > > > 2. DeclarativeSlotManager and FineGrainedSlotManager feel quite
> big
> >> in
> >> > > > terms of lines of code. Without knowing whether it's actually a
> >> > > reasonable
> >> > > > thing to do: Instead of just adding more features to the
> >> > > > FineGrainedSlotManager, have you thought of cutting out
> functionality
> >> > > into
> >> > > > smaller sub-components along this refactoring? Such a step-by-step
> >> > > approach
> >> > > > might improve the overall codebase and might make reviewing the
> >> > > refactoring
> >> > > > easier. I did a first pass over the code and struggled to identify
> >> code
> >> > > > blocks that could be moved out of the SlotManager
> implementation(s).
> >> > > > Therefore, I might be wrong with this proposal. I haven't worked
> on
> >> > this
> >> > > > codebase in that detail that it would allow me to come up with a
> >> > > judgement
> >> > > > call. I wanted to bring it up, anyway, because I'm curious whether
> >> that
> >> > > > could be an option. There's a comment created by Chesnay (CC'd) in
> >> the
> >> > > > JavaDoc of TaskExecutorManager [3] indicating something similar.
> I'm
> >> > > > wondering whether he can add some insights here.
> >> > > >
> >> > > > 3. For me personally, having a more detailed summary comparing the
> >> > > > subcomponents of both SlotManager implementations with where
> >> > > > their functionality matches and where they differ might help
> >> understand
> >> > > the
> >> > > > consequences of the changes proposed in FLIP-298.
> >> > > >
> >> > > > Best,
> >> > > > Matthias
> >> > > >
> >> > > > [1]
> >> > > >
> >> > >
> >> >
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+Refactoring+LeaderElection+to+make+Flink+support+multi-component+leader+election+out-of-the-box
> >> > > > [2] https://issues.apache.org/jira/browse/FLINK-30338
> >> > > > [3]
> >> > > >
> >> > >
> >> >
> >>
> https://github.com/apache/flink/blob/f611ea8cb5deddb42429df2c99f0c68d7382e9bd/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/TaskExecutorManager.java#L66-L68
> >> > > >
> >> > > > On Tue, Feb 28, 2023 at 6:14 AM Matt Wang <wa...@163.com>
> wrote:
> >> > > >
> >> > > >> This is a good proposal for me, it will make the code of the
> >> > SlotManager
> >> > > >> more clear.
> >> > > >>
> >> > > >>
> >> > > >>
> >> > > >> --
> >> > > >>
> >> > > >> Best,
> >> > > >> Matt Wang
> >> > > >>
> >> > > >>
> >> > > >> ---- Replied Message ----
> >> > > >> | From | David Morávek<dm...@apache.org> |
> >> > > >> | Date | 02/27/2023 22:45 |
> >> > > >> | To | <de...@flink.apache.org> |
> >> > > >> | Subject | Re: [DISCUSS] FLIP-298: Unifying the Implementation
> of
> >> > > >> SlotManager |
> >> > > >> Hi Weihua, I still need to dig into the details, but the overall
> >> > > sentiment
> >> > > >> of this change sounds reasonable.
> >> > > >>
> >> > > >> Best,
> >> > > >> D.
> >> > > >>
> >> > > >> On Mon, Feb 27, 2023 at 2:26 PM Zhanghao Chen <
> >> > > zhanghao.chen@outlook.com>
> >> > > >> wrote:
> >> > > >>
> >> > > >> Thanks for driving this topic. I think this FLIP could help
> clean up
> >> > the
> >> > > >> codebase to make it easier to maintain. +1 on it.
> >> > > >>
> >> > > >> Best,
> >> > > >> Zhanghao Chen
> >> > > >> ________________________________
> >> > > >> From: Weihua Hu <hu...@gmail.com>
> >> > > >> Sent: Monday, February 27, 2023 20:40
> >> > > >> To: dev <de...@flink.apache.org>
> >> > > >> Subject: [DISCUSS] FLIP-298: Unifying the Implementation of
> >> > SlotManager
> >> > > >>
> >> > > >> Hi everyone,
> >> > > >>
> >> > > >> I would like to begin a discussion on FLIP-298: Unifying the
> >> > > Implementation
> >> > > >> of SlotManager[1]. There are currently two types of SlotManager
> in
> >> > > Flink:
> >> > > >> DeclarativeSlotManager and FineGrainedSlotManager.
> >> > > FineGrainedSlotManager
> >> > > >> should behave as DeclarativeSlotManager if the user does not
> >> configure
> >> > > the
> >> > > >> slot request profile.
> >> > > >>
> >> > > >> Therefore, this FLIP aims to unify the implementation of
> SlotManager
> >> > in
> >> > > >> order to reduce maintenance costs.
> >> > > >>
> >> > > >> Looking forward to hearing from you.
> >> > > >>
> >> > > >> [1]
> >> > > >>
> >> > > >>
> >> > > >>
> >> > >
> >> >
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-298%3A+Unifying+the+Implementation+of+SlotManager
> >> > > >>
> >> > > >> Best,
> >> > > >> Weihua
> >> > > >>
> >> > > >>
> >> > >
> >> >
> >>
>

Re: [DISCUSS] FLIP-298: Unifying the Implementation of SlotManager

Posted by John Roesler <vv...@apache.org>.
Thanks for the test plan, Weihua!

Yes, it addresses my concerns. 

Thanks,
John

On Wed, Mar 1, 2023, at 22:38, Weihua Hu wrote:
> Hi, everyone,
> Thanks for your suggestions and ideas.
> Thanks Xintong for sharing the detailed backgrounds of SlotManager.
>
> *@Matthias
>
> 1. Did you do a proper test coverage analysis?
>
>
> Just as Xintong said, we already have a CI stage for fine grained resource
> managers.
> And I will make sure FineGrainedSlotManager as the default SlotManager can
> pass all the tests of CI.
> In addition, I will review all unit tests of DeclarativeSlotManager(DSM) to
> ensure that there are no gaps in the
> coverage provided by the FineGrainedSlotManager.
> I also added the 'Test Plan' part to the FLIP.
> @Matthias @John @Shammon Does this test plan address your concerns?
>
> 2.  DeclarativeSlotManager and FineGrainedSlotManager feel quite big in
>
> terms of lines of code....
>
>
> IMO, the refactoring of SlotManager does not belong to this FLIP since it
> may lead to some unstable risks. For
> FineGrainedSlotManager(FGSM), we already split some reasonable components.
> They are:
> * TaskManagerTracker: Track task managers and their resources.
> * ResourceTracker: track requirements of jobs
> * ResourceAllocationStrategy: Try to fulfill the resource requirements with
> available/pending resources.
> * SlotStatusSyncer: communicate with TaskManager, for allocating/freeing
> slot and reconciling the slot status
> Maybe we can start a discussion about refactoring SlotManager in another
> FLIP if there are some good suggestions.
> WDYT
>
> 3. For me personally, having a more detailed summary comparing the
>> subcomponents of both SlotManager implementations with where
>> their functionality matches and where they differ might help understand the
>> consequences of the changes proposed in FLIP-298
>
> Good suggestion, I have updated the comparison in this FLIP. Looking
> forward to any suggestions/thoughts
> if they are not described clearly.
>
> *@John
>
> 4. In addition to changing the default, would it make sense to log a
>> deprecation warning on initialization
>
> if the DeclarativeSlotManager is used?
>>
> SGTM, We should add Deprecated annotations to DSM for devs. And log a
> deprecation warning for users.
>
> *@Shammon
>
> 1. For their functional differences, can you give some detailed tests to
>> verify that the new FineGrainedSlotManager has these capabilities? This can
>> effectively verify the new functions
>>
> As just maintained, there is already a CI stage of FGSM, and I will do more
> review of unit tests for DSM.
>
>  2. I'm worried that many functions are not independent and it is difficult
>> to migrate step-by-step. You can list the relationship between them in
>> detail.
>
>  As Xintong saied the DSM is a subset of FGSM by design. But as time goes
> on, FGSM has some lacking
> functions as I listed in this FLIP. And I have added the comparison between
> DSM and FGSM in this FLIP.
>
>
> Thanks again for all your thoughts. Any feedback is appreciated!
>
> Best,
> Weihua
>
>
> On Wed, Mar 1, 2023 at 2:17 PM Xintong Song <to...@gmail.com> wrote:
>
>> Thanks Weihua for preparing this FLIP. +1 for the proposal.
>>
>>
>> As one of the contributors of the fine-grained slot manager, I'd like to
>> share some backgrounds here.
>>
>> - There used to be a defaut slot manager implementation, which is
>> non-declarative and has been removed now. The two features, declarative /
>> reactive resource management and fine-grained resource management, were
>> proposed at about the same time. We were aware that by design the
>> declarative slot manager is a subset of fine-grained slot manager at that
>> time, but still decided to implement two slot managers for the purpose of
>> decoupling efforts and reducing cross-team synchronization overhead.
>> Merging the two slot managers once they are proved stable is IMO a
>> technical debt that was planned at the very beginning.
>>
>> - The FineGrainedSlotManager has been verified in Alibaba's internal
>> production as well as Alibaba Cloud services as the default slot manager
>> for about 2 years.
>>
>>
>> Concerning test cases, we currently have a ci stage for fine grained
>> resource management. To avoid adding too much burden, the stage only
>> includes tests from flink-runtime and flink-test modules. I think switching
>> the default slot manager and applying the whole set of tests on the
>> fine-grained slot manager would help us to be more confident about it.
>>
>>
>> Concerning cutting out functionalities out of slot manager, I think Yangze
>> and I have tried our best to shape the FineGrainedSlotManager into
>> reasonable components. I personally don't have other ideas to further
>> disassemble the component, but I'm open to such suggestions. However, from
>> the stability perspective, I'd be in favor of not introducing significant
>> changes to the FineGrainedSlotManager while switching it to the default.
>> Because the current implementation has already been verified (or at least
>> partially verified because Alibaba does not cover all the Flink use cases),
>> and introducing more changes also means more chances of breaking things.
>>
>>
>> Best,
>>
>> Xintong
>>
>>
>>
>> On Wed, Mar 1, 2023 at 11:12 AM Shammon FY <zj...@gmail.com> wrote:
>>
>> > Hi
>> >
>> > Thanks for starting this work weihua, I think unifying
>> > DeclarativeSlotManager and FineGrainedSlotManager is valuable.
>> >
>> > I agree with @Matthias and @John that we need a way to ensure that
>> > DeclarativeSlotManager's capabilities are fully covered by
>> > FineGrainedSlotManager
>> >
>> > 1. For their functional differences, can you give some detailed tests to
>> > verify that the new FineGrainedSlotManager has these capabilities? This
>> can
>> > effectively verify the new functions
>> >
>> > 2. I'm worried that many functions are not independent and it is
>> difficult
>> > to migrate step-by-step. You can list the relationship between them in
>> > detail.
>> >
>> > 3. As John mentioned, give a smoke test for FineGrainedSlotManager is a
>> > good idea. Or you can add some test information to the
>> > DeclarativeSlotManager to determine how many tests have used it. In this
>> > way, we can gradually construct test cases for FineGrainedSlotManager
>> > during the development process.
>> >
>> >
>> > Best,
>> > Shammon
>> >
>> >
>> > On Tue, Feb 28, 2023 at 10:22 PM John Roesler <vv...@apache.org>
>> wrote:
>> >
>> > > Thanks for the FLIP, Weihua!
>> > >
>> > > I’ve read the FLIP, and it sounds good to me. We need to avoid
>> > > proliferating alternative implementations wherever possible. I have
>> just
>> > a
>> > > couple of comments:
>> > >
>> > > 1. I share Matthias’s concern about ensuring the behavior is really the
>> > > same. One suggestion I’ve used for this kind of thing is, as a smoke
>> > test,
>> > > to update the DeclarativeSlotManager to just delegate to the
>> > > FineGrainedSlotManager. If the full test suite still passes, you can be
>> > > pretty sure the new default is really ok. It would not be a good idea
>> to
>> > > actually keep that in for the release, since it would remove the option
>> > to
>> > > fall back in case of bugs. Either way, we need to make sure all test
>> > > scenarios are present for the FGSM.
>> > >
>> > > 4. In addition to changing the default, would it make sense to log a
>> > > deprecation warning on initialization if the DeclarativeSlotManager is
>> > used?
>> > >
>> > > Thanks again,
>> > > John
>> > >
>> > > On Tue, Feb 28, 2023, at 07:20, Matthias Pohl wrote:
>> > > > Hi Weihua,
>> > > > Thanks for your proposal. From a conceptual point: AFAIU, the
>> > > > DeclarativeSlotManager covers a subset (i.e. only evenly sized slots)
>> > of
>> > > > what the FineGrainedSlotManager should be able to achieve (variable
>> > slot
>> > > > size per task manager). Is this the right assumption/understanding?
>> In
>> > > this
>> > > > sense, merging both implementations into a single one sounds good. A
>> > few
>> > > > more general comments, though:
>> > > >
>> > > > 1. Did you do a proper test coverage analysis? That's not mentioned
>> in
>> > > the
>> > > > current version of the FLIP. I'm bringing this up because we ran into
>> > the
>> > > > same issue when fixing the flaws that popped up after introducing the
>> > > > multi-component leader election (see FLIP-285 [1]). There is a risk
>> > that
>> > > by
>> > > > removing the legacy code we decrease test coverage because certain
>> > > > test cases that were covered for the legacy classes might not be
>> > > > necessarily covered in the new implementation, yet (see FLINK-30338
>> [2]
>> > > > which covers this issue for the leader election case). Ideally, we
>> > don't
>> > > > want to remove test cases accidentally because they were only
>> > implemented
>> > > > for the DeclarativeSlotManager but missed for the
>> > FineGrainedSlotManager.
>> > > >
>> > > > 2. DeclarativeSlotManager and FineGrainedSlotManager feel quite big
>> in
>> > > > terms of lines of code. Without knowing whether it's actually a
>> > > reasonable
>> > > > thing to do: Instead of just adding more features to the
>> > > > FineGrainedSlotManager, have you thought of cutting out functionality
>> > > into
>> > > > smaller sub-components along this refactoring? Such a step-by-step
>> > > approach
>> > > > might improve the overall codebase and might make reviewing the
>> > > refactoring
>> > > > easier. I did a first pass over the code and struggled to identify
>> code
>> > > > blocks that could be moved out of the SlotManager implementation(s).
>> > > > Therefore, I might be wrong with this proposal. I haven't worked on
>> > this
>> > > > codebase in that detail that it would allow me to come up with a
>> > > judgement
>> > > > call. I wanted to bring it up, anyway, because I'm curious whether
>> that
>> > > > could be an option. There's a comment created by Chesnay (CC'd) in
>> the
>> > > > JavaDoc of TaskExecutorManager [3] indicating something similar. I'm
>> > > > wondering whether he can add some insights here.
>> > > >
>> > > > 3. For me personally, having a more detailed summary comparing the
>> > > > subcomponents of both SlotManager implementations with where
>> > > > their functionality matches and where they differ might help
>> understand
>> > > the
>> > > > consequences of the changes proposed in FLIP-298.
>> > > >
>> > > > Best,
>> > > > Matthias
>> > > >
>> > > > [1]
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+Refactoring+LeaderElection+to+make+Flink+support+multi-component+leader+election+out-of-the-box
>> > > > [2] https://issues.apache.org/jira/browse/FLINK-30338
>> > > > [3]
>> > > >
>> > >
>> >
>> https://github.com/apache/flink/blob/f611ea8cb5deddb42429df2c99f0c68d7382e9bd/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/TaskExecutorManager.java#L66-L68
>> > > >
>> > > > On Tue, Feb 28, 2023 at 6:14 AM Matt Wang <wa...@163.com> wrote:
>> > > >
>> > > >> This is a good proposal for me, it will make the code of the
>> > SlotManager
>> > > >> more clear.
>> > > >>
>> > > >>
>> > > >>
>> > > >> --
>> > > >>
>> > > >> Best,
>> > > >> Matt Wang
>> > > >>
>> > > >>
>> > > >> ---- Replied Message ----
>> > > >> | From | David Morávek<dm...@apache.org> |
>> > > >> | Date | 02/27/2023 22:45 |
>> > > >> | To | <de...@flink.apache.org> |
>> > > >> | Subject | Re: [DISCUSS] FLIP-298: Unifying the Implementation of
>> > > >> SlotManager |
>> > > >> Hi Weihua, I still need to dig into the details, but the overall
>> > > sentiment
>> > > >> of this change sounds reasonable.
>> > > >>
>> > > >> Best,
>> > > >> D.
>> > > >>
>> > > >> On Mon, Feb 27, 2023 at 2:26 PM Zhanghao Chen <
>> > > zhanghao.chen@outlook.com>
>> > > >> wrote:
>> > > >>
>> > > >> Thanks for driving this topic. I think this FLIP could help clean up
>> > the
>> > > >> codebase to make it easier to maintain. +1 on it.
>> > > >>
>> > > >> Best,
>> > > >> Zhanghao Chen
>> > > >> ________________________________
>> > > >> From: Weihua Hu <hu...@gmail.com>
>> > > >> Sent: Monday, February 27, 2023 20:40
>> > > >> To: dev <de...@flink.apache.org>
>> > > >> Subject: [DISCUSS] FLIP-298: Unifying the Implementation of
>> > SlotManager
>> > > >>
>> > > >> Hi everyone,
>> > > >>
>> > > >> I would like to begin a discussion on FLIP-298: Unifying the
>> > > Implementation
>> > > >> of SlotManager[1]. There are currently two types of SlotManager in
>> > > Flink:
>> > > >> DeclarativeSlotManager and FineGrainedSlotManager.
>> > > FineGrainedSlotManager
>> > > >> should behave as DeclarativeSlotManager if the user does not
>> configure
>> > > the
>> > > >> slot request profile.
>> > > >>
>> > > >> Therefore, this FLIP aims to unify the implementation of SlotManager
>> > in
>> > > >> order to reduce maintenance costs.
>> > > >>
>> > > >> Looking forward to hearing from you.
>> > > >>
>> > > >> [1]
>> > > >>
>> > > >>
>> > > >>
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-298%3A+Unifying+the+Implementation+of+SlotManager
>> > > >>
>> > > >> Best,
>> > > >> Weihua
>> > > >>
>> > > >>
>> > >
>> >
>>

Re: [DISCUSS] FLIP-298: Unifying the Implementation of SlotManager

Posted by Weihua Hu <hu...@gmail.com>.
Hi, everyone,
Thanks for your suggestions and ideas.
Thanks Xintong for sharing the detailed backgrounds of SlotManager.

*@Matthias

1. Did you do a proper test coverage analysis?


Just as Xintong said, we already have a CI stage for fine grained resource
managers.
And I will make sure FineGrainedSlotManager as the default SlotManager can
pass all the tests of CI.
In addition, I will review all unit tests of DeclarativeSlotManager(DSM) to
ensure that there are no gaps in the
coverage provided by the FineGrainedSlotManager.
I also added the 'Test Plan' part to the FLIP.
@Matthias @John @Shammon Does this test plan address your concerns?

2.  DeclarativeSlotManager and FineGrainedSlotManager feel quite big in

terms of lines of code....


IMO, the refactoring of SlotManager does not belong to this FLIP since it
may lead to some unstable risks. For
FineGrainedSlotManager(FGSM), we already split some reasonable components.
They are:
* TaskManagerTracker: Track task managers and their resources.
* ResourceTracker: track requirements of jobs
* ResourceAllocationStrategy: Try to fulfill the resource requirements with
available/pending resources.
* SlotStatusSyncer: communicate with TaskManager, for allocating/freeing
slot and reconciling the slot status
Maybe we can start a discussion about refactoring SlotManager in another
FLIP if there are some good suggestions.
WDYT

3. For me personally, having a more detailed summary comparing the
> subcomponents of both SlotManager implementations with where
> their functionality matches and where they differ might help understand the
> consequences of the changes proposed in FLIP-298

Good suggestion, I have updated the comparison in this FLIP. Looking
forward to any suggestions/thoughts
if they are not described clearly.

*@John

4. In addition to changing the default, would it make sense to log a
> deprecation warning on initialization

if the DeclarativeSlotManager is used?
>
SGTM, We should add Deprecated annotations to DSM for devs. And log a
deprecation warning for users.

*@Shammon

1. For their functional differences, can you give some detailed tests to
> verify that the new FineGrainedSlotManager has these capabilities? This can
> effectively verify the new functions
>
As just maintained, there is already a CI stage of FGSM, and I will do more
review of unit tests for DSM.

 2. I'm worried that many functions are not independent and it is difficult
> to migrate step-by-step. You can list the relationship between them in
> detail.

 As Xintong saied the DSM is a subset of FGSM by design. But as time goes
on, FGSM has some lacking
functions as I listed in this FLIP. And I have added the comparison between
DSM and FGSM in this FLIP.


Thanks again for all your thoughts. Any feedback is appreciated!

Best,
Weihua


On Wed, Mar 1, 2023 at 2:17 PM Xintong Song <to...@gmail.com> wrote:

> Thanks Weihua for preparing this FLIP. +1 for the proposal.
>
>
> As one of the contributors of the fine-grained slot manager, I'd like to
> share some backgrounds here.
>
> - There used to be a defaut slot manager implementation, which is
> non-declarative and has been removed now. The two features, declarative /
> reactive resource management and fine-grained resource management, were
> proposed at about the same time. We were aware that by design the
> declarative slot manager is a subset of fine-grained slot manager at that
> time, but still decided to implement two slot managers for the purpose of
> decoupling efforts and reducing cross-team synchronization overhead.
> Merging the two slot managers once they are proved stable is IMO a
> technical debt that was planned at the very beginning.
>
> - The FineGrainedSlotManager has been verified in Alibaba's internal
> production as well as Alibaba Cloud services as the default slot manager
> for about 2 years.
>
>
> Concerning test cases, we currently have a ci stage for fine grained
> resource management. To avoid adding too much burden, the stage only
> includes tests from flink-runtime and flink-test modules. I think switching
> the default slot manager and applying the whole set of tests on the
> fine-grained slot manager would help us to be more confident about it.
>
>
> Concerning cutting out functionalities out of slot manager, I think Yangze
> and I have tried our best to shape the FineGrainedSlotManager into
> reasonable components. I personally don't have other ideas to further
> disassemble the component, but I'm open to such suggestions. However, from
> the stability perspective, I'd be in favor of not introducing significant
> changes to the FineGrainedSlotManager while switching it to the default.
> Because the current implementation has already been verified (or at least
> partially verified because Alibaba does not cover all the Flink use cases),
> and introducing more changes also means more chances of breaking things.
>
>
> Best,
>
> Xintong
>
>
>
> On Wed, Mar 1, 2023 at 11:12 AM Shammon FY <zj...@gmail.com> wrote:
>
> > Hi
> >
> > Thanks for starting this work weihua, I think unifying
> > DeclarativeSlotManager and FineGrainedSlotManager is valuable.
> >
> > I agree with @Matthias and @John that we need a way to ensure that
> > DeclarativeSlotManager's capabilities are fully covered by
> > FineGrainedSlotManager
> >
> > 1. For their functional differences, can you give some detailed tests to
> > verify that the new FineGrainedSlotManager has these capabilities? This
> can
> > effectively verify the new functions
> >
> > 2. I'm worried that many functions are not independent and it is
> difficult
> > to migrate step-by-step. You can list the relationship between them in
> > detail.
> >
> > 3. As John mentioned, give a smoke test for FineGrainedSlotManager is a
> > good idea. Or you can add some test information to the
> > DeclarativeSlotManager to determine how many tests have used it. In this
> > way, we can gradually construct test cases for FineGrainedSlotManager
> > during the development process.
> >
> >
> > Best,
> > Shammon
> >
> >
> > On Tue, Feb 28, 2023 at 10:22 PM John Roesler <vv...@apache.org>
> wrote:
> >
> > > Thanks for the FLIP, Weihua!
> > >
> > > I’ve read the FLIP, and it sounds good to me. We need to avoid
> > > proliferating alternative implementations wherever possible. I have
> just
> > a
> > > couple of comments:
> > >
> > > 1. I share Matthias’s concern about ensuring the behavior is really the
> > > same. One suggestion I’ve used for this kind of thing is, as a smoke
> > test,
> > > to update the DeclarativeSlotManager to just delegate to the
> > > FineGrainedSlotManager. If the full test suite still passes, you can be
> > > pretty sure the new default is really ok. It would not be a good idea
> to
> > > actually keep that in for the release, since it would remove the option
> > to
> > > fall back in case of bugs. Either way, we need to make sure all test
> > > scenarios are present for the FGSM.
> > >
> > > 4. In addition to changing the default, would it make sense to log a
> > > deprecation warning on initialization if the DeclarativeSlotManager is
> > used?
> > >
> > > Thanks again,
> > > John
> > >
> > > On Tue, Feb 28, 2023, at 07:20, Matthias Pohl wrote:
> > > > Hi Weihua,
> > > > Thanks for your proposal. From a conceptual point: AFAIU, the
> > > > DeclarativeSlotManager covers a subset (i.e. only evenly sized slots)
> > of
> > > > what the FineGrainedSlotManager should be able to achieve (variable
> > slot
> > > > size per task manager). Is this the right assumption/understanding?
> In
> > > this
> > > > sense, merging both implementations into a single one sounds good. A
> > few
> > > > more general comments, though:
> > > >
> > > > 1. Did you do a proper test coverage analysis? That's not mentioned
> in
> > > the
> > > > current version of the FLIP. I'm bringing this up because we ran into
> > the
> > > > same issue when fixing the flaws that popped up after introducing the
> > > > multi-component leader election (see FLIP-285 [1]). There is a risk
> > that
> > > by
> > > > removing the legacy code we decrease test coverage because certain
> > > > test cases that were covered for the legacy classes might not be
> > > > necessarily covered in the new implementation, yet (see FLINK-30338
> [2]
> > > > which covers this issue for the leader election case). Ideally, we
> > don't
> > > > want to remove test cases accidentally because they were only
> > implemented
> > > > for the DeclarativeSlotManager but missed for the
> > FineGrainedSlotManager.
> > > >
> > > > 2. DeclarativeSlotManager and FineGrainedSlotManager feel quite big
> in
> > > > terms of lines of code. Without knowing whether it's actually a
> > > reasonable
> > > > thing to do: Instead of just adding more features to the
> > > > FineGrainedSlotManager, have you thought of cutting out functionality
> > > into
> > > > smaller sub-components along this refactoring? Such a step-by-step
> > > approach
> > > > might improve the overall codebase and might make reviewing the
> > > refactoring
> > > > easier. I did a first pass over the code and struggled to identify
> code
> > > > blocks that could be moved out of the SlotManager implementation(s).
> > > > Therefore, I might be wrong with this proposal. I haven't worked on
> > this
> > > > codebase in that detail that it would allow me to come up with a
> > > judgement
> > > > call. I wanted to bring it up, anyway, because I'm curious whether
> that
> > > > could be an option. There's a comment created by Chesnay (CC'd) in
> the
> > > > JavaDoc of TaskExecutorManager [3] indicating something similar. I'm
> > > > wondering whether he can add some insights here.
> > > >
> > > > 3. For me personally, having a more detailed summary comparing the
> > > > subcomponents of both SlotManager implementations with where
> > > > their functionality matches and where they differ might help
> understand
> > > the
> > > > consequences of the changes proposed in FLIP-298.
> > > >
> > > > Best,
> > > > Matthias
> > > >
> > > > [1]
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+Refactoring+LeaderElection+to+make+Flink+support+multi-component+leader+election+out-of-the-box
> > > > [2] https://issues.apache.org/jira/browse/FLINK-30338
> > > > [3]
> > > >
> > >
> >
> https://github.com/apache/flink/blob/f611ea8cb5deddb42429df2c99f0c68d7382e9bd/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/TaskExecutorManager.java#L66-L68
> > > >
> > > > On Tue, Feb 28, 2023 at 6:14 AM Matt Wang <wa...@163.com> wrote:
> > > >
> > > >> This is a good proposal for me, it will make the code of the
> > SlotManager
> > > >> more clear.
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >>
> > > >> Best,
> > > >> Matt Wang
> > > >>
> > > >>
> > > >> ---- Replied Message ----
> > > >> | From | David Morávek<dm...@apache.org> |
> > > >> | Date | 02/27/2023 22:45 |
> > > >> | To | <de...@flink.apache.org> |
> > > >> | Subject | Re: [DISCUSS] FLIP-298: Unifying the Implementation of
> > > >> SlotManager |
> > > >> Hi Weihua, I still need to dig into the details, but the overall
> > > sentiment
> > > >> of this change sounds reasonable.
> > > >>
> > > >> Best,
> > > >> D.
> > > >>
> > > >> On Mon, Feb 27, 2023 at 2:26 PM Zhanghao Chen <
> > > zhanghao.chen@outlook.com>
> > > >> wrote:
> > > >>
> > > >> Thanks for driving this topic. I think this FLIP could help clean up
> > the
> > > >> codebase to make it easier to maintain. +1 on it.
> > > >>
> > > >> Best,
> > > >> Zhanghao Chen
> > > >> ________________________________
> > > >> From: Weihua Hu <hu...@gmail.com>
> > > >> Sent: Monday, February 27, 2023 20:40
> > > >> To: dev <de...@flink.apache.org>
> > > >> Subject: [DISCUSS] FLIP-298: Unifying the Implementation of
> > SlotManager
> > > >>
> > > >> Hi everyone,
> > > >>
> > > >> I would like to begin a discussion on FLIP-298: Unifying the
> > > Implementation
> > > >> of SlotManager[1]. There are currently two types of SlotManager in
> > > Flink:
> > > >> DeclarativeSlotManager and FineGrainedSlotManager.
> > > FineGrainedSlotManager
> > > >> should behave as DeclarativeSlotManager if the user does not
> configure
> > > the
> > > >> slot request profile.
> > > >>
> > > >> Therefore, this FLIP aims to unify the implementation of SlotManager
> > in
> > > >> order to reduce maintenance costs.
> > > >>
> > > >> Looking forward to hearing from you.
> > > >>
> > > >> [1]
> > > >>
> > > >>
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-298%3A+Unifying+the+Implementation+of+SlotManager
> > > >>
> > > >> Best,
> > > >> Weihua
> > > >>
> > > >>
> > >
> >
>

Re: [DISCUSS] FLIP-298: Unifying the Implementation of SlotManager

Posted by Chesnay Schepler <ch...@apache.org>.
Well there's no reason to make them work for both since we're dropping 
the DSM.

On 13/03/2023 15:39, Weihua Hu wrote:
> Thanks Chesnay for your suggestion.
>
> One thing you could potentially look into is integrating the
>> TaskExecutorManager into the FIneGrainedSlotManager or find another way
>> to move the timeout handling into another component.
> After going over the codes. It's hard to make TaskExecutorManager both work
> for DSM and FGSM.
> But we can move task manager related work from FGSM to TaskManagerTracker,
> which already
> tracks task managers but not including request/release. After this change,
> the TaskManagerTracker
> will manage all the behavior of TaskManager:
> 1. request new task manager
> 2. tracking pending task manager
> 3. tracking registered task manager
> 4. release idle task manager
> 5. deal with max resources limitation
>
> w.r.t. testing, there's the DeclarativeSlotManagerTest that covers some
>> specific issues that should be ported to the fine grained slot manager.
> Yes, we will check all the unit tests related to DeclarativeSlotManager
> carefully
> to make sure there are no tests lost while migrating from DSM to FGSM.
>
> Best,
> Weihua
>
>
> On Mon, Mar 13, 2023 at 6:27 PM Chesnay Schepler <ch...@apache.org> wrote:
>
>> One thing you could potentially look into is integrating the
>> TaskExecutorManager into the FIneGrainedSlotManager or find another way
>> to move the timeout handling into another component.
>>
>> w.r.t. testing, there's the DeclarativeSlotManagerTest that covers some
>> specific issues that should be ported to the fine grained slot manager.
>>
>> On 01/03/2023 07:17, Xintong Song wrote:
>>> Thanks Weihua for preparing this FLIP. +1 for the proposal.
>>>
>>>
>>> As one of the contributors of the fine-grained slot manager, I'd like to
>>> share some backgrounds here.
>>>
>>> - There used to be a defaut slot manager implementation, which is
>>> non-declarative and has been removed now. The two features, declarative /
>>> reactive resource management and fine-grained resource management, were
>>> proposed at about the same time. We were aware that by design the
>>> declarative slot manager is a subset of fine-grained slot manager at that
>>> time, but still decided to implement two slot managers for the purpose of
>>> decoupling efforts and reducing cross-team synchronization overhead.
>>> Merging the two slot managers once they are proved stable is IMO a
>>> technical debt that was planned at the very beginning.
>>>
>>> - The FineGrainedSlotManager has been verified in Alibaba's internal
>>> production as well as Alibaba Cloud services as the default slot manager
>>> for about 2 years.
>>>
>>>
>>> Concerning test cases, we currently have a ci stage for fine grained
>>> resource management. To avoid adding too much burden, the stage only
>>> includes tests from flink-runtime and flink-test modules. I think
>> switching
>>> the default slot manager and applying the whole set of tests on the
>>> fine-grained slot manager would help us to be more confident about it.
>>>
>>>
>>> Concerning cutting out functionalities out of slot manager, I think
>> Yangze
>>> and I have tried our best to shape the FineGrainedSlotManager into
>>> reasonable components. I personally don't have other ideas to further
>>> disassemble the component, but I'm open to such suggestions. However,
>> from
>>> the stability perspective, I'd be in favor of not introducing significant
>>> changes to the FineGrainedSlotManager while switching it to the default.
>>> Because the current implementation has already been verified (or at least
>>> partially verified because Alibaba does not cover all the Flink use
>> cases),
>>> and introducing more changes also means more chances of breaking things.
>>>
>>>
>>> Best,
>>>
>>> Xintong
>>>
>>>
>>>
>>> On Wed, Mar 1, 2023 at 11:12 AM Shammon FY <zj...@gmail.com> wrote:
>>>
>>>> Hi
>>>>
>>>> Thanks for starting this work weihua, I think unifying
>>>> DeclarativeSlotManager and FineGrainedSlotManager is valuable.
>>>>
>>>> I agree with @Matthias and @John that we need a way to ensure that
>>>> DeclarativeSlotManager's capabilities are fully covered by
>>>> FineGrainedSlotManager
>>>>
>>>> 1. For their functional differences, can you give some detailed tests to
>>>> verify that the new FineGrainedSlotManager has these capabilities? This
>> can
>>>> effectively verify the new functions
>>>>
>>>> 2. I'm worried that many functions are not independent and it is
>> difficult
>>>> to migrate step-by-step. You can list the relationship between them in
>>>> detail.
>>>>
>>>> 3. As John mentioned, give a smoke test for FineGrainedSlotManager is a
>>>> good idea. Or you can add some test information to the
>>>> DeclarativeSlotManager to determine how many tests have used it. In this
>>>> way, we can gradually construct test cases for FineGrainedSlotManager
>>>> during the development process.
>>>>
>>>>
>>>> Best,
>>>> Shammon
>>>>
>>>>
>>>> On Tue, Feb 28, 2023 at 10:22 PM John Roesler <vv...@apache.org>
>> wrote:
>>>>> Thanks for the FLIP, Weihua!
>>>>>
>>>>> I’ve read the FLIP, and it sounds good to me. We need to avoid
>>>>> proliferating alternative implementations wherever possible. I have
>> just
>>>> a
>>>>> couple of comments:
>>>>>
>>>>> 1. I share Matthias’s concern about ensuring the behavior is really the
>>>>> same. One suggestion I’ve used for this kind of thing is, as a smoke
>>>> test,
>>>>> to update the DeclarativeSlotManager to just delegate to the
>>>>> FineGrainedSlotManager. If the full test suite still passes, you can be
>>>>> pretty sure the new default is really ok. It would not be a good idea
>> to
>>>>> actually keep that in for the release, since it would remove the option
>>>> to
>>>>> fall back in case of bugs. Either way, we need to make sure all test
>>>>> scenarios are present for the FGSM.
>>>>>
>>>>> 4. In addition to changing the default, would it make sense to log a
>>>>> deprecation warning on initialization if the DeclarativeSlotManager is
>>>> used?
>>>>> Thanks again,
>>>>> John
>>>>>
>>>>> On Tue, Feb 28, 2023, at 07:20, Matthias Pohl wrote:
>>>>>> Hi Weihua,
>>>>>> Thanks for your proposal. From a conceptual point: AFAIU, the
>>>>>> DeclarativeSlotManager covers a subset (i.e. only evenly sized slots)
>>>> of
>>>>>> what the FineGrainedSlotManager should be able to achieve (variable
>>>> slot
>>>>>> size per task manager). Is this the right assumption/understanding? In
>>>>> this
>>>>>> sense, merging both implementations into a single one sounds good. A
>>>> few
>>>>>> more general comments, though:
>>>>>>
>>>>>> 1. Did you do a proper test coverage analysis? That's not mentioned in
>>>>> the
>>>>>> current version of the FLIP. I'm bringing this up because we ran into
>>>> the
>>>>>> same issue when fixing the flaws that popped up after introducing the
>>>>>> multi-component leader election (see FLIP-285 [1]). There is a risk
>>>> that
>>>>> by
>>>>>> removing the legacy code we decrease test coverage because certain
>>>>>> test cases that were covered for the legacy classes might not be
>>>>>> necessarily covered in the new implementation, yet (see FLINK-30338
>> [2]
>>>>>> which covers this issue for the leader election case). Ideally, we
>>>> don't
>>>>>> want to remove test cases accidentally because they were only
>>>> implemented
>>>>>> for the DeclarativeSlotManager but missed for the
>>>> FineGrainedSlotManager.
>>>>>> 2. DeclarativeSlotManager and FineGrainedSlotManager feel quite big in
>>>>>> terms of lines of code. Without knowing whether it's actually a
>>>>> reasonable
>>>>>> thing to do: Instead of just adding more features to the
>>>>>> FineGrainedSlotManager, have you thought of cutting out functionality
>>>>> into
>>>>>> smaller sub-components along this refactoring? Such a step-by-step
>>>>> approach
>>>>>> might improve the overall codebase and might make reviewing the
>>>>> refactoring
>>>>>> easier. I did a first pass over the code and struggled to identify
>> code
>>>>>> blocks that could be moved out of the SlotManager implementation(s).
>>>>>> Therefore, I might be wrong with this proposal. I haven't worked on
>>>> this
>>>>>> codebase in that detail that it would allow me to come up with a
>>>>> judgement
>>>>>> call. I wanted to bring it up, anyway, because I'm curious whether
>> that
>>>>>> could be an option. There's a comment created by Chesnay (CC'd) in the
>>>>>> JavaDoc of TaskExecutorManager [3] indicating something similar. I'm
>>>>>> wondering whether he can add some insights here.
>>>>>>
>>>>>> 3. For me personally, having a more detailed summary comparing the
>>>>>> subcomponents of both SlotManager implementations with where
>>>>>> their functionality matches and where they differ might help
>> understand
>>>>> the
>>>>>> consequences of the changes proposed in FLIP-298.
>>>>>>
>>>>>> Best,
>>>>>> Matthias
>>>>>>
>>>>>> [1]
>>>>>>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+Refactoring+LeaderElection+to+make+Flink+support+multi-component+leader+election+out-of-the-box
>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-30338
>>>>>> [3]
>>>>>>
>> https://github.com/apache/flink/blob/f611ea8cb5deddb42429df2c99f0c68d7382e9bd/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/TaskExecutorManager.java#L66-L68
>>>>>> On Tue, Feb 28, 2023 at 6:14 AM Matt Wang <wa...@163.com> wrote:
>>>>>>
>>>>>>> This is a good proposal for me, it will make the code of the
>>>> SlotManager
>>>>>>> more clear.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> Best,
>>>>>>> Matt Wang
>>>>>>>
>>>>>>>
>>>>>>> ---- Replied Message ----
>>>>>>> | From | David Morávek<dm...@apache.org> |
>>>>>>> | Date | 02/27/2023 22:45 |
>>>>>>> | To | <de...@flink.apache.org> |
>>>>>>> | Subject | Re: [DISCUSS] FLIP-298: Unifying the Implementation of
>>>>>>> SlotManager |
>>>>>>> Hi Weihua, I still need to dig into the details, but the overall
>>>>> sentiment
>>>>>>> of this change sounds reasonable.
>>>>>>>
>>>>>>> Best,
>>>>>>> D.
>>>>>>>
>>>>>>> On Mon, Feb 27, 2023 at 2:26 PM Zhanghao Chen <
>>>>> zhanghao.chen@outlook.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Thanks for driving this topic. I think this FLIP could help clean up
>>>> the
>>>>>>> codebase to make it easier to maintain. +1 on it.
>>>>>>>
>>>>>>> Best,
>>>>>>> Zhanghao Chen
>>>>>>> ________________________________
>>>>>>> From: Weihua Hu <hu...@gmail.com>
>>>>>>> Sent: Monday, February 27, 2023 20:40
>>>>>>> To: dev <de...@flink.apache.org>
>>>>>>> Subject: [DISCUSS] FLIP-298: Unifying the Implementation of
>>>> SlotManager
>>>>>>> Hi everyone,
>>>>>>>
>>>>>>> I would like to begin a discussion on FLIP-298: Unifying the
>>>>> Implementation
>>>>>>> of SlotManager[1]. There are currently two types of SlotManager in
>>>>> Flink:
>>>>>>> DeclarativeSlotManager and FineGrainedSlotManager.
>>>>> FineGrainedSlotManager
>>>>>>> should behave as DeclarativeSlotManager if the user does not
>> configure
>>>>> the
>>>>>>> slot request profile.
>>>>>>>
>>>>>>> Therefore, this FLIP aims to unify the implementation of SlotManager
>>>> in
>>>>>>> order to reduce maintenance costs.
>>>>>>>
>>>>>>> Looking forward to hearing from you.
>>>>>>>
>>>>>>> [1]
>>>>>>>
>>>>>>>
>>>>>>>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-298%3A+Unifying+the+Implementation+of+SlotManager
>>>>>>> Best,
>>>>>>> Weihua
>>>>>>>
>>>>>>>
>>


Re: [DISCUSS] FLIP-298: Unifying the Implementation of SlotManager

Posted by Weihua Hu <hu...@gmail.com>.
Thanks Chesnay for your suggestion.

One thing you could potentially look into is integrating the
> TaskExecutorManager into the FIneGrainedSlotManager or find another way
> to move the timeout handling into another component.

After going over the codes. It's hard to make TaskExecutorManager both work
for DSM and FGSM.
But we can move task manager related work from FGSM to TaskManagerTracker,
which already
tracks task managers but not including request/release. After this change,
the TaskManagerTracker
will manage all the behavior of TaskManager:
1. request new task manager
2. tracking pending task manager
3. tracking registered task manager
4. release idle task manager
5. deal with max resources limitation

w.r.t. testing, there's the DeclarativeSlotManagerTest that covers some
> specific issues that should be ported to the fine grained slot manager.

Yes, we will check all the unit tests related to DeclarativeSlotManager
carefully
to make sure there are no tests lost while migrating from DSM to FGSM.

Best,
Weihua


On Mon, Mar 13, 2023 at 6:27 PM Chesnay Schepler <ch...@apache.org> wrote:

> One thing you could potentially look into is integrating the
> TaskExecutorManager into the FIneGrainedSlotManager or find another way
> to move the timeout handling into another component.
>
> w.r.t. testing, there's the DeclarativeSlotManagerTest that covers some
> specific issues that should be ported to the fine grained slot manager.
>
> On 01/03/2023 07:17, Xintong Song wrote:
> > Thanks Weihua for preparing this FLIP. +1 for the proposal.
> >
> >
> > As one of the contributors of the fine-grained slot manager, I'd like to
> > share some backgrounds here.
> >
> > - There used to be a defaut slot manager implementation, which is
> > non-declarative and has been removed now. The two features, declarative /
> > reactive resource management and fine-grained resource management, were
> > proposed at about the same time. We were aware that by design the
> > declarative slot manager is a subset of fine-grained slot manager at that
> > time, but still decided to implement two slot managers for the purpose of
> > decoupling efforts and reducing cross-team synchronization overhead.
> > Merging the two slot managers once they are proved stable is IMO a
> > technical debt that was planned at the very beginning.
> >
> > - The FineGrainedSlotManager has been verified in Alibaba's internal
> > production as well as Alibaba Cloud services as the default slot manager
> > for about 2 years.
> >
> >
> > Concerning test cases, we currently have a ci stage for fine grained
> > resource management. To avoid adding too much burden, the stage only
> > includes tests from flink-runtime and flink-test modules. I think
> switching
> > the default slot manager and applying the whole set of tests on the
> > fine-grained slot manager would help us to be more confident about it.
> >
> >
> > Concerning cutting out functionalities out of slot manager, I think
> Yangze
> > and I have tried our best to shape the FineGrainedSlotManager into
> > reasonable components. I personally don't have other ideas to further
> > disassemble the component, but I'm open to such suggestions. However,
> from
> > the stability perspective, I'd be in favor of not introducing significant
> > changes to the FineGrainedSlotManager while switching it to the default.
> > Because the current implementation has already been verified (or at least
> > partially verified because Alibaba does not cover all the Flink use
> cases),
> > and introducing more changes also means more chances of breaking things.
> >
> >
> > Best,
> >
> > Xintong
> >
> >
> >
> > On Wed, Mar 1, 2023 at 11:12 AM Shammon FY <zj...@gmail.com> wrote:
> >
> >> Hi
> >>
> >> Thanks for starting this work weihua, I think unifying
> >> DeclarativeSlotManager and FineGrainedSlotManager is valuable.
> >>
> >> I agree with @Matthias and @John that we need a way to ensure that
> >> DeclarativeSlotManager's capabilities are fully covered by
> >> FineGrainedSlotManager
> >>
> >> 1. For their functional differences, can you give some detailed tests to
> >> verify that the new FineGrainedSlotManager has these capabilities? This
> can
> >> effectively verify the new functions
> >>
> >> 2. I'm worried that many functions are not independent and it is
> difficult
> >> to migrate step-by-step. You can list the relationship between them in
> >> detail.
> >>
> >> 3. As John mentioned, give a smoke test for FineGrainedSlotManager is a
> >> good idea. Or you can add some test information to the
> >> DeclarativeSlotManager to determine how many tests have used it. In this
> >> way, we can gradually construct test cases for FineGrainedSlotManager
> >> during the development process.
> >>
> >>
> >> Best,
> >> Shammon
> >>
> >>
> >> On Tue, Feb 28, 2023 at 10:22 PM John Roesler <vv...@apache.org>
> wrote:
> >>
> >>> Thanks for the FLIP, Weihua!
> >>>
> >>> I’ve read the FLIP, and it sounds good to me. We need to avoid
> >>> proliferating alternative implementations wherever possible. I have
> just
> >> a
> >>> couple of comments:
> >>>
> >>> 1. I share Matthias’s concern about ensuring the behavior is really the
> >>> same. One suggestion I’ve used for this kind of thing is, as a smoke
> >> test,
> >>> to update the DeclarativeSlotManager to just delegate to the
> >>> FineGrainedSlotManager. If the full test suite still passes, you can be
> >>> pretty sure the new default is really ok. It would not be a good idea
> to
> >>> actually keep that in for the release, since it would remove the option
> >> to
> >>> fall back in case of bugs. Either way, we need to make sure all test
> >>> scenarios are present for the FGSM.
> >>>
> >>> 4. In addition to changing the default, would it make sense to log a
> >>> deprecation warning on initialization if the DeclarativeSlotManager is
> >> used?
> >>> Thanks again,
> >>> John
> >>>
> >>> On Tue, Feb 28, 2023, at 07:20, Matthias Pohl wrote:
> >>>> Hi Weihua,
> >>>> Thanks for your proposal. From a conceptual point: AFAIU, the
> >>>> DeclarativeSlotManager covers a subset (i.e. only evenly sized slots)
> >> of
> >>>> what the FineGrainedSlotManager should be able to achieve (variable
> >> slot
> >>>> size per task manager). Is this the right assumption/understanding? In
> >>> this
> >>>> sense, merging both implementations into a single one sounds good. A
> >> few
> >>>> more general comments, though:
> >>>>
> >>>> 1. Did you do a proper test coverage analysis? That's not mentioned in
> >>> the
> >>>> current version of the FLIP. I'm bringing this up because we ran into
> >> the
> >>>> same issue when fixing the flaws that popped up after introducing the
> >>>> multi-component leader election (see FLIP-285 [1]). There is a risk
> >> that
> >>> by
> >>>> removing the legacy code we decrease test coverage because certain
> >>>> test cases that were covered for the legacy classes might not be
> >>>> necessarily covered in the new implementation, yet (see FLINK-30338
> [2]
> >>>> which covers this issue for the leader election case). Ideally, we
> >> don't
> >>>> want to remove test cases accidentally because they were only
> >> implemented
> >>>> for the DeclarativeSlotManager but missed for the
> >> FineGrainedSlotManager.
> >>>> 2. DeclarativeSlotManager and FineGrainedSlotManager feel quite big in
> >>>> terms of lines of code. Without knowing whether it's actually a
> >>> reasonable
> >>>> thing to do: Instead of just adding more features to the
> >>>> FineGrainedSlotManager, have you thought of cutting out functionality
> >>> into
> >>>> smaller sub-components along this refactoring? Such a step-by-step
> >>> approach
> >>>> might improve the overall codebase and might make reviewing the
> >>> refactoring
> >>>> easier. I did a first pass over the code and struggled to identify
> code
> >>>> blocks that could be moved out of the SlotManager implementation(s).
> >>>> Therefore, I might be wrong with this proposal. I haven't worked on
> >> this
> >>>> codebase in that detail that it would allow me to come up with a
> >>> judgement
> >>>> call. I wanted to bring it up, anyway, because I'm curious whether
> that
> >>>> could be an option. There's a comment created by Chesnay (CC'd) in the
> >>>> JavaDoc of TaskExecutorManager [3] indicating something similar. I'm
> >>>> wondering whether he can add some insights here.
> >>>>
> >>>> 3. For me personally, having a more detailed summary comparing the
> >>>> subcomponents of both SlotManager implementations with where
> >>>> their functionality matches and where they differ might help
> understand
> >>> the
> >>>> consequences of the changes proposed in FLIP-298.
> >>>>
> >>>> Best,
> >>>> Matthias
> >>>>
> >>>> [1]
> >>>>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+Refactoring+LeaderElection+to+make+Flink+support+multi-component+leader+election+out-of-the-box
> >>>> [2] https://issues.apache.org/jira/browse/FLINK-30338
> >>>> [3]
> >>>>
> >>
> https://github.com/apache/flink/blob/f611ea8cb5deddb42429df2c99f0c68d7382e9bd/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/TaskExecutorManager.java#L66-L68
> >>>> On Tue, Feb 28, 2023 at 6:14 AM Matt Wang <wa...@163.com> wrote:
> >>>>
> >>>>> This is a good proposal for me, it will make the code of the
> >> SlotManager
> >>>>> more clear.
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>>
> >>>>> Best,
> >>>>> Matt Wang
> >>>>>
> >>>>>
> >>>>> ---- Replied Message ----
> >>>>> | From | David Morávek<dm...@apache.org> |
> >>>>> | Date | 02/27/2023 22:45 |
> >>>>> | To | <de...@flink.apache.org> |
> >>>>> | Subject | Re: [DISCUSS] FLIP-298: Unifying the Implementation of
> >>>>> SlotManager |
> >>>>> Hi Weihua, I still need to dig into the details, but the overall
> >>> sentiment
> >>>>> of this change sounds reasonable.
> >>>>>
> >>>>> Best,
> >>>>> D.
> >>>>>
> >>>>> On Mon, Feb 27, 2023 at 2:26 PM Zhanghao Chen <
> >>> zhanghao.chen@outlook.com>
> >>>>> wrote:
> >>>>>
> >>>>> Thanks for driving this topic. I think this FLIP could help clean up
> >> the
> >>>>> codebase to make it easier to maintain. +1 on it.
> >>>>>
> >>>>> Best,
> >>>>> Zhanghao Chen
> >>>>> ________________________________
> >>>>> From: Weihua Hu <hu...@gmail.com>
> >>>>> Sent: Monday, February 27, 2023 20:40
> >>>>> To: dev <de...@flink.apache.org>
> >>>>> Subject: [DISCUSS] FLIP-298: Unifying the Implementation of
> >> SlotManager
> >>>>> Hi everyone,
> >>>>>
> >>>>> I would like to begin a discussion on FLIP-298: Unifying the
> >>> Implementation
> >>>>> of SlotManager[1]. There are currently two types of SlotManager in
> >>> Flink:
> >>>>> DeclarativeSlotManager and FineGrainedSlotManager.
> >>> FineGrainedSlotManager
> >>>>> should behave as DeclarativeSlotManager if the user does not
> configure
> >>> the
> >>>>> slot request profile.
> >>>>>
> >>>>> Therefore, this FLIP aims to unify the implementation of SlotManager
> >> in
> >>>>> order to reduce maintenance costs.
> >>>>>
> >>>>> Looking forward to hearing from you.
> >>>>>
> >>>>> [1]
> >>>>>
> >>>>>
> >>>>>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-298%3A+Unifying+the+Implementation+of+SlotManager
> >>>>> Best,
> >>>>> Weihua
> >>>>>
> >>>>>
>
>

Re: [DISCUSS] FLIP-298: Unifying the Implementation of SlotManager

Posted by Chesnay Schepler <ch...@apache.org>.
One thing you could potentially look into is integrating the 
TaskExecutorManager into the FIneGrainedSlotManager or find another way 
to move the timeout handling into another component.

w.r.t. testing, there's the DeclarativeSlotManagerTest that covers some 
specific issues that should be ported to the fine grained slot manager.

On 01/03/2023 07:17, Xintong Song wrote:
> Thanks Weihua for preparing this FLIP. +1 for the proposal.
>
>
> As one of the contributors of the fine-grained slot manager, I'd like to
> share some backgrounds here.
>
> - There used to be a defaut slot manager implementation, which is
> non-declarative and has been removed now. The two features, declarative /
> reactive resource management and fine-grained resource management, were
> proposed at about the same time. We were aware that by design the
> declarative slot manager is a subset of fine-grained slot manager at that
> time, but still decided to implement two slot managers for the purpose of
> decoupling efforts and reducing cross-team synchronization overhead.
> Merging the two slot managers once they are proved stable is IMO a
> technical debt that was planned at the very beginning.
>
> - The FineGrainedSlotManager has been verified in Alibaba's internal
> production as well as Alibaba Cloud services as the default slot manager
> for about 2 years.
>
>
> Concerning test cases, we currently have a ci stage for fine grained
> resource management. To avoid adding too much burden, the stage only
> includes tests from flink-runtime and flink-test modules. I think switching
> the default slot manager and applying the whole set of tests on the
> fine-grained slot manager would help us to be more confident about it.
>
>
> Concerning cutting out functionalities out of slot manager, I think Yangze
> and I have tried our best to shape the FineGrainedSlotManager into
> reasonable components. I personally don't have other ideas to further
> disassemble the component, but I'm open to such suggestions. However, from
> the stability perspective, I'd be in favor of not introducing significant
> changes to the FineGrainedSlotManager while switching it to the default.
> Because the current implementation has already been verified (or at least
> partially verified because Alibaba does not cover all the Flink use cases),
> and introducing more changes also means more chances of breaking things.
>
>
> Best,
>
> Xintong
>
>
>
> On Wed, Mar 1, 2023 at 11:12 AM Shammon FY <zj...@gmail.com> wrote:
>
>> Hi
>>
>> Thanks for starting this work weihua, I think unifying
>> DeclarativeSlotManager and FineGrainedSlotManager is valuable.
>>
>> I agree with @Matthias and @John that we need a way to ensure that
>> DeclarativeSlotManager's capabilities are fully covered by
>> FineGrainedSlotManager
>>
>> 1. For their functional differences, can you give some detailed tests to
>> verify that the new FineGrainedSlotManager has these capabilities? This can
>> effectively verify the new functions
>>
>> 2. I'm worried that many functions are not independent and it is difficult
>> to migrate step-by-step. You can list the relationship between them in
>> detail.
>>
>> 3. As John mentioned, give a smoke test for FineGrainedSlotManager is a
>> good idea. Or you can add some test information to the
>> DeclarativeSlotManager to determine how many tests have used it. In this
>> way, we can gradually construct test cases for FineGrainedSlotManager
>> during the development process.
>>
>>
>> Best,
>> Shammon
>>
>>
>> On Tue, Feb 28, 2023 at 10:22 PM John Roesler <vv...@apache.org> wrote:
>>
>>> Thanks for the FLIP, Weihua!
>>>
>>> I’ve read the FLIP, and it sounds good to me. We need to avoid
>>> proliferating alternative implementations wherever possible. I have just
>> a
>>> couple of comments:
>>>
>>> 1. I share Matthias’s concern about ensuring the behavior is really the
>>> same. One suggestion I’ve used for this kind of thing is, as a smoke
>> test,
>>> to update the DeclarativeSlotManager to just delegate to the
>>> FineGrainedSlotManager. If the full test suite still passes, you can be
>>> pretty sure the new default is really ok. It would not be a good idea to
>>> actually keep that in for the release, since it would remove the option
>> to
>>> fall back in case of bugs. Either way, we need to make sure all test
>>> scenarios are present for the FGSM.
>>>
>>> 4. In addition to changing the default, would it make sense to log a
>>> deprecation warning on initialization if the DeclarativeSlotManager is
>> used?
>>> Thanks again,
>>> John
>>>
>>> On Tue, Feb 28, 2023, at 07:20, Matthias Pohl wrote:
>>>> Hi Weihua,
>>>> Thanks for your proposal. From a conceptual point: AFAIU, the
>>>> DeclarativeSlotManager covers a subset (i.e. only evenly sized slots)
>> of
>>>> what the FineGrainedSlotManager should be able to achieve (variable
>> slot
>>>> size per task manager). Is this the right assumption/understanding? In
>>> this
>>>> sense, merging both implementations into a single one sounds good. A
>> few
>>>> more general comments, though:
>>>>
>>>> 1. Did you do a proper test coverage analysis? That's not mentioned in
>>> the
>>>> current version of the FLIP. I'm bringing this up because we ran into
>> the
>>>> same issue when fixing the flaws that popped up after introducing the
>>>> multi-component leader election (see FLIP-285 [1]). There is a risk
>> that
>>> by
>>>> removing the legacy code we decrease test coverage because certain
>>>> test cases that were covered for the legacy classes might not be
>>>> necessarily covered in the new implementation, yet (see FLINK-30338 [2]
>>>> which covers this issue for the leader election case). Ideally, we
>> don't
>>>> want to remove test cases accidentally because they were only
>> implemented
>>>> for the DeclarativeSlotManager but missed for the
>> FineGrainedSlotManager.
>>>> 2. DeclarativeSlotManager and FineGrainedSlotManager feel quite big in
>>>> terms of lines of code. Without knowing whether it's actually a
>>> reasonable
>>>> thing to do: Instead of just adding more features to the
>>>> FineGrainedSlotManager, have you thought of cutting out functionality
>>> into
>>>> smaller sub-components along this refactoring? Such a step-by-step
>>> approach
>>>> might improve the overall codebase and might make reviewing the
>>> refactoring
>>>> easier. I did a first pass over the code and struggled to identify code
>>>> blocks that could be moved out of the SlotManager implementation(s).
>>>> Therefore, I might be wrong with this proposal. I haven't worked on
>> this
>>>> codebase in that detail that it would allow me to come up with a
>>> judgement
>>>> call. I wanted to bring it up, anyway, because I'm curious whether that
>>>> could be an option. There's a comment created by Chesnay (CC'd) in the
>>>> JavaDoc of TaskExecutorManager [3] indicating something similar. I'm
>>>> wondering whether he can add some insights here.
>>>>
>>>> 3. For me personally, having a more detailed summary comparing the
>>>> subcomponents of both SlotManager implementations with where
>>>> their functionality matches and where they differ might help understand
>>> the
>>>> consequences of the changes proposed in FLIP-298.
>>>>
>>>> Best,
>>>> Matthias
>>>>
>>>> [1]
>>>>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+Refactoring+LeaderElection+to+make+Flink+support+multi-component+leader+election+out-of-the-box
>>>> [2] https://issues.apache.org/jira/browse/FLINK-30338
>>>> [3]
>>>>
>> https://github.com/apache/flink/blob/f611ea8cb5deddb42429df2c99f0c68d7382e9bd/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/TaskExecutorManager.java#L66-L68
>>>> On Tue, Feb 28, 2023 at 6:14 AM Matt Wang <wa...@163.com> wrote:
>>>>
>>>>> This is a good proposal for me, it will make the code of the
>> SlotManager
>>>>> more clear.
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Best,
>>>>> Matt Wang
>>>>>
>>>>>
>>>>> ---- Replied Message ----
>>>>> | From | David Morávek<dm...@apache.org> |
>>>>> | Date | 02/27/2023 22:45 |
>>>>> | To | <de...@flink.apache.org> |
>>>>> | Subject | Re: [DISCUSS] FLIP-298: Unifying the Implementation of
>>>>> SlotManager |
>>>>> Hi Weihua, I still need to dig into the details, but the overall
>>> sentiment
>>>>> of this change sounds reasonable.
>>>>>
>>>>> Best,
>>>>> D.
>>>>>
>>>>> On Mon, Feb 27, 2023 at 2:26 PM Zhanghao Chen <
>>> zhanghao.chen@outlook.com>
>>>>> wrote:
>>>>>
>>>>> Thanks for driving this topic. I think this FLIP could help clean up
>> the
>>>>> codebase to make it easier to maintain. +1 on it.
>>>>>
>>>>> Best,
>>>>> Zhanghao Chen
>>>>> ________________________________
>>>>> From: Weihua Hu <hu...@gmail.com>
>>>>> Sent: Monday, February 27, 2023 20:40
>>>>> To: dev <de...@flink.apache.org>
>>>>> Subject: [DISCUSS] FLIP-298: Unifying the Implementation of
>> SlotManager
>>>>> Hi everyone,
>>>>>
>>>>> I would like to begin a discussion on FLIP-298: Unifying the
>>> Implementation
>>>>> of SlotManager[1]. There are currently two types of SlotManager in
>>> Flink:
>>>>> DeclarativeSlotManager and FineGrainedSlotManager.
>>> FineGrainedSlotManager
>>>>> should behave as DeclarativeSlotManager if the user does not configure
>>> the
>>>>> slot request profile.
>>>>>
>>>>> Therefore, this FLIP aims to unify the implementation of SlotManager
>> in
>>>>> order to reduce maintenance costs.
>>>>>
>>>>> Looking forward to hearing from you.
>>>>>
>>>>> [1]
>>>>>
>>>>>
>>>>>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-298%3A+Unifying+the+Implementation+of+SlotManager
>>>>> Best,
>>>>> Weihua
>>>>>
>>>>>


Re: [DISCUSS] FLIP-298: Unifying the Implementation of SlotManager

Posted by Xintong Song <to...@gmail.com>.
Thanks Weihua for preparing this FLIP. +1 for the proposal.


As one of the contributors of the fine-grained slot manager, I'd like to
share some backgrounds here.

- There used to be a defaut slot manager implementation, which is
non-declarative and has been removed now. The two features, declarative /
reactive resource management and fine-grained resource management, were
proposed at about the same time. We were aware that by design the
declarative slot manager is a subset of fine-grained slot manager at that
time, but still decided to implement two slot managers for the purpose of
decoupling efforts and reducing cross-team synchronization overhead.
Merging the two slot managers once they are proved stable is IMO a
technical debt that was planned at the very beginning.

- The FineGrainedSlotManager has been verified in Alibaba's internal
production as well as Alibaba Cloud services as the default slot manager
for about 2 years.


Concerning test cases, we currently have a ci stage for fine grained
resource management. To avoid adding too much burden, the stage only
includes tests from flink-runtime and flink-test modules. I think switching
the default slot manager and applying the whole set of tests on the
fine-grained slot manager would help us to be more confident about it.


Concerning cutting out functionalities out of slot manager, I think Yangze
and I have tried our best to shape the FineGrainedSlotManager into
reasonable components. I personally don't have other ideas to further
disassemble the component, but I'm open to such suggestions. However, from
the stability perspective, I'd be in favor of not introducing significant
changes to the FineGrainedSlotManager while switching it to the default.
Because the current implementation has already been verified (or at least
partially verified because Alibaba does not cover all the Flink use cases),
and introducing more changes also means more chances of breaking things.


Best,

Xintong



On Wed, Mar 1, 2023 at 11:12 AM Shammon FY <zj...@gmail.com> wrote:

> Hi
>
> Thanks for starting this work weihua, I think unifying
> DeclarativeSlotManager and FineGrainedSlotManager is valuable.
>
> I agree with @Matthias and @John that we need a way to ensure that
> DeclarativeSlotManager's capabilities are fully covered by
> FineGrainedSlotManager
>
> 1. For their functional differences, can you give some detailed tests to
> verify that the new FineGrainedSlotManager has these capabilities? This can
> effectively verify the new functions
>
> 2. I'm worried that many functions are not independent and it is difficult
> to migrate step-by-step. You can list the relationship between them in
> detail.
>
> 3. As John mentioned, give a smoke test for FineGrainedSlotManager is a
> good idea. Or you can add some test information to the
> DeclarativeSlotManager to determine how many tests have used it. In this
> way, we can gradually construct test cases for FineGrainedSlotManager
> during the development process.
>
>
> Best,
> Shammon
>
>
> On Tue, Feb 28, 2023 at 10:22 PM John Roesler <vv...@apache.org> wrote:
>
> > Thanks for the FLIP, Weihua!
> >
> > I’ve read the FLIP, and it sounds good to me. We need to avoid
> > proliferating alternative implementations wherever possible. I have just
> a
> > couple of comments:
> >
> > 1. I share Matthias’s concern about ensuring the behavior is really the
> > same. One suggestion I’ve used for this kind of thing is, as a smoke
> test,
> > to update the DeclarativeSlotManager to just delegate to the
> > FineGrainedSlotManager. If the full test suite still passes, you can be
> > pretty sure the new default is really ok. It would not be a good idea to
> > actually keep that in for the release, since it would remove the option
> to
> > fall back in case of bugs. Either way, we need to make sure all test
> > scenarios are present for the FGSM.
> >
> > 4. In addition to changing the default, would it make sense to log a
> > deprecation warning on initialization if the DeclarativeSlotManager is
> used?
> >
> > Thanks again,
> > John
> >
> > On Tue, Feb 28, 2023, at 07:20, Matthias Pohl wrote:
> > > Hi Weihua,
> > > Thanks for your proposal. From a conceptual point: AFAIU, the
> > > DeclarativeSlotManager covers a subset (i.e. only evenly sized slots)
> of
> > > what the FineGrainedSlotManager should be able to achieve (variable
> slot
> > > size per task manager). Is this the right assumption/understanding? In
> > this
> > > sense, merging both implementations into a single one sounds good. A
> few
> > > more general comments, though:
> > >
> > > 1. Did you do a proper test coverage analysis? That's not mentioned in
> > the
> > > current version of the FLIP. I'm bringing this up because we ran into
> the
> > > same issue when fixing the flaws that popped up after introducing the
> > > multi-component leader election (see FLIP-285 [1]). There is a risk
> that
> > by
> > > removing the legacy code we decrease test coverage because certain
> > > test cases that were covered for the legacy classes might not be
> > > necessarily covered in the new implementation, yet (see FLINK-30338 [2]
> > > which covers this issue for the leader election case). Ideally, we
> don't
> > > want to remove test cases accidentally because they were only
> implemented
> > > for the DeclarativeSlotManager but missed for the
> FineGrainedSlotManager.
> > >
> > > 2. DeclarativeSlotManager and FineGrainedSlotManager feel quite big in
> > > terms of lines of code. Without knowing whether it's actually a
> > reasonable
> > > thing to do: Instead of just adding more features to the
> > > FineGrainedSlotManager, have you thought of cutting out functionality
> > into
> > > smaller sub-components along this refactoring? Such a step-by-step
> > approach
> > > might improve the overall codebase and might make reviewing the
> > refactoring
> > > easier. I did a first pass over the code and struggled to identify code
> > > blocks that could be moved out of the SlotManager implementation(s).
> > > Therefore, I might be wrong with this proposal. I haven't worked on
> this
> > > codebase in that detail that it would allow me to come up with a
> > judgement
> > > call. I wanted to bring it up, anyway, because I'm curious whether that
> > > could be an option. There's a comment created by Chesnay (CC'd) in the
> > > JavaDoc of TaskExecutorManager [3] indicating something similar. I'm
> > > wondering whether he can add some insights here.
> > >
> > > 3. For me personally, having a more detailed summary comparing the
> > > subcomponents of both SlotManager implementations with where
> > > their functionality matches and where they differ might help understand
> > the
> > > consequences of the changes proposed in FLIP-298.
> > >
> > > Best,
> > > Matthias
> > >
> > > [1]
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+Refactoring+LeaderElection+to+make+Flink+support+multi-component+leader+election+out-of-the-box
> > > [2] https://issues.apache.org/jira/browse/FLINK-30338
> > > [3]
> > >
> >
> https://github.com/apache/flink/blob/f611ea8cb5deddb42429df2c99f0c68d7382e9bd/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/TaskExecutorManager.java#L66-L68
> > >
> > > On Tue, Feb 28, 2023 at 6:14 AM Matt Wang <wa...@163.com> wrote:
> > >
> > >> This is a good proposal for me, it will make the code of the
> SlotManager
> > >> more clear.
> > >>
> > >>
> > >>
> > >> --
> > >>
> > >> Best,
> > >> Matt Wang
> > >>
> > >>
> > >> ---- Replied Message ----
> > >> | From | David Morávek<dm...@apache.org> |
> > >> | Date | 02/27/2023 22:45 |
> > >> | To | <de...@flink.apache.org> |
> > >> | Subject | Re: [DISCUSS] FLIP-298: Unifying the Implementation of
> > >> SlotManager |
> > >> Hi Weihua, I still need to dig into the details, but the overall
> > sentiment
> > >> of this change sounds reasonable.
> > >>
> > >> Best,
> > >> D.
> > >>
> > >> On Mon, Feb 27, 2023 at 2:26 PM Zhanghao Chen <
> > zhanghao.chen@outlook.com>
> > >> wrote:
> > >>
> > >> Thanks for driving this topic. I think this FLIP could help clean up
> the
> > >> codebase to make it easier to maintain. +1 on it.
> > >>
> > >> Best,
> > >> Zhanghao Chen
> > >> ________________________________
> > >> From: Weihua Hu <hu...@gmail.com>
> > >> Sent: Monday, February 27, 2023 20:40
> > >> To: dev <de...@flink.apache.org>
> > >> Subject: [DISCUSS] FLIP-298: Unifying the Implementation of
> SlotManager
> > >>
> > >> Hi everyone,
> > >>
> > >> I would like to begin a discussion on FLIP-298: Unifying the
> > Implementation
> > >> of SlotManager[1]. There are currently two types of SlotManager in
> > Flink:
> > >> DeclarativeSlotManager and FineGrainedSlotManager.
> > FineGrainedSlotManager
> > >> should behave as DeclarativeSlotManager if the user does not configure
> > the
> > >> slot request profile.
> > >>
> > >> Therefore, this FLIP aims to unify the implementation of SlotManager
> in
> > >> order to reduce maintenance costs.
> > >>
> > >> Looking forward to hearing from you.
> > >>
> > >> [1]
> > >>
> > >>
> > >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-298%3A+Unifying+the+Implementation+of+SlotManager
> > >>
> > >> Best,
> > >> Weihua
> > >>
> > >>
> >
>

Re: [DISCUSS] FLIP-298: Unifying the Implementation of SlotManager

Posted by Shammon FY <zj...@gmail.com>.
Hi

Thanks for starting this work weihua, I think unifying
DeclarativeSlotManager and FineGrainedSlotManager is valuable.

I agree with @Matthias and @John that we need a way to ensure that
DeclarativeSlotManager's capabilities are fully covered by
FineGrainedSlotManager

1. For their functional differences, can you give some detailed tests to
verify that the new FineGrainedSlotManager has these capabilities? This can
effectively verify the new functions

2. I'm worried that many functions are not independent and it is difficult
to migrate step-by-step. You can list the relationship between them in
detail.

3. As John mentioned, give a smoke test for FineGrainedSlotManager is a
good idea. Or you can add some test information to the
DeclarativeSlotManager to determine how many tests have used it. In this
way, we can gradually construct test cases for FineGrainedSlotManager
during the development process.


Best,
Shammon


On Tue, Feb 28, 2023 at 10:22 PM John Roesler <vv...@apache.org> wrote:

> Thanks for the FLIP, Weihua!
>
> I’ve read the FLIP, and it sounds good to me. We need to avoid
> proliferating alternative implementations wherever possible. I have just a
> couple of comments:
>
> 1. I share Matthias’s concern about ensuring the behavior is really the
> same. One suggestion I’ve used for this kind of thing is, as a smoke test,
> to update the DeclarativeSlotManager to just delegate to the
> FineGrainedSlotManager. If the full test suite still passes, you can be
> pretty sure the new default is really ok. It would not be a good idea to
> actually keep that in for the release, since it would remove the option to
> fall back in case of bugs. Either way, we need to make sure all test
> scenarios are present for the FGSM.
>
> 4. In addition to changing the default, would it make sense to log a
> deprecation warning on initialization if the DeclarativeSlotManager is used?
>
> Thanks again,
> John
>
> On Tue, Feb 28, 2023, at 07:20, Matthias Pohl wrote:
> > Hi Weihua,
> > Thanks for your proposal. From a conceptual point: AFAIU, the
> > DeclarativeSlotManager covers a subset (i.e. only evenly sized slots) of
> > what the FineGrainedSlotManager should be able to achieve (variable slot
> > size per task manager). Is this the right assumption/understanding? In
> this
> > sense, merging both implementations into a single one sounds good. A few
> > more general comments, though:
> >
> > 1. Did you do a proper test coverage analysis? That's not mentioned in
> the
> > current version of the FLIP. I'm bringing this up because we ran into the
> > same issue when fixing the flaws that popped up after introducing the
> > multi-component leader election (see FLIP-285 [1]). There is a risk that
> by
> > removing the legacy code we decrease test coverage because certain
> > test cases that were covered for the legacy classes might not be
> > necessarily covered in the new implementation, yet (see FLINK-30338 [2]
> > which covers this issue for the leader election case). Ideally, we don't
> > want to remove test cases accidentally because they were only implemented
> > for the DeclarativeSlotManager but missed for the FineGrainedSlotManager.
> >
> > 2. DeclarativeSlotManager and FineGrainedSlotManager feel quite big in
> > terms of lines of code. Without knowing whether it's actually a
> reasonable
> > thing to do: Instead of just adding more features to the
> > FineGrainedSlotManager, have you thought of cutting out functionality
> into
> > smaller sub-components along this refactoring? Such a step-by-step
> approach
> > might improve the overall codebase and might make reviewing the
> refactoring
> > easier. I did a first pass over the code and struggled to identify code
> > blocks that could be moved out of the SlotManager implementation(s).
> > Therefore, I might be wrong with this proposal. I haven't worked on this
> > codebase in that detail that it would allow me to come up with a
> judgement
> > call. I wanted to bring it up, anyway, because I'm curious whether that
> > could be an option. There's a comment created by Chesnay (CC'd) in the
> > JavaDoc of TaskExecutorManager [3] indicating something similar. I'm
> > wondering whether he can add some insights here.
> >
> > 3. For me personally, having a more detailed summary comparing the
> > subcomponents of both SlotManager implementations with where
> > their functionality matches and where they differ might help understand
> the
> > consequences of the changes proposed in FLIP-298.
> >
> > Best,
> > Matthias
> >
> > [1]
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+Refactoring+LeaderElection+to+make+Flink+support+multi-component+leader+election+out-of-the-box
> > [2] https://issues.apache.org/jira/browse/FLINK-30338
> > [3]
> >
> https://github.com/apache/flink/blob/f611ea8cb5deddb42429df2c99f0c68d7382e9bd/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/TaskExecutorManager.java#L66-L68
> >
> > On Tue, Feb 28, 2023 at 6:14 AM Matt Wang <wa...@163.com> wrote:
> >
> >> This is a good proposal for me, it will make the code of the SlotManager
> >> more clear.
> >>
> >>
> >>
> >> --
> >>
> >> Best,
> >> Matt Wang
> >>
> >>
> >> ---- Replied Message ----
> >> | From | David Morávek<dm...@apache.org> |
> >> | Date | 02/27/2023 22:45 |
> >> | To | <de...@flink.apache.org> |
> >> | Subject | Re: [DISCUSS] FLIP-298: Unifying the Implementation of
> >> SlotManager |
> >> Hi Weihua, I still need to dig into the details, but the overall
> sentiment
> >> of this change sounds reasonable.
> >>
> >> Best,
> >> D.
> >>
> >> On Mon, Feb 27, 2023 at 2:26 PM Zhanghao Chen <
> zhanghao.chen@outlook.com>
> >> wrote:
> >>
> >> Thanks for driving this topic. I think this FLIP could help clean up the
> >> codebase to make it easier to maintain. +1 on it.
> >>
> >> Best,
> >> Zhanghao Chen
> >> ________________________________
> >> From: Weihua Hu <hu...@gmail.com>
> >> Sent: Monday, February 27, 2023 20:40
> >> To: dev <de...@flink.apache.org>
> >> Subject: [DISCUSS] FLIP-298: Unifying the Implementation of SlotManager
> >>
> >> Hi everyone,
> >>
> >> I would like to begin a discussion on FLIP-298: Unifying the
> Implementation
> >> of SlotManager[1]. There are currently two types of SlotManager in
> Flink:
> >> DeclarativeSlotManager and FineGrainedSlotManager.
> FineGrainedSlotManager
> >> should behave as DeclarativeSlotManager if the user does not configure
> the
> >> slot request profile.
> >>
> >> Therefore, this FLIP aims to unify the implementation of SlotManager in
> >> order to reduce maintenance costs.
> >>
> >> Looking forward to hearing from you.
> >>
> >> [1]
> >>
> >>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-298%3A+Unifying+the+Implementation+of+SlotManager
> >>
> >> Best,
> >> Weihua
> >>
> >>
>

Re: [DISCUSS] FLIP-298: Unifying the Implementation of SlotManager

Posted by John Roesler <vv...@apache.org>.
Thanks for the FLIP, Weihua!

I’ve read the FLIP, and it sounds good to me. We need to avoid proliferating alternative implementations wherever possible. I have just a couple of comments:

1. I share Matthias’s concern about ensuring the behavior is really the same. One suggestion I’ve used for this kind of thing is, as a smoke test, to update the DeclarativeSlotManager to just delegate to the FineGrainedSlotManager. If the full test suite still passes, you can be pretty sure the new default is really ok. It would not be a good idea to actually keep that in for the release, since it would remove the option to fall back in case of bugs. Either way, we need to make sure all test scenarios are present for the FGSM. 

4. In addition to changing the default, would it make sense to log a deprecation warning on initialization if the DeclarativeSlotManager is used?

Thanks again,
John

On Tue, Feb 28, 2023, at 07:20, Matthias Pohl wrote:
> Hi Weihua,
> Thanks for your proposal. From a conceptual point: AFAIU, the
> DeclarativeSlotManager covers a subset (i.e. only evenly sized slots) of
> what the FineGrainedSlotManager should be able to achieve (variable slot
> size per task manager). Is this the right assumption/understanding? In this
> sense, merging both implementations into a single one sounds good. A few
> more general comments, though:
>
> 1. Did you do a proper test coverage analysis? That's not mentioned in the
> current version of the FLIP. I'm bringing this up because we ran into the
> same issue when fixing the flaws that popped up after introducing the
> multi-component leader election (see FLIP-285 [1]). There is a risk that by
> removing the legacy code we decrease test coverage because certain
> test cases that were covered for the legacy classes might not be
> necessarily covered in the new implementation, yet (see FLINK-30338 [2]
> which covers this issue for the leader election case). Ideally, we don't
> want to remove test cases accidentally because they were only implemented
> for the DeclarativeSlotManager but missed for the FineGrainedSlotManager.
>
> 2. DeclarativeSlotManager and FineGrainedSlotManager feel quite big in
> terms of lines of code. Without knowing whether it's actually a reasonable
> thing to do: Instead of just adding more features to the
> FineGrainedSlotManager, have you thought of cutting out functionality into
> smaller sub-components along this refactoring? Such a step-by-step approach
> might improve the overall codebase and might make reviewing the refactoring
> easier. I did a first pass over the code and struggled to identify code
> blocks that could be moved out of the SlotManager implementation(s).
> Therefore, I might be wrong with this proposal. I haven't worked on this
> codebase in that detail that it would allow me to come up with a judgement
> call. I wanted to bring it up, anyway, because I'm curious whether that
> could be an option. There's a comment created by Chesnay (CC'd) in the
> JavaDoc of TaskExecutorManager [3] indicating something similar. I'm
> wondering whether he can add some insights here.
>
> 3. For me personally, having a more detailed summary comparing the
> subcomponents of both SlotManager implementations with where
> their functionality matches and where they differ might help understand the
> consequences of the changes proposed in FLIP-298.
>
> Best,
> Matthias
>
> [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+Refactoring+LeaderElection+to+make+Flink+support+multi-component+leader+election+out-of-the-box
> [2] https://issues.apache.org/jira/browse/FLINK-30338
> [3]
> https://github.com/apache/flink/blob/f611ea8cb5deddb42429df2c99f0c68d7382e9bd/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/TaskExecutorManager.java#L66-L68
>
> On Tue, Feb 28, 2023 at 6:14 AM Matt Wang <wa...@163.com> wrote:
>
>> This is a good proposal for me, it will make the code of the SlotManager
>> more clear.
>>
>>
>>
>> --
>>
>> Best,
>> Matt Wang
>>
>>
>> ---- Replied Message ----
>> | From | David Morávek<dm...@apache.org> |
>> | Date | 02/27/2023 22:45 |
>> | To | <de...@flink.apache.org> |
>> | Subject | Re: [DISCUSS] FLIP-298: Unifying the Implementation of
>> SlotManager |
>> Hi Weihua, I still need to dig into the details, but the overall sentiment
>> of this change sounds reasonable.
>>
>> Best,
>> D.
>>
>> On Mon, Feb 27, 2023 at 2:26 PM Zhanghao Chen <zh...@outlook.com>
>> wrote:
>>
>> Thanks for driving this topic. I think this FLIP could help clean up the
>> codebase to make it easier to maintain. +1 on it.
>>
>> Best,
>> Zhanghao Chen
>> ________________________________
>> From: Weihua Hu <hu...@gmail.com>
>> Sent: Monday, February 27, 2023 20:40
>> To: dev <de...@flink.apache.org>
>> Subject: [DISCUSS] FLIP-298: Unifying the Implementation of SlotManager
>>
>> Hi everyone,
>>
>> I would like to begin a discussion on FLIP-298: Unifying the Implementation
>> of SlotManager[1]. There are currently two types of SlotManager in Flink:
>> DeclarativeSlotManager and FineGrainedSlotManager. FineGrainedSlotManager
>> should behave as DeclarativeSlotManager if the user does not configure the
>> slot request profile.
>>
>> Therefore, this FLIP aims to unify the implementation of SlotManager in
>> order to reduce maintenance costs.
>>
>> Looking forward to hearing from you.
>>
>> [1]
>>
>>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-298%3A+Unifying+the+Implementation+of+SlotManager
>>
>> Best,
>> Weihua
>>
>>

Re: [DISCUSS] FLIP-298: Unifying the Implementation of SlotManager

Posted by Matthias Pohl <ma...@aiven.io.INVALID>.
Hi Weihua,
Thanks for your proposal. From a conceptual point: AFAIU, the
DeclarativeSlotManager covers a subset (i.e. only evenly sized slots) of
what the FineGrainedSlotManager should be able to achieve (variable slot
size per task manager). Is this the right assumption/understanding? In this
sense, merging both implementations into a single one sounds good. A few
more general comments, though:

1. Did you do a proper test coverage analysis? That's not mentioned in the
current version of the FLIP. I'm bringing this up because we ran into the
same issue when fixing the flaws that popped up after introducing the
multi-component leader election (see FLIP-285 [1]). There is a risk that by
removing the legacy code we decrease test coverage because certain
test cases that were covered for the legacy classes might not be
necessarily covered in the new implementation, yet (see FLINK-30338 [2]
which covers this issue for the leader election case). Ideally, we don't
want to remove test cases accidentally because they were only implemented
for the DeclarativeSlotManager but missed for the FineGrainedSlotManager.

2. DeclarativeSlotManager and FineGrainedSlotManager feel quite big in
terms of lines of code. Without knowing whether it's actually a reasonable
thing to do: Instead of just adding more features to the
FineGrainedSlotManager, have you thought of cutting out functionality into
smaller sub-components along this refactoring? Such a step-by-step approach
might improve the overall codebase and might make reviewing the refactoring
easier. I did a first pass over the code and struggled to identify code
blocks that could be moved out of the SlotManager implementation(s).
Therefore, I might be wrong with this proposal. I haven't worked on this
codebase in that detail that it would allow me to come up with a judgement
call. I wanted to bring it up, anyway, because I'm curious whether that
could be an option. There's a comment created by Chesnay (CC'd) in the
JavaDoc of TaskExecutorManager [3] indicating something similar. I'm
wondering whether he can add some insights here.

3. For me personally, having a more detailed summary comparing the
subcomponents of both SlotManager implementations with where
their functionality matches and where they differ might help understand the
consequences of the changes proposed in FLIP-298.

Best,
Matthias

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+Refactoring+LeaderElection+to+make+Flink+support+multi-component+leader+election+out-of-the-box
[2] https://issues.apache.org/jira/browse/FLINK-30338
[3]
https://github.com/apache/flink/blob/f611ea8cb5deddb42429df2c99f0c68d7382e9bd/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/TaskExecutorManager.java#L66-L68

On Tue, Feb 28, 2023 at 6:14 AM Matt Wang <wa...@163.com> wrote:

> This is a good proposal for me, it will make the code of the SlotManager
> more clear.
>
>
>
> --
>
> Best,
> Matt Wang
>
>
> ---- Replied Message ----
> | From | David Morávek<dm...@apache.org> |
> | Date | 02/27/2023 22:45 |
> | To | <de...@flink.apache.org> |
> | Subject | Re: [DISCUSS] FLIP-298: Unifying the Implementation of
> SlotManager |
> Hi Weihua, I still need to dig into the details, but the overall sentiment
> of this change sounds reasonable.
>
> Best,
> D.
>
> On Mon, Feb 27, 2023 at 2:26 PM Zhanghao Chen <zh...@outlook.com>
> wrote:
>
> Thanks for driving this topic. I think this FLIP could help clean up the
> codebase to make it easier to maintain. +1 on it.
>
> Best,
> Zhanghao Chen
> ________________________________
> From: Weihua Hu <hu...@gmail.com>
> Sent: Monday, February 27, 2023 20:40
> To: dev <de...@flink.apache.org>
> Subject: [DISCUSS] FLIP-298: Unifying the Implementation of SlotManager
>
> Hi everyone,
>
> I would like to begin a discussion on FLIP-298: Unifying the Implementation
> of SlotManager[1]. There are currently two types of SlotManager in Flink:
> DeclarativeSlotManager and FineGrainedSlotManager. FineGrainedSlotManager
> should behave as DeclarativeSlotManager if the user does not configure the
> slot request profile.
>
> Therefore, this FLIP aims to unify the implementation of SlotManager in
> order to reduce maintenance costs.
>
> Looking forward to hearing from you.
>
> [1]
>
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-298%3A+Unifying+the+Implementation+of+SlotManager
>
> Best,
> Weihua
>
>

Re: [DISCUSS] FLIP-298: Unifying the Implementation of SlotManager

Posted by Matt Wang <wa...@163.com>.
This is a good proposal for me, it will make the code of the SlotManager more clear.



--

Best,
Matt Wang


---- Replied Message ----
| From | David Morávek<dm...@apache.org> |
| Date | 02/27/2023 22:45 |
| To | <de...@flink.apache.org> |
| Subject | Re: [DISCUSS] FLIP-298: Unifying the Implementation of SlotManager |
Hi Weihua, I still need to dig into the details, but the overall sentiment
of this change sounds reasonable.

Best,
D.

On Mon, Feb 27, 2023 at 2:26 PM Zhanghao Chen <zh...@outlook.com>
wrote:

Thanks for driving this topic. I think this FLIP could help clean up the
codebase to make it easier to maintain. +1 on it.

Best,
Zhanghao Chen
________________________________
From: Weihua Hu <hu...@gmail.com>
Sent: Monday, February 27, 2023 20:40
To: dev <de...@flink.apache.org>
Subject: [DISCUSS] FLIP-298: Unifying the Implementation of SlotManager

Hi everyone,

I would like to begin a discussion on FLIP-298: Unifying the Implementation
of SlotManager[1]. There are currently two types of SlotManager in Flink:
DeclarativeSlotManager and FineGrainedSlotManager. FineGrainedSlotManager
should behave as DeclarativeSlotManager if the user does not configure the
slot request profile.

Therefore, this FLIP aims to unify the implementation of SlotManager in
order to reduce maintenance costs.

Looking forward to hearing from you.

[1]

https://cwiki.apache.org/confluence/display/FLINK/FLIP-298%3A+Unifying+the+Implementation+of+SlotManager

Best,
Weihua


Re: [DISCUSS] FLIP-298: Unifying the Implementation of SlotManager

Posted by David Morávek <dm...@apache.org>.
Hi Weihua, I still need to dig into the details, but the overall sentiment
of this change sounds reasonable.

Best,
D.

On Mon, Feb 27, 2023 at 2:26 PM Zhanghao Chen <zh...@outlook.com>
wrote:

> Thanks for driving this topic. I think this FLIP could help clean up the
> codebase to make it easier to maintain. +1 on it.
>
> Best,
> Zhanghao Chen
> ________________________________
> From: Weihua Hu <hu...@gmail.com>
> Sent: Monday, February 27, 2023 20:40
> To: dev <de...@flink.apache.org>
> Subject: [DISCUSS] FLIP-298: Unifying the Implementation of SlotManager
>
> Hi everyone,
>
> I would like to begin a discussion on FLIP-298: Unifying the Implementation
> of SlotManager[1]. There are currently two types of SlotManager in Flink:
> DeclarativeSlotManager and FineGrainedSlotManager. FineGrainedSlotManager
> should behave as DeclarativeSlotManager if the user does not configure the
> slot request profile.
>
> Therefore, this FLIP aims to unify the implementation of SlotManager in
> order to reduce maintenance costs.
>
> Looking forward to hearing from you.
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-298%3A+Unifying+the+Implementation+of+SlotManager
>
> Best,
> Weihua
>

Re: [DISCUSS] FLIP-298: Unifying the Implementation of SlotManager

Posted by Zhanghao Chen <zh...@outlook.com>.
Thanks for driving this topic. I think this FLIP could help clean up the codebase to make it easier to maintain. +1 on it.

Best,
Zhanghao Chen
________________________________
From: Weihua Hu <hu...@gmail.com>
Sent: Monday, February 27, 2023 20:40
To: dev <de...@flink.apache.org>
Subject: [DISCUSS] FLIP-298: Unifying the Implementation of SlotManager

Hi everyone,

I would like to begin a discussion on FLIP-298: Unifying the Implementation
of SlotManager[1]. There are currently two types of SlotManager in Flink:
DeclarativeSlotManager and FineGrainedSlotManager. FineGrainedSlotManager
should behave as DeclarativeSlotManager if the user does not configure the
slot request profile.

Therefore, this FLIP aims to unify the implementation of SlotManager in
order to reduce maintenance costs.

Looking forward to hearing from you.

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-298%3A+Unifying+the+Implementation+of+SlotManager

Best,
Weihua